Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text
automatically generate realistic 3D scenes and camera trajectories from texts
摘要
评审与讨论
This paper presents Director3D, a text-to-3D generation framework that creates realistic 3D scenes with adaptive camera trajectories. It includes a Cinematographer (Traj-DiT) for generating camera trajectories, a Decorator (GM-LDM) for initial scene generation, and a Detailer (SDS++ loss) for refinement. Using 3D Gaussians for scene representation, extensive experiments demonstrate its effectiveness.
优点
The results are good with high spatial consistency.
The task is interesting. It is a good idea to generate scene-level 3D GS directly.
The writing is clear.
缺点
-
The quantitative comparison with object-level methods seems unfair. The given prompts include sence information, which efftets clip score. In quantitative comparisons, the authors should at least compare with scene-level methods like LucidDreamer or others.
-
Why didn't the authors compare their method with camera control video generation methods?
问题
-
Is the number of Gaussians equal to the number of pixels in each view? Why set it that way? It seems that 256*256 Gaussians might not be sufficient for a scene-level representation. What is the total number of Gaussians used?
-
Can this pipeline achieve user-specified camera trajectories?
-
Is it possible to generate a complete 3D scene and then reconstruct it? If the camera trajectory encompasses the entire scene, all 3D information is captured. How does the novel view inference ability perform once the three stages are completed? If only can generate denoised views, the contributions will be weakened.
局限性
-
What about failure cases? In what situations does the method fail to generate a consistent scene? I find it hard to believe that training on only two not large datasets can achieve such good results. The performance even surpasses object-level methods trained on Objaverse, which doesn't make sense.
-
I want to confirm how the test set is divided. For each class, is a portion of the videos divided as the test set, or are a portion of the classes divided as the test categories? And I want to confirm again that none of the prompts appeared in the training set?
We thank the reviewer for your appreciation of the high spatial consistency, the interesting task, the good idea, and the clear writing of Director3D. We address your questions point-by-point below.
Q1: The fairness of the quantitative comparison and more scene-level baselines.
Thanks for your questions. The prompts used for our quantitative comparison are provided by the "Single-Object-with-Surroundings Set" of T3Bench. All methods share the same inputs (i.e., only text prompts) and the metrics are calculated without any cherry-picking. We would like to clarify that SDS-based methods can also generate with scene information. Please see the examples in the project pages of DreamFusion and ProlificDreamer. The camera trajectory generation and initial 3D scene generation trained with real-world datasets help our SDS++ outperform other baselines with pre-defined trajectory and random initialization.
For a more comprehensive evaluation, we further provide quantitative and qualitative comparisons with several scene-level baseline methods (e.g., LucidDreamer and ZeroNVS). Please refer to CQ1 in the "Global Rebuttal".
Q2: Comparison with camera controllable video generation methods.
We mainly focus on Text-to-3D generation in our original manuscript so that we choose baselines with 3D generation ability. However, your advice is very insightful since both our method and the camera controllable video generation methods can generate videos given trajectories. Compared to these methods (e.g., CameraCtrl), Director3D:
- Can adaptively generate camera trajectory with Traj-DiT, while CameraCtrl uses only user-defined trajectories.
- Can generate 3D-consistent videos with 3DGS rendering, while CameraCtrl cannot ensure the 3D consistency.
- Can denoise only key frames for 3D scene and render other frames effectively, while CameraCtrl needs to denoise each frame with video diffusion models.
From this perspective, our GM-LDM can be regarded as a novel and effective formulation of 3D-consistent camera controllable video diffusion models with 3DGS as the immediate 3D representation. Thanks for your kind reminder and we will add these discussions to our revision.
Q3: Number of Gaussians.
Since the Gaussian decoder of GM-LDM is converted from the VAE decoder of the original Stable Diffusion, we set the size of Gaussians for each view equal to the size of image pixels, which is 256*256. The Gaussians of the scene are merged from the Gaussians of all input views, which is 8*256*256 ≈ 520K. During subsequent refining, this number will be changed by the Adaptive Density Control as in the original 3DGS.
Q4: User-specific Camera Trajectories.
Yes. Director3D supports using user-specified camera trajectories for 3D scene generation. Please refer to CQ2 in the "Global Rebuttal".
Q5: Novel view inference ability.
Efficient novel view inference ability is one of the key advantages of Director3D over traditional video generation models since there is an immediate 3D representation. The cases presented in our video gallery are rendered with the generated 3DGS and interpolated novel cameras. For more novel view synthesis results after 3D scene generation, Please also refer to CQ2 in the "Global Rebuttal".
Q6: Failure cases.
Similar to 2D diffusion models, the success rate of Director3D decreases when generating with complicated and compositional prompts, objects with exact numbers, and articulated objects.
We provide several examples as failure cases in Fig. 4 of the "Global Rebuttal PDF". Thanks for your advice and we will add these to our revision.
Q7: The reasons for the good results of Director3D with two limited 3D datasets.
Compared with object-level methods trained on Objaverse (e.g., MVDream and GRM), several key points contribute to our superior visual results:
- Compared to synthesized datasets (e.g., Objaverse), the style of the real-world captured datasets used in Director3D is much closer to the real images in 2D datasets (e.g., LAION). From this perspective, Director3D can take better advantage of prior knowledge from the pre-trained Stable Diffusion networks and the collaborative training with the LAION dataset.
- The generated scenes of Director3D include reasonable real-world backgrounds, and the proposed SDS++ loss provide superior refining ability with prior of 2D diffusion model. From this perspective, the SDS++ loss can enhance the entire scenes with realistic shadows, lighting, and reflections, making the results much better than the object-level methods.
Q8: How to divide the test set and how to confirm that test prompts are not seen during training.
Director3D is aimed at generating 3D scenes given open-world prompts. To fully leverage the real-world captured video datasets of limited sizes for this challenging task, we do not divide a test set but employ novel and unseen prompts for evaluation.
We present evidence from two perspectives which can confirm that test prompts are novel:
- For object-centric cases, we provide the Top 1 category in the MVImgNet training set with the highest CLIP similarities to the main subjects in our test prompts, as shown in Tab. 1 of the "Global Rebuttal PDF".
- For scene-level prompts, we provide the Top 1 prompt in the DL3DV-10K training set with the highest CLIP similarities to our test prompts, as shown in Tab. 2 of the "Global Rebuttal PDF".
These pieces of evidence demonstrate that the test prompts used for evaluating Director3D are not included in the training set.
References:
- ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image.
- CameraCtrl : CameraCtrl: Enabling Camera Control for Video Diffusion Models.
Please kindly let us know if our responses have addressed your concerns. We are also happy to answer any of your remaining concerns.
Thank you to the authors for their detailed response. Many of my concerns have been addressed. I have another question. Does the author mean that the diffusion frame number (i.e., 29) is capable of sparse reconstruction of 3D scenes? I believe sparse reconstruction is a challenging task. How can this problem be solved, and what are the limitations?
Your question is very interesting. To reconstruct a specific scene, 29 views possibly cannot fully guarantee the reconstruction quality. However, for generative models, since there is a large amount of scene data for training, the novel view supervision is regarded as a distribution for the model to learn. To learn the distribution, the model needs to generate 3D scenes supporting widely possible views.
However, the 29 views we currently adopt are all nearby frames, and the number of frames is also limited due to the expensive GPU memory cost. We will try exploring more views for supervision (e.g., views within a range of frustum overlap) and increasing the number of sparse views to further enhance the novel view synthesis ability.
Can I understand it as only being able to reconstruct small angle scale scenes?
Actually no. The required number of views depends on the scene complexity. For 180° object-centric scenes, 8 views used for denoising and 29 views used for supervision in Director3D are basically enough to cover the whole scenes. But for complicated scenes, 8 views for denoising can only cover a corner.
The supported number of views, as well as the supported scene complexity of Director3D, is now limited by the specific architecture, the GPU resources and the used datasets. However, the framework of Director3D have the potential to scale up.
Following Q7, why choose an image diffusion base model instead of a video diffusion model like SVD or AnimateDiff?
Thank you for your detailed explain. I read other reviewers' discussions. And I think text-to-3D in sence level is an important but un-solved problem. So I decide to matain the score.
It's all right! Thank you for your positive rating and valuable advice nonetheless.
Another interesting question! In fact, our initial idea was to propose a novel LDM architecture with 3D Gaussians as the intermediate representation (i.e., GM-LDM) for text-to-3D generation. At that time, the core comparison method we had was MVDream, so we roughly followed it to utilize Stable Diffusion as the prior model. Subsequently, due to some characteristics found during the training with real-world data, Director3D was proposed and GM-LDM was finally used as one of its sub-modules.
We also noticed that some recent works fine-tune video generation models to multi-view generation models (e.g., SV3D), and the idea of GM-LDM (i.e., convert image/video decoders into pixel-aligned 3D Gaussians decoders) is also applicable to these architectures. Compared to the current structure (3D global attention), video diffusion models with separated spatial and temporal attention are more efficient and will help increase the supported number of views. However, at the same time, this may lead to the impairment of multi-view consistency. We will explore this in future works.
The paper proposes a scene-generation method from text input. The framework utilizes three models that first generate a trajectory, then produce 3D Gaussians, and finally refine through SDS loss.
优点
- The design of using a trajectory generator and 3DGS diffusion is novel and impressive.
- The proposed method outperforms existing object-level methods as shown in experiments.
缺点
- The paper's evaluation is limited. The methods compared are object-level generation frameworks. The experiments can be improved by comparing them against scene-level generation methods, e.g. LucidDreamer[16] and ZeroNVS[1].
- I am confused about how SDS++ loss is presented. The paragraph following Eq. (9) contains numerous undefined variables, making it hard to understand how exactly SDS++ loss is formulated. Can I say it is the loss presented in [67] but integrated with learnable text embedding? What's the difference against the loss proposed in VSD[26] and how does it compare?
- The proposed trajectory generator is novel. How important is generating the trajectory? Can we just assign some trajectories by retrieving the dataset? It seems that most trajectories in MVImgNet and DL3DV are very similar. How diverse are the generations? I see the results in Fig. 12. But aren't the differences coming from different sampling results of a text-based generator? Since the cameras are already normalized, the orientation of the first frame should not be restricted by the camera trajectory. It seems that the model is overfitting to the orientations in MVImgNet (flying around something on the table). Are the "randomly generated camera trajectories" coming from other objects? Is it because GM-LDM only works with one trajectory per scene? Does the model support working with multiple trajectories for the same scene?
- Is the proposed GM-LDM only training the 3DGS scene on observed views? How do the authors avoid overfitting in the rendering-based denoising? Is the SDS++ loss only employed on interpolated camera trajectory?
- Is the presented video only showing the views presented in the trajectory or they are novel view synthesis results? How do we know the generated 3D scene is not overfitting to these views? Sparse-view 3D reconstruction suffers from over-fitting issues a lot and I'd like to hear the authors' thoughts on this matter.
[1] Sargent K, Li Z, Shah T, et al. Zeronvs: Zero-shot 360-degree view synthesis from a single real image. CVPR 2024, arxiv 2023.
问题
Please refer to the weakness.
局限性
Please refer to the weakness. This work shows brilliant visual results. I hope the authors clarify my questions above to help me better understand how the results are obtained.
We thank the reviewer for your appreciation of the novel and impressive design and the brilliant visual results of Director3D. We address your questions point-by-point below.
Q1: More scene-level baselines.
Thanks for your helpful advice. For a more comprehensive evaluation, we further provide quantitative and qualitative comparisons with several scene-level baseline methods (e.g., LucidDreamer and ZeroNVS). Please refer to CQ1 in the "Global Rebuttal".
Q2: About SDS++ loss.
Meaning of variables. Thanks for pointing this out. Please kindly refer to Sec. 3 in the original manuscript for the explanations of each variable and we will add a reference in the method to guide the readers for a better understanding of our SDS++ loss.
Formulation. The formulation of SDS++ loss intended to combine the advantages observed in existing SDS losses (Please refer to lines 208–214 in the original manuscript):
- Eq. 9 - Both Latent-space and Image-space Objectives - HIFA.
- Eq. 10 & Eq. 11 - Adaptive Estimation of the Current Distribution - VSD & LODS.
- Eq. 12 - Appropriate Target Distribution - DreamFusion & VSD.
Comparison with other SDS losses. Please kindly refer to Sec. 5.5 in the original manuscript for the connections between the ablated versions of SDS++ loss and other works. Specifically, in contrast to the SDS+ loss proposed in HIFA, SDS++ loss encompasses an adaptive estimation of the current distribution with learnable text embedding. In comparison to the VSD loss proposed in ProlificDreamer, SDS++ loss incorporates an image-space objective and substitutes the complicated LORA by using learnable text embedding for the estimation of the current distribution.
Q3: About camera trajectories.
The importance of camera trajectory generation. Since the videos are naturally captured with cameras within the 3D scene, camera trajectory generation is of great significance for modeling the distribution of real-world captured videos with 3D generative models. Retrieval-based assignment can be employed in simple cases. However, when scaling up the trajectory and scene complicities, it will be difficult for retrieval-based assignment to find scene-specific trajectories for different scenes.
Similar trajectories for object-centric prompts. Since we use 180-degree captured MVImgNet for object-centric scenes, Traj-DiT follows to generate 180-degree orbital camera trajectories for object-centric prompts. Please refer to Fig. 9 in the original manuscript for the diverse trajectory generation of scene-level prompts.
The orientation of the first frame. We would like to clarify that this is not an overfitting problem. Since the camera poses are normalized according to the first frame, the orientations of the first frame for different scenes are always fixed after the normalization (lines 140~142).
Randomly generated camera trajectories.
The randomly generated camera trajectories are generated by performing unconditional denoising inference of Traj-DiT. The text embedding is set to be NULL for unconditional denoising.
Multiple trajectories for the same scene. Although the 3D scene generation (GM-LDM & SDS++) is performed in one trajectory, the generated 3D scenes can be rendered with multiple novel trajectories. Please refer to CQ2 in the "Global Rebuttal".
Q4: About over-fitting problem.
Is GM-LDM only trained on observed views? No. We use 8 views for denoising but 29 views are used for supervising GM-LDM.
How to avoid over-fitting in the rendering-based denoising? For GM-LDM, the camera parameters used for supervision are invisible to the model, and the images for calculating losses are obtained through 3DGS rendering. Therefore, the immediate 3D Gaussians of GM-LDM need to represent the 3D scene rather than over-fit individual images from different viewpoints.
Is the SDS++ loss only employed on interpolated camera trajectory? Yes. We also tried slightly disturbing the interpolated cameras but observed minor improvements.
Q5: Novel view synthesis results.
The presented video are rendered with uniformly interpolated cameras within the generated trajectories. Please refer to CQ2 in the "Global Rebuttal" for more novel view synthesis results, which can further demonstrate that the generated 3D scenes are not overfitting to training views.
Q6: About over-fitting issues of sparse-view 3D reconstruction.
From our perspective, the over-fitting issues of sparse-view 3D reconstruction are primarily caused by the loss of 3D information with too few input views. Therefore, possible solution to overcome overfitting issues is to introduce 3D prior knowledge such as novel view prior and geometric prior. For example, using sparse-view reconstruction models trained with novel view prior knowledge for 3D scene initialization can help overcome overfitting issues. Using depth or normal estimation with geometric prior for alternative supervision can also enhance the 3D information for sparse-view 3D reconstruction.
Please kindly let us know if our responses have addressed your concerns. We are also happy to answer any of your remaining concerns.
Thanks to the authors for the detailed response. A lot of my concerns are well resolved. I am, however, still confused about the novel view evaluations.
I understand that "The presented video is rendered with uniformly interpolated cameras within the generated trajectories". But 3D Gaussians are known to be easy to overfit and interpolation still doesn't guarantee a reasonable 3D structure. My major question maybe is -- is the generated 3D Gaussian showing reasonably good geometry and texture? The current results demonstrate great visual quality on the limited views, but this can be achievable even when the 3D geometry is very poor.
If the authors happen to have some renderings of more distant novel views or some renderings of the depth or the point cloud (proving the geometry is truly learned), showing some of these results can make the submission more solid.
I notice in Figure 3 of rebuttal, that some novel view renderings are presented. Can the authors please elaborate on whether these cameras are obtained through interpolation? Maybe measuring through relative camera distance can quantitatively show how novel these cameras are. E.g. When the scene is normalized, how far are these novel view cameras from the given views.
Thank you for your very insightful comments and observations.
Q1: Is the generated 3D Gaussian showing reasonably good geometry and texture?
According to your idea, we conduct experiments on depth rendering of the generated 3D Gaussians. We are sorry that we cannot provide external images here. We observed that the generated depth is reasonable for the overall layout, but it is rather noisy for object edges and details. We suspect the main reason for this issue resides in the fact that we temporarily have not introduced geometric supervision in Director3D.
Thank you very much for raising this question and we attach great importance to this issue. We will supplement the results of depth rendering in the revision and try possible solutions (e.g., introducing depth supervision and replacing 3DGS with 2DGS) in future works.
Q2: Whether the novel cameras are obtained through interpolation.
The novel cameras provided in Fig. 3 are not obtained through interpolation but controlled by user interaction with a web demo. We make sure that the novel cameras are not within but also not too far away from the original camera trajectory. The approximate distance and angular differences between the novel camera and the nearest original camera are 0.1 and 15°, respectively.
Thanks again for your detailed comments and inspiring questions. We will continuously enhance the novel view synthesis ability and geometric quality of Director3D.
Thanks for the prompt response. I hope the authors can consider providing some visualizations of the geometry in the revised version. This can potentially encourage more future work. Thanks again for the clear answers and I am happy to increase to a positive rating.
We express our gratitude for your careful reading and valuable advice. We will continuously enhance our manuscript according to the rebuttals. Also, the code will be released to the community for reproduction and further modification.
Since the average score for acceptance last year was 5.94, we kindly ask if you could increase the score to be higher than that. In any case, we are grateful for your efforts and time.
This paper presents Director3D, a novel text-to-3D generation framework designed to generate both real-world 3D scenes and adaptive camera trajectories. Specifically, the authors propose the Traj-DiT to generate adaptive camera trajectories, which treats camera parameters as temporal tokens and performs conditional denoising using a transformer model. The authors propose the GM-LDM and SDS++ Loss to generate robust 3D scenes by leveraging the 2D diffusion prior. Extensive experiments demonstrate the effectiveness of Director3D.
优点
- This paper is written clearly and easy to read.
- The idea of treating camera parameters as temporal tokens for denoising generation is novel and effective.
- The GM-LDM and SDS++ loss functions achieve a fairly high level of realistic scene synthesis. The generated scenarios conform to the textual input as well as being reasonably realistic and consistent.
- The proposed Director3D achieves impressive results on both quantitative and qualitative results.
缺点
- It would be more convincing to include more text-to-3D scene generation methods in the Quantitative Comparison in the Qualitative Comparison.
- For ablation experiments of SDS++ Loss, please use more detailed evaluation metrics and experimental results to illustrate the effects.
- Further implementation of conditionally controllable camera view generation would be helpful for the application of this technology.
- Some important work in this area such as 3D-SceneDreamer should be discussed if they are not suitable for the experimental comparison.
- The visual quality of the results seems acceptable compared to existing methods like 3D-scene-dreamer, the limitation of the method, including not allowing a large range of the camera movement, should be discussed.
问题
No additional questions.
局限性
- The baseline methods like Fantasia3D and DreamFusion are mainly for object-level generation, not focusing on scene generation.
- The result visualization can be improved, e.g., the imges are quite small for the reviewer to check the visual quality.
We thank the reviewer for your appreciation of the clearly written paper, the novel and effective idea, the realistic and consistent scene synthesis, and the impressive results of Director3D. We address your questions point-by-point below.
Q1: More scene-level baselines.
Please refer to CQ1 in the "Global Rebuttal".
Q2: More evaluation metrics and experimental results for SDS++ loss.
Thanks for you helpful advice. We further provide:
- a quantitative ablation study as shown in the following table.
Table 2. Quantitative ablation study of SDS++ loss
--------------------------------------------------
NIQE Score↓ CLIP-Similarity↑
--------------------------------------------------
Ours (Full) 4.10 0.837
λ_x=0 4.12 0.831
λ_z=0 7.18 0.813
ω_cfg=1 4.26 0.796
ε_src=ε 4.27 0.804
w/o refining 5.95 0.793
--------------------------------------------------
- a qualitative ablation study with more cases in Fig. 5 of the "Global Rebuttal PDF".
Although the ablation with λ_x=0 incurs a slight deterioration in metrics, the qualitative results are noisy in terms of visual quality. These results can further demonstrate the effectiveness of SDS++ loss.
Thanks for you kind advice and we will add these results into our revision.
Q3: Conditionally controllable camera view generation.
Please refer to CQ2 in the "Global Rebuttal".
Q4: Comparison with 3D-SceneDreamer.
Thanks for your kind reminder. 3D-SceneDreamer is an inpainting-based Text-to-3D scene generation method. Compared to 3D-SceneDreamer, Director3D:
- Can adaptively generate camera trajectories with Traj-DiT, while 3D-SceneDreamer employs user-defined trajectories.
- Can directly generate initial 3D scene by the predicted pixel-aligned 3DGS with our GM-LDM, while 3D-SceneDreamer needs to continuously optimize a triplane-base NeRF.
We also observed that inpainting-based Text-to-3D scene generation methods (e.g., LucidDreamer) support a larger scene scope but the multi-view consistency of objects with complicated geometry is worse than that of Director3D. Thanks for your kind reminder and we will add these discussions in our next revision. Also, scaling up the supported scene scope is listed as one of our top goals for subsequent improvements.
Q5: Some baselines are mainly for object-level generation, not focusing on scene generation.
The prompts utilized for quantitative comparison are provided by the “Single-Object-with-Surroundings Set” of T3Bench, which are primarily object-centric. For a more comprehensive evaluation, we further offer quantitative and qualitative comparisons with several scene-level baseline methods (e.g., LucidDreamer and ZeroNVS). Please refer to CQ1 in the "Global Rebuttal".
Q6: The result visualization can be improved.
Thanks for your advice. We rescale the images in our original manuscript to provide more examples demonstrating our open-world generation ability. Additionally, we present high-resolution videos in the anonymous video gallery link for better checking the visual quality. We will make subsequent improvements on this.
Please kindly let us know if our responses have addressed your concerns. We are also happy to answer any of your remaining concerns.
This paper proposes a framework for simultaneous text-to-3D scene and camera trajectory generation. The authors propose a 3-stage pipeline to (1) generate a dense camera trajectory from input text, (2) use multi-view latent diffusion from a sparse subset of the generated trajectory to generate the 3D scene representation (Gaussian splats), and (3) refine the Gaussian splats with a modified SDS loss.
优点
- The paper tackles a practical problem of generating a 3D scene representation while also synthesizing a camera trajectory from text. It has implications of potential further applications for video/movie synthesis using explicit 3D representations.
- The paper is well-written, the presentation is clear, and the method description is easy to follow and understand.
缺点
- While this paper tackles a new problem, I have concerns with the problem statement. First, what is a "real-world" camera trajectory? Virtually any kind of camera motion could be created in the real world (be it hand-held shakiness or really smooth orbital trajectories that could be achieved via physical equipment). It seems that the camera motions that Traj-DiT could synthesize are mostly orbital (object-centric) -- why would this be considered as a "real-world" trajectory?
- I don't quite get why 3D scene generation and camera trajectory generation should be a coupled problem. NeRFs / Gaussian splats with good quality are not necessarily created via "camera trajectories", but rather via a broad range of covered viewpoints.
- Using a diffusion model to synthesize trajectories, the number of frames would be fixed. How does one determine such trajectory length? How can one vary the length?
- It is unclear why Cinematographer generates a trajectory with dense frames while only a sparse subset is ever used subsequently.
- To evaluate the quality of the synthesized camera trajectory, I believe the method should also be evaluated and compared against with a video generation quality metric (e.g. FVD). The 3D scene representation could be a well-trained NeRF / Gaussian splat, and rendered videos under different trajectories could be quantified and compared.
问题
Please see the weakness sections.
局限性
Yes
We thank the reviewer for your appreciation of the practical problem, the potential applications, the clear presentation, the easy-to-follow method, and the well-written paper of Director3D. We address your questions point-by-point below.
Q1: The meaning of “real-world” camera trajectories.
The reason we emphasize "real-world" is that: Due to the collection costs, most large-scale real-world video datasets are primarily captured by handheld cameras rather than professional track cameras. Compared with the user-defined and precise camera trajectories of synthetic datasets (e.g., Objaverse and ShapeNet), the “real-world” camera trajectories (e.g., MVImgNet and DL3DV-10K) are invariably unstable, noisy, scene-specific and hard to manually define. Director3D is proposed to handle such "real-world" camera trajectories. Our method can be well-generalized to different scenes by modeling the trajectory distributions in accordance with the datasets. Since we employ 180-degree captured MVImgNet for object-centric scenes, Traj-DiT accordingly generates 180-degree orbital camera trajectories for object-centric prompts.
Q2: Why modeling "camera trajectories" rather than "covered viewpoints" for 3D scene generation?
Director3D is aimed at employing real-world captured datasets for superior 3D scene generation. Therefore, the characteristics of real-world captured datasets are the principal considerations in our proposed solution. These datasets typically consist of handheld captured videos with temporal order, leading us to model the camera trajectories rather than unordered viewpoints. The temporal characteristic greatly eases the modeling problem for Traj-DiT. It also assists in filtering appropriately spaced frames and might help further extending Director3D for 4D dynamic scene generation in the future.
Q3: The number of frames for camera trajectory.
Setting the frame number requires balancing between the multi-view consistency and the scene scope within the limited model capacity of GM-LDM. We empirically set this by observing the view overlaps between frames.
Currently, our Traj-DiT can only generate camera trajectories of a fixed frame number (i.e., 29). This issue might potentially be resolved through similar techniques for images or texts (e.g., linear interpolation of temporal embedding) to apply Transformers for variable token numbers. We consider this one of our most significant subsequent improvements.
Q4: Why does Traj-DiT generate dense frames but only a sparse subset is used for GM-LDM?
We generate dense frames to ensure that the generated trajectories are smooth for later refining and video rendering. For GM-LDM, we use only a sparse subset to reduce GPU memory cost for training a rendering-based multi-view diffusion model. Due to the overlaps between nearby frames, the sparse subset serves as key frames, limiting the number of Gaussians per scene. The other frames can be effectively rendered using 3DGS rendering with specific camera parameters.
Q5: Can the quality of the synthesized camera trajectory be evaluated with a video generation quality metric (e.g. FVD) by rendering videos under different trajectories for a specific scene?
Your advice is very insightful. The video generation process of Director3D is designed to generate trajectories first and then generate scenes, rather than generating trajectories based on scenes. Therefore, generating different trajectories for a specific scene is temporarily beyond the scope of the method's functionalities.
The FVD metric you specifically mentioned is a potential evaluation metric to evaluate the video and trajectory quality of Director3D. However, calculating FVD requires both the generated video set and the GT video set. The latter is currently unavailable for the open-world setting of our evaluation. Therefore, we temporarily choose to evaluate the quality with no-reference image quality assessment and CLIP similarity for video frames.
We do understand your concern and exploring more comprehensive metrics and benchmarks for evaluating trajectory and video generation is listed as one of our top subsequent goals.
We further offer quantitative and qualitative comparisons with several scene-level baseline methods (e.g., LucidDreamer and ZeroNVS). Please refer to CQ1 in the "Global Rebuttal".
We also provide results for employing user-specific cameras for 3D scene generation. Please refer to CQ2 in the "Global Rebuttal".
Please kindly let us know if our responses have addressed your concerns. We are also happy to answer any of your remaining concerns.
Global Rebuttal
We express our gratitude to all reviewers for their recognition of the potential applications (Reviewer YUdV), the novel and interesting idea (Reviewers Lv1o & sdLT & yFX5), the brilliant results (Reviewers Lv1o & sdLT & yFX5), and the well-written paper (Reviewers YUdV & sdLT & yFX5) of Director3D. Their suggestions are beneficial for enhancing our submission. Some common questions are addressed as follows.
CQ1: More evaluation and baselines.
We thank the reviews for this advice, we further showcase:
- a quantitative comparison with three scene-level baselines (e.g., GaussianDreamer with initial ground, LucidDreamer and ZeroNVS) for 32 object-centric prompts and 32 scene-level prompts, as shown in the following table.
Table 1. Quantitative comparison between Director3D and several scene-level baselines
------------------------------------------------------------------------------------
NIQE Score↓ CLIP-Similarity↑ Inference Time (min)↓
------------------------------------------------------------------------------------
GaussianDreamer 6.96 71.8 15
ZeroNVS* 9.84 67.2 90
LucidDreamer-LLFF* 3.53 83.3 40
LucidDreamer-HeadBang* 3.61 82.9 40
LucidDreamer-BackForth* 3.40 74.2 40
Ours 4.09 83.9 5
------------------------------------------------------------------------------------
*: Image-to-3D scene generation methods
Our Director3D achieves the highest CLIP-Similarity and the second-best NIQE Score with the shortest inference time. LucidDreamer attains the best NIQE Score; However, we observe that it is plagued by multi-view inconsistency, visible artifacts at edges, and excessive objects as shown in Fig. 7 in the original manuscript.
- a qualitative comparison with ZeroNVS in Fig. 1 of the "Global Rebuttal PDF".
Without adaptively generated camera trajectories (Traj-DiT) and a high-performance multi-view diffusion model with immediate 3D representation (GM-LDM), ZeroNVS exhibits deteriorated visual quality from multiple viewpoints and requires a time-consuming SDS optimization process from scratch for each scene.
CQ2: User-specific camera trajectories.
Director3D supports the utilization of both pre-generation and post-generation user-specific camera trajectories.
-
Pre-generation user-specific camera trajectories. Users are capable of employing user-specific camera trajectories instead of the generated camera trajectories from Traj-DiT for 3D scene generation. We present generation results with three user-specific camera trajectories for a specific prompt in Fig. 2 of the "Global Rebuttal PDF".
-
Post-generation user-specific camera trajectories. After generating the 3D scene, users can render novel views by providing novel cameras. We develop an interactive demo for visualizing the generated camera trajectories and 3D Gaussians, which is also capable of rendering the 3D Gaussians with novel cameras, as shown in Fig. 3 of the "Global Rebuttal PDF".
Other questions of each reviewer are addressed point-by-point in the separate rebuttals and we will revise the manuscript according to the rebuttals.
The code of our submission will be released to the community for reproduction and further modification.
Please note that the figures for rebuttals are in the "Global Rebuttal PDF".
This paper presents a framework for simultaneous text-to-3D scene and camera trajectory generation. The proposed framework is a 3-stage pipeline: generating a dense camera trajectory from input text, using multi-view latent diffusion from a sparse subset of the generated trajectory to generate the 3D scene representation (Gaussian splats), and refining the GSs with a modified score distillation sampling (SDS) loss. The presented experiments demonstrate the outperformance of the proposed method over SOTAs. The main concerns raised by the reviewers were insufficient validation of the proposed method, justification of coupling 3D scene generation and camera trajectory generation, over-fitting problem, and unclear explanation about the modified SDS loss. A questionable word of “real-world” camera trajectory was also pointed out. The authors’ rebuttal resolved most of the raised concerns. During the post-rebuttal discussion, the relation between the current scene reconstruction and the frame number used for diffusion was further discussed for clarification. The reviewer YUdV is still concerned with the word of “real-world” camera trajectory although he thinks that the strengths outweigh the weaknesses. AC also thinks that “large-scale real-world video datasets are primarily captured by handheld cameras” is a little bit overstated. Directly connecting the word “real-world” to camera trajectories that are invariably unstable, noisy, and scene-specific does not seem a good idea because “real-world” reminds us of a more general concept. AC suggests to use another word such as wild. Since the reviewers are all supportive, the paper should be accepted. All the discussion should be incorporated into the final version.