6.0

/10

Poster4 位审稿人

最低5最高7标准差1.0

4.0

置信度

正确性3.3

贡献度2.8

表达2.8

NeurIPS 2024

DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

Wen-Hsuan Chu,Lei Ke,Katerina Fragkiadaki

OpenReview PDF

提交: 2024-05-09更新: 2025-01-15

TL;DR

We propose DreamScene4D, the first video-to-4D scene generation approach to produce realistic 4D scene representation from real-world multi-object videos.

摘要

关键词

4D Scene Generation; Video-to-4D Generation

评审与讨论

审稿意见

评分: 7置信度: 52024-07-09

This work introduces DreamScene4D, a pioneering approach for generating realistic 4D scene representations from complex multi-object videos with significant object motions. It proposes a "decompose-recompose" strategy, where a video is first decomposed into individual objects and the background scene. Each component is then completed across occlusions and viewpoints. To manage fast-moving objects, DreamScene4D factorizes 3D motion into three components: camera motion, object-centric deformations, and an object-centric to world frame transformation. The components are recomposed using monocular depth guidance to estimate relative scales and rigid object-to-world transformations, placing all objects within a common coordinate system. DreamScene4D's effectiveness is demonstrated using challenging monocular videos from the DAVIS dataset. It shows significant improvements over existing state-of-the-art video-to-4D generation methods on the DAVIS, Kubric, and self-captured videos with fast-moving objects.

优点

This paper is well-written, and particularly the figures, clearly illustrate the system.
The idea is novel and straightforward, and the methodologies are effective and reasonable.
The factorization of 3D motions into three components, i.e. camera motion, object-centric deformations, and an object-centric to world frame transformation offers clear advantages over prior arts to model dramatic motions.
The comparisons with previous methods and the ablation studies are comprehensive, and the demonstrated improvements are convincing.

缺点

This method is composed of multiple steps and requires much human involvement.
For the recomposition stage, the estimation of scales and locations of multiple components is sensitive to the performance of the depth estimation model.

问题

For 4D scene composition, in addition to the depth, how to estimate the height/width location of the component?
How many GS points are used to represent each component? I think the appearance quality can be further improved by increasing the number of points.
For LPIPS metric evaluated on the DAVIS dataset, could you explain the reasons for better performance of Consistent4D over Dreamscene4D?

局限性

Due to the segmentation and impainting operations from the source video in the decomposition stage, we can observe some artifacts in the constructed 3D assets. This is a tricky problem associated with the methodology.

作者回复

2024-08-06

Thank you for acknowledging that our paper is novel, well-written and effective with comprehensive comparisons. Here we respond to your insightful comments and questions.

Q1. This method is composed of multiple steps and requires much human involvement.

The statement “our method requires much human involvement” is incorrect. Although composed of multiple steps in the pipeline, our method is fully automatic and runs automatically using a single script without human intervention. We use off-the-shelf object segmenters (Grounded-SAM to detect and segment objects), inpainting model (adapted SD-Inpaint), and MoT trackers (DEVA) to decompose a scene into objects and background, track each one, and then lift them to 4D.

Q2. The estimation of scales and locations of multiple components is sensitive to the performance of the depth estimation model.

DreamScene4D uses the estimated monocular depth to infer the relative depth relationships/scales of each independently optimized 4D Gaussians and recompose them into one unified coordinate frame. To provide more insights into the robustness of depth estimation during the recomposition, we further conducted the additional experiments analysis:

We swap out the Depth-Anything model with MiDAS, a weaker depth prediction model, as well as the newly released Depth-Anything v2.
We added random noise to perturb the outputs of Depth-Anything v1 in various magnitudes.

Since the depth estimation is used for scene composition, we conduct experiments on multi-object DAVIS videos only and summarize the results below. We will also include this analysis in the revised version of the paper.

Method	CLIP↑	LPIPS↓
Depth-Anything v1 (original)	83.71	0.169
Depth-Anything v1 + Noise (10%)	83.67	0.171
Depth-Anything v1 + Noise (25%)	83.48	0.174
MiDAS v3.1	83.34	0.176
Depth-Anything v2	83.76	0.169

While different depth predictions will result in objects being placed in slightly different scene locations, we note that existing SOTA depth prediction models (such as Depth Anything series) meet the requirements in most cases, and the rendered quality of the 4D scene will not deteriorate much as long as the relative depth ordering of the objects is correct, which even holds when we add a small amount of noise to the predicted depth (second row) or use a weaker model like MiDAS (fourth row).

Q3. How to estimate the height/width location of the component for 4D scene composition?

To estimate the 3D location of each object component, we first compute its 2D translation in the image-space (i.e. offset from the center of the image) as $(\Delta x,\, \Delta y)$ , then use the camera projection matrix $P = K[R|t]$ of the rendering camera to unproject this 2D translation to 3D, where $K$ is the camera intrinsics, $R$ is the rotation aspect of the extrinsics, and $T$ is the translation aspect of the extrinsics.

In our formulation with a monocular video, we define the coordinate frame of the camera space corresponding to the video and world space to be in the same orientation by setting $R$ to the identity matrix and solve the 3D translation from the origin directly: $(\Delta X,\, \Delta Y) = (\frac{\Delta x \cdot d}{f_x},\, \frac{\Delta y \cdot d}{f_y})$ , where $d$ is a scaling factor that is determined by the depth, and $f_x$ and $f_y$ denoting the focal length of the rendering camera in the $x$ and $y$ axis. A similar process can be used for solving the 3D translation under different camera models.

Q4. How many GS points are used to represent each component?

We represent each object by a maximum of 20k Gaussians, while the static background is represented by a maximum of 200k Gaussians. We agree that the number of Gaussians has an effect on the quality of the 4D representations and we have compiled a table to ablate this hyperparameter:

# Gaussians	CLIP↑	LPIPS↓
10k	84.57	0.156
20k (Original)	85.09	0.152
25k	85.14	0.151

We observe the performance does improve with more Gaussians, however, it also results in an increased amount of compute and VRAM cost. We noticed that at around 20k Gaussians per object, there exists a wall of diminishing returns, which was how the hyperparameter was chosen in the experiments.

Q5. For LPIPS metric evaluated on the DAVIS dataset, the reasons for better performance of Consistent4D over Dreamscene4D?

We noticed that the LPIPS metric will ignore a lot of optimization artifacts like floaters which are more common in Consistent4D (Figure 4 of paper), as the backbone in LPIPS is trained for ImageNet Classification. Besides, current SOTA NeRF implementations (as used in Consistent4D) are also slightly better at NVS when the queried novel view change is large for in-the-wild videos compared to GS [1], so in the LPIPS metric, Consistent4D has a slight advantage as its main disadvantage (optimization artifacts) is largely ignored.

[1] "Radsplat: Radiance field-informed gaussian splatting for robust real-time rendering with 900+ fps." arXiv 2024.

Q6. Limitation disclosure.

DreamScene4D handles real-world complex videos with multi-object movement. We summarized the paper's limitations in Sec. 4.3, as well as detailed failure case analysis in Sec. A.4. When the object is heavily occluded and small, the segmentation or inpainting operations may fail and the Gaussians will not attempt to explain the missing parts, as showcased in the top half of Figure 9. This can potentially be fixed by reducing the reliance on photometric RGBA rendering losses and incorporating semantic losses such as CLIP to measure the similarity of the rendered object in current frame with other frames. We also expect fewer artifacts using better diffusion-based inpainting models (e.g. SDXL-Inpainting). We will extend the current limitation section (paper) by merging failure cases discussion (A.4).

评论- Thanks for the rebuttal

2024-08-10

Thank the authors for the detailed rebuttal! It helps solve most of my concerns. I'd like to keep my original rating of Accept.

审稿意见

评分: 5置信度: 52024-07-12

DreamScene4D is a novel approach to generating 3D dynamic scenes of multiple objects from monocular videos using 360° novel view synthesis. The method employs a "decompose-recompose" strategy, segmenting the video scene into background and object tracks and further decomposing object motion into object-centric deformation, object-to-world-frame transformation, and camera motion. This decomposition enables effective recovery of object 3D completions and deformations, guided by bounding box tracks for large object movements. DreamScene4D demonstrates extensive results on challenging datasets like DAVIS, Kubric, and self-captured videos, providing quantitative comparisons and user preference studies. The approach also achieves accurate 2D persistent point tracking by projecting inferred 3D trajectories into 2D. The release of the code aims to stimulate further research in fine-grained 4D understanding from videos.

优点

The paper introduces a novel "decompose-recompose" strategy, effectively handling complex multi-object scenes and fast object motion by robustly recovering object 3D completions and deformations.
The paper demonstrates comprehensive experimental validation on challenging datasets such as DAVIS, Kubric, and self-captured videos, showcasing its applicability to real-world scenarios and superiority over existing methods through strong quantitative comparisons and user preference studies.
The paper is well-written and well-structured, making it easy to follow and understand the key ideas and methodologies.

缺点

Figure 2 appears to show artifacts in the decomposed background. The paper should clarify how the video scene decomposition separates the background and objects, as this is crucial for the model's performance.
The paper does not explain how temporal consistency is maintained between object motion and camera motion. Ensuring synchronized motion is essential for the accuracy of the 4D scene generation.
The paper lacks comparisons with relevant baselines such as Animate124[1] and 4Diffusion[2]. It should also address how the model handles scenarios with many objects or objects appearing at different times.

[1] Zhao Y, Yan Z, Xie E, et al. Animate124: Animating one image to 4d dynamic scene[J]. arXiv preprint arXiv:2311.14603, 2023.

[2] Zhang H, Chen X, Wang Y, et al. 4Diffusion: Multi-view Video Diffusion Model for 4D Generation[J]. arXiv preprint arXiv:2405.20674, 2024.

The model relies on an off-the-shelf depth estimator to compute the depth of each object and the background for 4D scene composition. However, the effectiveness of this guidance in dense object scenarios is questionable and needs further evaluation.
The paper should include visualizations of the 3D Gaussians from Figure 2 to provide a clearer understanding of the spatial representation and validate the model's effectiveness.

问题

While the authors effectively demonstrate their method, the paper lacks some intermediate visualizations, such as the 3D Gaussian visualizations for Figure 2. Additionally, there are potential bad cases involving multiple objects that need further exploration and clarification. These aspects are crucial for thoroughly understanding and validating the proposed approach.

局限性

The authors do not discuss the limitations in the manuscript.

作者回复

2024-08-06

Thank you for acknowledging DreamScene4D being a novel approach with comprehensive experimental validation and superiority over existing methods, and for the writing being well-structured. Here we respond to your insightful comments and questions.

Q1. How does video scene decomposition separates background and objects?

We take off-the-shelf segmenters (GroundingDINO + SAM) and mask trackers (XMem, DEVA [a]) to segment and track the objects. The background image simply contains the remaining parts of the video frame. We then use a modified version of SD-Inpaint (Sec A.1 of Appendix) to amodally complete (inpaint) the background image. We will update Sec 3.2.1 of the paper to clarify this.

We note that the small artifacts in Figure 2 result from the background inpainting process. We expect that there will be fewer artifacts using better image inpainting models (e.g., SDXL-Inpainting [b]) for background inpainting.

[a] Tracking Anything with Decoupled Video Segmentation. ICCV, 2023.

[b] stable-diffusion-xl-1.0-inpainting-0.1

Q2. The paper does not explain how temporal consistency is maintained between object motion and camera motion.

The inferred camera and object motion trajectories are temporally aligned by the reconstruction process (Sec 3.2.3 of paper), where the composition of the camera and object motion is grounded on the RGB/flow rendering loss between each rendered frame and the ground truth frame sequentially. This ensures that both camera and object motions faithfully reveal the temporal movement / deformation observed in the original video.

Q3.1. Comparisons with Animate124[a] and 4Diffusion[b].

Thanks for pointing out these two concurrent related works, and will discuss and cite them in the Related Works (paper). However, please note that Animate124 and 4Diffusion do not tackle in-the-wild multi-object videos with large motion. Also, 4Diffusion[b] was released after the NeurIPS submission deadline, while Animate124 [a] is reported with worse performance than DreamGaussian4D and is proposed as an image-to-4D method instead of video-to-4D, as described in Table 2 and Figure 7 of [c]. We show our model significantly outperforms DreamGaussian4D in Tables 1 and 2. We also cited Animate124 in related work. Per the reviewer's suggestion, we additionally ran experiments on DAVIS using Animate124 by adapting it to condition on given video frames, which we show here (the results for other baselines are from Tab 1 of Paper):

Method	CLIP↑	LPIPS↓
Consistent4D	82.14	0.141
Animate124	74.17	0.206
DreamGaussian4D	77.81	0.181
DreamScene4D (ours)	85.09	0.152

[a] Animate124: Animating one image to 4d dynamic scene. arXiv 2023.

[b] 4Diffusion: Multi-view Video Diffusion Model for 4D Generation. arXiv 2024.

[c] Dreamgaussian4d: Generative 4d gaussian splatting. arXiv 2023.

Q3.2. How does the model handle many objects appearing at different times?

DreamScene4D employs a multi-object segmentation tracker (Grounded SAM + XMem or DEVA) to infer 2D mask tracks for each object. It can track objects that appear and disappear arbitrarily throughout the video. We then extract the time intervals during which objects appear or disappear from the inferred video masks. Consequently, we can optimize for each object independently during its visible intervals and then recompose the 4D Gaussians at the appropriate timesteps. This approach highlights DreamScene4D's advantage, as the existing baselines cannot handle objects that appear in the middle of a video by deforming the 3D representations from the initial frame.

Q4. Effectiveness of depth guidance in dense object scenarios needs further evaluation.

DreamScene4D uses the estimated monocular depth to recompose the independently optimized 4D Gaussians into one unified coordinate frame. To provide insights on the effectiveness of depth guidance in dense object scenarios, we report the CLIP and LPIPS on multi-object DAVIS videos as below:

Num objects in video	CLIP↑	LPIPS↓
2 objects	83.87	0.149
3 objects	83.23	0.182
5 objects	84.15	0.167
All multi-obj videos	83.71	0.169

While we agree that depth prediction is harder in dense object scenes, existing SOTA depth prediction models meet the requirements in most cases as we only require the relative depth ordering of the objects to be correct. Thus, having more objects does not necessarily mean the results deteriorate.

We also conducted the additional experiments on multi-object DAVIS videos only to provide insights into the robustness of depth estimation error:

We swap out the Depth-Anything model with MiDAS, a weaker depth estimation model, as well as the newly released Depth-Anything v2.
We added random noise to perturb the outputs of Depth-Anything v1 in various magnitudes.

Method	CLIP↑	LPIPS↓
Depth-Anything v1 (original)	83.71	0.169
Depth-Anything v1 + Noise (10%)	83.67	0.171
Depth-Anything v1 + Noise (25%)	83.48	0.174
MiDAS v3.1	83.34	0.176
Depth-Anything v2	83.76	0.169

While different depth predictions will result in objects being placed in slightly different scene locations, the relative depth ordering of the objects largely stays the same, so the impacts on the performance is not large. We will also include these analysis in the revised version of the paper.

Q5. Intermediate visualizations for Gaussians.

Thanks for the suggestion, we have added this in Figure 1 of our one-page rebuttal PDF, and will include it in the revised paper.

Q6. The authors do not discuss the limitations in the manuscript.

This is incorrect. We'd like to kindly remind the reviewer that we discussed the limitations in Section 4.3 (paper, L279-L285), as well as provided detailed failure case analysis in Sec A.4 (Appendix). We will extend the current limitation section by merging Sec A.4.

审稿意见

评分: 7置信度: 32024-07-13

The paper introduces a novel framework for reconstructing multi-object scenes from monocular videos by factorizing motion into three components: object-centric deformation, object-to-world-frame transformation, and camera motion. This method enhances stability in motion optimization and effectively captures large-scale motions. Initially, the method segments the video into foreground objects and background. These elements are then lifted into a static 3D Gaussian representation using the DreamGaussian method. Subsequently, the motion of each object is independently optimized and finally, the optimized 4D Gaussians are recombined into a unified scene using estimated monocular depth, aligning the model effectively within the global scene context.

优点

The "decompose-recompose" approach effectively addresses the challenge of generating dynamic multi-object 4D scenes from monocular videos.
The experiments shows solid evidence that factorization of object motion improves motion optimization.
The writing well written. The main ideas and concepts are mostly well explained and articulated throughout.

缺点

Table 1 the ablation study on L_{flow} does not justify the need for this component, the metrics for the DreamScene4D and w/o L_{flow }are close, could you provide some examples to justify the need of this loss?

-While the object-to-world frame transformation process described includes computing the translation and scaling factors that warp the Gaussians from the object-centric frame to the world frame, there are still noticeable artifacts and misalignment of objects in the scene as shown in the supplemental results. Can the authors provide more details on the potential causes of these artifacts and misalignments? Additionally, what steps or adjustments could be implemented to address and reduce these issues, ensuring better alignment and artifact-free renderings?

The method seems to rely on monocular depth estimation. How does the performance degrade with less accurate depth data, and are there any strategies to mitigate this dependency?

问题

See weakness section.

局限性

The author has made a commendable effort to acknowledge the technical limitations of their method by showcasing failure cases in A.4.

作者回复

2024-08-06

Thank you for acknowledging that our paper is novel, effective with solid evidence and well-written. Here we respond to your insightful comments and questions.

Q1. In Table 1, the ablation study on $L_{flow}$ does not justify the need for this component.

We measure the 4D generation performance of DreamScene4D both in view synthesis (Tab.1 of the paper) and Gaussian motion accuracy (Tab.2 of the paper). While the flow loss $L_{flow}$ does not directly contribute to the quality of the rendered images, it drastically improves the accuracy of the 3D Gaussians movement in Tab.2 (paper), where the endpoint error (EPE) of the inferred 3D trajectories is reduced obviously from 10.91 to 8.56 on DAVIS, and 18.54 to 14.3 on Kubric.

Q2. Can the authors provide more details on the potential causes of these artifacts and misalignments?

We aim to handle real-world complex videos with multi-object movement. As we discussed in the Limitations (Sec 4.3 of paper) and Failure Cases (Sec A.4 of Appendix) section, although DreamScene4D produces decent 4D generation results for most videos, there are still three potential causes of artifacts of misalignment errors:

Occlusions and image completion failures. When the object is heavily occluded and small, it is difficult to accurately inpaint the missing parts of the object. Thus, the input may only contain a small part of the object, and the Gaussians will not attempt to explain the missing parts, as showcased in the top half of Figure 9 (Sec A.4 of Appendix). This can potentially be fixed by reducing the reliance on photometric RGBA rendering losses and incorporating semantic losses such as using CLIP to measure the similarity of the rendered object in frame t with other frames.
Large monocular depth prediction errors. This happens when the errors of the depth estimation cause the depth ordering of the objects to be wrong, resulting in the scene composition process placing an object that should be in the back in front of something else, as showcased in the bottom half of Figure 9. Please refer to our response to Q3 for a more detailed analysis.
Parallax differences. When warping the object to a different location in the scene, different parts of the object are rendered by the camera (e.g. the side of the object becomes visible when we move an object from the center to the left). We found empirically that doing a couple of joint-finetuning steps can mitigate this issue, as we discussed in the paper and showcased in Figure 10 (Appendix).

Q3. How does the performance degrade with less accurate depth data, and are there any strategies to mitigate this dependency?

We replaced the Depth-Anything model with MiDAS, a weaker depth prediction model, as well as the newly released Depth-Anything v2.
We added random noise to perturb the outputs of Depth-Anything v1 in various magnitudes.

Since the depth estimation is only used for scene composition, we conducted experiments on multi-object DAVIS videos only to emphasize the differences and summarize the results below. Note that this is different from the original evaluations in the paper which also contains some single-object videos from DAVIS. We will also include this analysis in the revised version of the paper.

Method	CLIP↑	LPIPS↓
Depth-Anything v1 (original)	83.71	0.169
Depth-Anything v1 + Noise (10%)	83.67	0.171
Depth-Anything v1 + Noise (25%)	83.48	0.174
MiDAS v3.1	83.34	0.176
Depth-Anything v2	83.76	0.169

For videos/scenarios requiring precise depth placement for 4D lifting, the captured depth from RGBD cameras (widely equipped on iPhones) can also be used or we can leverage the depth post-optimization method like RCVD [a] to further improve the initial depth prediction. Besides, we expect the SOTA monocular depth models will further improve its robustness by scaling data/model.

[a] Robust Consistent Video Depth Estimation. CVPR, 2021.

2024-08-13

Thanks for the detailed experiments and clarification, all my concerns are addressed, I will raise my score to 7.

2024-08-13

Dear Reviewer onS9,

Thanks for reading our rebuttal and your comments. We are happy to know that your concerns on DreamScene4D are well addressed and you will raise the score from 6 to 7 (but seems not updated in the system).

Feel free to let us know if there are further questions.

The authors

审稿意见

评分: 5置信度: 32024-07-17

The paper claims that their methodology represents the first approach capable of generating verisimilar four-dimensional scene representations derived from complex multi-object video sequences exhibiting substantial motion. This approach purportedly enables 360-degree novel view synthesis and precise motion estimation in both observed and unobserved perspectives.
To validate their approach, the researchers conduct evaluations of DreamScene4D using challenging datasets, including DAVIS, Kubric, and proprietary video footage featuring rapidly moving objects. Their findings indicate enhanced performance compared to extant methodologies, specifically in terms of novel view synthesis quality and motion estimation accuracy.

优点

The "decompose-recompose" strategy represents an innovative approach to addressing the complexities of multi-object scenes in video-to-4D generation. This method circumvents the limitations of previous approaches, which encountered difficulties with multiple objects, by employing a three-step process: decomposing the scene into individual objects and background, independently elevating these elements to 3D representations, and subsequently recomposing them.
Quantitative analyses reveal substantial enhancements over contemporary state-of-the-art baselines in two key areas: novel view synthesis quality, as assessed by CLIP and LPIPS metrics, and motion accuracy, quantified by End Point Error. The conducted ablation studies effectively elucidate the contributions of various components within the method, notably the video scene decomposition and motion factorization, thereby providing insight into their respective impacts on overall performance.

缺点

Insufficient analysis of computational complexity: While the authors note that the algorithm's runtime exhibits linear scaling with respect to the number of objects, the manuscript lacks a comprehensive examination of its computational demands. Moreover, there is an absence of comparative analysis vis-à-vis alternative methodologies in terms of computational efficiency. Given the intricate nature of the proposed approach, a more rigorous assessment of its computational costs would significantly enhance the paper's contribution to the field.
Reliance on external depth estimation: The proposed methodology incorporates an extrinsic depth estimation algorithm for scene composition. However, the manuscript provides limited discourse on the potential ramifications of depth estimation inaccuracies on the final output. A thorough investigation into the method's robustness to depth estimation errors would substantially augment the study's validity and applicability.

问题

Please see the weaknesses.

局限性

Limited generalization capability of the SDS (Scene Decomposition System) when applied to video footage captured from acute camera angles.
Potential for suboptimal scene composition in cases where the rendered depth of reconstructed 3D objects fails to align precisely with the estimated depth map.
Presence of visual artifacts stemming from under-constrained Gaussian representations during instances of significant occlusion, despite the implementation of inpainting techniques.
Linear scaling of computational runtime in relation to the quantity of objects, potentially resulting in reduced efficiency for videos of high complexity.

作者回复

2024-08-06

Thank you for recognizing DreamScene4D as an innovative approach that addresses the difficulties of multi-object scenarios and achieves significant improvements over current state-of-the-art baselines in view synthesis and motion accuracy. Here we respond to your insightful comments and questions.

Q1: Insufficient analysis of computational complexity.

We mention that "the algorithm's runtime exhibits linear scaling with respect to the number of objects" in the limitation section (Sec 4.3, L283-L284), and also discuss the run time of DreamScene4D in lines L490-L495 of the Appendix (A.2, L490-L495), where "On a 40GB A100 GPU, the static 3D lifting process takes around 5.5 minutes, and the 4D lifting process takes around 17 minutes for a video of 16 frames per object."

Per your request, to provide a more comprehensive comparison, we include the optimization time and rendering efficiency, measured in hours and FPS respectively, for DreamScene4D and the baselines, in addition to the original metrics presented in Table 1 of the paper. Since our method has linear runtime complexity w.r.t. to the number of objects in the scene, we present the results separately for videos with 1, 2, and 3 objects from DAVIS, denoted by the 3 entries in each field (1 obj/2 objs/3 objs). We will include this analysis in the revised version of the paper.

Method	CLIP↑	LPIPS↓	Optimization Time (GPU hrs)	FPS (Hz)	GPU Memory (GB)
Consistent4D	82.14	0.141	0.81hr	4.9	26.8
DreamGaussian4D	77.81	0.181	0.44hr	76.7	22.8
DreamScene4D (ours)	85.09	0.152	0.27hr/0.53hr/0.81hr	76.1/72.4/68.7	24.7

As motion factorization provides a strong prior on the general deformation of the objects, DreamScene4D can converge faster with fewer steps than the more conservative experiments reported in the paper without a loss in quality. However, as the number of object(s) increases, we can observe the trade-off where the individual optimization of decomposed objects results in higher training time but yields much better view synthesis quality. We leave it to future work to further accelerate the inference time of our model without loss in performance.

Q2. Reliance on external depth estimation.

We replaced the Depth-Anything model with MiDAS, a weaker depth prediction model, as well as the newly released Depth-Anything v2.
We added random noise to perturb the outputs of Depth-Anything v1 in various magnitudes.

Since depth estimation is only used for scene composition, we conducted experiments on multi-object DAVIS videos only to emphasize the differences and summarize the results below. Note that this is different from the original evaluations in the paper which also contains some single-object videos from DAVIS. We will also include this analysis in the revised version of the paper.

Method	CLIP↑	LPIPS↓
Depth-Anything v1 (original)	83.71	0.169
Depth-Anything v1 + Noise (10%)	83.67	0.171
Depth-Anything v1 + Noise (25%)	83.48	0.174
MiDAS v3.1	83.34	0.176
Depth-Anything v2	83.76	0.169

While different depth predictions result in objects being placed in slightly different scene locations, we note that existing SOTA depth prediction models (such as Depth Anything series) meet the requirements in most cases, and the rendered quality of the 4D scene will not deteriorate much as long as the relative depth ordering of the objects is correct, which even holds when we add a small amount of noise to the predicted depth (second row) or use a weaker model like MiDAS (fourth row).

For videos/scenarios requiring precise depth placement for 4D lifting, the captured depth from RGBD cameras (widely equipped on iPhones) can also be used. Also, we expect the SOTA monocular depth models will further improve the robustness by scaling data/model.

Q3. Limitation disclosure.

We summarized the limitations of our paper in Section 4.3 and extensively discussed it in Appendix A.4 section.

2024-08-14

Thanks for your response, I tend to keep the decision of acceptance.

作者回复

2024-08-06

We appreciate the positive and insightful comments from all four reviewers. We are glad that they found our paper "novel / innovative", "effective with comprehensive experiments" and "well-written". We will release our code to facilitate future research in fine-grained 4D understanding from videos.

We address all questions in separate responses to each reviewer, and will incorporate all valuable feedback into our paper.

最终决定Accept (poster)

2024-09-25

Reviewers find the paper novel, effective, and the presentation well-structured. They also find the proposed "decompose-recompose" strategy innovative and the experimental validation comprehensive, demonstrating the method's superiority over existing approaches. While there were some initial concerns regarding computational complexity, reliance on external depth estimation, and handling of specific scenarios, the authors' detailed responses and additional experiments effectively addressed these issues. The overall consensus leans towards recognizing the paper's significant contribution to 4D scene generation from monocular videos. Hence, my decision is to accept the paper.