Diffusion4D: Fast Spatial-temporal Consistent 4D generation via Video Diffusion Models
摘要
评审与讨论
The paper presents a novel framework for generating 4D content, leveraging video diffusion models, and introduces motion magnitude reconstruction loss and 3D-aware classifier-free guidance to refine dynamics learning and generation. The framework is shown to enhance generation efficiency and 4D geometry consistency relative to existing methods.
优点
- This paper is the first to directly generate 4D videos using a video diffusion model, partially addressing the spatial-temporal inconsistencies caused by using separate multi-view 3D synthesis models and monocular video generation models in previous works.
- Curate a large-scale, high-quality dynamic 3D dataset sourced from Objaverse to facilitate future research.
缺点
- This method still uses two models (VideoMV and Diffusion4D). Although it is claimed that the outputs of the two models are very similar, the supplementary video still shows considerable inconsistencies, which to some extent undermine the overall consistency.
- In the paper, both numerical and visual comparisons only used 24 images in orbital videos as targets. However, this reduces the difficulty because the viewpoints of the final 4D renderings are almost the same as those generated by Diffusion4D, and the viewpoints generated by Diffusion4D are too limited compared to other methods. Maybe the author could provide more results to demonstrate that the 4D Gaussian obtained by this method can still render high-quality images from different viewpoints. For example, they could fix the camera at a frontal view to render images at different time points or use viewpoints opposite to those generated by diffusion4D.
问题
See weaknesses
局限性
The authors have acknowledged some limitations in their method.
Dear Reviewer mXZW, many thanks to you for taking the time and effort to review our work and providing constructive comments. We are grateful for your recognition of the novelty and contributions of our research, noting that it is the first work training a singular video diffusion model directly on 4D datasets for explicit synthesis of spatial-temporal consistent views of 4D assets. We also appreciate your acknowledgment of the great value of our curated high-quality dataset in facilitating future research.
We have thoroughly considered your constructive feedback, and we are committed to addressing all your concerns as follows:
Q1. VideoMV and Diffusion4D outputs show differences.
In the 4D construction stage, we apply the dynamic images from our 4D-aware video diffusion model to optimize the 4DGS. We also involved static images from VideoMV to help initialize the 3D geometry. Since our 4D-aware video diffusion model was finetuned from VideoMV, we noticed that the inconsistency between two sets of images is very limited. We proposed a coarse-to-fine optimization strategy in the 4D construction stage, and we carried out extensive experiments and validated that this strategy can effectively resolve the potential inconsistency.
Specifically, In the coarse stage, the static images from VideoMV are only used to initialize the 3D geometry (the attributes of gaussians), and are not used to train the deformation network for motion learning. The potential inconsistency is mitigated in the fine stage, where only dynamic images from Diffusion4D are used to finetune both the attributes of gaussians and the deformation network to guarantee spatial-temporal consistency.
Q2. More numerical and visual comparisons on novel views
We highly appreciate your comments at this point which have greatly contributed to enhancing the quality and clarity of our work. In response, we have conducted more extensive numerical comparisons and visual demonstrations on novel views, as detailed below.
1) Numerical Comparisons.
We first clarify the details regarding numerical comparison in our main submission. As shown in Tab.1 of main.pdf, we evaluated both the images from 4D-aware video diffusion model (Diffusion4D*) and the renderings from constructed 4D assets (Diffusion4D). For the latter, we uniformly rendered 36 orbital viewpoints along the timeline as the targets (L252), which resulted in most of them being at novel azimuth angles distinct from the 24-frame videos. At this point, our comprehensive evaluation covered a set of metrics including CLIP-O, SSIM, PSNR, LPIPS, and FVD. Additionally, we measured the novel front views using CLIP (CLIP-F in Tab.1), where we fixed the viewpoint at the front view and similarly, uniformly rendered 36 images along the timeline as targets (L253). Thus, most of these images are at novel timestamps distinct from the 24-frame videos. We will clarify this in the refined version.
To strengthen our quantitative evaluation, we used our validation set and further evaluated two sets of novel viewpoints with all the metrics. The first set is the previously used 36-frame front-view renderings (F); the second set is a new 36-frame orbital-view renderings with an elevation angle fixed at 30° (O-30°). For both sets, we used ground truth images rendered from the dynamic 3D assets in Objaverse as references. Metrics were computed between ground truth images and rendered images from the same viewpoints and timestamps. The results are shown in the table below:
| CLIP-F↑ | SSIM-F↑ | PSNR-F↑ | LPIPS-F↓ | FVD-F↓ | CLIP-O-30°↑ | SSIM-O-30°↑ | PSNR-O-30°↑ | LPIPS-O-30°↓ | FVD-O-30°↓ | |
|---|---|---|---|---|---|---|---|---|---|---|
| (text) 4D-fy | 0.78 | - | - | - | - | 0.55 | - | - | - | - |
| (text) Animate124 | 0.75 | - | - | - | - | 0.51 | - | - | - | - |
| (text) Diffusion4D | 0.81 | - | - | - | - | 0.62 | - | - | - | - |
| (image) 4DGen | 0.84 | 0.72 | 15.1 | 0.28 | 691.5 | 0.66 | 0.66 | 13.9 | 0.34 | 760.5 |
| (image) Stag4D | 0.86 | 0.78 | 16.0 | 0.25 | 624.7 | 0.69 | 0.74 | 15.1 | 0.30 | 705.8 |
| (image) Diffusion4D | 0.89 | 0.80 | 16.2 | 0.24 | 594.5 | 0.72 | 0.81 | 16.5 | 0.23 | 526.4 |
Note that CLIP-F is adopted from Tab.1 in main.pdf. Comparing these results along with the results in Tab.1, we can observe the following:
- Our method consistently outperforms the baselines across all the metrics at the novel viewpoints.
- For the semantic metric(CLIP), all the methods perform better in front views than orbital views;
- For the photometric metrics, the competitors (4DGen, Stag4D) perform better in front views than orbital views, while our method performs better in orbital views than front views. This is because the competitors involved the ground-truth front-view videos in their training for 4D construction, whereas our method utilized the orbital videos generated by our 4D-aware video diffusion model in 4D construction.
2) Visual comparisons.
First, we would like to clarify that each rendered video in the supplementary material consists of 160 frames around the constructed 4D asset, with most frames presenting novel viewpoints not seen in the 24-frame videos. Also, we kindly invite you to refer to Fig.A in rebuttal.pdf, where we provide renderings of samples from more novel views. In addition to orbital-view at elevation angle of 0°, we also visualize the renderings from orbital-view at an elevation angle of 30°, and monocular view by fixing the camera at the front view.
Compared to the baselines, we can observe that our constructed 4D assets exhibit larger motion magnitude, more complex texture, smoother appearance, and more consistent geometry across multiple viewpoints. Please refer to global rebuttal (1) and (2) for detailed discussions.
We are willing to address any further questions or concerns you might have within this review window.
Thank you for your detailed explain (has solved my main considersion). I will maintain 5: Borderline accept. The most reason is it's the first to directly generate 4D videos, which I believe is a important and un-solved problem.
Dear Reviewer mXZW,
Thank you very much for taking the time to read our responses and we are grateful for your recognition. And we are truly happy that our rebuttal resolves your concerns.
Should there be any ambiguities or further questions, please know that we are more than willing to provide clarity or delve deeper into any topic.
Best wishes, Authors of Paper 3414
The paper tackles 4D asset generation. Its pipeline consists of two parts: a 4D-aware video diffusion model, where orbital videos circling the 4D assets can be seamlessly generated, and a coarse-to-fine 4D construction stage afterward. The video diffusion model is trained on a novel dynamic dataset curated by the authors from Objaverse. The method can generate pretty high-quality 4D assets while being quite efficient compared with SDS-based methods.
优点
- The task of 4D asset generation is extremely challenging and well-motivated.
- The paper is nicely written, and ablation studies were done to justify its design choices.
- Evaluations were done with reasonable metrics and against SOTA methods.
- Tremendous engineering efforts can be observed from all the design choices, such as the motion magnitude conditioning and guidance, details about the 4D construction, etc.
缺点
- The effectiveness of the Diffusion4D model heavily relies on a meticulously curated high-quality 4D dataset. The dependency on large-scale, high-quality datasets might limit its applicability in scenarios where such data is scarce or hard to generate.
- The current resolution and temporal sequence length significantly limit the quality and realism of the generated 4D content --- I could barely tell whether there is motion in many samples until after looking at them several times.
- The introduction of a novel 3D-to-4D motion magnitude metric and the corresponding reconstruction loss is innovative; however, the paper could explore in more detail how sensitive the model’s performance is to the tuning of these parameters. It’s also not clear how much the performance depends on the exact settings of these loss functions, which could be critical for users trying to replicate or build upon this work.
问题
N/A
局限性
Yes
Dear Reviewer BwxH, We would like to extend our great thanks for your time and effort in reviewing our work and providing insightful comments. We thank you for recognizing the values and contributions of our work, noting that the paper is nicely written, the solution to the challenging task is well-motivated, the experimental evaluations and ablation studies are comprehensive and convincing, and the generated assets are of pretty high quality. We are also grateful for your acknowledgment of our tremendous engineering efforts in high-quality data curation and experimental design.
We have carefully considered your feedback and would like to address your comments as follows:
Q1. Dependency on 4D dataset. We acknowledge that the success of our method relies on a curated high-quality dataset. To the best of our knowledge, our work is the first that trains a singular video diffusion model directly on 4D datasets for explicit synthesis of spatial-temporal consistent views of 4D assets. We adopt a data-driven optimization strategy that reduces the generation time from a couple of hours to several minutes. The data scarcity is an issue in this research field. To mitigate this problem, we made significant efforts to curate a high-quality 4D dataset from Objaverse. We believe that this dataset can make contributions to future research in this community.
Q2. Resolutions, temporal length, and large motion.
- Resolution and temporal length. We followed prior works [1, 12, 27, 34, 50, 52, 55] and chose a resolution of 256², considering GPU memory. With the same resolution, as shown by more examples in Fig.A of rebuttal.pdf, our method can generate assets with larger motion, more complex textures, clearer appearance, and more consistent geometry compared with baselines. We have adopted a longer temporal length of 24, compared to 16 in most prior works. We appreciate your feedback and are committed to exploring higher resolutions in the future study.
- Large motion. Please refer to Fig.3 in main.pdf and Fig.A in rebuttal.pdf for more demonstrations of our method and the baselines.
- In Fig.3 in main.pdf, as highlighted by the contours, our animations include cartoon characters and humans stepping forward, running or raising arms, birds flapping wings, and lights changing. Due to space constraints, for the baselines, we only showed two representative viewpoints at starting and ending timestamps. The objects’ poses at two timestamps are almost identical, as the baselines generated samples with invisible motions along the whole timeline. In contrast, the motions from our method are much more obvious and pronounced.
- In Fig.A in rebuttal.pdf, we provide more samples with larger motion magnitudes, including the pose change of a knight, flapping of butterfly wings, turtle swimming, the kid falling, etc. In addition to orbital views that illustrate the 3D geometry consistency, we also exhibit novel front views. The generated motions are much more apparent seen from the front view. We can observe that our model is capable of generating 4D assets with very large motions and great geometry consistency.
We believe that these demonstrations, along with the visualizations in the main submission, could provide a well-rounded view of our generation results and successfully address your concerns. We also kindly invite you to refer to global rebuttal-(1) and (2) for more details.
Q3. More analysis on the effect and the sensitivity of the design components.
Thank you for your recognition of the novelty of the proposed 3D-to-4D motion magnitude metric (m) and innovation of our motion magnitude reconstruction loss (). We are pleased to receive your constructive comment and provide a more in-depth analysis of the effect and the sensitivity of the design components as follows.
- 3D-to-4D motion magnitude metric (m). In the ablation study, we conducted a qualitative analysis on the effect of m. By comparing (b) and (c) in Fig.5 of main.pdf, we can observe that increasing m could augment the dynamics of the kid. We also provide more examples in Fig.B of rebuttal.pdf. We can observe that increasing m augments the dynamics of the generated assets.
- Effect of motion magnitude reconstruction loss (). In the ablation study, we conducted both quantitative and qualitative analysis on the effects of in Tab.2 and Fig.5 of main.pdf. We can observe that adding in the is critical to animating the static objects, and improving the photometric performance. We further evaluate the effect of on the model performance by monitoring the validation loss along the training process. Please refer to Fig.C in rebuttal.pdf. Specifically, we trained two image-to-4D generation models under the same settings, except that one included and the other did not. At each training step, we evaluate the model’s denoising capability with the validation set. We applied the same magnitude of noise (controlled by ) to the clean validation set in the diffusion process, and computed the noise prediction loss and motion magnitude reconstruction error (mre) (Eq.(2)) after denosing. As shown in Fig.C in rebuttal.pdf, training with leads to faster convergency in diffusion learning and better performance in motion magnitude reconstruction. (Note that the curves are smoothed with a striding window of size 5.) Additionally, we empirically found that our framework is robust to the setting of the weight w of .
- For the explicit 4D construction stage, we followed [46] and adopted the same weights for and . We empirically found that the model performance is robust to the choice of weight of , which serves as a regularizer.
We will release our codebase soon to facilitate replication and further development upon our work.
The paper presents Diffusion4D, a novel method for generating 4D content that maintains spatial and temporal consistency. It improves upon previous approaches by embedding spatial-temporal consistency within a single video diffusion model, enabling efficient creation of 4D assets. The technique uses a curated dataset of high-quality dynamic 3D objects to train a 4D-aware video diffusion model, which can synthesize orbital views of moving 3D assets. The method outperforms existing techniques in terms of generation speed and 4D geometry consistency.
优点
- This paper is well-written and easy to follow
- The proposed subset of objaverse and objaverse-xl is valuable for subsequent research
- The quantitative experiments presented in the main text demonstrate the superiority of the proposed method.
缺点
- The qualitative comparisons are not that convincing, the generated motions are quite small and easy, the authors are encouraged to provide more experimental results in the rebuttal.
问题
- The qualitative comparisons are not that convincing, the generated motions are quite small and easy, the authors are encouraged to provide more experimental results in the rebuttal.
局限性
Yes
Dear Reviewer fabU, thank you very much for taking the time to review our work and providing constructive feedback. We also thank you for recognizing our paper with novel method, valuable data contribution for subsequent research, good writing, and superior quantitative results. We would like to take this opportunity to address your comments regarding qualitative comparisons.
Provide more qualitative comparisons 1) Comparisons with baselines. Please refer to Fig.3 in main.pdf and Fig. A in rebuttal.pdf for qualitative comparisons between our method and the baselines.
- Larger motion magnitude. In Fig.3, due to space constraints, we only showed two viewpoints at starting and ending timestamps as representatives for the baselines. It is evident that the objects’ poses at two timestamps are almost identical, as the baselines generated samples with almost invisible motions along the whole timeline. In contrast, the motions from our method are much more obvious and pronounced. Also in Fig.A in rebuttal.pdf, we show more cases with large motions, including the pose change of a knight, the flapping of butterfly wings, a turtle swimming, a kid falling, and the arm movements of a cartoon dog and Mario. Our method is capable of generating 4D assets with very much larger motions.
- More complex texture, smoother appearance, and more consistent geometry. For the generation details, by zooming in, we can observe that the baselines suffer from blurry texture, noisy appearances, excessive artifacts, and incomplete geometry. On the other hand, our method provides more complex texture, smoother appearances, and more consistent geometry. These phenomena can also be observed in both Fig.3 in main.pdf and Fig.A in rebuttal.pdf.
2) Consistent novel views.
- In Fig.A in rebuttal.pdf, in addition to orbital views with elevation angle of 0°, we also provide more novel views by rendering images from the constructed 4D assets. We show orbital views with elevation angle of 30° and front view images along the timeline. The motions are much more apparent by fixing the viewpoint at the front view. We can observe that the constructed 4D assets demonstrate large motions, coherent appearance, and consistent geometries across timestamps and viewpoints.
We believe that these demonstrations, along with the visualizations in the main submission, could provide a well-rounded view of our generation results and successfully address your concerns. And we are pleased to offer more video demonstrations if policy allows. Thank you again!
Thanks for your reply. I will maintain my original score.
Dear Reviewer fabU,
Thank you very much for taking the time to read our rebuttal. We are very happy that our rebuttal resolves your main concerns and that we are very grateful for your recognition.
Since today is the last day for author/review discussion, if you have further questions or queries, we are ready and more than willing to answer them promptly today.
Best wishes, Authors of Paper 3414
This paper studies 4D object generation. Instead of using SDS, this paper proposes to use a video diffusion model for generating multi-view images first and then a 4D Gaussian Splatting representation is learned using the multi-view images. To facilitate the training of video diffusion model, this paper uses a filtered subset of Objaverse as pretraining data. It also adds an motion magnitude embedding to the diffusion model as extra condition for guiding the generation process.
The developed framework Diffusion4D is able to generate 4D objects from various condition: image, text, static-3D. It's much faster than the previous SDS based method, and achieves better results on rendering metrics and human preference benchmark.
优点
This paper has a few strengths:
- It tries to solve the slow convergence problem of SDS, and proposes a solution based on multi-view images generation and Gaussian splatting re-training.
- To address temporal consistency, this paper adopts a video diffusion model for generating the multi-view images.
- A motion magnitude metric is developed as metric and also extra condition to the framework for improving the performance.
- The proposed framework is able to generate 4D objects from various condition including image, text, static-3D.
- The proposed framework also achieves better results on generation metrics and human preference.
- The authors curated a filtered subset from objaverse, which will be released to the community of 4D generation research.
缺点
In my opinion, I found below things in this paper can be improved:
- Unclear and insufficient comparison to related work in Sec.2. 4D content generation is an active research field, and multiple recent work are mentioned in Sec.2. However, I found it's unclear 1) what are the differences between the proposed method to these related work, 2) why aren't all of them considered in the experiments? A comparison table that clearly listing the pros and cons between the proposed methods and all related work will greatly improve the reability. The authors might even consider to include some concurrent arxiv works such as L4GM/DreamGaussian4D/Gaussian4D/etc (not required though).
- Limited visual quality of the results. When I check the results in both paper and supplementary material (video), I found the quality are limited in two ways: 1) the rotating videos show that most generated objects are still suffering from view/time inconsistency. 2) The generated objects lack local fine details and are usually of very simplified texture. Also the local details are not sharp and often looks blurred. The 256x256 resolution might be one factor of this. 3) In the figures of this paper, the differences between the proposed method and the baselines are also hard to see. Overall, the generated objects are of low resolution and lack fine details. The inconsistency is also obvious in videos.
- The method section isn't very clear. Specifically
- In Eq.1 m takes as input a video, but in Eq.2 m takes as input one latent . What does m do exactly? And how is the motion magnitude embedding extracted from m?
- In Eq.2, both m terms in has the same offset , isn't this loss equivalent to ? Why do we need to write in the current form where two squared terms inside a squared term?
- Visually, what's the effect of the motion magnitude guidance term isn't clear. It's unknown to me which artifacts this term help to address and how it improves the overall results.
- The entire Coarse-to-Fine 4D construction seems to be a model-specific thing, it's proposed to address the issue of 4D Gaussian Splatting (4DGS). Is this module really necessary if different 4D representation is used? Or what's the benefit of using 4DGS over other methods.
问题
- For different conditions such as text, image, and static-3D, are different diffusion model used for each input?
- Limiting the camera with only 1 DOF (azimuth angle) seems to be unreasonable, especially during test time. For the generated 3D/4D content, there is no reason we only evaluate the rendering results at a fixed elevation and distance.
- The backbone/pre-trained diffusion models are extremely important, and I'm wondering why current set of models (Sec.4.1) are selected, and what are the insights and benefits.
- Why filtering out the objects with big motion in the dataset? These data are the challenging ones and should serve as the hard examples of the proposed methods. A good dataset should has long tail distribution of these hard examples. Removing them from the data might makes the established benchmark unfair.
局限性
yes, the authors talked about limitation in Appendix such as the limited spatial temporal resolutions.
Dear Reviewer G7WS, we would like to thank you for taking the time to review our work and providing valuable feedback. We also thank you for recognizing the importance of our targeted issue, the effectiveness and versatility of our framework, and the value of our curated dataset.
We thoroughly considered your questions, and we are committed to addressing all your concerns as follows:
Q1. Comparison to related works:
- We have elaborated the motivations of our method and its distinctions from prior works in the 2nd paragraph of Sec.1 and L99, 102-106 in Sec.2. Also, in response to your request, we provide a clearer summarization of prior 4D generation works in global rebuttal-(3). Our approach differs from priors in terms of optimization and consistency enhancement strategies. The innovative design of 4D-aware video diffusion model enables us to swiftly generate spatial-temporal consistent 4D assets. Please refer to global rebuttal-(3) for more details.
- We compared our method with the most relevant, latest, and well-recognized works. For text-to-4D, we compared with 4dfy (CVPR2024) and animated124(arXiv, Nov.2023). For image-to-4D, we compared with Stag4D(ECCV2024) and 4DGen(arXiv, Dec.2023). Notably, Stag4D claimed better performance than DreamGaussian4D(arXiv, Dec.2023) in its Fig.5 of Sec 4.3. L4GM(arXiv, June.2024) focused on video-to-4D generation task and was released after NIPS submission deadline, making it impossible for us to include a comparison.
Q2. Visual quality
- For generation consistency, we quantitatively assessed the results with FVD and conducted qualitative comparisons via extensive user study. Results in Tab.1 in main.pdf comprehensively demonstrated our superiority over baselines in geometry consistency. We can observe from the supplementary videos that ‘Diffusion4D’ demonstrates consistent 3D geometry, the appearances and motions exhibiting coherence and smoothness across 360°. This can be attributed to the extensive spatial-temporal attentions in the video model architecture.
- We trained our model on objaverse dataset consisting of simple objects and this is the reason for the simplified texture. The 256² resolution was also adopted in prior works[1, 12, 27, 34, 48, 50, 52, 55]. With the same resolution, as shown in Fig.A in rebuttal.pdf, our method can generate assets with more complex textures, clearer appearance, and more consistent geometry compared with baselines. We appreciate your feedback and are committed to exploring more complex textures and higher resolutions in the future study.
- For qualitative comparison with baselines, we provide a detailed explanation in global rebuttal-(1) and (2). Please refer to global rebuttal-(1) (2) and Fig.A in rebuttal.pdf for more details.
Q3. Method section details
- m is the proposed metric of 3D-to-4D motion magnitude and it can be evaluated in both pixel space (Eq.1) and latent space (Eq.2). In Eq.1, m is computed by averaging the difference across T images in a video; In Eq.2, m is also computed by averaging T image latents in z. These two operations are technically consistent so we use the same notation. We will revise the notations to avoid confusion. The motion magnitude is embedded in the same way as timestamp embedding. We project the motion magnitude into a positional embedding followed by two MLP layers.
- As z is a matrix, the mathematical form in the paper is computationally different from the suggested form. We derive the in this form to encourage the model to reconstruct the motion magnitude explicitly. We empirically found the proposed form leads to better learning of motion magnitude.
Q4. Effect of motion magnitude guidance. In ablation study, comparing (b) and (c) in Fig.5, we can observe that increasing m augments the dynamics of the kid. In Fig.B in rebuttal.pdf, we show more cases with varying motion magnitudes. We can observe that increasing m augments the dynamics of the generated assets.
Q5. Benefit of 4DGS. Many previous works [37,46,49] have demonstrated that Gaussian Splatting benefits from explicit representation, better construction details, faster rendering, and less memory usage compared to NeRF. Therefore, to achieve fast and high-quality 4D generation, we select 4DGS for 4D construction.
Q6. More questions
- Different conditions. Yes, we trained different diffusion models for distinct input modalities. The flexibility of the video diffusion model architecture allows our framework to accommodate various prompt modalities with minor modification. Please refer to L196-L208 for more details.
- Camera pose settings during training and evaluation. Generating consistent 4D assets across 360-degree is a challenging task. Therefore, in the training stage, we mitigate the modeling burden on the video diffusion model by fixing elevation angle and distance, making the model focus on learning the motion and geometry changes along orbital views. We can observe the same training settings in prior works [27,52,56]. In the test stage, we also adhered to the evaluation settings in prior works [1, 12, 34, 50, 52, 55] and rendered from constructed 4D assets with a fixed elevation and distance.
- We selected VideoMV as it is a 3D-geometry-aware video diffusion model. We did extensive testing experiments and demonstrated that this model is stable and reliable in generating consistent orbital videos of static 3D assets.
- Data curation details. We want to clarify that we do not filter out ‘big-motion’ assets. In the final curation stage, we remove cases with dramatic location changes where the objects exit the scene boundaries (leaving only partial parts within the scene). Such instances would deteriorate the model in learning stable and reasonable motions. We will clarify this point in the revised version.
We are willing to reply to your further questions in this time window if you have any feedback.
Following Q.6.2, to strengthen our quantitative evaluation during test time, we used our validation set and further evaluated two sets of novel viewpoints with metrics CLIP, SSIM, PSNR, LPIPS, and FVD.
- For the first set, we fixed the camera at the front view, and uniformly rendered 36 images along the timeline as the targets. Note that in our main evaluations, we have measured the same set (L253) with CLIP score (CLIP-F in Tab.1 of main.pdf).
- For the second set, we used a fixed distance and an elevation angle of 30°, and uniformly rendered 36 orbital images along the timeline as the targets (O-30°).
For both sets, we used ground truth images rendered from the dynamic 3D assets in Objaverse as references. Metrics were computed between ground truth images and rendered images from the same viewpoints and timestamps. The results are shown in the table below:
| CLIP-F↑ | SSIM-F↑ | PSNR-F↑ | LPIPS-F↓ | FVD-F↓ | CLIP-O-30°↑ | SSIM-O-30°↑ | PSNR-O-30°↑ | LPIPS-O-30°↓ | FVD-O-30°↓ | |
|---|---|---|---|---|---|---|---|---|---|---|
| (text) 4D-fy | 0.78 | - | - | - | - | 0.55 | - | - | - | - |
| (text) Animate124 | 0.75 | - | - | - | - | 0.51 | - | - | - | - |
| (text) Diffusion4D | 0.81 | - | - | - | - | 0.62 | - | - | - | - |
| (image) 4DGen | 0.84 | 0.72 | 15.1 | 0.28 | 691.5 | 0.66 | 0.66 | 13.9 | 0.34 | 760.5 |
| (image) Stag4D | 0.86 | 0.78 | 16.0 | 0.25 | 624.7 | 0.69 | 0.74 | 15.1 | 0.30 | 705.8 |
| (image) Diffusion4D | 0.89 | 0.80 | 16.2 | 0.24 | 594.5 | 0.72 | 0.81 | 16.5 | 0.23 | 526.4 |
Note that CLIP-F is adopted from Tab.1 in main.pdf. Comparing these results along with the results in Tab.1, we can observe that our method consistently outperforms the baselines across all the metrics at the novel viewpoints.
The authors have done a great job answering my doubts and questions. The provided additional details are very informative and convinced me to raise my rating.
Dear Reviewer G7WS,
Thank you very much for taking the time to read our rebuttal and we are very grateful for your recognition. We are truly happy that our rebuttal resolves your concerns and that you raised the rating.
Should there be any ambiguities or further questions, we are more than willing to provide clarity or delve deeper into any topic.
Best wishes, Authors of Paper 3414
Dear Area Chair and Reviewers,
We would like to express our sincere gratitude for your valuable time and efforts in reviewing our work and providing insightful feedback. We are encouraged that reviewers find that our paper is well-written (fabU, BwxH), the usage of video diffusion model to address inefficiency and inconsistency issues is novel and clearly illustrated (G7WS, fabU, mXZW), the introduction of 3D-to-4D motion magnitude metric is innovative (BwxH), and the experimental evaluations are comprehensive and convincing (fabU, BwxH). We are also pleased that all reviewers recognize the value of our curated 4D dataset in facilitating future research in the community.
Primarily, the reviewers requested more visualizations of the generated assets, in terms of comparisons with baselines and novel views. Therefore, we provided more cases in Fig.A in rebuttal.pdf and elaborated in detail as follows:
(1) Comparisons with baselines.
- Larger motion magnitude. In Fig.3 of main.pdf and Fig.A of rebuttal.pdf, we provided a qualitative comparison between our method and baselines. In Fig.3, due to space constraints, for the baselines, we only showed two representative viewpoints at starting and ending timestamps. It is evident that the objects’ poses at two timestamps are almost identical, as they generated samples with invisible motions along the whole timeline. In contrast, the motions from our method are much more obvious and pronounced. Additionally, in Fig.A, we show more cases with large motions, including the pose change of a knight, the flapping of butterfly wings, a turtle swimming, a kid falling, and the arm movements of a cartoon dog and Mario. Our method is capable of generating 4D assets with significantly larger motions.
- More complex texture, smoother appearance, and more consistent geometry. For the generation details, by zooming in, we can observe that the baselines suffer from blurry texture, noisy appearances, excessive artifacts, and incomplete geometry. On the other hand, our method provides more complex texture, smoother appearances, and more consistent geometry. These phenomena can also be observed in both Fig.3 and Fig.A.
- More efficient generation and superior quantitative performance. Also, in Tab.1 of main.pdf, extensive quantitative comparisons and human preference evaluations suggest the superiority and efficiency of our method, which can produce more favorable and high-quality 4D assets within several minutes, compared with the baselines that require hours of optimization.
(2) Consistent novel views.
In Fig.A in rebuttal.pdf, we provided more samples with large motion magnitudes from more novel views. In addition to our default setting of orbital views with elevation angle 0°, we also rendered more novel views from the constructed 4D assets, including orbital views with elevation angle 30° and front views. We can observe that the constructed 4D assets demonstrate large motions, coherent appearance, and consistent geometries across timestamps and viewpoints.The motions are more apparent by fixing the camera at front view.
We believe that these demonstrations, along with the visualizations in the main submission, could provide a well-rounded view of our generations. And we are pleased to offer more video demonstrations if policy allows.
(3) Summarization of related 4D generation works.
To give a clearer illustration about the method differences btw. our work and previous works, we provide a summarization of prior 4D generation works in the table below. Our approach differs from previous methods in terms of optimization and consistency enhancement strategies. Most works depend on 2D/3D diffusion priors and use SDS optimization, resulting in hours of optimization cost, including some text-to-4D generation works(4D-fy[1], Animate124 [55]) and image/video-to-4D generation works(4DGen [50], DreamGaussian4D[27], Stag4D[52]). Concurrent arXiv work (Diffusion^2 [48]) eliminates SDS and combines multiple diffusion priors for joint denoising, but it ignores geometry consistency in the generation process. In contrast, our work a) avoids SDS optimization and largely reduces the generation time from hours to several minutes; and b) utilizes a singular video diffusion model to ensure spatial-temporal consistency. To the best of our knowledge, this is the first work training on 4D datasets to generate 4D assets, which can be termed as '4D Native' method [57]. Our framework is also the first to integrate 4D spatial-temporal consistency into a singular video diffusion model, enabling the generation of multi-timestamp cross-view supervisions in one shot.
| 2D/3D Priors | 2D/3D Priors | 2D/3D Priors | 2D/3D Priors | 2D/3D Priors | 2D/3D Priors | 4D Native | |
|---|---|---|---|---|---|---|---|
| 4D-fy[1] | Animate124[55] | 4DGen[50] | DreamGaussian4D[27] | Stag4D[52] | Diffusion^2[48] | Diffusion4D | |
| Text-to-4D generation | √ | √ | × | × | √ | × | √ |
| Image-to-4D generation | × | √ | √ | √ | √ | √ | √ |
| SDS optimization | √ | √ | √ | √ | √ | × | × |
| Training on 4D data | × | × | × | × | × | × | √ |
| # of diffusion models used | 3 | 3 | 2 | 2 | 2 | 2 | 1 |
In the following, we provide more experiments and explanations to address all the questions and concerns raised by the reviewers. And we will incorporate them in the revised version.
Best regards,
Paper 3414 Authors
[57] Liu J, Huang X, Huang T, et al. A comprehensive survey on 3D content generation[J]. arXiv preprint arXiv:2402.01166, 2024.
Thank you to the Area Chair and all reviewers for taking the time to review our rebuttal. Here, we provide a recap of the method and contributions of our work.
In this work, we focused on addressing two critical limitations, i.e. efficiency and consistency, in prior 4D generation works. We proposed a novel framework, Diffusion4D, capable of swiftly generating high-quality and spatial-temporal consistent 4D assets within just several minutes, compared to previous methods that require several hours.
To the best of our knowledge, this is the first endeavor to adapt a video diffusion model and train it on 4D datasets for explicit novel view synthesis of 4D assets. Leveraging the flexibility of the video diffusion model architecture, our framework can seamlessly accommodate various prompt modalities. We also introduced the 3D-to-4D motion magnitude metric to enable explicit control over the dynamic strength and the motion magnitude reconstruction loss to improve motion learning. Extensive photometric and human evaluations demonstrate that the proposed framework outperforms baselines in terms of generation efficiency, motion fidelity, appearance quality, and geometry consistency.
Additionally, we curate and release a high-quality 4D dataset. Enormous efforts were dedicated to curating the dataset: With 8*A100 GPUs and 16 threads, it took 5.5 days to render the curated objaverse-1.0 dataset and about 30 days for objaverse-xl dataset. To strictly guarantee the data quality, before releasing the dataset, we also manually examined and removed low-quality assets. We believe this dataset can contribute much to further research in this community.
Dear Reviewers,
Firstly, we would like to express our sincere thanks for the time and effort you’ve spent reviewing our work. We have endeavored to incorporate the suggested experiments and elaborate on our methods based on your insightful comments. And we eagerly anticipate your feedback on our response. If there are any ambiguities or further questions, we are more than willing to provide clarity or delve deeper into any topic. Your continued guidance is greatly appreciated.
Warm regards, Authors of Paper 3414.
We would like to express our sincere gratitude to all the reviewers for the insightful comments, active feedback, and engaged discussion. We have carefully responded to all the questions raised by reviewers and endeavored to incorporate comprehensive experiments according to the suggestions. We believe that these experiments and responses successfully address all the concerns and comments raised by the reviewers. We are glad that the merits of our work have been acknowledged by all the reviewers:
- This is the first work to adapt a video diffusion model and train on 4D datasets for explicit novel view synthesis of 4D assets.
- Our method is capable of swiftly generating high-quality 4D assets with larger motion magnitude, more complex texture, smoother appearance, and more consistent geometry compared with baselines.
- The designed components in our framework are innovative and effective. The framework is versatile and can seamlessly accommodate various prompt modalities.
- A large-scale high-quality 4D dataset is curated that is valuable in facilitating future research in the community.
We deeply appreciate that we received positive feedback from all the reviewers. Thank you again for your insightful comments, valuable time, and commitment to the review process.
Warm regards,
Authors of Paper 3414
All the reviewers are positive about the paper and recommends acceptance. The AC agrees with the decision. Please read the weakness comments carefully and prepare the camera ready.