PaperHub
4.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
5
4
5
4.8
置信度
正确性2.5
贡献度2.5
表达2.5
NeurIPS 2024

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

we present a novel 4D generation pipeline, 4Diffusion, to create high-quality spatial-temporally consistent 4D content from a monocular video.

摘要

Current 4D generation methods have achieved noteworthy efficacy with the aid of advanced diffusion generative models. However, these methods lack multi-view spatial-temporal modeling and encounter challenges in integrating diverse prior knowledge from multiple diffusion models, resulting in inconsistent temporal appearance and flickers. In this paper, we propose a novel 4D generation pipeline, namely $4Diffusion$, aimed at generating spatial-temporally consistent 4D content from a monocular video. We first design a unified diffusion model tailored for multi-view video generation by incorporating a learnable motion module into a frozen 3D-aware diffusion model to capture multi-view spatial-temporal correlations. After training on a curated dataset, our diffusion model acquires reasonable temporal consistency and inherently preserves the generalizability and spatial consistency of the 3D-aware diffusion model. Subsequently, we propose 4D-aware Score Distillation Sampling loss, which is based on our multi-view video diffusion model, to optimize 4D representation parameterized by dynamic NeRF. This aims to eliminate discrepancies arising from multiple diffusion models, allowing for generating spatial-temporally consistent 4D content. Moreover, we devise an anchor loss to enhance the appearance details and facilitate the learning of dynamic NeRF. Extensive qualitative and quantitative experiments demonstrate that our method achieves superior performance compared to previous methods.
关键词
Diffusion Model4D GenerationNeRF

评审与讨论

审稿意见
5

The paper propose a 4D generation pipeline, namely 4Diffusion, aimed at generating spatial-temporally consistent 4D content from a monocular video.The authors design a unified diffusion model tailored for multiview video generation by incorporating a learnable motion module into a frozen 3Daware diffusion model to capture multi-view spatial-temporal correlations. After training on a curated dataset, the diffusion model acquires reasonable temporal consistency and inherently preserves the generalizability and spatial consistency of the 3D-aware diffusion model.

优点

  1. The paper proposes to generate multiview video to guide the 4d generation, the direction is reasonable.
  2. The paper is easy to follow.

缺点

  1. The paper's novelty wouldn't be its biggest strength, but training a multiview-video module is a good direction so, this point is moderately pass the bar of NeurIPS.
  2. The results look temporally inconsistent (color flickering), see the frog man's eye and wolf (w/ rider) 's tail. This might due to insufficient training samples.
  3. As another nerf based model, the results are not much better than consistent4D.

问题

  1. The improvement from this multi-view generation model is limited, the colors are still flickering, especially from the generated multview frames.

  2. I wonder if the limited consistency is due to insufficient amount of training data.

局限性

The paper hasn't show the full potential of the multi-view video for 4D generation, sourcing more synthetic data is important to improve the results.

作者回复

We sincerely thank you for providing insightful comments. Below, we address your constructive comments individually.

W1: The paper's novelty wouldn't be its biggest strength, but training a multiview-video module is a good direction so, this point is moderately pass the bar of NeurIPS.

Thank you for your recognition. Our paper primarily focuses on generating 4D content using our multi-view video diffusion model. Although the multi-view video diffusion model is an important contribution of our method, it can only generate four orthogonal viewpoints at a time. This restriction makes it challenging to produce consistent videos from any novel viewpoint, which is essential for 4D generation. Therefore, the 4D-aware SDS loss and anchor loss are also important for our model, which can enhance the spatial and temporal consistency of the generated content, allowing for effective rendering from any novel viewpoint across the temporal dimension.

W2: The results look temporally inconsistent (color flickering), see the frog man's eye and wolf (w/ rider) 's tail. This might due to insufficient training samples.

While we acknowledge that there is room for improvement in our results, our method still outperforms the baselines, demonstrating its effectiveness. Additionally, we have conducted further experiments to assess our method more thoroughly. We select 5 multi-view videos that are not included in the training data as test data. The following table presents the results, where metrics such as FVD, LPIPS, and PSNR clearly demonstrate that our method significantly outperforms the baselines.

ModelCLIP-I↑CLIP-C↑FVD↓LPIPS↓PSNR↑
Consistent4D0.92160.9723706.070.159316.70
DreamGaussian4D0.88980.9710760.180.179315.97
4D-fy0.86580.94871042.30.225414.24
Ours0.93100.9798417.630.119919.07

Here, we use ground truth videos for novel viewpoints to compute these metrics. As illustrated in the first two rows of Fig.R3 in the attached PDF, Consistent4D and DreamGaussian4D encounter the multi-face problem while our method generates spatial-temporally consistent contents.

We believe that curating more high-quality multi-view video datasets for training our multi-view video diffusion model could further enhance performance. However, there were very few high-quality multi-view video datasets available at the time of our project. We made every effort to curate as many datasets as possible to train our model.

W3: As another nerf based model, the results are not much better than consistent4D.

This weakness is related to W2. We have conducted further experiments to evaluate our method. As shown in the table above and Fig. R3 in the attached PDF, our method significantly outperforms the Consistent4D and other baselines. As illustrated in the first two rows of Fig.R3 in the attached PDF, Consistent4D and DreamGaussian4D encounter the multi-face problem. Consequently, when computing metrics between the input video and synthesized videos, as shown in Tab.1 of the main paper, the performance gap between our method and the baselines may not be apparent.

Q1: The improvement from this multi-view generation model is limited, the colors are still flickering, especially from the generated multview frames.

This question is related to W2. Please refer to W2 for detailed response.

Q2: I wonder if the limited consistency is due to insufficient amount of training data. The paper hasn't show the full potential of the multi-view video for 4D generation, sourcing more synthetic data is important to improve the results.

We believe that curating more high-quality multi-view video datasets for training our multi-view video diffusion model could indeed enhance performance. However, there were very few high-quality multi-view video datasets available at the time of our project. We made every effort to curate as many datasets as possible to train our model. Moreover, we believe that leveraging more robust and powerful 4D representations could further improve results. We plan to explore these enhancements in future work.

审稿意见
5

The paper proposes 3D-aware diffusion model trained on a curated 4D dataset for video-to-4D generation. 4D-aware Score Distillation Sampling loss is introduced to optimize 4D representation parameterized by dynamic NeRF. The proposed framework outperforms optimzation-based baselines.

优点

  • A new subset of animatable Objaverse is presented and improves the model's generation ability as shown in Table 2.
  • The proposed 4D diffusion model outperforms optimization-based baselines.

缺点

  • The proposed method requires 12 hours on A100, which is significantly longer than baselines.
  • I am confused by the quantitative evaluation design as mentioned in L244. "we calculate FVD between the input video and synthesized videos to evaluate the video quality". What's the purpose of comparing against input videos? If my understanding is correct, the ground truth 4D objects from Objaverse can be rendered into ground truth videos and can be used to calculate FVD, right?
  • The proposed anchor loss seems very similar to the ones proposed in 4DGen[62] and is not properly discussed in L200.
  • The back views in Fig. 5 seem very blurry and contain transparent edges. What might be the reason for these artifacts? Is this because of SDS loss?
  • I am mainly concerned about the test set construction. Does this manually filtered training set overlap with Consistent4D test set? Many of the Consistent4D test sets are originally from Sketchfab.

问题

  • I appreciate the authors' effort in manually filtering the dataset. I'm interested in how many objects were there before filtering out the 966 objects?
  • How are the input videos for evaluation constructed? Are they ground truth videos from the Objaverse dataset?
  • What are the reference videos for calculating the FVD metrics? Are they ground truth videos from the Objaverse dataset? For table 2, are the reference videos the same?

局限性

Please refer to weakness.

作者回复

We sincerely thank you for providing insightful suggestions. Below, we address your constructive comments individually.

W1: The proposed method requires 12 hours on A100, which is significantly longer than baselines.

Our method focuses on generating high-quality, spatial-temporally consistent 4D content. Therefore, we use dynamic NeRF, which is compact and high-capacity. However, it involves time-consuming volume rendering that extends the optimization time compared to other methods. Although 4DGS can significantly reduce optimization time, it may produce blurred appearances and inaccurate geometries due to the explicit characteristics of Gaussians, as illustrated in Fig.4.

W2: The quantitative evaluation design.

Sorry for the confusion. Here, we follow the setting of Consistent4D. For each generated 4D content, it is challenging to accurately assess video quality when computing FVD with ground truth videos rendered from Objaverse due to the potential distribution gap. Therefore, we use the input video as the reference video to compute the FVD, which provides a more precise evaluation for each test case.

W3: The proposed anchor loss seems very similar to the ones proposed in 4DGen[62] and is not properly discussed in L200.

Thanks for your important comments. Our anchor loss differs from 4DGen in several key aspects:

  • Model and Consistency: 4DGen employs a 3D-aware diffusion model, SyncDreamer, to generate multi-view images for each frame of the input video. This approach may lead to temporal inconsistencies. In contrast, our method uses a multi-view video diffusion model to produce videos with improved spatial-temporal consistency, resulting in better overall performance for 4D generation.
  • Viewpoint Selection: 4DGen utilizes all viewpoints generated by SyncDreamer to supervise the optimization process. However, viewpoints that are far from the input video may have lower quality, potentially degrading performance. To mitigate this issue, we select the viewpoint closest to the input video as the anchor. This strategy ensures that the anchor video maintains high quality and consistency as illustrated in Fig.R2 in the attached PDF.

We will discuss this in our revised version.

W4: What might be the reason for these artifacts in Fig.5? Is this because of SDS loss?

Figure 5 presents the multi-view video generation results from 4DM, where DDIM was used for sampling. It is important to note that SDS loss was not applied during the sampling process, so the observed artifacts are not due to SDS loss. Despite these artifacts, our approach achieves results that are comparable to, and even better than, those obtained with ImageDream as shown in Tab.2. We believe these artifacts may be attributable to limitations in the base model, which also produces results with transparent edges. Using a stronger base model could potentially enhance performance.

W5: Main concern: Test set construction.

Sorry for making these unclear. During the training of our multi-view video diffusion model, we ensure that our training data does not overlap with the test data used for evaluating our method.

Q1: How many objects were there before filtering out the 966 objects?

Thank you for your appreciation. Before filtering, the dataset contained a total of 44,000 objects, which represents the number of animated shapes from Objaverse 1.0.

Q2: How are the input videos for evaluation constructed? Are they ground truth videos from the Objaverse dataset?

Specifically, for experiments in Section 4.1, We use 3 real-world videos and 3 synthetic videos from the Consistent4D dataset, as well as 3 images from the ImageDream project page. As discussed in Section A.1 of the supplementary materials, for text-image pairs from ImageDream, we utilize SVD to generate monocular videos for 4D generation.

For the qualitative experiments in Sec.4.2, we use the same input videos as in Sec.4.1. For the quantitative experiments in Sec.4.2, we utilize six monocular videos from the Consistent4D test dataset, which provide ground truth for novel viewpoints. While some of these input videos for evaluation may come from the Objaverse dataset, none of them were used in training our multi-view video diffusion model.

Q3: What are the reference videos for calculating the FVD metrics? Are they ground truth videos from the Objaverse dataset? For table 2, are the reference videos the same?

In Tab.1, the reference videos used for calculating the FVD metric are the input videos, as there is no ground truth available for novel viewpoints. In Tab.2, the reference videos for the FVD metric are the ground truth videos that share the same viewpoints as the rendered videos. Although these input videos for evaluation may come from Objaverse dataset, none of them are used for training our multi-view video diffusion model. The reference videos are the same for each method in Tab.2.

审稿意见
4

This paper tackles the task of 4D reconstruction from monocular video. It introduces a training approach for a multi-view video generative model using a synthetic dataset of multi-view videos. The architecture uses a 3D-aware denoising diffusion model previously applied to multi-view images and extends it to accommodate multi-view videos. The model is fine-tuned using 1,000 synthetic multi-view videos from the Objaverse dataset. Then, score-distillation sampling (SDS) is used to generate a dynamic radiance field. The evaluation on videos of synthetic object-centric scenes demonstrates a slight improvement in terms of CLIP and FVD metrics over the recent Consistent4D work on the task of novel view synthesis from monocular video. Although the qualitative results show minor enhancement over baselines, concerns remain about the generalization to real-world videos and the evaluation, especially regarding potential training data leakage and significance of improvement over Consistent4D. Addressing these issues would warrant the acceptance of the paper.

优点

  • The paper addresses the significant and timely issue of generating 4D content using diffusion models.
  • The architectural extension of the 3D-aware diffusion model and its fine-tuning are good contributions that would be useful to know for the community.
  • The technical contribution is highlighted by impressive generalization performance (assuming no train data leakage).
  • This also highlights the scalability potential of synthetic Objaverse dataset for fine-tuning video diffusion models to perform 4D generation.
  • Both qualitative and quantitative results indicate improvements over the baselines, albeit modest compared to Consistent4D.

缺点

  • The training requires a multi-view video dataset which is difficult to obtain.
  • The evaluation is limited to synthetic, object-centric toy scenes without backgrounds.
  • I haven’t found a description of the validation and test dataset for experiments in Section 4.1
  • It is not clear whether assets in test-videos are unseen during training of all models (as some of them are also trained on objaverse).
  • Evaluation in Sec 4.1 is limited to CLIP and FVD metrics. Since multi-view video datasets were used for training, one could evaluate models for the novel view synthesis task using standard metrics such as LPIPS/PSNR (taking best of 10 due probabilistic nature of the task).
  • Minor: Given the small improvement over Consistent4D, further evaluation of statistical significance is needed.
  • Minor: The paper would benefit from more precise writing; particularly, the training description in lines 176-182 needs more clarity on each variable and the noising process, and Equation 10 lacks clarity regarding sampled variables used in expectations. The method description is overly complex, and the language is difficult to follow, containing several unclear sentences.

问题

The paper assumes access to a multi-view video dataset. The rationale behind the need for SDS when a multi-view video diffusion model is already available is unclear. Could you explain why not fit the dynamic NeRF directly on the generated multi-view videos?

局限性

Authors addressed the limitations.

作者回复

Thank you for recognizing and valuing our work. We address your constructive comments as follows:

W1: Difficult to obtain training data.

We have manually curated a high-quality subset of Objaverse, which will released to support community development. Despite data limitations, our method has achieved promising results by leveraging a pre-trained 3D-aware diffusion model. Additionally, as the 4D field continues to advance, we anticipate the emergence of more multi-view video datasets, which have been used in L4GM[1] and Diffusion4D[2].

W2: The evaluation is limited to synthetic, object-centric toy scenes without backgrounds.

Concurrently, most research in 4D generation focuses on creating object-centric scenes without backgrounds. Because maintaining spatial-temporal consistency is challenging. We address this challenge by designing a multi-view video diffusion model for 4D generation, which achieves superior performance compared to other methods. Additionally, our evaluation extends beyond synthetic data. We also evaluated on real-world videos from the Consistent4D dataset, including the squirrel in Fig. 1, the egret and robot in Fig. 4, and the jay in Fig. 8.

W3: Description of test dataset for experiments in Section 4.1.

We conduct experiments in Sec.4.1 by using 3 real-world videos and 3 synthetic videos from the Consistent4D dataset, as well as 3 images from the ImageDream. As discussed in Sec.A.1 of the supplementary materials, for text-image pairs from ImageDream, we utilize SVD to generate input videos. We will add the description in our revised version.

W4: Whether assets in test-videos are unseen during the training of all models.

Sorry for making this unclear. During the training of 4DM, we ensure that our training data does not overlap with the test data. Regarding the pre-trained ImageDream model, the training data has not been released. However, it is known that the model was trained exclusively on multi-view images from Objaverse, which means it has not seen the test videos either.

W5: Evaluation in Sec 4.1 is limited to CLIP and FVD metrics.

Thanks for your constructive comments. As detailed in W3, we utilize monocular videos from Consistent4D and ImageDream to evaluate our method. However, these datasets do not provide ground truth videos for novel viewpoints, preventing us from calculating LPIPS and PSNR metrics. To further evaluate our model, we select 5 multi-view videos that are not included in the training data and conduct experiments. The following table presents the results, where metrics such as FVD, LPIPS, and PSNR clearly demonstrate that our method significantly outperforms the baselines.

ModelCLIP-I↑CLIP-C↑FVD↓LPIPS↓PSNR↑
Consistent4D0.92160.9723706.070.159316.70
DreamGaussian4D0.88980.9710760.180.179315.97
4D-fy0.86580.94871042.30.225414.24
Ours0.93100.9798417.630.119919.07

Here, we use ground truth videos for novel viewpoints to compute these metrics. As illustrated in the first two rows of Fig.R3 in the attached PDF, Consistent4D and DreamGaussian4D encounter the multi-face problem. Consequently, when computing metrics between the input video and synthesized videos, as shown in Tab.1 of the main paper, the performance gap between our method and the baselines may not be apparent. Due to time constraints, we selected the best of two runs when conducting this experiment.

W6: Minor: Given the small improvement over Consistent4D, further evaluation of statistical significance is needed.

As discussed in W5, we conduct additional experiments to further evaluate our method. As shown in the table above and Fig. R3, our method significantly outperforms Consistent4D.

W7: Minor: The paper would benefit from more precise writing.

Thank you for pointing out the writing problems. We will revise the training details in lines 176-182 and clarify the annotations in Eq.10. Furthermore, we will simplify and refine the method section to enhance readability.

Q1: The paper assumes access to a multi-view video dataset.

This question is related to W1. Please refer to W1 for detailed response.

Q2: The rationale behind the need for SDS.

We utilize SDS to optimize dynamic NeRF, enabling effective rendering from any novel viewpoint across the temporal dimension, which is essential for 4D generation. This is particularly challenging for our multi-view video diffusion model, as it is limited to generating only four orthogonal viewpoints at a time, which hinders its ability to produce consistent videos from any novel viewpoint.

Q3: Why not fit the dynamic NeRF directly on the generated multi-view videos?

Our multi-view video diffusion model can only produce four orthogonal viewpoints at one time. Training with such sparse views often results in overfitting to the training viewpoints, as proved in SPARF[3]. To further demonstrate this challenge, we conducted an experiment where we used only four generated videos to optimize dynamic NeRF. The results, which illustrate the overfitting problem, are presented in Fig.R4 in the attached PDF.

[1]Ren J, et al. L4GM: Large 4D Gaussian Reconstruction Model. arXiv preprint arXiv:2406.10324, 2024.

[2]Liang H, et al. Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models. arXiv preprint arXiv:2405.16645, 2024.

[3]Truong P, et al. SPARF: Neural radiance fields from sparse and noisy poses. CVPR, 2023.

评论

Thank you for your response. I have found the added experiment for Q3 particularly interesting and camera-ready would benefit from it. It seems that extracted 4D is of fairly high-quality. Regarding overfitting, wouldn't this be resolved if you trained on more views (since you train on synthetic data anyway)?

评论

Dear Reviewer,

We sincerely thank you for your precious time and efforts in reviewing our paper.

Thank you for acknowledging the additional experiment in Q3. We will include it in our camera-ready version upon acceptance. Training a multi-view video diffusion model with more views could potentially mitigate the overfitting problem. However, it demands significantly more memory and computational resources, posing a challenge for current GPU capabilities. Additionally, learning such a complex distribution would require a much larger training dataset. On the other hand, our method achieves promising results with reasonable computational resources and datasets, making it a robust and efficient solution for 4D generation.

Thank you once again for your review and constructive comments! We are happy to engage in further discussion if you have any additional questions or concerns.

Best regards, Authors

审稿意见
5

The paper proposes a 4D generation method that aims to generate 4D content from a monocular video. A video-to-multi-view-video diffusion model is presented to create multi-view videos given a monocular video, a text prompt, and a sequence of camera poses. The trained multi-view-video diffusion model is leveraged to optimize 4D representation, i.e., dynamic NeRF. In addition, 4D-aware SDS loss and an anchor loss are introduced to train dynamic NeRF. Experimental results show the proposed method achieves the best performance, compared with state-of-the-art methods.

优点

  1. The paper is well-written and easy to follow.

  2. A multi-view-video diffusion model is presented to generate multi-view videos from a monocular video

  3. The paper addresses an interesting problem, and 4D generation significantly impacts various applications.

缺点

  1. Some technical details are unclear. The paper builds a multi-view-video diffusion model by inserting a learnable motion module into ImageDream. The learnable motion module is critical to the proposed method. However, the paper does not provide detailed information about the motion module, such as the architecture and the layer information. Without this information, it is difficult to reproduce the proposed method.

  2. The paper only uses 996 training data to train the multi-view-video diffusion model or the motion module, while the training takes two days on 16 NVIDIA Tesla A100 GPUs. Does such a small data set and such an extensive training cost lead to significant overfitting? How many parameters are in the motion modules?

  3. Instead of the input monocular video, the anchor loss chooses a monocular video generated by the presented multi-view-video diffusion model as an anchor, due to the difficulty in estimating the camera pose of the input video. Why not use all videos generated by the multi-view-video diffusion? Would this operation degrade the 4D generation performance? In addition, the input video typically has better quality than the generated one.

  4. Table 2 shows that using ImageDream achieves better CLIP-I than using the present multi-view-video diffusion model. Could the authors provide more explanations?

问题

Please refer to my comments above

局限性

The paper provides the limitations and societal impact of the proposed work

作者回复

Thank you for appreciating and acknowledging our work. We address your constructive comments below:

W1: Some technical details are unclear.

We will provide detailed information about the motion module in the revised version. Specifically, we incorporate a zero-initialized motion module at the end of each spatial attention block in ImageDream, as illustrated in Fig.3. Each motion module begins with group normalization and a linear projection, followed by two self-attention blocks and one feed-forward block. A final linear projection is then applied, after which the residual hidden feature is added back at the end of each motion module. For a more detailed architecture overview, please refer to Fig. R1 in the attached PDF. Moreover, we will open our code to enable researchers to reproduce our results and further build upon our work after the completion of the anonymous review process.

W2: Overfitting and parameters of the motion modules.

Thank you for your insightful question. Our motion module contains 453.209M parameters. We have evaluated our model on real-world videos from the Consistent4D dataset, including the squirrel in Fig. 1, the egret and robot in Fig. 4, and the jay in Fig. 8, indicating that our model does not overfit the training data. To be more concrete, the reasons are as follows:

  • Our training dataset comprises 966 multi-view videos, with each video containing 32 viewpoints. This setup results in approximately 741,000 frames in total. Additionally, the average number of frames per video is 24. During the training iteration of 4DM, we randomly sample 4 orthogonal viewpoints with 8 frames from the training data, this strategy effectively augments our training data, helping to mitigate the risk of overfitting.
  • Our 4DM model employs several 3D self-attention layers and 1D temporal self-attention layers to capture spatial-temporal relationships, which requires approximately 6 seconds to optimize per step. As a result, two days of training (30,000 iterations) is insufficient for 4DM to overfit the dataset.
  • In 4DM, we only finetune the parameters of the motion module while keeping the parameters of the original ImageDream model frozen, thereby preserving ImageDream's generalization ability, even when trained on a small curated dataset.

W3: Anchor videos.

Thank you for your comments. As the reviewer pointed out, the input video is generally of better quality than the generated videos. Therefore, we select the viewpoint closest to the input video as the anchor. This approach ensures that the anchor video maintains the same quality as the input, which improves the results. Moreover, our multi-view video diffusion model is currently limited to generating multi-view videos with 8 frames. When the input video exceeds 8 frames, we must apply our multi-view video diffusion model multiple times to generate anchor videos. However, this process may lead to temporally inconsistent results due to the stochasticity of the diffusion model, particularly when the viewpoint is far from the input video as shown in Fig.R2 in the attached PDF. This inconsistency would degrade the 4D generation performance.

W4: CLIP-I score.

In Tab.2 in the main paper, our method performs only slightly lower (-0.0146) than ImageDream in the CLIP-I metric, likely due to the stochasticity of the diffusion model. However, our method surpasses ImageDream in the LPIPS metric, which also reflects image quality. Moreover, our method exhibits comparable visual quality to that of ImageDream as shown in Fig.5 in the main paper. To further assess our model, we select 5 additional test cases from Objaverse, alongside the test data provided by Consistent4D (None of these test data are included in our training dataset). To account for the stochasticity of the diffusion model, we conduct five runs for each test case and report the average metrics as follows:

ModelCLIP-I↑LPIPS↓CLIP-C↑FVD↓
ImageDream0.91650.15360.9320465.94
Ours(4DM)0.92600.13460.9601427.34

Despite the comparable performance in CLIP-I, our model excels in generating spatial-temporally consistent multi-view videos, a primary focus of our research. This is evidenced by the superior performance on metrics such as LPIPS, CLIP-C, and FVD, which better capture the spatial and temporal fidelity of video content. These metrics demonstrate that our method effectively balances image quality with temporal consistency, making it a robust solution for multi-view video generation.

评论

Thank the authors for taking the time to answer my questions. Most of my concerns have been addressed. I still have a concern about using 996 training data to train 453.209M trainable parameters. Yet, I plan to keep my positive score.

评论

Dear Reviewer,

We sincerely thank you for the review and the suggestions.

Best regards,

Authors

作者回复

We appreciate the detailed and constructive feedback from all the reviewers. We are pleased that reviewers recognize our contribution (Y3vx, 4Nq2) and the effort involved in filtering the dataset (4Nq2). Additionally, the reviewers acknowledge the significance (Y3vx) and interest (ioTR) of the addressed problem, and consider the proposed method reasonable (MfEx). Furthermore, reviewers find our paper well-written (ioTR, MfEx) and easy to follow (ioTR).

Before addressing the individual reviews, we briefly summarize our responses as follows:

  • Experiment Improvements:
    • We utilize more accurate settings to evaluate our method on multi-view video generation.
    • We have conducted additional experiments on 4D generation, using LPIPS and PSNR metrics to evaluate our method against baselines.
    • We demonstrate the effectiveness of 4D-aware SDS by directly optimizing on generated videos.
  • Clarification of Technical Details:
    • Provide a detailed explanation of the design of our motion module.
    • Clarify the construction of our training dataset for the multi-view video diffusion model.
    • Describe the construction of our test dataset for each experiment.
    • Detail the reason for selecting anchor videos.
  • Writing Improvements:
    • We will clarify the annotations used in the paper and simplify and refine the method section to enhance readability in the revised version.

We provide the qualitative results of additional experiments and the detailed architecture of the motion module in the attached PDF.

评论

Dear Reviewer,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

We would like to know if our response has adequately addressed your questions and concerns. We are more than willing to engage in further discussion.

Best regards,

Authors

最终决定

This paper tackles the task of 4D reconstruction from monocular video. The approach extends a pretrained 3D-aware diffusion model to work with multi-view viedos by incorporating a learnable motion module into the frozen 3D-aware backbone to capture multi-view spatial-temporal correlations. The model is fine-tuned using 1,000 synthetic multi-view videos from the Objaverse dataset. Then, score-distillation sampling (SDS) is used to generate a dynamic radiance field.

The paper received borderline ratings. The reviewers consider the addressed problem timely and relevant and the technical contribution interesting to the community. Initially, the main concerns of the reviewers were the marginal improvements over one of the baselines (Consistent4D), and lacking details for training / test split of the data. The authors resolved these concerns with their rebuttal.

I hence recommend acceptance.