PaperHub
5.3
/10
Poster3 位审稿人
最低5最高6标准差0.5
5
5
6
4.0
置信度
正确性2.7
贡献度3.0
表达2.7
NeurIPS 2024

MotionCraft: Physics-Based Zero-Shot Video Generation

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Zero-shot video generationdiffusion modelphysics-based video generation

评审与讨论

审稿意见
5

In this work the authors propose MotionCraft, a new zero-shot video generator to craft physics-based and realistic videos. MotionCraft is able to warp the noise latent space of an image diffusion model, such as Stable Diffusion, by applying an optical flow derived from a physics simulation. The authors show that warping the noise latent space results in coherent application of the desired motion while allowing the model to generate missing elements consistent with the scene evolution, which would otherwise result in artefacts or missing content if the flow was applied in the pixel space. The authors compare the method with the state-of-the-art Text2Video-Zero reporting qualitative and quantitative improvements, demonstrating the effectiveness of the approach to generate videos with finely-prescribed complex motion dynamics.

优点

  1. This method does not need extra training, which is efficient.
  2. This method introduces explicit phisyics control to the field of video generation, which is novel.
  3. This method finds out that the optical flow is consistent between pixel space and latent space, which is interesting.

缺点

  1. The main concern is about the experiments. This paper only have 5 video results in total, which is not sufficient. I am worrying that the method is highly unstable and not robust, thus the author cannot present more video results. If this is the case, I think this manuscript is not suitable for publication. If this is not the case, I think the authors should provide more generated results and I will be glad to raise my score.
  2. How long does it take to generate a single video? If the paper claims that the video can be generated within minutes, I think generating tens of videos in the supplemtary material would be a good idea.

问题

See the weakness.

局限性

See the weakness.

作者回复

The main concern is about the experiments. This paper only have 5 video results in total, which is not sufficient. I am worrying that the method is highly unstable and not robust, thus the author cannot present more video results. If this is the case, I think this manuscript is not suitable for publication. If this is not the case, I think the authors should provide more generated results and I will be glad to raise my score.

We understand the reviewer's concern regarding the number of video results presented in our paper. We would like to assure you that our method is stable and robust. To address this concern, we have generated additional video results, which are included in the Additional Results PDF. These supplementary videos showcase a diverse set of scenarios and dynamic simulations, further demonstrating the reliability and consistency of our approach. For example, we applied it to a different diffusion model (SDXL) without modifications, resulting in a video with higher resolution and smoothness (example 4 in Additional Results PDF). We also generated a video 200 frames long (example 6) showcasing the stability of the method. Furthermore, in example 2, we tested our method with a complex optical flow, estimated directly from a real video.

However, we are limited to a 1-page additional PDF in the rebuttal, so we included all the videos that we could fit on this page. More video will be added in the supplementary materials upon acceptance. We believe that these additional results will provide a comprehensive validation of our method and alleviate any concerns regarding its stability and robustness. Code is also available to generate any additional videos.

How long does it take to generate a single video? If the paper claims that the video can be generated within minutes, I think generating tens of videos in the supplementary material would be a good idea.

The claims in the paper regarding the time to generate a video refer only to the video generation itself (all and only the steps in Pseudocode 1 in the manuscript). However, as explained in the methodology, there is a certain degree of manual input in setting up the physical simulation and conditioning it on the starting frame. For example, setting the type of simulation, the boundary conditions, or the initial state of the fluid often takes time beyond the actual runtime of the generative model.

评论

Thank you for the rebuttal. I have read the additional results. The rating has been updated. I hope the authors could show more generated results in the suppmentary materials upon the acception of this manuscript.

审稿意见
5

The paper presents MOTIONCRAFT, a novel zero-shot video generation method that leverages physical simulations to create realistic and physically plausible videos. Unlike traditional video diffusion models that require extensive training and large datasets, MOTIONCRAFT uses a pre-trained image diffusion model, such as Stable Diffusion, and warps its noise latent space with optical flow derived from physical simulations. This approach ensures coherent motion and the generation of missing elements consistent with scene evolution.

Key Contributions:

  1. Innovative Approach: Introduction of a zero-shot video generation method that uses optical flow from physical simulations to warp the noise latent space of a pre-trained image diffusion model.
  2. Experimental Validation: Demonstrates the effectiveness of MOTIONCRAFT through both qualitative and quantitative comparisons with the state-of-the-art Text2Video-Zero method, showing significant improvements.
  3. Theoretical Insights: Provides an analysis of the correlation between optical flow in the image space and the noise latent space, supporting the proposed method.
  4. Versatility: Showcases the ability of MOTIONCRAFT to generate videos with complex dynamics, including fluid dynamics, rigid body physics, and multi-agent interaction models, without additional training.
  5. Technical Details: Describes key techniques such as multi-frame cross-attention and spatial noise map weighting to ensure temporal and spatial consistency in the generated videos.

Overall, MOTIONCRAFT represents a significant advancement in zero-shot video generation, combining the strengths of physical simulations and image diffusion models to produce high-quality, dynamic videos.

优点

Refer to Summary.

缺点

  • There are now many approaches to zero-shot video generation, such as https://openreview.net/forum?id=zOjW6yVYkE. The authors only compared their method with T2V0 (a relatively earlier method), which may make the experimental results insufficient. It would be better to include a more comprehensive comparison.
  • The current zero-shot video generation methods generally cannot achieve a very coherent video effect and can only generate keyframes. Although the method proposed in the paper largely ensures content consistency between consecutive frames, the generated videos still fail to achieve a highly coherent effect as seen in the demonstration.
  • Optical flow-based strategies often face limitations in certain specific situations. In complex environments, optical flow might not be effective. This aspect should be discussed and analyzed in the paper's main text. Moreover, the performance in different scenarios should be thoroughly evaluated using a variety of experimental results presented in the paper, instead of merely showcasing the method's validity through three handpicked examples.

问题

  • Is it possible to achieve better controllable video generation by handling optical flow in a manner similar to that mentioned in the Generative Image Dynamics (https://generative-dynamics.github.io/) paper?
  • Can more coherent videos be generated through interpolation in optical flow?
  • Is it possible to generate long videos?

局限性

Refer to Weakness

作者回复

There are now many approaches to zero-shot video generation, such as https://openreview.net/forum?id=zOjW6yVYkE. The authors only compared their method with T2V0 (a relatively earlier method), which may make the experimental results insufficient. It would be better to include a more comprehensive comparison.

We appreciate the reviewer's suggestion. Besides the work linked by the reviewer, we are unsure what other specific papers the reviewer has in mind for zero-shot video generation with diffusion models. At the time of submission, we ensured to include references to all relevant works available. Regarding the method mentioned by the reviewer, we note that this work primarily compares with T2V0 using a limited quantitative metric (text-image CLIP score) without considering frame alignment or temporal coherence. Additionally, since it does not have any code or videos released, we are unable to reproduce their results and compare them within our evaluation setting. Nonetheless, we acknowledge the importance of a broader comparative analysis. We will include the referenced work in our related work section in the camera-ready version of our paper.

The current zero-shot video generation methods generally cannot achieve a very coherent video effect and can only generate keyframes. Although the method proposed in the paper largely ensures content consistency between consecutive frames, the generated videos still fail to achieve a highly coherent effect as seen in the demonstration.

It is indeed a challenge for zero-shot video generation methods to maintain high temporal coherence. In particular, we found that the 64x64 resolution of the latent space is one limitation. This resolution restricts the accuracy and smoothness of the motion since it prevents sub-pixel movements during warping, resulting in less coherent transitions between frames. However, we have explored the use of Stable Diffusion XL (SDXL), which offers a larger latent space (128x128), allowing for more precise and smooth motion. By increasing the spatial resolution of the latent space, SDXL mitigates the limitations observed with the standard 64x64 latent space. See example 4 in the Additional Results. We will deepen the discussion of these aspects in the revised version of our manuscript.

Optical flow-based strategies often face limitations in certain specific situations. In complex environments, optical flow might not be effective. This aspect should be discussed and analyzed in the paper's main text. Moreover, the performance in different scenarios should be thoroughly evaluated using a variety of experimental results presented in the paper, instead of merely showcasing the method's validity through three handpicked examples.

We agree with the reviewer that optical flow in the pixel domain is not adequate for generating realistic video (for example, it is difficult to handle occlusions). However, the same optical flow in the latent space can produce realistic videos. Indeed, MotionCraft is able to exploit the diffusion model in order to fix the simplistic nature of the optical flow while maintaining its explainability and simulation ability. We will add some videos generated by warping the pixel space to prove this claim.

Regarding the limited number of examples, we provide videos with new contents and dynamics in the Additional Results PDF (we are limited to 1 page as per NeurIPS FAQ). These additional examples have been evaluated and confirm the quantitative results presented in the paper.

Is it possible to achieve better controllable video generation by handling optical flow in a manner similar to that mentioned in the Generative Image Dynamics paper?

The Generative Image Dynamics method involves training a motion prior by modeling it as a dense Fourier volume. This representation of optical flow is particularly suited for oscillatory motions, while MotionCraft is able to handle both non-oscillatory and oscillatory motion (as seen in the sea waves in example 6). However, it should be possible to achieve better controllable video generation in a manner similar to GID, at the cost of learning an optical flow generator, because this module should be input dependent. This is, however, outside the scope of this work.

Can more coherent videos be generated through interpolation in optical flow?

We found that the coherence of generated videos is affected by both the temporal resolution of the provided optical flows as well as the spatial resolution of the latent space of the diffusion model.

Concerning the temporal resolution, it is possible to run the physics simulator with a sufficiently small time step to match the desired frame rate. However, as mentioned in our response to Weakness 2, the primary limitation we encountered to represent fine-grained motion is the 64x64 spatial size of the latent space in the current Stable Diffusion model. Plase, see our response to Question 2 for further comments.

Is it possible to generate long videos?

Yes, it is possible to generate long videos using our method. From a theoretical point of view, the method complexity is linear in the number of frames to be generated, as explained in the manuscript. From a practical point of view, we demonstrate this capability in example 6 in the Additional Results, generating a coherent video with 200 frames. The key to generating long videos lies in the cross-attention mechanism we use, which ensures global consistency by attending to the first frame in the sequence and local consistency by attending to the previous frame. However, this cross-attention approach may introduce challenges when there are significant scene changes, as the mechanism maintains consistency with the initial frame. To address this, the frame-to-attend can be shifted periodically, or a weighted attention mechanism that decays over time can be used. These strategies can help enhance the video generation capability of our method for very long videos.

评论

Dear reviewer,

The discussion period is coming to a close soon. Please do your best to engage with the authors.

Thank you, Your AC

评论

Thank you for your rebuttal. The limited results remain a major concern. I will maintain my original rating.

评论

We respectfully disagree with the reviewer regarding the perceived limitation in the number of examples shown. In the original manuscript, we included 6 examples (5 in the main text and 1 in the appendix). With the rebuttal, we added 9 more examples, effectively more than doubling the number of experiments. Overall, the examples provided highlight the capacity of MotionCraft to handle a variety of scenarios, including different physics (rigid bodies, fluids, multi-agents), different diffusion models with varying resolutions (Stable Diffusion and SDXL), and different frame rates (even reaching videos of 200 frames).

Additionally, we would like to point out that the number of compatible examples provided by T2V0 (those without trained ControlNets) is comparable to the number of examples we have presented.

审稿意见
6

The paper works on the zero-shot video generation task and proposes, MotionCraft. It uses physics simulations to generate optical flow that follows physical dynamics. Then, optical flow is applied to warp the noise in the latent space with the stable diffusion model. This approach ensures coherent motion application and consistent scene evolution, avoiding artefacts and missing content typical in pixel space flow applications. Compared to the state-of-the-art Text2Video-Zero, MotionCraft shows both qualitative and quantitative improvements in generating videos with complex motion dynamics.

优点

  1. The paper is well-motivated and well-written.
  2. The idea of using physics simulator to generate the optical flow which is then applied in latent space is very interesting.
  3. The qualitative results are impressive.

缺点

  1. More quantitative comparison with baselines. Table 1 only reports the comparison with T2V0 on the generated videos. However, it is not clear which benchmark it is. Is it possible to compare with other baselines on more benchmarks, like MUG, MHAD?

  2. The method seems limited by specific types of dynamics, like fluid dynamics. It is not clear how to generate more general dynamics in real world. This may limit the potential application of the proposed method.

  3. The method is assumes that "Optical Flow is preserved in the Latent Space of Stable Diffusion" based on the observation of average correlations 0.727 between optical flows estimated in the RGB and noise latent spaces. Does this assumption hold true for generating realistic, pixel-wise precise motion in video with only 0.72 cosine simlarity?

问题

How do you specify the region for simulating physical dynamics?

Will the type of dynamic physics simulator affect the quality of the generated video?

局限性

The method is limited to specific types of dynamic simulators and may be hard to apply to generic real-world video generation.

作者回复

More quantitative comparison with baselines. Table 1 only reports the comparison with T2V0 on the generated videos. However, it is not clear which benchmark it is. Is it possible to compare with other baselines on more benchmarks, like MUG, MHAD?

We appreciate the reviewer’s suggestion to include more quantitative comparisons with additional baselines.

Our experimental setup follows that of the seminal paper of T2V0, which, at this time, is the only baseline for zero-shot video generation. Moreover, the nature of our approach introduces specific challenges in directly comparing with benchmarks such as MUG and MHAD. Our method, MotionCraft, utilizes optical flows derived from physical simulations to generate videos. As no benchmarks of this kind are available, we decided to compare T2V0 and MotionCraft on a new set of physics-based videos. This set of videos is composed of the 5 examples reported in Table 1. Additionally, among other newly generated videos in the Additional Results PDF, example 2 shows that MotionCraft can handle facial expression change videos similar to those present in the MUG dataset.

The method seems limited by specific types of dynamics, like fluid dynamics. It is not clear how to generate more general dynamics in real world. This may limit the potential application of the proposed method.

We argue that focusing on specific types of dynamics, like fluid dynamics, is a novel approach to the problem of zero-shot video generation. Moreover, fluid dynamics can be applied even to videos with content different from liquids or gases, as shown in the example 5 of Additional Results PDF, where the crowd moves according to a fluid simulation. However, the proposed approach based on physical simulations may not apply to all kinds of content to be generated. But, as we argue in the paper, the approach could be generalized by thinking of replacing the physics simulator with an animation software or 3D engine to generate the optical flows. One could even consider learning a dedicated optical flow generator. These are all interesting directions that require substantial future work that can leverage the evidence of effectiveness for the case of physics simulations shown in this paper. We will deepen the discussion of these aspects in the revised version of our manuscript.

The method is assumes that "Optical Flow is preserved in the Latent Space of Stable Diffusion" based on the observation of average correlations 0.727 between optical flows estimated in the RGB and noise latent spaces. Does this assumption hold true for generating realistic, pixel-wise precise motion in video with only 0.72 cosine similarity?

We agree with the reviewer that the value of the cosine similarity needs a comment.

This cosine similarity is computed between the flow estimated in the pixel space (between the two RGB frames) and, independently, the flow estimated between the corresponding latents. We argue that these two flows cannot be perfectly aligned since the VAE module of Stable Diffusion downsamples and encodes the image to the latent space. In addition, we think that 0.72 is a surprisingly high value, as it is measured in the noisy regime of the diffusion process (τ=400\tau=400 over 10001000 steps).

To address the concern about generating pixel-wise precise motion, we conducted an additional experiment (Additional Results, experiment 2) to further validate our approach. We extracted the optical flow from a video and used it as input to MotionCraft. The results demonstrated that our method could reconstruct the initial video with higher fidelity than the reconstruction in the pixel space.

How do you specify the region for simulating physical dynamics?

We thank the reviewer for this comment that allows us to clarify a point. As explained in the methodology, there is a certain degree of manual input in setting up a physical simulation conditioned on the starting frame. For example, in the fluid simulation of the filling glass in the supplementary material, we extracted a semantic map from the latent of the first frame and labeled the glass walls as boundary regions.

Will the type of dynamic physics simulator affect the quality of the generated video?

We compared different types of physics simulators in appendix C, where the same dynamic is simulated with both an Eulerian and a Lagrangian numerical solver, resulting in two different videos. As we can see in this example, both methods provide plausible frames. This showcases the ability of MotionCraft to operate on very different types of physical simulators seamlessly, leveraging the image prior of the diffusion model to correct potential misalignments and produce visually coherent and high-fidelity videos.

评论

Dear reviewer,

The discussion period is coming to a close soon. Please do your best to engage with the authors.

Thank you, Your AC

作者回复

Dear Reviewers and Area Chair,

We appreciate the valuable feedback and suggestions provided by the reviewers. We have carefully addressed the concerns raised and integrated additional experiments and clarifications to expand the original material in the suggested ways. Please find below a detailed point-by-point response to the reviewers' comments. We hope that these revisions and the additional results provided will demonstrate the robustness and effectiveness of our method.

Since the most critical point highlighted by the reviewers is the limited number of generated videos, we would like to note that we are limited to a 1-page additional PDF in the rebuttal, so we included all the videos that we could fit on this page. More videos will be added in the supplementary materials upon acceptance.

Thank you for your consideration. Sincerely, Authors of submission 16937

最终决定

The paper received 2 borderline accepts and 1 weak accept. The reviewers approved of both the problem setting and the general approach of defining a connection between a physics simulator and a video generator. They were however slightly concerned regarding the breadth of the experimental comparison, whether optical flow was the optimal representation to relate the results of the physics simulator to the video generation model, and the generality of the solution.

I agree with both the strengths and weaknesses expressed by the reviewers. The task of using a physics simulator to support and augment video generation is very interesting, but the representations used (optical flow) could have been investigated more thoroughly and more experiments included. I recommend Accept as poster.