PaperHub
5.0
/10
Poster4 位审稿人
最低5最高5标准差0.0
5
5
5
5
4.8
置信度
正确性2.8
贡献度2.3
表达2.8
NeurIPS 2024

Video Diffusion Models are Training-free Motion Interpreter and Controller

OpenReviewPDF
提交: 2024-05-12更新: 2024-11-06

摘要

关键词
video generationmotion control

评审与讨论

审稿意见
5

This paper introduces a new Motion Feature (MOFT) that can effectively capture motion information in video diffusion models. The authors reveal that robust motion-aware features already exist in video diffusion models, allowing to encode comprehensive motion information with clear interpretability. They present MOFT, which can be extracted without the need for training and is generalizable across diverse architectures.

优点

  • Training-free strategy effectively extracts motion information encoded in the features of the video diffusion model, demonstrating its ability to capture and leverage the inherent motion representations learned by the model.

  • The method presents a clean and straightforward solution for extracting motion encoding from video diffusion models, making it a ready and practical technique for various applications involving motion analysis or synthesis.

缺点

  • The paper lacks clarity on the training process. While it claims to be training-free, it defines loss functions for other tasks (Equations 3 and 4). It would be helpful to clarify which stages are trained and which are not.

  • The PCA analysis is based on a small number of videos (only 2 videos in Figure 2), which limits the generalizability of the results.

  • While the motion in the qualitative videos looks good, the differences compared to other alterations appear subtle and hard to recognize. Other methods show too poor results, were they tuned correctly?

  • The paper should report the runtime and resolution for better understanding of the method's computational requirements and output quality.

  • The idea is heavily inspired by DIFT and utilized for video applications, then novelty seems limited.

问题

Clarification needed:

  • Figure 1:

    a. It is unclear whether the motion feature in 1(a) is extracted semantically or spatially. Clarification is needed on how the similarity with other videos in 1(b) is calculated. Additionally, an explanation of what the higher score represents and why motion features from different videos could influence each other would be helpful.

    b. In 1(c), the motion direction seems to be manually defined. If so, why does the paper state that MOFT serves as guidance for controlling motion direction? If MOFT controls the motion, what is the source video for that motion?

  • Figure 6: Why the comparison is presented in the form of a point for DIFT and a segment for MOFT.

局限性

Limitations are addressed adequately.

作者回复

We are grateful for your valuable input. Please see the detailed responses to each of your concerns listed below.

W1: Clarification on the training process

The paper lacks clarity on the training process. It would be helpful to clarify which stages are trained and which are not.

Optimization does not necessarily mean training. "Training-free" means that we do not need to optimize model parameters during the training stage. Instead, the loss function is used to optimize the latent at each denoising step during the inference stage, which is a common technique to guide the diffusion process [1, 2]. We will add this clarification in the new version. Please refer to the pseudo-code in the global response for the details of the optimization process.

W2: Clarification for the number of videos in PCA

The PCA analysis is based on a small number of videos (only 2 videos in Figure 2), which limits the generalizability of the results.

In Figure 2, PCA analysis is based on six videos, with indices 1 and 2 showing the same motion in different videos. We do not consistently use the same pair of videos to demonstrate that our observations are not tied to specific videos. Although the sample size is small, our results show that PCA-filtered motion channels are robust to the number of videos.

Table 2 in the attached PDF compares channel similarity between PCA results from 6 and 20 videos, showing high similarity in filtered motion channels, indicating PCA's robustness. Table 3 further confirms that motion fidelity and image quality metrics are also robust to the number of videos used.

W3: Clarification on the qualitative comparison

While the motion in the qualitative videos looks good, the differences compared to other alterations appear subtle and hard to recognize. Other methods show too poor results, were they tuned correctly?

In the qualitative comparison with other methods, we focus on challenging cases for existing methods. In these cases, other methods often struggle: for example, DragNUWA [3] typically only moves part of the object, and Gen-2 [4] frequently generates unnatural movements. Our method, however, performs well and generates natural motions even in these challenging scenarios. We do not need to fine-tune these methods; instead, we directly test their results using publicly available code or APIs.

W4: Report on the runtime and resolution

The paper should report the runtime and resolution for better understanding of the method's computational requirements and output quality.

Our results are at a resolution of 512x512 and 16 frames unless otherwise specified. We use DDIM with 25 denoising steps for each sample. It takes approximately 3 minutes to generate one sample on an RTX 3090 GPU. We will include this information in the revision.

W5: Clarification on the novelty

The idea is heavily inspired by DIFT and utilized for video applications, then novelty seems limited.

The inspiration from DIFT lies in the high-level research style, as both works analyze diffusion features. However, our motivations, contributions, and techniques are distinct:

  • Motivation: We aim to decompose motion features from video diffusion features for better motion understanding and control, unlike DIFT which targets semantic correspondence in image diffusion models.
  • Contribution: MOtion FeaTure (MOFT) is the first to reveal rich motion information in video diffusion features, using a straightforward and innovative approach. DIFT shows diffusion features can capture semantic correspondence.
  • Technique: MOFT uses a novel PCA-based strategy to extract motion-aware features, while DIFT directly uses intermediate features without further processing.

Q1-1: Clarification on the motion feature

The motion feature in 1(a) is extracted following Eq. 2. The extracted feature contains spatial dimension, but the feature of a point location captures temporal motion of that point instead of semantics. The heatmap in 1(b) is calculated by the cosine similarity between the MOFT at the red dot in 1(a) and the MOFT of all points in 1(b), expressed as: M=cosine(Ma,Mb)M=cosine(M_a,M_b), where MRH×WM\in \mathbb{R}^{H×W} is the output heatmap, MaRCM_a \in \mathbb{R}^{C} is the MOFT of one point in the source image, and MbRH×W×CM_b\in \mathbb{R}^{H\times W\times C} is the MOFT of all points in the target image. HH, WW and CC are height, width and channel number, respectively. We use only one frame to calculate the similarity and for visualization.

Higher scores indicate greater motion similarity. For example, higher scores in the man’s region (last case in 1(b)) show that the man’s motion direction matches the reference (red dot) in 1(a), both moving left. This method also allows us to manipulate MOFT to alter motion direction.

Q1-2: Clarification on the motion signal

We need to transfer manually defined motion directions to MOFT for the motion optimization target, otherwise the optimization is not feasible. For obtaining the target MOFT, we offer two methods:

  • Synthesized from manually defined motion directions (as shown in Fig. 7(c) and 7(d)), which is the case discussed in this question. Details of this process are described in Line 175-179 of the paper.
  • Extracted from the reference source video (Fig. 7(a) and 7(b)).

Q2: Clarification on Figure 6

The similarity heatmaps serve different purposes in each method.

In DIFT, the similarity heatmap shows semantic similarity between points in the target and source images, focusing on a one-to-one correspondence to find the most accurate match.

In MOFT, the heatmap represents motion similarity between points in the target image and a reference point in the source image. Here, the heatmap visualizes regions with similar motion rather than finding a single best match.

We visualize their effects at different time steps to highlight the MOFT works better at the earlier denoising stage, instead of direct comparisons on effects.

评论

Thank you to the authors for the detailed response. A few points remain unclear to me, including the optimization process solely on the latent representation. I could not fully verify and be convinced on this point. However, I understand the limitations of conveying technical details thoroughly in a written response.

If possible, I would suggest the authors conduct additional samples that transfer longer and more complex motions, such as multipoint, circular, or periodic motions.

Considering the strengths and applicability of this work, I am happy to increase my rating to "Borderline Accept". I hope the authors will soon release the code so that other researchers can build upon this work to push forward the field.

评论

Thank you for the quick reply and positive response. We appreciate your feedback and are happy to clarify any remaining points.

The optimization process on the latent representation is not a novel technique. It has been widely used in image and video editing tasks, such as DragDiffusion [1] and SMM [2]. Our focus is on proposing a method to extract motion information as the optimization target.

We will include more challenging cases in the final version and will release the code for further exploration by the research community.

Thank you again for your time and consideration.

[1] Shi Y, et al. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing, CVPR 2024.
[2] Yatim D, et al. Space-time diffusion features for zero-shot text-driven motion transfer, CVPR 2024.

审稿意见
5

This paper investigates the relationship between the features of video diffusion models and the motion in the generated videos. By extracting motion features and using them as guidance, training-free motion control can be achieved.

优点

  1. The technical aspects of this paper are clear and it is easy to read.
  2. The proposed method can achieve training-free motion control for video generation.
  3. The framework can be applied to different forms of control signals.

缺点

  1. From Fig. 6, it is hard to draw the conclusion that "MOFT can provide more valid information than DIFT at the early design stages".
  2. The analysis experiments in the method section only focused on very simple motions, such as pan up, down, left, and right, without discussing more complex and realistic motions.
  3. The generated motions presented in the experiment section are also mostly very simple.
  4. The experiments lack comparisons with existing methods. Comparisons with other methods were only made in the point drag mode.

问题

Please refer to the waknesses.

局限性

The limitations are discussed while the potential negative societal impact is not discussed. But I don't think this discussion is necessary.

作者回复

We appreciate your thoughtful feedback. I've provided detailed responses to each of your concerns below.

W1: Clarification on the conclusion from Fig. 6

From Fig. 6, it is hard to draw the conclusion that "MOFT can provide more valid information than DIFT at the early design stages".

From Figure 6, we observe that at early denoising steps (e.g., time step 800), DIFT struggles to provide valid information as the semantic correspondence is wrong. In contrast, MOFT delivers a relatively high score in the corresponding motion region (the rabbit's head), which benefits motion control. This observation is further validated by Figure 4 in the supplementary material. Note: The similarity heatmaps serve different purposes for each method. In DIFT, the heatmap represents semantic similarity, whereas in MOFT, it represents motion similarity. Therefore, we do not directly compare the heatmaps between DIFT and MOFT at the same time steps. Instead, we visualize their effects at different time steps to highlight that MOFT performs better at the earlier denoising stages.

W2: Reasons for using simple motion in analysis experiments

The analysis experiments in the method section only focused on very simple motions, such as pan up, down, left, and right, without discussing more complex and realistic motions.

We select simple motions in the analysis experiments for two reasons: (1) Simplifying the analysis allows for clearer insights. (2) It is sufficient to filter out motion channels. Motion features obtained in this way generalize well to more complex motions because complex motions are composed of simple ones, i.e., different simple motions occurring at different times and spaces combine to form complex motions.

W3: Presented motions are not simple

The generated motions presented in the experiment section are also mostly very simple.

In Figure 8 and the supplementary webpage, we showcase complex motions, including camera motion control, motion control of multiple objects, different motion frequencies and directions, etc. These examples cover common aspects of motion control, such as those in Motionctrl [12], SMM [2], and Cameractrl [13]. We welcome suggestions for additional test cases from the reviewers.

W4: Comparisons with existing methods

The experiments lack comparisons with existing methods. Comparisons with other methods were only made in the point drag mode.

In addition to comparisons using the point drag mode, we also evaluate our approach against SMM [2], a method that extracts intermediate features for motion control. To make the comparison more comprehensive, we include a qualitative comparison on motion transfer tasks with methods [2,14] (please refer to Fig.1 and Table 1 of the attached PDF), demonstrating our ability to achieve a superior balance between motion fidelity and text alignment. This improvement is attributed to our innovative motion decomposition designs. Beyond these comparisons, we emphasize that our approach offers deeper insights into motion interpredictions in video diffusion features.

评论

The authors have addressed most of my concerns. I decide to keep my initial positive rating.

评论

Dear Reviewer,

Please reply to the rebuttal if your issues are addressed.

AC.

审稿意见
5

The paper introduces a training-free framework for understanding and controlling motion in video diffusion models. The key innovation is the MOtion FeaTure (MOFT), which is derived by removing content correlation and filtering motion channels from pre-trained diffusion model features. MOFT provides a training-free way to encode and manipulate motion information, offering high interpretability and generalizability across various architectures. The framework demonstrates competitive performance in generating natural and faithful motion, with applications in video motion control and point-drag manipulation.

优点

  1. Training-free Approach: The framework does not require additional training, leveraging pre-trained diffusion models to control motion, significantly reducing resource requirements.
  2. Interpretability: MOFT offers a clear and interpretable way to understand and manipulate motion information in video diffusion models.
  3. Generalizability: The method is applicable across various video generation models, demonstrating versatility and robustness.

缺点

Scalability to Longer Videos: The proposed method's scalability to longer videos or higher resolutions is not adequately explored.

Complexity of MOFT Extraction: Removing content correlation and filtering motion channels may be complex and require fine-tuning for optimal results.

Experiments: The text prompts used for quantitative experiments and user studies are unclear, and 56 case studies are insufficient to validate effectiveness.

问题

Providing a detailed description or pseudocode for MOFT extraction would aid in understanding its practical implementation and reproducibility. Have you tested MOFT's scalability for generating longer videos or higher-resolution outputs? What challenges, if any, did you encounter, and how did you address them? Additionally, how does this method handle more complex object control, such as multi-object scenarios with different categories and sizes?

局限性

The author have discussed limitations in Supp.

作者回复

Thank you for your insightful review. You will find detailed responses to each of the points you raised below.

W1: Scalability to longer videos and higher resolutions

Scalability to Longer Videos: The proposed method's scalability to longer videos or higher resolutions is not adequately explored

Thanks for your suggestions. We've added experiments showing that our methods can be directly generalized to higher resolutions and longer videos. In Figure 3(a) of the attached PDF, we demonstrate that PCA can clearly separate videos with different motions based on their diffusion features from Open-Sora [5], an open-source video generation model capable of producing long videos. In Figure 3(b), we show that our methods can be applied to higher resolutions (768×768) and longer videos (205 frames on Open-Sora).

W2: Clarification on MOFT extraction and optimization process

Complexity of MOFT Extraction: Removing content correlation and filtering motion channels may be complex and require fine-tuning for optimal results.

The extraction of MOFT is both straightforward and efficient, requiring only a subtraction operation and mask indexing, thereby adding minimal time to the overall process. Our method does not require fine-tuning of model parameters. Instead, it only involves efficient optimization during the inference stage, which is a common technique in image and video editing to guide the generation process [2,6,9]. For further clarification on the optimization process, please refer to the pseudo-code in the global response.

W3: Clarification on user studies

Experiments: The text prompts used for quantitative experiments and user studies are unclear, and 56 case studies are insufficient to validate effectiveness.

We have added the prompts for the test in the appendix, some of them are shown below.

  • b&w photo of 42 y.o man in black clothes, bald, face, half body, body, high detailed skin, skin pores, coastline, overcast weather, wind, waves, 8k uhd, dslr, soft
  • a rabbit, forest, haze, halation, bloom, dramatic atmosphere, centred, rule of thirds, 200mm 1.4f macro shot
  • a white deer in the snow, cartoon, centred
  • a man surfing in the sea, RAW photo, subject, (high detailed skin:1.2), 8k uhd, dslr, soft lighting, high quality, film grain, Fujifilm XT3
  • a car turns in the winter, 8k uhd, dslr, soft
  • photo of coastline, rocks, storm weather, wind, waves, lightning, 8k uhd, dslr,
    soft lighting, high quality, film grain, Fujifilm XT3
  • night, old house, post apocalypse, forest, wind, rocks, 8k uhd, dslr, soft lighting, high quality, film grain
  • ...

User study usually does not include samples with the scale of quantitative evaluation due to the labor consumption. 56 case studies are common numbers in user studies (~18 samples for each method). For example, Dragondiffusion [6] (ICLR 2024 , 16 samples for each method), Rerender-A-Video [10] (Siggraph Asia 2023, 8 samples for each method), FateZero [11] (ICCV 2023, 9 samples for each method)

Q1: Challenges and solutions in this work

What challenges, if any, did you encounter, and how did you address them?

The main challenge lies in the extraction of MOFT. Unlike DIFT, which directly uses intermediate features as the Diffusion Feature (DIFT) because it encodes rich semantic information, video diffusion features entangle various types of information, including semantic and motion information. Decomposing motion information from these features is not straightforward. Fortunately, our proposed method effectively localizes and extracts motion features, inspired by recent works on understanding video latents [8]. In addition, we want to highlight that the contribution of this paper extends beyond the technical aspects; it also discloses the finding that motion-aware features naturally exist in video diffusion models and provides a clear interpretation of how video diffusion features encode motion information.

Q2: Solutions for complex cases

Additionally, how does this method handle more complex object control, such as multi-object scenarios with different categories and sizes?

Our method can be naturally applied to multi-object scenarios involving different categories and sizes, thanks to the excellent generalization ability of video generation models. In multi-object scenarios, we employ multiple masks and use different optimization targets for these masked regions, enabling effective multi-object control. We will include more complex cases in the final version.

评论

Dear Reviewer,

Please reply to the rebuttal if your concerns are addressed.

AC.

评论

Thanks for the response and further experiments, which have addressed my main concerns. My original rating remains unchanged.

评论

Thank you for your positive feedback and for acknowledging that our response addressed your main concerns. We are keen to resolve any remaining issues and would be grateful if you could let us know if there are specific aspects that need further improvement. We hope that our response can fully address your concerns and positively influence your rating.

审稿意见
5

This paper presents a training-free method for motion control in video diffusion models and explores the interpretability of features within these models. The authors demonstrate through experiments that principal components of the features, extracted using PCA, contain motion information. They propose a pipeline that eliminates content correlation information from the features, filters motion channels, and optimizes the initial latent input in the diffusion model's UNet.

优点

  • The paper effectively demonstrates the potential of internal features in video diffusion models to capture motion information.

  • The use of PCA to eliminate irrelevant information is well-justified.

  • The proposed training-free method can be applied to various manipulation scenarios, including reference-based and drag-based control.

缺点

  • In Section 3, the authors discuss the challenge of extracting motion information from diffusion features due to their encapsulation of other data types, such as semantic and structural correlations. The paper lacks a detailed explanation of "content correlation information" (section 3). It is unclear whether this term encompasses semantic, structural, appearance, background, lighting, or other information.

  • Figure 6 contrasts the video motion control capabilities of DIFT and MOFT. It is recommended that identical sample images be used and their similarity heatmaps be compared, thereby better visualizing the capability gap between DIFT and MOFT.

  • The optimization process of the latent in Section 4.1 could be more clearly explained, particularly the settings from references [31; 41].

  • A more descriptive caption for the Motion Control Pipeline (Figure 5) and a clearer title for the caption of Figure 6 are advised for better comprehension.

问题

Q1. After eliminating content correlation information, does the feature retain any information other than motion? If so, could this residual information affect video generation?

Q2. Figure 6 illustrates how the ability to track motion varies with generation steps. Is it possible to manipulate features from earlier generation steps to achieve better results?

局限性

Not included. While the research demonstrates the potential of training-free methods to encode motion information, it remains unclear whether the proposed method completely eliminates motion-correlated information or retains any motion-irrelevant information.

作者回复

Thank you for your constructive comment. The detailed responses regarding each concern are listed below.

W1: Explanation of content correlation information

In Section 3, the authors discuss the challenge of extracting motion information from diffusion features due to their encapsulation of other data types, such as semantic and structural correlations. The paper lacks a detailed explanation of "content correlation information" (section 3). It is unclear whether this term encompasses semantic, structural, appearance, background, lighting, or other information.

This component is inspired by VideoFusion [8], which demonstrates that decomposing video latents into shared latents among all frames and per-frame residual latents enhances video generation. The shared latents, to which we refer as content correlation information, encompass shared aspects such as semantic content and appearance. In contrast, the residual latents primarily capture motion information, which also can be interpreted as deformation in structure. Building on this insight, we designed a content correlation removal process to remove shared information while preserving motion information. The effectiveness of this approach is validated in Figure 2. We will clarify this term in the revision.

W2: Identical sample images in Figure 6

Figure 6 contrasts the video motion control capabilities of DIFT and MOFT. It is recommended that identical sample images be used and their similarity heatmaps be compared, thereby better visualizing the capability gap between DIFT and MOFT.

Thank you for your valuable input. We have added the corresponding images, and you can find the results from identical images in Figure 2 in the attached PDF, which supports our conclusions. It is important to note that the heatmaps in DIFT and MOFT represent different contents: semantic similarity in DIFT and motion similarity in MOFT. Therefore, even though the same images are used, the heatmaps cannot be directly compared side by side. Instead, we visualize their effects at different time steps to highlight that MOFT is more reliable at the earlier denoising stage.

W3: Explanation of the optimization process of the latent

The optimization process of the latent in Section 4.1 could be more clearly explained, particularly the settings from references [31; 41].

Thank you for the suggestion. We will explain this more clearly in the revision. Please refer to the global response for the pseudo-code of the optimization process.

W4: Revision for the captions

A more descriptive caption for the Motion Control Pipeline (Figure 5) and a clearer title for the caption of Figure 6 are advised for better comprehension. (MOFT v.s. DIFT)

Thank you for your suggestion. We will include the following revisions:
Revised caption of Figure 5: In one denoising step, we get the intermediate features and extract MOFT from it with content correlation removal and motion channel filter. We optimize the latents to alter the sampling process with the loss of masked MOFT and reference MOFT. For the detailed motion control process, please refer to the pseudo-code.
Revised title of Figure 6: Effects of DIFT and MOFT on different denoising time steps.

Q1: Clarification on the residual information

After eliminating content correlation information, does the feature retain any information other than motion? If so, could this residual information affect video generation?

As illustrated in Fig. 2 of the main paper, after removing content correlation information, videos with entirely different appearances but identical motion directions cluster closely together (e.g., Right 1 and Right 2). This suggests that motion information is the primary component retained in the feature. Additionally, results presented in Figure 7 and the supplementary material further demonstrate that reference videos have minimal influence on appearance, lighting, or other semantic aspects. Thus, the residual information beyond motion has a negligible effect on video generation.

Q2: Manipulating features from earlier generation steps

Figure 6 illustrates how the ability to track motion varies with generation steps. Is it possible to manipulate features from earlier generation steps to achieve better results?

Good idea! We do manipulate the features from earlier generation steps. We manipulate features from earlier generation steps for two primary reasons:

  1. MOFT provides valid motion information early in the generation process, allowing us to manipulate the motion effectively.
  2. The motion in the generated videos is established during the early generation steps, necessitating manipulation at these stages. However, at the very beginning of the generation process, MOFT does not offer a very clear motion pattern. Therefore, we typically apply optimization at 850 to 500 denoising time steps for a 1000-timestep denoising process.

The inclusion of limitations

We've described the limitations in the Supplementary due to the limited space of the main paper.

评论

Thanks for the response. My primary concerns are solved and I would like to recommend acceptance on the condition that the authors would make all the necessary changes into the next version.

评论

Glad to know that your primary concerns are solved. We will definitely integrate changes mentioned in the responses in the next version (and also changes mentioned by other reviewers), as we have already provided the detailed change contents in response to w1, w2, w3, w4, q1, q2, and in the attached PDF (w2). Given that the policy of NeurIPS does not allow editing the paper before the final decision, we respectively hope that your positive attitude toward this work could be reflected in the final rating.

评论

Dear Reviewer,

Please reply to the rebuttal if your issues are addressed.

AC.

作者回复

We are grateful to the reviewers for your thorough, insightful, and constructive feedback. We are pleased that the interpretability and clarity of MOFT have been recognized (Reviewers P4uW, vDWU) and that the technical soundness of our paper has been acknowledged (Reviewers zi7R, zmG9, P4uW, vDWU). We also appreciate the recognition of MOFT's versatility and its potential for diverse applications (Reviewers zi7R, zmG9, P4uW, vDWU), as well as the effectiveness of our training-free approach (Reviewers zi7R, zmG9, P4uW, vDWU). We have addressed each of your comments with care and have provided detailed clarifications in our responses and the attached PDF. Please refer to the responses below for further details.

Thank you for your commitment. We eagerly anticipate your continued feedback and hope that you find the responses to be satisfactory.


Following is the pseudo-code of the optimization process since some reviewers raise the common question.

Algorithm: Optimization process

Input: noisy latents at timestep t ztz_t, region mask SS, reference MOFT MOgMO^g, the network ϵθ\epsilon_{\theta}, Motion Channel Mask CC, learning rate η\eta.

Output: optimized latents ztnewz_t^{new}.

Begin

X=ϵθ(zt)X=\epsilon_{\theta}(z_t) \triangleright Get intermediate feature XX from the network
Given CC, extract MOFT MOMO by Eq. 2 in the main paper
Given SS, MOgMO^g, MOMO, get the loss ll by Eq. 3 in the main paper
Optimize ztz_t by ztnew=ztηlztz^{new}_t = z_t - \eta \frac{\partial l}{\partial z_t}

Return ztnewz_t^{new}
End


Reference: due to the rebuttals' character limit, we've placed the references for all rebuttals below.

[1] Shi Y, et al. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing, CVPR 2024.
[2] Yatim D, et al. Space-time diffusion features for zero-shot text-driven motion transfer, CVPR 2024.
[3] Yin S, et al. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv 2023.
[4] Gen-2, https://runwayml.com/research/gen-2.
[5] Zheng Z, et al. Open-Sora: Democratizing Efficient Video Production for All.
[6] Mou C, et al. Dragondiffusion: Enabling drag-style manipulation on diffusion models. ICLR 2024.
[7] Ouyang W, et al. I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models. arXiv preprint 2024.
[8] Luo Z, et al. Videofusion: Decomposed diffusion models for high-quality video generation, CVPR 2023.
[9] Shi Y, et al. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing, CVPR 2024.
[10] Yang S, et al. Rerender a video: Zero-shot text-guided video-to-video translation. SIGGRAPH Asia 2023 Conference Papers.
[11] Qi C, et al. Fatezero: Fusing attentions for zero-shot text-based video editing, ICCV 2023.
[12] Wang Z, et al. Motionctrl: A unified and flexible motion controller for video generation, ACM SIGGRAPH 2024 Conference.
[13] He H, et al. Cameractrl: Enabling camera control for text-to-video generation, arXiv preprint 2024.
[14] Gen-1, https://runwayml.com/research/gen-1.

最终决定

According to the reviews, the reviewers acknowledge the paper's potential in utilizing internal features of video diffusion models to capture motion information and its applications for various manipulations, such as reference-based and drag-based control. Following the rebuttal, all reviewers have provided positive ratings for the paper. However, they have also requested that the authors incorporate the materials discussed during the rebuttal into the revised paper as a condition for acceptance. It is advised that the authors carefully follow the reviewers' suggestions and comments to refine the manuscript accordingly. For example, please incorporate the reviewers' feedback by providing a detailed explanation of the content correlation information in Section 3 and elaborating on the optimization process of the latent variables in Section 4.1. Additionally, update the motion control pipeline in Figure 5, ensure the use of identical sample images in Figure 6 for the similarity heatmaps of DIFT and MOFT, and include the results of more complex motions in the experimental section.