PaperHub
6.8
/10
Spotlight4 位审稿人
最低4最高5标准差0.4
4
4
5
4
3.5
置信度
创新性3.0
质量3.0
清晰度3.5
重要性3.0
NeurIPS 2025

DisMo: Disentangled Motion Representations for Open-World Motion Transfer

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
Representation LearningMotion AnalysisVideo GenerationSelf-Supervised Learning

评审与讨论

审稿意见
4

This paper introduces DisMo, a novel approach for learning abstract motion representations directly from raw video data using an image-space reconstruction objective. Dismo disentangles motion from static content such as appearance, object identity, or pose. This allows motion to be transferred across semantically unrelated entities, even between vastly different categories, without requiring object correspondences. Dismo achieves state-of-the-art performance on several downstream tasks like like open-world motion transfer, zero-shot action classification.

优缺点分析

Strengths:

  1. Dismo learns motion information directly from raw video data using only a standard image reconstruction loss by a flow-matching-based reconstruction objective, which is interesting and concise.
  2. Experiments on motion transfer tasks and action classification tasks show the effectiveness of the motion embeddings.
  3. The motion representation can be combined with any existing video generator via lightweight adapters, allowing the approach to benefit from future advancements in video generate models.

Weaknesses:

  1. Scaling. Will the model's performance be further enhanced if a more powerful motion extractor and frame generator are employed? Additionally, how will the training cost be affected?
  2. Is the experimental comparison in Table 1 fair? Since Dismo is trained using a pretrained diffusion transformer along with a light adapter, does this introduce any unfairness when compared to baselines such as VMC and DMT?
  3. In the action classification task, datasets like K600, THUMOS and AVA are more commonly used, their results should also be evaluated.

问题

Please see the weaknesses. The method proposed in the paper is interesting and promising. My primary concerns focus on its generalizability and scaling capability. I will take into consideration the opinions of other reviewers when making my final decision.

局限性

Yes.

最终评判理由

The work is solid and the rebuttal addresses most of my concerns. I will maintain my positive score.

格式问题

No.

作者回复

Thank you for your review and for recognizing the effectiveness and modularity of our approach. We appreciate your comments on scalability, fairness of comparisons, and further evaluations.

Scaling

We agree that using a more powerful motion extractor and frame generator has the potential to further improve performance. While we did not conduct a full scaling experiment due to computational and time constraints, we instead focused our computational resources on evaluating other and more powerful video models to scale the motion transfer capabilities. For this evaluation, we generate 2,000 videos per model, each 25 frames long, yielding 50,000 frames in total for FID computation. Motion is transferred from randomly sampled videos and prompts from the Koala-36M [4] dataset. Our results show, how scaling the video generation model greatly enhances the motion transfer capabilities of our model.

ModelTemporal Consistency↑Motion Fidelity↑FID↓FVD↓
SparseCtrl [6]0.950.73116.8327.9
LTX [7]0.950.7588.54298.4
CogVideoX [8]0.970.7862.99206.3

We consider scaling the motion extractor and frame generator a valuable direction for future work. Incorporating larger architectures would likely lead to increased training cost, particularly in terms of memory and training time, due to their higher parameter count and complexity. Nevertheless, our method is modular by design and can readily benefit from such improvements.

Fairness of Comparison

We appreciate the reviewer’s concern regarding the fairness of the comparison in Table 1. In fact, we believe the comparison is conservative with respect to DisMo. While DisMo combines a pretrained diffusion transformer with a lightweight motion adapter, its overall model size and the amount of training data used are substantially smaller than those of several baselines. For example, VMC utilizes a 6B parameter text-to-video model trained on WebVid-10M [1], which contains 10 million video-text pairs—an order of magnitude larger than the 2.1M video samples used to train DisMo-LTX, which itself has only 2.17B parameters. Moreover, VMC and DMT both rely on image encoders pretrained on LAION [2], gaining significant advantages from large-scale image-text pretraining. DMT also builds upon Zeroscope [3], a model trained on both WebVid-10M and LAION. Similarly, MotionClone and MotionDirector are based on Stable Diffusion, pretrained on hundreds of millions of image-text pairs. All compared methods rely on heavy pretraining and modular adaptation. Yet, DisMo achieves competitive—and often superior—results across benchmarks with significantly fewer parameters and less pretraining data, showcasing the efficiency and robustness of our approach.

Action Classification on other Datasets

While AVA and Kinetics-600 are widely used benchmarks for action classification, we were not able include it due to time and/or computational constraints (Kinetics-600 has over 500k videos). Moreover, as demonstrated in the V-JEPA [5] paper, state-of-the-art performance on Kinetics-400 (the preceding version of Kinetics-600) can be achieved using static visual features (e.g. DINOv2) alone, suggesting that the dataset can be largely solved without modeling motion. For this reason, we focused on datasets where motion understanding plays a more central role.

To the reviewer's request, in the table below we include action classification results on THUMOS13 [4], a benchmark consisting of 24 classes and 3,207 videos. However, we found that this benchmark is mostly solvable by static appearance-based cues alone, as is evident, when probing the dataset using a solely frame-based model (DINOv2). While our method (DisMo) does not achieve the best performance on this dataset, it remains competitive despite its limited training data. Combined with our strong results on other benchmarks presented in the main paper, this supports the effectiveness of our approach for learning meaningful motion representations.

ModelTHUMOS13
VideoMAE0.57
VideoMAEv20.97
V-JEPA0.88
DINOv2-B0.92
DisMo0.80

References

[1] Max Bain, Arsha Nagrani, G¨ ul Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021. 5, 8. [2] Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., ... & Jitsev, J. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 35, 25278-25294. [3] Wang, J., Yuan, H., Chen, D., Zhang, Y., Wang, X., & Zhang, S. (2023). Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571. [4] Y.-G. Jiang, J. Liu, A. R. Zamir, I. Laptev, M. Piccardi, M. Shah, R. Sukthankar, THUMOS’13: ICCV Workshop on Action Recognition with a Large Number of Classes, http://crcv.ucf.edu/ICCV13-Action-Workshop/, 2013. [5] Adrien Bardes et al. (2024), Revisiting Feature Prediction for Learning Visual Representations from Video [6] Guo, Y., Yang, C., Rao, A., Agrawala, M., Lin, D., & Dai, B. (2024, September). Sparsectrl: Adding sparse controls to text-to-video diffusion models. In European Conference on Computer Vision (pp. 330-348). Cham: Springer Nature Switzerland. [7] Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., ... & Tang, J. (2024). Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072. [8] HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., ... & Bibi, O. (2024). Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103.

评论

Thanks for your responses, and it addressed most of my concerns. I will maintain my positive score of 4.

审稿意见
4

This paper presents DisMo, a novel framework for learning abstract motion representations disentangled from appearance, pose, and category information. Unlike prior approaches that encode motion implicitly or rely on structure-specific representations, DisMo learns motion purely through an image reconstruction loss from augmented video clips using a flow-matching-based training objective. The resulting motion embeddings are invariant to static content and can be used to condition any video generator via lightweight adapters (e.g., LoRAdapter in LTX Video) for open-world motion transfer. DisMo is evaluated extensively across three main tasks, including Motion Transfer, Appearance Disentanglement and Zero-shot Action Classification.

优缺点分析

Strengths

  • Clear and Novel Formulation: DisMo introduces a simple and effective method for learning motion representations by leveraging standard image-space reconstruction losses, augmented with an information bottleneck and strong data augmentations to force motion extraction.
  • Disentanglement and Generalization: The model robustly disentangles motion from identity and visual content, enabling effective cross-category and viewpoint-invariant transfer. This is validated both qualitatively and quantitatively.
  • Integration with Video Generators: The framework is modular and scalable. The learned motion embeddings can be seamlessly integrated into pretrained video generation models (e.g., LTX Video) using lightweight LoRA adapters, eliminating the need to retrain the entire generative backbone.
  • Strong Experimental Results: DisMo outperforms state-of-the-art baselines in both generative and discriminative tasks—achieving best scores in motion fidelity and text adherence, while also outperforming strong backbones like V-JEPA on zero-shot action classification with only 17M training samples.
  • Various Qualitative Analysis: The paper provides a wide range of visual comparisons, illustrating robustness across scenarios such as camera motion transfer, same- and cross-category transfer, and structure-independent motion abstraction.

Weaknesses

  • No Evaluation on High-Fidelity Generation Tasks Beyond LTX: While motion quality is demonstrated using LTX Video, there is limited exploration of how DisMo integrates with other high-fidelity video generation models (e.g., Wan2.1, HunyuanVideo). Without such experiments, it is difficult to assess the general applicability of the proposed motion representations across different generation frameworks.
  • Limited Analysis of Flow-Matching Design Choice: The paper adopts flow matching as the core training objective for motion representation learning, but does not provide comparisons or ablations against alternatives like diffusion-based or contrastive objectives. It remains unclear whether the reported performance gains are due to the task formulation or specific to the flow-matching mechanism.
  • Missing Comparisons with Other Motion Extractors: Although DisMo’s own extractor is proposed as a key contribution, the paper does not compare its performance with existing video representation models (e.g., V-JEPA, VideoMAEv2) when used within the same motion transfer pipeline. This would clarify the advantage of the proposed extractor over well-established baselines.
  • Absence of Human Study: The evaluation relies entirely on quantitative metrics (e.g., CLIP-based similarity) and visual examples. However, given the inherently perceptual nature of motion realism and alignment, a user study would provide valuable insight into subjective quality and preference.
  • No Experiments on Composite Motion Scenarios: The paper does not explore how DisMo handles complex scenarios involving multiple motion types simultaneously—such as combining camera motion with object or human motion. This limits the understanding of its effectiveness in realistic, unconstrained settings.
  • Lack of Analysis on the Role of Augmentations: The paper claims that strong augmentations during training help enforce disentanglement by forcing the extractor to focus on temporal dynamics rather than appearance. However, there is no dedicated analysis or qualitative example demonstrating how this affects motion transfer performance. A targeted study could help isolate the impact of augmentations on representation quality and transfer fidelity.
  • Ambiguous Reference in Line 252: The sentence “Qualitative examples in Section 4.1 further illustrate this” refers to visual results, but it is unclear whether this points to Figure 4.

问题

  • Have you tested DisMo with other high-quality video generators (e.g., Wan2.1, HunyuanVideo)? If not, do you anticipate any compatibility or performance limitations when applying your motion embeddings to those models?
  • Why did you choose flow matching over other generative training paradigms like diffusion, contrastive learning, or masked prediction?
  • How does the performance of your motion extractor compare to representations learned by models such as V-JEPA or VideoMAEv2, when used in the same transfer setup?
  • What specific impact do the augmentations during training have on the quality of motion transfer results?
  • Given the perceptual nature of motion transfer quality, have you considered running a user study or human evaluation to complement the CLIP-based metrics?
  • Can DisMo handle composite motion scenarios involving both object motion and camera motion simultaneously?

局限性

yes

最终评判理由

The rebuttal adequately addresses most of my concerns. I will therefore maintain my positive score.

格式问题

None

作者回复

Thank you for your detailed and insightful review. We appreciate your recognition of the novelty, formulation, and experimental results of our work, as well as your constructive suggestions for improvement.

Other Video Generation Backbones

To the reviewer's request, we additionally test DisMo with other high-quality video generation backbones, namely SparseCtrl and CogVideoX. As shown in the table below, using a more capable generator like CogVideoX leads to improved motion fidelity and temporal consistency, without any modification to the motion embeddings produced by DisMo. This demonstrates that DisMo is generally applicable to off-the-shelf video models withouth being constrained to a specific one. We further assessed the quality and realism of the generated videos using FID and FVD scores, observing consistent improvements with stronger video backbones. This demonstrates that DisMo benefits from more powerful architectures and remains compatible with future models. For this evaluation, we generate 2,000 videos per model, each 25 frames long, yielding 50,000 frames in total for FID computation. Motion is transferred from randomly sampled videos and prompts from the Koala-36M [4] dataset.

ModelTemporal Consistency↑Motion Fidelity↑FID↓FVD↓
SparseCtrl [5]0.950.73116.8327.9
LTX [6]0.950.7588.54298.4
CogVideoX [7]0.970.7862.99206.3

Other Training Paradigms

We chose flow matching because it has been shown to be as effective a generative paradigm as diffusion for images [1] and offers a significantly simpler training process, as it does not require careful tuning of a noise schedule for example. We also performed an ablation study in which we replaced flow matching with a discriminative masked prediction objective, followed by CogVideoX fine-tuning. As shown in the table below, flow matching leads to a notable gain in motion fidelity, while keeping other metrics comparable:

ModelPrompt Adherence↑Temporal Consistency↑Driving Video Similarity↓Motion Fidelity↑
DisMo (MAE)0.260.970.550.65
V-JEPA0.260.970.590.67
DisMo0.270.970.560.78

Motion Transfer with other Representations

In response to the comment suggesting comparisons with other motion extractors, we train a large I2V model (CogVideoX) using a frozen V-JEPA backbone, consistent with the fine-tuning procedure described in Sec 3.3. However, for MtM_t, we use spatially averaged embeddings from V-JEPA. The results in the table above demonstrate, how this increases the similarity to the driving video, indicating content bleeding.

Augmentations and Motion Transfer

As shown in Table C, while previous frame conditioning significantly prevents appearance leaking and improves disentanglement, integrating augmentations further boost these improvements. Furthermore, we evaluate how our motion representations handle changes in appearance and geometry by applying photometric and geometric augmentations to a driving video. Using the same source prompt, we perform motion transfer with the original and both augmented videos and then compare the outputs. The tables below show that DisMo with augmentations exhibits notably greater invariance to driving video changes, particularly in response to geometric transformations.

Photometric Augmentations Applying photometric augmentations yields a 2.4% relative improvement in L1 and a 1.4% relative improvement in SSIM:

ModelL1↓SSIM↑
DisMo (w/o augmentations)0.0820.71
DisMo0.0800.72

Geometric Augmentations Geometric augmentations lead to a 2.3% relative improvement in L1 and a 4.4% relative improvement in SSIM:

ModelL1↓SSIM↑
DisMo (w/o augmentations)0.0870.68
DisMo0.0850.71

Human Evaluation

We agree that perceptual aspects such as motion transfer quality benefit from human evaluation in addition to automated metrics. To this end, we conducted a user study comparing several state-of-the-art models, including our own (DisMo), to better understand how they perform across realism, prompt adherence, and motion transfer quality. The results of the study are shown below:

MethodRealism (%)Prompt Matching (%)Motion Transfer Quality (%)
DMT10.939.6017.73
MotionDirector10.987.4725.96
VMC20.0426.1316.98
MotionClone19.9119.4214.62
DisMo38.1337.3824.71

While MotionDirector achieves high motion transfer quality (25.96%), users reported a notable drop in prompt adherence (7.47%), likely due to visual inconsistencies and content bleeding. Our method, DisMo, matches MotionDirector in motion transfer quality (24.71%) while significantly outperforming it in both realism and prompt matching.

Composite Motion

We designed a targeted experiment to evaluate the composability of DisMo’s motion embedding space.

To assess which aspects of motion are encoded in the embeddings, we use the YUP++[2] dataset, which contains 600 static videos with no camera motion. Using these, we can sample three types of video clips:

  • Object Motion Only: A randomly sampled clip from the dataset, containing natural object motion but no camera movement.
  • Camera Motion Only: A different clip, augmented with temporally smooth geometric transformations simulating camera motion, applied only to the initial frame (thus eliminating object motion).
  • Combined Motion: The sampled clip with both object motion and the same simulated camera motion applied across all frames.

This setup allows us to isolate the two motion components, camera and object, and test whether DisMo’s embeddings for the combined motion contain information from both individual motion types.

We sample 3200 of such triplets, each containing the three types of clips mentioned before, and encode each of them using DisMo’s motion encoder. We then estimate the mutual information (MI) between the embeddings of the combined-motion clip (MbothM_{both}) and those of the camera-only (McamM_{cam}) and object-only (MobjM_{obj}) clips, respectively. High mutual information would suggest that MbothM_{both} captures shared information from McamM_{cam} and MobjM_{obj}, which is a key indication of composability.

To quantify this, we use the Kraskov–Stögbauer–Grassberger (KSG) estimator [3] with kk = 5, as also used in Sec C.3. To compute statistical significance, we further estimate MI under 100 random permutations (i.e., computing MI between unpaired clips), yielding a null distribution with means of around 0.2 nats and standard deviations of around 0.4. The observed MI values for both (MbothM_{both}, McamM_{cam}) and (MbothM_{both}, MobjM_{obj}) are significantly higher than this baseline, indicating a strong statistical dependency and thus supporting the hypothesis that DisMo’s latent space is compositional with regard to simultaneous camera and object motion.

The table below summarizes the mutual information results.

MI (nats)Low Camera MotionMedium Camera MotionHigh Camera Motion
Camera Motion1.372.254.40
Object Motion3.592.971.51

References

[1] Ma, N., Goldstein, M., Albergo, M. S., Boffi, N. M., Vanden-Eijnden, E., & Xie, S. (2024, September). Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision (pp. 23-40). Cham: Springer Nature Switzerland. [2] Feichtenhofer, C., Pinz, A., & Wildes, R. P. (2017). Temporal residual networks for dynamic scene recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 4728-4737). [3] Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, 69(6), 066138. [4] Qiuheng Wang et al. (2024) Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content [5] Guo, Y., Yang, C., Rao, A., Agrawala, M., Lin, D., & Dai, B. (2024, September). Sparsectrl: Adding sparse controls to text-to-video diffusion models. In European Conference on Computer Vision (pp. 330-348). Cham: Springer Nature Switzerland. [6] Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., ... & Tang, J. (2024). Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072. [7] HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., ... & Bibi, O. (2024). Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103.

评论

I appreciate the authors' efforts in the rebuttal. The response adequately addresses most of my concerns. I will therefore maintain my positive score.

审稿意见
5

The authors proposed DisMo that learns abstract motion representations directly from video data via an image space reconstruction objective. As their model is independent of different static information, including appearance, pose, or object identity, it has the capability for open-world motion transfer that allows motion to be transferred across semantically unrelated entities without requiring object correspondences. Unlike prior methods that show a trade-off between motion fidelity and adherence to textual prompts, their approach disentangles the motion semantics from instance-specific appearance features. They demonstrated the effectiveness on diverse motion transfer tasks, outperforming existing state-of-the-art video representation models.

优缺点分析

Strengths

S1. The idea of representation learning solely based on an image-space reconstruction objective is interesting and unique in terms of leading to an abstract motion representation space.

S2. The authors performed a diverse set of experiments focusing on cross-category and cross-viewpoint settings. The downstream classification task also shows the efficiency of the learned motion embeddings.

S3. One of the primary strengths of this paper is that the core idea of learning motion representations that are invariant to static content is well described and motivated.

S4. The experiments are comprehensive with clear implementation details, ablation studies, and latent space analysis.

S5. The paper is well-written, easy to read and follow.

Weaknesses

W1. I found that the current methodological development lacks detail about how the "invariance" is actually enforced or how it is being maintained with respect to the architectures that ensure this property.

W2. While the authors show appearance disentanglement results in Tab. 2, they did not provide clear metrics to measure motion invariance specifically or whether motion patterns are preserved or not (or how much). Also, experiments for achieving invariance would be further strengthened by performing experiments in different scenarios such as with different lighting, backgrounds, and camera angles. Overall, I found that the current disentanglement evaluation is limited and does not directly measure motion invariance.

W3. Computational costs and statistical significance tests are missing. Compared to the alternatives, what's the efficiency and complexity in terms of time and memory?

I would strongly suggest the authors to justify their standings on the raised weaknesses and concerns in the rebuttal period.

问题

Please see the "Strengths and Weaknesses" section.

局限性

Yes

最终评判理由

My questions and concerns are well addressed. I will keep my original score.

格式问题

NA

作者回复

Thank you for your encouraging review and for acknowledging the strengths of our method and comprehensive analysis. We value your detailed suggestions about the invariance, computational costs and statistical significance of our model.

Enforcing Invariance

Here is a more detailed explanation of why augmentations help to enforce invariance in the learned motion embeddings. During training, the motion extractor is fed augmented frames from the video, while the source frame provided to the frame generator (for reconstruction) is not augmented, and the target frame that the generator is tasked to reconstruct is also not augmented. This setup encourages the frame generator to use the motion embedding mtm_t, which was derived from an augmented input frame xt\mathbf{x_t}, to reconstruct the unaltered future frame xt+Δt\mathbf{x}_{t+\Delta t}. In doing so, the model is incentivized to learn motion embeddings that are robust to visual perturbations and focus solely on the underlying motion, thereby promoting invariance to augmentations. We evaluate this by applying photometric and geometric augmentations to a driving video. Using the same source prompt, we perform motion transfer with the original and both augmented videos and then compare the outputs. The tables below show that DisMo with augmentations exhibits notably greater invariance to driving video changes, particularly in response to geometric transformations.

Photometric Augmentations

ModelL1↓SSIM↑
DisMo (w/o augmentations)0.0820.71
DisMo0.0800.72

Geometric Augmentations

ModelL1↓SSIM↑
DisMo (w/o augmentations)0.0870.68
DisMo0.0850.71

Invariance of the Motion Representation

We evaluate the invariance of motion representations under appearance and geometric changes. Starting with an original driving video, we generate two augmented versions: one with photometric augmentations (e.g., changes in brightness, contrast, and color) and one with geometric augmentations (e.g., rotations, cropping, and spatial shifts). We then perform motion transfer using all three driving videos, the original and the two augmented versions, while keeping the same source prompt and seed. This results in three generated videos. To assess invariance, we compare the generated videos from the two augmented driving inputs against the one generated using the original driving video. Ideally, an invariant motion representation should produce minimal differences between these outputs. The tables below show that our method demonstrates significantly higher consistency across these conditions compared to prior work, indicating stronger invariance to both photometric and geometric changes.

Photometric Augmentations

ModelCLIP ↑SSIM ↑L1 ↓LPIPS ↓
DMT0.660.510.170.59
VMC0.710.440.170.58
MotionDirector0.620.460.150.55
MotionClone0.760.520.130.46
DisMo0.900.720.080.25

Geometric Augmentations

ModelCLIP ↑SSIM ↑L1 ↓LPIPS ↓
DMT0.660.480.180.62
VMC0.710.400.200.62
MotionDirector0.600.440.160.60
MotionClone0.720.430.180.58
DisMo0.900.710.090.26

Computational Costs

We report the inference time and the number of parameters for each model during motion transfer generation in the table below. It's important to note that DMT, VMC, and MotionDirector are optimization-based methods or rely on per-sample fine-tuning, which significantly increases inference time, as they perform iterative adaptation for each video. In contrast, our method (DisMo) is feed-forward at inference time, enabling significantly faster generation. Across all configurations we report, DisMo is faster and more efficient—except when paired with CogVideoX as the video generation backbone, which increases inference time due to the size and complexity of the underlying model. However, it is important to note that using CogVideoX is not required to achieve strong motion transfer results: as evidenced by Table 1 in the main paper, DisMo combined with lighter backbones such as LTX already achieves better or comparable motion fidelity and transfer quality, while being substantially faster and more efficient as seen in table below.

ModelInference Time (A100)# Params
VMC10 min / video6B
DMT7.5 min / video1.7B
MotionClone45 sec / video0.983B
MotionDirector~5 min / video0.983B
DisMo-CogVideo2 min / video6.072B
DisMo-LTX30 sec / video2.172B
DisMo-SparseCtrl10 sec / video1.115B

Note: The reported number of parameters is approximate and refers to the largest component of each model, to the best of our knowledge. Exact values may vary depending on implementation details and auxiliary components.

评论

I want to thank the authors for their detailed responses. I will keep my original score, as my concerns have been adequately addressed.

审稿意见
4

This paper proposes DisMo, a method for learning abstract, disentangled motion representations from raw videos via image-space reconstruction. The goal is to enable open-world motion transfer without relying on object correspondences. DisMo can be plugged into existing video generators and shows strong results in fidelity, prompt alignment, and generalization.

优缺点分析

  • The writing in this paper is clear and easy to follow.
  • I appreciate the method’s flexibility and modular design — it is compatible with various pre-trained video generators. The ability to perform zero-shot transfer is particularly compelling.
  • It’s also encouraging to see strong performance on both standard and out-of-distribution benchmarks.

Limitations:

  • The idea is not very new to me. Abstraction and disentanglement exist in robotics and embodied AI; the contribution feels incremental.
  • Evaluation scope limited: Would like to see more extreme or semantically challenging motion transfers to test generality.
  • The method adds non-trivial motion modeling layers, but lacks analysis of training/inference cost. How is the stability of the method?

问题

Can the authors clarify the novelty of their approach, especially in relation to existing work in robotics and motion generation?

How do the authors address the visible artifacts in the generated videos? Is there any planned analysis or improvements for these issues?

I am not fully in this field, I would like to check other reviews opinion on this submission.

局限性

/

最终评判理由

Thank you for the response. I appreciate the authors’ clarifications regarding the novelty of the approach compared to robotics pipelines, and I find the distinction around unconstrained, general-purpose motion representations to be well-articulated. The additional results with stronger video generators (e.g., CogVideoX) also address my concerns about generation artifacts. Overall, I am satisfied with the rebuttal and the points raised were adequately addressed. I will maintain my original score.

格式问题

/

作者回复

Thank you for your review and for highlighting the flexibility and clarity of our method. We appreciate your feedback on the evaluation scope, computational analysis, and novelty. We understand the concern and would like to clarify the specific contributions that distinguish our work from prior methods.

Novelty

While there is prior work on learning inverse dynamics in robotics, our method is novel in several key aspects:

  • Unconstrained motion space and training: Latent dynamics spaces that the robotics community attempts to learn, typically specialize to specific domains and use-cases, such as robotic movements or video game controls. This is usually enforced by highly reduced and/or regularized embeddings with the goal of capturing individual, atomic and often discrete actions. Meanwhile, DisMo leverages an unconstrained and larger embedding space to represent generic and composable motion. This design choice is crucial for open-set video training where complex dynamics with multiple modes (e.g., multiple moving objects, objects parts, and/or camera movement) must be captured.
  • General-purpose applicability: While some recent robotics methods have begun exploring open-world data, many still focus on structured and constrained settings (e.g., repetitive arm movements or indoor navigation). In contrast, our approach is designed to operate on unstructured, in-the-wild video data, enabling broader generalization across diverse and unconstrained scenarios. This introduces significantly more variability in appearance, scene dynamics, and motion patterns, requiring more general and robust representations. This gives rise to general-purpose motion representations, that can be used for a variety downstream tasks, such as action classification (Section 4.3) or general latent space analysis (Section D). Most importantly, though, we demonstrate how our motion representations are well-suited for open-world motion transfer across arbitrary scenes and subjects, rather than task-specific behavior cloning or control, which is usually the primary focus in robotics.
  • Second-stage video model finetuning: Our method is designed to be plug-and-play, allowing motion embeddings to be combined with any video generator. We further demonstrate performance gains through second-stage finetuning of the video model, which is uncommon in robotics pipelines.

Addressing Artifacts in Generated Videos

We address the visible artifacts by employing a more powerful video generation model, CogVideoX, instead of LTX. This substitution significantly reduces artifacts, as evidenced by the improved FID and FVD scores reported in the table below. While we are unable to provide additional qualitative results during the rebuttal period due to NeurIPS regulations, the quantitative improvement in FID and FVD offers strong support for the effectiveness of this change. To this end, we generate 2,000 videos per model, each consisting of 25 frames, resulting in a total of 50,000 frames for FID evaluation. The videos are generated by transferring motion from randomly sampled videos and prompts of the Koala-36M [1] dataset.

ModelFID↓FVD↓Temporal Consistency↑Motion Fidelity↑
SparseCtrl [2]116.8327.90.950.73
LTX [3]88.54298.40.950.75
CogVideoX [4]62.99206.30.970.78

[1] Qiuheng Wang et al. (2024) Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content [2] Guo, Y., Yang, C., Rao, A., Agrawala, M., Lin, D., & Dai, B. (2024, September). Sparsectrl: Adding sparse controls to text-to-video diffusion models. In European Conference on Computer Vision (pp. 330-348). Cham: Springer Nature Switzerland. [3] Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., ... & Tang, J. (2024). Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072. [4] HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., ... & Bibi, O. (2024). Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103.

评论

Thank you for the response. I appreciate the authors’ clarifications regarding the novelty of the approach compared to robotics pipelines, and I find the distinction around unconstrained, general-purpose motion representations to be well-articulated. The additional results with stronger video generators (e.g., CogVideoX) also address my concerns about generation artifacts. Overall, I am satisfied with the rebuttal and the points raised were adequately addressed. I will maintain my original score.

最终决定

Summary

This paper proposes DisMo, a novel approach for learning abstract motion representations from raw video data using an image-space reconstruction objective. The main idea is to disentangle motion from static content such as appearance, object identity, or pose. DisMo achieves state-of-the-art performance on several downstream tasks such as open-world motion transfer, zero-shot action classification.

Strength

  • Novel approach of disentangling motion from appearance for abstract motion representation learning.
  • Strong experimental results on various data

Weakness

  • Some reviewer(s) concern about the level of novelty since disentangling of motion and appearance had been used in other field such as robotics.
  • Some ablation / additional analysis needed.

Rebuttal / Discussion

During rebuttal, the author(s) were able to address most of the concerns from the reviewers. At the end all 4 reviewers recommend to accept the paper. AC agrees with the recommendation, thus recommends to accept the paper.