SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios
A novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process for HOI video and motion generation.
摘要
评审与讨论
SViMo proposes a joint video and motion diffusion pipeline that extends image-to-video generative models for Hand-Object Interaction (HOI) scenarios. The framework is trained on the TACO dataset, which includes 20 object categories and 15 types of daily hand-object interactions. The pipeline introduces a triple-modality adaptive module that enables integration of text, video, and motion tokens despite differences in scale and feature space. In addition, a vision-aware 3D interaction diffusion module maps and optimizes the generated motion into a 3D point cloud space. The results are evaluated using two types of metrics: (1) VBench-based perceptual metrics for subject/background consistency and temporal smoothness in the generated video, and (2) standard 3D interaction metrics such as mean per-joint position error (MPJPE) and motion smoothness, where SViMo shows improved performance over prior baselines.Strengths • The paper is clearly written and the proposed SViMo framework is intuitively motivated, especially in how it bridges video appearance and motion dynamics through a unified joint generation process.
• The modular architecture and use of intermediate 2D-rendered motion videos as proxies for 3D motion is practical and well-reasoned, given the domain gap between 3D point clouds and video generation.
Weaknesses • The related work section could better acknowledge and position this work relative to prior works such as VideoJAM [1], which tackles related challenges in joint video-motion modeling.
• While quantitative 3D evaluation metrics are used, they may not correlate well with perceptual quality in generated videos. User studies or perceptual metrics would strengthen claims of motion plausibility and realism.
• The zero-shot generalization results remain limited in scope, although the model is evaluated on unseen HOI prompts, the object categories (e.g., bowl, spoon) remain closely tied to the fine-tuning distribution. It would be more compelling to include qualitative or quantitative results on semantically distant HOI categories (grabbing a phone and dialing numbers).
• One of the VBench metrics used is background consistency which is affected by the authors’ choice to crop hand regions, which could confound the metric’s validity and should be acknowledged more clearly.
• It would be beneficial to have more supplementary video examples available outside of the curated video presentation to allow reviewers to explore the full diversity of generations independently.
优缺点分析
Strengths • The paper is clearly written and the proposed SViMo framework is intuitively motivated, especially in how it bridges video appearance and motion dynamics through a unified joint generation process.
• The modular architecture and use of intermediate 2D-rendered motion videos as proxies for 3D motion is practical and well-reasoned, given the domain gap between 3D point clouds and video generation.
Weaknesses • The related work section could better acknowledge and position this work relative to prior works such as VideoJAM [1], which tackles related challenges in joint video-motion modeling.
• While quantitative 3D evaluation metrics are used, they may not correlate well with perceptual quality in generated videos. User studies or perceptual metrics would strengthen claims of motion plausibility and realism.
• The zero-shot generalization results remain limited in scope, although the model is evaluated on unseen HOI prompts, the object categories (e.g., bowl, spoon) remain closely tied to the fine-tuning distribution. It would be more compelling to include qualitative or quantitative results on semantically distant HOI categories (grabbing a phone and dialing numbers).
• One of the VBench metrics used is background consistency which is affected by the authors’ choice to crop hand regions, which could confound the metric’s validity and should be acknowledged more clearly.
• It would be beneficial to have more supplementary video examples available outside of the curated video presentation to allow reviewers to explore the full diversity of generations independently.
问题
Zero-shot Generalization Scope Were any test prompts explicitly filtered to ensure no overlap with the object or interaction types seen during training? While the paper claims out-of-domain HOI generation,prompts involving bowls and spoons still appear semantically close to training categories. Could the authors include examples involving semantically unrelated HOIs to better demonstrate true generalization?
Alignment Between Evaluation Metrics and Human Perception The current evaluation relies heavily on structural 3D metrics like Chamfer distance. However, these metrics may not correlate well with human-perceived motion plausibility or overall video quality. Would it be possible to provide even a small-scale user preference study, or discuss whether such perceptual evaluation was considered or piloted?
Ablation on Fusion Strategy Simplicity The proposed triple-modality adaptive modulation with scaling, shifting, and gating is a compelling approach to multimodal alignment. Could the authors clarify whether they explored simpler fusion baselines, such as projecting all modalities into a shared embedding space via linear layers.
局限性
yes
最终评判理由
My final score is Borderline Accept. The justification is found in the sequence of initial review / rebuttal / comments.
格式问题
none
- Q1: Discussion of VideoJAM [1] (ICML 2025). The concurrent work VideoJAM generates video alongside 2D optical flow to enhance visual fidelity and motion plausibility. In contrast, our approach generates video with explicit 3D motion. This enables direct perception of object geometry, scale, and spatial relationships, resulting in superior physical plausibility for complex scenarios involving occlusions or intricate interactions. The core challenge in 3D joint video-motion generation lies in the significant domain gap between visual and dynamic information, as well as the difficulty in ensuring consistency between both outputs. To address this, we bridge the multimodal domain gap through adaptive modulation and establish a closed-loop co-evolutionary pipeline. This framework effectively captures richer complementary information between video and motion. To validate our improvement over VideoJAM, we replace our default synchronized video-3D-motion diffusion model with a video-2D-motion variant to mimic VideoJAM. Results demonstrate that incorporating 3D motion diffusion achieves more reasonable motions, faster convergence speed, and improved overall video score, and significantly higher human perceptual quality. We will cite and discuss VideoJAM in the revised manuscript.
| Method | Motion FID ↓ | Video Lose @1K (log) ↓ | Over Video Quality ↑ |
|---|---|---|---|
| Ours (Video-2D-Motion Diffusion) | 0.0522 | -3.8265 | 0.8721 |
| Ours (Video-3D-Motion Diffusion,Default) | 0.0100 | -4.9119 | 0.8800 |
- Q2: Human perceptual evaluation of 3D motions. For 3D motion evaluation, we employed MPJPE, motion smoothness, chamfer distance, and FID metrics, detailed in Sec. 4.1 and Tables 2 & 3. To provide evaluation more consistent with human perception, we have conducted a user study (Fig. 5(b) in the main paper) involving 410 participants, which demonstrates our method's superiority over the baselines in 97.56% of cases.
- Q3: Zero-shot generalization to more distant HOI categories. Our train-test split mitigates HOI category overlap by reserving all instances involving specific actions (e.g., hit), tools (e.g., glue gun), and objects (e.g., toy) exclusively for the test set, ensuring these categories remain unseen during training. Thereby our evaluation on the TACO test data could demonstrate zero-shot generalization capabilities to some extent. To further validate this capability in real-world scenarios, we conducted experiments using daily HOI examples (Fig. 6). For more rigorous generalization evaluation, we constructed an additional test set of 100 instances comprising entirely unseen HOI categories (novel objects, actions, and diverse camera viewpoints), sourced from TACO, OakInk2[2], and GigaHands[3] datasets. As shown below, while video and motion metrics for this unseen set show a slight performance decline compared to the main TACO test set, they remain reasonably close. This robust zero-shot generalization capability stems from two critical design principles: (1) Our plug-and-play synchronized diffusion architecture leverages the visual-semantic priors of large video foundation models. (2) Our 3D point cloud object representation offers enhanced spatial geometry and physical awareness, and supports viewpoint generalization.
| Evaluation Data | Overall Video Quality ↑ | 3D Motion FID ↓ |
|---|---|---|
| Ours (on completely unseen data) | 0.8712 | 0.0197 |
| Ours (on TACO test data) | 0.8800 | 0.0100 |
- Q4: Demonstration of more diverse videos. As suggested by the reviewer, we will include more representative video samples, including both successful demonstrations (cases mentioned in Q3) and systematically analyzed failure cases, in the upcoming supplementary material to further substantiate the practical versatility of our approach.
-
Q5: Validity of the VBench metric (background consistency). Firstly, as defined in VBench [4], the "background consistency" metric calculates global CLIP image feature differences between consecutive video frames, reflecting overall video stability across both foreground and background. Secondly, in the field of HOI generation, cropping the HOI region is a common practice[5], as the global background context has limited impact on fine-grained local interactions. Our experiments reveal that despite cropping the HOI region, baseline models still exhibit hallucinated artifacts, such as spontaneous object emergence, violating temporal coherence (See Appendix Fig. 9, rows 1-2). Therefore, this metric remains valuable for video consistency evaluation.
-
Q6: Ablation study of the multimodal fusion machinism. As suggested by the reviewer, we replaced the adaptive modulation with a simpler linear projection (Linear Proj. (Suggested)) mapping features to a shared embedding space. The results below demonstrate that the linear projection leads to significant degradation in motion FID scores, slower convergence speed, and moderate declines in overall video quality. That is because our triple-modality adaptive modulation has the advantage of inherently decoupling appearance and motion, thereby enabling the extraction of discriminative and complementary video-motion representations.
| Multimodal Fusion | 3D Motion FID ↓ | Video Loss @1K (log) ↓ | Overall Video Quality ↑ |
|---|---|---|---|
| Simpler Linear Proj. (Suggested) | 0.1573 | -2.9739 | 0.8643 |
| Adap. Mod. (Ours) | 0.0100 | -3.4883 | 0.8800 |
- [1] Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models. In Forty-second International Conference on Machine Learning, 2025.
- [2] Xinyu Zhan, Lixin Yang, Yifei Zhao, Kangrui Mao, Hanlin Xu, Zenan Lin, Kailin Li, and Cewu Lu. Oakink2: A dataset of bimanual hands-object manipulation in complex task completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 445–456, 2024.
- [3] Rao Fu, Dingxi Zhang, Alex Jiang, Wanjia Fu, Austin Funk, Daniel Ritchie, and Srinath Sridhar. Giga-hands: A massive annotated dataset of bimanual hand activities. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 17461–17474, 2025.
- [4] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024.
- [5] Youxin Pang, Ruizhi Shao, Jiajun Zhang, Hanzhang Tu, Yun Liu, Boyao Zhou, Hongwen Zhang, and Yebin Liu. Manivideo: Generating hand-object manipulation video with dexterous and generalizable grasping. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 12209–12219, 2025.
I’m sorry, I had initially missed the user study shown in Figure 5, which alleviates my earlier concern regarding human evaluation. The authors have also addressed my request for an ablation of their multimodal fusion mechanism well, and the results suggest it does indeed play a meaningful role in their pipeline. Additionally, the authors did acknowledge the omission of a relevant citation to VideoJam, and have said they will reposition the paper to include this in the final version. My main remaining concern is whether the model generalizes well. The authors have provided some quantitative results and mention the inclusion of more diverse cases, but it’s hard to evaluate this without concrete visual examples. Given the above points, a score change to borderline accept is reasonable, with a suggestion to include more out-of-domain objects and activities in the final version.
Dear Reviewer T4nk, Thank you for your thorough review and professional feedback! We are delighted that you raised the score to borderline accept, and we will include more out-of-domain objects and activities in the final version.
Thanks for your willingness to include more out-of-domain objects and activities !
This paper presents a framework that integrates HOI action generation models with HOI video generation models, enabling closed-loop collaborative optimization between the two. The approach facilitates the transfer of useful information and physical principles between 2D and 3D data, thereby enhancing the generalization and effectiveness of HOI video and motion generation.
优缺点分析
Strengths:
This paper explores how to effectively combine 3D and 2D models within the HOI (Human-Object Interaction) context to leverage complementary information from both modalities. This is an interesting and potentially impactful direction for the community.
The authors propose a unified optimization pipeline and validate its effectiveness on both action generation and video generation tasks. The visual quality of the generated videos appears promising.
The paper is clearly written, with sufficient methodological details, making the work seem reproducible.
Weaknesses:
The Multimodal Feature Modulation and Fusion's validation is somewhat insufficient. Balancing features across different modalities is a critical issue. To better demonstrate the effectiveness of the feature alignment in this module, it is recommended that additional ablation or validation experiments be included.
Some technical details are not clearly explained. For example, the Chamfer loss in Eq. (2) should be further clarified or properly cited in the main text.
Some related works that combine motion and video, although not in a joint optimization setting, should be cited and discussed for completeness. Relevant examples include:
Corona, Enric, et al. "Vlogger: Multimodal diffusion for embodied avatar synthesis." CVPR 2025.
Chefer, Hila, et al. "VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models." arXiv preprint arXiv:2502.02492 (2025).
问题
From my perspective, exploring how to combine the strengths of 3D and 2D models is an interesting topic. This direction becomes even more appealing in the complex HOI scenarios, where leveraging both modalities can be especially beneficial.
I am also highly interested in the generalizability of the proposed framework. Specifically, it would be valuable to understand whether it can be extended to more diverse object and human interaction scenarios, such as human-scene interactions, full-body representations, deformable or flexible objects, and architectural objects (e.g., flip phones or earbud cases).
In addition, it would be helpful to include results or analysis regarding the convergence behavior of the model on both 3D and 2D data.
The comparison with 3D motion generation models also requires clarification. To my knowledge, the compared 3D motion models, MDM and EMDM, are text-to-motion, whereas the 3D motion generation in this work appears to be conditioned on input images.
Moreover, a comparison of resource and time consumption with methods like AnimateAnyone under the same resolution setting would be informative for understanding the efficiency and scalability of the proposed approach.
局限性
Yes.
最终评判理由
The authors have addressed the concerns I raised, and after also considering the feedback from other reviewers, I believe that this paper tackles an interesting and important problem. The generation results are also promising. On the other hand, I encourage the authors to further explore scenarios in more sophisticated HOI scenarios. Overall, I keep my score of accept.
格式问题
No.
- Q1: Ablation study of the multimodal fusion mechanism. We replaced our adaptive modulation with a simpler linear projection (Linear Proj. (Simpler)) mapping features to a shared embedding space. The results below demonstrate that the linear projection leads to significant degradation in motion FID scores, slower convergence speed, and moderate declines in overall video quality. That is because our triple-modality adaptive modulation has the advantage of inherently decoupling appearance and motion, thereby enabling the extraction of discriminative and complementary video-motion representations.
| Multimodal Fusion | 3D Motion FID ↓ | Video Loss @1K (log) | Overall Video Quality ↑ |
|---|---|---|---|
| Simpler Linear Proj. (Simpler) | 0.1573 | -2.9739 | 0.8643 |
| Adap. Mod. (Ours) | 0.0100 | -3.4883 | 0.8800 |
- Q2: Chamfer distance loss is employed in Eq. 6 (not Eq. 2) in the main text with its detailed formulation presented in Eq. 13 in the Appendix. This loss measures the Chamfer distance between the generated object point clouds and the ground-truth point clouds, which also serves as a metric for 3D motion generation quality.
- Q3: Discussion of concurrent works Vlogger[1] (CVPR 2025) and VideoJAM[2] (ICML 2025). VideoJAM generates video alongside 2D optical flow to enhance visual fidelity and motion plausibility. In contrast, our approach generates video with explicit 3D motion. This enables direct perception of object geometry, scale, and spatial relationships, resulting in superior physical plausibility for complex scenarios involving occlusions or intricate interactions. Vlogger utilizes a two-stage pipeline: first, it generates 3D human poses with one network, then feeds the motions into another separate network for video synthesis. In contrast, we unify video and motion generation within a single synchronized diffusion model. Since HOI video and motion inherently share the same physical laws of the real world, this joint modeling enables the learning of intrinsic co-evolutionary relationships, enhancing the quality of both outputs. We will cite and discuss Vlogger and VideoJAM in the revised manuscript.
- Q4: Extension potential to more sophisticated HOI scenarios. Our method inherently supports extension to HOI scenarios involving objects of complex geometries, or even articulated objects, and full-body interactions. Extending the approach requires only minor adjustments to point cloud density or quantity, or pose representation in our vision-aware 3D interaction diffusion model. Specifically, (1) Complex geometries or deformable objects require increased point density per object for precision. (2) Articulated objects can be decomposed into sub-components and increase the total number of point clouds. (3) Full-body interaction requires extending the current hand pose to a full-body skeleton pose. Notably, human-scene interaction is distinct from our task. Those works typically take static 3D scenes as input and generate plausible human motions[3][4].
- Q5: Convergence behavior on the 2D and 3D data. Our model comprises two core components: synchronized diffusion for generating 2D video and motion video, and 3D interaction diffusion for synthesizing 3D HOI motions. To analyse the convergence behavior, we conduct comparative experiments using only 2D data to train the synchronized diffusion versus using joint 2D/3D data to train the whole model. Results show that at identical training steps (1K, 3K), integration of 3D data achieves lower training loss, faster convergence, and superior final performance.
| Method | Video Lose @1K (log) ↓ | Video Lose @3K (log) ↓ | Over Video Quality ↑ |
|---|---|---|---|
| Ours w/ Only 2D Data | -3.4883 | -4.1912 | 0.8719 |
| Ours w/ 2D & 3D Data | -4.9119 | -5.9522 | 0.8800 |
- Q6: Comparison setting for 3D motions generation between ours and baselines. Our method generates motion from both images and text (image+text-to-motion), whereas the original MDM and EMDM are text-to-motion systems. For fair comparison, we adapted both baselines into text-and-image-to-motion generators by additionally adding the reference image into their CLIP encoders (details in lines 253-260).
- Q7: Memory and time consumption for training and inference. We compared VRAM consumption and training/inference time between Animate Anyone (as suggested by the reviewer) and CogVideoX-5B (our backbone). Results show that Animate Anyone requires significantly lower VRAM consumption and shorter time for both training and inference. This efficiency stems from its incremental architecture design: it extends a 2D image generation model with temporal attention blocks to enable video synthesis. In contrast, both CogVideoX and our method employ unified spatial-temporal full attention modules. Compared with CogVideoX, our approach can simultaneously generate both 2D videos and 3D motion sequences while introducing only minimal additional parameters and keeping computational cost within a reasonable range.
| Method | Params (B) | Training Peak VRAM (G) | Training Time Per Step (s) | Inferance Peak VRAM (G) | Inferance Latency for 50 steps (s) |
|---|---|---|---|---|---|
| Animate Anyone (Suggested) | 2.17 | 39.98 | 2.45 | 6.84 | 20.81 |
| CogVideoX-5B (Our Backbone) | 5.57 | 56.32 | 11.61 | 20.59 | 48.35 |
| Ours | 5.82 | 61.52 | 18.06 | 23.35 | 71.31 |
- [1] Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models. In Forty-second International Conference on Machine Learning, 2025.
- [2] Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan, Nikos Kolotouros, Thiemo Alldieck, and Cristian Sminchisescu. Vlogger: Multimodal diffusion for embodied avatar synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15896–15908, 2025.
- [3] Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, and Siyuan Huang. Scaling up dynamic human-scene interaction modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1737–1747, 2024.
- [4] Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Move as you say interact as you can: Language-guided human motion generation with scene affordance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 433–444, 2024.
After reading the author's response and other reviewers' comments, I have decided to keep my original score.
Dear Reviewer Lhcy, Thank you for your thorough review and professional feedback!
This paper introduces SViMo, a two-stream diffusion framework that jointly generates a hand-object interaction (HOI) video and its corresponding 3D motion (hand pose and object point cloud) from an image-and-text prompt.
优缺点分析
Strengths
-
The paper is clearly written and easy to follow.
-
The proposed model, which jointly denoises video and motion in a closed-loop feedback cycle, presents a somewhat novel approach within the domain. Additionally, the removal of pre-defined meshes or pose conditions increases its general applicability.
-
The experimental evaluation is comprehensive, covering diverse baselines, user studies, and real-world zero-shot settings. The results also demonstrate that the proposed method achieves superior performance.
Weaknesses
- For motion generation evaluation, the chosen metrics are limited. MPJPE and Chamfer distance are commonly used in prediction tasks, but for generative tasks, metrics like Diversity should also be included. Furthermore, for assessing physical plausibility, metrics such as Contact and Penetration would help substantiate the claims.
Minor issues:
- The vertical spacing below Figures 5 and 6 is slightly excessive.
问题
-
Could you elaborate on the training and inference efficiency of your model, including memory consumption and runtime costs?
-
How does your model generalize to novel verbs (e.g., “crack”, “unscrew”) that are not present in the TACO dataset?
局限性
yes
最终评判理由
The author's rebuttal has addressed my concerns. As my current score already indicates support for accepting the paper, I will retain it.
格式问题
N/A
- Q1: Diversity and physical metrics of 3D motions. As suggested by the reviewer, we added three new evaluation metrics: diversity, penetration rate, and contact rate. The following table shows that our method achieves lower diversity scores than baselines. This occurs because baseline methods produce 3D motions with visible artifacts, including chaotic object point clouds and significant shape/scaling distortions relative to reference images (Fig. 4). Consequently, higher diversity scores don't necessarily indicate better motion quality. Additionally, our method achieves superior performance in both penetration rate and contact rate metrics, indicating superior physical plausibility.
| Method | Diversity ↑ | Penetration Rate ↓ | Contact Rate ↑ |
|---|---|---|---|
| MDM | 6.5308 | 0.0904 | 0.8107 |
| EMDM | 5.7847 | 0.0844 | 0.8251 |
| Ours | 2.4135 | 0.0473 | 0.8328 |
- Q2: Memory and time consumption for training and inference. We compared the VRAM usage and training/inference time with our backbone CogVideoX-5B, and found that our method simultaneously generates both 2D video and 3D motion sequences while introducing only minimal additional parameters, and maintaining computational cost within a reasonable range.
| Method | Params (B) | Training Peak VRAM (G) | Training Time Per Step (s) | Inferance Peak VRAM (G) | Inferance Latency for 50 steps (s) |
|---|---|---|---|---|---|
| CogVideoX-5B (Our Backbone) | 5.57 | 56.32 | 11.61 | 20.59 | 48.35 |
| Ours | 5.82 | 61.52 | 18.06 | 23.35 | 71.31 |
- Q3: Generalization to actions with novel verbs. Our train-test split mitigates HOI category overlap by reserving all instances involving specific actions (e.g., hit), tools (e.g., glue gun), and objects (e.g., toy) exclusively for the test set, ensuring these categories remain unseen during training. Thereby the evaluation on test data (Tables 1 & 2 in the main paper) could demonstrate generalization capabilities to novel verbs to some extent. For more rigorous generalization evaluation, we constructed an additional test set of 100 instances comprising entirely unseen HOI categories with novel verbs, sourced from TACO, OakInk2 [1], and GigaHands [2] datasets. As shown below, while video and motion metrics for this unseen set show a slight performance decline compared to the main TACO test set, they remain reasonably close. This robust zero-shot generalization capability stems from two critical design principles: (1) Our plug-and-play synchronized diffusion architecture leverages the visual-semantic priors of large video foundation models. (2) Our 3D point cloud object representation offers enhanced spatial geometry and physical dynamic awareness, and supports viewpoint generalization.
| Evaluation Data | Overall Video Quality ↑ | 3D Motion FID ↓ |
|---|---|---|
| Ours (on completely unseen verbs) | 0.8712 | 0.0197 |
| Ours (on TACO test data) | 0.8800 | 0.0100 |
- [1] Xinyu Zhan, Lixin Yang, Yifei Zhao, Kangrui Mao, Hanlin Xu, Zenan Lin, Kailin Li, and Cewu Lu. Oakink2: A dataset of bimanual hands-object manipulation in complex task completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 445–456, 2024.
- [2] Rao Fu, Dingxi Zhang, Alex Jiang, Wanjia Fu, Austin Funk, Daniel Ritchie, and Srinath Sridhar. Giga-hands: A massive annotated dataset of bimanual hand activities. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 17461–17474, 2025.
Thanks for your response. As my current score already indicates support for accepting the paper, I will retain it.
Dear Reviewer rh5e, Thank you for your thorough review and professional feedback!
This paper presents SViMo, a synchronized diffusion model for video and motion generation in hand-object interaction (HOI) scenarios. The authors observe that HOI involves complex physical dynamics, which pixel-level video generation often fails to capture with high fidelity. They note that visual appearance and motion patterns are governed by shared physical laws, and thus propose synchronized generation of three representations: 3D motion, rendered 2D motion, and video. The SViMo model takes as input text, a reference image, and a rendered motion video, using a triple modality adaptive module to share information across these three modalities, and generates both HOI video and rendered motion video. A vision-aware 3D interaction model then takes the two video latents as input, diffusing noise over hand and object joint sets to produce 3D hand skeletons and object surface point clouds. These generated 3D interactions can be used as closed-loop supervision through gradient guidance or rendered to 2D for rendered motion supervision. Quantitative and qualitative experiments show that the proposed method outperforms prior video generation models in terms of quality, dynamics, and motion plausibility. Perceptual studies indicate that the generated videos are preferred by users. Ablation studies evaluate the effectiveness of the VID model and the impact of synchronized training of both HOI RGB and motion video streams.
优缺点分析
Strengths
- The paper addresses an important and challenging problem: generating physically plausible HOI videos.
- It introduces the novel idea of leveraging rendered motion videos as strong guidance to improve physical plausibility, supported by a tailored training and inference pipeline.
- The method makes effective use of explicit 3D supervision during training, ensuring consistency and quality in generated motion.
Weaknesses
-
The comparison with other video generation models is somewhat unfair, as these models do not utilize the additional motion information available in SViMo. For models not trained on this information, it would be interesting to see if they could animate motion videos from the first frame and generate plausible results. For other baselines, it would be useful to include rendered motion as an additional condition, or to show how fine-tuning with this modality affects their performance. This would help ablate the contribution of the extra motion video modality and the triple modality adaptive module to overall video generation quality.
-
Robustness to camera viewpoint is not thoroughly evaluated. The mocap data (TACO) used contains only 12 camera views, and the supplementary video primarily shows results from two views. In real-world settings, camera perspectives are more varied, and since the video encoding does not explicitly incorporate camera information, it is unclear how well the method generalizes. I recommend evaluating the method’s performance across a broader range of viewpoints, possibly using datasets like OAKINK2 or GigaHands, which feature diverse view settings.
- OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion
- GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities
-
After reviewing the supplementary material, it remains unclear whether the pipeline requires a rendered motion video as input during inference. For example, in Supplementary Algorithm 2 (line 4), Z~^{M}_{0} is derived directly from h0 and o0. But h0 and o0 are inferred from Ht, Ot, Zt; is this feasible without further guidance? More explanation or clarification would be helpful.
问题
- How is the train-test split performed—by object, by action, or by instance?
- The paper uses TACO, where in all scenes the subject interacts with exactly two objects. How does the proposed method generalize to scenarios with more objects or with greater variability in object numbers and shapes?
局限性
- As noted in the supplementary material, the method has limitations in training efficiency and relies heavily on access to 3D ground truth data.
格式问题
n.a.
- Q1: The influence of motion information on baseline video models. Compared to baseline methods performing video generation alone, our approach jointly generates video and motion, and learns a visual-motion joint representation through adaptive multimodal modulation. To enable fair comparisons with baselines and assess the contribution of the modulation mechanism to our performance gains, we augmented baseline video models with the motion modality. Specifically, Animate Anyone inherently uses GT Motion as input. For EasyAnimate and CogVideoX, we trained them using both the original video and the rendered motion video. Results show that Animate Anyone produces videos of notably lower overall quality due to severe inter-frame flickering (see Fig. 3 in the main text and supplementary videos). For EasyAnimate and CogVideoX, while motion-aware training yielded marginal gains, improvements remained modest. This suggests that multimodal modulation fusion plays a more critical role in our approach than the motion modality alone.
| Method | Motion Availability | Motion Type | Overall Video Quality ↑ |
|---|---|---|---|
| Easy Animate | w/o | w/o | 0.8330 |
| CogVideoX-I2V-5B | w/o | w/o | 0.8684 |
| Animate Anyone | w/ | Input GT Motion | 0.8209 |
| Easy Animate | w/ | Train with motion data | 0.8366 |
| CogVideoX-I2V-5B | w/ | Train with motion data | 0.8698 |
| Ours | w/ | Train with motion data | 0.8800 |
- Q2: Evaluation on HOI datas with more camera viewpoints. The TACO dataset we used contains 12 camera views. To assess our method on HOI datasets with greater viewpoint diversity, as suggested by the reviewer, we curated approximately 70 instances from OakInk2[1] (4 views) and GigaHands[2] (51 views). As shown below, both the video and motion metrics on this unseen set are comparable to those on the main TACO test set. This robust generalization capability to camera views stems from two critical design principles: (1) Our plug-and-play synchronized diffusion architecture leverages the visual-semantic priors of large video foundation models. (2) Our 3D point cloud object representation offers enhanced spatial geometry and physical dynamic awareness, and supports viewpoint generalization.
| Evaluation Data | 3D Motion FID ↓ | Overall Video Quality ↑ |
|---|---|---|
| Ours (on diverse camera viewpoint data) | 0.0214 | 0.8703 |
| Ours (on TACO test data) | 0.0100 | 0.8800 |
- Q3: Requirement of the rendered motion video during inference. Our method does not require the rendered motion video as input during inference. We only require a reference image and a prompt as inputs, and we initialize the target outputs (2D video, 2D motion video, and 3D motion) with pure noise, and iteratively denoise them to obtain the final results. As detailed in Appendix Alg. 2, Line 4, denote the triple noisy latents at timestep as , we first apply single-step denoising using VID to produce the refined (but still noisy) 3D motion (,). The refined 3D motions are then rendered back into a noisy 2D motion video latent . Since represents a more refined estimate than the original noisy latent , we concatenate it with the latter as auxiliary information and input them into SViMo .
- Q4: Train-test dataset splitting. We implemented a two-stage data partitioning strategy. First, to mitigate overlap between test and training sets that might compromise evaluation of generalization, we reserved all instances involving specific actions (e.g., hit), tools (e.g., glue gun), and objects (e.g., toy) as the initial test set, ensuring these elements are absent from the training set. Second, from the remaining instances grouped by <action, tool, object> triplets, we applied weighted sampling according to group sizes to obtain additional test data. Finally, the ratio of the training set to the test set is 9:1 (Line 203 in the main text).
- Q5: Extension potential to HOI scenarios with objects of different numbers and shapes. Our method inherently supports extension to HOI scenarios involving diverse object numbers, shapes, complex geometries, or even articulated objects. Extending the approach requires only minor adjustments to point cloud density or quantity in our vision-aware 3D interaction diffusion model. Specifically, (1) Object number expansion merely needs additional point clouds. (2) Articulated objects (e.g., drawers, flip phones) can be decomposed into sub-components and handled via approach (1). (3) Complex geometries or deformable objects require increased point density per object for precision.
- Q6: Limitations about training efficiency and the reliance on 3D data. We discussed the limitations in training efficiency of the video foundation model, and the requirement for paired video-3D motion data in Appendix Sec. D. These issues will be alleviated with the development of video foundation models, as well as advances in 3D hand pose recovery and object pose estimation methods.
- [1] Xinyu Zhan, Lixin Yang, Yifei Zhao, Kangrui Mao, Hanlin Xu, Zenan Lin, Kailin Li, and Cewu Lu. Oakink2: A dataset of bimanual hands-object manipulation in complex task completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 445–456, 2024.
- [2] Rao Fu, Dingxi Zhang, Alex Jiang, Wanjia Fu, Austin Funk, Daniel Ritchie, and Srinath Sridhar. Giga-hands: A massive annotated dataset of bimanual hand activities. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 17461–17474, 2025.
Dear reviewer gC6v,
Please reply to the authors' rebuttal and discuss before finalizing your score.
AC
The rebuttal clearly resolve all of my concerns regarding the (1) fair comparison with video models, (2) view point generalization; (3) inference pipeline. I support the acceptance of the paper.
Strengths
- Novel closed-loop architecture for generating hand-object interaction videos - unlike many others, which are more or less "ControlNet" variants
- SViMo: synchronized, all-in-one video generation and motion generation diffusion model for plausible hand-object interaction video generation
- VID: vision-aware 3D interaction diffusion generates 3D interaction from the SViMo video and motion latent codes to feed into the SViMo interaction guidance to form a closed-loop feedback prompting optimization
- Convincing results qualitatively and quantitatively. Ablations support the design choices.
Weaknesses
- Some clarifications and more experiments, e.g., generalization capabilities, as suggested by reviewers, will strengthen the paper.
I believe this multi-modal generation with closed-loop feedback has great potential beyond hand-object interaction. Because of this general impact of the paper, I recommend accepting it as a spotlight.