PaperHub
7.8
/10
Spotlight4 位审稿人
最低4最高5标准差0.4
4
5
5
5
3.8
置信度
创新性2.8
质量3.0
清晰度3.0
重要性3.0
NeurIPS 2025

Stable Part Diffusion 4D: Multi-View RGB and Kinematic Parts Video Generation

OpenReviewPDF
提交: 2025-04-13更新: 2025-10-29
TL;DR

Stable Part Diffusion 4D (SP4D) generates RGB and kinematic part videos from monocular inputs via a dual-branch diffusion model.

摘要

关键词
Video Generation; Part Decomposition; Kinematic Parts

评审与讨论

审稿意见
4

This paper proposes Stable Part Diffusion (SPD), a method to generate paired multi-view RGB and kinematic part videos from monocular inputs. It introduces a dual-branch diffusion model, spatial color encoding, Bidirectional Diffusion Fusion (BiDiFuse), and a contrastive part consistency loss. The authors design a 2D-to-kinematic mesh pipeline to produce skeleton-free, animation-ready outputs. They also curate KinematicParts20K, a large-scale dataset to train the model. Experiments demonstrate that SPD generalizes well across real-world and synthetic scenarios.

优缺点分析

Strengths

  1. They first propose a framework for generating paired multi-view RGB and kinematic part videos from monocular inputs.
  2. The kinematic part representation appears to be somewhat novel. It can be used for 3D mesh animation.
  3. Experiments demonstrate that SPD generalizes well across both real-world and synthetic scenarios.

Weaknesses

  1. This paper does not provide any standard metrics, such as PSNR or FID, to quantitatively assess the quality of the generated multiview RGB videos.
  2. In Table 1, the authors only report comparisons with 2D segmentation methods. It is critical to also include relevant 3D segmentation methods.
  3. The effectiveness of the kinematic part representation is not sufficiently validated. The paper primarily compares against 2D segmentation approaches (e.g., SAM) designed for semantic classification, which are inherently less suited for the evaluation metric defined for kinematic representation. This comparison alone does not convincingly demonstrate the advantages of the proposed kinematic representation. Considering that the kinematic representation is intended to support 3D mesh animation, the authors should report quantitative metrics for skinning weight accuracy and compare these results with existing state-of-the-art techniques. This would provide stronger empirical evidence.

问题

  1. This author should provide metrics, such as PSNR or FID, to quantitatively assess the quality of the generated multiview RGB videos.
  2. In Table 1, the authors should report comparisons with 3D segmentation methods.
  3. The author should validate the effectiveness of the kinematic part representation by reporting quantitative metrics for skinning weight accuracy or other relevant animation metrics.

局限性

yes

最终评判理由

My major concerns are video quality metrics, segmentation comparison with 3D-based methods, and evaluation on animation metrics. In the rebuttal, the authors provide detailed experiments addressing all three points. Therefore, my concerns are fully addressed, and I will keep my positive score.

格式问题

No formatting issues

作者回复

We sincerely thank you for your constructive review. We are glad to see that you recognize the novelty of our framework for generating paired multi-view RGB and kinematic part videos from monocular inputs. We appreciate your positive remarks on the usefulness of the proposed kinematic part representation for 3D mesh animation, as well as your acknowledgment that our experiments demonstrate strong generalization across both real-world and synthetic scenarios. Below, we address your concerns point by point and have conducted additional experiments accordingly.


R1. Lack of PSNR / LPIPS for Multi-view RGB Videos

Thank you for this suggestion. Although our primary goal is to solve the kinematic part segmentation task, we agree that assessing the quality of the generated RGB outputs is helpful for completeness.

We will include quantitative evaluations of the RGB branch on the KinematicParts20K test set using LPIPS and PSNR in the revised version. While SPD shares a similar RGB backbone with SV4D 2.0, we observe a small but consistent improvement in both metrics (0.13% on LPIPS and 0.07% on PSNR). We believe this is due to beneficial cross-branch interactions between the RGB and part segmentation branches during training, which enhance the overall feature quality.

We will include these results in the revised manuscript.


R2. Comparison with 3D Segmentation Methods

We appreciate this suggestion and agree it is important to contextualize our method beyond 2D segmentation baselines. SOTA 3D segmentation methods, such as Segment Anything Mesh [1] and SAMPart3D [2], rely on 2D semantic segmentation backbones (e.g., SAM, DINOv2), which are texture- or appearance-driven and not designed for kinematic reasoning. This is reflected in our visual comparisons in Figure 2.

According to your suggestion, we conducted a comprehensive evaluation on the KinematicParts20K test set using SOTA 3D segmentation methods [1,2]. Below we report mIoU, ARI, F1, mAcc, and User Study scores ( User Study follows same evaluation criteria as in Supplementary Section D.2):

Table 1. Comparison of SPD with SOTA 3D Segmentation Methods [1,2].

MethodmIoUARIF1mAccUser Study
Segment Any Mesh0.150.060.290.201.98
SAMPart3D0.130.050.270.181.75
Ours (Full)0.640.580.670.724.13

This provides a broader evaluation of SPD's kinematic segmentation ability.


R3. Validation of the Kinematic Part Representation (Animation Metrics)

Thank you for this excellent suggestion. We fully agree that it is critical to assess the usefulness of our kinematic representation beyond segmentation accuracy.

In the revised version, we will include:

  • Skinning weight precision on samples with available ground truth rigging (KinematicParts20K-test), compared with a SOTA auto-rigging method [3,4].
  • A user study for generated objects (which lack ground truth), comparing animation plausibility between SPD and the SOTA auto-rigging method [3,4]. This will demonstrate the generalization of the models on generated unseen objects or poses.

Table 2. Comparison of SPD with SOTA Auto-rigging Method [3,4].

MethodKinematicPart20K-testKinematicPart20K-testGenerated Objects
MetricPrecisionUser StudyUser Study
Magic Articulate63.73.82.7
UniRig64.33.92.3
Ours (Full)72.74.34.1

We also note in Figure 2 (middle) that we already compare with MagicArticulate, a SOTA auto-rigging method trained on large-scale rigged meshes from ObjaverseXL. While MagicArticulate performs well on standard, seen category inputs, it struggles on generated meshes, such as dinosaurs and robotic arms (as shown in Figure 2).

In contrast, SPD leverages strong priors from 2D diffusion model and learns kinematic decomposition robustly, for real world, synthetic or unusual objects. We believe this illustrates a key motivation for our approach: learning kinematic structure from 2D multiview supervision enables better generalization to novel inputs.

These new experiments will be included in the updated manuscript to demonstrate the advantages of our kinematic representation in both rigging precision and real-world usability.


We truly appreciate your detailed feedback and believe the additional experiments will address your concerns effectively. Thank you again for your time and valuable insights.

[1] Yang, Yunhan, et al. "Sampart3d: Segment any part in 3d objects." arXiv preprint arXiv:2411.07184 (2024).

[2] Tang, George, et al. "Segment any mesh: Zero-shot mesh part segmentation via lifting segment anything 2 to 3d." arXiv e-prints (2024): arXiv-2408.

[3] Song, Chaoyue, et al. "Magicarticulate: Make your 3d models articulation-ready." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.

[4] Zhang, Jia-Peng, et al. "One model to rig them all: Diverse skeleton rigging with unirig." arXiv preprint arXiv:2504.12451 (2025).

评论

Thank you so much for the detailed clarifications and experiments. My concerns have all been addressed. I will keep my score.

评论

Thank you very much for your positive response and for taking the time to carefully review our work. We sincerely appreciate your support and are glad to hear that all your concerns have been addressed.

审稿意见
5

This paper proposes multi-view consistent video generation for articulated objects with temporally consistent part segmentation. The method is built on SV4D 2.0, extended by adding a part segmentation branch in addition to the RGB branch. The model is trained on the proposed KinematicParts20K dataset, which is an extension of the SV4D 2.0 dataset. Both qualitative and quantitative evaluations on part segmentation outperform a self-made SAM2-based baselines and an unsupervised approach (DeepViT).

优缺点分析

Strength

  • While the extension from SV4D 2.0 is relatively simple, it is surprising that the method can produce part segmentation.
  • Both qualitative and quantitative performance show a significant improvement over the baselines.

Weakness

Major

  • The paper lacks strong technical novelty. The method heavily relies on SV4D 2.0 for the model architecture, and outputting an extra modality beyond RGB is a straightforward application of Vainer et al. Furthermore, the spatial color encoding and contrastive part consistency loss appear closely related to ideas in shape modeling through canonical embedding and self-supervised canonical embedding learning [1].
  • Prior work on part-aware multi-view rendering is not cited and discussed, such as [1] and [2].
  • The paper lacks discussion of failure cases and examples, which would be helpful for future research.

[1] Yang et al. BANMo: Building Animatable 3D Neural Models from Many Casual Videos. NeurIPS 2022.
[2] Noguchi et al. Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects. CVPR 2022.

Minor

  • Equations in the manuscript are not numbered, making them difficult to reference.
  • The notation ff is reused for multiple meanings, which is confusing: L159 for the index of a frame, L174 for a fusion function, and L208 for part features.

问题

  • In Fig. 2 (rightmost) and Fig. 4 (first two rows on the right), non-rigid clothing is also segmented. Does the method target such non-rigid parts? If so, what does the rigging look like for these parts?
  • What is the part feature used in the contrastive loss? Is it purely appearance-based, or does it include spatial information? The clothing in Fig. 2 is segmented into multiple parts, and the lower parts appear visually similar. On what basis are the parts grouped consistently?
  • Does the model support long-range tracking of spatially dynamic parts, such as during a backflip, where part locations change significantly and may switch places over time? The prior optimization-based method [1] seems to be able to handle this. Can this approach handle such scenarios?

[1] Yang et al. ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction. NeurIPS 2021.

局限性

As mentioned in the question section, the paper does not discuss limitations related to part segmentation, which is a core focus of the work.

最终评判理由

The authors have adequately addressed the raised concerns. After also considering the responses to other reviewers, I recommend accepting this paper.

格式问题

No noticeable concern.

作者回复

Thank you for your constructive feedback. We appreciate your recognition of our model’s ability to produce consistent part segmentation and the strong improvements over enhanced SAM2-based and DeepViT baselines. Below, we address your concerns in detail.


W1. Regarding Novelty

Thank you for raising this important point. As outlined in Section 3 of our paper, we believe the proposed framework is just one part of our contribution, and that our work introduces a novel (Reviewers TrfX, adtH), interesting (Reviewers t6Ud), and meaningful direction with high potential impact for the community (Reviewers TrfX):

  • Problem Setting: Our goal is to tackle the task of kinematic part segmentation from monocular input, a task that, to our knowledge, has not been previously addressed (Reviewers TrfX). Unlike prior segmentation methods such as SAM2, DINOv2, SAMPart3D, or Segment Any Mesh, which focus on semantics or texture-based cues, we aim to segment objects according to their underlying kinematic structure. This task is crucial for accurate motion understanding, action recognition, and dynamic 3D modeling from videos.

  • Model Design (SPD): We propose SPD, a simple but powerful model capable of predicting multi-view consistent RGB videos and kinematic part segmentations from a single image or monocular video. This reflects not only strong multi-view generation ability but also a understanding of articulated non-rigid object kinematic structures. SPD is not a direct reuse of SV4D 2.0. Simply retraining it on our data underperforms SPD by a large margin and requires more training iteration (40K more than SPD and drop to 0.57 from 0.68 and 0.59 from 0.7 in mIoU in multiview and multiframe experiments). Our BiDiFuse architecture, while inspired by cross-network communication, is independently implemented due to the lack of publicly available code. We also introduce spatial color encoding to map spatial coordinates to RGB efficiently, significantly simplifying training compared to random color mappings.

  • Dataset Curation Strategy (KinematicParts20K): To train SPD, we introduce a novel pipeline to curate a high-quality dataset of 20,000 multiview video sequences with kinematic part labels. Our automatic filtering, preprocessing, and rendering pipeline enables us to curate a diverse set of articulated objects with detailed part annotations. To our knowledge, this is the first dataset to provide multiview 2D kinematic part annotations, and our dataset curation pipeline can be easily adapted to similar training tasks. Additionally, the scripts for dataset processing will be released publicly.

  • Novel Capabilities: SPD supports (i) kinematic part segmentation from monocular input, (ii) novel view synthesis of both RGB and kinematic part maps, and (iii) generation of animatable 3D meshes via our 2D-to-Kinematic Mesh pipeline and harmonic skinning without long time per-object optimization and multiview requirements (videos should include multi-views of the object) [1-4].

  • Loss Design: Our proposed Contrastive Part Consistency Loss across frames and views is crucial for improving part-level temporal and view consistency. Ablations in Table 1 clearly demonstrate its impact.

In summary, our contributions span dataset curation, model design, loss formulation, and mesh reconstruction, collectively forming a comprehensive pipeline for kinematic part-aware generation. SPD is a novel system tackling a new, challenging, and impactful problem.

W1-2. Clarification on Spatial Color Encoding and Contrastive Part Consistency Loss

We respectfully disagree with the statement that “the spatial color encoding and contrastive part consistency loss appear closely related to ideas in shape modeling through canonical embedding and self-supervised canonical embedding learning [1].”

To the best of our knowledge, the work [1] referenced in that comment do not introduce any form of spatial color encoding, nor do they include a contrastive part consistency loss across frames and views.

  • Our spatial color encoding is a novel mechanism that maps 3D spatial coordinates to RGB values to serve as positional cues for the model, which is not included in [1].

  • Our contrastive part consistency loss is designed specifically to enforce part-level consistency across both time and views by bringing together features belonging to the same part across frames and viewpoints. This formulation is not present in [1].

We will clarify this distinction in the revised manuscript to avoid future misunderstanding.


W2. Prior Work on Part-Aware Rendering (e.g., [1-4])

We appreciate you pointing out this line of work and will add a discussion in the revised version. Methods such as [1,2,3,4] are optimization-based frameworks designed for per-instance 3D reconstruction using videos that show multi-views of the object. These approaches often require ground-truth camera poses, multiview video of the same object, and hours-long optimization per object (48+ hours on A100).

Moreover, their kinematic understanding is limited by what appears in the video. For instance, if a limb never moves, the model may fail to segment it correctly. These methods are not designed to generalize across categories or perform inference from monocular input such as single-view video or image.

In contrast, SPD is a feedforward category-agnostic generative model that produces consistent RGB and kinematic part outputs within seconds, from just a single image or video. While both approaches can output kinematic segmentations, they solve fundamentally different problems with very different capabilities. We will clarify this distinction and cite the relevant works as suggested.


W3. Failure Cases and Limitations

Thank you for this valuable suggestion. We mention limitations in Lines 302–307, but will expand this section with more discussion. SPD has been tested on a wide range of challenging and diverse inputs, including human falls, skiing, horse riding, occlusions, non-frontal views, and domain-shifted content like DAVIS and GPT-generated images, demonstrating strong robustness.

We are happy to include more failure cases in the updated version to facilitate future improvements.


W4. Missing Equation Numbers

Thank you for pointing this out. We will number all equations in the revised version for clarity.


W5. Ambiguous Notation "f"

We appreciate your careful observation. We will disambiguate the use of the variable “f” by using different symbols to prevent confusion.


Q1. Does SPD Segment Non-Rigid Clothing? How Is It Rigged?

Thank you for this insightful question. SPD directly predicts kinematic part segmentation maps. These are hard segmentations at first, but when producing harmonic skinning weights, each part behaves as a semi-rigid region: rigid near the core (weight ≈1 to a single bone) and non-rigid near boundaries (weights shared across adjacent bones). This is consistent with how non-rigid objects such as clothing are typically rigged in animation pipelines and use LBS for deformation.

SPD learns part decomposition from KinematicParts20K, which reflects the design choices of 3D artists generating the dataset. There is no absolute ground truth; different segmentations may be valid depending on animation needs.


Q2. What Features Are Used in the Contrastive Part Loss?

The features used are latent diffusion features (The latent vector will be fed into the VAE decoder.), which preserve strong spatial correspondence with the outputs. These features contain rich appearance and spatial information necessary for high-quality RGB and part reconstruction. Our contrastive loss brings together features of the same part across frames and views, improving temporal and spatial consistency. We will clarify this in the Contrastive Part Consistency Loss section.


Q3. Long-Range Tracking of Spatially Dynamic Parts (e.g., Backflip)

We appreciate your comparison. As mentioned in W2, optimization-based approaches like ViSER [3] tackle a different problem and often require many-view input or object self-rotation to function properly. They are unable to handle single-view scenarios and demand significant optimization time per instance.

During the rebuttal period, we were unable to provide side-by-side image comparisons. However, we have included results on the DAVIS real-world dataset in rows 5–8 of Figure 4. For reference, LASR [4], a closely related work by the same authors as ViSER [3] published in the same year, also evaluated on DAVIS (cow and horsejump). Our results show clearly favorable performance. ViSER did not report results on DAVIS since it requires videos that include multiviews of the object as input, which are not available for most DAVIS sequences.

In contrast, SPD is a feedforward model capable of processing single-view inputs in seconds, generalizing across object types and motions. In Fig. 4 (third and fourth rows, right), we show SPD handling dramatic falls, which are analogous to backflip scenarios, and SPD handing karate kick in demo video at 1:52, with stable and consistent segmentation, demonstrating its capability for long-range dynamic part tracking.


Limitations and Failure Cases

Thank you for highlighting this. While we discussed general limitations (Lines 302–307), we will explicitly expand on part segmentation failure cases, such as objects very different from training data, in the revised paper with visual examples.


We sincerely appreciate your comments and will incorporate all suggested clarifications, citations, and failure cases into the revised manuscript. Thank you again for your time and feedback.

[1] BANMo

[2] Watch It Move

[3] ViSER

[4] LASR

评论

I thank the authors for the response. They addressed my concerns. As noted in the rebuttal, I hope the authors include the discussion on the surface embedding approach and the proposed color encoding approach.

评论

We sincerely thank you for the encouraging feedback and for confirming that our responses have addressed your concerns. We appreciate your valuable suggestions and will ensure that all relevant discussions and additional experiments are included in the final version of the paper.

We would be truly grateful if you would consider updating your rating, as your concerns have been addressed. If there are any remaining issues or additional feedback, please don’t hesitate to let us know, we would be more than happy to further address them.

评论

Thank you once again for your time and thoughtful feedback. As the discussion period concludes in three days, we remain fully available to address any remaining concerns or questions regarding our work. Please don’t hesitate to let us know if any further clarification is needed. We sincerely appreciate your engagement.

审稿意见
5

This paper introduces an interesting multi-view diffusion model, namely as Stable Part Diffusion, designed to generate both RGB images and part-semantic maps. Specifically, the authors employ a dual-branch U-Net diffusion architecture to produce final results concurrently. Additionally, they propose Bidirectional Diffusion Fusion (BiDiFuse) to efficiently fuse information from these different modalities. To tackle temporal inconsistency, the paper introduces a contrastive part consistency loss, which aligns latent part features corresponding to the same physical regions across both views and time. Experiments demonstrate that the proposed method outperforms the baseline method in both qualitative and quantitative results.

优缺点分析

  • Strengths:

    [1] The paper is well-written and easy to follow.

    [2] The paper addresses the interesting issue of generating kinematic part-semantic maps. Although the proposed method is relatively simple, it demonstrates strong performance on public datasets as well as in-the-wild testing.

    [3] The proposed contrastive part consistency loss is convincing, and the details provided in the ablation study clearly support their claims.

  • Weaknesses:

    [1] Does KinematicParts20K Dataset publicly available? It is important for the following works to comparison and make contribtution to the field.

    [2]Missing some References:

      1) V3D: Video Diffusion Models are Effective 3D Generators.
    
      2) VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model.
    

问题

N/A

局限性

N/A

最终评判理由

After reviewing the authors' rebuttal and the comments from other reviewers, I find that the authors have adequately addressed my main concerns. As a result, I have decided to change my recommendation from "borderline accept" to "accept."

格式问题

N/A

作者回复

We sincerely thank you for your review and constructive feedback. We appreciate your recognition of our writing clarity, the motivation behind generating kinematic part-semantic maps, and the effectiveness of our contrastive part consistency loss. Below we address your suggestions and concerns.


R1. Will the KinematicParts20K dataset be made publicly available?

We are actively applying for internal approval to release the model and will very likely make it publicly available upon paper acceptance. Due to licensing constraints and the large size of the dataset, we will not mirror the full dataset.

Instead:

  • We will provide scripts that guide users to reproduce the dataset.
  • To facilitate testing and demonstration, we will also include a mock object to showcase the complete pipeline without legal concerns.

This approach ensures usability while respecting licensing limitations.


R2. Missing References

Thank you for pointing out these relevant works. We will include and discuss the following papers in the revised manuscript:

  • V3D: Video Diffusion Models are Effective 3D Generators
  • VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

These papers contribute important insights to video and multi-view diffusion modeling, and we will incorporate a discussion that positions our work in relation to theirs, especially in terms of the unique challenge of generating kinematic part segmentations alongside RGB content.


Thank you once again for your thoughtful comments and support. We are excited to refine the paper with your suggestions.

评论

Thank you once again for your time and thoughtful feedback. As the discussion period concludes in three days, we remain fully available to address any remaining concerns or questions regarding our work. Please don’t hesitate to let us know if any further clarification is needed. We sincerely appreciate your engagement.

评论

Dear Reviewer t6Ud,

We noticed that the score was recently updated, and we wanted to kindly ask whether our response has sufficiently addressed your concerns. If there are any remaining questions or clarifications we could provide, we would be more than happy to do so. We truly appreciate your feedback and your effort in reviewing our work.

Best, Authors

审稿意见
5

The paper proposes new diffusion model, Stable Part Diffusion, for kinematic part label 4D video generation given single/multi-view images as input. SPD extends extends the recent SV4D with a dual-branch UNet, one for RGB video and another for continuous kinematic part maps. They propose BiDiFuse module to fuse information from each branch and are jointly optimized with the proposed part contrastive consistency loss. For supervision, authors curate new dataset called KinematicParts20K dataset, whose labels are produced by normalizing 3D coordinates into RGB codes, projecting these codes into each frame via SAM's auto-generated segments, and clustering to yield consistent masks. The experiment results are strong on both quantitative/qualitative evaluations, outperforming baselines such as SAM2 and DeepViT.

优缺点分析

Strengths

Reviewer regards this as a strong paper with high potential impact for the community.

  1. Presentation is strong, well-structured,and easy to follow. Writing is concise, layout/Figs are clean.
  2. Novelty: suggest new model can also generate "kinematic" part label map, which hasn't seen one in the literature. Also, the kinematic part label generation process; color coding from normalized 3D coords to assist by SAM's auto-generation, clustering rather than using SAM2's tracking, it makes all sense and novel.
  3. Authors show SPD consistently outperforms baselines in both quantitative and qualitative evaluations. Ablations also empirically demonstrate that both consistency loss and BiDiFuse contribute respectively.
  4. Authors construct new dataset, KinematicParts20K, where the dataset and the model itself can boost a lot of exciting future works for vision and graphics community if publicly released.

Weakness

  1. why the dual-branch U-Net (with BiDiFuse) for separate video/part-label generation, rather than a single U-Net to directly predicts video+part label map with fewer parameters? There are some recent works (ex. UDPDiff[A]) have shown success in unified prediction models.


    [A] Unified Dense Prediction of Video Diffusion, arxiv 2025

  2. Although kinematic part labels in the paper works well, it's not thoroughly kinematic structure, but more close to much-more fine-grained part labels. There exist some kinematic-related priors that can use, such as 2D/3D human/animal pose or skeletons, which have part connectivity information. Why not leverage these to enforce more realistic motion?

  3. The paper would benefit from more failure-case analysis to better understand current limitations.

Below are minor ones:

  1. In the quantitative comparisons, only SPD is trained with the KinematicParts20K dataset, while the baselines are not. This naturally favors SPD in performance. (though it’s not a major concern.)

  2. The label generation process would be clearer if illustrated with a small figure in the method section to provide a visual overview

问题

Will model/dataset release after acceptance?

局限性

Yes

最终评判理由

See above for justification. I maintain my initial score.

格式问题

None

作者回复

We sincerely thank you for your detailed review. We are grateful for your recognition of the paper’s potential impact, clean presentation, and the novelty of both our method (SPD) and dataset (KinematicParts20K). Your comments motivate us to further strengthen the paper. Please find our point-by-point responses below.


R1. Why dual-branch UNet instead of a single-branch model?

Thank you for this thoughtful question. We initially experimented with a single-branch architecture that jointly predicts multi-view RGB and part sequences by concatenating both modalities into a shared latent and splitting them before decoding. However, we found that:

  1. The model converged more slowly and produced lower performance compared to the dual-branch variant after training for the same epochs.
  2. RGB and part segmentation require learning very different types of features. Combining them in a single UNet led to interference between the two tasks, particularly in generating temporally and spatially consistent multi-frame outputs.

In contrast, our BiDiFuse dual-branch design allows effective cross-modal communication while preserving task-specific feature learning, resulting in better disentanglement and overall performance. We will include a discussion of this design choice in the updated paper and cite the recent work [A] you kindly recommended.

[A] Unified Dense Prediction of Video Diffusion, arxiv 2025


R2. Why not incorporate 2D/3D pose or rigging priors for enforcing structure?

We appreciate this suggestion and agree that pose and skeleton priors encode rich kinematic structure. However, we deliberately designed SPD to be category-agnostic and not rely on human/animal-specific priors to ensure broad applicability.

Methods based on human pose models or rigging templates typically generalize poorly to objects with unusual topology, such as loose clothing, handheld objects (e.g., shields, skis), or more exotic categories (e.g., crabs, hands, robots, mechanical arms). In our experiments, such priors often failed to produce meaningful structure for these diverse object types.

That said, we see great potential in exploring how to combine 2D/3D structural priors with diffusion-based generation, and we appreciate your insight as a promising future direction.


R3. More failure case analysis

Thank you, we agree that this would make the paper more comprehensive. We will add visual and textual analysis of failure cases in the revised version. These will include examples with ambiguous shapes, heavy occlusion, or multi-object scenes where SPD may struggle.


R4. Only SPD is trained on KinematicParts20K – does that bias comparisons?

We acknowledge that data-specific training provides an advantage, and this also highlights the novelty and importance of our proposed dataset. To ensure fairer comparisons:

  • We provide Enhanced SAM-based baselines (denoted as SAM*) using point prompts derived from ground-truth part masks.
  • We include quantitative evaluations against state-of-the-art 3D segmentation (e.g., SAMPart3D [1], Segment Anything Mesh [2]) and 3D rigging methods (e.g., MagicArticulate [3], UniRig [4]), using consistent inputs and evaluation protocols on both the KinematicParts20K test set and generated 3D objects with arbitrary poses.

Table 1. Comparison of SPD with SOTA 3D Segmentation Methods [1,2].

MethodmIoUARIF1mAccUser Study
Segment Any Mesh0.150.060.290.201.98
SAMPart3D0.130.050.270.181.75
Ours (Full)0.640.580.670.724.13

Table 2. Comparison of SPD with SOTA Auto-rigging Method [3,4].

MethodKinematicPart20K-testKinematicPart20K-testGenerated Objects
MetricPrecisionUser StudyUser Study
Magic Articulate63.73.82.7
UniRig64.33.92.3
Ours (Full)72.74.34.1

As shown in the above tables, SPD outperforms SOTA 3D segmentation methods [1,2] on segmenting 3D kinematic parts, and outperforms SOTA auto-rigging methods [3,4] on predicting skinning weights.

SOTA auto-rigging methods are trained solely on 3D datasets with ground-truth rigs, which are often limited in size. As a result, they tend to perform poorly on rare or unseen poses and objects (generated objects in table 2). In contrast, SPD leverages a vast amount of 2D prior knowledge as a foundation, demonstrating significantly stronger generalization capability. Updated results will be included in the revised manuscript.


R5. Please illustrate the label generation pipeline with a figure

We appreciate this suggestion and will add a new figure in the method section that visually outlines the full data filtering, annotation, and rendering pipeline for KinematicParts20K. This will improve clarity and reproducibility.


Q1. Will the model and dataset be released after acceptance?

We are actively applying for internal approval to release the model and will very likely make it publicly available upon paper acceptance.

Due to licensing constraints and the large size of the dataset, we will not mirror the full dataset. Instead:

  • We will provide scripts that guide users to reproduce the dataset.
  • To facilitate testing and demonstration, we will also include a mock object to showcase the complete pipeline without legal concerns.

This approach ensures usability while respecting licensing limitations.

[1] Yang, Yunhan, et al. "Sampart3d: Segment any part in 3d objects." arXiv preprint arXiv:2411.07184 (2024).

[2] Tang, George, et al. "Segment any mesh: Zero-shot mesh part segmentation via lifting segment anything 2 to 3d." arXiv e-prints (2024): arXiv-2408.

[3] Song, Chaoyue, et al. "Magicarticulate: Make your 3d models articulation-ready." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.

[4] Zhang, Jia-Peng, et al. "One model to rig them all: Diverse skeleton rigging with unirig." arXiv preprint arXiv:2504.12451 (2025).

评论

First, want to thank the authors for the detailed rebuttal, which addressed most of my concerns. Thus, I maintain my initial rating. I would still strongly encourage the inclusion of failure case analysis in the final version, as well as the release of the model weight and dataset generation scripts (and maybe a subset of original ones, if possible).

评论

Thank you for the helpful feedback and for acknowledging our rebuttal. We will include failure case analysis in the final version and release the model weights and dataset generation scripts. If allowed, we will also share a subset of the original dataset.

评论

Thank you once again for your time and thoughtful feedback. As the discussion period concludes in three days, we remain fully available to address any remaining concerns or questions regarding our work. Please don’t hesitate to let us know if any further clarification is needed. We sincerely appreciate your engagement.

评论

Dear Reviewers and Authors,

Thank you for your efforts during the review and rebuttal phases. We are now in the discussion period, open until August 6, 11:59 PM AoE.

This paper has received mixed ratings. The authors have provided detailed rebuttals to each review. At your earliest convenience, please read the rebuttals and respond to engage in the discussion. In particular, please indicate whether your main concerns have been addressed.

  • Reviewer S8gU: Have the authors’ responses resolved your concerns?
  • Other reviewers: After reading the rebuttal, do you remain positive about the paper?

Thank you again for your contributions.

Best regards, The AC

最终决定

This paper introduces Stable Part Diffusion (SPD), a framework for generating paired RGB and kinematic part videos from monocular inputs. The idea of generating a kinematic part representation is novel. Initially, the paper received mixed ratings. The primary concerns raised by reviewers included the justification for the dual-branch design, clarity of the method’s details, and the choice of evaluation metrics. During the rebuttal and discussion, the authors’ response successfully addressed most of the reviewers’ concerns, and the paper ultimately received 3 "Accept" ratings and 1 "Borderline Accept" rating.

After considering the paper, the reviews, and the rebuttal, the Area Chair believes this is an interesting work that makes a valuable contribution to the field of video generation. Accordingly, the Area Chair recommends acceptance. The authors are requested to revise the paper by addressing the reviewers’ comments (e.g., incorporating failure case analysis) in the camera-ready version.