7.8

/10

Spotlight4 位审稿人

最低4最高5标准差0.4

3.3

置信度

创新性2.8

质量3.3

清晰度3.3

重要性3.5

NeurIPS 2025

Puppeteer: Rig and Animate Your 3D Models

Chaoyue Song,Xiu Li,Fan Yang,Zhongcong Xu,Jiacheng Wei,Fayao Liu,Jiashi Feng,Guosheng Lin,Jianfeng Zhang

OpenReview PDF

提交: 2025-04-09更新: 2025-10-29

TL;DR

Given an input mesh, Puppeteer first transforms it into an animation-ready model through automatic rigging, and subsequently animates it under video guidance.

摘要

关键词

Automatic rigging3D animationauto-regressive generation

评审与讨论

审稿意见

评分: 5置信度: 42025-06-21

This paper introduces Puppeteer, a comprehensive pipeline that automates both rigging and animation for diverse 3D models. The framework consists of three core stages: (1) an autoregressive transformer for skeleton prediction using a joint-based tokenization and hierarchical ordering strategy; (2) an attention-based architecture for skinning weight estimation enhanced by a novel topology-aware joint attention mechanism; and (3) a differentiable optimization module that aligns mesh motion with reference videos generated from text prompts using off-the-shelf video generation models. To support learning, the authors construct Articulation-XL2.0, an expanded dataset of 59.4k rigged models including 11.4k diverse-pose samples. Puppeteer outperforms existing approaches across multiple benchmarks on both rigging and animation tasks and offers significantly improved generalization and efficiency.

优缺点分析

Strengths:

The proposed pipeline is technically sound, well-motivated, and systematically engineered.

The use of joint-based tokenization combined with hierarchical and randomized ordering improves the coherence and compactness of skeleton sequences.

The topology-aware joint attention improves skinning weight accuracy by explicitly modeling joint relationships.

Evaluation is well done: multiple benchmarks, detailed quantitative and qualitative comparisons, and ablations are provided.

Weaknesses:

The animation stage is optimization-based and not learnable, limiting the inference speed. The method relies heavily on the quality of the generated video; if the video is unrealistic or noisy, animation fidelity may degrade. There are a few recent works that seem relevant and may provide useful points of comparison or inspiration: [1] Zhang, Hao, et al. "Magicpose4d: Crafting articulated models with appearance and motion control." arXiv preprint arXiv:2405.14017 (2024). [2] Zhang, Jia-Peng, et al. "One Model to Rig Them All: Diverse Skeleton Rigging with UniRig." arXiv preprint arXiv:2504.12451 (2025).

问题

Handling Complex Motions: The method relies on video generation models to provide motion guidance. However, these models may struggle to generate complex or large-scale motions (e.g., dancing). Can the authors comment on how well their method handles such cases, and whether this limits the animation quality?

Robustness and Failure Cases: The animation quality seems to depend on several factors, such as the accuracy of the rigging, the viewpoint of the first rendered frame, and whether important parts (e.g., limbs) are always visible in the video. Are there common failure cases when one of these steps goes wrong? It would be helpful if the authors could show or discuss a few examples where the method fails.

局限性

The limitations are adequately acknowledged, with additional discussions included in the appendix.

最终评判理由

After the rebuttal, most of my concerns are resolved, this I will keep my positive score.

格式问题

None observed.

作者回复

2025-07-31

We are grateful for the reviewer’s insightful comments. Please find below our detailed responses to your comments and concerns.

W1: The animation stage is optimization-based and not learnable, limiting the inference speed.

While per-scene optimization does limit inference speed (as discussed in our limitations), it offers two key advantages: (1) Unlike large-scale feed-forward methods such as L4GM [1], which require training on 128 80G A100 GPUs, our approach needs only a single A100 GPU while avoiding geometric distortions and achieving better generalization to unseen cases since we do not rely on training data distribution; (2) Compared to other per-scene optimization methods like AKD [2], which requires 25 hours per object, our method achieves significantly faster optimization (20 minutes for objects with up to 10k vertices) while producing stable, jittering-free animations.

[1] Ren et al., L4GM: Large 4D Gaussian Reconstruction Model, 2024.

[2] Li et al., Articulated Kinematics Distillation from Video Diffusion Models, 2025.

W2 & Q1: The method relies on video generation models to provide motion guidance. However, if the video is unrealistic or noisy, animation fidelity may degrade. And these models may struggle to generate complex or large-scale motions (e.g., dancing). Can the authors comment on how well their method handles such cases, and whether this limits the animation quality?

Thanks for raising this critical concern about video quality dependency. Current text-to-video models (e.g., Kling AI, RunwayML, Jimeng AI) have improved significantly and can generate complex motions with reasonable success rates, as demonstrated on the project page in supplemental material. However, we acknowledge that video quality does impact animation fidelity. For instance, motion blur or temporal inconsistencies in generated videos can degrade joint/vertex tracking accuracy and make optimization more challenging. To mitigate these issues, since we use off-the-shelf generation models, we can generate multiple video candidates and select the highest-quality one based on visual clarity and motion consistency to guide the animation optimization. While this approach helps reduce the impact of poor-quality videos, video generation quality still represents a limitation for highly complex motion scenarios. We will include comprehensive discussions of these robustness issues and mitigation strategies in the final version.

W3: Recent works like UniRig and MagicPose4D seem relevant and may provide useful points of comparison or inspiration.

Thanks for highlighting these relevant works. UniRig is a concurrent work on automatic rigging. We provide a comparison with UniRig on skeleton generation in the table below (Arti2.0 = Articulation-XL2.0, MR = ModelsResource, Pose = Diverse-pose). Our method demonstrates superior performance across all three benchmarks. Note that we do not compare skinning approaches since UniRig uses bone-vertex skinning while our method employs joint-vertex skinning. MagicPose4D addresses skeleton-driven motion transfer, targeting a different problem scope from ours. We discuss it in the related work section of the appendix.

	Arti2.0	Arti2.0	Arti2.0	MR	MR	MR	Pose	Pose	Pose
Method	J2J ↓	J2B ↓	B2B ↓	J2J ↓	J2B ↓	B2B ↓	J2J ↓	J2B ↓	B2B ↓
MagicArticulate	3.417	2.692	2.281	4.116	3.124	2.704	5.381	4.441	3.902
UniRig	3.305	2.611	2.180	3.964	3.021	2.570	3.252	2.569	2.077
Ours	3.033	2.300	1.923	3.841	2.881	2.475	3.212	2.542	2.027

Q2: Animation quality seems to depend on several factors, such as rigging accuracy, the viewpoint of the first frame, and whether important parts (e.g., limbs) are always visible in the video. Are there common failure cases when one of these steps goes wrong? Could the authors show or discuss a few examples where the method fails?

Thanks for this important question about robustness. Several key factors can affect animation quality: (1) Rigging: When skeleton generation produces insufficient joint density, fine-scale deformations suffer. For example, in our turtle case ("More Animation Results" section of supplementary project page), the reference video shows smooth forelimb motion, but our animation appears less smooth due to sparse joint placement in these regions. This limitation stems from our skeleton generation producing few joints in areas requiring fine-scale deformations. Animation-driven adaptive joint refinement could potentially address this issue in future work. (2) Viewpoint dependency and occlusion: Suboptimal camera angles can cause depth ambiguities and tracking failures. Since we have access to the input 3D mesh, we can select optimal viewpoints that maximize joint visibility and minimize occlusion. However, single-view optimization inherently suffers from occlusion issues when critical joints or limbs are hidden throughout the video sequence. Training a feed-forward model with multi-view priors could potentially alleviate these limitations by providing better geometric understanding of occluded regions. We will include comprehensive discussions with specific examples and potential mitigation strategies in the final version.

2025-08-05

After the rebuttal, most of my concerns are resolved, thus I will keep my positive score.

2025-08-05

We thank the reviewer for carefully evaluating our submission and rebuttal. We are pleased that your concerns have been addressed and appreciate the positive assessment.

审稿意见

评分: 5置信度: 32025-07-02

This paper introduces Puppeteer, a unified pipeline for automatically rigging and animating 3D models. It first predicts joint-based skeletons using an auto-regressive transformer, then computes skinning weights using a topology-aware attention network. To animate the model, it leverages reference videos generated from text prompts (e.g., using Kling AI), using a differentiable optimization method to align 3D motion (each frame has its own set of the learnable parameters of rotation and translation of the bones) to the video. The authors also expand the Articulation-XL dataset to 59.4k rigged models. Experiments show strong improvements in rigging accuracy, skinning quality, and animation stability compared to state-of-the-art methods.

优缺点分析

Quality:

The paper presents a strong and well-engineered pipeline that unifies skeleton generation, skinning weight prediction, and animation. Each stage is supported by thoughtful architectural design and ablation studies.
The extension of the Articulation-XL dataset to 59.4k models (including 11.4k pose-varied examples) offers a meaningful contribution to 3D articulation research.
Quantitative results demonstrate state-of-the-art performance across multiple benchmarks, and inference efficiency is significantly improved over prior methods.
However, the animation module's runtime and robustness to imperfect video prompts are not clearly reported, and the absence of quantitative animation metrics (e.g., perceptual motion realism) is a notable gap.

Clarity

The paper is generally well-written and technically clear. However, several clarity issues hinder full comprehension:
The introduction references a "data scarcity limitation of previous approaches," but it's unclear whether MagicArticulate is included in this group -- even though it already introduced the original Articulation-XL dataset.
The paper extends Articulation-XL, but the citation to MagicArticulate is missing in the introduction, which could confuse readers not already familiar with the dataset’s history.
In Section 3.2, there appears to be a contradiction between L143–153 and L195–199: first arguing that joint-based tokenization is more compact, but then noting that bone-based coordinates improve performance. A footnote or brief clarification could help reconcile this.
Since this work is heavily based on MagicArticulate, a summary table comparing the two frameworks would greatly benefit readers, helping them understand what has been kept, changed, or added.

Significance:

The paper addresses a high-impact problem in automating the animation pipeline for 3D content creation -- spanning gaming, simulation, and AR/VR.
The integration of animation guided by text-to-video generation makes this work more usable and complete than prior rigging-only methods.
The extended dataset and performance improvements across rigging tasks may benefit downstream research in generative 3D modeling, especially for non-human and AI-generated shapes.

Originality:

While the paper introduces new techniques (such as joint-based tokenization with hierarchical ordering and topology-aware joint attention for skinning), its core contributions build heavily on MagicArticulate. The overall structure, training strategy, and model formulation are evolutionary rather than revolutionary.
The most original addition is the animation module, which transforms rigged models into animated ones using differentiable optimization with video guidance. This makes the framework end-to-end for the first time, which is a meaningful advancement.

Weaknesses:

Missing citation and dataset lineage clarification in the introduction reduces clarity.
The methodology section does not reconcile the joint-based vs. bone-based tension clearly.
The reliance on off-the-shelf video generation is not analyzed for failure cases.
No analysis or footnote distinguishes what part of the pipeline is reused vs. redesigned from MagicArticulate.
Evaluation lacks perceptual animation metrics or user studies (Sec 5.4 all about qualitative experiment).
No runtime or compute cost is reported for the animation phase.

问题

The authors mention that previous approaches suffered from data scarcity (L45), but MagicArticulate already proposed the original Articulation-XL dataset. Are you referring to MagicArticulate as data-scarce, or only methods before it?
Joint vs. bone inconsistency: In L143–153, you argue for joint-based tokenization due to compactness and lower redundancy. However, in L195–199, you state that bone-based coordinates yield better performance. Can you clarify better?
Comparison to MagicArticulate: The work is heavily based on MagicArticulate (autoregressive skeletons, Articulation-XL, point-cloud conditioning), a side-by-side comparison table or architectural diagram would help clarify what has changed or improved.
Robustness to video quality: Your animation pipeline depends on video prompts generated via text-to-video models. What happens when the generated video contains deformations, or irrelevant motion? Is your optimization robust to such cases?
Animation runtime and scalability: How long does the animation optimization take for typical sequences, and how does it scale with mesh complexity or video length? Could this be practical for real-time or interactive applications?

局限性

Going to recap what the authors mentioned in the Appendix:

Despite its strong performance, Puppeteer has two main limitations.
First, it does not capture very fine-scale deformations, such as flowing hair or fluttering cloth, because no skeletons are generated for these highly deformable parts.
Second, the animation stage still relies on per-scene optimization, which limits real-time deployment.

Other limitations can be referred to weaknesses above (if appropriate).

最终评判理由

concern addressed and raising my score to Accept.

格式问题

n/a

作者回复

2025-07-31

We appreciate the valuable comments from the reviewer. Please find below our detailed responses to your comments and concerns.

W1 & Q1: Missing citation to MagicArticulate in introduction, unclear data scarcity claim.

Thanks for highlighting these important clarification issues. We will add the proper citation to MagicArticulate in the introduction when discussing the Articulation-XL dataset. Regarding the data scarcity claim: while MagicArticulate introduced a substantial 33K dataset, it primarily consists of objects in rest poses, limiting generalization to diverse pose inputs. Our Articulation-XL2.0 addresses this limitation by adding a diverse-pose subset with 11.4K samples containing varied articulation states. We will clarify this distinction in the final version by specifically mentioning both data scarcity and pose diversity as key challenges addressed by our work.

W2 & Q2: Joint vs. bone inconsistency: You argue for joint-based tokenization (compactness) but then state bone-based coordinates perform better.

These approaches are used in different modules with distinct objectives: (1) We use joint-based tokenization in skeleton generation because it represents skeletons with shorter token sequences, enabling faster training and inference while achieving comparable or better performance than bone-based tokenization for the autoregressive generation task; (2) We use bone-based coordinates with positional encoding in skinning weight prediction because bone coordinates better capture the relationships between adjacent joints and provide more informative spatial features for the attention mechanism, which is helpful for accurate skinning weight prediction.

W3 & Q4: Animation pipeline relies on off-the-shelf text-to-video models for motion guidance. How robust is your optimization to imperfect video quality? Failure case analysis for video quality issues is missing.

Thanks for raising this critical concern about video quality dependency. Current text-to-video models (e.g., Kling AI, RunwayML, Jimeng AI) have improved significantly and can generate complex motions with reasonable success rates, as demonstrated on our project page in supplemental material. However, we acknowledge that video quality does impact animation fidelity. For instance, motion blur or temporal inconsistencies in generated videos can degrade joint/vertex tracking accuracy and make optimization more challenging. To mitigate these issues, since we use off-the-shelf generation models, we can generate multiple video candidates and select the highest-quality one based on visual clarity and motion consistency to guide the animation optimization. While this approach helps reduce the impact of poor-quality videos, video generation quality still represents a limitation for highly complex motion scenarios. We will include comprehensive discussions of these robustness issues, failure case analysis, and mitigation strategies in the final version.

W4 & Q3: Comparison with MagicArticulate: Clarify what components are reused or redesigned, and consider providing a comparison table showing the key differences and improvements.

Thanks for your suggestion. Please find the comparison table between MagicArticulate and our method below. For the dataset, we built upon Articulation-XL, expanding it from 33K to 59.4K samples with a diverse-pose subset that enhances generalization to varied pose inputs. For skeleton generation, we use the same autoregressive transformer with point cloud conditioning but propose more efficient joint-based tokenization with hierarchical ordering and randomization to improve model performance. For skinning weight prediction, we propose a completely different attention-based architecture that achieves better generalization and faster inference than MagicArticulate's functional diffusion approach. Most importantly, MagicArticulate does not support automatic animation, while our method provides video-guided animation capability.

Component	MagicArticulate	Our Method
Dataset
Dataset	Articulation-XL (33K)	Articulation-XL2.0 (59.4K, with 11.4K diverse poses)
Skeleton Generation
Tokenization	Bone-based tokenization	Joint-based tokenization with ordering randomization
Architecture	Auto-regressive transformer	Auto-regressive transformer
Shape conditioning	Point cloud features	Point cloud features
Skinning Weight Prediction
Architecture	Functional diffusion	Attention-based network
Skeleton coordinates	Joint coordinates	Bone-based coordinates
Attention mechanism	Standard attention	Propose Topology-aware joint attention
Animation
Animation capability	Not supported	Video-guided animation
Motion guidance	N/A	Video guidance
Optimization	N/A	Differentiable rendering pipeline

W5: Animation evaluation lacks perceptual animation metrics or user studies.

Thanks for this valuable suggestion. We conducted user studies with 21 participants to evaluate animation quality across three methods: L4GM, MotionDreamer, and our approach. Participants compared 8 animation examples across three evaluation criteria: (1) Video-Animation Alignment: Which animation result shows better alignment with the input video? (2) Motion Quality: Which one has a more natural and realistic motion? (3) 3D Geometry Preservation: Which method better maintains the original 3D object geometry without introducing distortions or artifacts? Results are shown in the table below. Our method outperforms both L4GM and MotionDreamer across all three evaluation dimensions. Note that Video-Animation Alignment is not evaluated for MotionDreamer since it uses text-driven motion generation rather than video guidance. Regarding perceptual metrics, standard measures (LPIPS, CLIP similarity, FVD, etc.) require ground truth animations from multiple viewpoints for comparison, which are unavailable in our setting. Developing appropriate perceptual metrics and benchmarks for video-guided 3D animation remains an open challenge that we would like to address in future work.

Method	Video-Animation Align	Motion Quality	Geometry Preservation
MotionDreamer	-	0	0
L4GM	19.64%	16.67%	18.45%
Ours	80.36%	83.33%	81.55%

W6 & Q5: No runtime or compute cost is reported for the animation phase. How long does the optimization take for typical sequences, and how does it scale with mesh complexity or video length? Is this approach practical for interactive applications?

Thanks for this important question. We report runtime and computational costs in Appendix Section A.1 (lines 59-62). Our animation optimization takes approximately 20 minutes for objects with up to 10K vertices on a single NVIDIA A100 GPU, processing 5-second videos (approximately 50 frames at 10 FPS) generated by Kling AI or Jimeng AI. Runtime scales with both mesh complexity and frame count: (1) Mesh complexity: Models with more vertices will require additional Pytorch3D rendering time. For example, the bat case ( $\sim$ 70K vertices) in the supplementary project page requires 90 minutes, while the turtle case ( $\sim$ 15K vertices) takes 35 minutes. (2) Frame count: For a typical case taking 20 minutes at 50 frames (10 FPS, 5 seconds), increasing to 20 FPS (100 frames) extends optimization time to 41 minutes, while reducing to 4 FPS (20 frames) decreases it to 8 minutes, demonstrating approximately linear scaling with frame count. We will include these discussions in the final version.

As acknowledged in our limitations (Appendix Section E), this per-object optimization approach is not suitable for real-time applications. Future work will focus on developing an end-to-end learning framework to eliminate per-scene optimization and enable real-time animation generation.

评论- Authors' Responses

2025-08-03

Dear Reviewer h6uP,

The authors have provided detailed responses to your questions. What is your view after seeing this additional information? It would be good if you could actively engage in discussions with the authors during the discussion phase ASAP, which ends on EoA (Aug 6).

Best, AC

审稿意见

评分: 5置信度: 42025-07-03

The paper proposed a framework for automating the process of the rigging and animation using following key components.

Automating Rigging:

A newly curated dataset
Novel joint-based skeleton representation
Sequence radomization incorporating target-aware positional indicator
Autoregressive skeleton generation

Skinning Weight Prediction:

Multi attention processes to integrate global shape context into refined bones and point features for precise skinning weight prediction.

Text-to-video generation guided animation

Using text-to-video models as a guidance, the skeleton pose and root movement of the 3D model are generated.

优缺点分析

Strength The paper is well-written and the method is clear and easy to follow. Experiments and results are extensive and demonstrate significant improvements over baselines.

Weakness Little explanation about dataset curation and information. There's no detailed explanation about the Articulation-XL2.0 dataset. I want to know precisely which data types were added and how their inclusion specifically impacted the model's performance. Additionally, incorporating "diverse-pose subset" makes generalization to novel articulations. But what kind of new poses used is not mentioned or visualized.

问题

Regarding Shape-conditioned auto-regressive generation, where does ground truth T come from? All ground truth data are preprocessed from Articulation-XL2.0?
Incorporating "diverse-pose subset" makes generalization to novel articulations. But what kind of new poses used is not mentioned or visualized. Could you provide more explanation about this part?
There's no detailed explanation about the Articulation-XL2.0. I want to know precisely which data types were added and how their inclusion specifically impacted the model's performance.
The joint-based tokenization discretizes joint coordinates into a grid. I would be curious about whether there's any potential issues about spatial information loss due to this process.
Topology-aware Joint Attention use graph distance. How this graph distance is calculated for very complex skeletons?
I think video quality from text2video generation model is critical for animation quality. Can these video models make very complex moves well? Or maybe the video is the weak part for final animation quality? I think this can be.
In experiment, the paper said that RigNet struggles with diverse object categories and fails to converge well. Is RigNet design bad? Or just the dataset too hard for RigNet?

局限性

yes

最终评判理由

All resolved issues: 1. Dataset transparency was significantly improved through rebuttal clarification and ablation studies. 2. Concerns about tokenization loss and graph computation were addressed with practical implementation details and empirical evidence. 3. Questions on RigNet failure and video quality trade-offs were thoroughly answered with insights and examples.

Justification: The contributions are novel, experimentally validated, and likely to have impact across rigging and generative modeling communities.

格式问题

None

作者回复

2025-07-31

Thank you for your time and careful review of our work. Below we address your questions and weaknesses mentioned:

W & Q2 & Q3: More explanation about Articulation-XL2.0 dataset information and curation: 1) What specific data types were added and how did their inclusion impact model performance? 2) What kind of new poses are used in the "diverse-pose subset" is not mentioned or visualized.

(1) Articulation-XL2.0 expands the original Articulation-XL dataset with: (a) multi-geometric data from Objaverse-XL, where each mesh consists of multiple parts, increasing dataset scale (from 33K to 48K) and improving performance across all three benchmarks; (b) a diverse-pose subset that enhances model generalization to diverse pose inputs. To demonstrate the impact of these dataset enhancements, we provide an ablation comparison across three training configurations: original Articulation-XL, Articulation-XL2.0 without diverse poses, and full Articulation-XL2.0 with diverse poses (marked with *). Results are shown in the table below (Arti2.0 = Articulation-XL2.0, MR = ModelsResource, Pose = Diverse-pose). (2) The diverse-pose subset contains two components (detailed in Section 3.1 and Appendix Section B): (a) data with varied poses extracted from objects with animation in Objaverse-XL, where we specifically selected frames exhibiting maximum deviation from rest pose configurations to capture extreme articulations; (b) animal data with diverse poses generated using SMALR. Examples are visualized in Figure S2 in the Appendix.

	Arti2.0	Arti2.0	Arti2.0	MR	MR	MR	Pose	Pose	Pose
Method	J2J ↓	J2B ↓	B2B ↓	J2J ↓	J2B ↓	B2B ↓	J2J ↓	J2B ↓	B2B ↓
Train on Arti-XL	3.486	2.769	2.315	4.314	3.218	2.724	3.657	3.031	2.387
Train on Arti-XL2.0	3.033	2.300	1.923	3.841	2.881	2.475	3.212	2.542	2.027
Train on Arti-XL2.0*	3.109	2.370	1.983	3.766	2.804	2.405	2.514	1.986	1.598

For dataset curation, we follow the original MagicArticulate pipeline with VLM-based filtering and further enhance quality by eliminating meshes with unskinned vertices and conducting manual validation. For the diverse-pose subset, we apply dual quality filtering using both high-quality rigging data (from our curated list) and high-quality animation data (from Diffusion4D's list), then extract individual frames from these animations to construct the diverse-pose subset with varied articulation states. More details are provided in Section 3.1 and Section B in the Appendix.

Q1: Where does ground truth token sequence T come from? All ground truth data are preprocessed from Articulation-XL2.0?

The ground truth token sequence T in training consists of three components: (1) Shape tokens: generated by encoding sampled point clouds with normals from dataset meshes using the pre-trained shape encoder; (2) Skeleton tokens: obtained by applying joint-based tokenization to skeletons from the dataset; and (3) Positional indicators: learnable parameters that specify the next joint group in the auto-regressive sequence. All components except the positional indicators are directly preprocessed from the dataset Articulation-XL2.0.

Q4: Joint-based tokenization discretizes joint coordinates into a grid. Is there spatial information loss due to this process?

Our joint-based tokenization normalizes coordinates to $[-0.5, 0.5]$ and discretizes them into a $128^3$ grid, yielding a quantization resolution of $\sim0.008$ units per dimension. While this introduces small quantization error, it balances tokenization efficiency with spatial precision for skeleton generation. Our experiments show this quantization level preserves sufficient detail for anatomically plausible skeletons. We also experimented that higher resolutions ( $256^3$ ) provide marginal improvements while substantially increasing computational cost.

Q5: How is the graph distance in Topology-aware Joint Attention calculated for very complex skeletons?

The graph distance calculation handles complex skeletons efficiently using Breadth-First Search (BFS) from each joint. Our implementation constructs an adjacency list from the bone connections, then performs BFS from each of the $J$ joints to compute shortest path distances to all other joints. This produces a $J\times J$ distance matrix. The computational complexity is $O(J^2)$ since we run BFS from each of the $J$ joints. We precompute this distance matrix once per skeleton and use it in Topology-aware Joint Attention.

Q6: Can current video models handle complex motions well, or maybe the video is the weak part for final animation quality?

Thanks for raising this important point. Current video generation models (e.g., Kling AI, RunwayML, Jimeng AI) have improved significantly and can generate complex motions with high success rates. Some examples are provided on the project page in supplemental material. However, video quality does impact our final animation quality. For instance, motion blur or temporal inconsistencies in generated videos can degrade joint/vertex tracking accuracy and make optimization more challenging. We will include this discussion in the final version.

Q7: The paper said that RigNet struggles with diverse object categories and fails to converge well. Is RigNet's design bad? Or just the dataset too hard for RigNet?

RigNet's performance issues stem from both factors: (1) Architectural limitations: RigNet's GNN-based design struggles to scale on large, diverse datasets like Articulation-XL2.0, which contains significantly more object categories than its original training data; (2) Dataset complexity: Our dataset's varied object orientations and geometric diversity exceed RigNet's original assumptions. This combination explains RigNet's convergence difficulties in our experiments.

2025-08-06

Thanks for addressing all my concerns. I have no further question. So I will keep my positive rating.

2025-08-06

Many thanks for your thoughtful feedback. We are glad we could resolve your concerns and will integrate your suggestions into the final version.

评论- Authors' Responses

2025-08-03

Dear Reviewer WsQJ,

Best, AC

审稿意见

评分: 4置信度: 22025-07-05

This paper presents a comprehensive framework for automatic rigging and animation of diverse 3D objects. The contributions include an expanded dataset (Articulation-XL2.0), an auto-regressive transformer for skeleton generation, a novel topology-aware attention-based architecture for skinning weight prediction, and a differentiable optimization-based method for animation. Extensive experiments demonstrate the effectiveness of the proposed approach, showing consistent improvements in both performance and efficiency over existing baselines.

优缺点分析

Quality This work presents a complete pipeline, along with detailed technical descriptions, ablation studies, different testing , and supporting qualitative evidence through figures and videos. However, some of the authors’ claims would benefit from stronger empirical or theoretical support. We refer to the limitations section in particular, where further clarification or justification would strengthen the overall contribution. Clarity The paper is well-organized, with clear structure and informative figures. The authors provide a detailed presentation of Data expansion methodological components, technical tools, and implementation details. Supplementary materials such as images and videos are also provided to support the claims. However, certain terms are ambiguously describe，like "randomized valid poses" in 3.1 Dataset: Articulation-XL2.0. Significance The introduction of a new/expanded dataset especially the pipeline represents a meaningful contribution to the community and is likely to benefit future research in this area. Originality The authors propose a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. The contributions include: (1) a more diverse dataset (Articulation-XL2.0); (2) an improved auto-regressive transformer for skeleton generation; (3) a novel attention-based architecture for skinning weight prediction; and (4) a differentiable optimization-based animation method that requires no neural network parameters. The methods for dataset construction and auto-regressive skeleton generation relies heavily on prior techniques. The skeleton generation component appears to be an incremental extension of prior methods, with all three variants being derived from similar approaches based on auto-regressive transformers. The deeper insights primarily emerge through the ablation studies and limited visualizations. For the animation part, the authors introduce a differentiable optimization-based approach that directly optimizes a sequence of motion parameters without relying on neural networks. While this idea is promising and shows improved performance over baselines—likely due to the use of joint-specific transformations enabled by the learned skeleton—the paper lacks sufficient discussion of this component. Notably, the approach bears conceptual similarity to existing works such as some work in 4DGS fileds, which employs a deformation network to predict motion parameters. These related methods all aim to fit deformation/motion parameters, yet the paper lacks any meaningful discussion of this shared strategy. In fact, this design choice—particularly the focus on jointlevel fitting—may partly explain the method’s limitations in handling fine-scale deformations. Furthermore, in Section 4, the authors primarily describe the optimization procedure, but the actual animation process is not clearly presented, nor are there references to prior work that would help clarify it.

问题

For the animation part, the optimization framework incorporates seven different loss. However, the paper lacks further discussion, like the necessity of the balance of different loss. Could you provide more information or example about the long sequence or the complex motion sequences and humanoid morphologies sequences/video. Most examples in the appendix are animal related. Could you provide more analysis about why better performance for Spatial order in Articulation-XL2.0. (Not Important ) While the limitations regarding fine-scale deformations are acknowledged in the paper, we would appreciate the inclusion of relevant visualizations to better illustrate these shortcomings. Additionally, a fair comparison of the optimization cost with baseline methods would strengthen the evaluation and provide a clearer understanding of the trade-offs involved.

局限性

While the authors have acknowledged certain limitations, some of the claims—such as the applicability to complex motion sequences—lack sufficient empirical evidence. In the dataset section (Articulation-XL2.0), some descriptions are vague, such as the term "randomized valid poses". In Sec.4, the authors primarily describe the optimization procedure, but the actual animation process is not clearly presented, nor are there references to prior work that would help clarify it. Although the dataset aims to increase diversity by incorporating more animal morphologies, the supplementary videos predominantly showcase animal examples, while humanoid morphologies (video) appear underrepresented. Given the stated goal of generality, we believe both categories are equally important and should be demonstrated. The core methodology appears to be largely composed of existing techniques, and the validation is mostly limited to ablation studies and selected qualitative examples. While helpful, the paper would benefit from deeper, more insightful analysis. Some discussions (e.g., about ordering) have also been covered in previous works. Notably, the spatial ordering performs well on Articulation-XL2.0, but the underlying reason for this improved performance remains unclear. A deeper exploration into why spatial ordering works better in this context would be valuable.

格式问题

not any formatting issues.

作者回复

2025-07-31

Thanks for your constructive feedback. We have organized your comments from the weaknesses, questions, and limitations sections and provided our responses below.

Q1: Vague descriptions in Articulation-XL2.0 dataset section, such as "randomized valid poses".

Thanks for pointing out this vagueness. "Randomized valid poses" refers to poses generated from SMALR where we apply random rotation angles to animal joints while constraining the angles within anatomically valid ranges. This ensures diverse articulation states while maintaining biological plausibility. We will clarify this with specific angle ranges in the final version.

Q2: Core methodology appears largely composed of existing techniques; the deeper insights primarily emerge through ablation studies and limited visualizations.

While some components leverage established foundations, our work introduces substantial novel contributions across multiple dimensions: (1) Dataset enhancement: We expanded Articulation-XL from 33K to 59.4K samples with a diverse-pose subset that enhances generalization to varied pose inputs, addressing a critical limitation of pose diversity in existing datasets. (2) Technical innovations: (a) Joint-based tokenization with hierarchical ordering for skeleton generation, improving efficiency and performance; (b) Novel attention-based architecture with topology-aware joint attention mechanism for skinning weight prediction, achieving better generalization and faster inference; (c) First integration of video-guided optimization for automatic animation. (3) System-level contribution: While existing methods address rigging or animation separately, our unified framework enables the first pipeline from shape input to final animation - a significant advance in practical applicability.

We acknowledge the need for deeper analysis. Beyond ablation studies and qualitative examples in the main paper, we also provide quantitative results on three benchmarks (Tables 1, 2), inference time analysis (Tables S2, S5), generalization studies (Figure S4), more qualitative results in the appendix (Figure S5, Figure S7), and video results on the project page in supplemental material. However, we agree that a more comprehensive analysis would strengthen the work. In the final version, we will include failure case analysis, quantitative animation results, and a comprehensive comparison with MagicArticulate.

Q3: Insufficient discussion of the optimization-based animation; lacks discussions with related methods in 4DGS fields that employ deformation networks to predict motion parameters; joint-level fitting approach may explain limitations in fine-scale deformations.

Thanks for the suggestion. We acknowledge that animation deserves more detailed discussion. Our differentiable optimization directly optimizes joint transformations by minimizing rendering, tracking, motion regularization losses, without requiring neural network training. We chose direct optimization over neural approaches because: (1) it provides more precise control over joint-specific transformations; (2) it avoids the need for extensive motion training data; (3) it allows flexible adaptation to different video inputs without model retraining.

Regarding 4DGS methods: While both our animation optimization and 4DGS methods aim to deform vertices/Gaussians, there are fundamental differences: (1) We leverage learned skeletal structure for physically-plausible joint-based deformation, whereas 4DGS uses direct Gaussian deformation without anatomical constraints; (2) Our approach focuses on controllable object animation with video guidance, while 4DGS targets novel view synthesis of dynamic scenes; (3) Most importantly, rigging enables multiple control modalities. Beyond video-guided animation, rigged models can integrate directly into standard 3D animation software. This rigging-animation pipeline aligns seamlessly with current animation workflows in games and animated films.

The limitations in fine-scale deformation are indeed caused by our joint-level optimization approach. Fine-scale deformations require higher joint density in specific regions, but our current rigging pipeline generates skeletons with limited joint resolution. Animation-driven feedback to adaptively refine joint placement based on motion complexity may address this limitation in future work. We will include these discussions in the final version.

Q4: For animation, the authors primarily describe the optimization, but the actual animation process is not clearly presented, nor are there references to prior work that would help clarify it.

After optimization, our animation process follows standard skeletal animation principles [1]: (1) Forward Kinematics (FK): We compute global joint transformations from optimized local transformations using hierarchical forward kinematics, traversing the skeleton from root to leaves; (2) Linear Blend Skinning (LBS) [2]: Deform mesh vertices using weighted combinations of joint transformations, where each vertex is influenced by multiple joints according to skinning weights. This produces the final mesh animation sequence. We will add these implementation details and references to clarify the animation process in the final version.

[1] Parent, Computer Animation: Algorithms and Techniques, 2012.

[2] Lewis et al., Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation, 2000.

Q5: The optimization framework uses seven different losses but lacks discussion on their necessity and balance.

Thanks for this important question. Our seven loss terms serve distinct objectives: (1) Rendering losses: RGB loss ensures photometric consistency; Mask loss maintains object boundaries; Optical flow loss enforces temporal motion consistency; Depth loss alleviates 3D ambiguities in monocular optimization. (2) Tracking losses align 3D skeleton joints and mesh vertices with keypoints tracked across video frames, ensuring optimized animation accurately follows the motion in the video. (3) The regularization term prevents temporal jittering by penalizing large transformation changes between consecutive frames. To balance these losses, we weight each term to ensure comparable magnitudes. In practice, regularization losses are down-weighted by 3–4 orders of magnitude relative to rendering and tracking losses to prevent over-smoothing. We will provide detailed discussions in the final version.

Q6: Limited examples of long sequences, complex motions, and humanoid morphologies.

Regarding sequence length, we primarily generate 5-second videos ( $\sim$ 50 frames at 10 FPS) using Kling AI and Jimeng AI, which is longer than most existing 4D generation works. We will extend this to longer sequences (10 seconds) to better demonstrate long-range motion stability.

For complex motions, they typically involve rapid movements with large joint rotations between frames or extensive rotational changes throughout the video sequence. These challenges could potentially be alleviated by adaptive frame sampling based on optical flow magnitude, sampling more densely during rapid motion. For example, in the seahorse case on our supplementary project page, the tail exhibits large-amplitude oscillatory motion. While our method successfully captures the overall motion pattern, it struggles with precise alignment between the video and the tail's fine-scale movements. We will further discuss these in our limitations and future work.

In our paper figures and supplementary project page, we provide results across different categories: 4 humanoid examples, 4 quadrupedal animals, 3 marine creatures, and 2 flying objects to demonstrate generalization across morphologies. Since we cannot add visualizations during rebuttal, we will include more humanoid examples in the final version, which is more important for practical applications.

Q7: Lack of explanation for why spatial ordering performs better in Articulation-XL2.0.

To clarify, we actually use hierarchical ordering rather than spatial ordering in our method. We provide an ablation comparison between both ordering strategies in Section 5.5 and Table 3. While they achieve close performance in spatial alignment metrics, spatial ordering often yields disconnected skeletons because it may generate child joints before their parents, resulting in invalid parent references. We also provide additional discussion in Section D.1 (lines 127-129) and visualization in Figure S6 of the appendix.

Q8: Limitations regarding fine-scale deformations are acknowledged but not well illustrated.

Regarding fine-scale deformation limitations, an example can be seen in the turtle case in the More Animation Results section of our supplementary project page. While the reference video shows very soft motion of the turtle's forelimbs, our animation results appear less smooth due to insufficient joint density in these regions. This limitation stems from our skeleton generation producing few joints in areas requiring fine-scale deformations. Animation-driven joint refinement could improve smoothness but remains future work. We will include more detailed analysis and examples in the final version.

Q9: Missing comparison of optimization costs with baseline methods.

Thanks for your suggestion. We compare against two animation baselines in the paper: (1) L4GM (feed-forward method): While L4GM was trained on 128 80G A100 GPUs, our optimization approach needs only a single A100 GPU ( $\sim$ 20 minutes for 10K vertex meshes) while avoiding geometric distortions inherent in L4GM; (2) MotionDreamer (per-object optimization): MotionDreamer takes approximately 10 minutes for meshes with less than 8K vertices but produces only 20-frame animations, while our method generates 50-frame animations in 20 minutes with superior quality and stability (refer to comparison videos in the supplementary project page). We will provide a comprehensive computational cost analysis in the final version.

2025-08-05

Thank you for your response and excellent work. I am willing to raise the original score.

2025-08-06

Thank you very much for your positive feedback and for raising your score. We truly appreciate your constructive comments throughout the review process.

评论- Authors' Responses

2025-08-03

Dear Reviewer kTdi,

Best, AC

2025-08-04

Thank you for the detailed response, which has clarified my previous concerns.

2025-08-05

Thank you again for the constructive feedback. We are pleased to have addressed your concerns and will incorporate your suggestions into the final manuscript.

评论- Discussion with Authors

2025-08-02

Dear Reviewers,

The discussion period with the authors has now started. It will last until Aug 6th AoE. The authors have provided responses to your questions. I request that you please read the authors' responses, acknowledge that you have read them and start discussions with the authors RIGHT AWAY if you have further questions, to ensure that the authors enough time to respond to you during the discussion period.

Best, AC

最终决定Accept (spotlight)

2025-09-17

This paper proposes a novel method for automatic rigging and animation of 3D characters. Four reviewers provided positive ratings of 4 x accept. Reviewer kTdi promised the authors that they would increase their final score, but did not update it, in spite of repeated reminders from the AC. The AC assumes that their final rating is "accept", increased originally from "borderline accept". The reviewers appreciated the work for the novelty of its proposed approach, its strong performance, its new dataset and the significance of the task performed. The reviewers' questions were adequately answered during the reviewer-author discussion phase. The AC concurs' with the reviewers' consensus and recommends acceptance. Congratulations! The authors should incorporate the changes that they have promised into the final camera ready version of their paper.