PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
4
4
4.5
置信度
创新性3.0
质量2.8
清晰度3.0
重要性2.8
NeurIPS 2025

RigAnyFace: Scaling Neural Facial Mesh Auto-Rigging with Unlabeled Data

OpenReviewPDF
提交: 2025-04-22更新: 2025-10-29
TL;DR

We propose a scalable neural auto-rigging framework for facial meshes of diverse topologies with multiple disconnected components.

摘要

关键词
Creative AIAuto-riggingFacial Animation

评审与讨论

审稿意见
5

This paper presents RigAnyFace (RAF), a novel auto-rigging framework for facial meshes that supports rigging in-the-wild meshes without manual labeling, addresses limitations of previous methods, particularly Neural Face Rigging (NFR). RAF is designed to handle diverse mesh topologies, including those with multiple disconnected components like eyeballs and mouth sockets. The framework utilizes a triangulation-agnostic surface learning network, augmented with a 2D diffusion block for and a global encoder to handle disconnected components. To overcome the scarcity and high cost of manually rigged 3D ground truth data, RAF introduces a 2D supervision strategy for unlabeled neutral meshes. This strategy leverages a generative 2D face animation model and an optical flow estimator to synthesize posed images and 2D displacement fields for training. Experiments show that RAF outperforms prior work in accuracy and generalizability across artist-crafted assets and in-the-wild samples, and it supports more detailed expression animation by accommodating disconnected components.

优缺点分析

Strengths

  1. The proposed global encoder helps to deal with multiple disconnected components.
  2. The 2D supervision pipeline enriches the model's exposure to unrigged meshes, which effectively enhances the generalization ability towards in-the-wild meshes.
  3. The general retargetting quality, including the expression faithfulness, the temporal smoothness, and the scope of applicable meshes is better than comparable baselines.
  4. The ablation studies validate the effectiveness of the key design choices.

Weaknesses

  1. Though the model delivers better temporal consistency, it does not explicitly consider the temporal dimension.
  2. The paper does not analyse the properties of the global encoder, either its feature distribution or why it's effective for different disconnected components.
  3. The discussion of limitations is omitted. It's better to give some specific examples about the limitations.

问题

  1. Will the code and the dataset be released?
  2. Why did the authors choose to output vertex offsets directly instead of using Possion systems like in Neural Jacobian Fields or Neural Face Rigging? An ablation on this could better explain the design choices.
  3. As the DiffusionNet layers can't propagate information between disconnected components, how does the Global Encoder, made of several DiffusionNet Layers with average pooling, help to mitigate this issue? Besides, would the model be able to generalize to meshes with unseen components?

局限性

Yes

最终评判理由

The authors have provided a thorough rebuttal and engaged in a productive discussion that has successfully addressed my initial questions and concerns. The core contributions of the paper stand strong, and my confidence in its quality and impact has been reinforced. Through rebuttal, the author resolved my issues about the functionality of the Global Encoder and the character-specific expressions. The paper does not explicitly model the temporal dimension for animation consistency. However, the authors acknowledged this and have reasonably positioned it as an area for future work. Given that the paper's main contribution is a scalable auto-rigging framework for static meshes, I assign low weight to this point and do not see it as a flaw in the current work. In summary, I believe this work is a step towards auto-rigging of the face. The proposed method, combining scalable training on 2D unlabeled data and an architecture handling complex topologies and disconnected components, is helpful. I thus recommand accept.

格式问题

No concerns.

作者回复

We sincerely appreciate the reviewer’s constructive comments and recognition of the key contributions of our paper. We address the reviewer’s concerns and questions point by point.

1. Though the model delivers better temporal consistency, it does not explicitly consider the temporal dimension.

Thanks for pointing that out. We will explore the explicit temporary modeling during animation in our future work.

2. Properties of the global encoder:

The primary role of the global feature is to encode the relative positions between disconnected components. This information is crucial for preventing penetrations during deformation, as the Diffusion Blocks alone cannot propagate information across disconnected mesh components.

To quantify this, we evaluate the penetration between inner components (e.g., mouthbag) and the outer face surface by reporting the percentage of penetrating vertices with and without our global feature. Specifically, we compute the signed distance function (SDF) of each vertex on the inner component with respect to the outer face surface in both the neutral and predicted meshes. Vertices are considered to be penetrating when their SDF values change from negative in the neutral mesh to positive in the deformed/predicted mesh. As shown in the table below, involving the global feature significantly reduces the number of such penetrating vertices. Please also refer to Fig. 5A (first column) in the main paper for the qualitative comparison.

Global EncoderAll Other ComponentsMAE ↓MAE Q95 ↓Penetration ↓
2.146.640.377
2.166.080.405
2.085.840.166
1.925.630.173

Additionally, we will include two visualizations in our revision to illustrate the role of the global encoder. Specifically, we will show T-SNE plots of the global features compared against those from unmodified meshes in our dataset: (1) when the positions of connected components are randomly perturbed, and (2) when different numbers or types of connected components are removed. In both cases, the resulting features form separate clusters from the original ones, demonstrating that the global encoder captures structural changes in the mesh.

3. Discussion of limitations:

Thanks for pointing that out. In our revision, we will include qualitative examples of our failure cases particularly where our method struggles to deform shell-like structures or meshes with poor discretization. These failure cases are primarily caused by the following factors:

  • Shell-like meshes (e.g., deformed spheres) often depict facial features such as eyelids and lips using texture rather than geometry, making meaningful deformation ill-posed or ambiguous.
  • Poorly discretized meshes, where the main facial surface breaks into disconnected parts instead of a complete one, hinder feature propagation through the DiffusionNet backbone and make it difficult to establish consistent deformation bases.

4. Will the code and the dataset be released?

We will release the code, trained model weights for reproducibility. We are also working on releasing the full dataset, pending internal approval and legal clearance upon acceptance.

5. Ablation on Neural Jacobian Fields

We initially explored Poisson-based approaches such as Neural Jacobian Fields (NJF), but found several issues. In our early experiments, NJF required double-precision for stable training, which increased memory usage and slowed down the training. Moreover, passing gradients through the Poisson solver introduces additional computational overhead without a prominent benefit in terms of accuracy.

In contrast, directly predicting vertex displacements proved to be much simpler, more efficient, and yielded strong empirical results. Given these considerations, we decided to focus on the direct offset prediction and discontinued the NJF direction early in development.

6. Would the model be able to generalize to meshes with unseen components?

Yes. During dataset curation, we ensured coverage of common deformable facial components needed for rigging, such as mouth bags and eyeballs. Unseen components typically refer to non-deforming accessories (e.g., earrings, hats, or hair), which do not require animation.

For example, in the sample from Objaverse (Please refer to Fig. 1, third column; Fig. 7, last column), earrings are preserved without deformation. Similarly, in our video demo (Please refer to Section 3: Animating Generated Facial Mesh from Text-to-3D Model), components like hair and hats remain unaffected during animation. Even though these parts were unseen during training, the model naturally preserves them on the mesh.

评论

Thanks to the authors for the rebuttal. I think the additional experiment clearly demonstrates the effectiveness of the global encoder, and suggest the author to include it to clarify the functionality of the global encoder.

Besides, I would like to ask two more questions:

Character-specific expression:

As the labeled dataset is with a pre-defined linear blendshape rig (L140), the same expression under different character meshes has the same vertex offset. For neutral mesh M0M^0 and M1M^1, the expressed mesh under FACS vector AA would be:

MA0=V0+i=0NAiVi,MA1=V1+i=0NAiVi,M^0_A = V^0 + \sum_{i=0}^N A_i V_i,\qquad M^1_A = V^1 + \sum_{i=0}^N A_i V_i,

leading to the same expression offset.

Thus, my question is, can the model learn character-specific expressions, given a character-agnostic dataset?

Comparing to deformation transfer

How does the model perform on unseen meshes, compared with deformation transfer given mesh registration from the annotation?

评论

We sincerely appreciate the reviewer’s recognition of our additional experiments demonstrating the effectiveness of the global encoder and will include them in our revision. Regarding the reviewer’s questions:

1. Character-specific expression:

Thank you for pointing this out. We agree that the current phrasing in L140, “a full blendshape rig,” is potentially misleading. It may give the impression that the same blendshape rig is shared across all characters.

To clarify: each labeled character in our dataset is annotated with its own character-specific blendshape rig, manually created by artists. This means that while the same FACS vector AiA_i is used across characters, the associated blendshape offsets ViV_i are unique to each character. As a result, the same FACS vector produces different expression offset per character:

MA0=V0+i=1NAiVi0,MA1=V1+i=1NAiVi1M^0_A = V^0 + \sum_{i=1}^N A_i V_i^0, M^1_A = V^1 + \sum_{i=1}^N A_i V_i^1

Thus, our dataset is not character-agnostic, and our model is trained on character-specific expressions. This enables the model to learn how different characters express the same FACS pose differently, capturing unique deformation styles for each identity.

We will revise the paper's phrasing to make this important detail clear in the final version.

2. Comparing to deformation transfer

Thank you for your insightful question. We have provided both quantitative and qualitative comparisons with deformation transfer in Table 2 and Figure 5 (b) of our main paper. These evaluations are conducted on a leave-out test set; all meshes are unseen during training, paired with artist-annotated ground-truth blendshapes and 3D correspondences. As shown, our method produces more accurate and expressive animations than deformation transfer.

评论

Thanks for the reply. Please clarify Q1 in the paper.

评论

Thank you very much for your thoughtful feedback and valuable questions. We greatly appreciate your continued support and will incorporate your suggestions into the revised version of the paper.

审稿意见
4

This work tackles the problem of generating blend shapes of an arbitrary 3D human head model, which can also have disconnected geometries such as eyeballs. To this end, the authors propose a learning based approach based on the DiffusioNet architecture and extend it to accept FACS conditioning as well as a global latent code to overcome the issue of disconnected geometry. The authors train on hand-crafted ground truth pairs, i.e. head models and FACS blend shapes. However, this data has limited diversity and thus, the authors also propose a 2D supervision strategy using readily available 2D face animation methods for generating pseudo 2D ground truth. The ablation studies and comparisons partially confirm the authors claims.

优缺点分析

Strengths:

  • Clear scope and positioning of the work
  • References are more or less complete
  • Challenging and relevant problem setting
  • Interesting idea of using Megactor for generating pseudo ground truth for weakly supervised training
  • Comparisons are slightly limited in terms of prior methods. However, even after careful search, I could not find newer methods tackling the same problem. Thus, I believe the comparisons are complete to the best of my knowledge.

Weaknesses:

  • Incremental design
    • The global encoder and FACS conditioning can be considered minor (task-specific) adaptations to the DiffusioNet architecture. While interesting, I believe it is rather incremental and does not justify an acceptance at Neurips.
  • Technical correctness
    • paragraph l175 defines a differentiable rendering loss between the reconstructed FACS and the ground truth rendering. However, this still assumes having paired GT data available. Thus, I am wondering why this should generalize better as it essentially requires the same data as before? Then why not simply supervising in 3D? The authors should clarify.
    • In paragraph l195, again it remains unclear why supervision is applied in 2D as the ground truth mesh is assumed. Why not simply supervising in 3D directly?
  • Conceptual design choices
    • The global encoder basically averages all features across all vertices. This completely ignores spatial/local structures. While it is addressing disconnected geometry parts, it completely destroys locality information. Thus, I feel this design is sub-optimal given the task.
  • Writing
    • FACS not being introduced before using the acronym
    • l.34 starting the sentence with “But” is grammatically wrong
    • l.133-134 could be removed as it does not add any valuable information
    • l158 “. And…” not a proper english sentence
  • Clarity
    • The DiffusionNet background section is not very self-contained. It would be great to introduce the approach more formally with equations/mathematical notations
  • Ablation studies
    • The authors provide some ablation studies, which is good. However, I would like to see a study where only one component at the time is changed to a simple baseline. For example, the global encoder is only removed in conjunction with other key designs. Instead, I would like to see an ablation where only the global encoder is removed. This would be a more meaningful experiment. Moreover, the discussion about the ablation study is very shallow and at the level of “every design improves the results”. There is no good answer to why it improves and in which sense it improves results. The authors should extend explanations here.
  • Missing reference for image to 3D avatar
    • Teotia et al. 2024: GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations

Given the current weaknesses, I slightly lean towards reject. However, the authors may be able to address some of the concerns during the rebuttal period.

问题

l.146 How does the interpolation strategy works? I cannot find much details in the manuscript? Does it assume correspondence between two different instances? If so, how can this be guaranteed?

Is NFR retrained on the same data? Otherwise, the comparison is not conclusive as it is not clear whether the improvement comes from data or the algorithmic design. The authors should clarify.

局限性

Limitations are discussed in the conclusion section.

最终评判理由

The authors addressed my major concerns about novelty, evaluation, and key design choices during the rebuttal.

Thus, I am happy to increase my score.

格式问题

Seems appropriate.

作者回复

We sincerely appreciate the reviewer’s valuable and constructive feedback and recognition of our work’s scope and complete comparison. For presentation-related issues, we will further polish the writing, correct grammatical errors, clarify technical details, and add the missing references as suggested. We address the reviewer’s technical concerns below.

1. Why is supervision applied in 2D, as the ground truth mesh is assumed? Why not simply supervise in 3D directly?

Thank you for the question, this gives us a chance to clarify an important part and contribution of our approach. Not all meshes in the training set come with artist-labeled 3D blendshape (deformed mesh) ground truth. Due to the high cost of manual annotation, only a subset of samples have full 3D labels. The remaining unlabeled data consists of only a neutral mesh without any corresponding 3D ground truth deformations.

Therefore, our use of 2D supervision is not intended to replace 3D supervision when it is available, but to enable learning from the unlabeled data. Specifically, we employ a two-stage training pipeline. During the first stage of training, we use 2D losses to supervise both labeled and unlabeled samples. For the unlabeled ones, we generate pseudo-2D ground truth using a 2D face animation model (based on MegActor) and a RAFT-based optical flow estimator.

Then, in the second stage, we fine-tune the model using the subset of data with accurate 3D ground-truth deformations with 3D loss. This two-stage strategy allows us to scale learning while still benefiting from the precision of 3D supervision when available.

2. The global encoder and FACS conditioning can be considered minor (task-specific) adaptations.

While the global encoder and FACS conditioning are indeed tailored to the facial rigging, we believe this task-specificity reflects a thoughtful and necessary adaptation to challenges that have not been addressed in prior work. The global encoder enables information flow across disconnected components, and FACS conditioning allows expression-aware deformation. These designs are essential components of our auto-rigging system.

Importantly, these architectural components are just one part of our contributions. Another key contribution of our work lies in the 2D supervision design that enables scaling-up training on unlabeled neutral facial meshes without rig annotations. Unlike prior methods that rely exclusively on costly 3D-labeled data, we introduce a tailored 2D animation and optical flow pipeline to generate pseudo-2D ground truth supervision for unlabeled data. To the best of our knowledge, RAF is the first scalable approach to make effective use of unlabeled data for neural auto-rigging.

Taken together, the architectural innovations and our scalable training framework form a unified and practical solution for real-world auto-rigging across diverse mesh topologies with multiple disconnected components, which we believe constitutes a meaningful step beyond prior work.

3. Global encoder compressing all vertices' features in one single vector destroys locality information.

The primary role of the global feature is not to capture fine-grained geometric details, but rather to encode the relative positions between disconnected components. This information is crucial for preventing penetrations during deformation, as the Diffusion Blocks alone cannot propagate information across disconnected mesh components.

To quantify this, we evaluate the penetration between inner components (e.g., mouthbag) and the outer face surface by reporting the percentage of penetrating vertices with and without our global feature. Specifically, we compute the signed distance function (SDF) of each vertex on the inner component with respect to the outer face surface in both the neutral and predicted meshes. Vertices are considered to be penetrating when their SDF values change from negative in the neutral mesh to positive in the deformed/predicted mesh. As shown in the table below, involving the global feature significantly reduces the number of such penetrating vertices. Please also refer to Fig. 5A (first column) in the main paper for the qualitative comparison.

Regarding the concern about capturing geometric details, the final deformation predictions are contributed by both the global feature and pre-vertex features from the Diffusion Blocks, enabling accurate deformation even for complex meshes. For instance, while the training set meshes have 1,704 vertices and 3,267 faces on average, our model generalizes well to in-the-wild meshes with significantly higher complexity, averaging 9,497 vertices and 17,958 faces, as shown in Fig. 7 of the main paper.

Global EncoderAll Other ComponentsMAE ↓MAE Q95 ↓Penetration ↓
2.146.640.377
2.166.080.405
2.085.840.166
1.925.630.173

4. Ablation study on the global encoder.

We conducted an additional ablation as the reviewer suggested, removing the global encoder only while keeping all other components unchanged. As shown in the second row of the above table, the model’s performance drops noticeably compared to the full model (the last row) and becomes similar to the variant where the global encoder is removed together with other key components (the first row). This demonstrates that the global encoder plays a critical role, and its removal alone creates a significant bottleneck in the overall performance, even when the rest of the architecture remains intact.

5. How does the interpolation strategy work? Does it assume correspondence, and how can it be guaranteed?

Yes, our interpolation strategy assumes correspondence between meshes of different topologies. Specifically, we standardize the UV layout across all head meshes so that key facial features (e.g., eyes, mouth) are consistently mapped to the same regions in UV space. This shared UV space enables us to identify correspondences between vertices across meshes, even when their topologies differ. Leveraging these correspondences, we perform linear blending in geometry space to smoothly interpolate between different head shapes.

6. Is NFR retrained on the same data?

NFR’s released pretrained model didn’t provide training scripts, and it does not support meshes with multiple connected components by design. Therefore, we could not re-train the NFR model. Given these constraints, the fairest comparison we can make is to evaluate both methods on unseen artist-crafted and in-the-wild head meshes, strictly preprocessed to meet NFR’s input requirements. Specifically, for the data used to compare our model with NFR, we retain only the largest connected component of the neutral mesh, removing auxiliary structures such as eyeballs and mouthbags, and trimming inner surfaces of the lips and eyelids (please refer to Figs 3 and 4 in the supplementary appendix).

Besides, we also evaluated our model on NFR’s training dataset, ICT FaceKit (please refer to the first sample in our Fig.7 ), while our method did not train on this dataset, we still got comparable performance while handling multiple connected components.

评论

Dear authors,

thank you for providing your rebuttal.

Overall, the authors addressed my concerns and I am happy to improve my score.

I still believe some individual design choices are rather incremental, but the overall design is new and sound. Thus, I recommend borderline accept.

评论

Thank you so much for the thoughtful and constructive feedback. We appreciate your recognition of our efforts and your willingness to improve the score. We will revise the manuscript accordingly based on your suggestions.

审稿意见
4

This paper introduces RigAnyFace, a scalable and generalizable neural auto-rigging framework for facial meshes with diverse topologies, including those with disconnected components like eyeballs. Its key goal is to automatically create FACS-based blendshape rigs from a neutral 3D facial mesh.

To achieve this goal, the paper use 2 stage training pipeline. First, train on both rigged and unrigged data using 2D losses. Then, fine-tune on rigged data with 3D supervision for higher precision. The model part is built upon DiffusionNet, use a global encoder to capture holistic information across disconnected components and conditional diffusion blocks to integrate FACS parameters directly into the deformation prediction process.

优缺点分析

Strengths:

  • Strong Methodological Innovation: The paper presents a well-designed auto-rigging framework that combines a modified DiffusionNet backbone with conditional FACS encoding and a global encoder to handle disconnected mesh components. These architectural modifications are technically sound and justified.
  • Well-structured: the paper is logically organized, beginning with motivation, followed by detailed method descriptions, training strategy, and comprehensive evaluations. Figures are clear and informative.
  • Support for Disconnected Components: A significant practical contribution is the ability to rig meshes with disconnected elements like eyeballs—something most prior works fail to address.

Weaknesses

  • Limited Analysis of Failure Cases: The limitations section briefly mentions mesh discretization and poor mesh quality as failure modes, but these are not empirically analyzed or visualized in the main paper.
  • Incremental Architecture Changes: While effective, the architectural improvements to DiffusionNet (e.g., conditional blocks, global encoder) are relatively incremental. Much of the novelty lies in the integration and system design, rather than entirely new algorithms.
  • Two-Stage Training with 2D and 3D Supervision: 3D mesh learning with 2D supervision already have many prior works. The originality and novelty of the framework and loss functions are limited here.
  • Evaluation Setup: This work only performed a comparison with NFR in the evaluation, lacking more comprehensive analysis with other SOTA methods.

问题

  • Could you include visual or quantitative examples of where the model fails or underperforms?
  • How is the model's ability to handle extreme expressions and postures of driving images?

局限性

Yes.

最终评判理由

The authors have addressed majority of the concerns in the rebuttal.

格式问题

N/A

作者回复

We sincerely thank the reviewer for the valuable feedback and recognition of the key contributions of our work, including the tailored architectural design for handling disconnected mesh components and the scalable training pipeline. We address the reviewer’s concerns as below:

1. Limited Analysis of Failure Cases:

Thanks for pointing that out. In our revision, we will include qualitative examples of our failure cases particularly where our method struggles to deform shell-like structures or meshes with poor discretization. These challenging cases are mainly in-the-wild samples sourced from Objaverse, which is a large-scale dataset of general-purpose 3D assets but not curated for facial modeling. Therefore, quantitative analysis of failure cases is unfortunately not feasible as the lack of ground-truth rig annotations. These failure cases are primarily caused by the following factors:

  • Shell-like meshes (e.g., deformed spheres) often depict facial features such as eyelids and lips using texture rather than geometry, making meaningful deformation ill-posed or ambiguous.
  • Poorly discretized meshes, where the main facial surface breaks into disconnected parts instead of a complete one, hinder feature propagation through the DiffusionNet backbone and make it difficult to establish consistent deformation bases.

2. Incremental Architecture Changes:

Our RAF, built on top of DiffusionNet, is specifically aimed at addressing facial rigging challenges that previous neural methods are not capable of handling. The global feature inherits the information of global arrangement between disconnected components, and our FACS conditioning allows for expression-aware deformation. These designs are essential components of our auto-rigging system.

3. 3D mesh learning with 2D losses already has many prior works. The originality of the loss functions is limited:

While prior works [1, 2, 3] have explored applying 2D losses for 3D face reconstruction or generation using direct 2D supervision from input images or videos, our approach first generates pseudo-2D ground truth tailored to each unrigged mesh via a 2D face animation model and an optical flow estimator. We then train on large-scale unrigged meshes with 2D supervision generated above, eliminating the need for artist-provided annotations while enabling scalable training.

In addition, we empirically found that relying solely on 2D photometric loss is insufficient for accurate deformation. It often fails to capture subtle expressions as limited changes in pixel RGB values provide only marginal gradients. To address this, we introduce a 2D displacement loss based on pixel motion flow, which provides a stronger signal as an additional supervision for capturing fine-grained facial motions.

[1] Feng, Yao, et al. Learning an Animatable Detailed 3D Face Model from In-The-Wild Images. (TOG2021)

[2] Chan, Eric R., et al. Efficient geometry-aware 3d generative adversarial networks. (CVPR2022)

[3] Trevithick, Alex, et al. Real-time radiance fields for single-image portrait view synthesis. (TOG2023)

4. Evaluation Setup: only compared with NFR in the evaluation, lacking analysis with other SOTA methods.

Indeed, Neural Face Rigging (NFR) is our primary baseline as it is the most recent and relevant method for our task, to the best of our knowledge. As also mentioned by Reviewer cQPn, NFR is the latest SOTA method that we can compare with.

5. How is the model's ability to handle extreme expressions of driving images?

Please refer to Section 2 of our supplementary video: Retargeting Human Expression from videos (01:09–03:17), which showcases retargeting of real-world videos with a variety of extreme expressions (including one example from the Facial Extreme Emotions Dataset [4]) to unrigged 3D meshes. These results demonstrate our model’s ability to generalize to intense and diverse facial expressions.

[4] Drobyshev, Nikita, et al. Emoportraits: Emotion-enhanced multimodal one-shot head avatars. (CVPR2024)

评论

Thanks authors for the update and majority of my concerns are addressed, especially for failure and extreme cases analysis.

评论

Thank you for your kind follow-up and for acknowledging our updates. We truly appreciate your valuable feedback and support, and we will revise the paper accordingly based on your suggestions.

审稿意见
4

The paper introduces RigAnyFace, a framework for facial meshes of diverse topologies. RAF deforms a neutral mesh into FACS poses to create a blendshape rig, using a triangulation-agnostic DiffusionNet backbone augmented with two novel components: (1) a global encoder to handle disconnected parts (2) FACS conditioning blocks to integrate action-unit parameters. To address limited 3D supervision, RAF leverages 2D supervision generated via a pretrained 2D face-animation model for unrigged neutral meshes. This strategy scales training data, enhancing generalization. Experiments show RAF outperforms prior work on artist-crafted and in-the-wild meshes in accuracy and topological flexibility.

优缺点分析

Strengths:

  • Robust technical design effectively handles disconnected components via the global encoder and achieves precise expression control through FACS conditioning blocks.
  • Superior performance outperforms NFR by >45% in MAE on humanoid heads and generalizes to non-humanoid/out-of-distribution meshes.
  • Significant practical impact eliminates template dependency, supports multi-component meshes, and reduces 3D data reliance via scalable 2D supervision, enabling applications in text-to-3D rigging and video retargeting.
  • Pioneering contributions are the first to support disconnected components in auto-rigging and introduce 2D supervision for 3D deformation networks.

Weaknesses:

  • Global feature bottleneck may compress fine details into a single vector, limiting performance on complex meshes.
  • Degraded performance on shell-like meshes occurs when inputs lack fine geometric details.
  • Unclear handling of stylized meshes in the 2D supervision pipeline; fine-tuning details for non-standard topologies need clarification.
  • Suboptimal inference speed could hinder real-time deployment, Moderate originality as core innovations build incrementally upon DiffusionNet.

问题

RAFT optical flow and MegActor may struggle with highly stylized/non-humanoid meshes. How significant is the domain gap for these models? Provide error analysis of generated 2D data vs. artist annotations.

The global encoder compresses all mesh components into one vector, potentially losing part-specific details. Could a hierarchical encoder improve detail preservation?

How does RAF handle cases where the input mesh contains severe discretization artifacts, such as shell-like geometry or low-resolution scans? Are there any preprocessing recommendations or failure cases to be aware of? Would data augmentation or mesh-denoising pre-processing mitigate this?

局限性

please see the Questions section.

最终评判理由

The rebuttal has addressed my concerns about the evaluation issues. I am happy to upgrade my score to borderline accept. The additional discussions and experiments should be included in the revised manuscript or supplementary materials.

格式问题

NA

作者回复

We sincerely appreciate the reviewer’s constructive feedback and recognition of the key strengths of our work, including its support for meshes with multiple disconnected components and the scalable training enabled by unlabeled data. Below, we address the reviewer’s questions and concerns point by point.

1. Global features are compressed into a single vector, potentially losing details.

The primary role of the global feature is not to capture fine-grained geometric details, but rather to encode the relative positions between disconnected components. This information is crucial for preventing penetrations during deformation, as the Diffusion Blocks alone cannot propagate information across disconnected mesh components.

To quantify this, we evaluate the penetration between inner components (e.g., mouthbag) and the outer face surface by reporting the percentage of penetrating vertices with and without our global feature. Specifically, we compute the signed distance function (SDF) of each vertex on the inner component with respect to the outer face surface in both the neutral and predicted meshes. Vertices are considered to be penetrating when their SDF values change from negative in the neutral mesh to positive in the deformed/predicted mesh. As shown in the table below, involving the global feature significantly reduces the number of such penetrating vertices. Please also refer to Fig. 5A (first column) in the main paper for the qualitative comparison.

Regarding the concern about capturing geometric details, the final deformation predictions are contributed by both the global feature and pre-vertex features from the Diffusion Blocks, enabling accurate deformation even for complex meshes. For instance, while the training set meshes have 1,704 vertices and 3,267 faces on average, our model generalizes well to in-the-wild meshes with significantly higher complexity, averaging 9,497 vertices and 17,958 faces, as shown in Fig. 7 of the main paper. Given the effectiveness of this setup, we found that a more complex global encoder (e.g., a hierarchical encoder) might not be necessary.

Global EncoderAll Other ComponentsMAE ↓MAE Q95 ↓Penetration ↓
2.146.640.377
2.166.080.405
2.085.840.166
1.925.630.173

2. Domain gap of RAFT and MegActor’s pretrained models on stylized/non-humanoid meshes. Provide error analysis.

Yes, there is indeed a domain gap when directly applying the pretrained RAFT and MegActor models to stylized head mesh renderings. To address this, we fine-tuned both models using rendered ground-truth optical flow and expression images from our artist-annotated training set, as described in L222–224 of our paper.

For error analysis, we used similarly rendered ground-truth data from the artist-annotated test set, with all images rendered at 512×512 resolution. The fine-tuned RAFT achieves an end-point error (EPE, lower is better) of 0.99, i.e., the average per-pixel Euclidean distance between predicted and ground-truth optical flow vectors is equal to 0.99. The fine-tuned MegActor yields a mean absolute error (MAE) of 0.00346 between generated and ground-truth images, computed over RGB values normalized to [0, 1]. Please also refer to Figure 1 in our supplementary material for qualitative comparisons of the generated 2D supervisions.

These results indicate that, after fine-tuning, both RAFT and MegActor handle stylized or non-humanoid meshes with high accuracy and show strong robustness against domain gaps between realistic and stylized inputs.

3. Degraded performance on shell-like meshes and meshes with poor discretization. How does RAF handle them?

We will include qualitative examples in the revision to show that our model typically leaves such meshes undeformed. Even though we would also like to clarify that such meshes are not typical inputs for facial rigging pipelines and are generally not considered by prior methods. Shell-like meshes (e.g., deformed spheres) often depict facial features such as eyeballs and mouths using textures instead of explicit geometry, making them ill-suited for animating expressions like eye closure or mouth opening. Meshes with poor discretization, where the main facial surface can break into multiple disconnected pieces, are very challenging to rig and annotate, even for professional artists. We include them in our evaluation to explore the robustness and limitations of our approach beyond its intended scope.

4. Are there any preprocessing recommendations for failure cases?

For poorly discretized meshes, we experimented with preprocessing such as remeshing to improve mesh quality. However, we found that remeshing often mistakenly connects disconnected components (e.g., eyeballs and mouth bags) to the main facial mesh, which deteriorates the pose prediction. For instance, animation of eye gazing requires the eyeballs to deform independently within the eyelid region. If these parts are merged, the model would struggle to predict correct deformations. As a future direction, incorporating a diffusion operator defined on a high-quality background triangulation [1] may help with these failure cases.

[1] Nicholas Sharp, Yousuf Soliman, and Keenan Crane. 2019. Navigating intrinsic triangulations. ACM Trans. Graph. 38, 4 (2019)

5. Suboptimal inference speed could hinder real-time deployment.

As discussed in L242–L244 of our paper, our model is lightweight, with only 5.4M parameters. It performs a single forward pass to generate all the FACS poses to form a blendshape rig for each input mesh, a process that takes on average 8.72s on an Apple M2 Max CPU and 3.1s on an Nvidia T4 GPU (test set average: 1,750 vertices and 3,362 faces).

Importantly, this process is conducted offline. Once the blendshape rig is generated, animations can be executed in real-time at deployment using standard linear blendshape interpolation, achieving over 1000 FPS on an Apple M2 Max CPU in our testing.

6. Moderate originality as core innovations build incrementally upon DiffusionNet.

Our architectural design is specifically tailored to address facial rigging challenges that previous neural methods have not effectively handled. The global feature enables information flow across disconnected components, and FACS conditioning allows expression-aware deformation. These designs are essential components of our auto-rigging framework. More importantly, our design of learning 3D deformation with 2D supervision enables scalable training on unlabeled data.

Taken together, the architectural innovations and our scalable training framework form a unified and practical solution for real-world auto-rigging across diverse mesh topologies with multiple disconnected components, which we believe constitutes a meaningful step beyond prior work.

评论

Thanks for the detailed response.

The rebuttal has addressed my concerns about the evaluation issues. The additional ablation studies on the global features provided make the design of the proposed solution clearer.

I have also read the reviews from other reviewers. I am happy to upgrade my score to borderline accept. It is recommended to include the discussions and experiments in the revised manuscript or supplementary materials.

评论

Dear reviewer U3KH,

Could you at least comment on how the authors' response was insufficient to convince you? Given that other reviewers are supportive of this submission, your participation in the discussion is crucial to improving the paper and the final decision.

AC

评论

We sincerely thank the Area Chair for facilitating the discussion and supporting a thorough review process.

We also appreciate Reviewer U3KH for the follow-up and for reconsidering the evaluation. We appreciate the constructive feedback and will incorporate the suggested discussions and experiments into the revised manuscript.

最终决定

Strengths

  • Builds on top of DiffusionNet for topology-agnostic facial rigging
  • Overcomes the limitation of DiffusionNet with a global encoder to communicate holistic face characteristics across disconnected mesh components, e.g., base face mesh and eyeballs
  • 2D supervision pipeline with fine-tuned optical flow model for training with unlabelled data
  • SOTA results compared to NFR

Weaknesses

  • (debatable) Incremental work on top of DiffusionNet

During the discussion, the authors have clarified the claims and novelty of the paper. The additional experiments, especially the ablations of the global encoder, will strengthen the paper.

Therefore, I recommend accepting the paper.