PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
6
5
6
5
4.3
置信度
ICLR 2024

A Large-Scale 3D Face Mesh Video Dataset via Neural Re-parameterized Optimization

OpenReviewPDF
提交: 2023-09-15更新: 2024-02-11
TL;DR

Neural re-parameterized 3D face mesh optimization method and reliable 3D face mesh pseudo annotations on large-scale facial videos

摘要

关键词
3D Face video dataset; Neural Re-parameterization; Optimization

评审与讨论

审稿意见
6

The paper proposes a novel method, i.e. the neural re-parameterized optimization, for optimizing FLAME parameters from a video sequence. Then, the paper applies the proposed technique to several large-scale in-the-wild video datasets. The fitting results form a novel dataset called NeuFace-dataset, which is another contribution of the paper. The paper also demonstrates the usage of the proposed dataset on face reconstruction and motion-prior learning.

优点

  • I do like the proposed neural re-parameterized technique for optimizing FLAME parameters, it provides insights for me. Sparse gradients are not desired for geometry optimization. By optimizing the neural network weights to indirectly optimize the FLAME geometry, dense gradients can be obtained from sparse landmark loss as shown in Fig.3. However, in the context of optimizing per-vertex displacement (this task is more challenging than optimizing 3DMM parameters), similar techniques [1][2] are proposed to obtain dense gradients. I suggest the authors discuss these more relevant works in the main paper.
  • The proposed dataset would benefit future research in this field.

[1] Neural head avatars from monocular rgb videos, CVPR 2022.

[2] Large steps in inverse rendering of geometry, SIGGRAPH Asia 2021.

缺点

My main concerns are listed as follows.

  1. I agree that the paper proposes the first method to introduce the neural re-parameterized technique to optimize FLAME parameters. As I know, previous work [1] has already introduced this technique to solve very relevant (actually more challenging) tasks; they adopt a neural network to re-parameterize the per-vertex displacement of a FLAME mesh. However, the submission does not discuss these very relevant works.

  2. I think the video FLAME fitting algorithm is not well-designed. One of the contributions of the paper is to use the proposed fitting algorithm to provide pseudo GT for the video dataset, so I think is necessary to push the quality of the fitting method as high as possible. I list some questions about the design choice here:

  • Why not use a photometric loss term? Many previous works have demonstrated that photometric loss can improve the geometry reconstruction quality, see Figure 6 in [2]. I'm curious when photometric loss is used, will the proposed neural re-parameterized technique still improve the results a lot?
  • Why not use a shared identity code?
  • From the supp. video, I find that the fitting results are still jittering, although better than the naive baseline of applying DECA to each frame separately. More advanced strategies beyond the temporal moving average method should be exploited, like the common-used optical flow loss, or stabilize the detected landmarks [3].
  • Thus, it gives me the impression that the paper does not do its best to improve the video-fitting results. I would like to hear from the authors to change my impression.
  1. Apply DECA to each frame is too weak to serve as a baseline for video fitting. There are many previous works [4-6] proposed to fit 3DMM into a video. Why not compare with these more strong baselines specialized in video fitting? In the supp. material, the method is compared to MICA-T, but I think more competitors are expected. At least they should be discussed in the paper. So I think the paper does not well evaluate the proposed technique for video fitting.

[1] Neural head avatars from monocular rgb videos, CVPR 2022.

[2] State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications

[3] High-Resolution Neural Face Swapping for Visual Effects

[4] 3D Shape Regression for Real-time Facial Animation

[5] Real-time high-fidelity facial performance capture

[6] Face2Face: Real-time face capture and reenactment of rgb videos

问题

See Weaknesses.

评论

We thank the reviewer for the time and constructive comments that strengthen our paper. We discuss the concerns and questions of the reviewer below and in the revision (please check pdf, highlighted in pink).

Please let us know if our answers satisfy the reviewer’s concerns. We would be happy to provide further discussions and clarifications.

Weaknesses

W1. A previous work [C1] has already introduced this technique to solve very relevant (actually more challenging) tasks; they adopt a neural network to re-parameterize the per-vertex displacement of a FLAME mesh. However, the submission does not discuss these very relevant works.

We appreciate the reviewer for recommending the reference. We have revised the paper to include the discussion about the suggested reference in the 4th paragraph of Sec. 2 of the revision, highlighted in pink.

However, please note that our work and [C1] are focus on very different perspectives as:

  1. We first remark that, in contrast to ours, Neural Head Avatars [C1] does not re-parameterize FLAME parameters with neural networks, but directly optimizes FLAME parameters which corresponds to the baseline we presented.
  2. In contrast, we provided the empirical analysis about the gradient densification (Sec. 3.3, p. 5 in the initial submission and Sec. A.1, p. 17 in the Appendix), which can be attributed to the proposed neural re-parameterization, and hasn’t been explored in the fields. While [C2] also tries to make the dense gradient for geometry optimization, it uses a different technique, pre-conditioning, which is different from our neural re-parameterization. Also, we briefly showed the robustness and high-probability convergence guarantee of the optimization to global optima in the theoretical proof sketch of the initial submission (Sec. A.2, p. 18 in the Appendix).
  3. They parameterize residual fine-detail geometries of face using an MLP, which is not our scope for reconstructing details of face shapes. Our scope is to reconstruct facial geometries, well complying with input facial gestures and motions as discussed in the response to W1 of Reviewer axFo.

Given these, [C1,C2] do not harm our contribution. Thanks for suggesting references.

[C1] Grassal et al., Neural head avatars from monocular rgb videos. In CVPR 2022.

[C2] Nicolet et al., Large Steps in Inverse Rendering of Geometry. In SIGGRAPH ASIA 2021.


W2. The video FLAME fitting algorithm is not well-designed.

  • W2-1. Why not use a photometric loss term?

As requested, we report the effect of the photometric loss (L_photo\mathcal{L}\_{photo}) in the metrics for the MEAD dataset we used in the main paper. We compare the optimization that uses three different loss configurations as below:

Loss configurationMSI_3DLˆ()\text{MSI}\_{3D}\^L (\uparrow)MSI_3DVˆ()\text{MSI}\_{3D}\^V (\uparrow)CVD()\text{CVD} (\downarrow)NME()\text{NME} (\downarrow)
NeuFace (Original)0.2060.3050.1032.58
NeuFace + 2L_photo2\cdot\mathcal{L}\_{photo} 0.2050.2990.1062.62
NeuFace + 5L_photo5\cdot\mathcal{L}\_{photo}0.1950.2820.1123.69

We tried this in our preliminary experiment, and even with the tuning of balancing, we obtain negligible effects of using L_photo.\mathcal{L}\_{photo}. Using L_photo\mathcal{L}\_{photo} rather degrades the performance, and the larger L_photo\mathcal{L}\_{photo} results in more degradation.

We postulate that, since L_photo\mathcal{L}\_{photo} is designed to enforce the consistency of colors, i.e., albedo assumption, thus it has been known to be very prone to self-shadows, and non-Lambertian reflection by lighting, and noise existing in videos, which distract the optimization. Furthermore, it takes more than twice the GPU memory than the original configuration, due to the requirement of the differentiable rendering. Thus, we claim our design choice is sufficient to reconstruct dynamic 3D faces despite its compact combination.

We have newly added the results and discussion in the revision (Sec. A.4, p. 21 in the Appendix).

评论
  • W2-2. Why not use a shared identity code?

We tried this in our preliminary experiment, but found negligible differences in our early test. In our newly conducted investigation, we found that the standard deviation of the identity codes (β\boldsymbol{\beta}) of all frames, gradually decreases and converges during NeuFace optimization.

Optim. step0255075100 (End)
β\boldsymbol{\beta} std.3.1521.4891.3721.3701.366

This means, the identity codes across video frames and views automatically converge, although we do not manually force the shared identity codes with the losses. Accordingly, in the visual results, we obtain identical results in both 1) original NeuFace optimization and 2) NeuFace optimization with consistent β\boldsymbol{\beta}.

We have revised the paper and added this discussion and results in the revision (Sec. A.4, p. 21 in the Appendix)

  • W2-3. More advanced strategies beyond the temporal moving average method should be exploited, like the common-used optical flow loss, or stabilize the detected landmarks [C3].

We thank the reviewer for suggesting diverse methods for improving each component in our optimization. As discussed in the response to [W2-1], our design choice is sufficient to reconstruct dynamic 3D faces despite its compact combination. It is also sufficient to validate the idea of neural re-parameterization with bootstrapping. Thus, adding more losses does not change our message and finding of our work we conclude (This is the goal of a piece of the academic paper).

On the other hand, we agree with the reviewer’s point that using advanced losses would increase the accuracy. Note that we can readily add more advanced losses for our proposed optimization. Given the limited rebuttal period, we’ll investigate other advanced combinations of methods later, and add the relevant discussion in the camera-ready version.

[C3] Naruniec et al., High-Resolution Neural Face Swapping for Visual Effects. In Eurographics Symposium on Rendering 2020.


W3. Why not compare with these more strong baselines specialized in video fitting? In the supp. material, the method is compared to MICA-T, but I think more competitors are expected.

We compared with MICA+Tracker as the representative competitor because of the following reasons:

  1. MICA+Tracker [C4] is the most competitive video fitting work, which already demonstrate the surpassing performance over Face2Face [C5], and other previous baselines [C6,C7]. Ours better performs than MICA+Tracker, which implies ours performs also better than [C5,C6,C7].
  2. For [C8,C9], their codes are not publicly available. During the discussion period, we reached out to the authors of [C8,C9] and requested to share the code, but we couldn’t get a reply. We cannot even quote their quantitative results, because they did not present the benchmarks shared with both MICA and ours.

Their quantitative results cannot be directly quoted, because the benchmarks they presented are not compatible with ours. Despite our efforts, we could not manage to add them. Instead, we have newly added the reference in the revision (Sec. C.2, p. 24 in the Appendix).

[C4] Zielonka et al., MICA - Towards Metrical Reconstruction of Human Faces. In ECCV 2022.

[C5] Thies et al., Face2Face: Real-time face capture and reenactment of rgb videos. In CVPR 2016.

[C6] Deng et al., Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In CVPRW 2019.

[C7] Feng et al., Learning an animatable detailed 3D face model from in-the-wild images. In SIGGRAPH 2021.

[C8] Cao et al., 3D Shape Regression for Real-time Facial Animation. In SIGGRAPH 2013.

[C9] Cao et al., Real-time high-fidelity facial performance capture. In SIGGRAPH 2015.

评论

Thanks for providing more experiments, it addressed most of my concerns about the experiment setup. However, my main concern is still the main techniqical contribution, i.e. the re-parameterized optimization.

  1. Is the proposed re-parameterized technique still useful if the FLAME fitting energy involves a dense term, i.e. the photometric loss? If the proposed technique only benefits the "landmark loss + regularization" fitting setup, I think it is not a significant step as most modern 3DMM fitting algorithm involves a photometric loss. I suggest the author include more experiments in the revised paper to demonstrate their technique can benefit the 3DMM fitting problem, not only the "landmark loss + regularization" setup.

  2. I am still not sure about the significance of the proposed re-parameterized optimization as Neural Head Avatar has already applied it to a more challenging problem to optimize per-vertex displacement. However, I also noticed that in the Neural Head Avatar paper, they do this thing implicitly; they even did not mention the re-parameterized technique can densify the gradient to benefit optimization. I need to discuss with other reviewers to re-evaluate the significance of the proposed technique.

评论

We thank Reviewer B8S8 for the reply. Here, we address the reviewer’s concerns.

  1. Our proposed neural re-parameterized optimization is indeed useful with the dense energy terms, e.g., the photometric loss.

    As we discussed, the limitations lie within the photometric loss itself, assuming the albedo consistency, and not within our optimization method. This point was clearly mentioned in the response of our discussion. Although it's a different context, it has been known that the limitations of the photometric loss, e.g., the recent work [C10] also pointed out its limitations due to Albedo consistency assumption, which does not hold for human facial skins that have the view points varying radiance property.

    [C10] Ge et al., Ref-NeuS: Ambiguity-Reduced Neural Implicit Surface Learning for Multi-View Reconstruction with Reflection. In ICCV 2023 (Oral paper).

  2. Our neural re-parameterization induces the image-conditioned gradient and densifies it by conditioning the image as input (discussed as a data-dependent gradient in Sec.3.3 and Fig.3 in the main paper).

    In contrast, the Neural Head Avatar models a different component (per-vertex residual displacement) in an image context irrelevant manner by conditioning the detached pose parameters as only input. Also, as the reviewer mentioned, they did not even find and report the benefit of their parameterization.

    The goal, application, and contribution of Neural Head Avatar differ from our work and are rather independent, so the Neural Head Avatar should not be aggressively compared to kill each other's contribution, which is unfair.

If the reviewer has any remaining concerns, we’d be happy to discuss them.

评论

Dear reviewer B8S8,

We think that all the comments by the reviewer have been addressed. If there is no other concern, we would like to respectfully ask the reviewer to re-assess our work and consider increasing the rating.

评论

Thanks for the rebuttal. I think my concerns are addressed. I would change my rating to weak accept now.

评论

We thank reviewer B8S8 for the response and for raising the scores.

We are pleased that our responses and the additional experiments could address the reviewer's concerns and questions. The reviewer B8S8's comments indeed helped improve our paper, and glad that the reviewer acknowledged our work's contributions.

Thank you.

Best regards,

Authors of 12

审稿意见
5

This article introduces a new video dataset with 3D face mesh pseudo-labels and provides a method for annotating spatio-temporally consistent 3D face meshes for existing multi-view facial video data. Based on the results provided by the authors, this dataset is valuable for related research.

优点

The dataset introduced by the authors exhibits clear advantages in terms of data quantity, annotation accuracy, and spatio-temporal consistency, as evidenced by the provided data examples. These strengths are valuable for advancing research in the relevant field. Additionally, the optimization method proposed for achieving spatio-temporal consistency seems effective.

缺点

  1. Unfair Comparison: The comparison with methods like DECA and EMOCA, which operate on single-view data (DECA-dataset and EMOCA-dataset), cannot utilize multi-view information. It can be argued that the proposed method leverages more information by utilizing multi-view data. Therefore, comparing the proposed multi-view approach to these single-view reconstruction methods may not provide a fair evaluation.

  2. The novelty is limited. The proposed temporal-consistency-loss and multi-view-consistency-loss seem more like separate regularizations (or averages) applied to pose, camera parameters, or face shape and expression coefficients to achieve reduced jitter in the reconstructed videos.

I have doubts about the effectiveness of the multi-view-consistency-loss. In the training set, only the MEAD dataset consists of multi-view video data, while VoxCeleb2 and CelebV-HQ have only single-view video data. Consequently, it appears that only the MEAD dataset can effectively leverage the multi-view consistency loss. Table 1 illustrates that MEAD comprises a mere 1% of the total duration, suggesting that the majority of the proposed NeuFace-dataset primarily derives from VoxCeleb2 and CelebV-HQ. In essence, it seems to be a data processing outcome achieved by applying inter-frame smoothing to existing methods. Although I appreciate the authors' effort and the contribution of NeuFace-dataset to the community, the paper's level of innovation may fall slightly below the standard typically expected at ICLR.

问题

  1. As mentioned in the paper, the proposed dataset contains a large amount of data, and the preliminary 3D mesh results generated based on DECA (EMOCA) may have errors. Have the authors considered how to filter out failed reconstruction results?
  2. The quality of the reconstructed results for extreme facial expressions appears suboptimal. For instance, in Figure 5's top-left corner, where the open-mouth expression is depicted, the reconstruction of the mouth region does not seem consistent with the original input. Additionally, there appear to be imperfections in the reconstruction of closed-eye expressions.
  3. Given the analysis above, while the dataset's scale is certainly commendable, there seems to be room for improvement in terms of reconstruction accuracy. It might be worthwhile for the authors to consider utilizing such data as annotations for 3D landmarks rather than 3D mesh data. Additionally, have the authors explored the possibility of applying their proposed method to a different face model, such as the Basel Face Model (BFM), or investigating alternative pre-trained models instead of DECA or EMOCA?
评论

W3. Effectiveness of the multi-view consistency loss. In the training set, only the MEAD dataset consists of multi-view video data, while VoxCeleb2 and CelebV-HQ have only single-view video data. Consequently, it appears that only the MEAD dataset can effectively leverage the multi-view consistency loss.

We’d like to remind that our work focuses on the effectiveness of our proposal, the neural re-parameterization with bootstrapping (temporal and multi-view consistency).

Thus, we show that

  1. Even with the temporal bootstrapping alone in the majority of data, we demonstrate the effectiveness of our NeuFace dataset by significantly enhancing the 3D reconstruction quality in the public benchmark (Table 3a, Sec.5.1. in the initial submission).
  2. Also, Table 3a, Sec.5.1. in the initial submission shows that the model trained with our NeuFace dataset significantly improves 3D reconstruction quality for extreme side view cases, where the temporal consistency alone is weak while the multi-view consistency effectively works. This suggests that even with the small portion of data involving multi-view consistency as the reviewer mentioned, the multi-view consistency helps.

Regardless of which bootstrapping (temporal or multi-view consistency) is, it does not alter our conclusion that the neural re-parameterization with bootstrapping is effective. Also, while the multi-view consistency is important, for scalability of the dataset, it is a practical consideration to leverage large-scale monocular video data.

We will consider processing and enlarging the portion of multi-view datasets, e.g., HUMBI [C5], to provide FLAME parameters when releasing the dataset.

[C5] Yu et al., HUMBI: A Large Multiview Dataset of Human Body Expressions. In CVPR 2020.

Questions


Q1. The proposed dataset contains a large amount of data, and the preliminary 3D mesh results generated based on DECA (EMOCA) may have errors. Have the authors considered how to filter out failed reconstruction results?

Our NeuFace optimization is robust to the initial 3D face estimation so that small number of failure does not affect the performance. In the newly conducted experiment, we have compared the multi-view consistency of the final reconstruction on two scenarios: 1) Standard cases (use original images) and 2) Extreme cases (use perturbated images). For extreme cases, we randomly apply perturbation with large black boxes to the facial areas for 2~3 views in the multi-view videos to mimic significant corruption scenarios.

Loss configurationStandard cases (CVD)↓Extreme cases (CVD)↓
Weighted average (ours)0.1030.113

The results show our optimization is robust to initial failure that occurs in the extreme cases.

Furthermore, to filter out the potentially remaining erroneous samples or outliers, we conducted 1) automatic filtering and 2) manual human verification when we constructed our NeuFace dataset. Please refer to Sec. B in the appendix of the initial submission.


Q2. The quality of the reconstructed results for extreme facial expressions appears suboptimal.

The failure cases stem from the limited performance of the off-the-shelf 2D landmark detection algorithm for extreme facial expressions. Since 2D landmark human annotations are relatively cheaper than any other signals, we think using better 2D landmarks can mitigate this limitation. We newly added this discussion in the revision (Sec. E, p. 24 in the Appendix).

Also note that these failure cases are rare, and we showed our annotation method’s effectiveness even in the presence of these rare cases, by enhancing the existing 3D reconstruction model with our dataset (Table 3. in Sec. 5.1.). We hope the reviewer acknowledges this perspective.

评论

Q3-1. Room for improvement in reconstruction accuracy; It might be worthwhile for the authors to consider utilizing such data as annotations for 3D landmarks rather than 3D mesh data.

Thank you for the suggestion regarding the use of our dataset for improving the another task, 3D landmark reconstruction. As our dataset includes accurate pseudo 3D mesh data along with corresponding 3D landmarks, which indeed makes it a valuable resource for this purpose. Due to the limited time of the rebuttal, we aim to include the results of these experiments in the camera-ready version of our paper.

Q3-2. Have the authors explored the possibility of applying their proposed method to a different face model, such as the Basel Face Model (BFM), or investigating alternative pre-trained models instead of DECA or EMOCA?

We think our losses and optimization are applicable to other 3DMMs and neural models. We chose FLAME since it is the most recent and popular 3D linear face model.

For the alternative pre-trained models, we can use our method and the same benefits would hold as long as the model predicts low dimensional parameters and the neural network is overparameterized (from Proposition 1 in Sec. 3.3.). Also, using a neural model with better accuracy would provide better initial points that can lead to the faster convergence of NeuFace optimization.

评论

We thank the reviewer for the time and the thorough review, which helped us improve our paper. We address the concerns and the questions below and in the revision (please check pdf, highlighted in pink).

Please let us know if our answers satisfy the reviewer’s concerns. We would be happy to provide further discussions and clarifications.

Weaknesses


W1. Unfair Comparison with DECA/EMOCA, which operates on single-view data.

We confirm that the experiments were fair.

Our NeuFace optimization is a test-time optimization method. The test-time optimization is a method that updates the model to fit into the deployment environment at test time, without access to training data [C1]. Test-time optimization methods utilize the losses with self-supervision or pseudo supervision, available at test-time. Prior arts [C1,C2,C3,C4] also compared their proposed test-time optimization methods and test-time losses with the baseline methods that do not involve test-time optimization. It is regarded as fair and valid comparison.

When evaluating the test-time optimization methods, the configuration of the pre-trained backbone network is important. Our NeuFace optimization starts from the pre-trained DECA/EMOCA checkpoint same as the compared baselines, which makes our settings and comparison fair.

Also, we would have tried if there was a fair setting that the reviewer specifically thinks. Please leave the comment if there is a specific fair setting to compare with DECA/EMOCA. We’d like to further discuss and willing to include it.

[C1] Yi et al., Temporal Coherent Test-Time Optimization for Robust Video Classification. In ICLR 2023.

[C2] Zeng et al., Test-Time Optimization for Video Depth Estimation Using Pseudo Reference Depth. In Computer Graphics Forum 2023.

[C3] Luo et al., Consistent Video Depth Estimation. In SIGGRAPH 2020.

[C4] Schneider et al., Improving robustness against common corruptions by covariate shift adaptation. In NeurIPS 2020.


W2. Novelty; The proposed temporal-consistency-loss and multi-view-consistency-loss seem more like separate regularizations (or averages) to achieve reduced jitter in the reconstructed videos.

We haven’t claimed the novelty of our losses. While our loss configurations are simple, we obtain a sufficiently good quality dataset, thanks to our neural re-parameterized optimization and its favorable properties. We can readily add more advanced losses for our proposed optimization.

Although our loss configuration is sufficient, we share the experimental result that reviewer B8S8 asked: NeuFace optimization + photometric loss (Lphoto\mathcal{L}_{photo}).

Below we report the metrics for the MEAD dataset we used in the main paper. We compare the optimization that uses three different loss configurations as follows:

Loss configurationMSI3DL()\text{MSI}_{3D}^L (\uparrow)MSI3DV()\text{MSI}_{3D}^V (\uparrow)CVD()\text{CVD} (\downarrow)NME()\text{NME}(\downarrow)
NeuFace (Original)0.2060.3050.1032.58
NeuFace + 2Lphoto2*\mathcal{L}_{photo} 0.2050.2990.1062.62
NeuFace + 5Lphoto5*\mathcal{L}_{photo}0.1950.2820.1123.69

We obtain negligible effects of using Lphoto\mathcal{L}_{photo}. The results show our minimal loss combination is sufficient and essential to reconstruct dynamic 3D faces. We have added the results and discussion in the revision (Sec. A.4, p. 21 in the Appendix).

评论

Dear reviewer C76P,

Thank you for reviewing our work to enhance the quality of the paper. Please check our rebuttal. We think our rebuttal addressed all of the reviewer's comments. If there is anything more we can do to improve the paper, please leave a comment. We'd be happy to discuss further.

Best regards, Authors of 12

评论

Thank you for your thoughtful and encouraging feedback. We think that all the comments by the reviewer have been addressed. If there is no other concern, we would like to respectfully ask the reviewer to re-assess our work and increase the rating.

审稿意见
6

In this proposed method, a 3D face database is built based on the neural radiance fields method. In general, the neural face representation tries to find the best 3D mesh representation through the neural network parameters to best fit the multiple views and temporal consistent faces. Multiview and temporal consistency losses are added on top of the 2D landmark loss in the EM-like optimization process. Based on the proposed method, a significantly larger 3D face database is built using existing public 3D videos. The authors also demonstrated the possible applications of the proposed database to improve 3D face reconstruction and to learn the 3D face prior.

优点

The proposed 3D face database is beneficial to the research community. The proposed database is significantly larger than the typical 3D face datasets. Experimental results also demonstrate good 3D estimation results.

缺点

The theoretical novelty of face reconstruction using Neural network parameterization is incremental.

问题

What are the typical failure cases of the proposed method? In the EM-like optimization, does the reconstruction always go to the right optional result?

评论

We thank the reviewer for the interest in our work and the valuable feedback that strengthens our paper. We address the concern below and in the revision (please check pdf, highlighted in pink).

Weaknesses


W1. The theoretical novelty of face reconstruction using Neural network parameterization is incremental.

We haven’t claimed the theoretical novelty itself.

The theoretical context in Sec. A.2 helps the deep understanding of the analysis and algorithmic behavior of our proposed optimization system. We respectfully request the reviewer to consider our contributions below:

  • Proposing the first and insightful concept of neural re-parameterized 3D face optimization, which mitigates the undesirable sparse gradient for face optimization (acknowledged by Reviewer B8S8).
  • Providing a NeuFace-dataset, the first large-scale 3D face mesh pseudo-labels for existing large-scale 2D face video datasets. The dataset would benefit future research in this field (acknowledged by all the other reviewers).
  • Extensive experiments to compare the quality and reliability of the proposed optimization and the dataset, and some empirical analysis.

Questions


Q1. What are the typical failure cases of the proposed method?

The failure cases could occur when the 2D video itself contains extreme degradations, e.g., motion blur, low resolution, extremely (> 50%) occluded, so that the 2D keypoint detection fails. Please note that, when we construct the NeuFace-dataset, we tackle these cases with automatic filtering followed by human verification, discussed in Appendix Sec. B of the initial submssion, which guarantees the reliability of the dataset.


Q2. In the EM-like optimization, does the reconstruction always go to the right optional result?

The NeuFace optimization and its losses are designed in a self-improving manner. Although the initial estimate of DECA could be noisy, the strong measurement of detected 2D landmarks and robust target supervision at each iteration gradually corrects the initial noisy predictions. The theoretical analysis shows that our neural re-parameterized optimization is highly likely to converge to global optima (Appendix Sec. A.2 of the initial submission). This hints that, with our neural parameterization, the optimization is robust to noisy initials and guarantees to exhibit at least stabler optimization behaviors than the compared baseline methods.

评论

My review towards this paper still hold. Overall, the proposed method is straightforward but dataset is still a good contribution. However, the authors mentioned they specifically did QA to review challenging examples in the data set which is a bit concerning. Overall, I keep my original rating due to the data set contribution.

评论

We thank the reviewer GPao for the response. If the reviewer has any other comments or suggestions to improve our paper, we'd be happy to address them.

评论

Dear reviewer GPao,

We would like to clarify the reviewer's last comment:

"However, the authors mentioned they specifically did QA to review challenging examples in the data set which is a bit concerning."

If the reviewer meant our 'human verification process' for the word QA, we'd like to correct and clarify that:

  • We didn't involve the human verification process at all, when assessing our dataset's quality in experiments (Secs. 3.4, 4, and A.3, Fig 4, Table 2, and Table S1).
  • We conduct the human verification process only for making the release version of our dataset.

We hope this addresses your concern about QA. If there is no other concern, we would like to respectfully ask the reviewer to re-assess our work and consider increasing the rating.

评论

Dear Reviewer GPao,

We wish to highlight that Reviewer B8S8 has acknowledged the value of our work and updated the rating to a weak accept. We have thoroughly addressed all the comments raised by Reviewer GPao, thus we kindly request Reviewer GPao to review our revision and the responses of the rebuttal once more. We hope this might prompt a reconsideration and adjustment of your rating.

Best regards, Authors of 12.

审稿意见
5

The paper presents NeuFace, an optimization algorithm for fitting a morphable model to a sequence of multi-view face images. To this end it refines a pre-trained NN to fit the target images. The optimized loss includes temporal and multi-view regularization terms minimizing the distance of the reconstructed mesh to the temporal moving average, in the case of the temporal term, and to the aligned average, in the multi-view loss case. The model is iteratively refined by alternating the estimation of the reconstructions used in the regularization terms with the optimization of the network parameter that minimize the loss.

The experimentation quantitatively compares the reconstructions with two competing algorithms, DECA and EMOCA, on MEAD, VoxCeleb2 and CelebV-HQ video datasets.

Finally, the algorithm is used to build the "NeuFace dataset" as the result of the reconstruction of the 3D face meshes in these three datasets.

优点

The paper reads well and is properly set in the research context. It addresses a relevant problem, namely, 3D face landmark estimation, with many practical applications and open challenges. The paper contributes with a new dataset and shows that by using it we may improve the accuracy of different face processing algorithms. This will be of interest to the face processing community.

缺点

The the paper claims to investigate the reconstruction of image-aligned facial details on 3D meshes. However, the approach is based on optimizing the parameters of a 3DMM model, with the limitations of a linear model to represent fine facial details.

In the vertex accuracy evaluation experiments described in Sec. 3.4 and shown in Fig. 4, NeuFace optimization is compared with DECA and plain FLAME fitting. The paper does not describe the details of this experiment, specifically what are the train, validation and test data used for evaluating each algorithm. DECA results were produced by the plain pre-trained DECA model. We may assume that NeuFace optimization, as described in Sec. 3.2, was trained with some part of VOCASET and evaluated on a different part of it. It does not seem like a fair comparison, since DECA did not have the chance to see any part of VOCASET.

For the same reason, the quantitative comparisons in Table 2 seem also unfair, since the optimizations in NEUFACE-*-datasets could see part of MEAD, VoxCeleb2 and CelebV-HQ data, whereas those involving DECA and EMOCA datasets did not. In Sec. A.3 and Table S1 we can see that if we give DECA the chance to be refined on these datasets, the NME in MEAD reduces to 2.44, much lower than 4.65 shown in Table 2.

问题

There are important details missing:

  • Section 3.1 It would be good if you extended the explanation by adding the dimension of each FLAME parameter. Also, the backbone network, e.g. DECA, not only estimates the 3DMM parameters, but also texture, lighting and a displacement map to model details outside the 3DMM linear model. I understand that the approach discards the texture and lighting part, but what about the displacement map?
  • Section 3.2 does not explain how the ground-truth landmarks for equation 2 were obtained. Also, in the Multi-view consistency loss it does not explain where the confidence values for each vertex come from.
  • Experiments. The paper must clearly explain what is the train/validation/test data used in each experiment and confirm that the results shown in the accuracy evaluation in Fig.4 and Table 2 are correct and fair.

Often the base pre-trained model fails dramatically. In this situation, averaging the estimated mesh with others in the regularization terms would ruin the optimization, since the average operation is not robust. Would an alternative robust operation, e.g. median, improve the results?

伦理问题详情

The paper reads "Since our dataset is acquired based on the existing public video datasets (Wang et al., 2020; Chung et al., 2018; Zhu et al., 2022), all the rights, licenses, and permissions follow the original datasets." However, some of these datasets were automatically gathered from the internet. So, it is unclear whether the new dataset is legally compliant.

评论

Q3. Would an alternative robust operation, e.g., median, improve the results?

Following the reviewer's suggestion, we have compared the mean, median, and weighted average (ours) in the multi-view loss. We have compared the results on two scenarios: 1) Standard cases (use original images) and 2) Extreme cases (use perturbated images). For extreme cases, we randomly apply perturbation with large black boxes to the facial areas for 2~3 views in the multi-view videos to mimic significant corruption scenarios.

Loss configurationStandard cases (CVD)↓Extreme cases (CVD)↓
Average0.1060.124
Median0.1040.112
Weighted average (Ours)0.1030.113
  • Results of standard cases: All three methods (mean, median, and weighted average) showed similar performance, with no marked difference in optimization results evaluated on the MEAD subset.
  • Results of extreme cases: The median outperforms the mean in perturbed MEAD data.
  • Our Method: Our methodology, which employs a weighted average grounded in the confidence scores from multiple view vertices, not only shows favorable performance over the simple average but also aligns closely with the median's results. This indicates that our approach maintains robustness in extreme cases, akin to the median operation highlighted by the reviewer.

The reviewer's suggestion about the median was insightful. Both our weighted average and the median work for extreme scenarios. One difference is that our visibility-based weighted average has a control parameter to adjust the robustness through hyperparameters (as in Q2-2.).


Q4. Ethics Concerns: Some of these datasets were automatically gathered from the internet. So, it is unclear whether the new dataset is legally compliant.

We will not release the video dataset itself, but will release only the optimized 3DMM parameters obtained by our method without the video frames that might have been gathered from the internet.

The optimized 3DMM parameter does not contain identity-specific metadata or facial texture maps. Also, we will release the code that can generate pseudo ground-truth datasets like the NeuFace dataset for generic applications.

评论

We thank the reviewer for the time and the thorough review. By addressing the reviewer’s questions and comments, we could strengthen our paper. We address the concerns and the questions below and in the revision (please check pdf, highlighted in pink).

We’d like to ask the reviewer to re-assess the value of our work with the following clarification.

Please let us know if our answers address the reviewer’s concerns. We would be happy to provide further discussions and clarifications.

Weaknesses


W1. Claims to investigate facial details. The approach is based on 3DMM with limitations of a linear model to represent fine facial details.

We would like to clarify the scope of our approach. This is a misleading point due to the term we used, which we have revised and tone-downed the terms across the paper in this revision.

Our work aims to facial details of facial gestures and motions, not mesoscopic facial geometry of facial skin, which is the point the misleading happens at. Thus, we do not reconstruct the displacement-level facial details which is not the target of our work. We focus on 3D facial geometries complying with input facial gestures and motions.

We revised the overall paper and tone-down our claim of “reconstructing image-aligned facial details” into “reconstructing facial geometries, well complying with input facial gestures and motions”, highlighted in pink. Thanks for the comment for specifying our scope better.


W2. Concerns on fair comparison and missing descriptions of experiments (Sec. 3.4, Fig. 4).

We respectfully note that this concern misled by the reviewer's misunderstanding of the premise of our work. We hope the explanation below clearly addresses the points leading to the reviewer's misconcern.

First of all, the evaluation of our NeuFace test-time optimization is indeed fair, which we carefully designed, as all the other test-time approaches are. Specifically,

  1. Our NeuFace optimization, as elaborated in Sec. 3.2, is a TEST-TIME approach and does not involve training with train/validation/test splits. Our NeuFace step itself only exploits the input samples at test-time, but no additional data is used for the NeuFace itself.
  2. We use the the pre-trained base model in NeuFace, e.g., DECA, for test-time fine-tuning on the input samples at test-time. Thus, the comparison with the pre-trained base model and our NeuFace (base model + NeuFace test-time optimization) is indeed in the fair setting; they have seen the exactly same data).
  3. Our NeuFace is evaluated on each sample sequence independently. In Sec.3.4 and Fig. 4, for NeuFace optimization, we do not split the VOCASET into train/validation/test splits. We performed the test-time optimization for each sequence of VOCASET, starting from the pre-trained DECA checkpoint (pre-trained on VGGFace2 [C1], BUPT-Balancedface [C2], VoxCeleb2 [C3]), and detected 2D landmarks (we didn't use any GT information including mesh or 2D landmarks from VOCASET). Note that 2D landmark detections are not the ground-truth, as well. Thus, we claim that both DECA and ours do not have the chance to see any test dataset.
  4. Likewise, our method is evaluated and compared for each test samples in each dataset (including VOCASET, MEAD, Voxceleb, and Celebv-HQ) independently without using their ground-truth mesh.

Our goal of experiments in Sec 3.4 and Fig. 4 was to compare the quality of pseudo 3D mesh annotations obtained from existing methods. In other words, the experiment in Sec 3.4 and Fig. 4 did not aim to compare the performance of each trained model on test datasets.

In the 3D face community, recent works [C4,C5] still naively use a pre-trained DECA or FLAME fitting (baseline) for annotator. In our experiments, we wanted to show how accurate and reliable NeuFace optimization is in generating pseudo 3D face annotations compared to existing methods.

To conclude, NeuFace is not a learned model but a test-time optimization method; thus, it does not require training/validation/test sets. Therefore, we confirm that our experiments were correct and fair, which we carefully designed.

[C1] Cao et al., VGGFace2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face & Gesture Recognition (FG) 2018.

[C2] Wang et al., Racial Faces in the Wild: Reducing Racial Bias by Information Maximization Adaptation Network. In ICCV 2019.

[C3] Chung et al., VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH 2018.

[C4] Ng et al., Learning to listen: Modeling non-deterministic dyadic facial motion. In CVPR 2022.

[C5] Paraperas et al., Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos. In CVPR 2022.

评论

W3. Unfair comparison in Table 2, Sec A.3., & Table S1.

Our goal in Table 2 was to compare the quality of our pseudo 3D mesh annotations (NeuFace-D-dataset, NeuFace-E-dataset) with those obtained from existing methods (Base-dataset, DECA-dataset, EMOCA-dataset).

We emphasize that both DECA and ours do not have the chance to see any test dataset, given the fact that 1) for Table 2, we do NOT train models using MEAD, VoxCeleb2, and CelebV-HQ at all, and 2) 2D landmark detections are not the ground-truth but estimated from the off-the-shelf module.

In Sec. A.3 and Table S1, we have compared the dataset quality obtained with our different TEST-TIME loss configurations. When we use only L2D\mathcal{L}_\text{2D} to refine the DECA estimations, the NME indeed improves than DECA’s initial prediction. However, as mentioned in [W2], NeuFace optimization is not a learned nor a fine-tuned model on MEAD, VoxCeleb2, and CelebV-HQ datasets. NeuFace optimization is a test-time optimization method for fitting 3D face for each video. Also, the 2D landmark detections itself is a pseudo ground-truth. Therefore, we conclude that our experiment for Table S1 was correct and fair as well.

Questions


Q1-1. It would be good if you extended the explanation by adding the dimension of each FLAME parameter.

Thanks for the thoughtful comment. We have added the detailed explanation in the revision (Sec. 3.1), highlighted in pink, as:

"We use FLAME, a renowned 3DMM, as a 3D face representation. 3D face mesh vertices M\mathbf{M} and facial landmarks J\mathbf{J} for FF frame videos can be acquired with the differentiable skinning: M,J=FLAME(r,θ,β,ψ)\mathbf{M}, \mathbf{J}{=}`FLAME`(\boldsymbol{\mathbf{r}, \boldsymbol{\theta}, \boldsymbol{\beta}, \boldsymbol{\psi}}), where rR3\mathbf{r}\in{\mathbb{R}^{3}}, θR12\boldsymbol{\theta}\in{\mathbb R}^{12}, βR100\boldsymbol{\beta}\in{\mathbb R}^{100} and ψR50\boldsymbol{\psi}\in{\mathbb R}^{50} denote the head orientation, face poses, face shape and expression coefficients, respectively."

Q1-2. Considering the displacement map in addition to the FLAME parameter?

We think optimizing the mesoscopic detailed face geometry by re-parameterizing the displacement map is definitely an interesting future direction.

As discussed in [W1], the face skin detail reconstruction is not our scope of this work, but “reconstructing facial geometries, well complying with input facial gestures and motions.” More importantly, we are motivated by the fact that there are lack of existing large-scale in-the-wild or multi-view video datasets that contain high-level face geometry, head motion, identity, and expressions that comply with input videos, in the community.

We believe our NeuFace dataset, which contains large-scale, diverse, natural, high-level human 3D face motion, would invigorate the 3D face community.


Q2-1. Sec. 3.2: how were the ground-truth landmarks for equation 2 obtained?

We used the off-the-shelf 2D landmark detection model, FAN [C6]. We also performed manual human verifications to reject the failure cases when constructing the NeuFace-dataset.

We have added this to the revision (Sec 3.2, second paragraph, page 4), highlighted in pink. Thanks for checking the details.

[C6] Face Recognition (FAN), https://github.com/1adrianb/face-alignment

Q2-2. In the Multi-view consistency loss, where do the confidence values come from?

We assign the confidence score per vertex by measuring the angle between the vertex normal and the camera ray. We set the vertices as invisible if the angle is larger than the threshold τa\tau_a, and the vertex has a deeper depth than τz\tau_{z} (i.e., z<τzz<\tau_z). We empirically choose τa=72\tau_a = 72^\circ, τz=0.08\tau_z=-0.08. We have revised the paper accordingly (p5, Sec 3.2), highlighted in pink.

评论

Dear reviewer aXFo,

Thank you for reviewing our work. We sincerely ask reviewer axFo to check our rebuttal.
We believe that we have thoroughly addressed all the comments and suggestions reviewer axFo had raised. Considering the remaining time, we welcome any further comments or feedback that may help to enhance our work. We'd be happy to further discuss.

Best regards, Authors of 12

评论

Thank you for your thoughtful and encouraging feedback. We think that all the comments by the reviewer have been addressed, and the reviewer also confirmed the reviewer's misled concerns being corrected. If there is no other concern, we would like to respectfully ask the reviewer to re-assess our work and increase the rating.

评论

Dear authors. Thanks for your message. Perhaps you missed my previous message of Nov 22nd at 16:03. Let me attach it here ...

"Thanks for your detailed answer. I can see that you do not train your model with any "extra" images from the datasets, but rather optimize eq (1).

I assume that in this case the predictions of DECA or EMOCA models in Fig 4 and Tables 2 and S1 only analyze one image, whereas those of NeuFace are optimized with eq (1) using all frames in each sequence. Is this correct?

Also, it is not yet clear to me where you get the ground truth landmarks to compute the NME in Tables 2 and S1."

I would like a response to these questions. Best regards.

评论

Dear reviewer axFo,

We kindly note that the reviewer's comments were at the top (not linked to this thread), and we already answered above.

1) Answer about the experiments

[22 Nov 2023, 16:31] Reply link: https://openreview.net/forum?id=E6EbeJR20o&noteId=bsAfVVwlqg

2) Answer about temporal consistency loss

Also, please check our response on temporal consistency loss at above (the reviewer's comment), as well.

[22 Nov 2023, 17:53] Reply link: https://openreview.net/forum?id=E6EbeJR20o&noteId=n5KZS1btNz

We respectfully request the reviewer axFo to have a look at our responses, re-assess our work, and increase the rating, if the reviewer's concerns are addressed.

评论

Dear authors. Thanks for your message and for sending me the links to your previous responses. I cannot find in them a clear and straight answer to these questions:

"I assume that in this case the predictions of DECA or EMOCA models in Fig 4 and Tables 2 and S1 only analyze one image, whereas those of NeuFace are optimized with eq (1) using all frames in each sequence. Is this correct?

Also, it is not yet clear to me where you get the ground truth landmarks to compute the NME in Tables 2 and S1."

Please do not miss the opportunity to do so at this time. Best regards.

评论

Dear reviewer axFo,

Let us re-phrase the answers above.

1. The reviewer's question is CORRECT.

Still, we think it is fair to compare ours with DECA/EMOCA. If the comparisons in Fig. 4, Table 2., and Table S1 were about competing across the learned models, it would appear to be unfair, BUT the comparison was about the "dataset quality" obtained from typical methods of the community (DECA/EMOCA inference or 3DMM fitting)[C4,C5]; thus we consider it as fair and valid.

[C4] Ng et al., Learning to listen: Modeling non-deterministic dyadic facial motion. In CVPR 2022.

[C5] Paraperas et al., Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos. In CVPR 2022.

2. We used the landmarks predicted by FAN [C6] as a ground truth for computing NME in Tables.

Please note that there are no human-annotated ground truth landmarks for the datasets (MEAD, VoxCeleb2, CelebV-HQ).

[C6] Face Recognition (FAN), https://github.com/1adrianb/face-alignment

We hope this addresses the reviewer's questions and concerns.

评论

Dear Reviewer axFo,

We believe that all the comments by the reviewer have been addressed. If there is no other concern, we respectfully request Reviewer axFo to review our revision and the responses of the rebuttal once more. We hope this might prompt a reconsideration and adjustment of Reviewer axFo's rating.

Best regards,

Authors of 12.

评论

We thank the reviewers for their constructive comments. We appreciate the positive feedbacks acknowledged by the reviewers as:

  • Providing the NeuFace-dataset, which would benefit future research in this field (all the reviewers).
  • Proposing the insightful concept of neural re-parameterized 3D face optimization (Reviewer B8S8)
  • Providing an effective optimization method (Reviewer C76P)
  • The paper reads well and is properly set in the research context (Reviewer aXFo)

In this rebuttal, we have addressed all the comments from the reviewers, and we look forward to additional feedback or constructive discussion for clarification. Here is the summary of the rebuttal and the revision.

  • We have newly added analyses to support the effectiveness of our NeuFace optimization: analysis with the photometric loss, and analysis of identity code convergence (see Appendix Sec A.4).
  • We have added additional references related to our work (see Sec. 2 in the revision and Appendix Sec C.2).
  • We provided a detailed explanation about the FLAME parameter (see Sec. 3.1 in the revision), off-the-shelf 2D landmark detector (see Sec 3.2 in the revision), and the mult-view consistency loss (see Sec 3.2 in the revision).
  • All other comments from the reviewers are addressed in the individual responses.

Before the individual responses, we would like to re-emphasize the scope and the contribution of our work.

  • We propose a test-time optimization method, called NeuFace optimization, which is designed to acquire accurate and reliable pseudo-3D face annotations for a single video or multi-view video.
  • Thus, this approach does not entail training a general 3D face reconstruction model with training datasets, nor does it involve the conventional train/validation/test split approach. Our optimization part does not involve any other data, but a single test sample.
  • We demonstrated how accurate and reliable our NeuFace optimization is in generating pseudo 3D face annotations, by comparing with existing methods used for mesh annotations (Secs. 3 and 4 in the initial submission).
  • Through our proposed optimization, we constructed the NeuFace-dataset, the first large-scale 3D face mesh pseudo-labels for existing large-scale 2D face video datasets.
  • The effectiveness of our provided dataset is separately demonstrated through enhancing the performance of the 3D reconstruction neural model (Sec. 5.1), and learning a 3D human facial motion prior (Sec. 5.2) as applications.

We’d like to ask the reviewer to re-assess the value of our work with the following clarification. Please let us know if our answers address the reviewer’s concerns. We would be happy to provide further discussions and clarifications.

评论

Thanks for your experiments using the median on the multi-view regularization term. I can see it improves over a plain average and provides a result similar to your weighted average. However, in my comment I was also thinking about the running average you use for temporal consistency loss. I still think a robust alternative would improve results.

评论

Thanks for your detailed answer. I can see that you do not train your model with any "extra" images from the datasets, but rather optimize eq (1).

I assume that in this case the predictions of DECA or EMOCA models in Fig 4 and Tables 2 and S1 only analyze one image, whereas those of NeuFace are optimized with eq (1) using all frames in each sequence. Is this correct?

Also, it is not yet clear to me where you get the ground truth landmarks to compute the NME in Tables 2 and S1.

AC 元评审

This paper introduces (1) a technique: neural reparameterization and bootstrapping for test-time optimization, an EM algorithm (which uses landmark loss, temporal consistency loss, and multi-view consistency loss; note the loss is not the novelty),and (2) a 3D face mesh video dataset by running the proposed technique on existing face video datasets. The strengths of this work lie in the creation of the 3D face database. The weakness includes the incremental nature of the proposed technique and the lack of the comparison to other methods. The missing parts include (1) validation of the novelty of the proposed method and the advantages to SOTA methods, (2) validation of the accuracy of the dataset (is it the best that can be done).

为何不给更高分

The technique's novelty is low, and there is no fair comparison to show the advantage of the proposed technique. The dataset is not of high quality (more like smoothing; low quality for difficult facial expressions). Overall, it doesn't achieve the bar.

为何不给更低分

N/A

最终决定

Reject