5.5

/10

Poster4 位审稿人

最低5最高6标准差0.5

3.8

置信度

正确性2.8

贡献度2.3

表达2.5

NeurIPS 2024

Incorporating Test-Time Optimization into Training with Dual Networks for Human Mesh Recovery

Yongwei Nie,Mingxian Fan,Chengjiang Long,Qing Zhang,Jian Zhu,Xuemiao Xu

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

摘要

关键词

Human Mesh Recovery; Meta-Learning; Test-Time Optimization; Training Optimization

评审与讨论

审稿意见

评分: 5置信度: 42024-07-07

The paper presents a new training paradigm for better test-time optimization performance at test time. Specifically, two strategies are proposed. (1). integrate the test-time optimization into the training procedure, which performs test-time optimization before running the typical training in each training iteration. (2). propose a dual-network architecture to implement the proposed novel training paradigm, aiming at unifying the space of the test-time and training optimization problems.

优点

The paper proposes a new training paradigm that can integrate test-time optimization into the training procedure.
The overall procedure is clear and easy to understand.

缺点

My biggest concern is that the performance improvement is not significant when compared with the existing method. The Table 1 compares the results of an existing method PLIKS and the method proposed in this paper. However, the comparison of 3DPW is not fair, since the results of the last two lines use 3DPW for training, while the reported PLIKS does not use 3DPW for training. PLIKS trained on 3DPW can achieve 40.1 PA-MPJPE and 73.3 PVE, which greatly outperforms the results in this paper.

问题

see weakness

局限性

see weakness

作者回复

2024-08-07

Question 1: The biggest concern is that the performance improvement is not significant when compared with the existing method. The comparison of 3DPW is not fair, since the results of the last two lines use 3DPW for training, while the reported PLIKS does not use 3DPW for training. PLIKS trained on 3DPW can achieve 40.1 PA-MPJPE and 73.3 PVE, which greatly outperforms the results in this paper.

Our Response: Thanks very much for the comments. We have checked the results of PLIKS again. After the checking, we confirm that our comparison is fair, and our results outperform those of PLIKS.

For convenience, we copy the following content from Table 1 of the paper of PLIKS (see https://openaccess.thecvf.com/content/CVPR2023/papers/Shetty_PLIKS_A_Pseudo-Linear_Inverse_Kinematic_Solver_for_3D_Human_Body_CVPR_2023_paper.pdf).

3DPW
Method	PA-MPJPE	MPJPE	PVE
PLIKS+(HR32)	42.8	66.9	82.6
PLIKS++(HR32)	40.1	63.5	76.7
PLIKS++(HR48)	38.5	60.5	73.3

From caption of Table 1 of PLIKS: PLIKS+ means that the network was additionally trained with 3DPW. PLKIS++ means that the network was additionally trained with 3DPW and AGORA.

In our paper, we compare with PLIKS+ which is trained with 3DPW too. So, from the aspect of training dataset, our comparison is fair.

As for the results of 40.1 (PA-MPJPE) and 73.3 (PVE) of PLIKS++, they are obtained by further finetuning PLIKS on 3DPW and Agora. Since our method is not finetuned with Agora, we are thus more appropriate to compared with PLIKS+ instead of PLIKS++.

However, we find there is a typo in our paper. That is, we write that PLIKS adopts the backbone of HRNet-W48, but actually it uses HRNet-W32. This makes it somewhat unfair between PLIKS and our method. Therefore, we make the following fair comparison, which also shows our method outperforms PLIKS.

3DPW
Method	PA-MPJPE	MPJPE	PVE
PLIKS+ (HR32)	42.8	66.9	82.6
Ours_CLIFF $\dagger$ (HR32)	40.1	62.9	80.9

2024-08-09

Thanks to the author's feedback, the result shows that the proposed method can outperform PLIKS with fair comparison. So I improve the score.

2024-08-09

Dear Reviewer Nccj,

The authors really appreciate your positive feedback.

审稿意见

评分: 6置信度: 42024-07-13

This paper targets on the Human Mesh Recovery task. The authors adopt a test-time per-instance optimization method similar to EFT, but proposes a meta-learning-like framework to pre-optimize the pretrained human pose estimation network, aiming to provide a better starting point for test-time optimization. In addition, the authors also propose a dual-network mechanism to solve the test-time and training-time objective misalignment. The idea is novel and interesting. Experimental results also fully demonstrate the effectiveness of the propsoed method.

优点

Incorporating meta learning idea in test-time optimization is interesting and demonstrates leading performance.
The authors conduct extensive ablations to validate the proposed method.
Paper is well organized and written.

缺点

Some experimental results were not clearly explained. E.g. In Fig. 3, EFT gets worse after several optimization steps, while the proposed method keeps getting better. Could you please give some intuition on this? Why EFT gets worse and which design of the proposed approach alleviate this issue?
Fig. 1 is a bit dense and difficult to follow. Perhaps the authors could try to reorganize and simplify it.
The paper consists a large number of experiments. However, the authors don't provide statistical significance for any of the experiment as suggested by the NeurIPS Paper Checklist 7. It is suggested to at least include the standard deviation or the standard error for the main results.

问题

Please see above.

局限性

The authors didn't include Broader Impacts in the paper. Although two failure cases are provided, more are encouraged to give a comprehensive overview.

作者回复

2024-08-07

Question 1: In Fig. 3, Why EFT gets worse and which design of the proposed approach alleviates this issue?

Our Response: This is probably because EFT is only finetuned with 2D reprojection loss and is more sensitive to the errors in the 2D joints. At the first several optimization steps, the estimated 3D SMPL approaches the 2D joints from a relatively distant initialization, therefore the result gets better gradually. However, with more optimization steps, the SMPL may overfit the 2D joints whose annotation-errors then distort the SMPLs, thus yielding worse evaluation metrics.

In contrast, our method is guided by both 3D and 2D supervisions. The 3D pseudo SMPLs plays the role of regularization that mitigates the influences of errors in 2D joints.

Question 2: Fig. 1 is a bit difficult to follow. Reorganize and simplify Fig. 1.

Our Response: Thanks for the suggestion. We have revised Fig. 1. Please see the new figure in the attached PDF document.

Overall, we think the difficulty in reading Fig. 1 comes from the fact that the illustrations of SMPL meshes and 2D joints in the testing and training losses are free of text descriptions. In the new figure, we increase space between losses, and add descriptions for the inputs of the loss functions.

Question 3: Provide statistical significance for experiments as suggested by the NeurIPS Paper Checklist 7.

Our Response:

3DPW
Method	Joints	MPJPE	PA-MPJPE	PVE
Ours_CLIFF	OpenPose	62.93 $\pm$ 0.003	39.72 $\pm$ 0.002	80.06 $\pm$ 0.004
Ours_CLIFF	RSN	62.42 $\pm$ 0.0002	39.47 $\pm$ 0.0001	78.14 $\pm$ 0.0003

Human3.6M
Method	Joints	MPJPE	PA-MPJPE
Ours_CLIFF	OpenPose	43.93 $\pm$ 0.006	30.31 $\pm$ 0.004
Ours_CLIFF	RSN	41.96 $\pm$ 0.002	29.18 $\pm$ 0.0006

Thanks very much for the reminding. During this rebuttal period, we have conducted experiments to include the standard errors for the main results. Please see the above tables. The above results are obtained by running the test-time optimization 5 times using different initializations. More statistical experimental results of both training and testing for the main and ablation experiments will be added in the revised paper finally.

Question 4: Although two failure cases are provided, more are encouraged to give a comprehensive overview.

Our Response: Thanks very much for the suggestion. We have added more failure cases. Please see the attached PDF document for more information.

2024-08-09

Thanks for the author's response. I have another question. The authors mention that "3D pseudo SMPLs plays the role of regularization that mitigates the influences of errors in 2D joints". Intuitively, 3D pseudo poses usually produce larger errors than 2D poses. Could the authors provide some experimental evidence to verify the effect of 3D pseudo SMPL? It would be very helpful.

2024-08-13

Dear Reviewer Qn92,

Thanks a lot for your valuable comments. We spent time on new experiments. Now we provide our responses to your thoughtful questions.

Question 1: Provide some experimental evidence to verify the effect of 3D pseudo SMPL?

Our Response: To validate the effect of 3D pseudo SMPL, we have provided an ablation experiment between Row 2 and 3 in Table 3 of the paper. In Row 2, we discard the auxiliary network, meaning the pseudo SMPLs generated by the auxiliary network are not used in the test-time optimization function. The results are:

MPJPE: 78.5 (w/o pseudo) vs. 76.7 (w/ pseudo)
PA-MPJPE: 49.9 (w/o pseudo) vs. 49.5 (w/ pseudo)

This ablation study demonstrates that pseudo SMPLs can improve the results.

Since the above ablation experiment was conducted on a small training (COCO) dataset. To further confirm the correctness of the results, we conduct the ablation study on the full training datasets.

3DPW
		MPJPE	PA-MPJPE	PVE
Ours_CLIFF $\dagger$ (Res-50)	w/o pseudo	68.69	44.65	85.88
Ours_CLIFF $\dagger$ (Res-50)	w/ pseudo	66.00	42.13	83.63

Backbone: ResNet-50
2D joint detection method: OpenPose ( $\dagger$ )
Training datasets: COCO, MPII, Human3.6M, MPI-INF-3DHP, 3DPW
Number of training epochs: 65 (on the first four training dataset) + 1 (finetuned on 3DPW)

The above results further confirm the effectiveness of the generated pseudo 3D SMPLs.

Question 2: Intuitively, 3D pseudo poses usually produce larger errors than 2D poses.

Our Response: We are very grateful for this insightful comment, which reminds us to provide the following experimental results that can make our paper more complete.

Based on your comment, we think it is necessary to show how good the pseudo labels are, i.e., whether the 3D pseudo labels bring larger errors or not?

In the following table, we report the accuracy of the pseudo SMPLs generated by the auxiliary network and compare them with CLIFF as we use CLIFF as our auxiliary network.

Human3.6M
	MPJPE	PA-MPJPE
CLIFF*	52.32	35.94
Auxiliary Network	49.70	35.22
Ours_CLIFF $\dagger$ (Res-50)	46.9	33.1

* means the results are tested by ourself using the code of CLIFF

Note that CLIFF is a strong baseline. The results show that the auxiliary network outperforms CLIFF. we therefore believe the pseudo SMPLs generated by the auxiliary network provide reliable supervisions.

2024-08-13

I appreciate the authors' efforts. Although it's unclear how the 3D pseudo pose compares to the 2D estimated pose, I'm inclined to vote for acceptance.

2024-08-13

Dear Reviewer,

Thank you sincerely for your response and for your patience in reviewing our rebuttal. We will revise our paper considering all the comments of you and other reviewers carefully to make the paper clear to understand. Besides, we will release our code such that any researcher can resort to the code for further understanding.

Our best, Authors.

审稿意见

评分: 5置信度: 32024-07-15

Human mesh recovery (HMR) is improved by incorporating test-time optimization into the mesh prediction model’s training loop, allowing the learned parameters to potentially be more readily adapted to examples at test time. This is referred to as a meta learning approach, and it departs from prior work that also introduced this meta learning strategy by incorporating a secondary network to make predictions for pseudo ground truth meshes during the test time computations (during training and actual testing). This novelty as well as the design choices of the approach as a whole are validated through ablation studies, and the final method is shown to perform strongly relative to SOTA competing methods.

优点

Originality: Adding to the meta learning approach an auxiliary network that provides a pseudo mesh for the test time adaptation is a novel contribution. Ablation studies illustrate the effects of key method hyperparameters.

Quality: Incorporation of the test-time adaptation into the training procedure is well motivated. The method is tested with multiple backbone networks on multiple datasets, against multiple strong baseline methods of varying types (optimization and regression). Quantitatively, it significantly outperforms the state of the art baselines. The ablation studies support the contention that the proposed learning approach will be effective as joint estimation and mesh prediction models advance. The results (especially Figure 3 and Table 3) support the idea that meta learning is a critical framework for solving the HMR problem. Table 3 also supports the use of the dual network approach.

Clarity: The paper is reasonably well written and the figures/tables helpfully convey key concepts and results.

Significance: The problem of predicting human mesh parameters from image data is actively studied, and progress on it impacts various applications. The submission makes notable contributions that improve the ability to solve this problem.

缺点

The main weakness is presentation, which I think can be improved in a few ways. I outline these below and in the Questions section. Importantly, I think the results in this paper are strong, and I would be happy to raise my score if these weaknesses are addressed.

Most importantly, the contextualization relative to the prior work “META-LEARNED INITIALIZATION FOR 3D HUMAN RECOVERY” (Kim et al., 2022) needs to be improved.

The related work section references Kim et al. (2022), which also uses meta learning for HMR, but other than that reference the phrasing in the submission suggests that the meta learning idea is a novel contribution of this submission. I think this phrasing should be adjusted to better reflect that the meta learning component here is not novel. Some examples of phrasing I would change follow: the submission says that the relevance of meta learning to HMR is its “key observation”, that it is “rethinking” the HMR problem by introducing meta learning, that it “proposes” to use meta learning, etc.

问题

Line 153 and Algorithm 1: the referenced meta learning approach MAML involves computing gradients through other gradient computations (i.e., second order derivatives). Could you please confirm my understanding that Algorithm 1 does not do this, and (if it does not) explain why it's described as “simulating” this meta learning approach?

Please add a quantitative comparison to the prior work that introduced meta learning to this problem, Kim et al. (2022), using the same training/testing data for each method. Also, please highlight the differences between these methods when discussing their different results, and connect this discussion to Table 3 because I believe its row 2 corresponds to Kim et al. (2022).

Line 251: I would add quantitative results to this section or avoid claiming strong OOD performance (e.g. relative to EFT). The provided qualitative results with 2 images do not provide sufficient support for the OOD claims.

Consider replacing the backwards sum notation in Figure 1 by placing the usual notation on the left side of the summands.

局限性

The main text does not discuss any limitations. The appendix has some discussion of them that should be moved to the main text.

作者回复

2024-08-07

Question 1: Rephasing “key observation”, “rethinking”, “proposes”, etc., as Kim et al. 2022 have used meta learning already.

Our Response: Thanks very much for the suggestion. Yes, Kim et al. 2022 also adopted meta learning, as referenced in our paper. The difference is that their method is a direct extension to EFT, while we take a different perspective from the Human Mesh Regression model. Please see the difference between Kim et al. 2022 and our method in the response to Question 3.

Following your suggestion, we will rephase the above presentation to credit the contribution of Kim et al. 2022 more appropriately.

“The key observation in this paper is that the pretrained model may not provide an ideal optimization starting point for the test-time optimization” will be changed to:

“However, the pretrained model may not provide an ideal optimization starting point for the test-time optimization”.

“The above analysis motivates us to re-think the test-time optimization problem in the framework of learning to learn, i.e., meta learning. Inspired by optimization-based meta-learning [12], we incorporate test-time optimization into the training process” will be changed to:

“Based on the above analysis and inspired by Kim et al. [22], we incorporate the test-time optimization into the training process, formulating the test-time optimization in the framework of learning to learn, i.e., meta learning [12].”

Besides the above modifications, we will add analysis about the difference between Kim et al. 2022 and our method, and revise relevant places of the paper accordingly.

Question 2: Does Algorithm 1 compute second-order derivatives like MAML? Explain why “simulating” this meta learning approach in Line 153.

Our Response: We do use the MAML algorithm. However, the original MAML needs to compute second-order derivatives which is too much time-consuming and memory-cost. We thus adopt FOMAML instead, a simplified version of MAML suggested by the Finn et al. [12] which discards second-order derivatives.

To clarify, there is a typo in Line 7 of Algorithm 1, where the gradient should be computed with respect to w instead of w’, as indicated by Reviewer PPKd.

Therefore, we confirm that the statement of our algorithm “simulates” MAML in Line 153 is correct.

Question 3: Quantitative comparison with Kim et al. (2022). Highlight the difference.

Our Response:

Method	Backbone	Training Dataset	Testing Dataset	PA-MPJPE
Kim et al. 2022	SPIN (Kolotouros et al. 2019)	COCO, MPII, Human3.6M, MPI-INF-3DHP, 3DPW, LSP	3DPW	57.88
Ours	HMR (Kanazawa et al. 2018)	COCO, MPII, Human3.6M, MPI-INF-3DHP, 3DPW	3DPW	44.3

Please see the above quantitative comparisons. Due to time limit, we collect the result from Kim et al. 2022 for the comparison.

Backbone: the backbone of Kim et al. 2022 is SPIN, which is a stronger backbone than ours HMR.
Training dataset: the training datasets are nearly the same except that Kim et al. 2022 additionally uses LSP while we do not.

As can be seen, the PA-MPJPE of Kim et al. is 57.88 which is worse than ours 44.3, indicating that our method performs better than Kim et al. 2022.

Now, we elaborate the key difference between Kim et al. 2022 and our method in detail.

The key difference is that that Kim et al. 2022 use the 2D reprojection loss only (please see Eq. 1 in their paper) in both inner and outer loops of meta learning, while our method uses 2D and 3D losses. This is a small difference in formulation, but a large difference in contextualization.

EFT is performed under the guidance of 2D reprojection loss, and Kim et al. (2022) extend EFT to both the inner and outer loops of meta learning. In this sense, the method of Kim et al. (2022) is a direct extension of EFT to meta learning.

In contrast, our method considers meta learning from the perspective of the complete HMR model trained with 2D and 3D losses. In other words, we use the complete HMR model in both inner and outer loops of meta learning. This is more reasonable, as our aim is to personalize the HMR model on each test sample, rather than a model trained with only 2D reprojection loss.

The above difference enables us to design a dual-network architecture that is very different from the network used in Kim et al. 2022. Besides, the introducing of 3D losses improves the HMR quality significantly, as shown by the above quantitative comparisons and those in the paper.

Question 4: Line 251, add quantitative results to support the OOD claims.

Our Response: Thanks very much for the suggestion. We have added quantitative comparisons. Due to space limit, please refer to our answer to Question 3 of Reviewer PPKd (i.e., Reviewer 1) for more information.

Question 5: Place the backwards sum notation in Figure 1 on the left side of the summands.

Our Response: Thanks for the suggestion. We have revised Figure 1. Please see the attached PDF document for the modified figure.

Question 6: Move discussion of limitations in Appendix to the main text.

Our Response: Thanks a lot for the suggestion. We will move them to the main text as suggested.

2024-08-12

I thank the authors for their updates and clarifications.

Could the authors please comment on whether row 2 of Table 3 corresponds to the method of Kim et al. (2022)? If it doesn't, how does row 2 differ from Kim et al. (2022)?

2024-08-12

Dear Reviewer,

Thanks very much for the feedback.

Row 2 of Table 3 does not correspond to Kim et al. (2022). It is a little different from Kim et al. (2022).

Both Kim et al. (2022) and our method employ meta learning which contains inner and outer loops. Row 2 (Table 3) and Kim et al. (2022) have identical (or strictly similar) objectives in the inner loop, but different objectives in the outer loop.

Specifically, Row 2 employs only 2D joint-based reprojection loss in the inner loop, i.e., neglecting the auxiliary network and its generated pseudo 3D SMPL labels. For the outer loop, Row 2 adopts both 2D (joints) and 3D (SMPLs) losses, where the latter is computed with respect to the GT SMPLs.

In contrast, Kim et al. (2022) used the 2D reprojection loss in both the inner and outer loops.

We find that incorporating 3D SMPLs into the meta learning is very helpful. It is the utilization of the GT SMPLs that greatly improves the results of our method. Besides utilizing GT 3D SMPLs in the outer loop of meta learning, we additionally generate pseudo SMPLs and incorporate them into the inner loop of the metal learning.

2024-08-14

This makes sense, thanks! Elucidating these differences with Kim et al. (2022) while properly acknowledging its influence on your submission addresses my main concern.

I have read the other reviews and authors' response, and I have raised my score.

2024-08-14

Dear Reviewer Cdc4,

Yes, we will clearly show how our approach differs from that of Kim et al. 2022. Your high-quality comments teach us how to more clearly demonstrate our contributions to the community.

Thanks very much for the kind help.

审稿意见

评分: 6置信度: 42024-07-16

This paper introduces a novel method to enhance Human Mesh Recovery (HMR) from images. The approach integrates test-time optimization into the training process, inspired by meta-learning. The dual-network structure is designed to align training and test-time objectives, improving the starting point for test-time optimization. Extensive experiments demonstrate that this method outperforms state-of-the-art HMR approaches, providing higher accuracy and better generalization to test samples. The work emphasizes the advantage of combining test-time and training-time optimization for robust human mesh recovery.

优点

The authors introduce an innovative approach by incorporating test-time optimization into the training phase, which addresses a gap in existing HMR methods. This blend of meta-learning principles ensures the optimization process at test time is more effective, enhancing performance and accuracy.
The proposed method integrates seamlessly with existing HMR models using a dual-network structure, ensuring minimal additional computational load during training. This efficiency is achieved without compromising result quality, making the approach practical for real-world applications.
The authors present their work in a well-structured and accessible manner. The logical flow, clear explanations of technical terms, and use of diagrams ensure that even complex concepts are easily understood, enhancing the overall impact of the paper.

缺点

The inclusion of dual networks does not fully achieve the goal of matching training and testing objectives. There remains a discrepancy between the training loss (using ground truth labels) and the testing-time loss (using pseudo labels). There is no guarantee or analysis provided that the gradients from ground truth labels and pseudo labels align.
Using dual networks will introduce the ensembling effect. The second term of $\bigtriangledown_{\omega} L_{test}$ ( $\bigtriangledown_{\omega} L_{3D}$ ) acts as an adjustment towards an intermediate estimation between the two networks. To fully eliminate the ensembling effects, another row in Table 3 should be included showing the results of EFT_CLIFF with two networks (averaged results) for a more accurate comparison.
While the last subsection in Section 4 is a good start, it is insufficient. Test-time optimization (TTO) methods typically perform significantly better than their counterparts on out-of-domain datasets, which are critical for demonstrating the full potential of TTO solutions. A quantitative comparison with EFT on the LSP-Extended dataset would provide a more comprehensive validation of the method's effectiveness in OOD scenarios.

问题

In line 7 of Algorithm 1, why use the gradient w.r.t. $w'$ to update $w$ (similar to Reptile), instead of the gradient w.r.t $w$ (as in MAML)?

局限性

No negative societal impact

作者回复

2024-08-07

Question 1: There remains a discrepancy between the training loss (using ground truth labels) and the testing-time loss (using pseudo labels).

Our Response: Thanks a lot for the insightful question which deserves further interpretations.

The design of dual networks is one of the core contributions of our method. We use the dual networks to unify the formulations of training and testing objectives, but the content of them cannot be completely unified, as one uses GT labels and the other uses pseudo labels.

We interpret this mismatch and provide analysis as follows.

There are 3D losses in the training objectives, while there are not at test time. We find this yields a gap between the training and testing optimizations, which motivates us to design the auxiliary network to generate pseudo 3D labels to mitigate the gap. Although there still exists discrepancy between GT and pseudo labels, the gap is greatly reduced compared to using only 2D loss without introducing the pseudo labels. The experiments in Table 3 validates this key design of our method.
In our system, the auxiliary network in combination with the main network is trained simultaneously using our proposed training strategy. Through this process, the auxiliary network learns to estimate pseudo 3D meshes that are effective for optimizing the network during test-time optimization, which guarantees to a certain extent the alignment between the gradients from ground truth labels and pseudo labels.

Question 2: Eliminate ensembling effects: include results of EFT_CLIFF with two networks (averaged results).

Our Response: Thanks a lot for the valuable comment. Adding this experiment can further validate the meta-architecture of our method.

Following your suggestion, we use two CLIFFs to generate SMPLs and compute the averaged SMPL of them, based on which we conduct EFT optimization which in turn optimizes the two CLIFFs. This process is iterated for 20 rounds. The finally obtained CLIFFs and their averaged result is used for comparison.

The comparison results are shown in the following table.

	MPJPE	PA-MPJPE
EFT_CLIFF	84.6	54.2
EFT_2CLIFFs	82.72	53.59
Ours_CLIFF	76.7	49.5

The row of EFT_2CLIFFs shows results of using two CLIFFs in EFT. It shows that this really works, indicating there indeed exists ensembling effects. However, although the two CLIFFs improve EFT, the improved results are still much worse than ours, proving that the ensembling effect by meta-learning outperforms that by simply using two CLIFFs.

We will add this new comparison into Table 3. Thanks again for the great suggestion.

Question 3: A quantitative comparison with EFT on the LSP-Extended dataset would provide a more comprehensive validation of the method's effectiveness in OOD scenarios.

Our Response: Thanks a lot for the suggestion. We perform the following quantitative experiments to validate the OOD effectiveness of our method compared with EFT.

First, we perform a quantitative comparison with EFT on the LSP-Extended dataset. Since LSP provides GT 2D joints but not GT SMPLs, we compare the 2D loss (measured w.r.t. the GT joints) with EFT.

	LSP-Extended
	2D Loss
EFT_CLIFF	8.3e-3
Ours_CLIFF	6.1e-3

Training dataset: COCO, MPII, MPI-INF-3DHP, Human3.6M, 3DPW
Backbone: HR-W48

Our method approaches the GT joints more nearby than EFT, i.e., 6.1e-3 (ours) vs. 8.3e-3 (EFT).

The calculation method for the 2D loss is as follows: Both the predicted values and GT joints are transformed into the range [-1,1] using the formula 2*y/crop_img_height-1 (similar for the width). Then, the mean squared error (MSE) loss is calculated.

Second, to further evaluate the OOD effectiveness of our method, we train our method and EFT on COCO (an outdoor dataset), and test them on Human3.6M (an indoor dataset).

	Human3.6M
	MPJPE	PA-MPJPE
EFT_CLIFF	85.5	51.0
Ours_CLIFF	83.8	48.6

Training dataset: COCO
Backbone: Res-50

Since there are GT SMPLs, we report MPJPE and PA-MPJPE. As can be seen, our results are also better than EFT.

Finally, The above quantitative experiments will be edited into the main text of the revised paper, to validate the OOD effectiveness of our method with quantitative evidences.

Question 4: In line 7 of Algorithm 1, why use the gradient w.r.t. w′ instead w?

Our Response: Thanks for the careful checking. We are sorry that this is a typo. We do use the MAML algorithm. Specifically, we use FOMAML that discards the second-order derivatives. So, w’ should be corrected to w. We will correct this typo in the revised paper.

2024-08-13

Thanks for the details reply. All my concerns have been resolved. So I will lift my score to 6. Please add the above content to your final version, especially answers to Q2 and Q3.

2024-08-13

Dear Respectful Reviewer PPKd,

We are greatly encouraged by your positive feedback. Thanks very much. We promise to release our code.

作者回复

2024-08-07

Dear ACs and Reviewers:

We sincerely thank you and all reviewers for your time and review.

All reviewers endorse the novel idea of this paper. For example, Reviewer PPKd says "The authors introduce an innovative approach", Reviewer Cdc4 says "Adding to the meta learning approach an auxiliary network ... is a novel contribution, is well motivated". Reviewer Qn92 provides the comment that "Incorporating meta learning idea in test-time optimization is interesting and demonstrates leading performance". Reviewer Nccj says "The paper proposes a new training paradigm". All the reviewers think our paper is "well structured", and "easy to understand".

Reviewer Cdc4 mainly raise the concern that Kim et al. (2022) have already adopted meta learning to improve EFT. We have interpreted the difference between Kim et al. (2022) and our method in detail.

Both Reviewer PPKd and Cdc4 suggest to add quantitative results about the OOD performance of our method. We have conducted two kinds of experiments that well validate the OOD effectiveness.

Reviewer PPKd provides a very insightful comment that our method has ensembling effect and suggests us to provide a comparison with the baseline that also incorporates ensembling. After adding this experiment, our paper is more solid.

According to the comments of Reviewer Qn92, we re-plot Figure 1 to make the pipeline clearer. We also provide two more failure cases in the attached PDF document.

We would like to indicate that there are some factual errors in the comments of Reviewer Nccj who may misread the results of PLIKS. We confirm again that our comparison with PLIKS is fair and our method outperforms PLIKS. Please see our response to the question of Reviewer Nccj.

In summary, we have proposed a novel Human Mesh Recovery (HMR) method. Our method incorporates HMR into the framework of meta learning. We propose a novel dual-network architecture to mitigate the discrepancy between the optimizations in the inner and outer loops of the meta learning. Besides, we have provided thorough experiments to validate the effectiveness of the proposed method.

We sincerely hope that this SOTA paper on an important topic will be considered suitable for publishing in NIPS 2024. Thank you for your service and for considering our paper for the conference.

Sincerely

Authors.

最终决定Accept (poster)

2024-09-25

This work addresses the problem of human mesh recovery (HMR) using a meta-learning framework, where a dual network is designed to align training and test-time objectives. Overall, the presentation is clear, the problem and method are well-motivated, and the performance is good. It receives positive scores after the rebuttal discussions. AC agrees this is a nice application of meta-learning on HMR and thus recommends acceptance. The authors should properly reference the prior work “META-LEARNED INITIALIZATION FOR 3D HUMAN RECOVERY”(Kim et al., 2022) and highlight the key difference which already applied meta-learning on HMR.