PaperHub
6.0
/10
Poster5 位审稿人
最低5最高7标准差0.9
5
6
5
7
7
4.2
置信度
正确性3.2
贡献度2.8
表达3.2
NeurIPS 2024

Estimating Ego-Body Pose from Doubly Sparse Egocentric Video Data

OpenReviewPDF
提交: 2024-04-23更新: 2025-01-23

摘要

关键词
pose estimationegocentric pose estimation.

评审与讨论

审稿意见
5

This paper presents a framework to estimate full-body human poses from egocentric head mounted display. The input of the system mainly contains 2 parts: head tracking signal given by HMD, and sparse hand pose signal estimated from egocentric video. The algorithm, DSPoser, is composed of two stages: temporal completion and spatial completion. In the temporal completion stage, DSPoser uses an encoder-decoder to generate the gaussian distribution of hand pose states. In spatial completion stage, DSPoser generates fullbody poses from hand trajectories and head tracking signals. Experiments were performed on Ego-Exo4D and AMASS dataset.

优点

(1) The quantitative results are excellent, especially for doubly sparse data. Table.1 and Table.2 supports the effectiveness of the proposed method.

(2) The overall paper writing is clear.

(3) Limitations are well-explained.

(4) The ablation study about aleatoric and epistemic uncertainties is interesting to me.

缺点

(1) The novelty of “newly introduced task (L.205)” is a bit limited. Both motion imputation and pose completion are widely investigated problems, and the combination of both problems seems not difficult to solve.

(2) The paper only proves that the proposed hand trajectory imputation is better than linear interpolation for the doubly sparse task. As the motion imputation/interpolation/inbetweening is a long-standing task, the contribution of the proposed hand imputation method (Uncertainty-aware MAE) is not clear.

问题

(1) For hand pose estimation, why use FrankMocap instead of recently introduced ACR (ACR: Attention Collaboration-based Regressor for Arbitrary Two-Hand Reconstruction [CVPR 2023]) or IntagHand (Interacting Attention Graph for Single Image Two-Hand Reconstruction [CVPR 2022]), which are designed for hand mesh recovery only? Does the performance of hand mesh recovery affect the final full-body pose estimation?

(2) May the training on AMASS benefit Ego-Exo4D performance?

(3) In Figure 3(b), why not demonstrate hand pose here?

(4) Sec.3.3 is titled with "imputed hand trajectories and head tracking signal". However, L.146 said "from imputed hand trajectories" and did not mention "head tracking signal" in this section at all.

Minor questions: In table.1, “x” should also be briefly explained like “y”.

At L.81, V_1 … V_T_w should be defined (RGB images?)

Definition of T_w (L. 114) is better moved to its first appearance (L.81)

局限性

Limitations are clearly addressed.

作者回复

We appreciate the valuable comments aimed at improving our paper. We will revise the draft according to the reviewer's suggestions.

Q1: Limited novelty

Our novelty lies in our approach to solving the body pose estimation problem given doubly sparse data, specifically in how we address the under-constrained problem by measuring and exploiting uncertainty. Previous methods rely heavily on dense hand signals, requiring hand controllers for ego-body pose estimation. Another approach that only uses head poses to estimate the whole body does not utilize hand pose information. Our proposed solution strikes a novel balance between these two approaches, eliminating the need for hand controllers while achieving better results by incorporating a few constraints from detected hand poses.

Q2: The effectiveness of Uncertainty Aware MAE

As the reviewer noted, numerous works have addressed trajectory imputation. However, imputing the hand trajectory itself is not our primary task or the focus of our novelty. Our design of MAE aims to capture uncertainty while imputing the hand trajectory, which differs from other imputation methods such as mask token, in-betweening, and interpolation. Additionally, we introduced a couple of ways to utilize this uncertainty —sampling, dropout, and distribution embedding— in a diffusion model while spatially completing the full body. Given that our newly introduced task of estimating ego-body pose from doubly sparse video data is an under-constrained problem, one of our key motivations is to leverage the "uncertainty" that arises from this under-constrained data.

Q3: Hand pose estimation module and its effectivness

\begin{array}{l|cc} \hline Methods & MPJPE & MPJVE \\ \hline \hline FrankMocap & 16.84 \pm 0.04 & 39.86 \pm 0.05 \\ ACR & 16.69 \pm 0.05 & 40.12 \pm 0.06 \\ Hand Ground Truth & 16.43 \pm 0.02 & 37.49 \pm 0.04 \\ \hline \end{array}

We agree that recently introduced hand models can be used instead of FrankMocap. Before we submitted the paper, we compared the results of FrankMocap with the ground truth 3D hand joint location of Ego-Exo4D and concluded that the effect of hand detectors is not significant. We believe this is because we utilized only the wrist 3D location from the detected hand. Therefore, even though FrankMocap was our initial choice only to prove the concept, we decided not to replace it. We visulized a table of comparison for the different inputs of the hand detector, and it shows that the performance differnece is not significant from ground truth and other hand detector models.

Q4: Training on AMASS benefits Ego-Exo4D performance?

We greatly appreciate the suggestion of applying transfer learning to the Ego-Exo4D dataset. The Ego-Exo4D paper's baseline implementation of EgoEgo [1] took the second-place in the Ego Body Pose Estimation Challenge. This implementation has demonstrated the benefits of training on the AMASS dataset for improving performance on Ego-Exo4D. This method employs a conditional diffusion model, cross-attention for conditioning, and rotary positional embeddings with SLAM pose input. We recognize the potential of this approach and intend to explore its application in our future work on this task.

[1] Li, Jiaman, Karen Liu, and Jiajun Wu. "Ego-body pose estimation via ego-head pose estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

Q5: Hand visualization on Figure 3-(c)

We appreciate the reviewer's feedback. While our primary focus is on body pose estimation, rather than hand pose estimation, we understand the importance of comprehensive visualization. Therefore, we can provide visualizations of the canonical hand poses, similar to the approach we used for mesh visualization in Fig 3-(c), to enhance clarity and completeness.

Q6: Writing clarity regarding our constraint in Section 3.3.

Thank you for the valuable feedback. We acknowledge the oversight in Section 3.3, where the title mentions both "imputed hand trajectories and head tracking signal," but the text only references "imputed hand trajectories." We will clarify our method by including a head tracking signal in this section to ensure consistency and completeness.

Q7: Suggestions for the clarify.

Thank you for pointing out the oversights. Following the reviewer's comments, we will make the following revisions to enhance clarity:

  1. Add a brief explanation of xx in Table 1.

  2. Clarify the dimension of the RGB video data.

  3. Move TwT_w​ to its first appearance in the text.

These changes will help ensure the information is clear and easy to understand.

评论

The authors' responses are adequate to solve most of my concerns. I would like to keep my initial rating because I still think the proposed task is a bit incremental. Both technical descriptions and evaluations are satisfied.

评论

Thank you for your thorough review and for acknowledging that our responses addressed most of your concerns. We appreciate your thoughtful consideration and respect your decision to maintain your initial rating. Once again, thank you for your valuable feedback and for taking the time to review our work.

审稿意见
6

This paper presents a new method for ego-body pose estimation from egocentric videos. Compared to previous methods that assume hand tracking signals are always available, this paper focused on the case that hand poses captured intermittently from egocentric videos. To solve this, this paper proposes a two-stage method that firstly does temporal completion for hand trajectory and then spatial completion for full-body poses. Experiments show better performances than selected baseline methods.

优点

  1. The paper is well-written and very easy to follow. The figures and tables are well presented.
  2. I like the point that the authors use the hand poses captured intermittently from egocentric video data instead of assuming dense tracking signals. The two-stage method also sounds reasonable for this case.

缺点

  1. It would be interesting to see how the proposed method would compare to the FoV modeling in [1], which focused on the same task setting when the hand tracking signals are intermittent. However, [1] is not discussed in the submission.

    [1] Jiang et al. EgoPoser: Robust Real-Time Ego-Body Pose Estimation in Large Scenes, arXiv 2023

  2. In Table 3, when trained only on dense data, the proposed methods performed worse than previous methods. So, I was wondering whether the model generalizes well to different settings.

  3. Metrics related to computational complexity, such as the number of parameters, FLOPs, and inference time, are not provided. This is important to ensure a fair comparison with previous methods, and to see the potential in real-world applications.

  4. Sota methods like AvatarPoser and AvatarJLM are only compared in Table 3 but not in Tables 1 and 2. It would be better to have consistency when comparing methods.

  5. Some technical details seem missing. For example, what are the lengths of input and output frames? How is an evaluation made during the evaluation (e.g., the step size of the sliding window)?

  6. There are no video comparisons provided, which are important for this task.

问题

  1. The authors acknowledged that using diffusion models could limit the usage of real-time applications. What is the motivation for using diffusion models for this task? Some recent papers like [2] even show AvatarPoser performed better than Diffusion models. In Table 3, the previous method, AvatarJLM, also shows better performance than the proposed method.

    [2] Ma et al. Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild, arXiv 2024

  2. IMU measurements are mentioned multiple times in this paper. Are they used in the paper? If yes, how?

  3. Will the code be published?

局限性

Limitation has been discussed in Section 7.

作者回复

We sincerely appreciate the reviewers' insightful comments for our paper.

Q1: FoV modeling result of EgoPoser (ECCV 2024).

\begin{array}{c|cc|cc|cc}\hline Strategies&MPJPE (180°)&MPJVE (180°)&MPJPE (120°)&MPJVE (120°)&MPJPE (90°)&MPJVE (90°) \\\hline \text{EgoPoser}&5.31&39.69&6.07&46.01&6.60&48.25 \\ \text{DSPoser (Ours)}&4.80&22.58&5.28&23.13&5.51&24.19 \\\hline \end{array}

Thank you for bringing EgoPoser (ECCV 2024) to our attention. We were unaware of this work at the time of our submission, and it is encouraging to see parallel interest in ego body pose estimation from intermittent observations. We will ensure to cite this work in the next version of our paper.

Regarding contributions, we understand that EgoPoser's main focus is on preparing training data based on field-of-view (FoV) modeling rather than random masking. While we share an interest in FoV considerations, our work offers distinct algorithmic contributions, including:

  1. A multi-stage approach to pose estimation.

  2. An uncertainty-aware masked auto-encoder (MAE).

These aspects of our work were recognized by reviewers as innovative contributions to the field. We believe our approach complements the ideas presented in EgoPoser.

Moreover, the experimental data presented in the preceding table demonstrates that our approach surpasses EgoPoser in performance, with notably superior results in scenarios involving a narrow field-of-view (FoV).

Q2: Generalization on Ego-Body Pose Estimation given dense data.

Please see the table in Q3 of the Author Rebuttal.

The quantitative results of our method are influenced by the performance of the VQ-VAE component. Since Table 3 in the main paper is intended to show the versatility of our framework rather than argue our main contribution, we did not conduct an extensive hyper-parameter search to achieve state-of-the-art performance initially. To address this, we performed hyperparameter tuning on the VQ-VAE to mitigate performance loss attributed to this component. Following these adjustments, we observed improved results compared to other methods.

The results demonstrate that while our method may not exhibit the best performance across all metrics, it consistently shows at least the second-best performance for all metrics. A one-to-one comparison with other baselines reveals that our method outperforms them in at least 3 out of 4 metrics. Specifically, we included the Jitter metric to provide a comprehensive analysis of our method, as suggested by Reviewer VMPC. Our method significantly achieves better smoothness, indicating that our approach generates smoother motion close to the ground truth in terms of Jitter.

Q3: Metrics related to computational complexity.

Please see tables in Q2 of Author rebuttal, or the Table B and C in the pdf.

We recognize the importance of including metrics related to computational complexity. While we have roughly provided the inference time in the supplementary material, we agree that a more detailed comparison would be beneficial. Including metrics such as the number of parameters, MACs, and inference time will offer a more comprehensive comparison with previous methods and better illustrate the potential of our approach for real-world applications. We will incorporate these details in the revised version of our paper.

Q4: Consistency between tables

Please refer to table in Q1 of the Author Rebuttal, or Table A in the pdf file.

Thank you for highlighting this inconsistency. Following the reviewer's comment, we plan to update the Tables 1 and 2 to include comparisons with AvatarPoser and AvatarJLM with various imputation method, ensuring that relevant state-of-the-art methods are consistently evaluated throughout the paper. However, due to limited time and resources, we are currently only able to provide baseline results on the AMASS dataset. We are working on training baseline methods on the Ego-Exo4D dataset and will update the results as soon as they are completed.

Q5: Missing details

Thank you for bringing this to our attention. We realize that some technical details were not clearly outlined in our submission. As described in Supplementary Section A., the window size is set to 40 frames, so the lengths of the input and output frames can be inferred as 40 frames from the explanation in the Preliminary section. However, we acknowledge that this is not clear and easy to catch, so we will explicitly state these details in the main paper. Additionally, we will specify the sliding window size, which is set to 20 frames. Upon a thorough review of the paper, we also noticed that some details of the hand detectors, such as how we handled visibility for the Ego-Exo4D dataset, were not clearly stated. We will ensure that these details, along with the step size of the sliding window used during evaluation, are clearly presented in the main text to provide a complete understanding of our methodology.

Q6: Qualitative comparisons against SoTA

Please refer to the Q4 of the Author Rebuttal.

Q7: the motivation for using diffusion models

Please refer to the Q2 of the Author Rebuttal.

Q8: How we used IMU information.

The Ego-Exo4D dataset, collected by Meta using the Aria device, includes head trajectory data processed from IMU measurements. In the paper, we refer to this head trajectory as the tracking signal from the IMU, which is integral to our analysis and experiments. For the AMASS dataset, while IMU data is not used, we follow the approach of previous works such as AvatarPoser, where aggregated data of joint pose, joint velocity, 6D rotation, and angular velocity mimic the tracking signals from IMU. Both AvatarPoser and Avatar JLM have demonstrated that models trained with these signals are applicable to real-world data from AR devices.

Q9: Lack of publicly available code.

Please refer to Q5 of the Author Rebuttal.

评论

We've completed the baseline experiments on the Ego-Exo4D dataset and would like to share the results.

\begin{array}{l|ccc} \hline Methods & Imputation & MPJPE & MPJVE & Jitter \\\hline \text{AvatarPoser [10]} & \text{Interpolation} & 47.28 & 89.34 & 65.39 \\ \text{Bodiffusion [3]} & \text{Interpolation} & 59.81 & 120.12 & 142.32 \\ \text{AvatarJLM [34]} & \text{Interpolation} & 43.01 & 61.98 & 54.23 \\ \text{AvatarPoser [10]} & \text{MAE} & 24.54 & 62.34 & 44.24 \\ \text{Bodiffusion [3]} & \text{MAE} & 22.12 & 53.30 & 93.80 \\ \text{AvatarJLM [34]} & \text{MAE} & 21.08 & 45.77 & 39.04 \\\hline DSPoser (Ours) & \text{MAE} & 16.84\pm0.04 & 39.86\pm0.05 & 19.21\pm0.04\\\hline \end{array}

评论

Thank you for your detailed response! While some of my concerns have been addressed, there remains an issue regarding the sliding window size. Previous methods like AvatarPoser and AvatarJLM set the sliding window size to one frame to simulate real-time inference. However, according to the rebuttal, the sliding window size is 20, which could explain the improved smoothness metrics (i.e., MPJVE and Jitter, as mentioned in Q3 of the rebuttal). To ensure a fair comparison with previous methods, I would recommend adjusting the sliding window size accordingly.

评论

Thank you for your valuable feedback on ensuring a fair comparison. We agree with your concern and have conducted additional experiments on the AMASS dataset to provide a more accurate and fair comparison with the baseline methods.

\begin{array}{l|c|c|c|cc} \hline Methods & \text{imputation} & sliding step & averaging & MPJPE & MPJVE & MPJRE & Jitter \\\hline \text{AvatarPoser [10]} & \text{MAE} & \text{1} & & 9.88 & 62.31 & 5.98 & 37.89 \\ \text{BoDiffusion [3]} & \text{MAE} & \text{20} & \text{temporal avg} & 7.35 & 31.33 & 5.47 & 1254.84 \\ \text{AvatarJLM [34]} & \text{MAE} & \text{1} & & 7.12 & 37.60 & 5.24 & 16.95 \\\hline DSPoser (Paper) & \text{MAE} & \text{20} & \text{temporal avg} & 5.51 & 24.19 & 4.09 & 4.27 \\ \text{DSPoser (\#1) } & \text{MAE} & \text{1} & & 5.87 & 52.38 & 4.31 & 34.12 \\ \text{DSPoser (\#2) } & \text{MAE} & \text{1} & \text{temporal avg} & 5.23 & 21.73 & 3.83 & 5.94 \\ \text{DSPoser (\#3) } & \text{MAE} & \text{1} & \text{4 samples} & 5.68 & 29.48 & 4.23 & 12.98 \\\hline \end{array}

Before discussing the results, we want to clarify that our primary focus is on introducing two key aspects: (1) the underexplored problem of doubly sparse video data, and (2) a generic multi-stage framework to address such problems. The specific choices within our framework (e.g. VQ-Diffusion, MAE, and the sliding window) were deliberately kept simple to demonstrate the efficacy of the intermittent tracking signal and our multi-stage framework. As reviewers jBAY and 2EAc noted, the computational demands of diffusion models pose concerns for real-time applications. While we provided baseline results to justify our design choices, we want to emphasize that alternatives such as AvatarJLM and AvatarPoser (instead of VQ-Diffusion style pose estimation algorithms used in the submission) are also viable within our multi-stage framework. These alternatives can balance time complexity and accuracy, making them suitable for real-time applications.

Exp. #1, fair comparison.

We first tested our method using the evaluation protocol of AvatarPoser and AvatarJLM, with a sliding window step of 1 where only the final output frame (current frame) is used for each step. Our method performed worse in Jitter and MPJVE compared to AvatarJLM but showed better performance in MPJPE and MPJRE. We believe this drop is due to the probabilistic nature of our method, unlike the deterministic approach of AvatarJLM and AvatarPoser.

Exp. #2, temporal averaging

Next, we modified Exp #1 to better utilize the diversity from our uncertainty modeling by averaging overlapped frames while advancing the sliding window. This significantly reduced errors across all metrics. However, when it comes to real-time infernece, the improvement (averaging overlapped frames) doesn't affect to current frames but only applies to historical frames, which are not useful for real-time inference.

Exp. #3, multiple sampling

Finally, we implemented a multiple sampling approach, averaging four samples for each step using the end frames (current frames). This method outperformed AvatarJLM across all metrics and can be efficiently implemented with parallel processing, resulting in minimal time overhead from #1.

In conclusion, we agree with the reviewers that our method's better performance in MPJVE and Jitter is due to the difference in sliding steps. However, as shown in Exp #1, #2, and #3, the performance drop seems to result from our framework's ability to generate diverse motions from the same input. This issue can be mitigated by sampling multiple times using parallel processing and averaging the results.

评论

Dear Reviewer 2EAc,

We want to express our sincere gratitude for your insightful feedback on our manuscript.

As the author-reviewer discussion period is coming to an end, we wonder whether our responses have addressed your concerns? We're looking forward to the further discussion regarding any more questions you might have!

Thank you once again for your time and valuable input. Your comments have significantly contributed to improving our work.

Best regards,

Authors

评论

Thanks for your honest response. While I don't see the multiple sampling approach as a contribution (because such a moving average filter is quite common in signal processing and can also improve other methods), I find the overall paper interesting and thus decide to increase my score to weak accept.

I hope the authors can improve the paper in the revision, particularly by comprehensively comparing it with existing work, and clearly describing the sliding window setup.

评论

Thank you for your thoughtful feedback and for taking the time to review our paper and rebuttal. Without your comments, we might have overlooked the differences in sliding window sizes in the additional experiments included in the rebuttal.

We are committed to improving the paper by adding more baselines to our experiments and providing a clearer explanation of the evaluation setup. We are grateful for your decision to raise the score.

审稿意见
5

The paper introduces the task of full-body pose estimation from temporally and spatially sparse tracking inputs. It differs from the prior work in assuming only the partial availability of hand tracking, which is a common scenario for head-mounted displays (HMD) without hand controllers. To address this problem, a two-stage approach is proposed. First, a masked autoencoder (MAE) infills the missing hand joints along with an uncertainty prediction using only the available frames and the head tracking. Then, the imputed tracking data (hands and head) are passed to a VQ-Diffusion model to predict the remaining body. The proposed model, namely DSPoser, is evaluated on the AMASS and Ego-Exo4D datasets where it performs better than the baselines in this new problem setting.

优点

Originality: The paper introduces a new challenge to the full body tracking domain, which has become an active research area due to the increasing number of HMD devices in the market. The proposed solution combines techniques from various works effectively.

Quality: The proposed two-stage approach is practical. The masked autoencoder with uncertainty estimation addresses the temporal sparsity problem and decouples the formerly known spatially sparse body tracking (i.e., hands and head are always available) from the temporal sparse setting.

Clarity: The paper is well-organized and easy to follow. There is enough background information to understand the proposed method and make connections to the prior works. The authors also provide experimental details thoroughly which seem to be sufficient for reproducibility. 

Significance: The new problem setting is novel and I expect it to be more commonly addressed in the future. Hence, this paper could be a reference for future works.

缺点

1- I think the evaluations in Tables 1 and 2 could be better structured and also more fair. To make an apples-to-apples comparison, it would be better to group methods using a particular type of input data. For example, EgoEgo should be compared against the DSPoser with only the head tracking inputs. Similarly, the Bodiffusion could also use the MAE imputation. Considering that the VQ-Diffusion and the motion tokenizer are taken from the prior work, MAE as being the main contribution could be better highlighted in this way. The “naive Bodiffusion extension” is simply too naive. Stronger baselines could be introduced.

2- The runtime performance analysis is missing. Considering that the proposed problem setting aims for real-time applications, a masked autoencoder with additional uncertainty computations followed by a diffusion inference is not the optimal candidate. I acknowledge that this is covered in the limitations section. What could be done about it?

3- After reading the “Uncertainty-aware MAE” section, I assumed the total uncertainty (Eq. 6) is proposed. The ablation study, however, reveals that the aleatoric uncertainty gives better performance. Is it the one used in the experiments?

4- A supplementary video with qualitative comparisons would be very helpful.

5- [1*] and [2*] (as concurrent work) could also be covered in the related work section.

These are not weaknesses but suggestions:

  • I think mixture density networks [3*] could be applied in this setting which would also simplify the story around the uncertainty. The network predicts parameters of a Gaussian Mixture Model, not very different from the current aleatoric uncertainty.
  • Line #243: “The dropout strategy achieves...” -> “The sampling strategy achieves...”
[1*] Du, Yuming, et al. "Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[2*] Dai, Peng, et al. "HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[3*] Bishop, Christopher M. "Mixture density networks." (1994).

问题

Please see my questions in the previous section and address my main concern on the evaluations.

局限性

Yes.

作者回复

We appreciate the reviewers insightful feedback on our paper. We will improve the draft based on the comments.

Q1: Stronger baselines & Reorganization of Table 1 and 2

Due to limited time and resources, we are currently only able to provide baseline results on the AMASS dataset. We are currently working on training basline methods on Ego-Exo4D datasets, and we will update as soon as they are completed. Please refer to Q1 in the Author Rebuttal to see our response on stronger baselines.

Thank you for your valuable feedback. We appreciate the suggestions for improving the structure and fairness of our evaluations. Since we now have more baselines as shown in the Table A in the pdf, it will be better to re-organize the table for a fair comparison and easier understanding.

Q2: The runtime performance analysis, and possible direction.

Please refer to the Q2 of the Author Rebuttal.

Thank you for your feedback regarding the runtime performance analysis. We understand the importance of computational efficiency, especially for real-time applications.

In Table B in the pdf and Q2 of the Author Rebuttal, we present a detailed comparison of the computational complexity of VQ-VAE, MAE, and VQ-Diffusion modules, highlighting the number of parameters, multiply-accumulate operations (MACs), and inference time. Our results show that the overhead introduced by the MAE, including uncertainty computations, is minimal compared to the significant overhead from the diffusion process. Specifically, MAE adds only 3 ms to the inference time, which is negligible compared to the 955 ms required by the VQ-Diffusion module. Therefore, the primary computational burden arises from the diffusion process rather than the MAE with uncertainty computations.

When we chose the diffusion model, we recognized the heavy computational cost but concluded it was more appropriate for solving the under-constrained problem of ego-body pose estimation. To mitigate this heavy computation issue, we selected the VQ-Diffusion method, which denoises in discretized latent spaces and is considered more computationally efficient. Additionally, as shown in Table C in the pdf, our approach allows skipping denoising steps using the reparameterization trick, following methods from [1] and [2]. The results show that our method provides four times faster options, effectively balancing the trade-off between MPJPE and inference time.

Recent research, such as [3] and [4], has focused on improving the speed of diffusion inference. We expect that as these advancements continue, the diffusion model's versatility and extendability will become even more beneficial, reducing the cost of the diffusion sampling process.

[1] Austin, Jacob, et al. "Structured denoising diffusion models in discrete state-spaces." Advances in Neural Information Processing Systems 34 (2021): 17981-17993.

[2] Gu, Shuyang, et al. "Vector quantized diffusion model for text-to-image synthesis." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

[3] Zheng, Hongkai, et al. "Fast sampling of diffusion models via operator learning." International conference on machine learning. PMLR, 2023.

[4] Yin, Tianwei, et al. "One-step diffusion with distribution matching distillation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

Q3: Clarification on Uncertainty

Yes, aleatoric uncertainty with a sampling strategy is used in our Table 1 and Table 2. To avoid confusion, we will clarify the strategy and the type of uncertainty at the beginning of Section 3.3.

Q4: Qualitative comparisons against state-of-the-art approaches

We appreciate your comment on the importance of qualitative comparisons. While we have visualized our qualitative results in a video format, we found that it is not permitted to include links during the rebuttal phase. As an alternative, we have included detailed comparative qualitative results in the PDF file attached to this rebuttal. We hope this additional information provides clarity and supports the quantitative improvements we reported. We also ensure that we will add video comparisons against state-of-the-art methods in the final version.

Q5: Suggestion of concurrent related works

Thank you for the suggestion. We will include suggested references in the related work section to ensure comprehensive coverage of concurrent research.

Q6: Mixed Density Network as Aleatoric Uncertainty

\begin{array}{l|ccc} \hline Methods & MPJPE & MPJVE & MPJRE \\ \hline \hline \text{w/o Uncertinaty}&6.05\pm0.01&30.12 \pm 0.04&4.36 \pm 0.00 \\ \text{DSPoser w/ MDN}&5.84\pm0.05&28.34 \pm 0.15&4.72 \pm 0.04 \\ DSPoser (Ours)&5.51\pm0.02&24.19 \pm 0.10&4.09 \pm 0.02 \\\hline \end{array}

Thank you for the interesting idea for uncertainty measurement. We reported the result of the whole pipeline after substituting the head of our MAE with a Mixed Density Network (MDN), setting the number of mixtures MM to 4 for fair comparison. Similar to the calculation of aleatoric uncertainty in our paper, we measure the aleatoric uncertainty of MDN by Uale(x)M1iπiσi2(x)\mathcal{U}_{ale}(\mathbf{x}) \approx M^{-1} \sum_i \pi_i\sigma_i^2 (\mathbf{x}), where π\pi is mixture weight. The results show that MDN improves performance compared to our method without uncertainty; however, it shows worse performance compared to the MAE approaches.

\begin{array}{cc} \hline Methods & MPJPE \\\hline MDN & 13.45 \\ MAE (Ours) & 10.85 \\\hline \end{array}

To investigate the performance difference, we also analyzed the results of temporal completion of hand trajectories. Different from the MAE, the loss of MDN often diverged, so we early stopped the MDN training at 800 epochs, leading to worse MPJPE for hand trajectories constrast to the MAE model trained fro 4000 epochs.

Q7: Typos.

We will revise the sentence as suggested.

评论

We've completed the baseline experiments on the Ego-Exo4D dataset and would like to share the results.

\begin{array}{l|ccc} \hline Methods & Imputation & MPJPE & MPJVE & Jitter \\\hline \text{AvatarPoser [10]} & \text{Interpolation} & 47.28 & 89.34 & 65.39 \\ \text{Bodiffusion [3]} & \text{Interpolation} & 59.81 & 120.12 & 142.32 \\ \text{AvatarJLM [34]} & \text{Interpolation} & 43.01 & 61.98 & 54.23 \\ \text{AvatarPoser [10]} & \text{MAE} & 24.54 & 62.34 & 44.24 \\ \text{Bodiffusion [3]} & \text{MAE} & 22.12 & 53.30 & 93.80 \\ \text{AvatarJLM [34]} & \text{MAE} & 21.08 & 45.77 & 39.04 \\\hline DSPoser (Ours) & \text{MAE} & 16.84\pm0.04 & 39.86\pm0.05 & 19.21\pm0.04\\\hline \end{array}

评论

I thank the authors for their rebuttal and also providing new experiments! I read other reviews as well. All my concerns except one is addressed.

As the reviewer 2EAc pointed out, I'd appreciate if the authors clarify the inference setting in the evaluations. The proposed DSPoser's jitter performance seems significantly better than the baselines (pity there is no video illustrating this performance). How is this achieved? Ideally, the model should be evaluated in an online setting where only a single frame is available at a time, with no constraints on the number of past frames. In other words, to make a 40-frame prediction, the model must be run 40 times by sliding the input window 1 frame at a time. Is this the case?

Is the inference time per frame or total time for 40 frames?

评论

Clarification on inference setting in the evaluations.

Thank you for your feedback on our evaluation protocol. We would like to clarify that our method follows the BoDiffusion protocol, which uses a sliding step of 20. This approach averages the overlapping frames, producing a 40-frame output for each sliding step of 20. In contrast, AvatarPoser and AvatarJLM use a 1-step sliding window, focusing only on the final frame for real-time applications.

Is the inference time per frame or total time for 40 frames?

Our inference time is measured based on generating 40 frames. However, unlike AvatarPoser and AvatarJLM, our method's input and output are fixed (40 frames). Therefore, we can answer your question with a 'yes,' but considering the question may be assumed in an online-inference setup, we need to clarify that the inference time listed in the table B also represents the per-frame time in an online setup.

Additional experiments under real-time inference setup.

Recognizing that our initial comparison in previous rebuttal was not conducted under fair conditions, we performed additional experiments to ensure a fair comparison with the baselines.

\begin{array}{l|c|c|c|cc} \hline Methods & \text{imputation} & sliding step & averaging & MPJPE & MPJVE & MPJRE & Jitter \\\hline \text{AvatarPoser [10]} & \text{MAE} & \text{1} & & 9.88 & 62.31 & 5.98 & 37.89 \\ \text{BoDiffusion [3]} & \text{MAE} & \text{20} & \text{temporal avg} & 7.35 & 31.33 & 5.47 & 1254.84 \\ \text{AvatarJLM [34]} & \text{MAE} & \text{1} & & 7.12 & 37.60 & 5.24 & 16.95 \\\hline DSPoser (Paper) & \text{MAE} & \text{20} & \text{temporal avg} & 5.51 & 24.19 & 4.09 & 4.27 \\ \text{DSPoser (\#1) } & \text{MAE} & \text{1} & & 5.87 & 52.38 & 4.31 & 34.12 \\ \text{DSPoser (\#2) } & \text{MAE} & \text{1} & \text{temporal avg} & 5.23 & 21.73 & 3.83 & 5.94 \\ \text{DSPoser (\#3) } & \text{MAE} & \text{1} & \text{4 samples} & 5.68 & 29.48 & 4.23 & 12.98 \\\hline \end{array}

Before discussing the results, we want to clarify that our primary focus is on introducing two key aspects: (1) the underexplored problem of doubly sparse video data, and (2) a generic multi-stage framework to address such problems. The specific choices within our framework (e.g. VQ-Diffusion, MAE, and the sliding window) were deliberately kept simple to demonstrate the efficacy of the intermittent tracking signal and our multi-stage framework. As reviewers jBAY and 2EAc noted, the computational demands of diffusion models pose concerns for real-time applications. While we provided baseline results to justify our design choices, we want to emphasize that alternatives such as AvatarJLM and AvatarPoser (instead of VQ-Diffusion style pose estimation algorithms used in the submission) are also viable within our multi-stage framework. These alternatives can balance time complexity and accuracy, making them suitable for real-time applications.

Exp. #1, fair comparison.

We first tested our method using the evaluation protocol of AvatarPoser and AvatarJLM, with a sliding window step of 1 where only the final output frame (current frame) is used for each step. Our method performed worse in Jitter and MPJVE compared to AvatarJLM but showed better performance in MPJPE and MPJRE. We believe this drop is due to the probabilistic nature of our method, unlike the deterministic approach of AvatarJLM and AvatarPoser.

Exp. #2, temporal averaging

Next, we modified Exp #1 to better utilize the diversity from our uncertainty modeling by averaging overlapped frames while advancing the sliding window. This significantly reduced errors across all metrics. However, when it comes to real-time infernece, the improvement doesn't affect to current frames but only applies to historical frames, which are not useful for real-time inference.

Exp. #3, multiple sampling

Finally, we implemented a multiple sampling approach, averaging four samples for each step using the end frames (current frames). This method outperformed AvatarJLM across all metrics and can be efficiently implemented with parallel processing, resulting in minimal time overhead from #1.

In conclusion, we agree with the reviewer 2EAc that the better performance in MPJVE and Jitter of our method is due to the difference in sliding steps. However, based on Exp #2 and Exp #3, the performance drop appears to stem from our framework's ability to generate diverse motions from the same input. This issue can be mitigated by sampling multiple times using parallel processing and averaging the results.

评论

Dear Reviewer jBAY,

We want to express our sincere gratitude for your insightful feedback on our manuscript.

As the author-reviewer discussion period is coming to an end, we wonder whether our responses have addressed your concerns? We're looking forward to the further discussion regarding any more questions you might have!

Thank you once again for your time and valuable input. Your comments have significantly contributed to improving our work.

Best regards,

Authors

审稿意见
7

This paper proposes a system to estimate full-body pose from forward-facing egocentric videos. Dubbed “doubly sparse video data,” such data streams have the distinct characteristic that only the headset pose is persistent, while the hand pose estimation is only occasionally available. The proposed method first infills the hand motion from estimated hand motion from video information, and then uses the infilled motion (with estimated uncertainty) to estimate the full-body pose. The full-body pose estimation is built upon VQ-VAE representation and VQ-diffusion. Experiments show that the proposed pipeline outperforms SOTA methods.

优点

  • This work is very well-motivated; estimating the full-body pose from sparse egocentric views could have many applications in AR/VR and animation. The task is also very challenging, as hands are only visible in very few frames.
  • The proposed system is a complete solution to estimate full-body pose (including fingers) from egocentric videos and head tracking. It leverages the persistent signals (headset tracking) and occasional signals (hand pose) well by formulating it as a probabilistic infilling problem. The infilled hand motion then serves as input to a diffusion-based full-body pose estimator.
    • I find the uncertainty formulation a great addition to the current literature. While most methods just use diffusion-based pose estimation plus masking, the proposed MAE solution seems to be a principled way of obtaining a better hand pose trajectory based on sparse input.
  • Experimental results on the Ego-Exo-4D dataset and AMASS show that the proposed method outperforms SOTA methods. The evaluation is extensive and shows the results from the method well. Showing the results of using dense information (Table 3) also demonstrates the strength of the proposed method.

缺点

  • For pose estimation, it would be very beneficial to provide estimated motion as videos to better judge the quality of the estimated pose and motion. Not providing videos weakens this work.
  • Since a complex system is proposed, it would be great to see some ablation about VQ-VAE.
  • Looking at Figure 7 in the appendix, it appears to me that the proposed method could be overfitting. There is no information on kicking the feet up for the human, but the estimated pose is kicking the feet up.

问题

If possible, some metric on the velocity/acceleration error of the estimated motion would help indicate the smoothness of the motion.

局限性

Limitations are addressed.

作者回复

We appreicate the acknowledgement of our motivation and the novelty of our method. We will improve the draft based on the valuable comments of the reviewer.

Q1: Qualitative comparisons against state-of-the-art approaches and video visualizations of the results.

We appreciate your comment on the importance of qualitative comparisons. While we have visualized our qualitative results in a video format, we found that it is not permitted to include links during the rebuttal phase. As an alternative, we have included detailed comparative qualitative results in the PDF file attached to this rebuttal. We hope this additional information provides clarity and supports the quantitative improvements we reported. We also ensure that we will add video comparisons against state-of-the-art methods in the final version.

Q2: VQ-VAE ablation studies.

\begin{array}{l|c|cccc} \hline Methods & \text{Pipeline} & MPJPE & MPJVE & MPJRE & Jitter \\ \hline \hline \text{BoDiffusion [3]} & \text{MAE}  + \text{Skeleton Space Diffusion} & 7.35 & 31.33 & 5.47 & 1254.84 \\ \text{AvatarJLM [34]} & \text{MAE} + \text{Transformer} & 7.12 & 37.60 & 5.24 & 16.95 \\ DSPoser (ours) & \text{MAE} + \text{VQ-Diffusion} + \text{VQ-Decoder} &5.51 & 24.19 & 4.09 & 4.27 \\ \hline \end{array}

To evaluate the effectiveness of our pipeline, we implemented baseline models to tackle the pose estimation problem given doubly sparse data. For these baseline models, we introduced MAE at the initial stage of their methods to complete the temporally sparse data, then fed the imputed trajectory into their respective models. The implementation of BoDiffusion as a baseline can serve as an ablation study for our VQ-VAE, as it applies the diffusion process directly to the skeleton space, contrasting with our approach of applying the diffusion process on the Vector-Quantized latent space. Additionally, AvatarJLM can be considered another ablation study, as it utilizes a Transformer instead of a diffusion model to complete the sparse data. These results demonstrate that VQ-VAE outperforms the other architectural options.

Q3: Overfitting issue.

Thank you for your observation and insightful feedback. While it is possible that overfitting could explain this observation, there are other plausible explanations. To the best of my knowledge, when a human moves, the motion of each joint influences the others. Previous research, such as [1], [2], has demonstrated that joint movements are not only connected to adjacent joints through bones (explicit relationships) but also highly related to distant joints that are not directly connected (implicit relationships) in a certain motion context. Additionally, in [3], even though only head position is utilized for whole body estimation, the results often show very accurate lower body movements. This indicates that even if there appears to be no direct information for reconstructing the kicking motion in our visualized example, the intermittent hand observations and dense head trajectory data may provide sufficient information to reconstruct the kicking motion. Therefore, what may seem like overfitting could actually be the model leveraging these implicit relationships to generate a plausible motion sequence.

[1] Chi, Hyung-gun, et al. "Infogcn: Representation learning for human skeleton-based action recognition." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

[2] Wu, Zhize, et al. "SelfGCN: Graph Convolution Network with Self-Attention for Skeleton-based Action Recognition." IEEE Transactions on Image Processing (2024).

[3] Li, Jiaman, Karen Liu, and Jiajun Wu. "Ego-body pose estimation via ego-head pose estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

Q4: Evaluation on the velocity/acceleration metric.

\begin{array}{l|cccc} \hline Methods&\text{MPJPE}&\text{MPJVE}&\text{MPJRE}&\text{Jitter} \\ \hline \text{AvatarPoser [10]}&40.42&64.07&16.37&27.89 \\ \text{Bodiffusion [3]}&46.45&75.33&17.99&2793.32 \\ \text{AvatarJLM [34]}&25.02&68.42&14.14&32.18 \\ \text{AvatarPoser [10]}&9.88&62.31&5.98&37.89 \\ \text{Bodiffusion [3]}&7.35&31.33&5.47&1254.84 \\ \text{AvatarJLM [34]}&7.12&37.60&5.24&16.95 \\\hline DSPoser (Ours)&5.51\pm0.02&24.19\pm0.10&4.09\pm0.02&4.27\pm0.03 \\\hline \end{array}

\begin{array}{l|c|cccc} \hline \text{Methods} & \mathbf{y} & \text{MPJPE} & \text{MPJVE} & \text{MPJRE} & \text{Jitter} \\ \hline \hline \text{GT} & - & - & - & - & 4.01 \\ \text{VQ-VAE (Paper)} & \text{Full body} & 1.26 & 11.37 & 1.81 & 3.93 \\ \text{VQ-VAE (Opt'ed)} & \text{Full body} & 1.15 & 10.59 & 1.67 & 3.89 \\ \hline \text{AvatarPoser [10]} & \text{Dense traj.} & 4.18 & 29.40 & 3.21 & 13.63 \\ \text{Bodiffusion [3]} & \text{Dense traj.} & 3.63 & \mathbf{14.39} & \mathbf{\textcolor{blue}{2.70}} & 493.78 \\ \text{AvatarJLM [34]} & \text{Dense traj.} & \mathbf{3.35} & 20.79 & 2.90 & \mathbf{\textcolor{blue}{8.39}} \\ \hline \mathbf{DSPoser (Paper)} & \text{Dense traj.} & 3.61 \pm 0.01 & 18.36 \pm 0.03 & 2.81 \pm 0.02 & 4.08 \pm 0.02 \\ \mathbf{DSPoser (Opt'ed)} & \text{Dense traj.} & \mathbf{\textcolor{blue}{3.48 \pm 0.01}} & \mathbf{\textcolor{blue}{17.86 \pm 0.03}} & \mathbf{2.68 \pm 0.02} & \mathbf{4.03 \pm 0.02} \\ \hline \end{array}

We deeply appreciate your valuable comments on our paper. Thanks to your suggestion, we found out that our method shows significantly better performance on the Jitter metric, which is often used to measure the smoothness of motion. Jitter is a measure of jerk calculated by the derivative of acceleration, and MPJVE, already reported in our paper, indicates the velocity error. As seen in the table above and Q2 of the VQ-VAE ablation studies, it is demonstrated that our method produces smoother results compared to other methods.

评论

I thank the authors for the detailed response and additional experiments. My concerns are addressed and I would like to raise my score to Accept.

评论

Thank you for your thoughtful feedback and for taking the time to review our additional experiments. We appreciate your support and are pleased that our response addressed your concerns. We're grateful for your decision to raise the score.

审稿意见
7

The paper introduces an approach for estimating full-body pose from egocentric videos combined with sparse head and hand positions. The key contribution lies in utilizing sparse temporal annotations of hand positions to achieve a complete representation. The method is evaluated on the publicly available Ego-Exo4D and AMASS datasets, demonstrating performance that surpasses current state-of-the-art approaches.

优点

  • Relevance of the task/scope of the paper:

    • The task of estimating full-body pose from egocentric video and sparse positional signals is relevant for the NeurIPS community, and the paper presents this relevance adequately.
  • Technical novelty of the approach:

    • The approach utilizes both images and sparse information from the hands and head, which is a novel idea. The use of a masked autoencoder to complete the information from the hands is also innovative.
  • Technical correctness of the paper:

    • The methods section describes the proposed approach clearly.
  • Related Work:

    • The review of related work is comprehensive, covering all relevant literature.
  • Experimental validation:

    • The experiments are evaluated on the AMASS and Ego-Exo4D datasets, with a thorough comparison against current state-of-the-art approaches. The quantitative results show improvement over previous methods.
  • Writing and presentation:

    • The paper is mostly well-written, and the ideas are conveyed clearly despite a few typographical errors.

缺点

Weaknesses:

  • Technical contributions:

    • The key contributions could be consolidated into one. The first and second contributions are similar, and the third discusses the potential for AR experiences without providing experiments involving real-life AR devices.
  • Experimental validation:

    • The paper lacks qualitative comparisons against state-of-the-art approaches and video visualizations of the results. Given the low MPJPE and MPJRE values in current methods, visual assessments are important to ensure the improvements are not just quantitative artifacts.

问题

Justification:

Overall, the paper has clear contributions and thorough quantitative comparisons against previous state-of-the-art approaches. However, the claims about the method's use in AR and VR applications could be substantiated better, and there is a lack of qualitative comparisons.

Additional comments:

  • There are missing articles (e.g., "the") in multiple instances (L84, L124).
  • The phrase "using the transformer architecture" in L155 is redundant.
  • Correct "L161 strategies" and "L45 signals".

局限性

The paper adresses the limitations.

One of the main limitations is the lack of publicly available code, which will hinder progress in the community.

作者回复

We appreciate your acknowledgment of our paper's novelty and your valuable suggestions for improving our method. We fully recognize the importance of open-source code in advancing research, as this paper is also built upon the publicly available code of other researchers.

Q1: Key contributions can be consolidated into one & no experiments involving real-life AR devices.

We appreciate the feedback regarding the similarity between our first and second contributions. Our primary aim was to highlight different aspects of our approach, but we understand the need for clarity and consolidation. Additionally, while the dataset we used in our experiments, EgoExo4D, is collected using Meta's Aria Devices (an HMD/AR device) and demonstrates the potential for AR applications, we recognize that explicit experiments involving real-life AR devices would strengthen our claims. We will consider this in future work.

Q2: Qualitative comparisons against state-of-the-art approaches and video visualizations of the results.

We appreciate your comment on the importance of qualitative comparisons. While we have visualized our qualitative results in a video format, we found that it is not permitted to include links during the rebuttal phase. As an alternative, we have included detailed comparative qualitative results in the PDF file attached to this rebuttal. We hope this additional information provides clarity and supports the quantitative improvements we reported. We also ensure that we will add video comparisons against state-of-the-art methods in the final version.

Q3: Lack of publicly available code.

We recognize the value of open-source code and plan to release ours upon acceptance. Our organizational policy prioritizes careful review and preparation before any public release. In the meantime, we're committed to providing detailed methodologies in our publications to support reproducibility and further research in the field.

Q4: Misc.

Thank you for pointing out the grammar mistakes and writing issues. Following the reviewer's comments, we will:

  1. Conduct a thorough grammar review of the manuscript, including the ones you mentioned.

  2. Remove the phrase "using the transformer architecture" in L155.

We believe these changes will improve the clarity and readability of our paper.

评论

The authors responses appropriately addressed my concerns, and after reading the other comments from the reviewers I would like to maintain my initial rating.

评论

Thank you for your thoughtful review and for confirming that our responses addressed your concerns. We appreciate your decision to maintain your initial rating and value the feedback you have provided. Thank you again for your time and consideration.

作者回复

We thank all the reviewers for their time and effort in helping us improve the paper. We appreciate your acknowledgment of the novelty and valuable suggestions to improve our method. In this rebuttal, we want to clarify a few common questions raised by reviewers. Please note that experiments are conducted on the AMASS dataset unless otherwise stated.

Q1: The Necessity of Stronger Baselines

\begin{array}{l|cccc} \hline Methods&Imputation&\text{MPJPE}&\text{MPJVE}&\text{MPJRE}&\text{Jitter} \\ \hline \text{AvatarPoser [10]}&interpolation&40.42&64.07&16.37&27.89 \\ \text{Bodiffusion [3]}&interpolation&46.45&75.33&17.99&2793.32 \\ \text{AvatarJLM [34]}&interpolation&25.02&68.42&14.14&32.18 \\ \text{AvatarPoser [10]}&MAE&9.88&62.31&5.98&37.89 \\ \text{Bodiffusion [3]}&MAE&7.35&31.33&5.47&1254.84 \\ \text{AvatarJLM [34]}&MAE&7.12&37.60&5.24&16.95 \\\hline DSPoser (Ours)&MAE&5.51\pm0.02&24.19\pm0.10&4.09\pm0.02&4.27\pm0.03 \\\hline \end{array}

Please find the detailed table in the pdf file.

We acknowledge that stronger baselines are needed to prove the effectiveness of our methods. Therefore, we implemented the baselines with linear interpolation and MAE imputation. The table demonstrates that our method outperforms these baselines.

Q2: Motivation of Diffusion Model & Its Computational Complexity

We appreciate the question regarding our choice of diffusion models despite potential limitations for real-time applications. We have two reasons for choosing a diffusion model for this task:

First, we considered the ego-body pose estimation task to be an under-constrained problem, so we chose a diffusion model to leverage the inherent uncertainty of the task.

Second, when we first designed our framework, we planned to incorporate multi-modal inputs, such as image features from ego-centric videos and gaze information, in addition to sparse hand data. This integration would enable our model to generate diverse motion sequences conditioned on multi-modal inputs, which are available in the Ego-Exo4D data.

We believe that the diffusion model's versatility and extendability make it well-suited for these types of multi-modal integrations. This potential for extension reinforces our choice of diffusion models as a foundational element in our research, providing a flexible and powerful tool for future developments in this field.

\begin{array}{c|ccc} \hline Module&\# of Params&MACs&Time \\ \hline \text{VQ-VAE}&17.9 \text{ M}&3.6 \text{ G}&3 \text{ ms} \\ \text{MAE}&51.3 \text{ M}&23.3 \text{ G}&4 \text{ ms} \\ \text{VQ-Diffusion}&74.2 \text{ M}&1190.2 \text{ G}&958 \text{ ms} \\ \hline \end{array}

\begin{array}{c|cccc|c} \hline \text{Infer. Steps}&\text{Train. Steps 25}&\text{Train. Steps 33}&\text{Train. Steps 50}&\text{Train. Steps 100}&\text{Infer. Time (ms)}\\\hline 25&5.83&5.92&5.69&8.72&278\\ 33&-&5.67&5.63&5.58&348\\ 50&-&-&5.61&5.53&522\\ 100&-&-&-&5.51&1013\\\hline \end{array}

We appreciate the feedback and recognize the importance of including metrics related to computational complexity. While we have roughly provided the inference time in the supplementary material following the NeurIPS 2024 submission policy, we agree that a more detailed comparison would be beneficial. We will incorporate these details in the revised version of our paper to provide a thorough evaluation of our method's computational complexity.

Q3: Evaluation on Velocity and Accleration.

\begin{array}{l|c|cccc} \hline \text{Methods}&\mathbf{y}&\text{MPJPE}&\text{MPJVE}&\text{MPJRE}&\text{Jitter} \\\hline \text{GT}&-&-&-&-&4.01 \\ \text{VQ-VAE (Paper)}&\text{Full body}&1.26&11.37&1.81&3.93 \\ \text{VQ-VAE (Opt'ed)}&\text{Full body}&1.15&10.59&1.67&3.89 \\\hline \text{AvatarPoser [10]}&\text{Dense traj.}&4.18&29.40&3.21&13.63 \\ \text{Bodiffusion [3]}&\text{Dense traj.}&3.63&\mathbf{14.39}&\mathbf{\textcolor{blue}{2.70}}&493.78 \\ \text{AvatarJLM [34]}&\text{Dense traj.}&\mathbf{3.35}&20.79&2.90&\mathbf{\textcolor{blue}{8.39}} \\\hline \mathbf{DSPoser (Paper)}&\text{Dense traj.}&3.61\pm0.01&18.36\pm0.03&2.81\pm0.02&4.08\pm0.02 \\ \mathbf{DSPoser (Opt'ed)}&\text{Dense traj.}&\mathbf{\textcolor{blue}{3.48\pm0.01}}&\mathbf{\textcolor{blue}{17.86\pm0.03}}&\mathbf{2.68\pm0.02}&\mathbf{4.03\pm0.02} \\\hline \end{array}

We found that our method shows significantly better performance on the Jitter metric, which is often used to measure the smoothness of motion. Jitter is a measure of jerk calculated by the derivative of acceleration and MPJVE indicates the velocity error as reported in our paper. As seen in the table above and table in Q1, it is demonstrated that our method produces smoother results compared to other methods, closely matching the smoothness of the ground truth, while other methods show significantly higher values on the Jitter metric.

Q4: Qualitative comparisons against state-of-the-art approaches and video visualizations of the results.

We appreciate your comment on the importance of qualitative comparisons. While we have visualized our qualitative results in a video format, we found that it is not permitted to include links during the rebuttal phase. As an alternative, we have included detailed comparative qualitative results in the PDF file attached to this rebuttal. We hope this additional information provides clarity and supports the quantitative improvements we reported. We also ensure that we will add video comparisons against state-of-the-art methods in the revised version.

Q5: Lack of publicly available code.

We recognize the value of open-source code and plan to release ours upon acceptance. Our organizational policy prioritizes careful review and preparation before any public release. In the meantime, we're committed to providing detailed methodologies in our publications to support reproducibility and further research in the field.

最终决定

The submission initially received generally positive reviews; after rebuttal, overall, reviewers became slightly even more positive about the submission. The AC agrees with the shared recommendations. The authors should incorporate the rebuttal into the camera ready.