PaperHub
6.6
/10
Poster4 位审稿人
最低2最高4标准差0.9
2
4
4
4
ICML 2025

ADHMR: Aligning Diffusion-based Human Mesh Recovery via Direct Preference Optimization

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

A framework to align diffusion-based human mesh recovery methods via direct preference optimization.

摘要

关键词
human mesh recoverydirect preference optimization

评审与讨论

审稿意见
2

The authors propose a framework that integrates a diffusion-based human mesh recovery model with direct preference optimization. The core idea is to train HMR-Scorer, a model that evaluates the quality of human mesh predictions without requiring 3D annotations, and use it to create a preference dataset. This dataset consists of pairs of high- and low-quality predictions, which are then used to fine-tune the base model via preference optimization. Additionally, ADHMR leverages HMR-Scorer for data cleaning, which improves model accuracy. Experimental results show that ADHMR outperforms state-of-the-art probabilistic HMR methods.

给作者的问题

  • A comparison with diffusion-based methods such as ScoreHMR and CloseInt needs to be included.
  • There is no analysis of whether this method generalizes well across different poses, occlusions, or camera angles.
  • The effectiveness of HMR-Scorer is a critical component of ADHMR, yet the experiments evaluating its reliability are somewhat limited. While Table 1 reports correlation metrics (PLCC and SRCC) between HMR-Scorer’s scores and standard reconstruction errors, a more detailed analysis of failure cases is absent. For example, does HMR-Scorer struggle with extreme poses, occlusions, or rare human-object interactions? Qualitative examples of cases where HMR-Scorer fails would strengthen the validation.

论据与证据

Some claims in the paper are incorrect. For example, "This is primarily because diffusion-based HMR models cannot use traditional 3D joint reprojection loss." However, many existing works [A,B] successfully use reprojection loss to guide the denoising process. Additionally, depending on the camera model, there is a trade-off between 3D joint accuracy and 2D alignment [C]. What camera model does this method use? The proposed approach does not consider reprojection error—how can it ensure that the final results are consistent with the input image?

[A] "Score-guided diffusion for 3d human recovery." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[B] "Closely interactive human reconstruction with proxemics and physics-guided adaption." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[C] "Tokenhmr: Advancing human mesh recovery with a tokenized pose representation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

方法与评估标准

  • This work builds on top of HypoNet. What are the differences between this method and HypoNet? What is UVD in Line 176?
  • This base model is trained on accurate indoor datasets and synthetic datasets. How can it ensure accurate scoring on in-the-wild images?

理论论述

The improvement in training heavily depends on the performance of HMR-Scorer. However, this module is difficult to train on outdoor data, making its effectiveness in outdoor scenarios uncertain.

实验设计与分析

  • The experimental comparison is unfair. The proposed method uses different training data than the baselines. Moreover, it requires fine-tuning on the benchmark, which gives it an unfair advantage over other baselines. For example, HMR2.0 could also be fine-tuned on the benchmark using reprojection error.

  • Add more details, such as the specific formulas for PLCC and SRCC.

  • Which datasets were used for data cleaning?

补充材料

This paper does not provide any supplementary materials.

与现有文献的关系

This work may be insightful for future psedo-label generation in human pose estimation.

遗漏的重要参考文献

These works use reprojection error to guide the diffusion, which can achieve model-image alignment.

[A] "Score-guided diffusion for 3d human recovery." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[B] "Closely interactive human reconstruction with proxemics and physics-guided adaption." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

其他优缺点

NA

其他意见或建议

NA

作者回复

Thanks for recognizing our SOTA performance, and the potential for pseudo-label generation. We truly appreciate your constructive comments and address them below.

Q1. Clarification of claims

Thank you for pointing this out. We would like to clarify that: end-to-end diffusion models predicting from noise typically avoid reprojection loss, as early denoising steps yield unrealistic poses, making such loss ineffective. CloseInt also notes: “Early iterations are meaningless for human motion.”

In contrast to ScoreHMR and CloseInt, which are two-stage (using external pose estimators before diffusion), our method is purely end-to-end, sampling from random noise. This avoids dependency on potentially biased external estimators and improves the robustness of in-the-wild (ITW) data. Visual comparison with ScoreHMR:

https://anonymous.4open.science/r/ICML_2025_Rebuttal-E36D/mE1y_Q1.pdf

We will revise the phrasing in the final version.

Consistency with the input image. Our method directly learns 3D human pose distribution conditioned on images without needing an explicit camera model or reprojection loss in training. We follow ScoreHypo to predict UVD joints used for mesh reconstruction and scoring. At inference, we recover 3D joints from UVD using a full-perspective projection with estimated intrinsics.

As TokenHMR notes, minimizing 2D error can harm 3D accuracy due to imprecise camera parameters, even with full-perspective cameras. Instead, we ensure consistency implicitly via HMR-Scorer, which learns alignment using pixel-aligned image features. On clean data, scoring by 3D error naturally favors well-aligned results; on ITW data, the scorer transfers 2D alignment knowledge. Additional ITW results are provided at:

https://anonymous.4open.science/r/ICML_2025_Rebuttal-E36D/8ugF_Q1.pdf

Q2. Differences with HypoNet

We introduce a preference optimization framework to improve image consistency and ITW robustness. Rather than proposing a new diffusion model, our method focuses on finetuning existing diffusion-based HMR models like HypoNet.

Q3. How to ensure accurate in-the-wild scoring

To address limited and noisy outdoor data, we train HMR-Scorer on large-scale synthetic datasets (e.g., BEDLAM, GTA-Human) for effective knowledge transfer. As noted in BEDLAM, HMR models trained solely on synthetic data can achieve SOTA accuracy. Table 1 in the paper shows strong results on GTA-Human (with ITW scenes). HMR-Scorer visualization is at:

https://anonymous.4open.science/r/ICML_2025_Rebuttal-E36D/mE1y_Q3.pdf

Q4. About experiment comparison

We updated Table 2 to indicate which methods are finetuned on the target benchmark:

Updated Table 2: https://anonymous.4open.science/r/ICML_2025_Rebuttal-E36D/mE1y_Q4.png

We added results showing:

  • Our ADHMR outperforms finetuned HMR 2.0a.
  • Prior works (e.g., HMDiff, Zolly) also finetune on 3DPW. To compare fairly, we introduce ADHMR (ITW)†, trained without Human3.6M, which still outperforms others. Table 3 and Table 4 in the paper show that finetuning on target or extra datasets offers only marginal gains, highlighting the effectiveness of our framework.

Q5. More details

UVD in L176 denotes 2D coordinates (u, v) in the image and their corresponding depth (d) of human keypoints, which aligns 3D joint with image space.

We will include the formulas for PLCC and SRCC in the revision.

The datasets used for data cleaning are listed in L401–403 (right column).

Q6. Comparison with ScoreHMR and CloseInt

  • ScoreHMR is a two-stage method that refines pretrained pose estimates using a diffusion model, applying reprojection loss only during sampling while trained with standard diffusion loss.
  • CloseInt is also a two-stage method and focuses on multi-person interaction. It refines plausible initial poses with a proxemics and physics-guided diffusion model.
  • Our ADHMR is one-stage and learns to synthesize poses from random noise, guided solely by image cues and preference alignment, without handcrafted priors.

The Updated Table 2 (see Q4) includes results of ScoreHMR and CloseInt, where our ADHMR outperforms both.

Q7. Method generalization analysis

We provide more ITW results at:

https://anonymous.4open.science/r/ICML_2025_Rebuttal-E36D/8ugF_Q1.pdf

ADHMR shows generalization under extreme poses (a–c), occlusions (a, b, d, e), and complex backgrounds (f).

However, extreme camera angles remain challenging, likely due to limited training coverage, as shown here:

https://anonymous.4open.science/r/ICML_2025_Rebuttal-E36D/mE1y_Q7.pdf

Q8. Qualitative examples and failure case of HMR-Scorer

Although HMR-Scorer is robust even for ITW scenes (see Q3), it still fails in some occlusion cases:

https://anonymous.4open.science/r/ICML_2025_Rebuttal-E36D/mE1y_Q8.pdf

With limited visual cues, the scorer may prefer predictions aligned with the image but not the 3D GT. We plan to explore temporal scorers to address this.

审稿意见
4

The paper adapts the works of Diffusion-based DPO (DDPO) to HMR by proposing ADHMR. Specifically, the paper introduces an HMR-scorer model that generates a reward for image-mesh alignment. This module is given the local, sampled from the UV joint locations, and global image features and outputs a score. The network is then trained similarly to RankNet on synthesized noisy annotations from ground truth SMPL poses. The HMR-scorer is then used for preference dataset synthesis, HMR DDPO, and data cleaning. The paper highlights the positive impact of each of these applications, demonstrating the effectiveness of their scoring strategy. Specifically, the paper shows significant improvements on their base model, HypoNet, through preference fine-tuning and a stronger scoring model. The paper also shows that the data-cleaning strategy can improve the performance of SOTA HMR models.

给作者的问题

  1. How did you prepare the data for training the HMR-Scorer? Did you introduce rotational noise to the ground truth? Could you elaborate on the process?
  2. Could you analyze the biases caused by DPO on the diffusion model? Does it improve rare/uncommon poses like lying or rolling on the floor? Could you provide some qualitative comparisons of generalizability to unseen data?
  3. I understand that it is not common in this line of research to provide runtime information. However, could you give the model's general training/inference time and memory cost?
  4. Please also address my concerns about the weaknesses mentioned above.

论据与证据

The paper's claims are supported via experimental evidence. For instance, they show the superiority of ADHMR over previous works, the effectiveness of HMR-scorer in assessing mesh quality, and the impact of data cleaning on 3 state-of-the-art models. The paper makes some minor claims, such as improving in-the-wild robustness, without proper analysis. However, it is acceptable, given the similarity of such claims with the literature.

方法与评估标准

The evaluation criteria (PVE, MPJPE, etc.) are the standard and widely used metrics for evaluating HMR methods. The evaluation (3DPW, Human3.6m) and training sets are standard research benchmarks. The paper has also included recent papers up to 2024, showing good awareness of the literature during its comparisons.

理论论述

None provided. Only formulations of DDPO are used.

实验设计与分析

All experiments, including HMR-scorer evaluation, ADHMR evaluation, ablations, data cleaning, and qualitative comparisons, are sound and well-justified.

补充材料

None provided.

与现有文献的关系

Although DPO has been extensively researched for NLP/LLMs and even image/video diffusion models, I did not find its adaptation to HMR tasks. However, it has been proposed for human motion generation [1], which is an entirely different task with similar input/output parameters and metrics. Reward modeling and scoring multiple hypotheses have also been extensively researched, but the paper introduces several implementation changes over prior works, showing improved performance. Several papers also directly address data cleaning, but the paper does not mention them since they are not the paper's focus.

References: [1] Pappa, Massimiliano, et al. "MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization." arXiv preprint arXiv:2405.03803 (2024).

遗漏的重要参考文献

The references are adequate.

其他优缺点

Strength

  1. While the individual components (diffusion model, DPO, reward modeling) are not entirely new, the paper successfully combines them to improve prior works.
  2. The paper is well-written. It has clear goals and designs adequate experiments to showcase its results.
  3. The paper is well aware of recent literature and is one of the first works adapting DDPO into HMR tasks.

Weaknesses

  1. The paper makes incremental contributions over the base model by adapting DDPO for HMR. Since the usefulness of DDPO was already well-known, it reduced the impact of the paper's findings.
  2. The choice of baselines is limited to only HypoNet. It would've been much more impactful if HMR-Scorer showed improved performance on other hypothesis-generation approaches.

其他意见或建议

The paper's contributions are incremental since it applies well-established methods to HMR without many changes. However, I found no significant issues in the paper, and the experiments were adequate. My final decision is borderline, leaning on acceptance due to good experiment design.

Typo:

  • Line 87, Right Column: Duplicate citations for (Rafailov et al., 2024) (Rafailov et al., 2024)
作者回复

Thank you for recognizing the effectiveness of our scoring strategy, significant improvements over the base model, and good experiment design. We deeply appreciate your valuable comments and address them below.

Q1. Generalizability to unseen data & Bias caused by DPO

To further support our claim on in-the-wild robustness, we provide more ITW qualitative results here:

https://postimg.cc/jwY2qCgT

These examples demonstrate improved performance under extreme poses, occlusions, and complex backgrounds—for example, (a–b) involve lying or rolling on the floor, where our method shows better 2D alignment. In the paper, Figure 2 (g–h) shows ITW images without ground truth, allowing only qualitative comparison. Results on 3DPW also partially reflect ITW capability with outdoor, unconstrained scenes.

Bias brought by DPO. Our framework improves generalization by shifting the objective from fitting noisy in-the-wild pseudo labels to learning from relative preferences, encouraging the model to capture perceptually aligned and semantically meaningful poses (see L55–59). Besides, HMR-Scorer implicitly learns 2D image alignment by integrating global and pixel-aligned features. This leads to better robustness and transferability in challenging real-world scenarios.

Q2. More relevant references

Thank you for the helpful comment. While our work is the first to apply DPO to HMR, we acknowledge related efforts in motion generation and reward modeling. We also agree that data cleaning has been explored in prior works. For example, SPIN [A] filters training data based on reprojection error, and Unite the People [B] relies on manual selection. We will include these references in the revision.

[A] Kolotouros et al., Learning to reconstruct 3D human pose and shape via model-fitting in the loop, ICCV 2019.

[B] Lassner et al., Unite the people: Closing the loop between 3D and 2D human representations, CVPR 2017.

Q3. About contribution

Thank you for your thoughtful assessment and for acknowledging the strength of our experimental design.

While DDPO is well-established in tasks like image generation, its application to HMR is non-trivial. ADHMR introduces key domain-specific innovations:

  • a learned HMR-Scorer capturing perceptual image–mesh alignment, and

  • a preference dataset constructed without human labels.

ADHMR is the first to successfully apply DPO to HMR, enabling training from relative preferences rather than noisy ITW pseudo labels. In contrast, conventional methods often overfit these labels, degrading 3D accuracy—as also noted in TokenHMR [C].

Beyond performance gains, HMR-Scorer supports automatic data cleaning, making it broadly helpful in enhancing other HMR pipelines (see Sec. 4.5).

[C] Dwivedi, Sai Kumar, et al. "Tokenhmr: Advancing human mesh recovery with a tokenized pose representation." CVPR. 2024.

Q4. Generalization to other hypothesis-generation approaches

Thank you for the valuable suggestion. We chose HypoNet as our main baseline as it is the latest SOTA diffusion-based HMR model for general-purpose scenarios. Our framework is compatible with other diffusion-based methods [D, E], and we plan to extend experiments in future work.

Deterministic models (e.g., HMR 2.0) can also adopt our preference framework by discretizing the pose space (e.g., via VQ-VAE in TokenHMR [C]) and replacing the regression network with a classification network for DPO training.

Meanwhile, HMR-Scorer and data cleaning are model-agnostic and can enhance training quality across both probabilistic and deterministic pipelines—offering a practical path for pseudo-label refinement in in-the-wild settings where 3D ground truth is scarce.

[D] Cho, Hanbyel, and Junmo Kim. "Generative approach for probabilistic human mesh recovery using diffusion models." ICCV. 2023.

[E] Stathopoulos, Anastasis, Ligong Han, and Dimitris Metaxas. "Score-guided diffusion for 3d human recovery." CVPR. 2024.

Q5. Duplicate citations

We will fix this in the revision.

Q6. HMR-Scorer training data preparation

Yes, we add joint-wise Gaussian noise to the ground truth SMPL pose to simulate rotational errors, with magnitudes empirically determined. We will include the exact noise parameters in the final appendix.

Q7. Training/inference cost

We provide the training/inference time and memory cost below:

  • Inference | Model| Time per prediction| Batch Size| Memory| |-|-|-|-| | HMR-Scorer| 56 ms| 32| 3350 MiB| | ADHMR (M=10M=10) | 3 ms| 160| 8400 MiB| | ADHMR (M=100M=100) | 1.6 ms| 80| 24 GB| | ADHMR (M=200M=200) | 1.5 ms| 40| 24 GB| All inference was conducted on a single NVIDIA V100 GPU. MM denotes the number of predictions.

  • Training | Model| Time| Batch Size | Memory per GPU| |-|-|-|-| | HMR-Scorer | 24 h| 32| 11500 MiB| | ADHMR| 7 h| 40| 18 GB| All training was conducted on 4 NVIDIA V100 GPUs.

审稿人评论

I thank the authors for providing a comprehensive rebuttal addressing all the points raised in my initial review. My questions regarding HMR-scorer training, DPO bias, generalizability, and computational cost have been satisfactorily answered. The additional qualitative results are also compelling, demonstrating impressive results on irregular postures and hard examples. Furthermore, the author's arguments about the non-trivial aspects and domain-specific innovations involved in applying DDPO to HMR clarified my concerns about the contributions of this work. As a result of this discussion phase and reading other concerns raised by the reviewers, my reservations have been fully addressed. I would like to increase my score to Accept, given that the details from the rebuttal are added to the final paper.

作者评论

We sincerely thank you for acknowledging our efforts in addressing the concerns. We are particularly grateful for the recognition of the non-trivial and domain-specific aspects of applying DDPO to HMR, as well as the appreciation of our qualitative results and clarifications regarding scorer training, generalizability, DPO bias, and computational cost.

We will make sure to incorporate these clarifications and additional insights presented in the rebuttal into the final version of the paper. We truly appreciate your updated assessment and recommendation to accept the paper.

审稿意见
4

This paper targets improving HMR methods with preference prediction. Therefore, the authors present a prediction assessment model named HMR-Scorer. Further, the authors create a preference dataset using HMR-scorer, which is used to finetune base model and existing HMR methods. The full method, called ADHMR, shows sota results.

给作者的问题

No.

论据与证据

I think the claims are clear.

方法与评估标准

The proposed method ADHMR has been tested on several standard datasets such as 3DPW and Human3.6M using standard metrics MPJPE, etc.

理论论述

At L.254, the authors said "through some mathematical techniques ..." and yield Eqn.(5). It would be better to elucidate the procedure in appendix or guide the readers where could find similar provements.

实验设计与分析

I have some questions about the experiments.

  1. Why methods in Table.5 do not appear in Table.2? Any special reasons?

  2. In Table 1, does ScoreNet share the same training set with HMR-Scorer? How to ensure this is a fair comparison?

  3. Table.2 is a bit unfair because ADHMR is informed of more training data through HMR-Scorer which is trained on DNA-rendering etc. I do not doubt about the effectiveness of ADHMR, Table.4 has proven that. Additional descriptions about seen datasets in Table.2 would make it more rigorous.

补充材料

NO supp.

与现有文献的关系

This paper is based on CVPR 2024 paper HypoNet with significant differences.

遗漏的重要参考文献

None.

其他优缺点

Strength:

  1. The paper is well written. The contributions are stated clearly. Incorporating DPO with HMR is an effective idea.

  2. Experiments have shown the improvements brought by ADHMR.

Weakness:

  1. In Figure 2, it would be more informative to show GT and side view as well.

  2. The first visual result in Figure 2 is not very accurate. In this image, the man is playing the baseketball with his left hand. The proposed ADHMR predicts that the right hand is also on the ball. However, the shadow on the ground indicates that the right hand is far away from the ball. In fact, this image is the 78th frame of the seq "courtyard_basketball_00". From the whole sequence, we would find that the right hand of the man is placed on the right of his right leg, and is totally occluded. Therefore this sample is not that good.

其他意见或建议

L.087. (Rafailov et al. 2024) duplicated.

作者回复

Thanks for recognizing the strong performance of our method and the clear presentation of our paper. We deeply appreciate your constructive comments and address them below.

Q1. Equation 5 elucidation

Thank you for your helpful comment. Our formulation follows the approach introduced in Diffusion-DPO [A], which provides a detailed derivation (see Section 4 and Supp. S2 of that paper). To improve readability, we will revise our manuscript to explicitly cite this reference at Line 254 and include a pointer to the key steps in the original paper.

[A] Wallace, Bram, et al. "Diffusion model alignment using direct preference optimization." CVPR. 2024.

Q2. Results in Table 5 not in Table 2

We separate Table 2 and Table 5 as they serve different goals. Table 2 evaluates the ADHMR framework with SMPL-based methods for fair comparison, while Table 5 focuses on data cleaning with more challenging SMPL-X data. Methods in Table 5 are deterministic and not compatible with ADHMR. We will clarify this in the final version.

Q3. About ScoreNet training set

Thank you for the insightful comment. While we understand the concern, it’s important to note that ScoreNet and our HMR-Scorer were originally trained on different datasets. Specifically, ScoreNet was trained on H3.6M, 3DPW, MPI-INF-3DHP, MPII, COCO, and UP-3D, while our HMR-Scorer was trained on a set of synthetic and human-interaction datasets with higher annotation accuracy and scene diversity, as detailed in Sec. 5.1.

To ensure a fair comparison, we re-trained ScoreNet using our HMR-Scorer training sets. Please check the results in the updated Table 1:

MethodPVE ↑ PLCCPVE ↑ SRCCMPJPE ↑ PLCCMPJPE ↑ SRCCPA-MPJPE ↑ PLCCPA-MPJPE ↑ SRCCPVE ↑ PLCCPVE ↑ SRCCMPJPE ↑ PLCCMPJPE ↑ SRCCPA-MPJPE ↑ PLCCPA-MPJPE ↑ SRCC
GTA-HumanDNA-Render
ScoreNet0.520.490.520.500.470.430.550.510.550.500.500.46
ScoreNet\dagger0.550.510.540.510.500.460.570.540.560.540.510.48
HMR-Scorer (Ours)0.630.620.630.620.570.540.670.640.660.650.620.59

ScoreNet\dagger is the ScoreNet that is trained on the same train sets as HMR-Scorer. Results show that our HMR-Scorer still outperforms ScoreNet\dagger using the same set of train sets, validating the design of our scoring strategy rather than dataset bias.

We will clarify this point and explicitly list the datasets used in Table 1 to improve rigor and transparency.

Q4. Seen datasets of methods in Table 2

Thank you for your valuable comment. We will include a clearer description of the datasets used in Table 2 in the revised version. Specifically:

  1. ADHMR and HypoNet are trained on the same set of datasets, and ADHMR does not use HMR-Scorer. This comparison demonstrates the effectiveness of our framework even without additional supervision.
  2. ADHMR (ITW) is not directly trained on the datasets used for training HMR-Scorer (e.g., DNA-rendering). However, it may indirectly benefit from these extra datasets. As you rightly pointed out, Table 4 shows that our method is significantly more effective than directly fine-tuning with those datasets, which further highlights the strength of our approach.

Q5. About Figure 2

Thank you for your careful analysis and helpful suggestions.

We have updated Figure 2 by adding ground truth and side-view visualizations for better interpretability. In addition, we have replaced the first example with a clearer case. The revised figure is available here:

https://postimg.cc/gx0YG3G8

In the original first example, the man’s right hand is fully occluded, and our model still predicts a plausible pose based solely on visible cues. In contrast, HypoNet’s result is not consistent with the observed image evidence. It is worth noting that our model processes single-frame images only and does not leverage any temporal context from the sequence.

We appreciate your detailed observation and have reflected this clarification in the updated figure.

审稿意见
4

This paper proposes the first method to use preference optimziation to improve the Human Mesh Recovery (HMR) models. The paper first introduces a HMR-scorer model to rank the human mesh result produced by (an arbitrary) HMR method. Experiments show the score is strongly correlated to the reconstruction metrics, supporting that this scorer model can be reliably applied to other datasets without 3D ground truths. This scorer is then used to construct preference dataset of {Image, winner result, loser result} triplets. These data are used to fine-tune a base HMR model using the direct preference optimization (DPO) loss. Result: a big leading margin in the HMR accuracy on common benchmarks. Comprehensive experiments validated the different stages in the paper. Overall, I think this is a solid paper.

Update after rebuttal: after reading other reviewers' comments and the authors' rebuttal, my rating of acceptance remains unchanged.

给作者的问题

Is the Preference dataset also applicable to non-probabilistic HMR models to boost their performance?

论据与证据

Claim 1: the proposed method is the first framework that enhances HMR mdoels using preference optimziation. To my knowledge, this is true. Claim 2: the HMR-scorer model is effective. This is validated by experiments Sec. 5.2. Claim 3: the HMR-scorer improves the SOTA HMR models by filtering out low-quality training data. This is also validated through comprehensive experiments in Sec. 5.3, in particular Table 5.

To summarize, the major novelty and technical claims are valid and verified.

方法与评估标准

The proposed method makes of sense to me, and matches well with the evaluation. First, for the HMR-scoerer, the motivation is to replace human preference rating in the dataset curation step for the preference optimization methods. If the dataset has precise ground 3D ground truth, the direct computation of the errors are used for the ranking. If the data gt annotations are noisy (e.g. in-the-wild datasets), the scorer function can help effiectively create positive and negative samples. Consequently, the evaluation just need to verify the scores are correctly correlated to the common HMR error metrics (such as per-vertex errors, body joint errors etc.) -- that's how Section 5.1 and 5.2 are designed. Therefore I think this validation is a good design.

Second, using the constructed preference dataset for DPO of the diffusion-based HMR method -- sounds like a straightforward thinking in hindsight -- is a great idea and this paper is the first to conduct so. The method creates impressive performance improvement on the standard benchmarks (Table 2), which I think sufficiently validates the efficacy of the design.

理论论述

To my understanding, this paper is more about applying the DPO method (Wallace et al., 2024) to the specific problem of improving diffusion-based probabilistic human mesh recovery. Therefore the major theoretical claim (while indirectly made), is that DPO is effective for this setting. The experiments have validated this claim.

实验设计与分析

As discussed above, the experiment are designed to validate the major technical claims. They have achieve this goal.

  1. The first part of the experiment is to validate if the HMR scorer model can align with the common error metrics in HMR. This is validated on the datasets with precise ground truth. Metrics here are the correlation coefficients which I believe are sensibly chosen.
  2. The second part is about whether the preference-rated data can improve common HMR models. This is validated on a detailed benchmarking and ablation study (table 3).
  3. The third part of the experiment is to use the HMR-scorer model to clean up the data. The effectiveness is demonstrated on 3 different base models. Therefore, I think the experimental designs are comprehensive and effective.

补充材料

I did not find supplementary material for this submission.

与现有文献的关系

The models and datasets delivered by this work can be applied to more precise human pose/motion estimation from images or video streams. This can facilitate many applications in other areas, such as motion caption in the movie industry, pose estimation in live sports, sport medicine and analysis, etc.

遗漏的重要参考文献

N/A.

其他优缺点

Strengths:

  • Overall, I think this paper proposes a good idea that is validated through extensive experiments. The scorer network should be a drop-in enhancement for diffusion-based probabilistic HMR models. The method has improved the performance on common HMR benchmarks by a large margin -- such big improvement is not often seen in recent 2 years. The paper is well written too. Therefore I recommend acceptance.

Weakness: L. 254, "through some mathematical techniques" -- it would be good to refer the reader to the corresponding location in the original literature.

其他意见或建议

  • I particularly appreciate the detailed captions of the tables and figures.
  • Minor: DPO paper is cited twice on L. 87, right column.
作者回复

Thanks for highlighting the novelty of applying preference optimization to HMR, the effectiveness of our framework, and the comprehensive experiments. We truly appreciate your encouraging feedback and respond to your points below.

Q1. Equation 5 elucidation

Thank you for your helpful comment. Our formulation follows the approach introduced in Diffusion-DPO [A], which provides a detailed derivation (see Section 4 and Supp. S2 of that paper). To improve readability, we will revise our manuscript to explicitly cite this reference at Line 254 and include a pointer to the key steps in the original paper.

[A] Wallace, Bram, et al. "Diffusion model alignment using direct preference optimization." CVPR. 2024.

Q2. Preference dataset application

Thank you for the great question. The preference dataset is not directly applicable to standard non-probabilistic HMR models, as the DPO framework assumes the model produces discrete predictions with associated probabilities. Our current implementation is based on a diffusion-based HMR model, which naturally supports such probabilistic outputs, consistent with the assumptions in Diffusion-DPO.

However, we believe adapting deterministic models to fit within this preference framework is both feasible and promising. As demonstrated in TokenHMR [B], one can first train a VQ-VAE to discretize the continuous pose space, and then modify the original regression head (e.g., in HMR2.0) into a classification head over quantized tokens. This adaptation would allow deterministic HMR models (e.g., HMR 2.0) to benefit from our preference datasets and preference optimization framework, unlocking similar gains in robustness and image alignment. We consider this a promising direction for future research.

Importantly, the HMR-Scorer and our data cleaning pipeline are fully model-agnostic, and thus directly applicable to both probabilistic and deterministic models. They provide a scalable way to improve training data quality by filtering out noisy pseudo-labels—particularly useful for in-the-wild datasets where 3D ground truth is often unavailable or unreliable. We view this as a practical step toward broader adoption of preference-guided learning in HMR, and we plan to explore these extensions in future work.

[B] Dwivedi, Sai Kumar, et al. "Tokenhmr: Advancing human mesh recovery with a tokenized pose representation." CVPR. 2024.

Q3. Typos

Thank you for pointing out the duplicated citation. We will fix this in the final version.

最终决定

This paper targets improving existing diffusion-based human mesh recovery (HMR) models that generates multiple predictions in a preference optimization manner. Specifically, it trains HMR-Scorer, which quantifying the human mesh prediction, and then uses HMR-Scorer for data cleaning to improve the performance of HMR. Experiments show that the proposed method outperforms state-of-the-art probabilistic HMR methods.

The main strengths of this paper includes:

  • it is novel to incorporate direct preference optimization for diffusion-based probabilistic HMR
  • it improves the performance significantly

The main weaknesses of this paper are:

  • the fairness of training data of HMR-Scorer (Reviewer WdV4, mE1y)
  • more baselines and comparisons (Reviewer 8ugF,mE1y)

During the rebuttal phase, all reviewers discussed, acknowledged and provided justification. Overall, this paper received 3 accept and 1 weak reject from the reviewers. Most concerns raised during review were addressed in the later rebuttal. However, one HMR-Scorer issue should be highlighted: HMR-Scorer can provide accurate evaluations, but with "additional labeled data" like BEDLAM and GTA-Human (See Reviewer mE1y Rebuttal Q3). This may need more discussion and experiments for clarification like the use of labeled data, a fairer comparison (See Official Comment by Reviewer mE1y).

After reading all the review, the discussion and the rebuttal, the AC agrees its novelty and its superior performance. But the HMR-Scorer labelled data issue is also important and should be carefully handled. The AC acknowledges the merits and tends to accept. The authors are suggested to provide a substantial improvement in final version to clarify the issues raised by reviewers, especially the HMR-Scorer issue (Reviewer WdV4,mE1y).