PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
3
4
5
5
2.8
置信度
创新性2.3
质量2.5
清晰度2.5
重要性2.5
NeurIPS 2025

Learning Dense Hand Contact Estimation from Imbalanced Data

OpenReviewPDF
提交: 2025-04-20更新: 2025-10-29
TL;DR

We propose HACO, a framework for dense hand contact estimation that addresses class and spatial imbalance challenges in training on large-scale datasets.

摘要

关键词
Hand contact estimationHand-object interaction

评审与讨论

审稿意见
3

This paper develops the HACO framework to address the class imbalance and spatial imbalance issues in hand contact datasets. HACO tackles class imbalance with balanced contact sampling and spatial imbalance with a vertex-level class-balanced (VCB) loss. Experiments compared with the SoTA methods show the effectiveness of the proposed method.

优缺点分析

Pros:

  1. The paper is well-written and easy to follow.
  2. The ablation study is comprehensive and makes the paper convincing.

Cons:

  1. Limited novelty: The proposed Balanced contact sampling and Vertex-level class balanced loss is straightforward. Balanced contact sampling is simply resampling all training data from datasets to make data balanced. The class balanced loss is frequently used to handle imbalanced training issue and the authors simply apply it on vertex level which I think is not enough to be a solid technical contribution.
  2. Since the proposed method is to solve the imbalanced data issue, I think experiments on the data with different classes might be helpful to verify the effectiveness of the method. For example, previous methods may perform well on the whole dataset but do not work well on another kind of data which is a minor class in the datasets.
  3. Can you further explain the rationale of network structure design? Why self-sttention+cross attention? Why adding contact token? Is it learnable?
  4. I found that the comparison with SoTA methods is only conducted on MOW dataset. Testing them on more other datasets will be better.
  5. Can you give more details about how to compare with DeepContact since your model only outputs contact map?

问题

See cons above.

局限性

yes

最终评判理由

I appreciate the authors' responses, which addressed some of my concerns. However, I still believe the technical contribution of the work is limited, as the proposed Balanced Contact Sampling and Vertex-level Class Balanced Loss are straightforward and lack novelty. Additionally, the unfair comparison pointed out by reviewer BqXF raises further doubts about the reliability of the experimental results. Therefore, I maintain my initial rating.

格式问题

None

作者回复

We thank the reviewer xnje for reviewing our paper.

We sincerely appreciate the reviewer’s positive assessments of our work. The recognition that the paper is clearly written and accessible is especially encouraging. We also thank the reviewer for highlighting the comprehensiveness of our ablation studies, which we believe strengthens the credibility of our contributions. Below, we address the specific points raised in the review.

Q: Limited novelty of the proposed methods

A: While Balanced Contact Sampling (BCS) and VCB loss build on known principles, both are carefully adapted to address the specific challenges of dense hand contact estimation, which requires precise vertex-level prediction under severe class and spatial imbalance, common in real-world hand interaction datasets.

BCS is not a simple resampling heuristic, but a dataset-driven strategy targeting class imbalance. It groups training samples by a contact balance score, exposing the model to both contact-sparse and contact-heavy hands. Since most hand vertices are non-contact, BCS is essential to avoid overfitting to dominant patterns of non-contact. As it uses dataset-level statistics, it remains effective across varying datasets.

VCB loss addresses spatial imbalance by applying vertex-specific reweighting based on contact frequency. This prevents overfitting to frequently contacted regions (e.g., fingertips) and promotes learning from underrepresented regions. Unlike generic balancing techniques, VCB directly resolves the spatial imbalance nature of dense hand contact datasets and is broadly applicable through its reliance on label statistics.

Our method yields strong gains over prior work and supports downstream tasks like 3D grasp optimization and hand-object reconstruction. By explicitly addressing class and spatial imbalance, it provides both practical benefits and conceptual contributions to dense hand contact estimation.

We will clarify these motivations and contributions more explicitly in the final version.

Q: Since the proposed method is to solve the imbalanced data issue, I think experiments on the data with different classes might be helpful to verify the effectiveness of the method

A: We appreciate the reviewer’s insightful suggestion. To directly evaluate the effectiveness of our method under imbalance issue, we constructed a test set composed of highly imbalanced samples. Specifically, we selected the top 500 samples with the highest contact balance scores (as defined in Eq. 1 of the main paper) from a combined pool of MOW, HIC, RICH, and Hi4D datasets. This subset explicitly reflects extreme imbalance, making it well-suited to assess the robustness of imbalance-aware methods.

Table R1. SOTA comparison for dense hand contact estimation on assorted dataset with high imbalance

ModelsConferencePrecisionRecallF1-Score
POSACVPR 20210.1170.1020.104
BSTROCVPR 20220.2430.1880.204
DECOICCV 20230.2250.2050.192
HACO (Ours)-0.5500.6260.576

As shown above, HACO achieves the best performance across all metrics, demonstrating strong robustness even on samples with severe imbalance. This directly confirms the impact of our method in handling imbalance issue. We will include these results and the process of dataset acquisition in the final version.

Q: Can you further explain the rationale of network structure design? Why self-sttention+cross attention? Why adding contact token? Is it learnable?

A: Our model architecture closely follows the HaMeR framework, which has demonstrated strong performance in hand mesh recovery task. This design decision reflects our focus not on proposing a novel network architecture, but rather on advancing contact estimation through: (1) training on a large-scale composite dataset spanning 14 independent datasets, (2) Balanced Contact Sampling (BCS) to address class imbalance, and (3) Vertex-Level Class-Balanced (VCB) loss to address spatial imbalance in hand contact.

Regarding the self-attention and cross-attention modules, we adopted the structure used in the officially released HaMeR implementation. This combination has proven effective for fusing image features with mesh queries, and we retained it to ensure fair comparison and isolate the effects of our proposed training strategies.

The contact token used in our architecture serves as a query token for cross-attention Transformer (as mentioned in L154), which is also analogous to the query token used in HaMeR's official implementation. Similar to the class token in Vision Transformers, this learnable contact token serves as a global summary vector for predicting vertex-level contact. By learning this token end-to-end, the model is encouraged to develop a contact-specific representation that guides the Transformer’s attention during feature aggregation. This design is widely adopted in Transformer-based frameworks where there needs effective feature aggregation for regression or classification task.

We will clarify these architectural choices more explicitly in the final version.

Q: I found that the comparison with SoTA methods is only conducted on MOW dataset. Testing them on more other datasets will be better.

A: We appreciate the reviewer’s suggestion and agree that broader evaluation across multiple datasets provides a more comprehensive assessment. In response, we conducted additional comparisons on three more benchmark datasets: HIC, RICH, and Hi4D. These datasets represent diverse hand interaction scenarios and include hand-hand, hand-scene, and hand-body contacts.

Table R2. SOTA comparison for dense hand contact estimation on HIC dataset

ModelsConferencePrecisionRecallF1-Score
POSACVPR 20210.0000.0000.000
BSTROCVPR 20220.0000.0000.000
DECOICCV 20230.0050.0370.006
HACO (Ours)-0.1810.6410.278

Table R3. SOTA comparison for dense hand contact estimation on RICH dataset

ModelsConferencePrecisionRecallF1-Score
POSACVPR 20210.1430.1750.125
BSTROCVPR 20220.4370.4980.455
DECOICCV 20230.3240.3510.303
HACO (Ours)-0.7430.5700.528

Table R4. SOTA comparison for dense hand contact estimation on Hi4D dataset

ModelsConferencePrecisionRecallF1-Score
POSACVPR 20210.1280.1090.111
BSTROCVPR 20220.3130.2070.247
DECOICCV 20230.1970.1620.146
HACO (Ours)-0.7810.9230.828

Across all three additional datasets, HACO significantly outperforms prior SOTA methods, confirming its robustness and accuracy in diverse interaction settings. These additional results will be included in the final version of the manuscript.

Q: Can you give more details about how to compare with DeepContact since your model only outputs contact map?

A: Our comparison with DeepContact is conducted using the ContactOpt framework, which optimizes coarse 3D hand and object predictions using contact information. We utilize ContactOpt’s Differentiable Contact Optimization stage to refine the coarse 3D hand and object poses (from HFL-Net) for both DeepContact and HACO experiment.

The key difference lies in how contact is estimated: DeepContact predicts contact from the coarse 3D hand and object meshes (extracted using HFL-Net), while HACO predicts contact directly from the RGB image. Note that DeepContact does not use image information in any manner. Rather, it relies solely on the geometry of the coarse 3D hand and object meshes to estimate contact. As 3D hand grasp optimization inevitably requires initial 3D hand and object poses, we use the same HFL-Net predictions for both methods during ContactOpt’s Differentiable Contact Optimization stage. In HACO’s case, we supply the HACO's estimated contact from image alongside these coarse 3D poses to ContactOpt’s optimization process.

As shown in Section 5.4 and Table 6 in main paper, HACO achieves comparable or superior results across various metrics. These results demonstrate that HACO's contact prediction enables robust and effective contact-guided 3D grasp optimization, despite relying only on image input during contact prediction.

We will clarify this protocol more explicitly in the final version.

评论

I appreciate the authors' responses, which addressed some of my concerns. However, I still believe the technical contribution of the work is limited, as the proposed Balanced Contact Sampling and Vertex-level Class Balanced Loss are straightforward and lack novelty. Additionally, the unfair comparison pointed out by reviewer BqXF raises further doubts about the reliability of the experimental results. Therefore, I maintain my initial rating.

评论

Dear reviewer xnje,

We want to thank you for reviewing our paper and spending time reading and analyzing our paper. Although we are notified that the reviewer's decision has been sent, we still wish to follow-up and organize some of the discussion points that we have made during the rebuttal. In the previous response, we made the following updates/clarifications:

  • Clarified that the contributions of BCS and VCB loss are addressing challenges specific to dense hand contact estimation. Added new experiments on a highly imbalanced subset (Table R1) and additional datasets (Tables R2–R4) to further validate our method.

  • Explained the rationale behind the network design and clarified that the comparison protocol with DeepContact is within the ContactOpt framework.

This expanded discussion will be included in the final version. We are happy to answer further questions.

审稿意见
4

The paper describes a method, HACO, that addresses two problems of imbalanced data during hand interactions: class imbalance (due to the vast majority of non-contact frames) and spatial imbalance (due to the prevalence of fingertips contacts). The first problem is tackled by resampling the datasets based on contact balance scores to ensure fair representation. Spatial imbalance is instead tackled by reweighting vertices to give more importance to the under-represented ones. These two strategies lead to improvements in contact estimation and 3D hand reconstruction during HOI.

优缺点分析

Quality

Strengths: • Addresses a challenging and important problem of data imbalance in dense hand contact estimation with two complementary solutions. • Proposes technically sound approaches: Balanced Contact Sampling (BCS) for class imbalance and Vertex-Level Class-Balanced (VCB) loss for spatial imbalance.

Weaknesses: • Evaluation choices: Fundamentally flawed baseline comparisons where POSA, BSTRO, DECO are designed for full-body HS but evaluated on HO tasks; primary evaluation on MOW dataset despite authors acknowledging GT problems in supplementary material; only fair comparisons (DeepContact, EasyHOI) show minimal improvements. • Methodological gaps: Table 4 ablation ordering is not motivated and it starts with non-target domains; unclear contribution of HACO vs. massive 14-dataset training. • Missing critical ablation: Tables 2&3 show identical baseline performance (0.525, 0.632, 0.531), suggesting BCS and VCB were not properly isolated. • Technical inconsistencies: Fragmented contact predictions contradict claimed "smoothness loss" benefits mentioned in L238-239. • Ground truth reliability concerns: Missing GT visualizations in Figure 3 make it impossible to assess prediction quality; some predictions appear inconsistent with visual evidence (e.g., in row 3, based on the shadow, fingers do not seem to touch the computer).

Clarity

Strengths: • Clear problem formulation identifying class imbalance and spatial imbalance as distinct challenges. • Codes will be released.

Weaknesses: • Section 3.2 (Balanced Contact Sampling) clarity can be improved. • Misleading baseline comparisons: Paper does not justify why quantitatively evaluating POSA, BSTRO, DECO on MOW as they solve fundamentally different problems. • Inconsistent categorization: Not clear why Decaf (hand-face) and Hi4D (whole-body) are categorized together in hand-body despite being fundamentally different. • Incomplete experimental details: Table 4 ordering rationale unexplained, no explanation on why using MOW for evaluation as it has the worst GT estimation or why using out-of-domain baselines.

Significance

Strengths: • Tackles fundamental challenge in hand interaction understanding that affects multiple downstream applications.

Weaknesses: • Limited practical validation: Only meaningful comparisons (DeepContact, EasyHOI) show modest improvements. • Inflated performance claims: Dramatic improvements in Table 5 come from comparing incompatible methods. • Unclear method vs. data contribution: Cannot assess whether improvements stem from HACO techniques or from combining 14 diverse datasets. • Evaluation reliability: Primary evaluation on acknowledged problematic dataset (MOW) undermines confidence in reported improvements. • Failure case analysis: Missing discussion of when/why the method fails or performs poorly.

Originality

Strengths: • Adaptation of class-balanced loss to vertex-level to address different aspects of data imbalance. • Large-scale multi-domain training with resampling approach for hand contact estimation.

问题

  1. Can the authors justify their evaluation choices? In particular: a. Table 4: the non-logical ordering starting with non-target domains (e.g., MOW starting with HS instead of HO) b. Table 5: comparing HACO with full-body contact methods (POSA, BSTRO, DECO) on HO tasks. The goal of addressing class imbalance and spatial imbalance is related but not the same as out-of-distribution generalization. c. the reason for using MOW for primary evaluation despite acknowledging GT problems in supplementary material
  2. Can the authors ablate each of the two components of HACO while keeping the other one (BCS-only vs. VCB-only vs. both)?

局限性

The authors addressed limitations and potential negative societal impact in the supplementary material. However, the limitations section could be greatly expanded to critically discuss failure cases that could open up new research avenues. The current section only mentions that they excluded self-contact.

最终评判理由

I would like to thank the authors for addressing all my comments. Given the new experiments and the changes they have promised to incorporate into the manuscript, I raised my evaluation.

格式问题

None

作者回复

We thank the reviewer BqXF for reviewing our paper.

We appreciate the recognition of our method’s contributions, including the use of Balanced Contact Sampling (BCS) and Vertex-Level Class-Balanced (VCB) loss to address data imbalance in dense hand contact estimation. We are also grateful for highlighting the clarity & originality of our formulation, the planned code release, and the significance of our work for hand interaction tasks. Below are our responses to the questions mentioned in the review.

Q: Flawed baseline comparisons where POSA, BSTRO, DECO are designed for full-body HS but evaluated on HO tasks

A: We thank the reviewer for this important concern. Our goal in comparing HACO with full-body contact models (POSA, BSTRO, DECO) was to demonstrate HACO’s capability in the broader context of hand contact estimation, a task lacking dedicated baselines.

In real-world use cases, full-body models (e.g., DECO) are often the only available tools for hand contact estimation from a single RGB image. To address this practical gap, we compared HACO with these methods.

However, we acknowledge the need for broader evaluation. Hence, we implement and compare hand-specific versions of POSA, BSTRO, and DECO by replacing their SMPL/SMPL-X modules with MANO and training them on the same 14 datasets as HACO (Table R1). Additionally, we reproduced hand contact modules from DefConNet (Decaf), Stage 1 in CHOI, and InteractionNet (DICE) (Table R2). While not designed solely for dense hand contact task, these methods offer relevant baselines.

Table R1. SOTA comparison for dense hand contact estimation with hand version of POSA, BSTRO, DECO on MOW dataset

ModelConferencePrecisionRecallF1-Score
POSA (hand ver. w/ HaMeR pred mesh)CVPR 20210.5040.4000.408
BSTRO (hand ver.)CVPR 20220.5180.3590.380
DECO (hand ver.)ICCV 20230.4950.3850.395
HACO (Ours)-0.5250.6320.531

Table R2. SOTA comparison for dense hand contact estimation with hand contact estimation modules (†: reproduced, re-trained with 14 datasets, code not available) on MOW dataset

ModelConferencePrecisionRecallF1-Score
†DefConNet (from Decaf)SIGGRAPH Asia 20230.4140.4040.368
†Stage 1 in CHOIAAAI 20240.5210.3870.407
†InteractionNet (from DICE [A])ICLR 20250.4590.3920.413
Data Engine in LatentAct [B] (w/ SAM2 [C] & HaMeR)CVPR 20250.3450.2440.211
EasyHOICVPR 20250.4800.2280.282
HACO (Ours)-0.5250.6320.531

HACO consistently outperforms both hand-specific variants of full-body methods and dedicated hand contact modules. These results highlight the strength of HACO’s task-specific design. We will include them in the final version to provide a more comprehensive evaluation.

Q: Primary evaluation on MOW dataset despite GT problems

A: We thank the reviewer for this important observation. While we acknowledge that the MOW dataset still contains spatial imbalance, it is a key benchmark with relatively lower spatial imbalance compared to other HO datasets (e.g., DexYCB, HOI4D) and still serves as important baseline for HOI tasks (e.g., HORT [D]).

We selected MOW as our primary benchmark as it is closest to real-world hand contact scenarios among the 14 datasets. In contrast, most other datasets are collected in controlled settings, making them less suitable for evaluating dense contact under realistic, unconstrained conditions.

Nevertheless, we agree that including additional evaluation provides a broader perspective. To address this, we conducted additional SOTA comparison on the HIC, RICH, and Hi4D dataset as part of response for reviewer xnje in its Table R2, R3, and R4.

As shown in the results, HACO continues to outperform prior methods by a substantial margin. We will include these results in the final version.

Q: DeepContact, EasyHOI experiments show minimal improvements

A: We have added experiments in Table R1 and R2 to facilitate broader comparisons with significant improvements. We will add them to our final version.

Q: Table 4 ablation ordering is not motivated and it starts with non-target domains

A: The purpose of Table 4 in main manuscript is to show that diverse hand interaction datasets are critical for robust contact estimation. To highlight the impact of dataset diversity, we intentionally excluded the target interaction category for the first three rows (e.g., exclude HO when evaluating on MOW, or HS when evaluating on RICH).

The table is ordered to begin with the most generalizable interaction types (e.g., hand-object, hand-scene), which provide broader contact coverage than narrower categories like hand-hand. Starting with the same category as the evaluation set could lead to overfitting and would obscure the benefit of diverse training.

By progressively adding categories, we emphasize that diversity is key to accurate and robust dense contact estimation.

Q: Unclear contribution of HACO vs. massive 14-dataset training

A: We thank the reviewer for raising this point. To assess the contribution of large-scale training to HACO’s performance, we conducted an ablation study comparing different training dataset size. The results below show that while training on the full set of 14 datasets improves performance, HACO still achieves competitive results when trained on only 3 datasets (DexYCB, ObMan, MOW), demonstrating that the method’s effectiveness does not solely rely on large-scale data.

Table R3. Ablation for training dataset size on MOW dataset

ModelsPrecisionRecallF1-Score
HACO trained on 3 datasets0.4630.6020.485
HACO trained on 14 datasets0.5250.6320.531

We will include this ablation study in the final version of the paper to better clarify the individual contributions of the model design and training data scale.

Q: In Tables 2 & 3, BCS and VCB were not properly isolated

A: We thank the reviewer for pointing out this concern. Our ablation experiments in Tables 2 and 3 in main manuscript were intended to show the individual contributions of Balanced Contact Sampling (BCS) and Vertex-Level Class-Balanced (VCB) loss relative to the full HACO model, following standard ablation protocols adopted in many previous works. However, we acknowledge that the original tables did not include a baseline where both BCS and VCB were removed, which is helpful to fully isolate their effects.

To address this, we provide an extended ablation study in Table R4, which includes all combinations of BCS and VCB. This allows for a clearer interpretation of each component’s contribution and their combined effect on performance.

Table R4. Ablation for Balanced Contact Sampling and VCB loss on MOW dataset

BCSVCBPrecisionRecallF1-Score
0.5030.2420.310
0.5300.2940.348
0.5200.5420.481
0.5250.6320.531

We will include this full ablation in the final version of the paper to better clarify the individual and joint contributions of BCS and VCB loss.

Q. Fragmented contact predictions contradict claimed "smoothness loss" benefits mentioned in L238-239

A: Fragmented contact refers to spatially isolated predictions. As noted in L238–239, our smoothness loss penalizes such discontinuities, promoting more coherent and anatomically plausible results. We will clarify this in the final version.

Q. Missing GT visualizations in Figure 3

A: Thank you for your thoughtful feedback. We will include GT visualizations in the final version to improve interpretability.

Q. Some predictions inconsistent with visual evidence

A: We acknowledge that HACO, as the first dense hand contact estimator trained on large-scale datasets, still has limitations. We will provide a detailed discussion of failure cases to guide future improvements.

Q. Section 3.2 clarity can be improved

A: Thank you for your feedback. We will revise Section 3.2 to improve clarity on Balanced Contact Sampling.

Q. Unclear grouping of Decaf (hand-face) and Hi4D (whole-body)

A: Thank you for your detailed feedback. We will strictly categorize Decaf and Hi4D in different interaction category and fix any mis-categorization accordingly throughout the entire manuscript.

Q. Inflated performance claims: Dramatic improvements in Table 5 come from comparing incompatible methods

A: We now include stronger and more broader baselines that is re-trained on the same dataset with HACO. We will add them to final manuscript.

Q. Failure case analysis: Missing discussion of when/why the method fails or performs poorly

A: We will include detailed failure cases and limitations to help future researchers build upon HACO.

[A] Wu et al. DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image. In ICLR, 2025.

[B] Prakash et al. How Do I Do That? Synthesizing 3D Hand Motion and Contacts for Everyday Interactions. In CVPR, 2025.

[C] Ravi et al. SAM 2: Segment Anything in Images and Videos. In ICLR, 2025.

[D] Chen et al. HORT: Monocular Hand-held Objects Reconstruction with Transformers. In ArXiv, 2025.

评论

I would like to thank the authors for their responses to my comments, and in particular for including comparisons with re-trained versions of POSA, BSTRO, and DECO; this is much appreciated.

Q2: Could the authors please confirm whether Tables R1, R2, R3, and R4 (in response to reviewer xnje) are based on the re-trained versions of the methods?

Q5: Thank you for the ablation study. It would also be valuable to evaluate how the model performs when trained and tested exclusively on the MOW dataset.

Q7: In Figure 3, there appear to be some fragmented contact predictions. It would be helpful if the authors could clarify whether these reflect actual contact patterns or are incorrect predictions. In the latter case, the failure analysis promised in the response to Q13 would be particularly valuable. Additionally, including ground-truth contact visualizations in the paper, as promised in the response to Q8, would help clarify these cases and enhance understanding.

评论

Dear reviewer BqXF,

We sincerely thank you for your continued engagement and thoughtful comments. We especially appreciate your recommendation and recognition of our additional experiments, particularly the re-trained hand versions of POSA, BSTRO, and DECO, which provide a broader perspective on the effectiveness of our method. Below are our responses to the additional questions.

Q2: Whether Tables R1, R2, R3, and R4 (in response to reviewer xnje) are based on the re-trained versions of the methods

Thank you for pointing this out. The results reported in Tables R1, R2, R3, R4 in our response to reviewer xnje were based on the original versions of POSA, BSTRO, and DECO. These models were evaluated under the same setting as in the main paper, and the tables were intended to extend Table 5 in main paper by including evaluations on additional datasets and more challenging imbalance scenarios.

However, in case the experiments on the re-trained hand versions of POSA, BSTRO, and DECO analogous to Tables R1, R2, R3, and R4 (in response to Reviewer xnje) are needed for further analysis, we proactively conducted additional evaluations with the models from Table R1 (in response to Reviewer BqXF). The results are presented below to provide a more comprehensive comparison under the broader settings.

Table R1-xnje-Re. SOTA comparison for dense hand contact estimation on assorted dataset with high imbalance (hand version, re-trained with 14 datasets)

ModelsConferencePrecisionRecallF1-Score
POSACVPR 20210.3660.2850.310
BSTROCVPR 20220.3870.3240.351
DECOICCV 20230.4460.2670.327
HACO (Ours)-0.5500.6260.576

Table R2-xnje-Re. SOTA comparison for dense hand contact estimation on HIC dataset (hand version, re-trained with 14 datasets)

ModelsConferencePrecisionRecallF1-Score
POSACVPR 20210.0600.2250.086
BSTROCVPR 20220.0700.2930.107
DECOICCV 20230.0670.2100.094
HACO (Ours)-0.1810.6410.278

Table R3-xnje-Re. SOTA comparison for dense hand contact estimation on RICH dataset (hand version, re-trained with 14 datasets)

ModelsConferencePrecisionRecallF1-Score
POSACVPR 20210.5380.3550.398
BSTROCVPR 20220.5810.4130.421
DECOICCV 20230.5190.3210.367
HACO (Ours)-0.7430.5700.528

Table R4-xnje-Re. SOTA comparison for dense hand contact estimation on Hi4D dataset (hand version, re-trained with 14 datasets)

ModelsConferencePrecisionRecallF1-Score
POSACVPR 20210.5370.2570.315
BSTROCVPR 20220.5940.3750.420
DECOICCV 20230.5530.2440.304
HACO (Ours)-0.7810.9230.828

These results demonstrate that HACO consistently outperforms existing methods across a broader set of datasets when all models are trained under the same conditions (in terms of hand version, usage of 14 training datasets).

Q5: Evaluation on how the model performs when trained and tested exclusively on the MOW dataset.

Thank you for this valuable suggestion. We conducted an additional experiment where HACO is trained and tested only on the MOW dataset. This is presented below alongside the earlier results from Table R3 (in response to Reviewer BqXF), to further illustrate the impact of training dataset size.

Table R3-New. Ablation for training dataset size on MOW dataset

ModelsPrecisionRecallF1-Score
HACO trained on 1 dataset (MOW)0.4980.3480.373
HACO trained on 3 datasets0.4630.6020.485
HACO trained on 14 datasets0.5250.6320.531

These results demonstrate that HACO remains effective even when trained on a single dataset, but performance consistently improves as the scale of the training data increase.

评论

-continued-

Q7: Clarification on Fragmented Predictions in Figure 3, with Follow-up on Promised Failure Analysis (Q13) and Ground-Truth Visualizations (Q8).

Thank you for your detailed observation and thoughtful feedback. The fragmented contact predictions in Figure 3 arise from different causes depending on the example. In rows 1 and 2, the fragmented patterns largely reflect inherent characteristics of the dataset, such as disjointed contact regions resulting from distance-based thresholding (please refer to section A.3 in supplementary material for details on labeling) between hand and object surfaces.

In contrast, the fragmented prediction in row 3 shows an aspect of incorrect contact prediction that the smoothness loss attempts to mitigate. Since we are unable to include additional figures during the rebuttal phase, we provide a textual comparison here:

  • The ground-truth in row 3 does not contain two of the fragmented regions predicted by our model (one near the lower palmar region close to the wrist and another between the ring and pinky fingers).

  • The fragmented regions within the fingers do correspond to actual contact with the pen from ground-truth label obtained by distance-based thresholding.

  • Comparing HACO with and without smoothness loss, we observe that HACO without smoothness loss exhibits more fragmentation, both in predicted contact and non-contact regions.

  • For example, the BSTRO’s predictions in rows 2 and 3 contains fragmented non-contacting regions where there are multiple gray area between green regions. Similarly, HACO without smoothness loss predicts multiple disjoint non-contact regions between and within the thumb and index finger for sample in row 3, while HACO with smoothness loss produces a more coherent contact pattern within the thumb and index finger as visualized in row 3 of Figure 3.

Across samples, we consistently observe that the smoothness regularization loss reduces fragmentation for both contacting and non-contacting regions, promoting spatial coherence and anatomical plausibility. We will include a more extensive line of examples as part of our failure analysis and limitations in the final version of the paper.

While we are limited to text here, we will include ground-truth contact visualizations alongside model predictions in Figure 3 in the final version to aid interpretation. We will also add qualitative comparisons of HACO predictions with and without smoothness loss to the supplementary material. A dedicated failure case analysis section will further discuss common error modes and clarify the role of smoothness loss in mitigating fragmented outputs.

We are committed to delivering the additions promised in our responses to Q8 and Q13, with the goal of enhancing transparency and interpretability for future work on dense hand contact estimation.

Closing remark

Thank you once again for your constructive feedback and questions. We appreciate your careful reading and hope these additional experiments and clarifications address your concerns. We will include this expanded discussion in the final version.

评论

I would like to thank the authors for addressing all my comments. Given the new experiments and the changes they have promised to incorporate into the manuscript, I will raise my evaluation.

评论

We sincerely thank the reviewer for the insightful feedback provided throughout both the review and author-reviewer discussion period, as well as for considering an updated evaluation.

We greatly appreciate the recognition of the new experiments and the planned revisions. We will ensure that all promised changes are carefully implemented and clearly reflected in the final version of the paper.

Once again, thank you for your time and effort in reviewing our work.

评论

Dear reviewer BqXF,

We would like to follow up to check if your concerns have been addressed. In the previous response, we made the following updates/clarifications:

  • Provided broader and fairer comparisons by implementing hand-specific versions of POSA, BSTRO, DECO, and reproducing additional hand contact methods (DefConNet, CHOI, InteractionNet), along with new ablations on dataset size, method components (BCS, VCB), and evaluation on HIC, RICH, and Hi4D datasets to ensure robustness and isolate contributions.

  • Clarified network design decisions, explained ablation ordering rationale, justified our use of MOW as a primary benchmark, planned to address inconsistencies and failure cases, and committed to adding GT visualizations, clearer categorization, and revisions for Section 3.2.

This expanded discussion will be included in the final version. We are happy to answer further questions.

审稿意见
5

This paper proposes HACO, a method for dense hand contact estimation from imbalanced data. While many existing datasets capture various types of hand interactions (e.g., hand-object, hand-hand, hand-scene), they typically suffer from two types of data imbalance: (1) class imbalance, where non-contact samples vastly outnumber actual contact samples, and (2) spatial imbalance, where most hand contact occurs in the fingertip region. To address the class imbalance, HACO employs balanced contact sampling, which ensures fair representation across diverse sample groups. To tackle the spatial imbalance, HACO applies a vertex-level class-balanced loss that reweights the loss contribution of each hand vertex. In experiments, HACO achieves state-of-the-art performance on various hand contact-related tasks, including hand contact estimation, grasp optimization, and joint hand-object reconstruction.

优缺点分析

[Strengths]

  1. Good motivation

As there is no contact estimation model trained on large-scale, assorted datasets, I agree that pursuing this direction is important for further improving performance. In this regard, addressing the existing datasets' imbalance issues in terms of both class and spatial distribution is a reasonable approach.

  1. Writing quality

The paper is well organized and generally easy to follow. The motivation and technical components are clearly presented.

  1. Strong empirical performance

The proposed method demonstrates strong empirical results across various hand contact-related tasks, including hand contact estimation, 3D hand grasp optimization, and 3D hand-object reconstruction, outperforming existing state-of-the-art methods. The paper also provides comprehensive ablation studies to validate the proposed components, along with extensive qualitative results in the supplementary material.

[Weaknesses]

  1. Lack of empirical support for resolving imbalance between contact and non-contact classes

One of the main motivations of this paper is to address the imbalance between contact and non-contact classes by proposing a method that faithfully learns both. However, the paper states that it “skip[s] fully non-contacting samples during evaluation” (lines 264–265), which seems contradictory. While I understand that existing contact accuracy metrics (e.g., recall, F1-score) are not suitable for evaluating non-contact samples, the paper could at least include an additional binary classification metric (e.g., accuracy in distinguishing contact vs. non-contact) to empirically support the effectiveness of this component. If non-contact samples are entirely discarded during evaluation, then a trivial strategy—such as simply filtering out all non-contact samples, as is done in some interacting hand reconstruction methods—could be a comparably effective alternative, which then questions the motivation for balanced contact sampling.

  1. Questionable design for balanced contact sampling

For balanced contact sampling, the paper groups samples based on a contact balance score, which measures “how much each hand contact sample deviates from the dataset-wide average” (lines 85–86). However, I question whether this can fairly represent the full range of contact cases, particularly if the dataset's ground-truth distribution is multimodal. Moreover, during model training, the use of smoothness regularization encourages predictions to stay close to the average—seemingly contradicting the motivation behind balanced sampling. Clarifying these design choices would be helpful to readers.

  1. Lack of justification for spatial imbalance

The paper claims that existing datasets exhibit spatial imbalance, with contact regions mostly concentrated on the fingertips. However, no evidence is provided to support this claim. As with the class imbalance issue, presenting simple statistics (e.g., contact frequency per vertex) would help validate this point.

问题

Please see the Weaknesses section above.

局限性

Yes

最终评判理由

My initial questions were well addressed during the author–reviewer discussion. Therefore, I have updated my rating from borderline accept to accept. However, I strongly encourage the authors to clarify in the manuscript all the points discussed in the rebuttal, especially the comparison settings raised by reviewer BqXF.

格式问题

None

作者回复

We thank the reviewer kFWg for reviewing our paper.

We sincerely thank the reviewer for highlighting the strengths of our work. We appreciate the recognition of our paper’s clear motivation, especially the need for contact estimation models trained on diverse datasets and our approach to data imbalance. We are also grateful for the positive remarks on the clarity of our writing and the strong empirical performance across contact estimation, grasp optimization, and reconstruction tasks, as well as the comprehensive ablation and qualitative results. Below are our responses to the questions mentioned in the review.

Q: Lack of empirical support for resolving imbalance between contact and non-contact classes

A: We appreciate the reviewer’s thoughtful comment. Our Balanced Contact Sampling (BCS) is specifically designed to mitigate the class imbalance issue at the vertex-level, which is where the contact estimation prediction and supervision occur. Therefore, although we excluded fully non-contacting hand-level samples from evaluation due to limitations in existing contact metrics (e.g., recall, F1-score), our evaluation remains valid within the vertex-level prediction context, where contact and non-contact vertices co-exist in most samples.

Nevertheless, we acknowledge that excluding fully non-contacting hands during evaluation may overlook certain aspects of the contact estimation problem, particularly the model's ability to distinguish between entirely contacting and entirely non-contacting hands at the hand level. To provide a more comprehensive evaluation, we conducted additional experiments (Table R1, Table R2, Table R3) using binary classification metrics of Accuracy, Specificity, and MCC (Matthews Correlation Coefficient), which are better suited for this scenario. Additionally, since the MOW dataset primarily contains hand-level contacting cases, we evaluate on the RICH dataset with fully non-contacting hand-level samples included, which provides a broader range of contact and non-contact instances.

Table R1. SOTA comparison for hand-level contact estimation on RICH dataset inlcuding fully non-contacting hand samples

ModelsConferenceAccuracy ↑Specificity ↑MCC ↑
POSACVPR 20210.5610.2850.222
BSTROCVPR 20220.6270.4880.254
DECOICCV 20230.6140.4910.328
HACO-0.8020.5220.604

Table R2. SOTA comparison for dense hand contact estimation on RICH dataset inlcuding fully non-contacting hand samples

ModelsConferenceAccuracy ↑Specificity ↑MCC ↑
POSACVPR 20210.6910.7680.392
BSTROCVPR 20220.8390.8700.497
DECOICCV 20230.8120.9050.557
HACO-0.8860.9100.621

Table R3. Ablation study for Balanaced Contact Sampling (BCS) and VCB loss on RICH dataset inlcuding fully non-contacting hand samples

BCSVCBAccuracy ↑Specificity ↑MCC ↑
0.5540.5270.239
0.6080.6160.323
0.8080.8170.549
0.8860.9100.621

These results demonstrate that our method remains effective in addressing the contact vs non-contact imbalance, even when fully non-contacting hand-level samples are retained during evaluation.

Regarding the concern about a trivial strategy such as filtering out all non-contacting samples during training, we emphasize that this approach is not applicable in our setting. Our method explicitly targets vertex-level class imbalance, where hand-level contact is irrelevant in the training standpoint. Therefore, we do not need to discard entire hand-level non-contacting samples. In fact, doing so would reduce the diversity of non-contact patterns seen during training, which is counterproductive for learning a robust vertex-level contact estimator. By retaining all samples and addressing imbalance at the vertex-level through Balanced Contact Sampling and VCB loss, our method is able to learn from both contact and non-contact regions in a fine-grained manner, enabling more robust and accurate contact prediction.

We will revise the final version to better clarify this design choice and include evaluation results with fully non-contacting hand samples.

Q: Questionable design for balanced contact sampling

A: Thank you for the insightful question. Our goal with Balanced Contact Sampling (BCS) is to mitigate the severe imbalance in vertex-level contact labels, a common challenge in real-world datasets where most hand vertices are non-contact. Since the ground-truth contact labels are provided per-hand (rather than per vertex-image pair), sampling must be performed at the hand-level, making it essential to design a principled, hand-level proxy that reflects the underlying vertex-level imbalance.

To achieve this, we compute a dataset-wide average contact vector as a neutral reference and define a per-hand contact balance score based on the deviation of each hand’s contact vector from this mean. This scalar score provides a consistent and interpretable way to stratify hands by how contact-heavy, contact-sparse, or atypical their vertex-level contact patterns are. The mean is not used to enforce a unimodal assumption, but to offer a common reference point across thousands of heterogeneous contact configurations.

We emphasize that BCS is not designed to simply span the extremes of the contact spectrum (e.g., from fully contact to fully non-contact), but rather to increase exposure to diverse and underrepresented hand contact patterns. Real-world datasets tend to be skewed and sparse along this spectrum, making naive or uniform sampling approaches risk over-representing common patterns or producing arbitrary diversity without structure. In contrast, BCS applies non-linear binning and stratified sampling based on the contact balance score to construct a training subset that is structurally balanced and representative.

However, we believe that future extensions of HACO could incorporate clustering or interaction-aware multi-modality modeling if the presence of such multi-modality is well analyzed. We will clarify these points and the implications of our sampling strategy in the final version.

Regarding the concern about regularization, we acknowledge that the regularization term encourages predictions to not deviate excessively from the dataset-wide contact mean. While this may seem to contrast with the goal of BCS, which promotes diverse contact patterns, these two components serve different but complementary purposes. The regularization acts as a stabilizer to prevent overfitting to extreme or noisy contact configurations, ensuring that the model remains within a reasonable solution space. This is similar to many standard regularization strategies in deep learning, including adversarial training or entropy regularization, which temper the main loss without undermining its core objective. In our case, BCS enhances the diversity of training signals, while the regularization ensures stability and robustness.

We will revise the final version to clarify these design choices and explicitly address this point.

Q: Lack of justification for spatial imbalance

A: Thank you for the thoughtful feedback. We visualize spatial imbalance using heatmaps on the hand surface in Figure 1-b of the main manuscript for dataset-wide imbalance and Figures A2 and A3 in the supplementary material for per-dataset imbalance. These visualizations highlight the spatial imbalance present for all 14 datasets. In particular, they show that fingertips tend to exhibit higher contact probability (in red) compared to other regions of the hand in many datasets.

However, we agree that including raw data statistics, such as bar plots showing the contact to non contact ratio per vertex, would offer a more detailed and quantitative view of the spatial imbalance. Unfortunately, due to the limitations of the rebuttal format, we were unable to include these visualizations at this stage. We will incorporate this additional analysis in the final version to enhance clarity and depth.

评论

Dear reviewer kFWg,

We would like to follow up to check if your concerns have been addressed. In the previous response, we made the following updates/clarifications:

  • Clarified that our method addresses vertex-level imbalance, where hand-level filtering is not directly applicable. We also provided additional evaluation including fully non-contacting hand samples using binary metrics (Accuracy, Specificity, MCC).

  • Explained that the contact balance score in BCS uses the dataset mean as a neutral reference point (rather than assuming unimodality), clarified the complementary role of regularization, and noted that while spatial imbalance is already visualized, we will additionally include raw quantitative statistics in the final version.

This expanded discussion will be included in the final version. We are happy to answer further questions.

评论

I appreciate the authors' response. Since my concerns have been addressed, I will keep my positive rating. I may consider further updating my rating depending on whether the question regarding the unfair comparisons (raised by reviewer BqXF) is fully clarified.

评论

We sincerely thank the reviewer for the positive evaluation and for acknowledging that the concerns have been addressed.

Regarding the question of unfair comparisons raised by Reviewer BqXF, we recognize the importance of fully addressing this point. In the final version, we will ensure that the issue is clearly clarified and that the experimental comparisons cover a broader perspective that ensures fair and comprehensive comparison.

评论

I appreciate the authors’ engagement in the discussion. Since the question regarding unfair comparisons (raised by reviewer BqXF) has been clarified, I will update my rating. Nevertheless, I believe it is very important to fully clarify these points in the revision, and I trust the authors will do so.

评论

We sincerely thank the reviewer for the constructive feedback provided during the review process and for considering an updated evaluation after the question regarding unfair comparisons (raised by reviewer BqXF) was clarified.

We fully agree on the importance of clearly addressing all points raised during both the review and author-reviewer discussion period, including those from all reviewers. In particular, the concern regarding unfair comparisons is critical not only for the integrity of the paper but also for enabling it to serve as concrete guidance for future work. We will be fully committed to clarify and explicitly present all points raised by the reviewers in the final version.

Thank you for your time and effort in reviewing our work.

审稿意见
5

This paper proposes HACO, a framework that recovers dense human contact from visual inputs. As contact labels are intrinsically imbalanced, HACO introduces Balanced Contact Sampling and Vertex-Level Class-Balanced (VCB) loss to mitigate the issue. Balanced Contact Sampling is introduced to mitigate the class imbalance by dividing the dataset into multiple sampling groups that have different levels of hand contact. Vertex-Level Class-Balanced (VCB) loss is built on Class-Balanced loss that uses per-vertex labels for computing the effective number of samples when applying BCE loss on vertex contact labels. With these two components combined, the authors train the contact estimation model across multiple hand datasets based on HaMeR. The experiment results show that HACO's contact estimation results can boost existing hand grasp optimization methods and contact-aware hand-object reconstruction methods.

优缺点分析

Strengths:

  1. The proposed method targets at direct estimation of dense hand contact from vision inputs, which is an underexplored area.
  2. The proposed method, HACO, is evaluated on actual hand grasp optimization and hand-object reconstruction tasks, where the effectiveness of the estimated hand contact map is verified.
  3. The effectiveness of each component of HACO is validated through extensive ablation studies. In particular, the comparison of the VCB loss with commonly used losses and losses specifically designed for class imbalance demonstrates its effectiveness in addressing vertex-level class imbalance.
  4. This paper is clearly written.

Weaknesses:

  1. (Extension to continuous contact representation) This paper does not contain discussions on how the proposed method can be applied in the regression of continuous contact representation, e.g., offset vectors from hand surface to contact points, or SDF values of hand surface points.

问题

  1. (Weakness 1) Is it possible to extend and apply the proposed method to continuous contact representations?

局限性

Yes.

最终评判理由

The authors' response resolves my concern about whether the proposed method can be extended to continuous contact representations. The additional experiments in the rebuttal are crucial for supporting the paper's claimed contributions and have to be included in the final manuscript.

格式问题

None.

作者回复

We thank the reviewer a7WB for reviewing our paper.

We sincerely thank the reviewer for recognizing the strengths of our work. We appreciate the acknowledgment that our method addresses the underexplored problem of dense hand contact estimation from vision inputs. We are also glad that the effectiveness of our contact predictions in hand grasp optimization and hand-object reconstruction tasks was appreciated. Furthermore, we thank the reviewer for highlighting the value of our extensive ablation studies, especially the analysis of the VCB loss, which demonstrates clear advantages over standard and class imbalance aware losses. Lastly, we are grateful for the comment that the paper is clearly written. Below are our responses to the questions mentioned in the review.

Q: Is it possible to extend and apply the proposed method to continuous contact representations?

A: We thank the reviewer for raising this insightful and important question. Our current formulation focuses on discrete (binary) vertex-level contact estimation, which enables interpretability and direct integration with downstream tasks such as contact-driven hand-object optimization and reconstruction. While our representation is discrete, we recognize that continuous contact fields, such as offset vectors or signed distance functions (SDFs), represent an important and growing research direction in hand contact modeling.

From a training standpoint, both Balanced Contact Sampling (BCS) and Vertex-Level Class-Balanced (VCB) loss can be extended to regression settings. In the discrete case, BCS mitigates class imbalance between contact and non-contact labels by grouping hand samples based on their overall contact ratio and ensuring that the model sees a more balanced distribution of contact-related and non-contact-related vertices during training. In the continuous setting, a similar imbalance arises: the distribution of contact magnitudes (e.g., SDF values or offset norms) tends to be skewed, with values indicating distant (non-contact) configurations being much more frequent than those near contact. To address this distributional imbalance, BCS can group training samples based on statistics of the continuous contact signal, such as the distribution of offset vectors or SDF values from the hand surface, to ensure balanced exposure that does not skew toward either contact or non-contact cases.

The VCB loss, originally designed to address spatial imbalance across the hand surface in the discrete case, can also be extended to continuous contact regression. In our setting, spatial locations (e.g., mesh vertices or surface regions) exhibit different contact signal distributions. For example, fingertip regions tend to be more frequently in contact, while other areas such as the dorsum are rarely contacted. This imbalance can persist in continuous representations such as SDFs or offset vectors, where certain regions consistently exhibit lower absolute SDF values (i.e., closer proximity), while others predominantly produce large-distance signals. As a result, the regression model may become biased toward frequently occurring signals that is different by spatial region. To address this, VCB loss can be extended by reweighting the regression loss per spatial location using a continuous weighting strategy that assigns higher importance to underrepresented contact-relevant signals. This approach generalizes the vertex-wise effective number of samples strategy used in our original discrete VCB loss, which discretely re-weights contact and non-contact contributions at each vertex based on their contact frequency. In the continuous case, the weighting function would smoothly vary with the contact signal itself, enabling fine-grained correction of spatial imbalance without relying on discrete labels. This formulation would help the model avoid bias toward dominant signal and encourages learning that is more spatially balanced across the hand surface.

From an inference or integration standpoint, HACO’s binary contact predictions can serve as auxiliary supervision or initialization for continuous contact regression. For instance, HACO can be used to first localize probable contact regions, allowing a second-stage model to refine predictions into continuous fields such as SDFs or offset vectors. Conversely, a model trained to regress continuous contact fields can have its outputs thresholded or quantized to derive discrete contact labels, which may be useful for downstream tasks.

We will include this expanded discussion in the final version.

评论

Dear reviewer a7WB,

We would like to follow up to check if your concerns have been addressed. In the previous response, we made the following updates/clarification:

  • Clarified how the proposed method can be extended to continuous contact representations (e.g., SDFs or offset vectors) by adapting both BCS and VCB loss to regression settings.

  • Discussed on how HACO’s binary contact predictions can serve as guidance or initialization for continuous estimators.

This expanded discussion will be included in the final version. We are happy to answer further questions.

评论

I appreciate the response from the authors. My main concern has been addressed.

Furthermore, I believe the comparison with re-trained existing human-scene contact methods, as requested by Reviewer BqXF, is essential and needs to be included in the final text. When I initially read this paper, I had assumed these baselines were already adapted for hand-object scenarios, which turned out to be a major misunderstanding. The rating is kept, but the confidence is lowered.

评论

We sincerely thank the reviewer for acknowledging that the main concern has been addressed and for sharing additional feedback regarding the baseline comparisons. We also appreciate the clarification regarding the initial misunderstanding.

We fully agree that it is important to clearly indicate whether baseline experiments were already adapted for hand-object scenarios, as this distinction is critical for correct interpretation. To avoid further confusion, we will revise the text to make this point explicit. In addition, we will include the results of re-trained existing human-scene contact methods, as requested by Reviewer BqXF, in the final version.

最终决定

(a) The paper proposed a dense hand‑contact estimation method by explicitly addressing two forms of imbalance common in today’s multi‑dataset settings: 1) class imbalance (much more non‑contact than contact) via Balanced Contact Sampling (BCS); 2) spatial imbalance (contacts concentrated at fingertips) via a vertex‑level class‑balanced (VCB) loss. The authors also evaluate downstream impact on grasp optimization (ContactOpt) and hand–object reconstruction.

(b) + The problem is well‑scoped, which is an underexplored task (dense contact), with principled handling of class/spatial imbalance and clear downstream impact (grasp optimization, reconstruction). + Paper is well-written and organized.

(c) - BCS score design could be further stress‑tested for multimodality (beyond mean‑deviation). - Should ensure promised GT overlays and failure analysis, and make it into the camera‑ready.

(d)(e) The paper got mixed rating {5, 5,4,3}. The rebuttal materially addressed fairness, metrics with non‑contact hands, and isolation of method vs data‑scale—moving the balance to borderline‑positive. While not architecturally novel, the paper makes a useful, evidence‑backed contribution to a task where imbalance hurts both learning and downstream usage. AC agreed with the acceptance decision.