HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction
摘要
评审与讨论
To avoid previous works which predict hand trajectories based on high-level language task command, this paper aims to extend the classic hand trajectory prediction task to two tasks involving explicit or implicit language queries. By developing new benchmarks to evaluate the proposed two tasks named Vanilla Hand Prediction (VHP) and Reasoning-based Hand Prediction (RBHP), this paper requires the model to acquire extensive understanding of human daily activities and reasoning abilities about what is happening next given cues from the current scene. To be specific, this paper propose a model named HandsOnVLM which generates textual responses and produce future hand trajectories through natural-language conversations. The experiments validate that the model outperforms existing methods on the proposed tasks.
优点
(1) This paper iseasy to follow and well-written. (2) The tasks proposed in this paper sound interesting and I do believe they play an important in egocentric-relavent tasks, such as VR/AR, robotics.
缺点
Although the proposed tasks are interesting, I find that some critical details are either missing or require further elaboration in this paper, such as: (1) Definition of Hand Pose: It remains unclear how the authors define the hand pose—whether it is based on joint positions, bbox or another representation. The paper merely includes two red and blue curves on images as qualitative results, which is ambiguous and lacks clarity. (2) Details of H2O and FPHA Datasets: The paper provides insufficient information about the specifics of the H2O and FPHA datasets, particularly regarding the labels used in the study. This omission makes it challenging to fully understand the data, leaving readers to infer these details from the training objectives alone. (3) Experiments: it's difficult to observe stable improvement from the experiments (such as in Table.1)
问题
In A.1, the training set of VHP and RBHP only contains Epic-Kitchen, as far as I know, this dataset does not contain hand pose labels. So where does the authors obtain the hand pose labels?
Definition of hand pose. Thank you for the question. We would like to clarify that the hand-pose abstraction we use in this paper is just the future hand center for the left and right hand respectively and they are projected to the last observation frame. To obtain ground-truth trajectories, we first run an off-the-shelf active hand-object detector (Shan et al., 2020a) to get the bounding box of hand and object in each future frame. We then consider the centroid of the bounding box as the hand location for that frame, and we project them into the last observation frame, which is what we use for training the HandsOnVLM prediction model.
Observing stable improvement over baselines from the experiments. Thank you for your valuable observation regarding the model performance. Upon a thorough code review, we identified a critical computational error in our displacement error metric calculation. Specifically, we found that our implementation was incorrectly computing the Euclidean distance between all hand positions rather than the specific hand pair being evaluated, which led to inflated error values. After correcting this implementation error (replacing 'gt_last_hand - pred_last_hand' with 'cur_gt_last_hand - cur_pred_last_hand' in the distance calculation), we re-ran all experiments and observed substantially more stable and significant improvements. The corrected results show consistent improvement patterns across all experimental settings. We have updated all experimental results in the paper to reflect these corrections, which now more accurately demonstrate the effectiveness of our approach. We sincerely thank the reviewer for prompting this verification, which led to a more precise evaluation of our method's performance.
| On Validation Split | Zero-shot | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Approach | BBox Input | EK55 | EK100 | H2O | FPHA | ||||||||
| Input | ADE | FDE | WDE | ADE | FDE | WDE | ADE | FDE | WDE | ADE | FDE | WDE | |
| KF | 0.392 | 0.386 | 0.199 | 0.317 | 0.318 | 0.168 | - | - | - | - | - | - | |
| OCT | 0.216 | 0.199 | 0.105 | 0.209 | 0.187 | 0.102 | - | - | - | - | - | - | |
| OCT-global | 0.232 | 0.218 | 0.115 | 0.216 | 0.193 | 0.105 | - | - | - | - | - | - | |
| LLaVA-Pixel2Seq | 0.156 | 0.139 | 0.076 | 0.254 | 0.224 | 0.124 | 0.150 | 0.121 | 0.032 | 0.214 | 0.189 | 0.043 | |
| LLaVA-Traj | 0.142 | 0.073 | 0.201 | 0.191 | 0.103 | 0.130 | 0.031 | 0.191 | 0.167 | 0.041 | |||
| HandsOnVLM | 0.136 | 0.135 |
Details about the specifics of the H2O and FPHA datasets and obtaining the hand pose labels from Epic-Kitchen. To generate the hand labels for all the datasets, following Liu et al. (2022), we first run an off-the-shelf active hand-object detector (Shan et al., 2020a) to get the bounding box of hand in each frame. To get the ground truth of each future hand trajectory, we first compute pairwise homographies by matching SURF features of masked regions through RANSAC and project each future hand position into the last observation frame. Then, we apply cubic Hermite spline interpolation to smooth the projected trajectories and fill any missing points. Finally, we filter the resulting trajectories with multiple criteria, including confidence thresholds, highest-score detection selection, feature matching thresholds, trajectory completeness checks, and boundary constraints.
Please do not hesitate to let us know if we can clarify anything else for a revised assessment of the paper.
Dear Reviewer,
The discussion phase is coming to an end soon, and we thus kindly request you to let us know if our response below has addressed your concerns. We will be happy to answer if there are additional issues/questions, and if not we would be grateful if you would consider a revised assessment of the review score to reflect that the issues have been addressed.
Best,
Authors of HandsOnVLM (submission 960)
Thank you for the authors' responses and for addressing my concerns. I have carefully reviewed the feedback, which has successfully clarified most of my questions. I also acknowledge that the authors made some mistakes in their initial submission. In my view, it is acceptable to correct these mistakes and report updated results.
After further consideration, I have adjusted my final rating to borderline, with inclination toward a weak reject. However, I would not object to accepting this paper.
My primary concern lies in the paper’s true contribution to the egocentric community. While I appreciate the motivation behind this work, my initial expectation from Figure 1 was that this task—and the potentially subsequent works—could bring benefits to real-world applications, such as VR/AR or robotics. However, the task proposed in this paper focuses solely on 2D labels, without incorporating any 3D cues (e.g., depth, 3D hand keypoints, or hand poses like MANO). Incorporating these elements could enable tackling "truly challenging but applicable tasks," such as robot planning or even real-world manipulation. The value of a HOI model that only predicts trajectories on 2D images feels limited, particularly considernig numerous benchmark papers in the egocentric community also confined themselves to 2D egocentric understanding.
Even though, this is only my personal perspective. I believe that incorporating 3D-related information could significantly enhance the impact and reception of the work. Again, my current rating is borderline, with inclination to reject.
Thank you for your thoughtful feedback and constructive suggestions. Your insights, particularly regarding the integration of 3D information, are valuable for enhancing the practical impact of our work. We would like to address your main concerns:
The contribution of this paper to the community.
- We extend a new modality of action trajectory prediction to current VLMs.
- We extend traditional egocentric hand predictions to natural language and reasoning-based prediction tasks.
- We develop a general time-series prediction pipeline that can be extended to any representation of hand poses.
The application to the VR/AR field. For VR/AR applications, users can directly interact with our system using natural language, and VR/AR devices can directly display the prediction in pixel space to the users.
The application to robotics manipulation. Many previous works(Qin et al. 2022, Chang et al. 2024) have explored learning from priors trained from human videos. Manipulation policy can also be conditioned on pixel information like goal positions or 2D trajectories(Bharadhwaj et al. 2024).
The reason for choosing 2D instead of 3D in our works. Although our system can be easily extended to 3D predictions by replacing the trajectory decoder and training with 3D trajectory data, it is still challenging due to the scarcity of high-fidelity 3D trajectory data from previous works. Many datasets only provide 2D bounding box hand annotations. Even the state-of-the-art HaMer hand mesh extraction model (Georgios, 2023) still requires an external 2D hand bounding box detector to extract the bounding box first, which leads to an accumulation of errors during the process of getting any 3D trajectory data. In addition, monocular depth estimation techniques suffer significant errors in predicting consistent video depth and thus cannot be readily applied for large-scale egocentric videos. Thus, our key reasons for choosing 2D are summarized as follows:
- If we tried to do 3D predictions we would not be able to scale the data to this extent due to the limitations mentioned above.
- Clearer validation of the whole idea of extending new modality for the VLMs and understanding how VLMs scale with egocentric trajectory data without the added confounding errors introduced by trying to curate 3D data.
- Establish empirical foundations for hand-object interaction prediction and provide actionable insights that will benefit future 3D extensions.
While we agree that incorporating 3D information would benefit real-world applications, we believe our current work makes a significant contribution by incorporating strong world priors and reasoning ability of VLMs in egocentric trajectory prediction, which was previously unexplored in the literature. We would be grateful if the reviewer would kindly consider an improved assessment of the paper. We thank you very much for your time and feedback.
References:
Qin, Yuzhe, et al. "Dexmv: Imitation learning for dexterous manipulation from human videos." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
Chang, Matthew, Aditya Prakash, and Saurabh Gupta. "Look ma, no hands! agent-environment factorization of egocentric videos." Advances in Neural Information Processing Systems 36 (2024).
Bharadhwaj, Homanga, et al. "Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation." arXiv preprint arXiv:2405.01527 (2024).
Pavlakos, Georgios, et al. "Reconstructing hands in 3d with transformers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
The authors propose training a video-language model which can reason about hand trajectories (curves, not poses) given videos and user queries as language input. Two associated benchmarks, Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP) are introduced. VHP consists of predicting hand trajectories given an input video segment and a clear description of the object to be manipulated and the action to be performed. RHBP consists of predicting hand trajectories given less straightforward language input for which more complex reasoning must be performed. The authors curate datasets for both benchmarks, and promise to release these to the community. Evaluation on kitchen settings, as well as zero-shot evaluation on not (fully) kitchen-related datasets for the VHP benchmark shows the superiority of the method against state-of-the-art baselines.
优点
The paper is well-written.
The proposed method has useful applications in research and industry.
The authors promise to release two new, relevant benchmarks to the community.
The proposed method is evaluated on multiple datasets and makes qualitatively sound predictions on unseen datasets.
The proposed method outperforms most state-of-the-art architectures.
缺点
The RBHP benchmark only includes kitchen scenes, and numerical validation is hence performed on kitchen settings only.
A comparison with a static-frame version of the proposed architecture would be appropriate to ensure fairness in Tabs. 1 and 2, as the baselines do not have access to the context provided by the observation history. Additionally, the need to use videos is a limitation, as in many settings we do not have a history of frames available.
Typing errors in lines 208 (l_hand) and 325 (FPGA).
The prediction is limited to curves. A version involving hand poses would be very useful for many settings.
问题
Did you test your method with different numbers of input frames? What happens when you use only one frame? How well does your model handle the case of using a different number of tokens than what was used during training? Maybe the weakness I listed does not apply if the method performs well even when operating on a single frame.
Results beyond kitchen settings. We thank the reviewer for pointing out the predominance of kitchen scenes in the current results. We have now performed comparisons on scenes from the Ego4D dataset that contains a lot of non-kitchen tasks. Note that these evaluations are ZERO-SHOT since our HandsOnVLM model was not trained on Ego4D. We see that the trend in results continues to hold in these evaluations.
| RBHP(Epic-Kitchen) | RBHP(Ego4D) | |||||
|---|---|---|---|---|---|---|
| Approach | ADE | FDE | WDE | ADE | FDE | WDE |
| Kling 1.5 | 0.311 | 0.358 | 0.197 | 0.277 | 0.411 | 0.184 |
| LumaLabs | 0.293 | 0.377 | 0.189 | 0.286 | 0.135 | |
| LLaVA-Pixel2Seq | 0.277 | 0.248 | 0.137 | 0.312 | 0.287 | 0.143 |
| LLaVA-Traj | 0.196 | 0.187 | 0.101 | 0.381 | 0.353 | 0.178 |
| HandsOnVLM | 0.197 | 0.165 | 0.094 | 0.229 | 0.195 | 0.100 |
| HandsOnVLM | 0.228 |
Videos as observation context. We would like to clarify any confusion regarding the context window of our approach and baselines. For Tables 1 and 2, the baselines and our approach all have access to the same video context, so that comparisons are on the exact input. Also, kindly note that our approach does not necessarily require a video context and can also use an image as a context. For example in some of the new results where we compare with video generation followed by hand-tracking approaches, the evaluations are all conditioned on the last frame of the context video (i.e. not the entire video).
Hand-poses for predictions. Thanks for the comment about potentially predicting full-hand poses as future work. We definitely agree that this would be very valuable. The reason we did not try to do this for the current paper is that state-of-the-art hand pose detectors (e.g. HaMer, FrankMocap) suffered from significant errors when applied to the human video dataset we considered in this paper - hence they did not provide reliable ground-truths for our prediction model. As hand pose tracking gets better in the future, we hope to extend our framework to predicting full hand poses.
Single frame as input. Thank you for the suggestion to explicitly evaluate the conditioning of the model on a single frame as input. We have performed this comparison in the revised paper (Table 5) by conditioning on the last frame of the input video context. We find that the results in this evaluation scenario are comparable to the setting where the context is a video.
| Method | Num of Generations | ADE | FDE | WDE |
|---|---|---|---|---|
| VHP | OCT | 0.209 | 0.187 | 0.102 |
| OCT-last-im | 0.213 | 0.191 | 0.104 | |
| OCT-global | 0.216 | 0.193 | 0.105 | |
| OCT-global-last-im | 0.212 | 0.189 | 0.103 | |
| HandsOnVLM | ||||
| HandsOnVLM-last-im | 0.197 | 0.165 | 0.094 | |
| RBHP | HandsOnVLM | 0.197 | 0.165 | 0.094 |
| HandsOnVLM-last-im | 0.197 | 0.163 | 0.093 | |
| HandsOnVLM | 0.156 | 0.089 | ||
| HandsOnVLM -last-im |
Thanks for pointing out the typos in lines 208 and 325. We have now edited them in the revised paper. Please do not hesitate to let us know if we can clarify anything else for an improved assessment of the paper.
Dear Reviewer,
The discussion phase is coming to an end soon, and we thus kindly request you to let us know if our response below has addressed your concerns. We will be happy to answer if there are additional issues/questions, and if not we would be grateful if you would consider updating your score to reflect that the issues have been addressed.
Best,
Authors of HandsOnVLM (submission 960)
In light of the strong results provided by the authors on non-kitchen settings, as well as the thorough response to the rest of my review, I have raised my score.
Dear Reviewer 9WrP,
Thank you for your insightful review and for taking the time to consider our responses. We're grateful that you found our responses addressing your concerns satisfactory.
Best,
Authors of HandsOnVLM (submission 960)
The authors propose HandsOnVLM, a VLM-based framework to reasoning hand activities and predicting hand motions. In this framework, the hand trajectories The HandsOnVLM achieves SOTA on the proposed benchmark for Vanilla Hand Prediction (VHP) and Reasoning-based Hand Prediction (RBHP) tasks.
优点
- The proposed task appears engaging, as egocentric hand activities and motion prediction present challenging problems.
- The idea that encodes the hand as embedding is novel
缺点
- The authors only compare their method with naive baselines and traditional methods for hand motion prediction. One potential additional baseline would be using foundation models for video prediction and then track the motion as predictions.
问题
- The authors report the results on Epic-Kitchen dataset which includes lots of ego motion. The egomotion would make the hand motions change significantly but they are barely unpredictable. How did the authors handle this problem?
- How the <HAND> token is decoded to hand motions is not clear to me.
Handling ego-motion in Epic-Kitchens. Thank you for the question about ego-motion. We would like to clarify that for dataset generation, we consider short 3-5 second duration clips where the ego-motion is naturally not significant, and in addition we filter the trajectories to omit outliers so that we do not have trajectories with heavy ego-motion in the training dataset. We will publicly release this curated dataset to the community. For the architecture itself, some previous works include ego-motion awareness mechanisms that can potentially enhance performance. However, to maintain design simplicity and ensure compatibility with modern video-based vision-language models, we opted not to incorporate additional specialized modules in our architecture and opted to just mildly curate the training data.
<HAND> token decoding. Thank you for the question regarding hand token decoding. Here we produce a detailed step-by-step decoding procedure in both the training and inference processes.
-
Training Process(Please refer to Figure 5 in the Appendix): When token (i + 1) is a <HAND> token in the ground truth sequence, we perform two training tasks.
- Token Prediction Task: We take the last-layer embedding of token i from LLM and process it through a linear layer. The model is trained using next token prediction loss.
- Hand Trajectory Prediction Task (if applicable): We use the same last-layer embedding of token i and input it as a condition into the CVAE to predict the hand position of the token (i + 1). The model is trained using hand trajectory prediction loss
-
Inference Process (Please refer to Figure 6 in Appendix): For each token i in the sequence, we follow these steps:
- Next Token Prediction: We take the embedding of the current token i and pass it through the linear layer to predict the next token.
- Hand Position Generation (if applicable): If the predicted next token is a <HAND> token, we generate the predicted hand position coordinates by conditioning the current embedding to CVAE. For the tokenization process of the next iteration, we combine the positional embedding of the predicted coordinates to this <HAND> embedding.
New results with a baseline doing video prediction followed by hand-tracking. We thank the reviewer for suggesting this relevant baseline. We have now added comparisons to two video-generation baselines (below and Table 2 in the revised paper). We use off-the-shelf video models from Kling and Luma that can do image+text conditioned video generations - we use them to condition on the “last frame” of the observation context, and have the same frame as conditioning for our approach. For the video models, after generating videos, we use the same hand-tracking framework for obtaining the hand locations. Since video generation is computationally (and also monetarily) expensive, the comparisons now are on limited samples (100 evaluation trajectories each) For the revised final paper, we will make the number of evaluation samples much higher.
| RBHP(Epic-Kitchen) | RBHP(Ego4D) | |||||
|---|---|---|---|---|---|---|
| Approach | ADE | FDE | WDE | ADE | FDE | WDE |
| Kling 1.5 | 0.311 | 0.358 | 0.197 | 0.277 | 0.411 | 0.184 |
| LumaLabs | 0.293 | 0.377 | 0.189 | 0.286 | 0.135 | |
| LLaVA-Pixel2Seq | 0.277 | 0.248 | 0.137 | 0.312 | 0.287 | 0.143 |
| LLaVA-Traj | 0.196 | 0.187 | 0.101 | 0.381 | 0.353 | 0.178 |
| HandsOnVLM | 0.197 | 0.165 | 0.094 | 0.229 | 0.195 | 0.100 |
| HandsOnVLM | 0.228 |
Please do not hesitate to let us know if we can clarify anything else for a revised assessment of the paper.
Dear Reviewer,
The discussion phase is coming to an end soon, and we thus kindly request you to let us know if our response below has addressed your concerns. We will be happy to answer if there are additional issues/questions, and if not we would be grateful if you would consider a revised assessment of the review score to reflect that the issues have been addressed.
Best,
Authors of HandsOnVLM (submission 960)
Dear Reviewer Lkjo,
We hope you have had a chance to review our detailed response to your concerns. We would greatly appreciate your updated assessment.
Best,
Authors of HandsOnVLM (submission 960)
Dear Reviewer Lkjo,
As the discussion period nears its end, we wanted to kindly remind you once again that we have addressed your concerns in our responses above, including detailed explanations and additional experiments for video foundation models.
If you find our clarifications satisfactory, we would appreciate it if you could consider revising your rating of the paper. Should you have any further questions or require additional clarification, please don't hesitate to reach out—we’re happy to assist until the discussion period concludes.
Best regards,
Authors of HandsOnVLM (submission 960)
We thank all the reviewers for their detailed and thoughtful comments. We are glad the reviewers found the paper easy to follow, well-written (179J), the proposed egocentric vision tasks relevant to VR/AR and robotics (179J, Lkjo,9WrP), and the experimental evaluations performant (9WrP).
We respond to the comments of individual reviewers below and summarize common clarifications and new results that we have added in order to clarify some of the questions. We have also revised the paper to incorporate these clarifications and results (please find our modifications in blue on the pdf)
Summary of revisions: We summarize changes to our manuscript below; these changes have also been highlighted (blue) in the new version of the paper.
- We have added comparisons with video prediction followed by hand-tracking baselines suggested by reviewer Lkjo.
- We have added new experimental comparisons on single-image-based future hand trajectory prediction as suggested by reviewer 9WrP.
- We have added a new Ego4D RBHP dataset for zero-shot evaluation on the non-kitchen environments.
- We identified a bug in our computation of the metrics (ADE, FDE, WDE) and have now revised the results of our method and the baselines. The bug in the metrics had made our approach look much weaker earlier, and after the resolution, we see that our method significantly outperforms almost all the baselines on all the metrics.
Again, we thank the reviewers for their constructive feedback. We believe we have addressed all the comments and questions, but are happy to address any further clarifications from the reviewers.
Thank you,
Authors of HandsOnVLM (submission 960)
This paper proposes a method for predicting future hand trajectories, based on language query inputs.
The reviewers agree the setting of the task is interesting, and thinks it may be beneficial for ego-centric related tasks in AR/VR and robotics.
The reviewers also raise several weaknesses, the key one being the lack of qualified baselines.
After the discussion period, the reviewers are mixed, ranging from borderline reject (5), borderline accept (6) and weak accept (8). Some points from the reviewers are not well-addressed, e.g. the point about the hand embedding.
Having read through the paper, author responses and reviews, the AC recommends to reject the paper at this time. The AC agrees with reviewers on the limited and naive baselines, though from a different perspective, as outlined below:
Currently, the hand prediction is based only on the scene and observed trajectory, along with some cues from the text prompt. This doesn't guarantee the claims about "reasoning". Any boost in performance may come from the increased capacity of the vlms etc. A proper ablation would consider no informative instruction at all, then only object name, then full instructions etc. In a similar line of thought, the videos should be curated to have similar context but proceed in different ways (linked to different instructions), again, to reinforce the claims on reasoning.
In addition, another purpose of the natural language query relates to affordance grounding and intentions. The authors instead shift the setting into the prediction of hand trajectories, thereby bypassing several major line of work. The included baselines are instead either too trivial or not meaningful to compare, e.g. raw video prediction / generation.
A second weakness linked to the previous point is clarity in the task setting, whether or not it is meaningful for downstream settings or tasks and whether or not it should be in 2D or 3D. As one reviewer points out it would be better to have hand poses - this is perhaps too nuanced for the current seting. Without hand-poses, though, we are reduced only to bounding boxes and spatial regions to form the trajectory. Yet, the concept of predicting entire future trajectories in 2D (vs. key landmarks in a scene, e.g. where the hand might be placed) is a bit puzzling - the trajectory is not physically grounded nor meaningful without 3D knowledge.
审稿人讨论附加意见
One reviewer acknowledge the rebuttal and raised the score from marginally below (5) to marginally above (6).
Reject