PaperHub
7.6
/10
Poster3 位审稿人
最低4最高5标准差0.5
5
5
4
4.0
置信度
创新性2.3
质量3.0
清晰度3.0
重要性2.3
NeurIPS 2025

Toward Human Deictic Gesture Target Estimation

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

Proposed the task of human deictic gesture target estimation and the first model for this task.

摘要

关键词
Deictic GestureGesture Target EstimationSocial InteractionSocial Artificial Intelligence

评审与讨论

审稿意见
5

The authors propose dataset called GestureTarget which contains over 19000 annotated instances of source pointing, reaching, showing and giving (i.e. deictic) gestures with corresponding image masks for the target. The motivation for this dataset lies comes from the general unavailability of laerge deictic gesture datasets and hence blocking important research in the domain of deictic gesture prediction. The authors also propose TransGesture, a Transformer based model family (with cross attention between gaze source gesture) that integrates both human gestures and gaze to infer gesture targets in a complex scene. With ablation studies, the authors demonstrate the added value of gaze in predicting the target of the deictic gesture. Through various pretrained visual backbones, fusion methodologies, ablation studies, and comparison with SOTA VLMs, the authors study their design choices and validate the need for a Cross Attention module (i.e. JCA fusion) as well as the proposed joint loss function.

优缺点分析

Strengths

  • The proposed dataset contains valuable annotations of source gesture and target masks for furthering research in the Gesture Generation/Detection and Human Behavior modeling communities. The authors combine multiple multi-interaction datasets to build a larger dataset with over 19,000 annotations.
  • The authors measure that SOTA MLLMs models such as GPT 4-0, Claude and VLMs like GLaMM, OMG-LLaVA perform poorly on the proposed dataset, hence motivating the need for training dedicated models for gesture prediction
  • In table 1, the authors validate that gaze is a relevant conditioning variable to make better target predictions. More specifically, Cross Attention based fusion (i.e. JCA) with gaze results in clear wins on top of Gesture-only baseline

Weaknesses

  • Gestures are typically a temporal sequence, rather than just one image. The dataset is constructed using other Social Interaction datasets with videos. The construction of the dataset in this paper with images instead feels like a missed opportunity. Additionally, deictic gestures in single images may account for false positives (e.g. the person might not be pointing at someone, but performing beat gestures)
  • The choice of "Accuracy of Gesture Existing" as a metric might potentially not correctly represent the model's performance if the distribution of Gesture Existing and not-exisiting in the train/val dataset is not equal. It would benefit the community if the distribution of existence of gestures in the dataset is available to the readers and potentially using metrics such as F1-score or reporting both Precision and Recall would add value.

问题

  • What is the distribution of existence of gestures in the dataset?
  • Would it be possible to share Existence F1-Score?
  • How were "wrong-labeled frames during verification" (l116) detected? Were the annotators shown the video to make that determination?
  • Could the authors share more detailed information (e.g. guidelines, annotation webpage screenshots, inter-annotator correlation) on the annotation process?

局限性

yes

最终评判理由

Thank you to the authors for providing a detailed response to all the reviewers questions.

  1. The positive/negative examples ratio is 3:5 as mentioned by the authors. That coupled by F1 scores add more validation to the metrics.
  2. Thank you for sharing the Dataset collection methodology. I would recommend adding them to the final paper/appendix for the benefit of the readers
  3. Thank you for sharing a human study as well. The best acc metrics by a human is close to 90% percent which is close enough to the best Model (DINOv2-Large (JCA Fusion)). While there is not a huge gap to cover in predicting the existence vs non-existence, there seems to be a performance gap in locating the gesture location in the image. Intuitively, this makes sense as locating the target of the location is a harder problem, especially in 2D images, and potentially could benefit from a lot more context (e.g. scene, video history before arriving at the pointing gesture etc).

Based on the additional information, I have updated my score

格式问题

no

作者回复

We sincerely thank you for your thoughtful and supportive review, and for recognizing the strengths and impact of our work. We greatly appreciate the suggestion to release detailed dataset annotation information and to develop video-based annotations in the future.

Weakness 1: Gestures are typically a temporal sequence, rather than just one image. The dataset is constructed using other Social Interaction datasets with videos. The construction of the dataset in this paper with images instead feels like a missed opportunity. Additionally, deictic gestures in single images may account for false positives (e.g. the person might not be pointing at someone, but performing beat gestures)

We thank the reviewer for raising this point. We agree that temporal information can be very useful for disambiguating gestures—for example, continuous motion can help differentiate a deictic point from a brief beat gesture. Adapting to video input is also our ultimate goal for gesture target estimation.

Our decision to build the GestureTarget dataset using still images was primarily motivated by annotation feasibility and data availability. Similar to the development of gaze-estimation tasks (which progressed from image-based datasets like GazeFollow to video datasets such as VideoGaze and ChildPlay), constructing a large-scale video dataset with frame-by-frame target masks would have been prohibitively costly. Instead, we curated a large number of representative frames containing clear gestures. Many of the source datasets we leveraged (e.g., VCR, SocialGesture) were originally image- or frame-based, with metadata such as bounding boxes available per frame, which we could repurpose.

We took several steps to mitigate the ambiguities that can arise in single images. During dataset creation, we manually filtered out ambiguous cases, such as conversational beat gestures or poses where it was unclear if pointing was occurring. Our annotation guidelines explicitly defined positive deictic gesture instances as those where the target is clear to the subject. They are opposed to gestures without a clear target (such as a generic hand wave or beat).

Nonetheless, we agree that temporal context and other modalities, such as depth or speech, would further enrich the task. We are actively working on a video sequence extension (VideoGestureTarget), where the bounding box and mask of each gesture target is annotated across short video clips, allowing models to utilize motion cues and temporal consistency.

Weakness 2: The choice of "Accuracy of Gesture Existing" as a metric might potentially not correctly represent the model's performance if the distribution of Gesture Existing and not-existing in the train/val dataset is not equal. It would benefit the community if the distribution of existence of gestures in the dataset is available to the readers and potentially using metrics such as F1-score or reporting both Precision and Recall would add value.

Thank you for the constructive feedback. In our dataset, the ratio of gesture existence is approximately 3:5 (exist:non-exist) in both the training and test sets. To provide a more comprehensive evaluation, we have computed the F1-score for the gesture existence detection task and updated our results table accordingly. The backbone model used for these results is DINOv2-Base.

Fusion MethodExist Acc (%)Exist F1 (%)
Gesture-Only79.7880.18
Direct Fusion80.1980.64
Conv Fusion88.1087.62
JCA Fusion89.0188.73

Question 1: What is the distribution of existence of gestures in the dataset?

Please see our response to Weakness 2. The distribution of gesture existence is approximately 3:5 (exist:non-exist) in both the training and test sets.

Question 2: Would it be possible to share the existence F1-score?

Yes, we report the F1-score in the table for Weakness 2 above.

Question 3: How were "wrong-labeled frames during verification" (l116) detected? Were the annotators shown the video to make that determination?

Thank you for raising this important point. The 30% "wrong-labeled" rate refers to incorrect pseudo-annotated masks generated by YOLOv11 instance segmetnation model. Potential "wrong-labeled" situations are the mask only cover half of the target human; two or more human's mask regions are overlapped. We manually reviewed all target mask pseudo-annotations, and if the mask region deviated from the intended target, those data were removed from the dataset.

For the deictic gesture key frame selection, annotators needed to refer adjacent frames within the video sequence if the source data is a video.

Question 4: Could the authors share more detailed information (e.g., guidelines, annotation webpage screenshots, inter-annotator correlation) on the annotation process? We appreciate the reviewer’s interest in our annotation process, as it allows us to highlight the efforts we took to ensure data quality:

Annotation Verification: Every annotation in GestureTarget underwent manual verification. Our process involves:

  • Key frame selection (identifying frames in the video that clearly contained deictic gestures),

  • An automatic pseudo-annotation step (using YOLOv11 instance segmetnation model to segment all human and object masks), and

  • Manual selection of gesture target masks by the author team.

During key frame selection, we employed a three-step methodology (annotation, verification, and quality assurance), measuring inter-annotator agreement using metrics such as Cohen's kappa. For ambiguous or edge cases, final decisions were made through a three-person consensus. During the manual mask selection stage, one of the authors selected the target in each image, confirmed the subject person’s bounding box and head position, and verified that the mask generated by YOLOv11 was correct. Any image with incorrect or ambiguous masks was discarded. In fact, approximately 30% of candidate frames were removed during verification due to issues such as overlapping or missing masks.

We are committed to open research and plan to release the GestureTarget dataset on the Hugging Face Hub, accompanied by comprehensive documentation. This documentation will include a data card detailing dataset composition, annotation protocol, class definitions, and ethical considerations. We believe this level of transparency is important for any new dataset, and we thank the reviewer for highlighting this point. We will ensure that our final version clearly states that the dataset and its detailed documentation will be made publicly available (with a link or DOI).

评论

Dear Reviewer yXKj,

Thank you again for your thoughtful review and encouraging remarks. We hope our rebuttal has fully addressed your concerns and clarified the points you raised. If any questions remain, please let us know—we are happy to provide additional details.

We would greatly appreciate your reconsideration of our work when finalizing your evaluation.

Best regards,
The Authors

评论

Thank you to the authors for providing a detailed response to all the reviewers questions.

  1. The positive/negative examples ratio is 3:5 as mentioned by the authors. That coupled by F1 scores add more validation to the metrics.
  2. Thank you for sharing the Dataset collection methodology. I would recommend adding them to the final paper/appendix for the benefit of the readers
  3. Thank you for sharing a human study as well. The best acc metrics by a human is close to 90% percent which is close enough to the best Model (DINOv2-Large (JCA Fusion)). While there is not a huge gap to cover in predicting the existence vs non-existence, there seems to be a performance gap in locating the gesture location in the image. Intuitively, this makes sense as locating the target of the location is a harder problem, especially in 2D images, and potentially could benefit from a lot more context (e.g. scene, video history before arriving at the pointing gesture etc).

Based on the additional information, I have updated my score

评论

Thank you very much for your positive feedback. We sincerely appreciate your careful review, constructive suggestions, and encouragement throughout the process. Your recognition of our work is truly motivating. Thank you again for your support and engagement.

审稿意见
5

This paper introduces a novel task and accompanying benchmark dataset, GestureTarget, designed for evaluating the semantic target estimation of social deictic gestures. To tackle this task, the authors propose TransGesture, a model that incorporates a pretrained gaze target estimation module to extract gaze direction cues. By integrating gaze information, TransGesture aims to infer the actual intended target of a gesture, rather than relying solely on gesture features. Experimental results validate that gaze cues significantly enhance the model’s ability to accurately identify the object or person being pointed at.

优缺点分析

Strengths

  1. The task definition is clearly stated and well-motivated.
  2. The proposed method is targeted and appears effective in addressing the defined problem.
  3. The introduced dataset is well-aligned with the task and has the potential to benefit related tasks in the community, such as interactive generation.
  4. The experiments are comprehensive, and comparisons across multiple pretrained models provide useful insights for future work on visual feature extraction.
  5. The paper is generally well-written and easy to follow.

Weaknesses

  1. The paper lacks subjective evaluations (e.g., user studies) that could help assess whether the predicted targets align with human perception.
  2. While the authors emphasize the role of gaze in the model design, there is no clear ablation study isolating the contribution of gaze. Including such experiments would help substantiate its importance.
  3. Inference efficiency is not discussed. Highlighting runtime would help assess its applicability in real-world or interactive settings.
  4. Figure 3 is not referenced in the main text, which may confuse readers.

问题

See "Weaknesses" in the review.

局限性

yes

最终评判理由

I have carefully read the authors's rebuttal. Most of my concerns have been addressed by the authors. Therefore, I would like to raise my rating.

格式问题

None

作者回复

We thank you for recognizing the value of our task definition and our proposed model. Thank you for the thorough review and helpful feedback. We address each concern below, and we outline the improvements we will make in the revision in response to your comments.

1. The paper lacks subjective evaluations (e.g., user studies) that could help assess whether the predicted targets align with human perception.

Exist AccTarget IoU
GPT-4o70.21-
Claude 3.768.48-
OMG-LLaVA-3.26
DINOv2-Base (Gesture-only)79.1952.14
DINOv2-Base (JCA Fusion)89.0158.48
DINOv2-Large (JCA Fusion)89.8558.26
Human (Avg)87.0066.40
Human (Max)90.0080.00

Thank you for your valuable suggestion. We are pleased to report that we have conducted a human evaluation study and can now include these results. Due to the time constraints of the rebuttal period, we recruited 10 volunteer participants and asked them to perform two tasks on a randomly selected subset of 50 images from our test set: (1) Gesture existence detection (binary classification): determining whether the subject in the image is performing a deictic gesture; and (2) Target identification: if a gesture is present, identifying the intended target in the image by clicking on the corresponding object or person mask. Please note that this human study is different to the annotation process because the human study participants could not see the whole source video or additional image annotation information. We plan to further expand the size of the test population in the final version of the paper.

For the gesture existence detection task, the average human performance was 87.00%, with a maximum of 90.00%. For the target estimation task, we measured performance by the IoU between the region selected by each participant and the ground-truth target mask; the average human Target IoU was 66.40, with a maximum of 80.00%. This human study provides an upper bound on achievable performance.

We observe that our proposed model has reached human-level performance in gesture existence detection. However, a gap remains in target estimation accuracy. We will incorporate these findings into the paper to highlight that, while there is still room for improvement—particularly in target localization—our approach is closing the gap toward human-level agreement.

2. While the authors emphasize the role of gaze in the model design, there is no clear ablation study isolating the contribution of gaze. Including such experiments would help substantiate its importance.

We apologize for the confusion regarding the contribution of gaze cues. In the original Table 1 of the paper, we did include an ablation analysis isolating the effect of gaze in the main table, but we did not highlight it as an additional table for ablation study. Specifically, we compare our full model (with gaze input and joint cross-attention fusion) against a variant that uses gesture cues only (no gaze).

BackboneGaze FusionExist AccuracyTraget IoU
CLIP-Large79.2154.37
CLIP-Large87.6656.65
SigLIP2-Base76.7055.34
SigLIP2-Base85.1356.13
DINOv2-Base79.1952.14
DINOv2-Base89.0158.48

The ablation shows a clear benefit from incorporating gaze: for example, on the CLIP-Large encoder, the gesture-only model achieves 79.2% existence accuracy and 54.4 IoU, whereas adding gaze cues via JCA Fusion raises performance to 87.7% and 56.7 IoU. Thank you for noting this omission; we will ensure the gaze contribution ablation is clearly presented in the next final version of our paper.

3. Inference efficiency is not discussed. Highlighting runtime would help assess its applicability in real-world or interactive settings.

We thank the reviewer for the suggestion. Our model uses a frozen visual encoder (e.g., CLIP or DINO backbone) followed by a lightweight transformer fusion module, so the inference computational cost is largely in the backbone forward pass. We have measured the inference runtime of our method and compared it across different backbone architectures to address this concern. The table below summarizes the average runtime per image (single 518x518 image input) for various model variants, measured on a single NVIDIA H100 GPU:

Visual EncoderFusion MethodAverage run time per image (s)
SigLIP2-BaseGesture-Only0.0094
SigLIP2-BaseDirect Fusion0.0106
SigLIP2-BaseConv Fusion0.0134
SigLIP2-BaseJCA Fusion0.0133
DINOv2-BaseGesture-Only0.0108
DINOv2-BaseDirect Fusion0.0111
DINOv2-BaseConv Fusion0.0131
DINOv2-BaseJCA Fusion0.0110

As shown, incorporating our full JCA fusion adds only a minimal overhead of 0.001 to 0.003 seconds compared to the gesture-only baseline. In practice, the gaze branch computations—primarily a few additional transformer layers for cross-attention—are lightweight relative to the overall cost of the Transformer-based visual encoder. We will include this runtime analysis in the final version's appendix to assure readers of the model’s efficiency. Overall, our method’s inference speed is comparable to standard ViT-based image understanding models, and the added gaze branch provides clear benefits at a negligible computational cost.

4. Figure 3 is not referenced in the main text, which may confuse readers.

Thanks for pointing out this missing reference problem, We will add it to Section 4.3 Baselines.

评论

I thank the authors for their rebuttal. My previous concerns have been addressed. I would like to raise my rating.

评论

We are deeply grateful for your positive feedback and thoughtful review. Your constructive suggestions and encouragement have been invaluable, and your recognition of our work is truly motivating. Thank you again for your support and engagement.

评论

Dear Reviewer iuYx,

Thank you again for your thoughtful review and encouraging remarks. We hope our rebuttal has fully addressed your concerns and clarified the points you raised. If any questions remain, please let us know—we are happy to provide additional details.

We would greatly appreciate your reconsideration of our work when finalizing your evaluation.

Best regards,
The Authors

审稿意见
4

The study presents a new dataset, GestureTarget, by re-purposing four existing datasets related to gestures and social interaction to gesture target estimation. The authors then introduced TransGesture, a family of transformer models that fuse gesture and gaze cues. This fusion showed improvement over the gesture-only baseline.

优缺点分析

Strengths:

  • I find the task itself interesting (as well as the dataset). This type of information can guide models to calibrate to human attention and interactions.

  • The results make sense (e.g., gaze gives some prediction improvements), and the baselines, as well as the backbone variety, are good. The ablation for the current study is complete. Even though I would have liked to see a kinematic model (i.e., based on body models such as MediaPipe or SMLP-X).

Weaknesses:

  • I find the choice of using RGB-based representations (especially for gestures) unintuitive. I believe that a kinematic module might be a better choice, either alone or in addition to the RGB raw data.

  • The evaluation is not cross-validated properly. The fixed 80 / 20 split is convenient but risks optimistic results if adjacent video frames land in both sets. K-fold cross-validation or at least a video-level split would strengthen the evidence.

  • No human-performance benchmark is reported. An annotation study (e.g. randomly sampled frames) would establish how close the best model is to human agreement.

问题

  • What are the practical aspects of this work?

  • What is the difference between Y_exist and Y_mask? Is the Y_exist just a one-dimensional binary value? This is not clear.

  • Is there a human baseline for the target estimation performance? If so, what is the upper bound?

局限性

  • Human-based evaluation (baseline)
  • A kinematic model that captures gestures effectively

最终评判理由

I am happy with the authors rebuttal. Therefore, I increased my ratings

格式问题

I did not notice any formatting issues

作者回复

We thank Reviewer BGdf for recognizing the value of our task definition and dataset, and for providing thoughtful, constructive feedback. Several of the stated concerns appear to arise from misunderstandings; we clarify these points and address each weakness and question in detail below.

Weakness 1: Use of RGB vs. Kinematic/Keypoint Inputs

We appreciate the reviewer’s suggestion to explore kinematic-based inputs. Our choice of using raw RGB images was driven by the nature of in-the-wild data and the limitations of kinematic feature (e.g. keypoint) extraction in such settings. In our dataset, images come from diverse, uncontrolled sources (movies, social interactions, etc.), where obtaining accurate 2D/3D skeletons or keypoints for every person is challenging – automated keypoint detectors can fail under occlusions, extreme poses, or low lighting. Relying solely on keypoints could thus introduce significant noise or miss subtle contextual cues (e.g. finger orientation or held-object context) that the RGB modality provides.

That said, we agree that incorporating explicit pose representations or specialized human keypoint-centric features is a promising direction. To quantify the potential benefit of an explicit pose-centric backbone, we replaced our DINOv2-Large encoder with Meta's Sapiens-0.3B—pretrained model [1] specifically for human pose understanding—while keeping all other components and hyper-parameters unchanged. We chose Meta’s Sapiens-0.3B because its model size is the same as a ViT-Large, making for a fair comparison.

Visual Encoder#ParamsExist AccTarget IoU
DINOv2-Large304M89.8558.26
Sapiens-0.3B304M70.1319.80

Sapiens performs markedly worse than DINOv2-Large. We believe there are two main factors:

  • Pre-training bias. Sapiens is trained on Humans-300 M, which provides single-person key-point supervision but no labels for inter-person pointing relationships. Consequently, it learns body-pose geometry but not the relational cues that are essential for identifying a deictic target.

  • Frozen-backbone protocol. In our ablation we kept the visual encoder frozen, so Sapiens could not adapt its single-person priors to our multi-person task, further widening the gap.

These findings reinforce our hypothesis that full RGB imagery—capturing both fine-grained pose details and rich scene context—should remain the dominant signal for gesture-target estimation. A promising next step is to add a lightweight kinematic/key-point branch or decoder that complements, rather than replaces, the RGB stream, allowing the model to exploit pose cues where key-point extraction is reliable while still relying on RGB for holistic context. Thank you again for the insightful suggestion!

Reference:
[1] Khirodkar, Rawal, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, and Shunsuke Saito. "Sapiens: Foundation for human vision models." In European Conference on Computer Vision, pp. 206-228. Cham: Springer Nature Switzerland, 2024.

Weakness 2: Dataset Splitting and Cross-Validation

We understand the reviewer’s concern about potential overlap or leakage in the dataset split. To clarify, we performed a source-level split for training and testing. This means that all images originating from the same video clip or the same source dataset are placed entirely in either the train or test set, but not both. This ensures that near-duplicate frames or highly similar contexts from a single source cannot appear in both sets, preventing the model from simply memorizing scene-specific details. In the initial submission we used an 80/20 split on this source-stratified basis.

To further verify the robustness of our results, we have now conducted a 5-fold cross-validation on the GestureTarget dataset (maintaining source-consistent splits in each fold). The performance was highly consistent across folds. For instance, the gesture existence accuracy across the 5 folds ranged from ≈ 87% to 89%, and the mean IoU for target masks varied from ≈ 56% to 57%. The standard deviation was low, indicating no single split was an outlier. These results give us confidence that our reported performance is stable and not an artifact of a particular data split. We will include a brief summary of the cross-validation experiment in the revised paper to emphasize the reliability of our evaluation.

Exist Acc (%)IoU (%)
Previous Test Set89.0158.48
5-fold cross-validation88.75 ± 0.2557.42 ± 0.45

Weakness 3: Human Study - Gesture Existence and Target Identification | | Exist Acc | Target IoU | | :---- | :---- | :---- | | GPT-4o | 70.21 | - | | Claude 3.7 | 68.48 | - | | OMG-LLaVA | - | 3.26 | | DINOv2-Base (Gesture-only) | 79.19 | 52.14 | | DINOv2-Base (JCA Fusion) | 89.01 | 58.48 | | DINOv2-Large (JCA Fusion) | 89.85 | 58.26 | | Human (Avg) | 87.00 | 66.40 | | Human (Max) | 90.00 | 80.00 |

Thank you for your valuable suggestion. We are pleased to report that we have conducted a human evaluation study and can now include these results. Due to the time constraints of the rebuttal period, we recruited 10 volunteer participants and asked them to perform two tasks on a randomly selected subset of 50 images from our test set: (1) Gesture existence detection (binary classification): determining whether the subject in the image is performing a deictic gesture; and (2) Target identification: if a gesture is present, identifying the intended target in the image by clicking on the corresponding object or person mask. Please note that this human study is different to the annotation process because the human study participants could not see the whole source video or additional image annotation information. We plan to further expand the size of the test population in the final version of the paper.

For the gesture existence detection task, the average human performance was 87.00%, with a maximum of 90.00%. For the target estimation task, we measured performance by the IoU between the region selected by each participant and the ground-truth target mask; the average human Target IoU was 66.40, with a maximum of 80.00%. This human study provides an upper bound on achievable performance.

We observe that our proposed model has reached human-level performance in gesture existence detection. However, a gap remains in target estimation accuracy. We will incorporate these findings into the paper to highlight that, while there is still room for improvement—particularly in target localization—our approach is closing the gap toward human-level agreement.

Question 1: Practical aspects of this work

The practical applications of deictic gesture target estimation are diverse and impactful, spanning multiple domains:

  • Non-verbal communication analysis in healthcare: Deictic gesture target estimation can greatly enhance our understanding of non-verbal cues in clinical settings, particularly for early autism diagnosis [2]. Individuals with autism spectrum disorder often display differences in both the use and interpretation of gestures. An accurate model for detecting and interpreting these gestures could assist clinicians and therapists in identifying subtle communication deficits in young children, enabling earlier and more effective interventions.

  • Human–AI interaction: Deictic gestures play a crucial role in natural communication. Improving machine understanding of these gestures will facilitate smoother, more intuitive human–AI interactions [3]. This is particularly valuable for assistive robotics—where robots must interpret human intent to provide effective support—and for virtual assistants operating in social or professional contexts.

Furthermore, our paper observes that the co-occurrence of gaze and gesture can enhance gesture prediction. These non-verbal features remain underexplored in current Multimodal LLMs and VLMs. Our proposed model could be incorporated as a specialized tool in the toolkit of LLM agents. (We present a developmental experiment on this in the supplementary material.)

Reference:
[2] Özçalışkan, Şeyda, Lauren B. Adamson, and Nevena Dimitrova. "Early deictic but not other gestures predict later vocabulary in both typical development and autism." Autism 20, no. 6 (2016): 754-763.
[3] Matuszek, Cynthia, Liefeng Bo, Luke Zettlemoyer, and Dieter Fox. "Learning from unscripted deictic gesture and language for human-robot interactions." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 28, no. 1. 2014.

Question 2: What is the difference between Y_exist and Y_mask? Is the Y_exist just a one-dimensional binary value?

Thanks for your question. Y_exist is a binary value in {0, 1}, so we use binary cross-entropy (BCE) loss for this prediction. Y_mask is a binary mask used for semantic segmentation.

Question 3: Is there a human baseline for the target estimation performance? If so, what is the upper bound?

Please see our response to Weakness 3: Human Study – Gesture Existence and Target Identification for details.

评论

I have read your response, and I think the added analyses are valuable. Sapien model is good replacement for Dino. It might not have worked due to the limited period of the rebuttal, but i still think that with more fine-tuning and control of the model, its performance can increase quite a lot.

I also thank you for providing the human baseline. Please include all of these information in the updated (final) version of the paper.

I have increased my ratings to borderline accept. Good luck.

评论

Thank you very much for your thoughtful review and for considering our responses and additional analyses. We sincerely appreciate the time and effort you have invested in engaging with our submission and discussing our work. We are grateful that you found our clarifications and new experiments helpful. We will include a detailed experiments on further comparison of different scale of Sapien as the visual encoder in the final version. Thank you again for your constructive feedback and for updating your rating.

评论

Dear Reviewer BGdf,

Thank you again for your thoughtful review and encouraging remarks. We hope our rebuttal has fully addressed your concerns and clarified the points you raised. If any questions remain, please let us know—we are happy to provide additional details.

We would greatly appreciate your reconsideration of our work when finalizing your evaluation.

Best regards,
The Authors

最终决定

The reviewers are unanimously positive about the paper. The paper introduces an underexplored problem and provides a valuable dataset to the community. The core methodology is sound, and the experimental results are comprehensive and convincing. The rebuttal addresses the reviewers concerns. The ratings are two accepts and a borderline accept. Therefore, our decision is to accept the paper.