Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation
Hyperbolic contrastive regularisation improves skeletal based sign language translation above RGB baselines.
摘要
评审与讨论
The paper proposes a novel method, Geo-Sign, that incorporates the properties of hyperbolic geometry for sign language translation. They project the hands, pose, and face skeletal features from the ST-GCN model into a Poincaré ball and align them with text embeddings via a geometric contrastive loss. They evaluate their method on the CSL-Daily dataset and compare it with SOTA results.
优缺点分析
Strengths
- Introducing the idea of hyperbolic geometry in the SLT domain is interesting.
- Preserving privacy, although less motivated, is considered a positive point.
- The paper is well motivated.
Weaknesses
- The results are only reported on CSL-Daily using the ST-GCN weights from Uni-Sign, which were pre-trained on CSL-News. This limited evaluation makes it difficult to assess the generalizability of the method to other datasets. Therefore, to strengthen the paper, I'd have considered/expected the following:
(1) Experiment on at least another dataset, such as Phoenix-2014T, or How2Sign.
(2) More ablation studies to get a better understanding of how the geometric properties are working
问题
In Figure 2, the explanation accompanying the analysis may not be entirely accurate. While the figure on the right suggests that the hyperbolic token model focuses on the relevant parts of the input, I believe that for the sign corresponding to the character 面条, the model should attend to both the hands and the face. I am not a native Chinese Sign Language interpreter, so my interpretation may not be fully accurate. However, based on Table 1 in the paper, it seems that even with the Euclidean token model, the model was able to produce fairly good results, despite the absence of a structured embedding space as it is in the hyperbolic token case.
| Character | Body Part Involved | Notes |
|---|---|---|
| 今 | Right hand, face/chest | One-hand motion near chest |
| 天 | Right hand, forehead/head | Symbolizes sky/day |
| 我 | Right hand, chest (body) | Points to self |
| 想 | Right hand, chest/face | Pulling/holding motion |
| 吃 | Right hand, mouth (face) | Eating gesture |
| 面条 | Both hands, chest/face | Noodle-eating motion |
局限性
yes
最终评判理由
final score: The paper introduces a new direction for sign language translation by incorporating the properties of hyperbolic geometry. There were two major concerns about the experiments: 1) experiments only in CSL-Daily, 2) lack of justification for the geometry properties. After the discussion, I am convinced that the method can generalize to other studies and even ISLR, another domain for sign language. They provided clear comparisons that outperformed the SOTA, Uni-Sign paper. Therefore, I increase my score to 4: Borderline accept.
格式问题
In abstract, line 2, better change sign-language to sign language.
We thank the reviewer for their constructive comments. We have conducted several new experiments since our submission to directly address their concerns.
1. New Experiments on Other Datasets (Weakness 1)
To address the primary concern about the limited evaluation, we are pleased to report new results on two widely-used ASL benchmarks: How2Sign (for SLT) and WLASL2000 (for ISLR).
How2Sign (SLT):
Our method shows a clear improvement over the baseline, confirming its benefits generalise to a new language and a more challenging translation task.
| Method | BLEU-4 | ROUGE-L |
|---|---|---|
| Uni-Sign (Pose) | 14.5 | 34.3 |
| Geo-Sign (Ours) | 15.1 | 35.4 |
WLASL2000 (ISLR):
On the recognition task, our pose-only method surpasses both the pose-only and the stronger multi-modal (Pose+RGB) baselines.
| Method | Pose | RGB | WLASL2000 (P-I) | WLASL2000 (P-C) |
|---|---|---|---|---|
| Uni-Sign (Pose) | ✔ | 63.13 | 60.90 | |
| Uni-Sign (Pose+RGB) | ✔ | ✔ | 63.52 | 61.32 |
| Geo-Sign (Ours) | ✔ | 63.64 | 61.89 |
2. Deeper Ablation on Geometric Properties (Weakness 2)
To provide a better understanding of how the geometric properties are working, we conducted a new ablation study on pose noise robustness. We added Gaussian noise to the final pose representations and measured the impact on performance. The results show that our hyperbolic model is significantly more robust than the baseline, demonstrating a tangible benefit of the learned geometric structure.
| Noise Level () | Geo-Sign (BLEU-4) | Baseline (BLEU-4) | Geo-Sign Drop (%) | Baseline Drop (%) |
|---|---|---|---|---|
| 0.00 (Original) | 27.42 | 26.25 | - | - |
| 0.01 | 26.30 | 24.14 | -4.1% | -8.0% |
| 0.02 | 24.60 | 21.50 | -10.3% | -18.1% |
| 0.03 | 19.07 | 14.40 | -30.4% | -45.2% |
| 0.04 | 11.63 | 7.20 | -57.6% | -72.5% |
| 0.05 | 5.98 | 3.01 | -78.2% | -88.5% |
We also direct the reviewer to the supplementary material where we include several key analyses on geometric structure. Specifically, Figure 3 (Supp) offers a direct visualization of the learned representation space, demonstrating a clear hierarchical structure where fine-grained hand/face features are pushed to the high-curvature periphery while coarse body features remain central. This visual structure is supported quantitatively by Figure 2 (Supp), which plots the geodesic distances of embeddings from the origin, showing that the model consistently learns to place hand features further into the manifold to leverage its greater representational capacity.
3. On the Interpretation of the Attention Heatmap and Euclidean Performance (Question 1)
The reviewer's intuition is correct: the sign for “吃面条” (chī miàntiáo, 'eating noodles') involves coordinated motion of the dominant hand (mimicking chopsticks) and the face/mouth area. The Euclidean model's heatmap, however, shows diffuse attention. We presume that it still provides an improvement over the baseline by learning a "brute-force" correlation rather than identifying the most salient features. This would also explain why the Euclidean Token version actually performs worse than the Pooled version on ROUGE, because the fine-grained alignment required by the Token method can be detrimental when the feature space is not sufficiently discriminative. The attention method however, also introduces a small number of additional trainable parameters (around 1M), and these may be helping to regularise the model improving on the other n-gram metrics. We will add a discussion of this result to the paper and include additional attention heatmaps to the supplementary for review.
Finally, we will correct the "sign-language" typo in the abstract in the final version.
We hope these new results and clarifications address all the concerns and we are happy to answer any further questions or implement any suggestions the reviewer may have. Thanks again for the review and we look forward to the discussion period.
Thank you for addressing my questions and concerns in your rebuttal. I believe the experiments have strengthened the paper. Although I think Geo-Sign is on par with Uni-Sign on ISLR, I appreciate that Geo-Sign introduces a new direction by using hyperbolic geometry for sign language translation. Therefore, I increase my score to 4.
We thank the reviewer for the updated score and the positive feedback on our new experiments. To clarify the WLASL2000 result: our pose-only model surpassed the stronger multi-modal (Pose+RGB) baseline of Uni-Sign. This finding is significant for our work on privacy-aware SLT. It suggests that strong geometric priors can be a powerful alternative to fusing multi-modal data, achieving competitive performance without the need for RGB input. We appreciate the constructive discussion and will ensure this point is clearly articulated in the final version.
The paper presents Geo-Sign, a novel approach for Sign Language Translation (SLT). Geo-Sign uses hyperbolic geometry to model the hierarchical structure of sign language kinematics. They do so by projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincaré ball model. They further present two hyperbolic contrastive alignment strategies and compare them.
优缺点分析
Strengths:
- The paper presents a novel approach for solving the important and interesting problem of SLT.
- The paper explores two strategies for contrastive alignment in the hyperbolic representation space.
Weaknesses:
- The framing of the method and motivations are not very clear. In particular, there is no high-level overview of the method before diving into details, and there is no explanation before describing the two strategies explaining why are both of them explained at once, if they are related or when would we use each.
- The method is only applied over one dataset, hence the results may not be immediately applicable to other languages or data.
- There are some inaccuracies regarding prior art. For example, CV-SLT [65] is treated as a "gloss-based method", although their paper specifically states that "In this work, no additional gloss supervision was introduced during the SLT training.".
问题
- Did you try to apply your method over other common SLT datasets, such as How2Sign or PHOENIX14T?
- I believe more intuition and an overview of the method before diving into details would improve the clarity of the method and paper. For example, what is the connection between the two alignment strategies and why are we testing both? and in general, how are all the parts in the method combined? what are the steps?
- Please fix the inaccuracies mentioned above. Although CV-SLT are also gloss-free and achieve higher BLEU scores, I believe your method is still interesting and I don't think it is a reason for rejection.
局限性
yes
最终评判理由
The authors addressed most of my concerns by performing additional experiments on How2Sign and saying they will add clarifications to the method and method motivations. Therefore, I lean toward accepting the paper.
格式问题
No
We thank the reviewer for their positive feedback on our work's novelty and for the constructive comments.
1. On Clarity, Framing, and Motivation (Weakness 1, Question 2)
We agree that a clearer high-level overview would improve the paper's readability. In the revised version, we will add a new introductory paragraph to Section 3 (Methodology* that provides a step-by-step walkthrough of the Geo-Sign framework before delving into the geometric details.
To clarify the motivation for our two alignment strategies: we designed them as a deliberate comparative study. The "Pooled" method tests the baseline hypothesis of aligning global, sentence-level semantics. The "Token" method tests a more complex hypothesis, i.e that a fine-grained, attention-based alignment between individual pose parts and text tokens would yield superior results. We will explicitly state this motivation in Section 3.3. Furthermore in the supplementary material we provide a detailed analysis showing that the pooled method incurs less latency at a small cost in performance. We can incorporate this into the main paper to make the motivation for both approaches clearer and improve the flow of this section.
2. On Generalization and Evaluation on Other Datasets (Weakness 2, Question 1)
To address your concern (and other reviewers) about generalisation, we have, since our submission, evaluated our method on two additional, widely-used ASL benchmarks: How2Sign (for SLT) and WLASL2000 (for ISLR).
How2Sign (SLT): Our method demonstrates an improvement over the Uni-Sign pose-only baseline, confirming its benefits generalize to American Sign Language.
| Method | BLEU-4 | ROUGE-L |
|---|---|---|
| Uni-Sign (Pose) | 14.5 | 34.3 |
| Geo-Sign (Ours) | 15.1 | 35.4 |
WLASL2000 (ISLR): On the recognition task, our pose-only method not only improves upon the pose-only baseline but also surpasses the stronger multi-modal (Pose+RGB) version of Uni-Sign.
| Method | Pose | RGB | WLASL2000 (P-I) | WLASL2000 (P-C) |
|---|---|---|---|---|
| Uni-Sign (Pose) | ✔ | 63.13 | 60.90 | |
| Uni-Sign (Pose+RGB) | ✔ | ✔ | 63.52 | 61.32 |
| Geo-Sign (Ours) | ✔ | 63.64 | 61.89 |
These new results provide strong evidence that our approach is not limited to a single dataset and generalises effectively across different languages and tasks.
3. On the Classification of Prior Art (CV-SLT) (Weakness 3, Question 3)
This is an interesting discussion point. We would like to note that in the official repository and methodology it can be seen that their code uses visual features extracted from a model pre-trained on a sign-to-gloss (s2g) task (specifically, the pre-trained embeddings from MMTLB, which are themselves from a gloss-prediction network).
While the CV-SLT authors state they do not use "additional gloss supervision during the SLT training," we could argue that their method is, by definition, gloss-based. The visual features are not learned end-to-end from video-text pairs; they use prior gloss-level supervision. This is a bit different from truly gloss-free methods like ours, which learn representations without any exposure to gloss annotations at any stage. We do not feel that strongly about moving it in the table (i.e putting it with gloss free) so we will let the reviewer decide what they prefer given this context. Furthermore the architecture uses RGB features which are not directly comparable with our pose-only approach.
Thanks again for the nice review and we look forward to the discussion period.
Thank you for the clarifications and additional experiments. I keep my rating of borderline accept.
Thank you to the reviewer for their time and feedback on our rebuttal. We were glad to provide new experiments on the paper's generalisation, based on the initial comments. As the discussion period remains open for a few more days, we would be happy to provide any additional clarifications on the paper.
This paper presents Geo-Sign, a novel framework that introduces hyperbolic contrastive regularisation for Sign Language Translation (SLT) using skeletal data. The authors leverage the Poincaré ball model to enhance the hierarchical representation of sign kinematics and regularise a pre-trained mT5 language model via a geometric contrastive loss. The proposed method achieves state-of-the-art performance on the CSL-Daily benchmark in a gloss-free, pose-only setting.
优缺点分析
Strengths:
-
Innovative Geometric Perspective: The paper introduces a principled and novel use of hyperbolic geometry to model the hierarchical and compositional nature of sign language motion, which is well-motivated both theoretically and empirically.
-
Effective Performance in a Privacy-Preserving Setting: By relying solely on skeletal data, the method maintains user privacy while achieving competitive or superior performance compared to RGB- and gloss-based approaches. This is particularly valuable in real-world applications with privacy concerns.
-
Comprehensive Evaluation and Strong Empirical Results: The paper provides thorough experiments, including ablation studies, visualization of embedding structures, and comparison with strong baselines. The performance improvements, especially in the gloss-free setup, demonstrate the effectiveness of the proposed regularisation strategy.
Weaknesses:
-
SLT May Not Be the Most Suitable Task for Validating Embedding Quality. The core contribution of the paper is a more structured representation of sign language motion in embedding space. However, SLT (Sign Language Translation) involves complex language mapping from signs to natural language, where the alignment between sign motion and textual output (especially gloss-free translation) is often indirect and ambiguous. This issue is particularly pronounced in datasets like How2Sign, where the semantic gap between gloss and text is wide. CSL-Daily may obscure this limitation due to its high degree of alignment. Therefore, isolated or continuous sign language recognition would be a more appropriate task to validate the proposed embedding method directly.
-
Lack of Explanation on Positive Sample Selection. The paper does not clarify how positive pose-text pairs are sampled for the contrastive loss. CSL-Daily contains many repeated sentences, making positive sampling easier. However, this assumption does not hold in more diverse datasets, where repeated samples are rare. The method’s dependence on such dataset characteristics is a critical limitation and could hinder generalisation.
-
Dependency on Pose Estimation Quality. The method is highly dependent on the accuracy of the underlying pose estimation system (e.g., RTMPose). Inaccurate or noisy skeleton data can significantly affect downstream performance, and the paper does not attempt to mitigate or quantify this risk.
-
Limited Dataset and Language Scope. All experiments are conducted solely on the CSL-Daily dataset (Chinese Sign Language). No evaluation is provided on datasets with other sign languages or different linguistic and kinematic structures, raising concerns about the method’s cross-linguistic generalisability.
-
Unclear Notation in Figure 1. Figure 1 contains symbols (e.g., d, h) that are undefined or ambiguously used in the diagram. The lack of annotation impairs readability and reproducibility, especially for readers unfamiliar with hyperbolic geometry conventions.
问题
Plz see weaknesses.
局限性
Yes
最终评判理由
The authors have addressed all of my concerns with detailed and thoughtful responses. While some limitations remain, the clarifications and additional analyses are sufficient to justify a borderline acceptance.
格式问题
None
We thank the reviewer for their constructive and detailed review. We agree with the weaknesses presented and have addressed them below.
Addressing Weaknesses
1. On Task Suitability (SLT vs. SLR)
The reviewer raises a very fair point about the suitability of SLT for evaluating embedding quality, and we agree that an evaluation on ISLR is missing from the paper. As discussed in our other rebuttals, since our initial submission, we have evaluated our method on the WLASL2000 benchmark for ISLR. We are pleased to report that our method outperforms not only the pose-only baseline but also the stronger multi-modal (Pose+RGB) variant of Uni-Sign.
| Method | Pose | RGB | WLASL2000 (P-I) | WLASL2000 (P-C) |
|---|---|---|---|---|
| Uni-Sign (Pose) | ✔ | 63.13 | 60.90 | |
| Uni-Sign (Pose+RGB) | ✔ | ✔ | 63.52 | 61.32 |
| Geo-Sign (Ours) | ✔ | 63.64 | 61.89 |
We will add the full table of results and additional details to the full paper or can post here during the discussion period if requested.
To address concerns about the limited dataset scope and potential dependence on CSL-Daily's characteristics, we have conducted new experiments on the How2Sign dataset for ASL-to-English translation.
| Method | BLEU-4 | ROUGE-L |
|---|---|---|
| Uni-Sign (Pose) | 14.5 | 34.3 |
| Geo-Sign (Ours) | 15.1 | 35.4 |
As the reviewer correctly notes, How2Sign has a wider semantic gap and fewer repeated sentences. The fact that our method still provides a significant improvement on this more challenging dataset provides strong evidence that our results are not an artifact of CSL-Daily's structure and generalizes well.
2 Sampling in InfoNCE
To clarify, our work uses the standard in-batch sampling protocol of InfoNCE-style loss presented in Equation (5). There is no separate or special sampling step.
For any given anchor pose embedding from a training batch of size , its corresponding text embedding is treated as the sole positive sample. The other text embeddings in that same batch, , automatically serve as the in-batch negatives.
This objective, a softmax cross-entropy over the geodesic distances, requires the model to correctly identify the positive pair from the set of all candidates within the batch. Minimizing this loss directly forces the model to decrease the distance between corresponding pairs while simultaneously increasing the distance to all other non-corresponding pairs in the batch.
3. On Dependency on Pose Estimation Quality
The reviewer was not the only one to bring up the dependency on pose estimation quality. To quantify this risk, we conducted a new experiment where we added Gaussian noise to the pose representations. The results show that our hyperbolic method is significantly more robust and degrades more gracefully than the baseline.
| Noise Level () | Geo-Sign (BLEU-4) | Baseline (BLEU-4) | Geo-Sign Drop (%) | Baseline Drop (%) |
|---|---|---|---|---|
| 0.00 (Original) | 27.42 | 26.25 | - | - |
| 0.01 | 26.30 | 24.14 | -4.1% | -8.0% |
| 0.02 | 24.60 | 21.50 | -10.3% | -18.1% |
| 0.03 | 19.07 | 14.40 | -30.4% | -45.2% |
| 0.04 | 11.63 | 7.20 | -57.6% | -72.5% |
| 0.05 | 5.98 | 3.01 | -78.2% | -88.5% |
As discussed with the other reviewers, we believe this is because the learned hyperbolic geometry creates larger decision margins between sign classes, making the representations more resilient to perturbations. We will add this full analysis to the appendix.
4. On Limited Dataset Scope (Addressed in Point 2)
As shown above, our new results on How2Sign and WLASL provide examples of our method's cross-linguistic and cross-task generalisability.
5. On Unclear Notation in Figure 1
Thank you for pointing this out; we agree it is unclear. We will revise the caption for Figure 1 in the final version to explicitly define all symbols.
We are confident that these new experiments address the weaknesses raised and look forward to the discussion period. Thanks again for taking the time to review the paper and giving some nice feedback and comments to improve the paper.
Dear Reviewer TKSE,
The authors have responded to the original reviews. Could you read the rebuttal and share your thoughts? Does it address your original concerns? Are there any remaining questions for the authors?
Best, AC
Thank you for the detailed response, which has addressed all of my concerns. I appreciate the clarifications and the additional analyses provided. I hope all the points discussed will be incorporated into the revised version of the paper. I will update my score accordingly.
Geo-Sign introduces a novel approach to Sign Language Translation (SLT) by enhancing the geometric representation of skeletal pose features. Instead of relying on RGB inputs or conventional Euclidean embeddings, this work employs hyperbolic geometry—specifically the Poincaré ball model—to capture the hierarchical structure of sign kinematics. Integrating hyperbolic projections, Fréchet mean aggregation, and a geometric contrastive loss into an mT5-based translation pipeline, Geo-Sign significantly improves translation quality, particularly for fine-grained hand gestures. Experimental results on CSL-Daily show state-of-the-art performance with pose-only data.
优缺点分析
Strengths:
- The paper is well-written. It presents a well-motivated, thoroughly validated method for improving SLT using hyperbolic geometry, with strong experimental support and clear implementation details.
- Achieves SOTA performance in gloss-free SLT with significantly fewer parameters and privacy-respecting pose data. Demonstrates that geometric priors can substitute for expensive visual encoders to some extent.
- The first one to combines hyperbolic embedding, contrastive loss, and part-wise attention in SLT task. The application of Riemannian optimization and learnable curvature is innovative.
Weaknesses:
- Only evaluation on CSL-Daily. It would be great if this method can also generalize to other languages. I appreciate the author to acknowledge the lack of significance testing due to computational and time limits, but still a weakness when interpreting modest improvements (e.g., +1.81 BLEU4).
- As the author discussed in Line 108, full reliance on accurate pose extraction (RTMPose) could degrade under noisy conditions. It is very important to have visual features and more detailed facial expression from RGB videos. The in-the-wild application performance will be of concern because the quality of the pose data is highly unpredictable and could be jittery. Does this method also maintain good performance under noisy pose?
- The author claimed that one of the reasons to drop the RGB information is to reduce the computation computational cost. However, increased training latency also presents due to hyperbolic computations, although mitigated by not affecting inference. There is no analysis on this (e.g. run-time, memory cost).
- No statistical significance measures reported due to compute limitations.
- How sensitive is the weighted Fréchet mean computation to initialization or high curvature values near the boundary of the ball?
问题
- How would the model generalise to non-Chinese sign languages with different kinematic properties (e.g. [1])?
- Could dynamic curvature be more explicitly leveraged across layers or time steps?
- Have the authors considered comparing Möbius-attention to simpler attention variants in hyperbolic space (e.g., distance-only or dot-product-based)?
- In terms of the gradient of the hyperbolic distance in Appendix D.2, the exponential expansion of distances at the boundary could cause instability in backpropagation. How does the model prevent exploding gradients when points lie near the boundary of the Poincaré ball?
[1] A. Duarte et al., How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. CVPR, 2021
局限性
yes
最终评判理由
The authors addressed all my concerns and questions in the rebuttal with detailed experiments and clarification. Though in my view, the lack of RGB information is still a problem in terms of the application system, regardless of the claim of being privacy-aware. I still think this algorithm is valuable by improving existing SotA approaches. Therefore, I will support the acceptance of this work due to its novelty and contribution to the SLT field.
格式问题
None
We thank the reviewer for their thoughtful and constructive feedback. We are happy to address the weaknesses and questions below.
Addressing Weaknesses
1. Generalization to Other Datasets and Tasks
As requested, we provide additional results on How2Sign (ASL Translation) and WLASL2000 (ASL Recognition).
How2Sign (SLT)
On this translation task, our geometric regularization provides a clear improvement over the Uni-Sign baseline, demonstrating generalization to a new language (ASL) and data domain.
| Method | BLEU-4 | ROUGE-L |
|---|---|---|
| Uni-Sign (Pose) | 14.5 | 34.3 |
| Geo-Sign (Ours) | 15.1 | 35.4 |
WLASL2000 (ISLR)
On the recognition task, our pose-only method not only improves upon the pose-only baseline but also surpasses the stronger multi-modal (Pose+RGB) variant of Uni-Sign demonstrating that the method is adaptable to recognition as well as translation tasks.
| Method | Pose | RGB | WLASL2000 (P-I) | WLASL2000 (P-C) |
|---|---|---|---|---|
| Uni-Sign (Pose) | ✔ | 63.13 | 60.90 | |
| Uni-Sign (Pose+RGB) | ✔ | ✔ | 63.52 | 61.32 |
| Geo-Sign (Ours) | ✔ | 63.64 | 61.89 |
We have only included the baseline (Uni-Sign) as reference here for brevity but can provide the full table on request. We will add these results and some discussion to the final paper.
2. Robustness to Pose Noise and Non-Manual Features
The reviewer raises a great point regarding the importance of non-manual features (NMFs) that can only be captured accurately with RGB data and the challenges of pose-only methods. This work is explicitly framed to address privacy-aware SLT and so we don't consider RGB features, however we do include 16 keypoints for the face which should capture some of this information. To empirically address the concern of in-the-wild poses, we conducted a new ablation study, adding Gaussian noise to the pose representations. The results demonstrate that our hyperbolic method is substantially more robust to noise than the baseline, maintaining higher performance and degrading more gracefully. We will add these results as an additional ablation in the paper.
| Noise Level () | Geo-Sign (BLEU-4) | Baseline (BLEU-4) | Geo-Sign Drop (%) | Baseline Drop (%) |
|---|---|---|---|---|
| 0.00 (Original) | 27.42 | 26.25 | - | - |
| 0.01 | 26.30 | 24.14 | -4.1% | -8.0% |
| 0.02 | 24.60 | 21.50 | -10.3% | -18.1% |
| 0.03 | 19.07 | 14.40 | -30.4% | -45.2% |
| 0.04 | 11.63 | 7.20 | -57.6% | -72.5% |
| 0.05 | 5.98 | 3.01 | -78.2% | -88.5% |
We hypothesize this increased robustness stems from the geometric properties of the Poincaré ball. The exponential volume growth forces a sparser representation, creating larger geodesic margins between distinct sign classes, making the model less susceptible to noisy perturbations.
3. Computational Cost
The reviewer is correct that our hyperbolic branch introduces computational overhead during training, a point we address in detail in Appendix F.1. As discussed, this latency is partly due to the necessity of using FP32 for hyperbolic operations to maintain numerical stability. However, this is a one-time training cost. Crucially, at inference time, the overhead is zero, as the regularization branch is removed.
4. Statistical Significance
We can only agree with the reviewer, though we note it is a limitation shared by all prior work in this area due to the computational requirements for training these large SLT models. To ensure fairness and reproducibility, we used the same seed and hyperparameters as the baseline. We also include a detailed overview of all hyperparameters in the supplementary material. The code for all experiments and weights are also available for both our pretrained and fine-tuned models.
5. Sensitivity of the Weighted Fréchet Mean
As discussed in Appendix D.1 (Proposition D.1), the computation is highly stable. As we discuss in the supplementary, the Poincaré ball is a Hadamard manifold, where the squared geodesic distance is a strictly convex function. This guarantees that a unique Fréchet mean exists and that our iterative algorithm (Algorithm 1) is guaranteed to converge to it, irrespective of initialization. We are happy to move this into the main paper at the reviewers request.
Questions
1. Generalization to other sign languages
As demonstrated in our new experiments on How2Sign (ASL) and WLASL2000 (ASL), our method generalizes well. We hypothesize this is because the hierarchical nature of kinematics (torse arm hand) is a fundamental principle in sign languages. Our model, by using a geometry inherently suited to such hierarchies, learns this compositional structure, allowing it to adapt across different languages.
2. Dynamic Curvature
This is an interesting idea! We did experiment with learning distinct curvature parameters for each body part but, as the reviewer might anticipate, we found it introduced instability into the Riemannian optimization. It is technically feasible to have adaptable curvature over the temporal dimension, perhaps adapting it based on kinematic properties, but again challenging to optimize.
3. Simpler Attention Variants
Our choice of Möbius transformations was principled, as they are the formal isometries of the Poincaré ball and thus the geometrically sound analogue to the affine transformations critical to standard Euclidean attention. This enables a dynamic, learned alignment, which we argue is crucial for the complex alignment task in SLT. However, we agree that a simpler distance based metric would be a nice ablation to the paper. Unfortunately we don't have the resources during the rebuttal period, but will perform an additional ablation for the final version of the paper.
4. Gradient Stability
The reviewer correctly identifies this challenge, which we address in the paper. As detailed in Section 3.4 and expanded in Appendix G.2, we employ two primary safeguards: 1) tangent vector clipping before the exponential map and 2) maintaining FP32 precision for all geometric operations. These steps have proven effective in ensuring stable training.
Thanks again for the nice review, we look forward to the discussion period.
Thank you for the detailed answers and experiments. They address most of my concerns and questions. I look forward to your further results when you have the resources. I would keep my rating of borderline accept.
Thank you to the reviewers for taking the time to review our rebuttal. We are glad our new experiments on generalisation and noise robustness were helpful. As there are a few days left in the discussion period, we would be happy to provide any further context or clarification on the paper.
Since the initial rebuttal we have been able to run the additional requested ablation experiment on the simpler attention methods.
Regarding the dot-product variant, as discussed in our initial comment, a standard Euclidean dot product is geometrically inconsistent with the Poincaré manifold. However, we implemented the distance-only variant where attention scores are derived from the negative geodesic distance on the hyperbolic manifold. The ablation was performed on CSL-Daily with the same training and hyper-parameter setup as the main experiment.
| Model Variant | Attention Mechanism | Geometry | BLEU-4 | ROUGE-L |
|---|---|---|---|---|
| Geo-Sign (Ours) | Möbius (Learnable) | Hyperbolic | 27.42 | 57.95 |
| Geo-Sign (Ablation) | Distance-Only | Hyperbolic | 25.90 | 55.25 |
| Geo-Sign (Baseline) | Euclidean Dot-Product | Euclidean | 25.98 | 53.93 |
The distance-only model's n-gram precision (BLEU-4) offers no improvement over the Euclidean baseline. Interestingly, we see some improvement on the content recall (ROUGE-L) performance.
We attribute this to the hyperbolic space's greater capacity for feature separation, which aids a simple retrieval mechanism in recalling a more accurate set of semantic concepts (improving ROUGE-L). However, arranging these concepts into high-precision phrases requires a more sophisticated, learnable mechanism to model syntactic relationships which is provided by the learnable Möbius operations.
This ablation confirms that the method benefits from the principled combination of the hyperbolic geometry and the learnable Möbius attention.
This analysis has been added to the paper's appendix. Again, we appreciate the feedback, which has led to a more robust validation of our work.
Dear authors,
Thank you for the detailed experiments and comprehensive rebuttal provided. I can confirm that all of my concerns have been addressed. I will update my final justification accordingly, and I will highly support the acceptance of this work.
This paper focuses on Sign Language Translation using skeletal data and proposes an approach that uses ideas from hyperbolic geometry to model the hierarchical structure of sign language motions. Initially, the paper received two Borderline Accept and two Borderline Reject ratings. The reviewers appreciated the novelty of the approach and the geometric intuition, the privacy preserving nature of the approach and the strong empirical results. However, there were also concerns, particularly related to the evaluation, which omitted some baselines and was only performed on a single dataset. The authors provided a rebuttal which addressed these main concerns from the reviewers. Eventually, all reviewers gave a Borderline Accept rating for the paper. Given the unanimous acceptance recommendation from four knowledgable reviewers, there is no basis to overturn reviews. The AC recommends acceptance. Authors should still consider any additional comments or feedback from the reviewers while preparing their final version and of course update the manuscript to include any additional promised changes, analysis, results and/or discussion they provided in their rebuttal, particularly the extended quantitative evaluation results.