PaperHub
5.2
/10
Poster5 位审稿人
最低4最高6标准差0.7
4
5
6
6
5
4.2
置信度
正确性2.8
贡献度2.2
表达2.8
NeurIPS 2024

Improving Gloss-free Sign Language Translation by Reducing Representation Density

OpenReviewPDF
提交: 2024-05-10更新: 2024-11-06

摘要

关键词
Gloss-free Sign Language Translation; Representation Density; Performance Drop; Contrastive Learning

评审与讨论

审稿意见
4

This paper identifies a critical issue of representation density in gloss-free sign language translation. A series of models were employed to verify the existence of this problem. To address this, the author proposed a straightforward and effective solution, SignCL loss. This objective improves the discriminative representations of input videos, achieving promising performance in PhoenixT and CSL-Daily datasets.

优点

  • The paper is straightforward and comprehensible, and the proposed method is somewhat effective.
  • The adequate visualizations validate the effectiveness of the proposed objective.
  • The extensive experiments on both public benchmarks show plausible performance.

缺点

  • The technical novelty and contribution are somewhat limited. The entire paper mainly introduces a contrastive loss (SignCL), which has been widely utilized in other various domains.
  • The proposed objective is only applied in the single model, which seems to lack persuasiveness. Experiments on more SLT models are essential to validate its effectiveness.
  • On line 118, why is there a need to optimize the test set? Why must we manually determine the best frame from lvl_{v} and rvr_{v} instead of utilizing CTC's gradient? For VLP, how are lvl_{v} and rvr_{v} generated?
  • In section 2.2, why are VLP, I3D, and SMKD methods grouped together? These methods should differ in terms of their training objectives. SDR of SMKD will naturally be low, because CTC loss is a discriminative loss. Achieving low SDR is essential for successful completion of the CSLR task.

问题

  • The writing in the article is overly colloquial, and the presentation of Formula 4 may be confusing. Additionally, there is an extraneous character ")" in the second line of the title in Figure 4.
  • Line 236 claims that GFSLT "incorporates CLIP and MBART for model pretraining and finetuning." However, to my knowledge, GFSLT does not utilize CLIP's pre-trained weights.
  • The performance of CSL-Daily in Table 2 is notably inconsistent. In Table 1, when Sign2GPT achieves R@L of 48.90, its B@4 is 22.52. However, in SignCL's CSL-Daily, despite achieving a similar R@L of 48.92, its B@4 drops to 16.16. Please verify the accuracy of these performance metrics.
  • The t-SNE visualization method is unclear. How should the SLT task be correctly labeled with gloss?
  • The datasets PHOENIX-2014T and CSL-Daily are relatively small in size. Can the effectiveness of the method be validated on larger datasets like How2Sign and Open-ASL?

局限性

Yes, the limitations has been discussed in the main paper.

作者回复

We sincerely appreciate your detailed comments and suggestions. We provide point-wise responses to your concerns below.

W1: The technical novelty and contribution are somewhat limited: We respectfully disagree with the statement that "The entire paper mainly introduces a contrastive loss". We emphasize our contributions below:

  • Introducing a contrastive loss is not the only contribution of this work. We have an entire Section 2 and extensive analyses dedicated to investigating the representation density problem in SLT.
  • The key contribution of this paper is identifying the representation density problem for gloss-free SLT for the first time. It highlights a potential bottleneck in restricting the performance of gloss-free SLT, which advances the SLT field.
  • Introducing SignCL as the second contribution. Even though it is based on a well-known contrastive learning method, it is a simple but effective solution to address the representation density problem. We are the first to validate this, providing a direction for future gloss-free SLT to learn more discriminative representations.

W2: The proposed objective is only applied in the single model: We respectfully disagree with this conclusion. As described in line 13 and experiments listed in Section 4.1, we evaluate SignCL across multiple datasets and SLT frameworks, including both gloss-based and gloss-free settings. Reviewers rTJg and UuUf have recognized this contribution and even considered it as a strength.

Q1: how are lvl_v and rvr_v generated? We noticed that you have several questions about the sign-gloss forced aligner. We provide a clearer introduction to address your concerns.

  • To start, we want to emphasize that the purpose of the sign-gloss forced aligner is only to derive labels to measure the discriminability of the feature representations, not for the model itself. So for VLP, it does not need to produce lvl_v and rvr_v; the entire Sign-Gloss Alignment consistently uses pre-trained models.
  • We mix the test set into the sign-gloss forced aligner training process and employ volunteers to manually determine the best frame to ensure accurate SDR evaluation when assessing different SLT approaches.
  • The generation of lvl_v and rvr_v has been covered in several previous works [a, b]. For specific details, you may refer to these studies. We highlight that, as shown in Appendix A.3, the sign-gloss forced aligner yields a Word Error Rate (WER) of 8.68, which is on par with human performance [c]. Even with this accuracy, we strive to ensure the most precise results on the test set.

Q2: Why are VLP, I3D, and SMKD methods grouped together?

  • Section 2 aims to investigate the representation density problem within existing sign feature extraction methods, including both gloss-free and gloss-based methods.
  • We acknowledge your insightful observation that the SDR of gloss-based methods, such as SMKD, will naturally be lower because the CTC loss is a discriminative loss. We grouped them together to highlight that gloss-free SLT suffers from worse representation density compared to gloss-based methods.

Q3: Please verify the accuracy of these performance metrics in table 1 and table 2.

  • We have re-calculated our metrics, and they are accurate. We believe the differences in R@L and B@4 metrics are due to the different aspects they measure. R@L focuses on the retrieval accuracy, while B@4 emphasizes the fluency and relevance of the generated translations.
  • Additionally, the characteristics of Chinese and German languages are quite different, which can impact the consistency of R@L and B@4 across different languages.
  • It is also worth noting that similar discrepancies have been observed in past methods. For example, Sign2GPT achieves an R@L of 42.36 and a B@4 of 15.40, whereas GFSLT-VLP achieves a similar R@L of 42.49 but a B@4 of 21.44 on PHOENIX-2014T.

Q4: Can the effectiveness of the method be validated on larger datasets like How2Sign and Open-ASL

  • We believe that SignCL can still be effective on larger datasets, as larger datasets often lack costly gloss annotations to construct a discriminative loss. SignCL can serve as an effective optimization objective in this process, encouraging the learning of more discriminative representations of sign gestures.

  • We train on the How2Sign dataset for 40 epochs from scratch. In this limited experiment, the GFSLT baseline performance was 0.79 B@4, while applying SignCL as an additional optimization objective improved the performance to 1.58 B@4. These experiments are comparable to the results from [d], where they also trained from scratch.

MethodsB@1B@2B@3B@4
H2S(no pretraining)[d]13.924.691.820.86
H2S(pretraining)[d]14.965.112.261.22
GFSLT12.313.851.530.79
    + SignCL into Finetuning17.366.272.871.58

Suggestions And Typos:

  • We apologize for the lack of clarity in the text and the typographical errors. We have revised our paper according to your suggestions to improve its readability and accuracy.

  • To clarify, Line 236 states that GFSLT-VLP "incorporates CLIP and MBART for model pretraining and finetuning." What we mean is that GFSLT-VLP uses the pretraining strategies of CLIP and MBART to retrain the GFSLT model, not talking about using their pre-trained weights. We will make this point clearer in future versions of the paper.

References:

[a] CTC-segmentation of large corpora for German end-to-end speech recognition.

[b] Cross-modality Data Augmentation for End-to-End Sign Language Translation

[c] Achieving human parity in conversational speech recognition Microsoft Research Technical Report 2017

[d] YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus

评论

Firstly, I would like to thank the authors for their detailed responses and the additional experiments. After reading the rebuttal, I still have the following concerns regarding the paper:

  1. The authors claim that the main contribution of the paper is identifying the representation density problem for gloss-free SLT for the first time. However, I believe the fundamental reason why gloss-free SLT lags behind gloss-based SLT in performance is that the Vision Encoder lacks effective supervision signals, leading to insufficient representation capacity of the visual features. The representation density problem is merely one manifestation of the weak representational capacity of the visual encoder. In general vision tasks, numerous studies have been conducted on improving the representational capacity of visual encoding, such as Moco, Dino, MAE, etc. Therefore, striving to enhance the discriminative power of visual representations is a direction that has been continuously explored by the academic community. Thus, I believe that positioning this discovery as a core contribution lacks novelty.

  2. Regarding the representation density problem in the visual encoder, enhancing the representational capacity of the visual encoder is key to addressing this issue. The contrastive loss proposed in this paper does indeed improve the representational capacity of the visual model to some extent. However, from the quantitative experimental results, the improvement effects are inconsistent across different datasets. Compared to PhoenixT and CSL-Daily, H2S is a very challenging dataset due to its larger training set and vocabulary. It can be observed that the improvement brought by the proposed method on this dataset is very limited (B@4 0.79->1.58). In contrast, GLoFE [1] demonstrated that directly using the CNN model trained in [2] could achieve B@4 2.24. Therefore, these results do not provide strong evidence that the contrastive loss proposed in this paper can effectively solve the representation density problem. In other words, using a pre-trained visual model with stronger representational capacity might well solve this problem.

[1] Lin, Kezhou, et al. "Gloss-free end-to-end sign language translation." In ACL 2023.

[2] Oscar Koller, et al. "Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos." In TPAMI 2019.

评论

Thank you for your thoughtful review and for highlighting key points. I believe we are aligned in recognizing the gloss-free SLT struggles from the weak representative capacity of the Vision Encoder due to insufficient supervision signals. The difference may lie in how we approach the problem, particularly in terms of perspective and granularity.

In response to your concerns, we would like to clarify our position:

  • we find that describing this problem as "insufficient representation capacity" is too broad and general. In our work, we formalize this problem specifically as the lack of discriminative power in the visual representations of sign gestures, which we term the "representation density problem."
  • While general vision tasks have made strides in improving visual representations, we argue that the ability to distinguish between sign gestures with different meanings is crucial in the context of sign language, and prior to our work, this has not been highlighted.
  • We demonstrate that the aspect of representation density significantly impacts sign language translation performance, by comprehensive investigation in Section 2.
  • SignCL is relatively straightforward (in a good way) in addressing the lack of discriminative power for different sign gestures. It represents a good start, and of course, there is much more that can be explored in the future.

Regarding the performance in H2S, we would like to argue two points:

  • Our experiments were limited due to constraints in computational resources and time. We only trained for 40 epochs, whereas GLoFE [1] was trained for 400 epochs.
  • The CNN model used in GLoFE [1] was pre-trained with GLOSS annotations to extract visual features [2]. The SignCL approach does not use any gloss annotations, whether for pretraining or as weak supervision.

We believe that directly comparing our results with GLoFE is unfair. We will continue training the model, but we estimate that similar training would take around 25 days on an A800*8 GPU setup due to GFSLT processing raw video inputs directly. So far, our results demonstrate that incorporating SignCL does improve GFSLT performance (B@4 0.79 → 1.58) on How2Sign.

Thank you again for your thoughtful review and detailed responses. We welcome further discussion.

评论

Dear Reviewer,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper. We have provided a new response to address the concerns you further raised. Thank you for your thoughtful review and for highlighting key points. We kindly inquire if there is any further information or clarification you might need from our side. We apologize if this follow-up causes any inconvenience. Hope you have a great day!

评论

Thanks for your detailed responses and further discussion. After carefully reviewing the rebuttal and the main paper, I still consider that presenting the discovery of the 'representation density problem' as a core contribution lacks innovation. In the field of sign language video understanding, improving the representational capacity of visual models is a common goal among researchers. Furthermore, the proposed contrastive loss function has not consistently improved performance across public datasets (the performance improvements on PhoenixT and H2S are significantly lower than on CSL_Daily), making it difficult to assess the generalization ability of SignCL. I suggest the authors may need to conduct more in-depth research on this phenomenon to enhance the persuasiveness of the proposed method. Therefore, I maintain my initial score.

评论

Thank you for your insightful feedback and for acknowledging the effectiveness of SignCL in enhancing the representational capacity of visual models. We do not see the assumption you believe as a point of contention regarding the contributions of our work.

In fact, we have specifically formalized this poor visual encoder as the lack of discriminative power in the visual representations of sign gestures. Our work systematically experiments to validate this issue (Section 2), and our results clearly demonstrate that improving the discriminability of visual representations indeed leads to better translation performance (Section 3).

Regarding the inconsistency in the improvement of SignCL on PHOENIX-2014T and CSL-Daily, this issue aligns with what we addressed in Q3. The characteristics of Chinese and German languages are quite different, making direct comparisons across datasets challenging. Sign2GPT vs. GFSLT-VLP also exhibits similar varying degrees of performance improvement. We understand there may be concerns about our performance improvement on CSL-Daily. We have committed to release our code, models, and logs to facilitate the reproduction of our results.

评论

Thank you for your thoughtful feedback.

Based on the new concerns you’ve raised, we would like to share some additional arguments. We would like to mention Sign2GPT[a], which involves fine-tuning large-scale pretrained vision and language models (Dino-V2 and XGLM) for SLT.

  • According to the ablation study results presented in Table 2 of their paper, directly fine-tuning these models pretrained on general domains (termed as Sign2GPT) yields results that are only competitive with GFSLT-VLP (B@4 19.42 vs. 21.44 on PHOENIX-2014T).

  • To improve performance, they developed a pseudo-gloss pretraining strategy (termed as Sign2GPT(w/PGP)) on Dino-V2.

In contrast, our approach, using a simple contrastive loss, achieves better improvements over GFSLT-VLP when compared with Sign2GPT(w/PGP) on both the PHOENIX-2014T and CSL-Daily datasets.


Regarding the performance of the proposed SignCL, we believe it should be compared to more fully gloss-free methods, such as Sign2GPT [a], rather than GLoFE [1] which uses a GLOSS Dataset pretrain viusal backbone^*. We have provided consistent benchmarks with a competitive baseline with [a] and [b]. The improvements we observe on the two most widely used datasets in the sign language field (PHOENIX-2014T and CSL-Daily) are generally consistent with [a] and [b].

To sum up, though numerous studies in general vision tasks have explored improving the representational capacity of visual encoders, this does not diminish the novelty of our contribution to SLT. Our SignCL outperforms those achieved using general vision pre-trained models, such as Dino-V2 in Sign2GPT[a].

While we indeed did not have sufficient resources to run How2Sign for 400 epochs (which would take around 25 days on an A800*8 GPU setup), we believe that the current results are sufficient to validate our approach. It's fairer to compare with Sign2GPT[a] and GFSLT[b] due to the consistent setting onthe two most widely used benchmarks (PHOENIX-2014T and CSL-Daily).


[a] "Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation" in ICLR'24

[b] "Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining" in ICCV'23

* As referenced in Section 3.1 of GLoFE [1]: "The backbone is pre-trained on the WLASL dataset (Li et al., 2020a) through the isolated sign language recognition task."

评论

Thank you for your thoughtful responses.

  • For Sign2GPT [1], although this method utilizes large-scale pre-trained vision and language models (Dino-V2 and XGLM), both modules are frozen during training (except for the few learnable parameters with the Lora fine-tuning). Under such setting, Sign2GPT could achieve comparable performance with other methods by training all parameters of their proposed model. This result further validates my assumption, that enhancing the representative capacity of vision encoders is the key to improving the gloss-free SLT or gloss-based SLT. Meanwhile, I also acknowledge that the proposed SignCL can improve the representational capacity of visual models to some extent. However, I am confused by the varying degrees of performance improvement that SignCL achieves across different benchmarks.

[1] "Sign2GPT: Leveraging Large Language Models for Gloss-Free Sign Language Translation" in ICLR'24

审稿意见
5

This work uses contrastive learning of sign language gestures to improve the discrimination power of learned representations of gestures. This is motivated by showing, via TSNE projections, that gloss-free representations (that do not benefit from an intermediate representation), are less well represented that gloss-based representations. After incorporating contrastive learning, either into the pre-training, fine-tuning, or both, performance improvements can be achieved for both sign-language recognition and for sign-language translation over the presented baseline.

优点

The paper is tackling an important problem as accurate sign-language recognition and translation will enable accessibility for the hard of hearing and those that need to interact with the hard of hearing.

The paper is well motivated from the perspective of demonstrating that the particular measure used to determine representation quality (SDR) is significantly lower for gloss-free representations, which is the focus of interest here. However, the approach itself will have limited impact in the broader NeurIPS community. The paper is using known methods applied to the representation of sign language.

The work is based on open data and the authors will release their code and model with the paper. This will ensure reproducibility.

缺点

The word “significance” has specific meaning, and you should not claim that results are significant without properly showing that they are so.

Results for only a single run are provided. This is acknowledged in the check sheet with the justification that increasing the number of runs is computationally expensive. However, the resources required are ~12 hours on 8x NVIDIA A100s, which does not seem excessive by current standards. One solution might be to just use the best values for the hyper parameters, and show these are representative across runs.

Some of the findings appear obvious -- of course representations of dissimilar gestures that project closely in embedding space will be difficult for a model to tease apart for downstream tasks.

There is a lot of emphasis and conclusions drawn from the visualizations, which are TSNE projections of the embeddings. TSNE visualizations should be considered with a degree of caution, and there are better solutions if you are trying to consider the global structure of the data.

问题

Figure 2 — is the reported SDR just for the highlighted signs in the figure or is it computed across all sign gestures?

Is the SDR issue artificially inflated because of the dataset used? For example, using a corpus of signing from weather forecasting with a limited vocabulary and many signs around a small number of topics?

A missing limitation of the work is that a fixed margin is used to define the threshold for frame selection for the contrastive learning. Whilst 20 frames might be sufficient, as well as being impacted by closely repeating signs, could frame-selection also be impacted by signing rate?

My understanding is that two annotators select a frame representative for each gesture, and the main text said that details are in the appendix. However, I did not see how you select the final frame from the two (potentially different) annotations? Do you average the time stamps to pick the one on the middle?

Some other observations:

  • Remove subjective qualifiers from your writing.
  • Your equations should be punctuated and flow with the text.
  • For Figure 2, I would highlight gloss-based vs. gloss-free approaches as at first glance approaches like SMKD look significantly better, and then later realized that this is expected since it is not a gloss-free approach.
  • For Figure 3, the binning needs to be explained in the caption. Also is the axis label for the vertical axis correct? Is this figure not showing sign recognition accuracy and sign translation accuracy?
  • It seems somewhat redundant including both Figure 1 and Figure 5.
  • I think it would be worth reinforcing in Section 2.1 when you talk about mixing the training and testing tests that the purpose of this is only to derive labels, not for the model itself.
  • Has the contrastive learning been used with the other baselines? It seems that since the performance there is less good, there might be more room for improvement?

局限性

Yes the limitations are addressed adequately.

作者回复

We sincerely appreciate your detailed comments and insightful suggestions. We provide point-wise responses to your concerns below.

W1: Lack of proper statistical validation

Thank you for your suggestion. We reran the representative experiments of Table 2 five times, using different seeds (23,77,256,1024,2056). The table below shows the average results (with a standard deviation). The results demonstrate the statistical significance of our experiments.

ModelR@L ↑*B@1 ↑B@2 ↑B@3 ↑B@4 ↑
GFSLT-VLP38.61 (±1.37)35.98 (±1.00)23.52 (±0.88)15.80 (±0.79)11.21 (±0.54)
+ SignCL into Finetuning48.54 (±1.14)46.41 (±0.79)32.65 (±0.51)22.68 (±0.42)16.05 (±0.16)

W2: Limited contributions

We respectfully disagree with this. We provide our explanation below:

  • We are not merely "using known contrastive learning methods applied to the representation of sign language." The key contribution of this paper is identifying the representation density problem for gloss-free SLT for the first time.
  • This discovery is neither obvious nor trivial. It is well-known that gloss-free methods in SLT lag significantly behind gloss-based approaches, but the reasons are still under investigation. We highlight that the representation density problem could be a bottleneck in restricting the performance of gloss-free SLT.
  • The insights provided in this paper for SLT have been recognized by Reviewers rTJg and UuUf, even though they expressed concerns about the sensitivity analysis on sampling parameters.
  • The simple but effective SignCL is the second contribution. Even though it is a straightforward solution to the representation density problem, we are the first to validate this and improve performance in two different frameworks by 39% and 46%, respectively. It will provide a direction for future gloss-free SLT. Reviewer rTJg has noted that this approach is "relatively straightforward in a good way."

W3: Some findings appear obvious and are heavily reliant on t-SNE visualizations

We respectfully disagree with these statements. Here is our explanation:

  • Our findings are not obvious and limited: While it is evident that poor representation hinders a model's ability, whether gloss-free SLT faces a significant representation density problem needs to be researched.
  • Our emphasis and conclusions are not based on visualizations alone. We first used SDR to quantify the discriminative capability of various existing sign feature extraction techniques. Our conclusions are derived from these quantitative metrics and results, particularly when comparing gloss-free and gloss-based approaches. Visualizations only serve as a visual aid to better illustrate the representation density problem.

Q1: Is the reported SDR just for the highlighted signs in the figure or is it computed across all sign gestures?

All reported SDRs in Figure 2 are computed across all sign gestures in the training set.

Q2: Is the SDR issue artificially inflated due to the dataset used?

We believe the SDR issue is not artificially inflated. Our related conclusions have been consistently validated on the CLS-Daily Dataset, which is a multi-topic dataset. Moreover, many of our conclusions are derived from comparisons between gloss-free and gloss-based methods on the same dataset.

Q3: Could frame selection be impacted by the signing rate given the fixed margin?

Thank you for your insightful question. We address this concern from the following aspects:

  • The margin is not fixed; it dynamically adapts based on the estimated average frames of each gloss. Therefore, it actually adapts to the signing rate.
  • Our sensitivity analysis shows that SignCL is not sensitive to the margin parameter, as evidenced by a variance of 0.062 when using different thresholds in [0, 10, 20, 30, 40, 50].

Q4: How to select the final frame from the two (potentially different) annotations?

We apologize for the missing details. By default, we select the middle frame as the representative, and the two annotators first check if this default frame is a good representative. For interval [lvl_v,rvr_v], we ensure the current frame clearly represents a gesture and is different from the previously selected one. If the default frame does not meet these criteria, the annotators review the frames within the interval [lvl_v,rvr_v] proposed by the sign-gloss aligner. If a suitable frame cannot be found within this interval, the entire video is marked for discard. Only data that both annotators consistently agree upon as good will be used. Q5: Is the axis label for the vertical axis correct?

We apologize for the confusion. To clarify: In Figure 3(a), the bar chart represents the SDR of different gloss bins (left axis), while the line represents the corresponding sign recognition accuracy (right axis). In Figure 3(b), the left axis indicates the translation performance B@4.

Q6: Could there be more room for improvement?

  • We agree that achieving performance in gloss-free SLT similar to gloss-based approaches still has a long way to go. However, gloss-free methods have the advantage of not requiring costly gloss annotations, making it easier to scale up training datasets and perform large-scale pretraining. Then, the pretrained model can serve as a strong starting point for us to finetune on a small amount of high-quality annotated data.
  • We believe that addressing the representation density issue is crucial for effective pretraining. SignCL can serve as an effective optimization objective, encouraging the learning of more discriminative sign gesture representations.

Other suggestions:

We appreciate your insightful suggestions. It will certainly help improve our paper. We will incorporate your feedback to refine our paper in the upcoming version.

评论

Thank you for the responses to the questions and concerns raised. I also appreciate you taking the effort to provide the statistics over multiple runs. I will increase my score by a point.

I would make two observations about your response.

  1. You used the word "merely" and then put quotation marks around words that did not appear in my review. If you are using quotation marks, then you are attributing those words to someone else. The only place those words are used is in your response here and the general response above.
  2. I did not say that you rely on visualizations "alone", but rather that there is a lot of emphasis on TSNE projections. TSNE projections are known to be problematic. There are alternative projections, such as UMAP, which better preserve the global structure of the data.
评论

Thank you for your thoughtful review and for your careful suggestions. We have learned a lot, and we will be more rigorous in using quotation marks and the words we use. We also appreciate your clarification. We will include UMAP visualizations in future versions.

审稿意见
6

This paper addresses the challenge of gloss-free sign language translation (SLT) by identifying and tackling the "representation density problem". The authors observe that visual representations of semantically distinct sign gestures tend to be closely packed in feature space, making it difficult for gloss-free methods to distinguish between different gestures. To address this, they propose a contrastive learning strategy called SignCL, which encourages models to learn more discriminative feature representations in a self-supervised manner. The paper demonstrates significant improvements in BLEU scores across various translation frameworks on the CSL-Daily dataset.

优点

1 The paper identifies a novel problem (representation density) in gloss-free SLT and proposes an innovative solution (SignCL) to address it. The proposed SignCL method shows substantial improvements in SLT performance without requiring additional model parameters, potentially advancing the field of gloss-free SLT.

2 The authors conduct thorough experiments to demonstrate the existence of the representation density problem and the effectiveness of their proposed solution. They evaluate their method across multiple datasets and SLT frameworks.

3 The paper is well-structured and clearly written. The problem statement, methodology, and results are presented in a logical and easy-to-follow manner.

缺点

1 The sampling strategy for SignCL makes assumptions that may not always hold true in real-world scenarios. Specifically, it assumes that adjacent frames always belong to the same sign gesture and that frames far apart always represent different semantics. This approach overlooks the possibility of rapid transitions between gestures or the repetition of gestures over time, which could lead to incorrect positive or negative samples.

2 The SignCL method relies on several manually defined parameters and thresholds, such as the margin for determining positive and negative samples. The paper lacks a rigorous justification for these choices or an analysis of how sensitive the method is to these parameters. A more principled approach to parameter selection or a comprehensive sensitivity analysis would strengthen the scientific basis of the proposed method.

问题

1 How does the performance of SignCL vary with different choices of the margin parameter in the sampling strategy? Is there a systematic way to determine the optimal margin for a given dataset?

局限性

The authors discuss their limitations in the Appendix。

作者回复

We sincerely appreciate your detailed comments and insightful suggestions. We provide point-wise responses to your concerns below.

W1: Lack of comprehensive sensitivity analysis on sampling strategy

Thank you for your insightful suggestions. We conducted an additional comprehensive sensitivity analysis on the sampling strategy.

1. Before we proceed, let's briefly revisit some details from Formula 4: margin = max(10, len(frames)/len(text)* 2.3).

  • The margin for negative sampling dynamically depends on the estimated average margin of each gloss (i.e., len(frames)/len(text) * speech-to-gesture Zipf’s factor) and a minimum threshold (i.e., 10). The Zipf’s factor, set as 2.3, refers to the speech-to-gesture Zipf’s Law.
  • We calculated the distribution of the dynamically estimated margin, with the results shown in the table below. A more detailed distribution can be seen in the attached PDF (green background).
Margin[0, 10)[10, 20)[20, 30)[30, 40)[40, 50)[50, ∞)
Count11334603264230236
  • Only 1.6% fall into the [0, 10) range, which means that the margin is primarily determined by the estimated average frames of each gloss in our paper (i.e., len(frames)/len(text) * 2.3).

2. Sensitivity analysis on the threshold: We designed a minimum threshold to ensure the margin is not too small.

  • Experiment setup: To make the analysis more principled, we evaluated the threshold values at [0, 10, 20, 30, 40, 50]. Note that threshold=0 and threshold=50 indicate that the margin is dominated by the dynamically estimated margin and the fixed threshold, respectively.
  • Experiment results: We uniformly trained for 80 epochs on PHOENIX-2014T due to resource limitations. The results in the table below indicate that SignCL is not sensitive to the threshold parameter, with a variance of 0.062.
Threshold01020304050
B@417.2417.6317.5517.6317.1317.11

3. Sensitivity analysis on dynamically estimated margin: We use text length and Zipf’s factor to estimate the average frames of each gloss (gloss-free setting).

  • Experiment setup: To make the analysis more principled, we first use gloss labels to calculate the ground truth margin distribution for PHOENIX-2014T (green in Figure 8), with a specific Zipf’s factor of 2.1. We then evaluated the threshold values at [1, GT, 2.3, 3, 4]. Note that Zipf’s factor = 1 means we use len(frames)/len(text) directly to estimate the margin, while GT represents using the ideal len(frames)/len(gloss) to determine the margin (Zipf’s factor = 2.1).
  • Experiment results: The results show that using Zipf’s factors between 1, 2.3 and the GT margin does not lead to significant differences. When Zipf’s factor is set to 4, there is a noticeable drop in performance due to the margin being too large, which reduces the negative sampling interval (high probability of sampling from the same negatives).
Zipf’s factor1GT2.334
B@417.4517.8917.6317.2916.26

Q1: How to systematically determine the optimal margin?

Thank you for your insightful question. Based on our comprehensive sensitivity analysis, we found that SignCL is not sensitive to the threshold and Zipf’s factor. Therefore, we suggest setting the optimal margin approximately equal to the mean ± standard deviation of the estimated average frames of each gloss based on a given dataset, e.g., len(frames)/len(text) * 2. The threshold can be set to mean - standard deviation.


W2: The sampling strategy could lead to incorrect positive or negative samples

  • We agree that the sampling strategy might indeed produce errors in certain special cases. However, we would like to emphasize that a range of contrastive learning frameworks demonstrate that contrastive learning can still perform well even when there is noise in sampling strategies[a], like SimCLR[b] and MoCo[c].

  • This robustness is because the contrastive function inherently accommodates variability by focusing on relative differences rather than the absolute correctness of positive and negative pairs. Special cases of incorrect positive or negative samples do not significantly impact overall performance. Our sensitivity analysis also indicates that variations in the margin have minimal effect.


[a] Contrastive Representation Learning: A Framework and Review

[b] A Simple Framework for Contrastive Learning of Visual Representations

[c] Momentum Contrast for Unsupervised Visual Representation Learning

审稿意见
6

The main content of this paper is about improving the performance of Gloss-free sign language translation. The author discovered the representation density problem in sign language translation, that is, in the feature space, the semantic visual representations of different sign language gestures tend to be closely clustered together, making it difficult for the gloss-free method to distinguish different gestures, thus significantly affecting the translation performance. To solve this problem, the paper proposes a simple and effective contrastive learning strategy SignCL, which encourages term-free models to learn more discriminative feature representations in a self-supervised manner. Experimental results show that SignCL is able to significantly reduce representation density and improve performance in different translation frameworks.

优点

  • The authors discovered the representation density problem for the first time in the field of sign language and conducted a detailed analysis to show that this problem does affect the performance of sign language translation. These findings will help advances in the field of sign language processing.

  • Based on this finding, the authors proposed a contrastive learning method to improve the representation density problem. Experiments show that this method works well in the gloss-free setting.

  • The author promises to open source the code and model

缺点

  • The core of the contrastive learning method in this paper is the selection of corresponding positive and negative samples. The analysis and supplement of strategy selection can make this work more perfect. For example, what is the impact of choosing different distance parameters in formula 4? In addition, the selection of negative examples does not seem to be limited to a single video. For example, randomly selecting other frames with the same signer may be a better choice.

  • The contrastive learning method proposed by the author can be considered as a method to enhance the representation of sign language videos. This method should also have a certain effect on the gloss-base method, although the benefit may not be as large as that in the gloss-free setting. I would be happy to see the author add some relevant data, which can expand the scope of application of this paper's method.

问题

see Weaknesses

局限性

yes

作者回复

We sincerely appreciate your detailed comments and insightful suggestions. We provide point-wise responses to your concerns below.

W1: The analysis and supplement of strategy selection can make this work more perfect

Thank you for your suggestions. We have added a systematic and sensitivity analysis of the sampling strategy.

  1. Before we proceed, let's briefly revisit some details from Formula 4: margin = max(10, len(frames)/len(text)* 2.3).
  • The margin for negative sampling dynamically depends on the estimated average margin of each gloss (i.e., len(frames)/len(text) * speech-to-gesture Zipf’s factor) and a minimum threshold (i.e., 10). The Zipf’s factor, set as 2.3, refers to the speech-to-gesture Zipf’s Law.
  • We calculated the distribution of the dynamically estimated margin, with the results shown in the table below. A more detailed distribution can be seen in the attached PDF.
Margin[0, 10)[10, 20)[20, 30)[30, 40)[40, 50)[50, ∞)
Count11334603264230236
  • Only 1.6% fall into the [0, 10) range, which means that the margin is primarily determined by the estimated average frames of each gloss in our paper (i.e., len(frames)/len(text) * 2.3).
  1. Sensitivity analysis on the threshold: We designed a minimum threshold to ensure the margin is not too small.
  • Experiment setup: To make the analysis more principled, we evaluated the threshold values at [0, 10, 20, 30, 40, 50]. Note that threshold=0 and threshold=50 indicate that the margin is dominated by the estimated margin or the fixed threshold, respectively.
  • Experiment results: We uniformly trained for 80 epochs on PHOENIX-2014T due to resource limitations. The results in the table below indicate that SignCL is not sensitive to the threshold parameter, with a variance of 0.062.
Threshold01020304050
B@417.2417.6317.5517.6317.1317.11
  1. Sensitivity analysis on dynamically estimated margin: We use text length and Zipf’s factor to estimate the average frames of each gloss (gloss-free setting).
  • Experiment setup: To make the analysis more principled, we first use gloss labels to calculate the ground truth margin distribution for PHOENIX-2014T (green in Figure 8), with a specific Zipf’s factor of 2.1. We then evaluated the threshold values at [1, GT, 2.3, 3, 4]. Note that Zipf’s factor = 1 means we use len(frames)/len(text) directly to estimate the margin, while GT represents using the ideal len(frames)/len(gloss) to determine the margin (Zipf’s factor = 2.1).
  • Experiment results: The results show that using Zipf’s factors between 1 and 4 does not lead to significant differences. When Zipf’s factor is set to 8, there is a noticeable drop in performance due to the margin being too large, which leads to a high probability of sampling from the same negatives.
Zipf’s factor1GT2.3348
B@417.4517.8917.6317.2917.1016.26
  1. Conclusion for sensitivity analysis.
  • We noticed that SignCL is not sensitive to the threshold and Zipf’s factor.
  • We believe this insensitivity is because the contrastive function inherently accommodates variability by focusing on relative differences rather than the absolute correctness of positive and negative pairs. The size of the margin boundary does not significantly affect the overall performance of contrastive learning, as long as it is not too large or too small.
  • To systematically determine the optimal margin, we suggest setting it approximately equal to the mean ± standard deviation of the estimated average frames of each gloss based on a given dataset, e.g., len(frames)/len(text)* 2. The threshold can be mean - standard deviation.

Q1: Why not randomly select from other videos with the same signer?

  • This is an insightful question that explores alternative sampling methods. We did consider this approach. However, sampling from other videos with the same signer would require additional labels to identify the signer, which limits the applicability of SignCL. We believe maintaining a gloss-free setting is important as it represents a significant trend in the field.
  • As a supplementary experiment, we attempted to randomly select other frames within the same batch as negative samples (in-batch sampling). Unfortunately, this approach resulted in worse performance on PHOENIX-2014T and CSL-Daily datasets. This decline is because the model may use non-sign language features for contrastive learning, such as signer characteristics and background elements.
  • We have found that sampling within the same video has distinct advantages. It natively ensures consistency of non-sign features, allowing the contrastive learning process to focus on sign-specific features. This is our best practice.

W2: Looking forward to more results on gloss-based settings

Thank you for your insightful question. We appreciate your interest in applying our contrastive learning method to the gloss-based approach. We have shown some results in Figure 3 and Appendix A.3.3, which indicate the proposed SignCL method also benefits the gloss-based method. To further validate this, we have applied SignCL to gloss-based feature extraction (e.g., Self-Mutual KD[25]) and translation methods (e.g., Joint-SLT[5]. The results indicate that SignCL can indeed enhance fully gloss-based SLT).

Methods / Feature ExtractionPHOENIX-2014T
WER ↓B@4 ↑
Joint-SLT / Self-Mutual KD25.3822.79
    + SignCL into Feature Extraction24.7623.23
    + SignCL into Downstream Training25.1222.92
    + SignCL into Two States24.5823.46
评论

Dear Reviewer,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper. Our response has been available for several days now. With the discussion deadline approaching, we kindly inquire if there is any further information or clarification you might need from our side.

In our response:

We have included additional margin distribution statistics and sensitivity analysis. The experiments demonstrate that the SignCL method is not sensitive to the margin parameter.

Additionally, thank you for your insightful suggestion, we have applied SignCL to the Gloss-based method, which has also shown improved performance.

We would greatly appreciate your prompt response and thank you once again for your valuable comments and insightful suggestions. We apologize if this follow-up causes any inconvenience. Hope you have a great day!

审稿意见
5

This paper focuses on gloss-free sign language translation (SLT) and is largely motivated by the large cost to annotating glosses. The authors discussed a so-called "representation density problem" in gloss-free SLT where semantically dissimilar signs appear in a similar part of the latent space.

The first technical contribution relates to a metric they introduce to quantify the representation problem, which uses Fisher’s Discriminant Ratio (FDR) to assess the difference between the inter- and intra-gloss features. To improve the separability, as their second contribution, they propose SignCL, a contrastive learning strategy that encourages the model to learn more discriminative feature representations in a self-supervised manner.

The results show significant improvements in BLEU scores on the CSL-Daily dataset compared to state-of-the-art methods.

优点

Intuitively the motivation for the paper makes sense, the methods are relatively straightforward (in a good way), and their solution seems to work well. The authors provide a good amount of detail (especially in the appendix) in case someone wanted to reproduce.

缺点

A primary motivating example in the paper is that “RECIPROCATE” and “REVENGE” live in similar parts of the latent space because the hand motions are similar (even though the facial expressions are different). The proposed contrastive approach takes samples from different parts of the same sentence, where each sign should contrast with dissimilar signs in that sentence. Perhaps this is just an issue with the motivation, but it doesn't seem like the approach would help with the motivating example. Similarly, it doesn't seem like the contrastive approach should be any more helpful with gloss-free SLT vs gloss-based SLT.

I am wary about some of the t-SNE visualizations. The Sign Density Ratios don't change that much between approaches, and given that t-SNE is known to be deceptive at times, I wonder if these visualizations are representative or overstated.

Sec 4.3 on qualitative analysis is interesting but is too anecdotal. It would be useful to have a much deeper set of analyses here covering a much larger set of examples.

Overall, I think this paper is certainly not bad, but I'm on the fence about whether the novelty or depth is sufficient for this venue.

问题

There are two types of issues that are highlighted: (1) challenges where signs have the same hand motion but different facial expression and (2) challenges where the same hand shape is used but with different motions (e.g., piano vs typing). Can you talk specifically about how the proposed approaches help with face and motion errors?

I noticed some odd results in A.3.1. The WER results for train/test/dev are hugely different: 8.68/8.03/25.28. How is it the case that 'dev' has a 25% WER but train and test have a hugely improved 8%?

局限性

Their responses seem reasonable.

作者回复

We sincerely appreciate your detailed comments and insightful suggestions. We provide point-wise responses to your concerns below.

Q1: How do the proposed approaches help with face and motion errors?

  • Thank you for your question. Our SignCL approach samples positive and negative examples from a single video, which means that non-sign features such as the signer’s background and camera angle remain consistent. Consequently, when the contrastive learning framework distinguishes between positive and negative samples, the model naturally focuses on sign-specific features, such as the differences in facial expressions and hand movements, rather than on non-sign features like the signer or background.
  • To sum up, although SignCL is not explicitly designed to address face and motion errors (due to they are not the primary focus of this paper), the strategy of sampling from the same video inherently directs the model's attention to more sign-specific features, such as subtle differences in facial expressions and motions.
  • We are happy to see that future work will design face and motion-aware sign embedding backbones and apply SignCL to them. To the best of our knowledge, frame-based sign embedding models using CNN and ViT architectures are currently the best practice for sign language translation.

Q2: Why are there odd results in A.3.1?

  • Sorry for the confusion. To briefly recap, the purpose of the gloss-sign aligner is only to derive labels to calculate SDR, not for the model itself. We want to ensure the SDR is as accurate as possible, so we mix the test set into the training set to train the gloss-sign aligner. The dev set is reserved for evaluating the training process of the gloss-sign aligner and is not used in the SDR calculation. Therefore, the results for the dev set may appear less optimal.

W1: The proposal SignCL approach seems not aligned with the primary motivating example :

We respectfully disagree.  Here is our explanation:

  • The examples in Figure 1 or Figure 5 are intended to aid in illustrating the representation density. We do not explicitly tackle these examples in our method. We apologize for any confusion, and we will emphasize these points more clearly in future versions.
  • The key contribution of this paper is identifying the representation density problem for gloss-free SLT for the first time. We discuss the overall discriminability of feature representation of sign gestures, measured by the Sign Density Ratio (SDR).
  • The proposed SignCL serves as a relative and straightforward attempt to improve the discriminability of representations in the gloss-free setting.

W2: It doesn't seem like the contrastive approach should be any more helpful with gloss-free SLT vs gloss-based SLT:

  • Of course, SignCL can also work for gloss-based methods, and we have results in Section 4.1 and Appendix A3.3 that show this. However, the benefit may not be as large as that in the gloss-free setting because gloss-based methods use costly gloss annotations and CTC loss as a discriminative optimization objective.
  • The motivation of SignCL is that gloss-free methods suffer from worse visual representation due to the lack of costly gloss annotations. SignCL is designed in a self-supervised manner and does not rely on any gloss information to fit the gloss-free setting.

W3: t-SNE visualizations might be misleading:

We understand the t-SNE visualizations can be deceptive at times. We want to emphasize that t-SNE is not the primary basis for our conclusions; it serves as a visual aid to better illustrate the representation density problem.

  • The conclusions about representation density and performance drop are derived from the quantitative metrics and experiment results, especially when comparing gloss-free and gloss-based approaches.
  • In Section 2, we comprehensively investigate the representation density problem by using the Sign Density Ratio to measure feature discriminability within existing sign feature extraction methods. We also use sign recognition and translation tasks to analyze how different densities of representations affect performance.
  • We believe the representation density problem is not overstated. As shown by the experiment results in Figure 3 and Appendix A.3.3, gloss-free features indeed exhibit higher SDR and significantly poorer sign language recognition and translation performance. It highlights a potential bottleneck in restricting the performance of gloss-free SLT.

W4: Qualitative analysis in Section 4.3 is anecdotal and needs more depth:

  • Thank you for your insight. We are open to adding more cases and analyses. Unfortunately, the field of sign language research currently lacks an appropriate benchmark that is annotated with a large number of gestures that are similar in motion but distinct in meaning.
  • However, this paper focuses on the overall discriminability of feature representation of sign gestures. We would like to emphasize that, in addition to the visual analysis, we have provided overall sign recognition accuracy as quantitative evidence.
  • In Section 2 and Appendix A.3.3, we examine sign recognition accuracy and translation performance when applying SignCL to GFSLT-VLP pretraining. Figure 3 and Tables 7/8 demonstrate that the representation has better discriminability (lower SDR) and higher performance in sign language recognition after applying SignCL to the GFSLT-VLP baseline. We will make this qualitative analysis a separate section and make it clearer in future versions.
评论

Dear Reviewer,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper. Our response has been available for several days now. With the discussion deadline approaching, we kindly inquire if there is any further information or clarification you might need from our side.

In our response:

We have provided a clearer explanation of the position of this paper, where this paper focuses on the overall discriminability of feature extraction in sign language translation. We apologize for any confusion caused in the original paper, particularly regarding the cases in Figures 1 and 5.

Additionally, we have included further margin distribution statistics and sensitivity analysis. The experiments demonstrate that the SignCL method is not sensitive to the margin parameter.

We would greatly appreciate your prompt response and thank you once again for your valuable comments and insightful suggestions. We apologize if this follow-up causes any inconvenience. Hope you have a great day!

作者回复

We sincerely appreciate all reviewers for their detailed comments and insightful suggestions. We are encouraged that they find our paper identifies a novel problem (representation density) in gloss-free SLT [Reviewer rTJg and UuUf], and introduces a relatively straightforward SignCL to address this representation problem (simple and effective method) [Reviewer rTJg, UuUf, bAiC and rKGZ].

We would like to address some key concerns below:

  1. We respectfully disagree that our contributions are limited.
  • We are not merely "using known contrastive learning methods applied to the representation of sign language." The key contribution of this paper is identifying the representation density problem for gloss-free SLT for the first time.
  • This discovery is neither obvious nor trivial. It is well-known that gloss-free methods in SLT lag significantly behind gloss-based approaches, but the reasons are still under investigation. We are the first to take a closer look at the representation of sign gestures and demonstrate that gloss-free methods suffer from worse representation density. We highlight that the representation density problem could be a bottleneck in restricting the performance of gloss-free SLT, providing a direction for future gloss-free SLT to learn more discriminative representations.
  • Even though contrastive loss has been widely utilized in other domains, we believe SignCL is a very straightforward and effective solution to the representation density problem in the gloss-free setting. Our experiments show that it can improve performance in two different frameworks by 39% and 46%, respectively. This also highlights that representation density is a critical issue in gloss-free SLT.
  1. Findings and conclusions are not merely based on t-SNE visualizations.
  • In Section 2, we comprehensively investigate the representation density problem within existing sign feature extraction methods by using SDR to measure feature discriminability. We also use sign recognition and translation tasks to analyze how different densities of representations affect performance.
  • Our conclusions are derived from these quantitative metrics and experiment results, especially when comparing gloss-free and gloss-based approaches. The visualizations are used to aid in illustrating the findings about the representation density.
  1. Another main concern is the lack of analysis and supplement for the straightforward sampling strategy, which may lead to incorrect positive or negative samples in some certain special cases. We provide point-wise responses to this concern below:
  • There is a misunderstanding that the margin in SignCL is fixed or heavily influenced by the threshold. In fact, the margin is dynamically evaluated based on the text length and a validated speech-to-gesture Zipf’s factor (i.e., len(frames)/len(text) * 2.3). The distribution of margins during training is shown in the global PDF.
  • We want to emphasize that the contrastive function inherently accommodates variability by focusing on relative differences rather than the absolute correctness of positive and negative pairs. Therefore, partially incorrect positive or negative samples do not affect performance. This has been verified in our supplemental sensitivity analysis.
  • What's more, a range of related works, like SimCLR and MoCo, demonstrate that contrastive learning can still perform well even when there is noise in sampling strategies.
评论

We would like to express our sincere thanks to Reviewer 9u2m for his/her patient and thoughtful reminder. We apologize for the lack of rigor in our use of quotation marks and qualifiers in our previous response.

To clarify simply:

The phrase "merely 'using known contrastive learning methods applied to the representation of sign language'" was intended to reference the sentence "The paper is using known methods applied to the representation of sign language," from Reviewer 9u2m's original comment, as well as elements of "The entire paper mainly introduces a contrastive loss (SignCL), which has been widely utilized in other various domains," from Reviewer rKGZ's original comment.

We regret the oversight and will be more careful in our wording moving forward.

评论

We sincerely thank all the reviewers and ACs for their time and effort in reviewing our work.

We understand there may be some differences in opinion regarding whether identifying the representation density problem as a bottleneck in restricting the performance of gloss-free SLT and providing a straightforward and effective solution to this issue constitutes a solid contribution. Given the time constraints, we hope to further discuss this matter during the closed discussion phase with the other reviewers and ACs.

Once again, we greatly appreciate everyone’s time and thoughtful feedback.

最终决定

The submission proposed a contrastive method for Gloss-free sign language translation. It focuses on the representation density problem in sign language translation. 4/5 reviewers are positive to this submission while reviewer rKGZ has some concerns, especially on limited novelty. After read the submission, comments, and rebuttal, the AC recognize the submission still has some merits and agree with most of the reviewers, and suggest accepting this submission.

Related work is suggested move from session 5 to session 2 or merged into session 1.