Scaling Sign Language Translation
We explore data, model and language scaling for sign language translation and achieve new state-of-the-art performance on several open-domain benchmarks.
摘要
评审与讨论
This paper presents an approach to Sign Language Translation (SLT) that aims to scale the field by addressing limitations in data, model size, and the number of translation directions. The authors' key contributions are:
- Data Scaling: The authors leverage diverse and large-scale datasets, including noisy multilingual YouTube SLT data and augmented SLT data generated from video captions, enabling more robust pretraining.
- Model Scaling: They pretrain SLT models on various sizes, initially fine-tuning them with pretrained (m/By)T5 models, demonstrating the impact of larger models on performance.
- Cross-lingual and Cross-modal Transfer: The authors show that cross-lingual and cross-modal transfer from pretraining on multilingual data improvesSLT performance, even enabling zero-shot translation, translating sign language to a spoken language without explicit training in that direction.
- Open-domain SLT Evaluation: The models are finetuned on five downstream open-domain SLT benchmarks covering five different sign languages, leading to substantial quality improvements over existing state-of-the-art methods.
优点
- The paper and approach is well motivated.
- The experiments are generally well thought out.
- The writing quality and analyses are generally good and informative. e.g., Sec 4.2 brought up questions that I had been wondering and did a nice job answering.
- The results relative to SOTA are compelling.
缺点
There are multiple references that claim to be on arXiv but don't seem to exist including the FLEURS-ASL paper. I found at least three instances:
- Garrett Tanzer. Fleurs-asl: Including american sign language in massively multilingual multi- task evaluation. arXiv, 2024.
- Garrett Tanzer. Fingerspelling within sign language translation. arXiv, 2024.
- Garrett Tanzer and Biao Zhang. Youtube-sl-25: A large-scale, open-domain multilingual sign language parallel corpus. arXiv, 2024.
FLEURS-ASL#0 only has 353 sentences but is used for a large percentage of the experiments. I have some question about whether or not these results are likely to generalize, especially given that these sentences were generated by a single person. Given that the FLEURS-ASL paper isn't actually on arXiv (as of July 13) I don't have a way to understand what types of sentences these include. it's important to understand the biases and breadth of what is covered.
Lack of Comparative Analysis: While the paper surpasses the previous state-of-the-art, a more in-depth comparative analysis of other recent SLT methods, including their strengths and weaknesses, would strengthen the paper's argumentation. Yes, Table 4 does show SOTA numbers, but it would be useful to contrast the approaches.
Minor: Phoenix is referred to in Table 4 and 8 descriptions but only acronyms are used in the tables themselves -- it's unclear which acronym is which.
问题
See questions above. Especially, can you explain the missing references and talk through characterization of the FLEURS-ASL dataset? Can you also confirm that you are not fine-tuning on FLEURS-ASL#0?
Regarding Table 5 and related analysis, is there a reason why BLEURT could have a much higher correlation compared to BLEU or ChrF?
How accurate are the mediapipe landmarks on messy YouTube data? Have you looked at correlations between landmark accuracy and translation quality?
局限性
Seems reasonable.
Thanks for your insightful comments!
Re: the missing references
Thanks for pointing this out! Please note that our work is built on prior work discussed through personal correspondence with the authors before publication (hence our use of FLEURS-ASL#0 only, which was available to us before the full dataset was complete). YouTube-SL-25 has since been published to arxiv (https://www.arxiv.org/abs/2407.11144). According to our communication with the authors, other papers should be released publicly very soon, including the FLEURS-ASL dataset.
Re: characterization of the FLEURS-ASL dataset? Can you also confirm that you are not fine-tuning on FLEURS-ASL#0?
FLEURS-ASL#0 is a translation of 353 sentences from FLORES/FLEURS devtest set (https://github.com/facebookresearch/flores/tree/main/flores200) from English text to ASL by a Certified Deaf Interpreter, with at least 5-6 hours of preparation for each 1 hour of recorded content. Note the translations are performed in groups of sentences, not in isolation, since FLORES is constructed from documents. In this way, FLEURS-ASL represents itself as the first massively multilingual SLT evaluation benchmark, including American Sign Language and 200 spoken target languages; we evaluated the translation from ASL to 42 spoken languages in this study. Human performance on a random subset of this data (measured with a native Deaf ASL signer with credentials in sign language education) is about 14.8 BLEU / 63.4 BLEURT.
Regarding model finetuning, we never finetuned our models on FLEURS-ASL#0: all the reported results are for pretrained models only.
Re: a more in-depth comparative analysis of other recent SLT methods, including their strengths and weaknesses, would strengthen the paper's argumentation.
We’d like to highlight that the main focus of this paper is to understand and explore how pretraining data, model size, and sign language scaling improves SLT, which to the best of our knowledge has rarely been studied in the literature. While there are many other intriguing SLT methods recently, most of them focus on advanced modeling and/or application of large language models, which is beyond the scope of our study. Also note we didn’t make any claims that our method is better than the others.
Re: Phoenix is referred to in Table 4 and 8 descriptions
This is a typo and we will fix it in our next version.
Re: is there a reason why BLEURT could have a much higher correlation compared to BLEU or ChrF?
We hypothesize that BLEURT measures semantic similarity which captures something beyond simple exact n-gram match as in BLEU/ChrF. For example, the model often starts learning SLT by outputting some key words/phrases correctly. Such weak signals can hardly be captured by BLEU/ChrF.
Re: How accurate are the mediapipe landmarks on messy YouTube data?
This is a great question! Our simple evaluation by eye-balling some landmark examples showed acceptable results. But we believe the landmarks generated from mediapipe include noises with varying degrees for YouTube data, particularly when there are multiple signers in the video. Still, our scaling method could capture useful signals from these data. We leave the study of how landmarks accuracy affects translation quality to the future.
This paper proposes to improve the open-domain sign language translation by scaling pretraining data, model size, and number of translation directions. The proposed approach involves pretraining with a mixture of noisy multilingual YouTube SLT data, parallel corpora, and SLT data augmented with MT models. Experiments based on (m/By)T5 models show substantial improvements over the baselines on several benchmarks, surpassing the previous state-of-the-art (SOTA) by wide margins.
优点
The paper is well-written and provides many insights. The problem is that improving the sign language translation performance in open-domain settings is important. Scaling model size, number of languages, and data size lead to impressive performance based on (m/By)T5 models.
缺点
- In Figure 7, the legend is blocked.
- There is no open-source code or model, and reproducing this work requires substantial resource expenditure.
问题
In the zero-shot ASL-to-X translation, does the model encounter similar issues seen in zero-shot text translation, such as translations in wrong languages?
局限性
The author does not release the code or model.
Thanks for your insightful comments!
Re: In Figure 7, the legend is blocked.
We will fix it in the next version.
Re: There is no open-source code or model
While we can’t release the code and model subject to policy restriction, we believe our findings and scaling results could shed light on the development of SLT research and inspire more follow-up studies.
Re: does the model encounter similar issues seen in zero-shot text translation, such as translations in wrong languages?
This is a great question! We checked the translations in zero-shot directions and have some interesting results.
We evaluated the language accuracy and empty rate for Figure 2 (b) as below.
| Language | Baseline | w/ MT X<->En |
|---|---|---|
| es | 90.9 | 95.2 |
| de | 85.6 | 93.2 |
| fr | 92.4 | 94.1 |
| it | 82.4 | 91.2 |
| cs | 1.7 | 61.2 |
| pl | 78.5 | 97.7 |
| ru | 95.2 | 85.3 |
| zh | 45.0 | 15.0 |
| ar | 72.2 | 75.4 |
| ja | 76.5 | 73.7 |
| hi | 42.2 | 4.8 |
Language Accuracy: the accuracy of translations in the correct target language. Higher indicates better.
| Language | Baseline | w/ MT X<->En |
|---|---|---|
| es | 8.2 | 2.0 |
| de | 4.0 | 0.0 |
| fr | 6.5 | 0.6 |
| it | 5.1 | 0.6 |
| cs | 7.6 | 0.6 |
| pl | 2.3 | 0.0 |
| ru | 0.0 | 7.4 |
| zh | 2.8 | 17.3 |
| ar | 5.1 | 18.7 |
| ja | 0.3 | 0.3 |
| hi | 1.4 | 43.3 |
Empty Rate: the proportion of outputting empty translations.
We noticed that zero-shot SLT translation also suffers from off-target translation, particularly for those languages distant from English: for example Baseline only has a language accuracy of 1.7, 45.0 and 42.2 for Cs, Zh and Hi, respectively.
Adding parallel translation data generally improves the translation language accuracy, such as 1.7/78.5 to 61.2/97.7 for Cs/Pl. But there are also exceptions, like Zh and Hi, where the accuracy reduces from 45.0 and 42.2 to 15.0 and 4.8. A deeper inspection reveals that jointly training with translation data leads to more empty outputs for these languages: the empty rate increases from 2.8/1.4 to 17.3/43.3 for Zh/Hi. We argue that this may be because 1) these languages have significantly less parallel MT data, e.g. Hi only has 1.2M examples, and 2) the parallel corpus from MADLAD-400 can also be quite noisy.
We will add these results in our revised version.
Thank you for further addressing my concerns. The new results in the zero-shot direction are interesting. I will keep my positive score.
This paper attempts to advance the development of sign language translation (SLT) technology by using large-scale pre-training data, expanding model size, and adding translation directions. Through extensive experiments, the authors have drawn many useful conclusions. Experiments show that this work can achieve the best results in multiple open-domain SLT benchmarks covering multiple sign languages, although the translation quality needs to be further improved to meet practical needs.
优点
- The exploration of sign language translation in large-scale open domain is very valuable. Previously, much sign language translation work was limited to some smaller fields.
- The authors conducted a large number of experiments and obtained some useful conclusions, which can guide the future work of sign language translation.
- Combining all the gain-enhancing methods together, the authors achieved state-of-the-art results on multiple downstream tasks of sign language translation.
缺点
-
Unfortunately, the sign language data and code used in this article will not be open source. I also can't seem to find the corresponding paper about "Youtube-sl-25: A large-scale, open-domain multilingual sign language parallel corpus" for the YT-Full dataset, which makes it difficult for later generations to reproduce this work.
-
As far as I know, using the original RGB sequence as input often achieves better results than simply using the pose sequence as sign language input because it contains more information. Can the author explain why the RGB sequence is not used as input?
-
The author does not seem to state the specific computing resources and time spent on the experiment in section 3.
问题
see Weaknesses
局限性
yes
Thanks for your insightful comments!
Re: the sign language data and code used in this article will not be open source
Please note that YouTube-SL-25 has been released on arXiv (https://www.arxiv.org/abs/2407.11144), including release of the clean subset of the data. We built on their work through personal correspondence with the authors before it was widely published.
We are unable to release the source code because it uses an internal framework that has not been open sourced.
Re: Can the author explain why the RGB sequence is not used as input?
The RGB sequence was not used as input due to the different privacy considerations, as explained in Appendix A.
Directly modeling RGB sequences from video sharing sites may raise ethical issues, as highlighted in the work "Towards Privacy-Aware Sign Language Translation at Scale" by Rust et al., and we believe the comparison between pose and RGB sequence is out of scope for our work.
It would be an interesting question for future work to evaluate—in light of larger sign language datasets—how much of the gap between pose and RGB inputs is actually due to extra information in the RGB input vs. availability of pretrained vision encoders.
Re: The author does not seem to state the specific computing resources and time spent on the experiment in section 3.
As stated in line 160, Section 3, we moved the details for computing resources and time to Appendix B.1. Specifically, we pretrained models up to 1M steps using 64/64/128 TPU-v3 chips for Base/Large/XL, which takes about 7~20 days depending on the model scale and resource condition.
The paper focuses on scaling sign language translation (SLT) by leveraging large-scale pretraining data, increasing model size, and expanding the number of translation directions. The study demonstrates the effectiveness of data/model scaling and cross-lingual cross-modal transfer in improving SLT performance across multiple benchmarks. The authors showcase substantial quality improvements in SLT through scaling, surpassing previous state-of-the-art results by wide margins.
优点
1 The paper pushes the frontier of SLT through large-scale pretraining and the exploration of various data sources. This approach demonstrates a novel application of scaling techniques in the SLT domain.
2 The research employs a rigorous methodology, including the use of different pretraining tasks and modalities. The thoroughness of the experimental design and validation is evident.
3 The presentation is clear and well-structured, allowing for an easy understanding of the research approach and findings. The logical flow of the paper aids in comprehending complex concepts.
4 The work has significant potential to advance open-domain SLT for multiple sign languages through scalable methods. The substantial quality improvements over previous state-of-the-art results highlight the impact of the research.
缺点
1 Lack of discussion on the impact of model architectures and training strategies. The study lacks an in-depth exploration of how different model architectures or training strategies could affect SLT performance, potentially limiting the understanding of scalability and generalization.
2 Limited exploration of the robustness to variations in data quality and model complexity. The research does not extensively discuss the robustness of the proposed methods to changes in data quality and model complexity, which could impact the robustness of the study results.
问题
1 Have the authors considered the impact of different model architectures or training strategies on SLT performance, and how these factors could influence scalability and generalization?
局限性
This paper adequately discusses the limitations in Section 6.
Thanks for your insightful comments!
Re: how different model architectures or training strategies could affect SLT performance; how these factors could influence scalability and generalization?
Thanks for this question! Firstly, it would be great if you could provide more context about the "model architectures and training strategies" you're interested in.
According to our experiments in Table 1, Section 4.1, model architectures and training strategies have non-ignorable influence on SLT performance. We provided ablations for different T5 model families, covering T5, mT5 and ByT5, and model sizes at different scales, from Base, Large to XL. While these models follow the same encoder-decoder Transformer-based architectures, they differ significantly in the pretraining corpus (c4 vs. mc4), model parameter allocation, and the vocabularies.
Two intriguing observations from Table 1: 1) ByT5 performs generally better than T5 and mT5 across scales, even though ByT5 and mT5 used the same pretraining corpus. 2) Model scaling doesn’t necessarily lead to improved performance; in particular, we observed consistent worse performance for Large compared to Base.
We acknowledge the existence of different modeling variations, such as decoder-only Transformer (LLaMa, Gemma) and non-Transformer model (Mamba). We believe that different architectures and training strategies endow the model with different inductive biases that affect its generalization and adaptability substantially to cross-modality tasks like SLT. But exhaustively exploring how these variations affect SLT performance is beyond the scope of this study.
Re: the robustness of the proposed methods to changes in data quality and model complexity
Thanks for this question! The results in Table 4 and the YouTube-ASL performance [1] could demonstrate the robustness of our method. Please notice that YT-ASL is a noisy superset of YouTube-ASL, which includes significantly more but lower-quality sign language data (~2800 hours vs. ~1000 hours).
Based on Table 4, using YT-ASL (ID: 2) achieves a BLEURT score of 51.74 on How2Sign, substantially outperforming the YouTube-ASL result, 46.63. Adding noisy multilingual sign language data (ID: 6) further increases the performance to 53.51.
We will add a discussion on the robustness of our methods in the revised version following your suggestion!
[1] Uthus et al., 2023. YouTube-ASL: A Large-Scale, Open-Domain American Sign Language-English Parallel Corpus
The author's response addressed my question, so I still keep a positive rating.
Paper offers a comprehensive study of how to improve sign language translation by scaling up training in a variety of ways. Interesting recipes are offered such as using data from a noisy multilingual YT dataset, parallel text copora, and data augmentation with translation models. The reviews for the paper were quite positive, noting that it pushes the frontier of SLT. There was some concern about the unavailability of the data, which the authors addressed by providing an arXiv link to the paper that introduces YouTube-SL-25 (hopefully they will remember to update the paper appropriately, if required). While the code for the paper will not be available, it was felt that the methods contributed would still lead to useful research in the community on this specific problem.