ASROB: Measuring Automatic Speech Recognition from One Book
a benchmark for learning to transcribe/translate speech in Kalamang (a language with virtually no web presence) from book-scale resources
摘要
评审与讨论
This paper introduces a new task called ASR from One Book (ASROB), which involves speech recognition and translation of the indigenous language Kalamang, using only the grammatical resources available for the language. For Kalamang, the resources are limited to a grammar book and a small set of parallel sentences with English. The authors collected 15 hours of audio recordings from native speakers and transcribed and translated them using an anchor language.
The benchmark dataset comprises discourse-level (long-context) audio recordings that require transcription and translation into English. This work aims to encourage support for more indigenous languages, leveraging recent advancements in large language models (LLMs) to promote inclusivity.
优点
-
This paper introduces a novel task of transcribing and translating indigenous languages with minimal resources. Due to the long-context nature of the audio and the limited language resources, only advanced models with large context windows are suitable for evaluating such tasks. Attempting these tasks with smaller models would present an interesting challenge.
-
The evaluation of ASROB using Gemini models with multimodal capabilities and extended context windows demonstrates strong performance on the task, even surpassing human evaluation by the paper's first author.
-
The authors have taken sufficient measures to prevent data contamination, ensuring that the resources are not included in LLM pre-training.
缺点
While the proposed method (using Gemini Pro 1.5) outperforms human baselines, further evaluation is needed to identify which types of resources could improve the use of LLMs for this task. An ablation study using different portions of the grammar book or varying numbers of parallel sentences could help determine which resources are most beneficial and inform future research on resource acquisition for indigenous languages.
问题
Do you have insights into the types of errors made by the multimodal LLMs? Could these errors be addressed by adding more audio data, or are they due to limitations in the grammar books that are specific to audio-related tasks?
Thank you for reading and reviewing our paper! We are glad you appreciate the novelty of ASROB in pushing the boundaries of multimodal LLM evals and our attention to preventing contamination.
Re Weaknesses:
Improving results on current language models through prompting / RAG is important for understanding how models behave right now and achieving practical results sooner, but we would hesitate to draw conclusions yet about what kinds of data to collect based on these ablations. Gemini 1.5 Flash and Pro, for example, show huge differences in performance given the same data, and we would expect this trajectory to continue as models improve.
Re Questions:
Most of the errors seem to be limitations of the model rather than the in-context data. For example, as shown in Figure 4 and discussed in lines 357-360, the human outputs may misunderstand speech but are consistently grammatical, whereas the model outputs (even when they understand speech better) are often ungrammatical. And the models do not always monotonically improve given additional context.
This paper introduces a new benchmark, ASROB, built on top of MTOB, which is a benchmark made of a Kalamang grammar, an English-Kalamang lexicon and English-Kalamang parallel sentences. ASROB adds 15 hours of transcribed and translated Kalamang speech to English. Human and model (gemini 1.5 flash and pro) baselines are provided and analyzed with different in context learning settings. Initial results show that some of the model baselines surpass the human baseline.
优点
- The work is meaningful as it presents a testbed for improving zero-shot/few-shot capabilities of LLMs in a real-world low resource language scenario.
- The presentation is excellent: details, presentation flow, justifications for all design choices on the benchmark and the baseline models
- Experimental results are quite thorough
缺点
- While the paper presents very interesting results and analyses of baselines models, the main contribution from a dataset perspective is limited to a low resource speech recognition and translation corpus.
- Even if the benchmark is meant to test in context learning, other baselines (e.g. finetuning MMS (https://github.com/facebookresearch/fairseq/tree/main/examples/mms)) would strengthen the paper.
问题
316: “we surprisingly do not see much difference between the two sets” What is the intuition or explanation for this?
Comments rather than questions:
- Mitigations for contamination: the authors could consider using the noai tag
- gpt-4o: Moshi (https://arxiv.org/abs/2410.00037) can also be cited
Thank you for your valuable comments. We appreciate your recognition of the benchmark’s concept, presentation, and experiments.
Re Weaknesses:
- Ultimately yes, the artifact is a benchmark in one particular held-out domain. We see the contribution as curating previously released resources in a careful way that enables and inspires new machine learning research. Constructing novel yet interesting long context test cases is extremely time-consuming, which is why almost all long context benchmarks are synthetic.
- We will add a baseline for finetuning a pretrained speech recognition model. (Keep in mind that in addition to not using text-only data, this baseline will be somewhat disadvantaged because (as mentioned on lines 172-174) the human caption alignments are somewhat loose, and speech recognition models typically use somewhat short context windows. Plus it is not straightforward to incorporate text-only data.)
Re Comments:
- Thanks for suggesting the noai tag. It seems like it is only relevant for image content sourced in HTML on websites, but please correct us if we are misunderstanding.
- We will add some references to smaller scale audio-text language models including Spirit LM and Moshi.
Thank you for the additional information and potential new experiment on finetuning a pretrained ASR model (IIUC it's actually possible to update the pdf to reflect that, please let me know if the paper was updated). Re the noai tag, I believe noai is general while noimageai is specific to images.
This paper explores the mixed-modality in-context learning (ICL) capabilities of Large Language Models (LLMs) for low-resource languages.
Specifically, it investigates how Gemini 1.5 Flash/Pro models perform on automatic speech recognition (ASR) and speech-to-text translation (STT) tasks for the Kalamang language under zero-shot and few-shot scenarios.
The results demonstrate that, in certain settings, these models can outperform human evaluation baselines on both tasks.
优点
-
This paper focuses on leveraging LLM techniques for low-resource languages that are primarily oral and lack written traditions. By drawing more attention to these communities, it could benefit the groups that use these low-resource languages.
-
This work releases new data—including 15 hours of transcribed and translated Kalamang speech—and provides human evaluation baselines, which will help future studies focusing on this language.
-
This provides benchmark information for mixed-modality ICL, particularly for native multimodal models.
缺点
-
Lack of Novelty: This paper primarily builds on the existing MTOB, adding only a speech modality which just directly use the multimodal capabilities of the Gemini LLM. This addition is relatively straightforward, limiting the paper's innovative contribution.
-
Limited Scope: This study focuses solely on one language (Kalamang) and one LLM (Gemini 1.5), which weakens the generalizability of the results. While the authors note that Gemini is "the only LLM with publicly available native audio input at the time of writing," the study would benefit from including additional low-resource languages, especially from diverse language families, to provide a more comprehensive evaluation.
-
Inappropriate Experiment Design: The human evaluation baseline in this study appears unclear and potentially flawed for several reasons:
- The results are likely influenced by various factors, such as the human learner’s education level, familiarity with similar languages, and other personal attributes. However, the paper does not provide details on these factors or mention the number of human learners involved, which could significantly impact the baseline’s consistency.
- The human learner was only given 1 hour of study material, whereas the LLM had access to all 15 hours of content.
These issues raise serious questions about the validity of the human evaluation baseline in this study.
- Lack of Detailed Explanation and Analysis: The paper lacks sufficient explanation and analysis, particularly regarding the ASR results in Figure 3, where the outcomes are quite surprising. It would benefit from a deeper analysis on points such as:
- why including the Lexicon yields the best performance
- why adding more examples negatively impacts performance
问题
-
Clarification on Few-Shot Examples: It is unclear whether discourse-level or phrase-level examples are used as few-shot examples.
-
Accuracy of CER Measurement: Given that "the audio files include interspersed Papuan Malay (the region's vehicular language), not just Kalamang," it is unclear how Character Error Rate (CER) is measured accurately in this code-switching context.
-
Significance of BLEU Improvement: In the STT task, Gemini Pro achieves a BLEU score of 1.66 without any specific knowledge of Kalamang in the prompt. Is an improvement to 6.53 truly significant in this context?
Thank you for your thoughtful comments.
Re Weaknesses:
We agree that the ASROB benchmark is straightforward, but that doesn’t mean it lacks novelty or contribution. We think it is fair to say that—while it clearly builds on MTOB—no other benchmarks like ASROB exist, and it will enable and inspire new research in speech-text LLMs and speech technology for endangered languages. Simplicity is a virtue in benchmark design, and in this case the simple modeling goal (stick everything in a long context LLM) is what unlocks the social benefit.
In lines 438-440 we agree that our use of a single language (Kalamang) is a limitation of ASROB, and suggest that this should be improved in future work. It is challenging to introduce a new kind of benchmark and at the same time introduce resources from several different endangered language communities in a way that follows best ethical practices of the field.
We think we do a good job of emphasizing throughout the paper the ways in which the human baseline is limited, but we are happy to make this more consistent or add more caveats in any particular locations you suggest. The human baseline is only intended to lower bound achievable performance, not make a generalizable statement about human capabilities overall; this is why we describe it as “a human”.
In terms of the learner’s background: In footnote 8 we mentioned that they have studied a variety of languages, but none from Southeast Asia or Oceania. We avoided going into more depth in the submission to avoid anonymity concerns but can add more details in the final paper. The MTOB paper included more context about the learner’s education/training that we will summarize here.
We can add a table with more qualitative examples in the appendix, along with some discussion of them. Could you clarify exactly what you mean by “deeper analysis” on the reasons different context settings behave differently? Most of the time long context LLMs simply fail because they are unable to effectively utilize all of the provided context (and additional context chunks serve as extra distractors), or because they are biased towards context near the beginning and end of the context window. We can add some citations to MTOB and followup works (e.g. https://arxiv.org/abs/2409.19151) explaining that the lexicon has been shown to be advantageous because it presents unique content without repetition/redundancy.
Re Questions:
-
All of the experiments in this paper use discourse-level examples, as described on 168-170. We will clarify this around lines 321-323 and other places we discuss previous phrase-level results.
-
We do not claim that the CER on the benchmark means something in absolute, and in lines 225-226 mention that the interspersed Papuan Malay will artificially improve the scores a bit compared to someone who doesn’t know Papuan Malay. Assuming that LLMs generally have similar understanding of Papuan Malay (and especially when comparing to zero-shot performance), differences in CER across prompts or models are meaningful.
-
Yes, there is a clear gap between the zero-shot setting and all other settings in terms of BLEU. 1.66 BLEU zero-shot is not surprising, given that as we note in lines 307-308 the model can understand and translate Papuan Malay codeswitching within the recordings. We can compute the statistical significance of the comparison if that is what you mean by “significant”.
Could you clarify exactly what you mean by “deeper analysis” on the reasons different context settings behave differently?
The quantitative results in Appendix B lack a clear and consistent pattern. For example:
- ASR Task: The "pro" model achieves its best performance on L+S & 3^3, while the "flash" model performs best on S & 3^0. However, the "pro" model improves with the addition of examples, whereas the "flash" model does not.
- ST Task: Adding examples consistently degrades performance.
The authors suggest that "we surmise that ICL at the granularity of long recordings (with relatively long outputs) is more challenging for the models" and that "the models seem to have difficulty integrating context without few-shot structure in this mixed-modal setting."
What is the underlying intuition behind these mixed results? Why do some combinations work in some cases but not in others? At the very least, under what specific circumstances is long-context ICL more likely to succeed? These remain unclear even after reading this paper.
We can compute the statistical significance of the comparison if that is what you mean by “significant”.
Additionally, how should we interpret a BLEU score of 6.53, given that it is very low? Does this indicate that the LLM has genuinely learned something, or is it merely predicting some of the most frequent words at random?
Thank you for your response.
Trends in results:
Re Flash vs. Pro, we discuss in lines 297-298 that Pro outperforms Flash in every setting, and as you quote later that Pro is able to incorporate more context than Flash. This is unsurprising because Pro is an overall better/larger model than Flash (which is distilled from Pro according to the Gemini 1.5 paper).
Re why specific combinations of context work and others don't: it is difficult to give rigorous answers for opaque API-based models. As we discuss as motivation throughout the paper, this kind of mixed-modal task is rather unique; most evals test multimodal tasks with more narrowly defined structure, e.g., speech to text, whereas ASROB combines multiple different kinds of resources as context: a grammar book, a wordlist, parallel sentences, and speech-text audio recordings. And while it may be somewhat common to have textual instructional materials for languages on the web, it is rare to have interspersed text and audio instructional materials. So we assume the poor scaling is caused by a combination of lack of similar pretraining/posttraining data, as well as model capacity/long context ability to some degree. And as we speculated above, lexicons are relatively information-dense so they could be helpful while introducing minimal distractors.
BLEU score:
Keep in mind that 6.53 BLEU is low in absolute, but higher than the human baseline (which constitutes the cumulative effort of tens of hours from an NLP researcher with some formal linguistics training and a lot of experience learning languages). As we discuss in the paper, this corresponds to the model's improved prior understanding of Papuan Malay (captured by the 1.66 BLEU without context) and better audio understanding, vs. the human's better ability to integrate the text resources. The translations from Gemini 1.5 Pro in the best setting correctly recognize and translate certain Kalamang words that are unknown without context (and it adapts the translation to make sense given these words), but it clearly still does not understand all source content.
We will post a qualitative example excerpt from S2TT in a separate comment, of the kind that we promised to add in an appendix table. (We will post it separately with reduced visibility to reduce the chance of contamination, since the example is from the test set and actually contributed to the stated BLEU scores. The final appendix example will be held out from the train set instead, like the one in Figure 4.)
In this work, the authors present a new benchmark, ASROB (Automatic Speech Recognition from One Book), for long in-context learning using 15 hours of transcribed and translated Kalamang speech. Kalamang is an extremely low-resource languages, spoken by less than 200 people living on an island in Indonesian Papua. ASROB can be used to evaluate multimodal in-context learning where the entire data is fed as a long prompt to a natively multimodal LLM like Gemini 1.5. Gemini 1.5 Pro yields less than 25% CER on ASR and around 6.5 BLEU on spoken translation (to English). In contrast, a human Kalamang learner exposed to the same Kalamang resources, achieves 34% CER and 4.5 BLEU on the same evaluation sets.
优点
-
This is the first work that shows the power of long in-context learning for multimodal tasks like ASR and spoken translation. With no training and merely feeding all the data as an input prompt to a model like Gemini 1.5 Pro, ASROB yields an impressive CER of around 25% for a new language like Kalamang.
-
The authors publicly release the new benchmark ASROB that contains 15 hours of transcribed and translated speech in Kalamang. This is an important step towards language preservation and a vital contribution to existing data resources for extremely underresourced languages of the world.
-
The authors have very carefully considered the ramifications of releasing such a resource, and added all the necessary safeguards in place.
-
The draft is very well-written and clear in its exposition.
缺点
-
One main weakness is in the baselines. There is no comparison to regular finetuned baselines for the ASR and spoken translation (ST) tasks, that make use of the 15 hours of transcribed and translated speech. While the main objective of this work is to promote multimodal in-context learning with text and speech, it would still be useful to see how existing pretrained ASR/ST models fare after finetuning with labeled Kalamang speech.
-
For the human baseline, the human only closely follows a little more than 1 hour of speech recordings rather than the full 15 hours. This makes the human baseline considerably weaker than the model baseline, apart from all the other factors mentioned in Section 4.1 that make the human baseline weaker. In the abstract, the authors mention "already beating the 34.2% CER and 4.51 BLEU achieved by a human who learned Kalamang from the same resources" which is a bit misleading given the 1 hour vs. 15 hour disparity. The authors should consider softening this claim.
问题
-
As mentioned under weaknesses, there are no comparisons to standard finetuned ASR and ST using the 15 hours of labeled Kalamang speech. We could start with a pretrained multilingual, multimodal model (e.g., Whisper, Seamless M4T, etc.), and finetune on the labeled Kalamang data using a parameter-efficient finetuning approach like LoRA. One could cascade the speech input and concatenate the outputs to mimic the evaluation setting of ASROB. These are important baselines to compare against and will also help calibrate the current CER/BLEU numbers better.
-
Since the BLEU scores are quite low, like the authors did with CER for the ASR task, it might be useful to see a character-level evaluation metric (such as chrF++) for spoken translation as well.
-
This is a somewhat minor point, but why is the benchmark named ASROB when the evaluations are on both ASR and spoken translation? ASROB implies that only transcribed speech is available as part of the benchmark.
-
From Fig 3, could the authors comment on why there's a degradation in WER and BLEU scores when increasing the number of few-shot audio samples from 3^2 to 3^3?
-
For audio in the prompt, the authors increment by powers of 3 from 1-shot to 27-shot. It would be useful for the reader to know how many hours of speech each few-shot setting (3,9,27) amounts to.
-
In Fig 3, the human baseline is placed against 27 audio examples (on the x-axis) which is presumably much more than 1 hour of speech? Given that the human only closely followed roughly an hour of speech, can the authors clarify the placement of the human learner dot in Fig 3?
-
A minor editorial comment: The first line of the abstract is identical to the first line of the introduction.
I am happy to revise my scores after hearing from the authors during the rebuttal phase.
Thank you for your thoughtful comments!
Re Weaknesses:
- We will add a baseline for finetuning a pretrained speech recognition model. (Keep in mind that in addition to not using text-only data, this baseline will be somewhat disadvantaged because (as mentioned on lines 172-174) the human caption alignments are somewhat loose, and speech recognition models typically use somewhat short context windows.)
- Good point, we will soften the claim in the abstract to match the factors described in the paper, to something like “achieved by a human who learned Kalamang in the same task framing”, or “(with some caveats)”.
Re Questions:
- We will add a finetuning baseline as promised above.
- We will add chrF++ scores, or perhaps BLEURT averaged across chunks of the output (since it doesn’t work as one big output).
- “[ASR/S2TT]OB” doesn’t have quite the same ring to it. 🙂 It is a fair point though. Maybe we could call the tasks ASROB and S2TTOB in the body of the paper even if the title stays the same. We expect ASR to be the primary task the benchmark is used for in practice, in the same way that ASR is the primary task for FLEURS even though it also supports S2ST.
- We discuss this in lines 299-313 but can clarify that even Gemini 1.5 Pro is mixed in ASR for 3^3. These kinds of long multimodal prompts seem to be pushing the limits of in-context retrieval capabilities, in the same way that other works have found distractor effects in some models as context increases.
- Sure, we will add the number of hours in the Figure 3 caption.
- Good point, we will place the human learner dot at around 3^2 on the x axis and change the circle tick to a pentagon to more accurately reflect the available context.
- We’ll try to make the first line in the abstract more concise.
This paper leverages LLM techniques for languages that are primarily oral and lack written traditions, potentially benefiting the communities that use these languages. The research provides a valuable testbed for enhancing zero-shot and few-shot capabilities of large language models (LLMs) in real-world, low-resource language scenarios. The authors released a new benchmark, ASROB, which includes 15 hours of transcribed and translated speech in Kalamang. This is a significant contribution to language preservation and provides valuable data for under-resourced languages.
However, the paper primarily builds on existing work, adding only a speech modality using the multimodal capabilities of the Gemini LLM, which limits its innovative contribution. The main dataset contribution is restricted to a low-resource speech recognition and translation corpus. There is no comparison to regular finetuned baselines for ASR and spoken translation tasks using the 15 hours of transcribed and translated speech. The human baseline is based on only 1 hour of speech recordings, making it weaker than the model baseline. The claim in the abstract about beating human performance is misleading due to this disparity.
审稿人讨论附加意见
Despite addressing the reviewers' questions and the weakness of the paper during the rebuttal period, the scores were not increased.
Reject