PaperHub
6.5
/10
Poster4 位审稿人
最低6最高7标准差0.5
6
6
7
7
3.8
置信度
COLM 2025

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

OpenReviewPDF
提交: 2025-03-23更新: 2025-08-26
TL;DR

A massively multilingual benchmark for topical utterance classification and textual multiple-choice QA from spoken paragraphs

摘要

关键词
spoken language understandingmultilingual benchmarksmultilingual evaluation

评审与讨论

审稿意见
6

This paper proposes two multilingual benchmark datasets for spoken language understanding (SLU): SIB-Fleurs and Belebele-Fleurs. Based on these datasets, it evaluates several end-to-end and cascaded models. Results shows correlations between SLU performance and ASR/speech translation performance.

接收理由

  • The proposed benchmark datasets are novel and carefully created. They are valuable assets to the community.

拒绝理由

  • This paper conducted many experiments and provided many data points. But no conclusive or convincing insights came out. (1) The tested methods were all based on pooled embedding + a classification layer. This was a weak method comparing with contextual generative models nowadays. (2) Many conclusions were drawn from comparing two independent models, such as Whisper vs. SeamlessM4T. Without variable-controlled ablation study, it's not convincing to conclude that a certain training receipt difference was causing the performance difference. (3) Correlation was not causation. We could not conclude that we should jointly train ASR and SLU, because ASR and SLU were mutually beneficial, because ASR scores correlated with SLU scores.

给作者的问题

  • What does "ca." mean in Line 101 and 105?
  • Line 167: What are the multilingual datasets? Why do we need to present this method as it never works?
  • Figure 1: Why does Whisper wins more but in all language groups it gets higher CER? If it often wins by a small margin but losses by a large margin, then win ratio is not a good metric.
评论

We thank reviewer VcwH for their thoughtful review. We will discuss their comments and (in our general response) present additional results we hope they find appealing. We are addressing comments in good faith and commit to improving our manuscript and experiments for a potential camera-ready.

What does "ca." mean in Line 101 and 105?

ca. (circa) means approximately. The APA style guide indicates "ca." is more common for dates, so we will change it to "~" or "approx.".

Line 167: What are the multilingual datasets? Why do we need to present this method as it never works?

  • `MMS-1B-Fleurs' is ASR fine-tuned on all Fleurs languages' training sets, providing language and domain adaptation for test instances.
  • `MMS-1b-all' is trained on Fleurs plus New Testament readings in 1362 languages (55K hours) for ASR support.

We used original Fleurs splits to ensure no data leaks for models like `MMS-1B-Fleurs' and `MMS-1B-all'. These experiments evaluated SLU of self-supervised pre-trained mSE with no, Fleurs, and massively multilingual ASR fine-tuning. Our results show Fleurs ASR fine-tuning benefits ZS-XLT for these mSEs, making this a key part of analyzing cross-lingual SLU performance across mSE training regimes.

This paper conducted many experiments and provided many data points. But no conclusive or convincing insights came out.

We respectfully offer the following points, which we believe highlight valuable insights for the multilingual speech & NLP community:

  • We are first to evaluate cross-lingual SLU to over 100 languages, showing ZS-XLT to truly low-resource languages still degrades on less complex tasks.
  • We are first to demonstrate utterance quality's impact in cross-lingual SLU.
  • We show mSE can be competitive with cascaded systems in utterance classification (SeamlessM4Tv2 in SIB-Fleurs).
  • We comparatively evaluate (i) purely self-supervised (mHubert, MMS), (ii) Whisper, and (iii) SeamlessM4T mSE with varying pre-training. From English SIB-Fleurs performance (where all models have abundant English data), we deduce pre-training objectives impact performance.
  • Our analysis provides statistical (not causal!) support that pre-training protocols matter.

Regarding the final point, we acknowledge the reviewer's concerns and will revise our wording to emphasize these are correlations and encourage follow-up work.

(1) The tested methods were all based on pooled embedding + a classification layer. This was a weak method comparing with contextual generative models nowadays.

We do not consider this "weak." We have, however, added results from zero-shot prompting multi-modal speech LLMs.

(NLLB-)LLM2Vec, based on Llama 3.1 8B, use sizable fine-tuning supervision for both tasks, outperforming prompted same-size models and competing with much larger ones (cf. Llama 3.1 8B & 70B in MEXA [1]) Table 1). We aimed for maximum performance within an academic compute budget.

(2) Many conclusions were drawn from comparing two independent models, such as Whisper vs. SeamlessM4T. Without variable-controlled ablation study, it's not convincing to conclude that a certain training receipt difference was causing the performance difference.

Please see the last two bullet points in our previous response (on English performance). We acknowledge more experiments are needed to causally link pre-training protocol and downstream performance.

(3) Correlation was not causation. We could not conclude that we should jointly train ASR and SLU, because ASR and SLU were mutually beneficial, because ASR scores correlated with SLU scores.

These observations are correlations (cf. l19;l355), and we avoid implying causation (e.g., using "coincides" l66; "suggests" l67,258,357; "indicates" l278). Given our response to (2), we feel that our language adequately reflects the results.

Figure 1: Why does Whisper wins more but in all language groups it gets higher CER? If it often wins by a small margin but losses by a large margin, then win ratio is not a good metric.

The reviewer correctly notes the nuance in ASR performance measurement and its SLU impact (e.g., for CS with NLLB-LLM2Vec). Average CER by language group and win ratio together offer a more holistic ASR view.

We thank the reviewer for their consideration.

[1] MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

评论

Thanks for the detailed responses and new experiments! I’d keep my stance that this paper is ok to be accepted for its contributions of datasets and many experimental data points. The observations are interesting and help model selection, but they could be more useful if the authors conducted variable-controlled experiments/training on selected models (Whisper, SeamlessM4T, Qwen, etc.) to derive and deduce more straight and convincing conclusions. I.e., testing the hypotheses on strong baselines.

审稿意见
6

In this work, a multilingual spoken language understanding (SLU) benchmark, Fleurs-SLU, is proposed. It is derived from Fleurs, Flores and its derivatives (SIB-200 and Belebele) spanning over 100 languages. The SLU tasks include speech classification and question answering. Detailed evaluation is also presented for both end-to-end and cascaded systems. It would be a valuable benchmark for the SLU tasks.

接收理由

  • A multilingual SLU benchmark is provided for SLU tasks
  • Extensive experiments are conducted on the top of the proposed benchmark.

拒绝理由

  • The description of experiments are not clear and hard to follow.

给作者的问题

  • As described in section 4.1, the speech encoder and cascaded systems are trained with English training set. I assume it is English data from Fleurs. However, I also notice Table 3 presents both MMS-1B-FLEURS and MMS-1B-ALL. It is quite confusion and no clear explanation about "ALL" training data.
  • "zero-shot cross-lingual transfer (ZS-XLT)" mentioned in section 4.2 is not clear. As mentioned in section 4.1, the model is trained with English data only. So the source language is English, is it correct?
  • In Table 3, there are 7 languages are not supported by Whisper and SeamlessM4T, how do we generate the corresponding transcription with Whisper or SeamlessM4T models?
评论

We thank the reviewer t3tg for their thoughtful review, arguing for clarifications to strengthen our submission. We would like to take the chance to address the reviewer's comments.

  As described in section 4.1, the speech encoder and cascaded systems are trained with English training set. I assume it is English data from Fleurs. However, I also notice Table 3 presents both MMS-1B-FLEURS and MMS-1B-ALL. It is quite confusion and no clear explanation about "ALL" training data.

We can see now how this could cause confusion. Section 4.1 details our English training data and its use for mSE and cascaded systems. Section 4.2 outlines our ZS-XLT methods (e.g., transcribing with S4Tv2 then NLLB-LLM2Vec for inference) and translate-test approaches (e.g., S4Tv2 for speech-to-English translation, then LLM2Vec for inference).

`MMS-1B-FLEURS` and `MMS-1B-ALL` refer to different variants of the MMS speech encoders we evaluate in SIB-Fleurs. These underlying ASR models can have different. In l165-167 we outline these ASR model variants:

We include MMS-1b without fine-tuning (`MMS-1b'), with ASR fine-tuning on Fleurs (`MMS-1b-Fleurs'), and with ASR fine-tuning on multilingual datasets (`MMS-1b-all').

`MMS-1B-Fleurs' performs ASR fine-tuning on the training data of all Fleurs languages. This can be considered a language and domain adaptation to our test instances. `MMS-1b-all' is trained on Fleurs and much more multilingual texts, almost exclusively stemming from readings of the New Testament of the Bible in 1362 languages to support ASR for these languages.

For SIB-Fleurs, we used the original Fleurs data splits to ensure there is no test data leakage for models like `MMS-1B-Fleurs' and `MMS-1B-all'. These experiments evaluated SLU of self-supervised pre-trained mSE with no, Fleurs, as well massively multilingual ASR fine-tuning. Our results show Fleurs ASR fine-tuning significantly benefits ZS-XLT for these types of mSE.

We would flesh out these details in our experimental setup in a potential camera-ready.

  "zero-shot cross-lingual transfer (ZS-XLT)" mentioned in section 4.2 is not clear. As mentioned in section 4.1, the model is trained with English data only. So the source language is English, is it correct?

Yes, that is absolutely correct. We would refer more explicitly to English as the source language in a potential camera-ready.

In Table 3, there are 7 languages are not supported by Whisper and SeamlessM4T, how do we generate the corresponding transcription with Whisper or SeamlessM4T models?

We mention in footnote 11 at the bottom of page 7:

For unsupported languages, we hand-select the closest supported language to transcribe into.

We acknowledge that this is a crucial aspect. We would elevate this footnote to the main body of the experimental setup in a potential camera-ready.

The description of experiments are not clear and hard to follow.

We take these criticisms to heart and will do our best to address them thoroughly. We again thank the reviewer for the all-in-all favorable score despite their concerns about some details they would like to have more clearly written.

评论

Thanks for the detailed responses. I would like to keep my score.

审稿意见
7

This paper proposes a multilingual benchmark for topic classification and multiple choice QA. The dataset is derived from aligning existing datasets Flores, Fleurs, SIB-200 and Belebele. Cascade and end to end models are compared on the proposed dataset and models are further analyzed on separate tasks (ASR, MT) in order to understand performance on the original tasks.

接收理由

  • massive multilingual benchmark datasets are critical for the community
  • the experimentation is thorough, the results convincing and the paper is well written

拒绝理由

I don't see a specific reason to reject

给作者的问题

  • Abstract: the motivation is languages without writing system but all experiments compare cascade and end-to-end systems. Since cascade systems rely on a writing system, the motivation is not fully realized.
  • 99: "Concurrent Work. Costa-jussà et al. (2024)" I am not aware of what qualifies as concurrent work at COLM (and the reviewing guidelines do not specify this) but noting that https://arxiv.org/abs/2412.08274 shows a first submission on 11 Dec 2024.
  • 147-148: "We train mSE on the utterances and CS on the transcriptions of the English training set." It would be interesting to see what happens in an "unconstrained" setting where both mSE and CS have access to additional data.
  • 37: "wav2vec-BERT objective" → wav2vec or wav2vec 2.0. w2v-BERT refers to a different paper: https://ieeexplore.ieee.org/abstract/document/9688253. It would be good to go through the paper and fix all instances (note that for SeamlessM4T v2 it should be "w2v-BERT 2.0")
  • Table 1: it was unclear to me what the numbers in parenthesis represent
评论

We thank the reviewer KQ4z for their appreciative review of our work. We would like to take the chance to answer the questions the reviewer asked.

Abstract: the motivation is languages without writing system but all experiments compare cascade and end-to-end systems. Since cascade systems rely on a writing system, the motivation is not fully realized.

This arguably is a limitation of our work. Nevertheless, we believe that

  • The general motivation to develop strong SLU systems still holds.
  • We've constructed the SLU benchmark (non-LID) with the broadest language coverage for these two tasks to date.

Fleurs-SLU also lets us proxy how models might perform on languages unsupported by Whisper and SeamlessM4Tv2. On those, mSE can transfer directly to target-language utterances, and speech-to-English-text translation (translate-test) can still handle such or spoken-only languages by translating them to English. This still performs much better than randomly guessing (ca. 14% for SIB and 25% for Belebele).

99: "Concurrent Work. Costa-jussà et al. (2024)" I am not aware of what qualifies as concurrent work at COLM (and the reviewing guidelines do not specify this) but noting that https://arxiv.org/abs/2412.08274 shows a first submission on 11 Dec 2024.

While we would like to address this comment, maintaining anonymity inhibits us to do so. We sincerely hope that you trust that we in good faith consider this concurrent work.

147-148: "We train mSE on the utterances and CS on the transcriptions of the English training set." It would be interesting to see what happens in an "unconstrained" setting where both mSE and CS have access to additional data.

This is definitely an interesting idea. We focus on baselines under controlled conditions (ZS-XLT from English, speech vs. text, and varying utterance quality). ZS-XLT with English as a pivot is still the main way to test transfer to truly low-resource languages. We'd expect performance in an unconstrained setting to at best catch up to English. The benefits likely depend heavily on the representation of the language in prior training. If English data is involved, the typological distance of the language to English would also be crucial.

37: "wav2vec-BERT objective" → wav2vec or wav2vec 2.0. w2v-BERT refers to a different paper... It would be good to go through the paper and fix all instances...

We sincerely thank the author for pointing this out and would fix those in a potential camera-ready.

Table 1: it was unclear to me what the numbers in parenthesis represent

The numbers in parentheses show the size of the language group. For example, in SIB-Fleurs, SeamlessM4Tv2 supports 90 of the 101 non-English languages. "Unsupported" means languages not supported by either Whisper-v3 or SeamlessM4Tv2. We would refine the table caption in the potential camera-ready.

We additionally want to bring to the reviewer's attention that we have furthermore zero-shot evaluated Gemini-2-flash and Qwen 2.5 7B-Omni on all setups.

评论

Thank you for your reponses!

While we would like to address this comment, maintaining anonymity inhibits us to do so. We sincerely hope that you trust that we in good faith consider this concurrent work.

I don't quite get the point about anonymity (no need to clarify if there are issues with it). And yes, I assume that this is concurrent work in the sense that it was developed concurrently with that other work (as opposed to reactive work started right after that publication and submitted by the March 28 deadline).

All the clarifications are helpful. I will keep my (high) score.

审稿意见
7

This paper introduces Fluers-SLU benchmark for massive multilingual spoken language understanding, The benchmark consists of 692 hours of speech for topical classification in 102 languages (SIB-Fleurs) and 944 hours of speech for multi-choice question answering task in 92 languages (Belebele-Fleurs). The authors did extensive experiments for benchmarking models on the datasets, including end-to-end multilingual speech encoders (mSE) and cascaded systems (CS). The results show that the cascaded systems remain the most robust option, however the mSE can achieve competitive performance with adequate pre-training. The authors also find correlation between ASR and speech-to-English translation and SLU.

接收理由

  • The benchmark increased the coverage of languages to 102 for topical classification adn 92 for multi-choice question answering.
  • Comprehensive experiments have been conducted and results are very insightful.
  • The code will be released, which can help the research community to verify and research on top of the benchmark.

拒绝理由

NA

给作者的问题

For the length adaptor section, it may not be conclusive because the SeamlessM4Tv2 model is pre-trained on massive amount of data. I would suggest the authors choose a model baseline and train the model from scratch with and without the length adaptor to verify if it is helpful to model performance.

评论

We thank reviewer TY1K for their favorable review and insightful comments. We would like to address the below point the reviewer raised:

For the length adaptor section, it may not be conclusive because the SeamlessM4Tv2 model is pre-trained on massive amount of data. I would suggest the authors choose a model baseline and train the model from scratch with and without the length adaptor to verify if it is helpful to model performance.

We agree with your observation. Our own ablation study on the length adaptor led us to a similarly cautious stance, especially considering the significant impact of extensive pre-training. We state in our submission (l342-346):

"While a modest gap persists across setups, we cannot decisively infer whether appending a length adaptor is, as opposed to replacing model capacity with other layers of equal parameter size during pre-training, crucial for improved multilingual SLU. We conclude that the pre-training regime is more vital for multilingual SLU capabilities of mSE than nuanced architectural design choices."

That the pre-training regime seems more crucial than this specific architectural detail generally aligns with your concern regarding the difficulty of isolating the length adaptor's impact in pre-trained models. We will ensure a potential camera-ready of the paper consistently emphasizes this nuance. We will also more explicitly encourage future work to conduct more direct "apples-to-apples" comparisons, i.e. training models from scratch with and without length adaptors.

Furthermore, we would like to mention that we have since zero-shot evaluated Gemini-2-flash and Qwen 2.5 7B Omni on all setups, results which the reviewer might find interesting.

评论

Thank you for the response.

That the pre-training regime seems more crucial than this specific architectural detail generally aligns with your concern regarding the difficulty of isolating the length adaptor's impact in pre-trained models.

Yes, that is what I meant. The length adaptor, which is a temporal convolution layer, is pre-trained along with other layers in the SeamlessM4Tv2 model. What I suggest is adding a length adaptor layer but keep it uninitialized for fine-tuning, and compare it with the model which excludes the length adaptor. With this setting I think it will be a fair comparison.

评论

We thank all reviewers for their insightful and thoughtful review of our submission. We now also evaluated zero-shot prompting for Qwen 2.5 7B-Omni and Gemini-2-flash on both SIB-Fleurs and Belebele-Fleurs as per the setup of our main results. Kindly note that models like Qwen 2.5 7B-Omni were not originally available, though it is of course important to include them in our evaluations. We use Gemini-2-flash due to its much lower cost and much less prohibitive rate limits.

Input (Prompt examples below):

  • We provide the task description in English.
  • The relevant inputs by task are provided in the testing language.
    • SIB-Fleurs: the sentence to classify is provided in-language; the topic choices are in English
    • Belebele-Fleurs: The paragraph, question, and answer choices are all in-language
  • We loudness normalize the audio of all sentences. For Belebele-Fleurs, we normalize the concatenated utterances once more to ensure loudness is standardized across the passage.

Results:

  • Gemini-2-flash is among the best models in all setups. In speech-based evaluation, it nevertheless falls behind the best cascaded systems (cf. manuscript).
  • Qwen 2.5 7B-Omni performs competitively in high-resource languages, though outputs show that performance in low-resource languages substantially drops due to lack of support. Its speech encoder is based on Whisper-v3. Interestingly, this model is well behind cascaded systems.
  • We again see that languages unsupported by both Whisper and Seamless most severely degrade in performance across both tasks, highlighting that, beyond, in part, strong textual support, collection of speech data for such languages will be required for inclusive language technology.

We would add these results to a potential camera-ready version of our manuscript, further adding to our discussion and findings. We hope that the reviewers find these results intriguing and are happy to discuss them further. We kept our discussion here as brief as possible.

Results

SIB-Fleurs

ModelModalityQualityEnglishWhisper Supported (Non-English)Seamless Supported (Non-English)Unsupported (Non-English)All Non-English
GeminiText--90.487.086.877.286.1
SpeechBest87.682.180.859.478.6
SpeechWorst88.781.480.059.377.8
QwenText--88.776.475.153.973.0
SpeechBest87.059.558.345.656.9
SpeechWorst87.658.457.345.655.9

Belebele-Fleurs

ModelModalityQualityEnglishWhisper Supported (Non-English)Seamless Supported (Non-English)Unsupported (Non-English)All Non-English
GeminiText--95.990.889.782.089.1
SpeechBest94.184.482.959.781.1
SpeechWorst93.483.582.059.480.2
QwenText--94.169.167.436.664.9
SpeechBest91.651.450.434.449.0
SpeechWorst91.050.849.734.448.4
评论

Prompts

SIB-Fleurs

The utterance / sentence belongs to one of the following topics.

Utterance: [IN-LANG AUDIO] // Sentence: [IN-LANG SENTENCE]

Topics:
1. Entertainment
2. Geography
3. Health
4. Politics
5. Science and Technology
6. Sports
7. Travel

Please respond with only the number of the correct answer (1, 2, 3, 4, 5, 6, or 7).

Belebele-Fleurs

# Speech
Listen to the audio passage. Based on the audio, answer the following multiple-choice question.

Passage: [IN-LANG AUDIO]

# Text
Given the paragraph, answer the following multiple-choice question.

Paragraph: [IN-LANG PARAGRAPH]

# Both
Question: [IN-LANG QUESTION]

Options:
1. [IN-LANG MC-ANSWER 1]
2. [IN-LANG MC-ANSWER 2]
3. [IN-LANG MC-ANSWER 3]
4. [IN-LANG MC-ANSWER 4]
Please respond with only the number of the correct answer (1, 2, 3, or 4)."""
最终决定

Paper summary:

This paper presents a dataset for spoken language understanding (SLU) with a focus on low-resource languages and comparing the performance of cascaded (ASR + NLU) systems vs. end-to-end systems. The dataset is comprised of two parts: (1) 692 hours of speech across 102 languages annotated at the utterance level with topic classification labels and (2) 944 hours of speech across 94 languages accompanied by multiple-choice questions that assess a model's listening comprehension.

Extensive experiments are performed comparing cascaded vs. E2E systems. The paper finds that in general cascaded systems are still more robust than E2E systems, but this performance depends on how accurate the component systems are (ASR and MT), and with adequate pre-training the E2E systems can often be competitive to the cascaded systems.

Paper significance:

The paper curates a new medium-sized (~1600 hour) multilingual (~100 languages) dataset that is annotated for spoken language understanding. Such datasets with this degree of language coverage and annotation are hard to come by and very valuable for continuing work in multilingual SLU, so this contribution will be extremely valuable for the research community.

Pros (as raised by reviewers):

All reviewers noted that the dataset introduced by this paper will be very useful for the multilingual SLU research community (KQ4z, TY1K, t3tg, VcwH)

The experiments presented by the paper comparing E2E vs. cascaded SLU systems are very thorough and informative (KQ4z, TY1K, t3tg)

Cons (as raised by reviewers):

Two reviewers indicated that there were no reasons to reject the paper (KQ4z, TY1K)

One reviewer (t3tg) found the experimental description hard to follow, but still noted positively that the experiments were very thorough.

One reviewer (VcwH) found the introduced dataset to be valuable, but questioned whether the experiments led to any meaningful insights

Cons address by discussion between authors and reviewers:

Regarding the clarity issues raised by t3tg, the authors replied to the reviewer clarifying their questions, and the reviewer acknowledged their response and did not follow up with additional questions.

Regarding the significance of the experiments (VcwH), in my view the authors adequately made the case for their experiments (e.g. being the first to rigorously examine SLU on 100+ languages), and the reviewer noted that they believe the paper should be accepted on the basis of the dataset.

[Automatically added comment] At least one review was discounted during the decision process due to quality]