PaperHub
6.8
/10
Poster4 位审稿人
最低6最高8标准差0.8
8
7
6
6
4.3
置信度
COLM 2025

Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models

OpenReviewPDF
提交: 2025-03-21更新: 2025-08-26
TL;DR

This study reveals the crosslingual knowledge barrier for multilingual LLMs in both general (MMLU benchmark) and domain-specific (Harry Potter quiz and TOFU benchmark) contexts, and proposes to mitigate the barrier via mixed-language fine-tuning.

摘要

关键词
Large Language ModelsMultilingualCrosslingual Knowledge Barrier

评审与讨论

审稿意见
8

This paper shows that LLMs have surface-level crosslingual abilities, like machine translation, but demonstrate a crosslingual knowledge barrier -- a phenomenon where the model learns knowledge in one language, typically English, but cannot transfer the knowledge when queried in other languages. The authors also propose viable methods that can overcome the crosslingual knowledge barrier.

接收理由

  • The paper is well-written and easy to follow. I quite enjoy reading the paper.
  • The experiments are well-established and very extensive.
  • The paper not only showcases the existence of a crosslingual knowledge barrier but also presents promising methods to address it -- which is appreciated.

拒绝理由

  • I have concerns regarding the mixed-language evaluation. The authors claim that "general knowledge might be present in each of the evaluated languages in the pretraining dataset" in Lines 192-193. I think this case also applies to mixed-language evaluation -- the model can still have the knowledge in each of the language set {en, fr, de, es, it}. Would it be possible if the model does not have a crosslingual barrier (cannot leverage knowledge learned in other languages) but simply is confused in a multilingual context?

  • In Section 4.2, the model after being fine-tuned on domain data in English, the accuracy improves slightly for some models. However, we also see some other languages actually improve. I think a better way to quantify the effect of English-centric domain-specific fine-tuning is to look at those questions where the original model answers wrongly but correctly after fine-tuning. If other languages do not show consistent improvement in those questions, it is a more direct proof that the crosslingual barrier also exists, even with fine-tuning.

给作者的问题

Line 149: (4) Random Token Replacement -> (4) Random Token Dropout ?

评论

Thank you for your feedback! We address your questions and comments below.

  1. Would it be possible if the model does not have a crosslingual barrier (cannot leverage knowledge learned in other languages) but simply is confused in a multilingual context?

We appreciate the reviewer’s thoughtful question. The following perspectives provide evidence that the performance drop in mixed-language evaluation is not solely due to language confusion, but reflects an inherent limitation in leveraging cross-lingual knowledge.

  • Inference-time mitigation is ineffective. We tried several inference-time strategies (Section 5.1) to help the model better handle the multilingual context, such as explicitly stating that the input contains multiple languages or providing few-shot demonstrations. These methods did not yield noticeable improvements, suggesting that the issue is not just confusion from multilingual inputs but a deeper challenge in knowledge transfer across languages.

  • Humans can understand unnatural context. While mixed-language input may appear unnatural, multilingual humans can still understand the context in these settings (e.g., examples in Figure 4a), which supports the idea that the drop in model performance stems from a lack of robust cross-lingual transfer, rather than surface-level confusion.

  • Additional evidence from Harry Potter and TOFU datasets complement our findings. We evaluate models on fully translated, monolingual QA examples across multiple languages, both before and after English-only FTing. Since the inputs are purely monolingual, the evaluation removes any confusion arising from mixed-language input. As shown in Figure 6, models struggle to utilize the parametric TOFU/Harry Potter knowledge acquired during English FTing to answer related questions in other languages, suggesting the limitations in cross-lingual knowledge transfer.

We will add the above discussion to our revision.

  1. I think a better way to quantify the effect of English-centric domain-specific fine-tuning is to look at those questions where the original model answers wrongly but correctly after fine-tuning. If other languages do not show consistent improvement in those questions, it is a more direct proof that the crosslingual barrier also exists, even with fine-tuning.

We appreciate the reviewer’s helpful suggestion. Following this, we report the proportion of questions where the original model answers incorrectly but the English FTed model answers correctly. Specifically, we compute per-question ROUGE-L recall score and consider a response correct if the score is ≥0.5 and incorrect otherwise. The improvement proportion is defined as the fraction of questions where the original model is incorrect and the FTed model is correct, relative to the total number of questions.

As shown in anonymous link, for the four evaluated LLMs, the improvement on non-English questions is substantially lower than that on English questions. This indicates that LLMs struggle to effectively transfer the parametric knowledge gained during English FTing to other languages, highlighting a cross-lingual gap even after FTing.

We will add the new results and discussion to our revision.

3 Line 149: (4) Random Token Replacement -> (4) Random Token Dropout?

Thanks for pointing this out! We will fix the typo.

评论

Dear Authors,

Thank you for your response and the new results. My concerns are properly addressed.

p.s. Just some ideas, there are some very recent papers [1, 2, 3] showing that the frequency in the pretraining data is actually playing an important role, even in the multilingual context. It might be more or less related to the crosslingual knowledge barrier presented in this paper (the frequency in the fine-tuning dataset may also be very important).

[1] https://arxiv.org/abs/2504.12459
[2] https://arxiv.org/abs/2505.14824
[3] https://arxiv.org/abs/2505.16385

Good luck with this submission!

评论

Dear Reviewer JPW1,

Thank you for the positive feedback and for sharing these recent works! They offer a complementary perspective to our study by highlighting the role of pretraining data frequency in factual knowledge acquisition, particularly in multilingual settings. This line of work is orthogonal to our focus on evaluating and mitigating cross-lingual knowledge barriers but aligns well with our broader goal of understanding LLM crosslingual ability. We agree that frequency in the pretraining data could play an important role and appreciate the valuable pointer.

We will include a discussion of these works in the related work section of our revision. Thank you again for the thoughtful suggestions!

审稿意见
7

This paper explores the ability of multilingual Large Language Models (LLMs) to transfer parametric knowledge across language boundaries, moving beyond surface-level capabilities like machine translation. The study demonstrates that LLMs have limited capabilities in transferring knowledge across languages, revealing what the authors term the "crosslingual knowledge barrier". The barrier is observed in both general knowledge settings using the MMLU benchmark and domain-specific contexts using the Harry Potter quiz and TOFU benchmark. To systematically evaluate this, the authors conducted extensive evaluations on 15 models and 16 languages. They designed novel mixed-language evaluation formats, such as mixed-language multiple-choice questions (MCQs), to directly probe crosslingual abilities by requiring models to utilise knowledge learned in one language to answer questions or perform tasks in a different language. The authors also explored potential methods to overcome this barrier: inference-time interventions (like prompt engineering and few-shot demonstrations) and training-time interventions (specifically, mixed-language fine-tuning).

接收理由

Clear Introduction: The paper presents an important problem in the multilingual Natural Language Process considering more than handling multilingual input but to crosslingual parametric knowledge transfer. The new concept of "crosslingual knowledge barrier" is clearly defined and intriguing.

Logical Methodology: The paper presents a well-constructed and systematic empirical framework with well-designed evaluation pipeline and depth in problem exploration. The methodology progresses logically from standard translation and embedding similarity analyses to controlled crosslingual question answering settings. The proposed mixed-language MCQ formats are considered novel and valuable for probing the barrier in controlled ways. The experiments explores both domain-specific (Harry Potter Quiz, TOFU) and general (MMLU) benchmarks. The authors also explored multiple existing solutions before proposing their method. The proposed mixed-language fine-tuning method is described as simple, compute-friendly, and an attractive recipe for practitioners. It is shown to improve performance even with relatively small corpora.

Careful Analysis and DIscussion: The analysis includes examining failure modes, such as models exhibiting an English-option bias. The finding that the barrier persists even after English-only domain-specific fine-tuning strengthens the paper's central claim.

拒绝理由

A significant concern raised is that the main experiments primarily focus on a limited set of high-resource, typologically similar languages (English, French, German, Spanish, Italian). These languages share script and substantial vocabulary with English, which may limit the generalizability of the findings. The behavior of the barrier for genuinely low-resource or non-Latin languages is noted as being only touched upon in an appendix figure, suggesting the main claims would be stronger with systematic inclusion of such languages.

Although the paper references prior literature on code-switched training and contrasts its mixed-language fine-tuning approach, it stops short of conducting a controlled experimental comparison against traditional code-switched fine-tuning under equal corpus size. This leaves the claim that the proposed mixed-language fine-tuning is more effective unsubstantiated.

The mitigation technique (mixed-language fine-tuning that is similar to code-switching and random multilingual substitution during fine-tuning) are not novel, suggesting the novelty of the proposed solution might be incremental.

给作者的问题

Why did you select the Harry Potter Quiz dataset to identify crosslingual language barriers? The book is originally written in English, so does not consider language-specific knowledge, though the paper argues this domain is suitable for controlled testing of domain-specific knowledge transfer.

The paper could benefit from probing-based or attention-based diagnostics to explain why the barrier arises, beyond just behavioral metrics.

Further analysis and tests are required to assess the significance of the work, including a benchmark comparison with code-switching.

What are the specific languages used in the mixed-language fine-tuning and the zero-shot languages shown in Figure 9.

评论

Thank you for your feedback! We address your questions and comments below.

  1. The behavior of the barrier for genuinely low-resource or non-Latin languages is noted as being only touched upon in an appendix figure, suggesting the main claims would be stronger with systematic inclusion of such languages.

Please note that we evaluated a total of 16 languages to show the generalizability of findings. In addition to 5 major European languages (en, fr, de,es,it), as discussed in line 230, we also evaluate:

  • (i) Low-resource languages: Malay (ms), Danish (da), Finnish (fi), Norwegian (no), Bengali (bn), and Amharic (am);
  • (ii) Languages with token distributions significantly different from English: Russian (ru), Chinese (zh), Hebrew (he), Arabic (ar), and -Hindi (hi).

Concretely,

  • We identify cross-lingual knowledge barriers for off-the-shelf LLMs on the MMLU benchmark (Figure 5) and HP-Quiz (Figure 7 ) cross diverse languages, supporting the universality of our findings.
  • We also verify the effectiveness of mixed-language FTing on languages that were not used during the FTing stage. Results in Figures 9, 10 show that the model FTed using our method on high-resource languages (en, fr, de, es, it) can improve cross-lingual performance on Harry Potter quiz and MMLU for low-resource languages (ms, da, fi, no, bn, am) and languages that are rather different from English (ru, zh, he, ar, hi). These results provide evidence for the generalizability of our FTing approach.
  • Following the reviewer’s suggestions, we conduct additional experiments to analyze the cross-lingual knowledge barriers for monolingually FTed LLMs. Specifically, we finetune Mistral-7B on TOFU dataset under different low-resource/non-Latin languages separately. We report the difference in ROUGE-L recall scores between the fine-tuned and original models. As shown in anonymous link, the models perform best when evaluated in the same language as FTing, but struggle to transfer the parametric TOFU knowledge acquired in one language to answer questions in other languages. This trend is consistent with our English-only FTed results (Figure 6), further confirming the existence of cross-lingual knowledge barriers under monolingual FTing.

As the reviewer suggests, we will highlight these results more prominently in the main body of the paper, in our revision.

  1. It stops short of conducting a controlled experimental comparison against traditional code-switched fine-tuning under equal corpus size….. The mitigation technique (mixed-language fine-tuning that is similar to code-switching and random multilingual substitution during fine-tuning) are not novel.

Following the reviewer’s suggestion, we conducted additional controlled experiments to compare the mixed-language FTing with traditional code-switched FTing using equal corpus size. Specifically, we follow recent state-of-the-art methods for code-switching data synthesis [1, 2]. For each target language X in {fr, de, es, it}, we first translate the entire English corpus into language X. Given both the English and translated versions of each document, we use the Gemini-2.0 Flash model to generate a code-switched version (Code-switch-X) using the prompt template from App B.1 of [1]. We then construct the code-switched corpus by combining these generated samples across the 4 languages. To match the corpus size with mixed-language FTing, we allocate 1/4 of the total training size to each Code-switch-X.

As shown in anonymous link, mix-language FTed models (word/sentence levels) perform better than code-switch FTed models. One possible explanation is that current code-switching techniques rely on an external LLM to decide which entity or phrase to switch, which may introduce noise, especially when the external model has limited multilingual capabilities. In contrast, mixed-language FTing allows more controlled and flexible manipulation of language units (e.g., fixed-length k-words, sentences, or documents) without relying on external models.

Moreover, mixed-language FTing has the practical advantage that it can be applied over the entire training corpus without increasing the overall data size, whereas code-switching often leads to data expansion proportional to the number of target languages due to the need for parallel sentence pairs.

We will add the new results to the main body of the paper in our revision, and we hope this additional evidence and analysis help substantiate the novelty of our mitigation techniques.

评论

(A2 continued)

Below is the exact prompt we used for code-switching data synthesis. Note that they are as in [1] except we added a few instructions to specify the format for easier parsing of the results:

Given the follow pair of text, one in English, encoded in the <English>...</English> block, and one in {target_language}, encoded in the <{target_language}>...</{target_language}> block, generate a code-switching text. Put the generated code-switching text inside a <CodeSwitching>...</CodeSwitching> block. Code-switching is the use of more than one linguistic variety in a manner consistent with the syntax and phonology of each variety.

<English>
{original_text}
</English>

<{target_language}>
{translated_text}
</{target_language}>

Reference:

  • [1] Code-Switching Curriculum Learning for Multilingual Transfer in LLMs, ACL findings 2025
  • [2] Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding, ACL 2025
  1. Why did you select the Harry Potter Quiz dataset to identify crosslingual language barriers? The book is originally written in English, so does not consider language-specific knowledge, though the paper argues this domain is suitable for controlled testing of domain-specific knowledge transfer.

We selected the Harry Potter Quiz dataset precisely because the book was originally written in English, making it a natural source language. This allows us to test whether models can transfer domain-specific knowledge (e.g., facts and events in the Harry Potter universe) to other languages that models know, such as French and German. Harry Potter represents a widely known knowledge domain with practical relevance (e.g., users may query LLMs about HP in their native languages to understand the books better). The use of a well-defined fictional domain also helps reduce confounding effects from cultural or real-world differences across languages.

In addition, we conduct FTing-based experiments on both the Harry Potter Quiz and TOFU datasets (Figure 6), where the model is FTed on English data and evaluated on other languages. This setting allows for a more controlled examination of cross-lingual transfer of domain-specific knowledge, isolated from multilingual pre-training artifacts.

We hope this clarifies our motivation, and we would be happy to further discuss it based on the reviewer’s suggestions.

  1. The paper could benefit from probing-based or attention-based diagnostics to explain why the barrier arises, beyond just behavioral metrics.

Please note that lines 380–385 contain a probing analysis based on embedding similarity. Specifically, to understand how mixed-language FTing enhances cross-lingual capabilities, we measure whether the embeddings of an English sentence remain close when certain words are replaced with their counterparts in other languages. As shown in Appendix Figure 18, the mixed-language FTed models exhibit significantly smaller embedding distances, indicating improved cross-lingual alignment in the representation space.

  1. What are the specific languages used in the mixed-language fine-tuning and the zero-shot languages shown in Figure 9.

In Figure 9, the five languages used for mixed-language FTing are: en, fr, de, es, and it. The zero-shot languages refer to all other evaluated languages that were not included in the FTing phase: ms, da, fi, no, ru, zh, he, ar, hi, bn, and am. We will clarify this in the revision.

评论

Dear Authors,

Thank you for your detailed response and the new results; great job. I will respond shortly with further follow-up and whether my concerns are properly addressed.

评论

Dear Authors,

Thank you for the additional results. The table of comparison between code-switching and mixed fine-tuning supplemented by the other experiments, is promising. However, since there has not been fine-tuning for low-resource languages, we believe an increase in score by one point would be appropriate. Keep up the good work.

评论

Dear Reviewer riAs,

Thank you again for your thoughtful suggestions and encouraging feedback!

Regarding the comment on fine-tuning for low-resource languages: during the rebuttal phase, we conducted experiments where we fine-tuned LLMs monolingually on low-resource languages for the TOFU dataset (e.g., Finnish, Hindi, Malay, and Arabic). We observed a similar cross-lingual knowledge barrier phenomenon as in English.

To mitigate the barrier, we explored mixed fine-tuning using high-resource languages (English, French, German, Spanish, Italian) on the general-domain corpus (e.g., WikiText). Figures 9 and 10 show that it improved cross-lingual performance for low-resource languages (which were not used during the fine-tuning stage) on the Harry Potter quiz and MMLU tasks. A potential explanation is that exposure to frequent language switching during fine-tuning helps LLMs better adapt to settings where the knowledge is queried in different (often non-English) languages. These results show the potential of mixed fine-tuning as a generalization strategy for low-resource settings.

We will emphasize these results more clearly in the revised version. Thank you again for your valuable comments and support.

审稿意见
6

This paper investigates the cross-lingual capabilities of multilingual large language models (LLMs) and identifies a significant "cross-lingual knowledge barrier" that hinders these models from effectively transferring knowledge across languages. The study evaluates state-of-the-art LLMs on cross-lingual tasks, including machine translation and crosslingual question-answering (QA), using datasets like MMLU and a manually curated Harry Potter Quiz. The authors find that while LLMs demonstrate promising surface-level cross-lingual abilities in machine translation, they struggle with deeper cross-lingual knowledge transfer, especially in knowledge-intensive tasks that require implicit cross-lingual correlation.

接收理由

1.The authors propose novel cross-lingual evaluation formats (e.g., Mixup translation, Question+GT-option translation, GT-option translation) to directly test LLMs' cross-lingual capabilities. These formats effectively reveal the cross-lingual knowledge barriers faced by LLMs in multilingual settings, addressing the limitations of traditional monolingual evaluation methods. 2.Give credit to the author for carrying out extensive experiments on multiple datasets and languages, providing a systematic analysis of the cross-lingual knowledge barrier phenomenon. Overall I think it's a well-written paper.

拒绝理由

I would like to see more experiment on low-resource languages to see whether the corpus size correlate with the knowledge transfer efficiency. The proposed method’s efficacy for truly low-resource settings requires deeper validation.

评论

Thank you for your feedback! We address your questions and comments below.

I would like to see more experiment on low-resource languages to see whether the corpus size correlate with the knowledge transfer efficiency. The proposed method’s efficacy for truly low-resource settings requires deeper validation.

We note that the pre-training corpora of most LLMs are not publicly available, making it infeasible to precisely quantify corpus size or directly correlate it with performance. Still, our evaluation includes 16 languages, covering a wide range of resource levels and linguistic characteristics, which provides meaningful insight into generalizability. In addition to five major European languages (en, fr, de, es, it) that are explicitly documented in model cards (e.g., Mistral), as discussed in line 230, we also evaluate:

  • (i) Low-resource languages: Malay (ms), Danish (da), Finnish (fi), Norwegian (no), Bengali (bn), and Amharic (am);
  • (ii) Languages with token distributions significantly different from English: Russian (ru), Chinese (zh), Hebrew (he), Arabic (ar), and Hindi (hi).

On both the MMLU benchmark (Figure 5) and the Harry Potter Quiz (Figure 7), we identify consistent cross-lingual knowledge barriers in off-the-shelf LLMs, with noticeably lower performance on lower-resource languages such as Finnish (fi), Bengali (bn), and Amharic (am).

Following the reviewer’s suggestion, we conduct additional experiments to analyze the cross-lingual knowledge barriers for monolingually FTed LLMs. We finetune Mistral-7B on TOFU dataset under different low-resource/non-Latin languages separately. We report the difference in ROUGE-L recall scores between the fine-tuned and original models. Results (anonymous link) show that models generally perform best on the language used during FTing, but struggle to transfer the parametric TOFU knowledge acquired in one language to answer questions in other languages. This is consistent with our English-only FTed results in Figure 6, further confirming the limited cross-lingual transferability under monolingual FTing. In particular, for non-Latin languages such as Chinese (zh), Arabic (ar), and Hindi (hi), performance gains are smaller even when FTed and evaluated in the same language.

Lastly, we verify the effectiveness of our mixed-language FTing approach for languages not seen during FTing. As shown in Figures 9 and 10, models FTed on mixed high-resource languages (en, fr, de, es, it) have improved performance on MMLU / Harry Potter Quiz tasks across low-resource languages (ms, da, fi, no, bn, am) and typologically distant ones (ru, zh, he, ar, hi). While the gains are smaller for extremely low-resource/distant languages such as Hindi (hi), Bengali (bn), and Amharic (am), they still show meaningful improvements, supporting the generalizability of our method.

As the reviewer suggests, we will highlight these results more prominently in the main body of the paper, in our revision.

评论

Thanks for the response. After reading all reviews and author responses, I still find my original rating to be accurate and will keep it as is.

评论

Dear Reviewer ar6x,

Thank you for your valuable comments and for acknowledging our response!

审稿意见
6

This paper studies cross-linguality, where inputs are mixed with multiple languages. The authors first show that En inputs and the code-mixed inputs are close in the embedding space. Then, the authors randomly mix language at fixed-length k-word/sentence/document units for MMLU, Harry Potter Quiz, and Tofu to identify cross-lingual barriers for general knowledge and domain-specific knowledge. Finally, the authors present mixed language fine-tuning to address cross-lingual problems.

接收理由

  1. The presentation is good.
  2. The problem is important and practical as information in mixed languages is common in the real world.
  3. The authors show 3 strategies beyond classic word-level code-switching training and fine-tuning.

拒绝理由

  1. While the authors state "the same bias" in testing and fine-tuning, the final results may not convince as the model is fine-tuned for the test set.
  2. The paper mainly focuses on European languages that significantly share the same culture, scripts, and other confounding factors, which may limit the generalizability of the findings.
  3. The general idea overlaps with an existing work https://openreview.net/pdf?id=HMa8mIiBT8, where entity-level code-switching is leveraged for identifying cross-lingual problems. However, this is a minor concern.
评论

Thank you for your feedback! We address your questions and comments below.

  1. While the authors state "the same bias" in testing and fine-tuning, the final results may not convince as the model is fine-tuned for the test set.

Unfortunately there seems to be misunderstanding regarding “the same bias” in FTing and testing. In fact, we only mentioned “same bias” once (line 298) in the context of inference-time mitigation using mix-up demonstrations. Our FTing setup, however, is fundamentally different from the evaluation settings and does not introduce any such bias. Specifically, as discussed in lines 348-356:

  • Different mixed-language granularity: Mixed-language FTing is applied at the document/sentence/kk-word levels using general-purpose corpora. In contrast, our cross-lingual evaluations are conducted at the option/question levels.
  • Different corpora: The corpus used for FTing is the general WikiText-103/WikiText-2 (Wiki documents), which may not contain knowledge specific to MMLU benchmark and have no overlap with Harry potter/TOFU tasks, reducing the risk of target-task leakage.
  • Broad performance improvements across evaluation formats: We show in Figures 10, 16, 17 that FTed models have improved performance not only on mixup MMLU (at option/question level) but also on variants such as question-translated, option-translated, and fully-translated formats. Additionally, consistent improvements are observed on fully translated HP-Quiz and MMLU tasks in multiple languages (Figures 8, 9, 10, 16, 17), showing the effectiveness of the FTing approach in broad cross-lingual settings.
  • Cross-lingual generalization to unseen languages: In Figures 9,10, we show that mixed-language FTed models on high-resource languages (en, fr, de, es, it) can improve performance across a wide range of unseen low-resource languages (ms, da, fi, no, bn, am) as well as on linguistically distant languages (ru, zh, he, ar, hi).

In summary, we avoid test-set bias by applying general-domain, mixed-language FTing at different granularities from those used in testing. Our extensive experiments demonstrate strong cross-lingual generalization to a wide range of unseen languages, unseen domain knowledge and diverse evaluation formats.

  1. The paper mainly focuses on European languages that significantly share the same culture, scripts, and other confounding factors, which may limit the generalizability of the findings.

Please note that we evaluated a total of 16 languages to show the generalizability of findings. In addition to 5 major European languages (en, fr, de,es,it), we also evaluate:

  • (i) Low-resource languages: Malay (ms), Danish (da), Finnish (fi), Norwegian (no), Bengali (bn), and Amharic (am);
  • (ii) Languages with token distributions significantly different from English: Russian (ru), Chinese (zh), Hebrew (he), Arabic (ar), and Hindi (hi).

Concretely,

  • We identify the general cross-lingual knowledge barriers for off-the-shelf LLMs on MMLU benchmark (Figure 5) and HP-Quiz (Figure 7) across diverse languages.
  • We verify the general effectiveness of mixed-language FTing on languages that were not used during FTing (Figures 9, 10). LLMs FTed on mixed high-resource languages (en, fr, de, es, it) can improve cross-lingual performance on Harry Potter quiz and MMLU for low-resource languages (ms, da, fi, no, bn, am) and languages that are rather different from English (ru, zh, he, ar, hi).
  • Following the reviewer’s suggestions, we conduct additional experiments to analyze the cross-lingual knowledge barriers for monolingually FTed LLMs. Specifically, we finetune Mistral-7B on TOFU dataset under different low-resource/non-Latin languages separately. We report the difference in ROUGE-L recall scores between the fine-tuned and original models. As shown in anonymous link, LLMs perform best when evaluated in the same language as FTing, and struggle to transfer the parametric TOFU knowledge acquired in one language to answer questions in other languages. This trend is consistent with our English-only FTed results (Figure 6), confirming the general existence of cross-lingual barriers under monolingual FTing for different languages.

We will highlight these results more prominently in the main body of the paper, in our revision.

评论
  1. The general idea overlaps with an existing work https://openreview.net/pdf?id=HMa8mIiBT8, where entity-level code-switching is leveraged for identifying cross-lingual problems. However, this is a minor concern.

While we acknowledge some high-level similarity, the details are different. In particular, we believe our evaluation framework offers several key advantages over [1]:

  • [1] focuses solely on the particular mLAMA dataset, which uses a cloze-style format (e.g., “Paris is the capital of [MASK]”) which enables entity-level code-switching. This format does not generalize well to other task types due to the difficulty of automatically identifying entities to switch. In contrast, we propose novel and scalable mixed-language evaluation formats (e.g., at option/question levels) for general (multiple-choice) question answering tasks, which are applied to diverse knowledge benchmarks, including general commonsense reasoning (MMLU) and domain-specific knowledge (HP-Quiz and TOFU).
  • The models evaluated in [1] are masked language models such as XML-R and mT0. Their findings may not directly translate to modern causal language models, which are the focus of our study given their growing popularity and superior capabilities. We observe the crosslingual knowledge barrier on 15 modern LLMs, 16 languages and 3 datasets.

In addition to the different evaluation protocols, we conducted new controlled experiments comparing mixed-language FTing versus code-switched FTing using equal corpus size. Following recent code-switched FTed methods [2, 3], we translated the English corpus into 4 languages {fr, de, es, it} and used Gemini-2.0 Flash with [2]’s prompt (App B.1) to generate code-switched samples. The final code-switched corpus combines samples from all 4 languages, each contributing 1/4 of the total training size.

Below is the exact prompt we used for code-switching finetuning data synthesis. Note that they are as in [2] except we added a few instructions to specify the format for easier parsing of the results:

Given the follow pair of text, one in English, encoded in the <English>...</English> block, and one in {target_language}, encoded in the <{target_language}>...</{target_language}> block, generate a code-switching text. Put the generated code-switching text inside a <CodeSwitching>...</CodeSwitching> block. Code-switching is the use of more than one linguistic variety in a manner consistent with the syntax and phonology of each variety.

<English>
{original_text}
</English>

<{target_language}>
{translated_text}
</{target_language}>

As shown in anonymous link, mix-language FTed models (word/sentence levels) perform better than code-switch FTed models. One possible explanation is that current code-switching techniques rely on an external LLM to decide the entity or phrase to switch, which may introduce noise, especially when the external model has limited multilingual capabilities. In contrast, mixed-language FT allows more controlled and flexible manipulation of language units (e.g., fixed-length k-words, sentences, or documents) without relying on external models.

Moreover, mixed-language FT has the practical advantage that it can be applied over the entire training corpus without increasing the overall data size, whereas code-switching often leads to data expansion proportional to the number of target languages due to the need for parallel sentence pairs.

As the reviewer suggests, we will add the new results to the main body of the paper in our revision.

Reference:

  • [1] Is Knowledge in Multilingual Language Models Cross-Lingually Consistent?
  • [2] Code-Switching Curriculum Learning for Multilingual Transfer in LLMs, ACL findings 2025
  • [3] Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding, ACL 2025
评论

Thanks for your clarification. I have no further questions.

评论

Dear Reviewer jfyw,

Thank you for your valuable comments and for acknowledging our response!

评论

We sincerely thank all reviewers for their constructive feedback and suggestions, which are very helpful to us. We are encouraged that the reviewers found our work:
(1) study an important and practical problem (🔴 jfyw, 🟣 riAs)
(2) provide useful insights and promising evaluation/mitigation methods for cross-lingual ability of LLMs (🔴 jfyw, 🔵 ar6x, 🟣 riAs, 🟢 JPW1)
(3) the results are solid and promising (🔵 ar6x, 🟣 riAs, 🟢 JPW1)
(4) the presentation/writing is clear (🔴 jfyw, 🔵 ar6x, 🟢 JPW1)

Following the reviewers’ suggestions, we added more experiments/discussions, and we addressed the questions in the response to each reviewer.

Summary for new experimental results:

  • Comparison to code-switch finetuning: We evaluate code-switch finetuning following the state-of-the-art methods, and find that mixed-language finetuning is more effective (🔴 jfyw, 🟣 riAs)
  • Identifying cross-lingual knowledge barriers for more languages: We finetune LLMs on TOFU dataset in various low-resource/non-Latin languages separately and find that LLMs struggle to transfer the parametric TOFU knowledge acquired in one language to answer questions in other languages. (🔴 jfyw, 🔵 ar6x, 🟣 riAs)
  • New evaluation metric for English-centric domain-specific fine-tuning: We report the proportion of questions where the original model answers incorrectly but the English FTed model answers correctly. The large gap in improvement between English and non-English settings highlights the presence of a cross-lingual barrier even after fine-tuning. (🟢 JPW1)

Please also let us know if there are other questions, and we look forward to the discussion with the reviewers to further improve our paper. Thank you!

最终决定

This paper assesses how well multilingual large language models (LLMs) perform cross-lingual tasks. The main contribution of the paper is in identifying a significant "cross-lingual knowledge barrier" that hinders these models from effectively transferring knowledge across languages. To fix this problem, the paper presents mixed language fine-tuning. The experiments mix language at fixed-length k-word/sentence/document units for MMLU, Harry Potter Quiz, and Tofu and show that the proposed solution mitigates the problem.

Reviewers generally feel positively about this paper, and so do I. The paper addresses a relevant problem and proposes an effective, simple solution to this problem, as well as a way to evaluate cross-lingual performance. Experiments are convincing. The main weaknesses are the relatively low novelty of the proposed solution and the focus on high-resource European languages.