6.0

/10

Poster3 位审稿人

最低6最高6标准差0.0

3.7

置信度

COLM 2025

Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources

Zihao Li,Shaoxiong Ji,Hengyu Luo,Jörg Tiedemann

OpenReview PDF

提交: 2025-03-19更新: 2025-08-26

TL;DR

This study evaluates 36 continual pretraining configurations across three multilingual LLMs and 30+ languages, analyzing their effects across resource levels and language behavior categories.

摘要

关键词

Multilingual continual pretrainingdata mixing

评审与讨论

审稿意见

评分: 6置信度: 32025-04-25

This paper investigates how fine-tuning a model on either monolingual or bilingual data of different languages (called continuous pretraining here) affects their performance on X. They run experiments:

using 3 base models (Llama 3.1-8B, Llama-2-7B, and Viking-7B);
fine-tuning models using monolingual or bilingual data;
fine-tuning models using only languages categorised as altruistic, selfish, or stagnant;
fine-tuning models while including code in their fine-tuning dataset.

These results in 36 models, which are evaluated separately on:

two benchmarks (SIB-200, for topic classification, and FLORES-200, for machine translation);
on a set of high-, medium-, or low-resource languages.

The authors find that fine-tuning models on bilingual data hurts their performance on machine translation, as models are more prone to switch language mid-generation when fine-tuned on such data. Fine-tuning on monolingual data, however, improves translation from mid- and low-resource languages. Further, for the best performing model (Llama-3.1-8B), fine-tuning on either bilingual or monolingual data hurts classification performance on all, except low-resource languages. In general, fine-tuning models on code data seems to hurt translation tasks, but help on classification.

接收理由

The paper provides a large set of experiments: fine-tuning models on 36 configurations, and evaluating them on two benchmarks.

The paper is well-written and relatively easy to read.

The paper evaluates both high-, mid- and low-resource languages.

拒绝理由

While the paper provides a relatively large set of experiments, I think these experiments are not particularly informative. The settings in which the model are trained are, in my opinion, not fine-grained enough to be particularly informative about:

the impact of language choice on fine-tuned model performance. Models are fine-tuned on either altruistic vs. selfish vs. stagnant languages. These classes are defined at a language-level (as opposed to a language-pair level) and include either 6 or 5 languages each in the experiments. I think it’s hard to extract many insights from these experiments.
the impact of monolingual vs. bilingual fine-tuning. A single way of structuring bilingual data is used, which could strongly influence the paper’s results on language generation. Further, the authors seem to fine-tune the models on an entire sequence of form [source language]: [source] [target language]: [target], but if tokens [target language]: did not receive cross-entropy loss updates, maybe the model would not switch languages mid-generation.

给作者的问题

Line 28: “This imbalance deepens the digital language divide and limits the inclusivity of NLP technologies.”

A recent paper showed that Llama models manage to leverage the computational circuits learned in a high-resource language (i.e., English) to improve other language’s performance (Wendler et al. 2024), and other recent work suggests that language imbalance may actually be a key component for this cross-lingual transfer (Schäfer et al. 2024).

Line 239: “Llama-2-7B struggles across both configurations, with bilingual CPT reducing accuracy to 24.26% (vs baseline 20.59%, +17.8%) and monolingual CPT performing slightly better (26.80%, +30.2%).”

Should this say “increasing accuracy” instead of reducing?

Line 249: “Llama-2-7B degrades significantly with bilingual CPT (20.84%, +19.8%) and shows minimal gains with monolingual CPT (24.51%, +40.9%). Viking 7B benefits substantially from bilingual CPT (29.25%, +66.5%), while monolingual CPT slightly underperforms (21.33%, +21.4%).”

Does Llama-2-7B “degrade” or improve with bilingual CPT? Aren’t those improvements over the baseline?

References

Schäfer et al. 2024. The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments
Wendler et al. 2024. Do Llamas Work in English? On the Latent Language of Multilingual Transformers

2025-05-31

We thank the reviewer for their positive assessment of our paper's extensive experimental setup, clarity, and evaluation across diverse language resource levels.

We would like to address the reviewer's points:

Reasons to Reject:

R1: Informativeness of experiments regarding language choice and bilingual fine-tuning:
- Our primary goal with these categories was to empirically test the generalizability and stability of the classifications proposed by Yuan et al. (2024) across a wider range of CPT configurations (3 base models, varied data types including code, different resource levels). Our finding, as detailed in Section 3.4, is that these language classifications often do not generalize under these varied conditions. For example, "altruistic languages are not always helpful and often negatively impact related languages", and "stagnant languages demonstrate unexpected adaptability under specific training settings". We believe this is a significant insight, highlighting that the impact of a language in multilingual CPT is more complex and context-dependent than such fixed categories might suggest. The selection of 5-6 languages per category aimed to provide a representative sample while keeping the 36 full CPT model trainings computationally feasible.
R2: Bilingual data structure used for studying the impact of monolingual vs. bilingual fine-tuning:
- We agree that the specific data format influences outcomes. We follow prior works, e.g., Tower: An Open Multilingual Large Language Model for Translation-Related Tasks | OpenReview, and choose widely used format for bilingual data,[source language]: [source] [target language]: [target], which offers a straightforward way to present parallel data to CPT .
- The suggestion to alter the loss masking strategy (e.g., not applying cross-entropy loss to [target language]: tokens) is an interesting idea for future work focused on mitigating such language mixing. However, the scope of our current study was to evaluate the effects of common CPT data strategies, and our findings provide insights into optimizing the fine-tuning process to avoid all potential negative behaviors for future works.

Questions To Authors:

Q1: Line 28: "This imbalance deepens the digital language divide and limits the inclusivity of NLP technologies."
- We thank the reviewer for pointing out this interesting recent research. We will consider adding a brief nuance to this statement or in the introduction.
Q2: Line 239: "Llama-2-7B struggles across both configurations, with bilingual CPT reducing accuracy to 24.26% (vs baseline 20.59%, +17.8%) and monolingual CPT performing slightly better (26.80%, +30.2%)."
- Thank you for this close reading. Yes, the percentages (+17.8% and +30.2%) indicate an increase in accuracy relative to the Llama-2-7B baseline of 20.59%. The term "struggles" was intended to convey that Llama-2-7B's absolute accuracy remains low despite these relative gains, especially when compared to a stronger base model like Llama-3.1-8B (which has a baseline of 65.85% for mid-resource languages). The wording "reducing accuracy to 24.26%" is indeed incorrect if interpreted as a reduction from baseline; it should reflect the new accuracy level achieved. We will rephrase this for clarity.
Q3: Line 249: "Llama-2-7B degrades significantly with bilingual CPT (20.84%, +19.8%) and shows minimal gains with monolingual CPT (24.51%, +40.9%)."
- Again, we thank the reviewer. The values reflect performance increases over the Llama-2-7B baseline for low-resource languages (approx. 17.40%, based on Figure 2 values, so +19.8% leads to 20.84%). The term "degrades significantly" with bilingual CPT (20.84%) was intended to compare its performance to monolingual CPT (24.51%) for the same model or to highlight its still low absolute score, rather than implying a degradation from its own baseline. We clarify this in the revision.

2025-06-06

I thank the authors for their response, but I'm keeping my scores.

The authors state that:

Our primary goal with these categories was to empirically test the generalizability and stability of the classifications proposed by Yuan et al. (2024) across a wider range of CPT configurations (3 base models, varied data types including code, different resource levels).

By framing your paper as testing Yuan et al.'s (2024) categorisation, however, you inherently tie your paper's contributions to the significance of the categorisation you are testing. This is also mentioned as a weakness by reviewer Betk:

[Weakness] The choice to frame this work primarily around the categories introduced by Yuan et al. (2024): altruistic, selfish, stagnant. There is a large body of work that suggests that the efficacy of crosslingual transfer, either in pretraining or CPT, is linguistic similarity [1, 2, 3, inter alia].

I believe that, currently, the paper does a poor job of motivating or explaining this categorisation, and without such a clear motivation, the paper's contributions are lacking. Again, I link to reviewer Betk's review:

The authors should explain the categorization introduced by Yuan et al., 2024. Currently, the reader needs to go to the original paper to understand this categorization at all. It should be explained thoroughly (in at least 2-3 sentences) before the terms are used in the abstract and in Lines 60-61.

To improve the paper, I would recommend one of two things (or both): (i) better motivate Yuan et al.'s categorisation (you need to convince the reader of the importance of empirically testing it); or (ii) extend your paper to also test other categorisations, such as ones based on linguistic similarity.

2025-06-07

Thank you for the follow-up discussion. We want to clarify:

We study many factors, including data types (monolingual and bilingual texts and code), language resources (high-, medium-, and low-resource), language categories (Yuan et al's as an instance), and base LLMs.
Among those, the language category is one factor, and is studied in one subsection of our paper. We choose Yuan et al.'s (2024) categorization, which is the newest one. By testing it, we find its generazability issue and share our insights, i.e., the 3rd finding (see line 74-82). Our paper's contributions are not tied to the significance of any existing categorization, but instead, our paper calls for rethinking the existing categorization.
The statement "Our primary goal with these categories ..." has its context which is about the discussion in Sec. 3.4. Yuan et al. (2024) is not framed as the whole of this paper.
We thank your recommendation. Point 1 is a good point but again, we want to remind this is just about a subsection of this paper.
As for the recommended point 2, our aim is not to prove the generalizability of all theories of language categorization. Some studies, such as [1], have already shown that correlations between language similarity and performance vary, which stresses the generalizability of linguistic similarity theory. As our paper focuses on more broader factors, we choose Yuan et al. (2024), the latest language categorization as an instance. Our findings support the call for more rigorous investigations on the complexity of multilingual interactions in CPT.

[1] https://aclanthology.org/2025.coling-main.175/

2025-06-10

Dear Reviewer,

Thank you again for your critical follow-up. Your feedback has been very helpful in identifying how to significantly strengthen our paper's framing.

We understand your core concern about the paper's narrative being too tightly coupled with the Yuan et al. (2024) case study. Based on your feedback, we will reframe the introduction and conclusion to focus on the broader research question of whether static behavioral classifications for languages are generalizable in CPT. We will then present Yuan et al. as a timely and important case study to investigate this larger point.

We are confident this reframing will directly address your concerns about the paper's contribution and significantly improve the manuscript. We thank you for pushing us to make our work stronger.

审稿意见

评分: 6置信度: 42025-05-11

This paper aims to characterize the factors that influence the success of continual pre-training to improve multilingual performance. The authors test the impact of language categories identified in previous work (altruistic, selfish, stagnant), monolingual versus bilingual data, and the addition of code data.

The main limitations of the paper are the lack of engagement with the literature in related areas and methodological concerns. The lack of engagement with the literature limits the depth of the discussion about the results. There is prior work which would predict some of the main results from the paper. By framing the paper with respect to a very specific set of papers, the novelty of some of the findings may be overstated.

I name several methodological concerns, namely that the base models chosen for the experiment were not trained on all of the languages used in the experiments, and the evaluations are inadequate for drawing the conclusions the authors wish to draw. Additionally, the authors should include more discussion of related work.

接收理由

This study manipulates many factors and uses a very diverse sample of languages. These experiments do help to understand dynamics of CPT in a meaningful way, although there is still much to understand.
The CPT setup seems generally carefully controlled. I believe the models could make a good resource for further study, especially as the authors say they intend to release them.

拒绝理由

The choice to frame this work primarily around the categories introduced by Yuan et al. (2024): altruistic, selfish, stagnant. There is a large body of work that suggests that the efficacy of crosslingual transfer either in pretraining or CPT is linguistic similarity [1, 2, 3, inter alia]. The authors should at least acknowledge this work and the predictions that come from it.
There is related work that predicts the different results about high-, medium-, and low-resource languages benefitting differently from added data (in pre-training) [4]. The authors should discuss the similarity of these findings.
The experiments related to adding code data are poorly motivated. Previous work shows the benefits of code data for reasoning and handling structured information, which are not very closely related to the tasks the authors evaluate the model on. I suggest either adding more evaluations (see below) or providing more predictions about why adding code data should improve performance on the tasks chosen. Additionally, the authors use a significantly higher proportion of code data than is recommended and justify this as “reasonable for enhancing reasoning tasks” (L143). Again, the authors do not evaluate on reasoning tasks. I think this design choice should be better justified. Perhaps using too much code data means that this paper is underestimating the efficacy of added code data.
In the selection of related languages (Table 1), there is a wide range in relatedness. For instance, I believe Belarusian and Russian are much more similar to each other than Khmer and Vietnamese are to each other. I think the authors should calculate linguistic similarity with lang2vec (https://github.com/antonisa/lang2vec) and do an analysis to see whether the degree of linguistic similarity explains variance in the results. Linguistic similarity is likely playing a role in the results [4].
The evaluations are not sufficient to make the claims that the authors make in this paper. The two tasks the authors evaluate on are translation and text classification. Neither of these is a core language modeling task, nor do they relate to the predictions about reasoning that the authors make in the paper. I recognize that finding benchmarks that cover all the languages in this paper is extremely difficult. I first suggest using perplexity as a metric, perhaps on held-out parallel data not used in CPT. This is not an ideal evaluation but at least it would be available for all the languages and can better evaluate core next-word prediction performance. It is important that if the authors do this that they do not compare perplexity across different models/tokenizers, and only compare relative changes in performance within each language. MultiBLiMP (https://arxiv.org/pdf/2504.02768?) came out since this paper was submitted and I believe covers all the languages in this paper. This would also help evaluate next-word prediction/grammatical text generation performance. For more complex tasks, GlobalMMLU (https://arxiv.org/pdf/2412.03304), MGSM (https://arxiv.org/pdf/2210.03057), or translated HellaSwag (https://arxiv.org/pdf/2307.16039) might help understand how the CPT regimes affect performance on tasks of interest to many in the NLP community. Each of these has its disadvantages and I believe none of them covers all the languages in the sample, but they all do cover at least some of the lower resource languages. If the authors add perplexity, MutliBLiMP, and at least one of the other tasks, this would strengthen the claims of the paper.
The choice of base models seems very problematic. The Viking model was trained only on Finnish, English, Swedish, Danish, Norwegian, Icelandic and code data. For these experiments, the base models should be pretrained on either none of the languages or all of the languages used in the CPT experiments. Otherwise, it seems that the pretraining data mixture may be influencing the results in a way that is currently not taken into account by the authors. Perhaps consider XGLM (https://arxiv.org/abs/2112.10668), which I believe covers all the languages.
The “unexpected” language generation in Section 3.2.1 seems to me to be a direct consequence of the bilingual CPT regime, in which the authors have essentially fine-tuned the models to generate pairs of sequences in two different languages. The example on L203 shows that both sentences are about diabetes (as was the source sentence, Figure 6). Furthermore, all the cases of this phenomenon for which the authors provided examples are the exact language pairs that the authors created. This should at the very least be mentioned in Section 3.2.1. I believe the authors should reconsider the discussion, however, to more deeply reflect on this behavior and how it might be a consequence of the bilingual CPT.
It might also be good to add a quantitative analysis on how frequently this happens (similar to https://arxiv.org/html/2502.12476v1 or https://github.com/Pleias/language_adherence_tests).

[1] https://arxiv.org/pdf/1905.12688

[2] https://arxiv.org/pdf/2105.05975

[3] https://dspacemainprd01.lib.uwaterloo.ca/server/api/core/bitstreams/f9646594-910e-4b67-9ce9-56a1677158d3/content

[4] https://aclanthology.org/2024.emnlp-main.236/

给作者的问题

The authors should explain the categorization introduced by Yuan et al., 2024. Currently, the reader needs to go to the original paper to understand this categorization at all. It should be explained thoroughly (in at least 2-3 sentences) before the terms are used in the abstract and in Lines 60-61.
I think the authors should support the claim in L26-27 (“their performance relaims highly uneven across languages”) with citations [5, 6, 7]
I think Table 2 is not necessary to include in the main body of the paper, and can be moved to an appendix. The content is sufficiently explained in L158-163.
L130-132: the authors say they select 15 languages, but 6+6+6 = 17.
Is it problematic that some languages have only one related language? Why not just let each target language have only one related language?
The explanation of the language pairs is extremely confusing. It is almost impossible for me to tell which languages a model was CPT on and what it was evaluated on. For example, I understand from Table 1 that Mandarin Chinese is one of the training languages. And Cantonese is its related language (L 285-286). So is the model always trained on Cantonese when Chinese is the target language (for that condition)? Then why is there a Cantonese model/why is Cantonese evaluated in addition to Chinese (Table 4)? I find this whole part of the description very unclear, which makes it difficult to understand a key component of the paper.
This paper seems to be related to work on the curse of multilinguality [8, 4] and negative interference [9]. The authors should engage with this literature.
Could the authors bold the model/condition that gets the best performance for each language in the tables in the appendix (e.g. Table 7) to aid the reader in interpreting the results. Would it be possible to also make plots for the results? It is very hard to understand the results in the current form.
I would have predicted that bilingual CPT would improve more for translation than SIB-200. Do the authors have any hypotheses about why this happens?
The order of the task results is different in Figures 1-2 and Figures 3-4. Could the authors flip them (and change text accordingly) to aid the reader in drawing comparisons between the two sets of results?
Could the authors add a related works section and discuss some of the alternatives to CPT, e.g. adapters and fine-tuning (see related work subsection ‘language adaptation’; https://arxiv.org/pdf/2212.09535)? What is unique about CPT? How might these results relate to those other methods?
The authors should engage with the work on universal or super donor languages, e.g. https://aclanthology.org/2024.loresmt-1.10.pdf. This seems related to the categorization by Yuan et al. (2024)

[5] https://www.nature.com/articles/s41562-023-01716-4

[6] Atari, M., Xue, M. J., Park, P. S., Blasi, D., & Henrich, J. (preprint). Which humans?.

[7] https://arxiv.org/pdf/2004.09095

[8] https://aclanthology.org/2020.acl-main.747/

[9] https://aclanthology.org/2020.emnlp-main.359/

2025-05-31

Questions To Authors:

Q1: Explain Yuan et al. (2024) categorization:
- We will add a more concise 2-3 sentence summary of these categories earlier in the paper to improve upfront clarity for the reader.
Q2: Citations for L26-27:
- We will review the suggested citations [5, 6, 7 from review].
Q3: Table 2 necessity in main body:
- Thanks for the suggestion, we will move it to appendix.
Q4: L130-132 language count:
- We thank the reviewer for catching this. This was a writing oversight on our part. Just like the stagnant, the six languages in both the altruistic and selfish categories also include English. Therefore, the total number of unique languages is 15.
Q5: Only one related language for some training languages:
- Our goal was two related languages per training language, but due to benchmark availability and low-resource languages like Khmer, it is challenging to find two highly related languages.
Q6: Confusion about training vs. related languages evaluation (Chinese/Cantonese example):
- A model (e.g., L3-Mono-Alt) is trained on a set of "training languages" within a specific category (e.g., for the Altruistic category, training languages include zho Hani, ceb Latn, etc. ). This single CPT model is then evaluated on all its designated training languages (e.g., zho Hani) AND on their respective predefined related languages (e.g., yue Hant is a related language for zho Hani; tgl Latn for ceb Latn, etc., as per Table 1 ). In this example, Cantonese (yue Hant) is not part of the CPT training data for L3-Mono-Alt; it's an unseen related language used purely for evaluating cross-lingual transfer from zho Hani. Table 4 shows results for all these languages when evaluated using models trained on the "Altruistic" set of training languages. Training language columns are shaded in Table 4.
Q7: Engage with curse of multilinguality / negative interference literature:
- These are indeed relevant concepts to the challenges of multilingual modeling that CPT aims to address. We will integrate a brief discussion of "curse of multilinguality" and "negative interference" into our introduction or an expanded related work section.
Q8: Bolding best performance in appendix tables / add plots:
- We will bold the best-performing model/condition for each language in the appendix tables (Tables 4-12) to improve readability.
Q9: Hypothesis why bilingual CPT didn't improve translation more than SIB-200:
- Actually, our findings show that bilingual CPT significantly hampered translation quality due to language mixing issues, while it improved classification accuracy for medium- and low-resource languages.
Q10: Order of the task results in Figures 1-2 and Figures 3-4:
- We will revise to ensure a consistent order for presenting task results.
Q11: Add related works section (alternatives to CPT like adapters/fine-tuning):
- We will expand Appendix A.1 to briefly discuss other language adaptation methods.
Q12: Engage with the work on universal or super donor languages:
- Thanks for pointing out this paper; we will include a brief discussion of universal/super donor languages in relation to Yuan et al.'s categories and our findings on cross-lingual transfer, likely within the expanded related work.

评论- Response to Authors

2025-06-06

Our choice of a ~33% code data proportion is explicitly justified by citing Aryabumi et al. (2024), who note that while 25% is recommended for balancing language and code performance, "33% remains reasonable for enhancing reasoning tasks".

33% code represents approximately one-third more code data than is justified by Aryabumi et al. I disagree that this decision is sufficiently justified. I worry that it could have an impact on the results.

For each language category, such as Altruistic, our training set consists of a group of five languages

Yes, I misunderstood this. Please clarify this in the updated manuscript. I found the training details quite confusing in the current draft.

Evaluation Sufficiency (Tasks and Benchmarks)

I understand that evaluation is very difficult. I think the NLL results help. I think in the future, it would be more important to base language selection on what reasoning benchmarks exist if the goal is to make claims about reasoning. As it stands, I do not think the results support claims about reasoning. I recommend that the authors reframe the reporting of the results and conclusions accordingly.

Choice of Base Models

This reasoning needs to be clearly stated in the paper. I think it is key for understanding how to interpret the results. I still worry about the unequal language coverage of Llama.

Q5: Only one related language for some training languages

This does not directly address my question. Do you think this is problematic or affects the results? I recommend including some discussion on this point or adding to a limitations section.

2025-06-06

Thank you for your continued engagement and insightful follow-up comments. Your feedback is invaluable for improving the clarity and rigor of our work. Please find our responses to your remaining points below.

On the Justification of the ~33% Code Data Proportion

We first want to clarify that our research context of continual pretraining differs significantly from the "train from scratch" setting in Aryabumi et al. (2024). In CPT, a model with pre-existing knowledge is adapted, and the optimal data mixture for this process may differ from what is best for initial pre-training.

Our primary goal was not to identify a universally "optimal" ratio but to treat the "high proportion of code" as a fixed experimental variable. We intentionally chose a ~33% proportion to investigate the specific effects and trade-offs that emerge when a strong code signal is introduced during CPT. Our findings—that this configuration significantly boosts classification performance while creating a trade-off with generative quality—were revealed with greater clarity due to this more aggressive mixture. We agree that exploring the optimal ratio is a crucial question, and we will explicitly position it as a direction for future work in the revised manuscript.
On Evaluation Sufficiency and Reasoning Claims

We will revise the paper's narrative to shift the focus from "enhancing reasoning." Instead, we will ground our motivation in how code, by improving the handling of structured information, could indirectly benefit complex multilingual tasks. Our conclusions will be centered squarely on our empirical findings: that code integration acts as an effective "scaffold" for multilingual classification accuracy but introduces a task-dependent trade-off with generation quality.

We will explicitly state in our limitations section that the study does not directly evaluate reasoning. We will also adopt your valuable suggestion and propose in our future work section that subsequent research could strategically select languages based on their coverage in existing reasoning benchmarks.
On Having Only One Related Language for Some Training Languages

Thank you for giving us the opportunity to clarify our evaluation method for Section 3.4. We understand the concern that a single related language could affect the stability of the results.

Our evaluation in Section 3.4 is based on an aggregated assessment. For each language category (e.g., "Altruistic"), we trained one model on the set of all its training languages. We then evaluated this model's performance on the corresponding group of all related languages (e.g., for the Altruistic category, this group consists of 8 different languages). Our conclusions about whether a language category hypothesis holds are based on the average performance change across this entire group.

To add to this, we faced a methodological choice: we could have enforced uniformity by selecting only one related language for every training language. However, we believe this would have been a step backward, as it would mean intentionally discarding valuable evaluation data for languages where more than one related language was available in the benchmarks. Our approach was to be as comprehensive as possible. By including two related languages where feasible, we created a more robust evaluation for the category as a whole. This choice enhances the validity of our aggregated results, as they are based on a larger total pool of evaluation data than a uniformly restricted approach would have allowed.

Therefore, while a specific training language may only have one related language, its result is aggregated into the category's overall average. This aggregation mitigates the impact of a single data point on the high-level conclusions drawn in Section 3.4.

Nevertheless, we agree that your point is valid in principle. We will add a note to our limitations section acknowledging that for a more granular analysis, having multiple related languages for every training language would be ideal, and this was a constraint imposed by benchmark availability.

We hope these clarifications and proposed revisions fully address your concerns.

评论- Response to Authors

2025-06-10

I thank the authors for engaging with my feedback so thoroughly! I will raise my score, as the authors have adequately address many of my comments.

2025-05-31

We sincerely thank the reviewer for their detailed feedback and recognition of our extensive experimental setup. We also appreciate the acknowledgment that our experiments contribute meaningfully to understanding CPT dynamics and that the released models could be a valuable resource.

We would like to address the reviewer's points:

Reasons to Reject:

R1: Framing around Yuan et al. (2024) and Engagement with Linguistic Similarity Literature:
- The primary motivation for framing our work around Yuan et al. (2024) is to test the generalizability of their proposed behavioral classifications (altruistic, selfish, stagnant) under a wide array of CPT conditions. This systematic validation across 36 configurations is a novel contribution, as prior validation was in narrower settings. Our findings show these classifications do not generalize to a broad setting.
- We will enhance our introduction/related work to more explicitly cite and discuss the established role of linguistic similarity.
R2: Related Work on Differential Benefits for High/Medium/Low-resource languages from Added Data:
- We appreciate the pointer to related work predicting differential benefits for languages based on resource levels during pretraining. Our work focuses on Continual Pretraining, which involves adapting already pretrained models and presents distinct challenges and dynamics (e.g., leveraging or interfering with existing representations, catastrophic forgetting).
- We will incorporate a discussion of the suggested prior work (e.g., [4] from the review) in our related work section, contextualizing how findings from pre-training might relate to or differ from those in CPT.
R3: Motivation for Code Data Experiments & Proportion of Code Data:
- While our evaluation tasks (classification and translation) are not direct reasoning tasks, the inclusion of code is motivated by prior work indicating that code enhances reasoning and structured information handling, which can indirectly benefit complex classification and nuanced translation.
- Our choice of a ~33% code data proportion is explicitly justified by citing Aryabumi et al. (2024), who note that while 25% is recommended for balancing language and code performance, "33% remains reasonable for enhancing reasoning tasks". Our aim was to explore a configuration with a significant code signal. We acknowledge that different proportions might yield different trade-offs.
R4: Linguistic Similarity Variance in "Related Languages" & lang2vec Analysis:
- There might have been a misunderstanding in our training setting. For each language category, such as Altruistic, our training set consists of a group of five languages (zho_Hani, ceb_Latn, mar_Deva, zul_Latn, khm_Khmr). We did not perform training on the related languages. Instead, we identified a set of related languages (yue_Hant, tgl_Latn, ilo_Latn, hin_Deva, npi_Deva, xho_Latn, ssw_Latn, vie_Latn) for this group and used them only for testing, in order to evaluate whether cross-lingual transfer reflects the Altruistic property.
- We acknowledge that the linguistic similarity between the training languages and the selected related languages may vary across different categories. However, we want to emphasize that each model is trained on a group of languages rather than a single language. Moreover, for low-resource languages like Khmer, it is challenging to find highly related languages, so using the "language evolutionary tree" is a reasonable choice.

2025-05-31

R5: Evaluation Sufficiency (Tasks and Benchmarks):

We selected SIB-200 and FLORES-200 for their extensive multilingual coverage, aligning with our goal of assessing CPT across many languages, including low-resource ones.
Regarding the comment that our evaluations "nor do they relate to the predictions about reasoning that the authors make in the paper," a significant challenge was the lack of existing reasoning benchmarks that cover all languages in our study.

We acknowledge the reviewer’s suggestion to report perplexity (PPL), although it is still not a metric for a model's reasoning ability. We chose to evaluate using negative log-likelihood (NLL) on the MaLA validation set.

The NLL is defined as:

$NLL=-\sum_{i=1}^{n_i}log p_{\theta}(x_i|x_{<i})$

While PPL is computed as:

$PPL=exp\lbrace-\frac{1}{n_i}\sum_{i=1}^{n_i}log p_{\theta}(x_i|x_{<i})\rbrace$

PPL evaluates a model’s ability to predict tokens in a given corpus, while NLL measures the overall likelihood of the corpus under the model. Due to its length normalization, PPL is directly influenced by the tokenization scheme, whereas NLL remains unaffected (Luo et al., 2025). As our work involves evaluating multilingual models with diverse tokenizers, NLL offers a fairer and more stable metric in this context. We compute NLL by concatenating the input sentences and applying a strided sliding window of size 1024.

Our NLL results (lower is better) are summarized below:

			Resource Level
Base Model	Data Combination	High	Mid	Low
Llama-3.1-8B	Monolingual	4.8864	2.0375	4.2013
	Monolingual+Code	4.5701	1.9522	4.1389
Llama-2-7B	Monolingual	5.0080	1.7609	4.3308
	Monolingual+Code	4.7359	1.7042	4.2289
Viking-7B	Monolingual	6.9916	1.9506	4.5367
	Monolingual+Code	6.6029	1.8369	4.4850

These results consistently show that for all three base models and across all resource levels, the Monolingual+Code CPT configurations yield lower (better) NLL scores compared to the Monolingual CPT configurations. This suggests that the inclusion of code data in continual pretraining not only aids in downstream tasks like classification (as shown in paper section 3.3) but also improves the models' fundamental ability to predict and assign likelihood to unseen text, as measured by NLL.

We appreciate the suggestions for newer benchmarks, but covering all our languages is challenging. In the case of MultiBLiMP, nine of the languages we selected are not supported by it.

R6: Choice of Base Models (Viking Pretraining Data):
- We chose the Viking-7B (Nordic languages, English, and code) model precisely because it was not pretrained on our selected CPT languages, except for English. This roughly aligns with the reviewer's suggestion that the base model should be pretrained on none of the target languages.
- We chose the Llama-3.1-8B model because it was pretrained on multilingual languages, which is aligned with the reviewer's suggestion to use XGLM. We also used Llama-2-7B because it is less multilingual and English-centric. The design choice allows us to assess how different CPT strategies interact with varied initial model states and existing multilingual capabilities

2025-05-31

R7: "Quantitative Analysis of Language Mixing Frequency:
- We agree that a dedicated quantitative analysis of the frequency and nature of language mixing would be a valuable addition for a deeper understanding of this phenomenon. For the current paper, our primary focus was to identify this critical issue, illustrate it with clear examples through a manual case study (as presented in Appendix A.5, Figure 6 ), and measure its impact on a standard task-specific metric (BLEU).
- We opted for this qualitative case study approach due to the current limitations of fully automated methods for fine-grained language adherence and identification, especially across the many low-resource languages central to our study. Existing tools often struggle with the nuances of code-switching or language mixing in these contexts, potentially leading to unreliable quantitative measures.
- Therefore, a robust and large-scale quantitative analysis, particularly for low-resource languages, would likely necessitate further development of specialized tools, potentially like extensions of methodologies such as CoCo-CoLA (as referenced by the reviewer) if they can be effectively adapted. This represents a significant research direction in itself. We will highlight this as an important avenue for future work. Our current study lays the groundwork by clearly identifying and exemplifying the problem within the CPT framework.

审稿意见

评分: 6置信度: 42025-05-13

The paper explores the effect of continual pretraining (CPT) on a wide range of languages. It uses three base models: LLaMA 3.1-8B, LLaMA 2-7B, and Viking-7B. Languages are selected based on prior work that classifies them as altruistic, selfish, or stagnant. Each group includes a mix of high, medium, and low-resource languages, defined by token count in the CPT corpus. The training data includes monolingual text and bilingual translation pairs, and code from 32 programming languages. The models are evaluated on two benchmarks: SIB-200 for classification (accuracy) and FLORES-200 for machine translation (BLEU). Results show that bilingual CPT improves classification but causes language mixing in generation. Adding code data boosts classification, especially for low-resource languages, but slightly harms generation quality.

接收理由

The paper presents an extensive study on a wide range of models and languages.
Rigorous test of the generalizability of the language classification proposed by (Yuan et al, 2024) across CPT settings.

拒绝理由

The languages are grouped into high, medium, and low-resource based on CPT data alone, without accounting for whether those same languages were already well-represented (or absent) in the base model’s pretraining corpus. It is difficult to understand the true impact of CPT strategies, as observed gains or losses may be influenced by prior language familiarity rather than the CPT configuration itself. A deeper analysis on this could be helpful.
There could have been more analysis on language at script level. Do low-resource CPT languages gain more from CPT due to script-level transfer?
(Section 3.3) The effectiveness of code data in classification vs. generation has been shown empirically, however, it can benefit from providing an explanation behind better performance in classification.

给作者的问题

In each bar-plot figure, the y-axis could be in the same range for effective comparison.
Under section 3.2.2 (Low-resource languages), “compatible pretraining” could be described in detail to explain the degraded performance of bilingual CPT on low resource languages. How does one know if V7B and L38B have compatible pretraining?
Did you also experiment with code data + bilingual data?
Is there a rationale behind choosing data science-related code data?

2025-05-31

We thank the reviewer for their thorough assessment and constructive feedback on our manuscript.

We would like to address the reviewer's points:

Reasons to Reject:

R1: Influence of base model’s prior familiarity with pretraining languages on CPT behaviors, and the groups of language resources.
- The exposure to languages during pretraining is an important problem in CPT research. Our study design assesses the relative impact by employing three base models with diverse pretraining characteristics, e.g., with models pretrained on various training data and languages according to public information. They are Llama-3.1-8B (described with "extensive, multilingual sources" ), Llama-2-7B (described with "less multilingual data distribution" and as "English-centric" ), and Viking-7B (pretrained "mainly on Nordic languages, English, and code" ). The varied results observed across these distinct baselines help in discerning the impact of CPT strategies versus pre-existing knowledge and language exposure (e.g., English-centric, English+Nordic, and extensively multilingual).
- Furthermore, our language resource classification (high, medium, low) is explicitly based on the token counts within our CPT training data (derived from the Lego-MT dataset), which is a common practice[1][2].
- We acknowledge the impact of the current opacity of pretraining data for these open-source models on our research. We will add content to the discussion section, explicitly acknowledging this nuance and how our choice of multiple models provides insights into the behaviors of CPT models, continuing-pretrained from different base models.

[1] https://aclanthology.org/2024.emnlp-main.236

[2] https://arxiv.org/abs/2409.17892

R2: Analysis of script-level transfer, suggesting more analysis on language at the script level, particularly whether low-resource CPT languages benefit more from script-level transfer.
- This is an insightful avenue for future research. However, the primary focus of our current work is to systematically evaluate data mixing strategies (monolingual, bilingual, code-augmented) and the generalizability of language classifications (altruistic, selfish, stagnant ) across different CPT configurations and resource levels. A detailed script-level analysis that requires extensive experiments on various training corpora in different scripts, while valuable, extends beyond this defined scope.
- We will add this as a promising direction for future work in our conclusion section.
- Nonetheless, we reported per-language performance with script information provided. This allows researchers who are interested in script-level analysis to find some preliminary results.
R3: Explanation for code data's effectiveness in classification (Section 3.3).
- Our paper posits that "Including programming code data during CPT consistently enhances multilingual classification accuracy, particularly benefiting low-resource languages" and acts as an "effective scaffold for representation learning" or "acts as a 'scaffold' to improve classification accuracy". Prior research, cited in our paper (Petty et al., 2024; Aryabumi et al., 2024), indicates that incorporating code enhances reasoning capabilities, entity tracking, and improves the ability to handle structured information, which likely contributes to better classification.
- Our paper primarily provides empirical analysis but not theoretical proof. In the revision, we will introduce the link between these established benefits of code and the observed enhancement in classification performance and discuss prior research more explicitly in Section 3.3.

2025-05-31

Questions To Authors:

Q1: Y-axis range in bar plots
- We agree this will improve comparability.
- We will revise all bar plots to use consistent y-axis ranges.
Q2: Clarification of "compatible pretraining"
- "Compatible pretraining" refers to base models that allow them to more effectively integrate and leverage new languages introduced via bilingual CPT, particularly for low-resource languages.
- We will clarify this in the revision.
Q3: Experiments with code data + bilingual data
- Yes, we did. Our 36 CPT configurations explicitly include "Bi+Code" setups. These configurations are listed in Table 2.
- The results for these "Bi+Code" configurations on the SIB-200 classification task are presented and discussed in Appendix A.2 Additional Results on Bilingual CPT with Code Data, which includes Figure 5. As noted in Appendix A.2, FLORES-200 comparisons for bilingual+code were omitted because the fundamental language mixing issue with bilingual data in generation tasks made such a comparison less meaningful.
Q4: Rationale for data science-related code data
- Yes, our data science-related code data is from The ArXiv Research Code Dataset. Data science-related code(e.g., Numpy, PyTorch) has more complexity and scale, which are evident in the lines of code metric.

最终决定Accept

2025-07-08

This paper analyses continuously pretraining (CPT) in 36 settings such as as Bilingual CPT, mixing with code data, language similarity grounded CPT; with 3 pretrained models. CPT is a promising direction to build LLM beyond the languages they've been pretrained on. This paper provides extensive results to help practitioners understand CPT in practice in this context.

Despite the clarity of the argumentation and the scale of the experiments the paper should be improved based on reviewer feedback. Mainly:

A large portion of the argumentation is positioned with regard to Yuan et al 2024. As noted by multiple reviewers, this should be better motivated or be part of a broader literature positioning.
Discussion and potential analysis on the relation between pretrained languages and CPT and how that affect performance is lacking. This should be discussed more thoroughly The rebuttal lead to fruitful discussions between the authors and reviewers. We encourage the authors to improve their paper based on this.