PaperHub
4.3
/10
Rejected4 位审稿人
最低3最高5标准差0.8
5
5
3
4
3.8
置信度
正确性2.5
贡献度1.8
表达2.3
NeurIPS 2024

Why Not Transform Chat Large Language Models to Non-English?

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Large Language Model; Knowledge Transfer; Catastrophic Forgetting; Multi-turn Conversation; Human Preference

评审与讨论

审稿意见
5

This paper introduces the TransLLM framework to transform English-centric chat LLMs to non-English languages, addressing the challenges of transferring advanced abilities without supervised data and preventing catastrophic forgetting of original knowledge. Key contributions include using the Translation Chain-of-Thought (TCOT) to divide the transfer process into sub-tasks, employing Low-Rank Adaptation (LoRA) and Recovery Knowledge Distillation (KD) to maintain original LLM parameters and recover knowledge. TransLLM's effectiveness is demonstrated through experiments transforming LLaMA-2-chat-7B to Thai, where it outperformed strong baselines and ChatGPT in multi-turn conversations and safety benchmarks, highlighting significant improvements in helpfulness and safety without extensive supervised data. This framework offers a solid foundation for developing safe and useful non-English LLMs.

优点

  1. The introduction of TransLLM, which combines Translation Chain-of-Thought (TCOT) and recovery knowledge distillation, provides an effective method for transforming English-centric LLMs to non-English languages. This approach addresses both the transfer of advanced abilities and the prevention of catastrophic forgetting.
  2. The method shows notable improvements in rejecting harmful queries and maintaining human preference alignment, as evidenced by outperforming GPT-4 and ChatGPT on the safety benchmark AdvBench, highlighting the robustness of the model in safety-critical applications.

缺点

  1. The experiments are primarily conducted on transforming LLaMA-2-chat-7B to Thai, which may limit the generalizability of the findings to other non-English languages and other models. Further validations on other models (not necessarily bigger than the current one) or other size of Llama-2 would have strengthened the paper.
  2. While the proposed method exhibits promising results in MT-bench and AlpacaEval, it is not tested on other traditional NLU benchmarks, just like MMLU in English. Incorporating more diverse evaluation benchmarks would provide a more comprehensive assessment of the model's performance across various language tasks.

问题

Major:

  1. Please addresss the weaknesses above.
  2. For the construction of the TCOT, did you try other concatenation templates besides the one used in the paper? How sensitive is the performance to the choice of the template?
  3. Table 9: How was the human evaluation carried out? How many annotators were involved, and what is the inter-annotator agreement? The final scores are all integers, so it is not clear how to final scores were calculated.

Minor:

  1. line 45: instruct tuning -> instruction tuning
  2. line 58: and finally it, generates ... -> and finally, it generates ...
  3. line 59: the the performance -> the performance
  4. line 546: the second <<SYS>> should be <</SYS>>? also in line 547, the [\INST] should be [/INST].

局限性

Yes.

作者回复

Thank you for the positive comments and constructive suggestions.

W1: The experiments are primarily conducted on transforming LLaMA-2-chat-7B to Thai, which may limit the generalizability of the findings to other non-English languages and other models...

Please refer to the responses for all reviewers.

W2: While the proposed method exhibits promising results in MT-bench and AlpacaEval, it is not tested on other traditional NLU benchmarks, just like MMLU in English...

Thanks for your constructive suggestion. We further evaluate TransLLM for causal commonsense reasoning in Thai using the XCOPA dataset [1] and outperform strong baselines including ChatGPT. Detailly, XCOPA consists of 500 questions. We conduct experiments in the zero-shot setting. Since chat LLM may output results in different formats, we evaluate the results by one human annotator instead of simply matching. As shown in the following table, TransLLM outperforms ChatGPT by 8.38%. The performance of Typhoon and X-LLaMA is even worse than random guess (50%). That is because these models sometimes can't follow the instructions and output hallucinations.

Accuracy of different models in TH on XCOPA:

ModelAcc (%)
Typhoon36.20
X-LLaMA46.00
NLLB-Bridge55.20
ChatGPT51.70
TransLLM60.08
GPT-475.20

Q1: Please address the weaknesses above.

We have tried our best to address the weaknesses. We hope the above response could address your concerns.

Q2: For the construction of the TCOT, did you try other concatenation templates besides the one used in the paper? How sensitive is the performance to the choice of the template?

Following the designs for the template of in-context learning [2], we consider several aspects when we design the template: 1. Ordering. TCOT is inspired by human beings. It is intuitive to first translate the question from the foreign language to the native language, answer in the native language, and then translate the answer to the target language. 2. Formatting. To build the relationship between sub-tasks and TCOT, we use different special tokens to represent different tasks across different procedures. We do not use the natural language template (NLT) as in PLUG because the template will be trained many times across different training steps, which could significantly affect its original meanings. We conduct experiments using the NLT in PLUG (Let me interpret the instruction in English:...Then the English response is:...Final, the Thai response is:...). The following results show that the proposed template is better than NLT.

Comparison between vanilla TransLLM and TransLLM w/NLT on MT-bench in Thai under GPT-4 evaluation:

Turnvs. ModelWin (%)Tie (%)Loss (%)Δ
Firstw/NLT47.5028.7523.7523.75
Secondw/NLT32.5042.5025.007.5

Q3: Table 9: How was the human evaluation carried out? How many annotators were involved, and what is the inter-annotator agreement? The final scores are all integers, so it is not clear how to final scores were calculated.

We provide the details of human evaluation in Appendix A.4. We obtain the win/tie/loss tag by comparing evaluation scores of different models since pair-wise comparison significantly improves the evaluation cost. More importantly, as shown in Table 5 of [3], there is a high agreement between direct evaluation and pair-wise comparison. Due to the annotation cost, we only conduct human evaluation once for each response. To address this concern, we further re-evaluate 50% of these results following the same procedure and provide the inter-annotator agreement as follows. There is a high inter-annotator agreement in our evaluation.

Inter-annotator agreement on MT-Bench in Thai:

SettingFirst Turn (%)Second Turn (%)
w/ Tie (R = 33%)75.0067.50
w/o Tie (R = 50%)91.7080.00

Typos

Thanks a lot! We will double-check our manuscript and further revise the paper carefully.

[1] Ponti et al., 2020. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning.

[2] Dong et al., 2020. A survey on in-context learning

[3] Zheng et al., 2024. Judging llm-as-a-judge with mt-bench and chatbot arena.

评论

Thanks for your comments again. We would like to know if our response addressed your concerns. We look forward to discussing these concerns with you.

审稿意见
5

This paper proposes a transfer pipeline, TransLLM, which uses a translation chain-of-thought (TCOT) to adapt English-centric large language models (LLMs) to low-resource languages. This pipeline consists of pre-training and supervised fine-tuning (SFT) phases. During the pre-training stage, the authors select monolingual data in the target language and translation parallel data for language modeling. This pipeline also uses an external translation model to construct TCOT dialogue data, which is used in the SFT phase. In the SFT phase, to avoid the catastrophic forgetting problem caused by continual learning, the authors adopted the LoRA PEFT method and further proposed the recovery Knowledge Distillation (KD) method. Specfically, the recovery KD method mixes responses generated by the original model during the SFT-based transfer phase. The effectiveness of this pipeline was demonstrated by successfully transferring the Llama-2-7B model from English to Thai. Compared to other baseline models based on open-source LLMs, the transferred model exhibited great multi-turn dialogue capabilities and safety on the Thai MT-bench and AdvBench benchmarks.

优点

  1. Originality: Compared to other machine translation-based transfer methods in the community, the proposed pipeline introduces the concept of Translation chain-of-thought and specifically addresses the catastrophic forgetting problem that might arise from SFT-based transfer methods, proposing effective solutions.
  2. Evaluation Protocol: I appreciate the efforts made by the authors in the experimental section to ensure the effectiveness of evaluations for low-resource language (Thai). The authors employed professionals for evaluation and also validated the agreement between GPT-4's automatic evaluation and human evaluation on the MT-bench. Additionally, they used human translation in constructing some of the test data.
  3. Significance: Currently, training data and resources for LLMs are predominantly English-centric. The proposed method helps build strong chat LLMs for low-resource languages and minority groups.

缺点

  1. Flexibility of Methodology: Although the proposed method reduces the dependency on instruction-following data in the target language, it still relies on parallel corpora and external translation models for data construction (Translation pre-training data, TCOT data). If high-quality parallel corpora or models are not available, the proposed method might be infeasible. For example, even commercial translation systems cannot support translations for some endangered languages or dialects.
  2. Scope: While the method proposed in this paper might be extendable to other low-resource languages, the authors only validated it on Thai, which lacks empirical evidence for the generality of proposed pipeline. I also noticed that the authors emphasized "non-English" in the title, but a broader range of non-English languages still requires exploration.
  3. Effectiveness: The authors tested the method on Llama-2-7B. Experiments on other model series and larger models could further support the effectiveness of proposed pipeline.
  4. Quality of MT data: The authors used Google Translate for translating TCOT and AdvBench data. Although this is a commonly used commercial system, it would be better to supplement the evaluation and report on the quality of these data translations to explore their impact. In the absence of reference translations, quality estimation methods like CometKiwi [1] and TransQuest [2] could be used.
  5. Typos:
    • Line 45: instruct tuning -> instruction tuning

References

[1] Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, Alon Lavie, and André F. T. Martins. 2022. CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.

[2] Tharindu Ranasinghe, Constantin Orasan, and Ruslan Mitkov. 2020. TransQuest: Translation Quality Estimation with Cross-lingual Transformers. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5070–5081, Barcelona, Spain (Online). International Committee on Computational Linguistics.

问题

Besides some macro-level issues mentioned in the weaknesses, I still have some questions about some details:

  1. This paper mainly focuses on transferring to a single target language. Is it applicable for transferring to multiple target languages? If so, how to balance the data for multiple target languages?
  2. Line 170: A random strategy was used for data mixing. Would other mixing strategies affect the results?
  3. Line 131: Since the authors mentioned being inspired by mBART, is the special language ID independently occupying a token in the vocabulary?

局限性

The authors discussed some limitations after the conclusion.

作者回复

Thank you for the positive comments and constructive suggestions.

W1: Flexibility of Methodology

  1. Transfer requires resources: Knowledge transfer across languages relies on semantic alignment, which inevitably needs some translation resources. As we discuss in the related works, most existing works also rely on parallel corpora and external translation models for data construction (e.g. PloyLM, X-LLaMA, PLUG).
  2. Required resources are acceptable: With the development of machine translation, there have been more and more parallel data for many low-resource languages. However, there are only a few instruction-following data for limited languages. As shown in Table 6 of ablation studies, large amounts of parallel data only contribute a small improvement and therefore it is not necessary for extremely low-resource languages. This is consistent with the phenomenon in LLM translation studies. Moreover, we only need small-size TCOT data for fine-tuning (50K in our experiments). So, it is acceptable to conduct human translation to generate TCOT data if there are no high-quality machine translation systems for the target language.
  3. Alternative resources: For the extremely low-resource languages, we could use unsupervised or semisupervised machine translation methods (e.g. iterative back translation and multilingual pre-training) to obtain feasible translation resources.

W2: Scope

Please refer to the responses for all reviewers, where we discuss the generalizability of TransLLM and validate it in Arabic.

W3: Effectiveness

Please refer to the responses for all reviewers, where we conduct experiments on the recently released LLaMA-3-instruct-8b.

W4: Quality of MT data

Yes, the impact of machine translation (MT) quality is an interesting point. To investigate that, we also use the NLLB to construct TCOT data. We use the wmt22-cometkiwi-da to evaluate the quality of translations without references. As shown in the following table, Google Translate generally performs better than NLLB, which is consistent with the comments of our human translator.

CometKiwi score on TCOT data:

MT SystemScore
NLLB79.07
Google Translate83.40

We further train a TransLLM model using the TCOT data generated by NLLB and provide the results in the following table. We can find that TCOT data with higher MT quality achieve much better performance. That is because translation ability is important for successful TCOT, while low-quality TCOT data can introduce lots of translation errors.

Comparison between different TCOT data (Google Translate vs. NLLB) on MT-bench in Thai under GPT-4 evaluation:

TurnWin (%)Tie (%)Loss (%)Δ
First46.2525.0028.7517.50
Second57.5020.0022.5035.00

The impact of machine translation (MT) quality for AdvBench has been considered in human evaluation. Yong et al. [1] define the UNCLEAR label for nonsensical responses caused by low translation quality or LLM hallucinations.

W5: Typos

We will double-check our manuscript and further revise the paper carefully.

Q1: This paper mainly focuses on transferring to a single target language. Is it applicable for transferring to multiple target languages? If so, how to balance the data for multiple target languages?

Thanks for the constructive question. TransLLM is easy to extend for multilingual settings and benefits from the language IDs. mBART shows that the language IDs work well for models to distinguish tasks of different languages. The major problem, as you mentioned, is how to balance the data for multiple target languages, since the resources across languages are imbalanced. Lample et al. [2] simply smooth the distribution using a multinomial distribution. Zhu et al. [3] formulated scaling laws to optimize data allocation. We would like to explore these strategies in the future.

Q2: Line 170: A random strategy was used for data mixing. Would other mixing strategies affect the results?

The recency bias (catastrophic forgetting during training) causes the model to perform worse on tasks that have not been recently trained [4]. A straightforward and practical solution to balance the performance of different TCOT sub-tasks is to expose them equally frequently. This strategy works well in our setting.

Q3: Line 131: Since the authors mentioned being inspired by mBART, is the special language ID independently occupying a token in the vocabulary?

Yes, we add "<EN>", "<TH>" and "<RESPONSE>" as special tokens into the tokenizer.

[1] Yong et al., 2023. Low-Resource Languages Jailbreak GPT-4.

[2] Lample et al., 2019. Cross-lingual Language Model Pretraining.

[3] Zhu et al., 2023. Extrapolating large language models to non-english by aligning languages.

[4] Lyu et al., 2023. Overcoming Recency Bias of Normalization Statistics in Continual Learning: Balance and Adaptation.

评论

Thank you for the response. The authors have addressed some concerns. However, some common issues regarding generalizability as noted by other reviewers still remain.

Given the time and resource limitations during the rebuttal period, the authors presented results for one additional language in their reply. I will maintain a score of 5, but I hope the authors can include more evidence of generalizability in future versions of this paper to comprehensively address this concern.

评论

Thanks for your reply!

We would like to emphasize that Thai and Arabic, which we used in our experiments, are representative languages. Thai and Arabic are from two different language families, and therefore their syntax and writing system differ significantly. TransLLM demonstrates strong performance in both languages, confirming that it is not reliant on specific language features.

As you mentioned in the Strengths section, extensively evaluating chat LLMs on low-resource languages are quite heavy. In fact, we spend two months on the experiments in Arabic due to limited resources. With the limited resources, TransLLM outperforms open-source SOTA models in target languages and ChatGPT, which utilize significantly more resources.

We hope this response addresses your concerns, and we will try our best to include more languages in the future.

审稿意见
3

In this work, the authors present a method for transforming a chat-based English LLM to Non-English (Thai is the only language experimented with) based on a series of steps that teach the LLM to take in a non-English query and respond in non-English for that query. The methods presented is referred to as TransLLM pipeline which comprises of extending the based model vocab and finetuning with LoRA in multiple stages -- comprising of target language pretraining (on monolingual target language data), translation pretraining and transfer finetuning. The experiments are done on LLaMA2-Chat-7B with Thai as the target language and show the model performs well on both translation into and out of Thai and obtains good performance wrt baselines on MT-Bench & Alpaca-Eval.

优点

  1. The paper presents a simple and scalable solution pipeline to adapt chat LLMs to new languages.

缺点

  1. The experiments are done on only one LLM and on only one language (Thai). This severely constrains the extend to which the results could be verified.
  2. The novelty of the proposed method is very thin and very limited analysis is done to motivate that novelty.

问题

  1. Tables 4 and 5 show a large gap against GPT-4, yet GPT-4 KD is not as effective? Why is this the case?

局限性

Limitations have been adequately addressed.

作者回复

Thank you for your time and comments.

W1: The experiments are done on only one LLM and on only one language (Thai). This severely constrains the extend to which the results could be verified.

Please refer to the responses for all reviewers.

W2: The novelty of the proposed method is very thin and very limited analysis is done to motivate that novelty.

  1. Novel target: In this paper, we try to transform chat LLMs, instead of base LLMs in previous works, from English to non-English. These chat LLMs are specifically optimized for advanced abilities (e.g. multi-turn conversation and human preference alignment), making them not only more effective but also safer than their base versions. Transferring these abilities to non-English will improve the experience for non-English users and, more importantly, improve safety in non-English.

  2. Novel challenges: While pre-training data for base LLMs is typically available, fine-tuning data—especially for advanced chatbot capabilities—is often lacking. Transferring these advanced abilities in chat LLMs without their specific supervised data presents a significant challenge. During the transformation of base LLMs, instruction tuning can be conducted either concurrently or subsequent to transfer training. However, in chat LLMs, advanced capabilities are integrated into model parameters prior to transformation, potentially leading to catastrophic forgetting of these abilities during subsequent transfer training.

  3. Novel framework: To tackle these challenges, we introduce the TransLLM framework. For the first issue, TransLLM leverages the Translation Chain-of-Thought (TCOT) to divide the transfer process into some common tasks, whose datasets are often publicly available. For the second issue, we employ low-rank adaptation (LoRA) to preserve the original LLM parameters and recovery knowledge distillation (KD) to recover the original abilities.

  4. Corresponding analysis: In our ablation studies (Table 6), we demonstrate the importance of enhancing TOCT's sub-task performance for effective transfer. Our findings in Table 7 confirm that transfer training induces catastrophic forgetting of original knowledge. Moreover, despite the widespread use of high-quality GPT-4 KD data, it fails to aid in recovering original knowledge, whereas our proposed recovery KD method achieves this successfully. Importantly, we introduce how LoRA + recovery KD mitigates catastrophic forgetting (lines 323-330). The recovery KD enables the LLM to learn a “shortcut” that uses the English knowledge from the original parameters and the new knowledge from the LoRA parameters. The experiment results indicate that we successfully transfer advanced abilities from English. TransLLM outperforms strong baselines on the multi-turn conversation benchmark in Thai without using any multi-turn conversation datasets. Notably, TransLLM outperforms GPT-4 on the safety benchmark in Thai without using any safety or human preference alignment datasets. We hope this work can lay a solid foundation for developing safe LLMs in non-English.

Q1: Tables 4 and 5 show a large gap against GPT-4, yet GPT-4 KD is not as effective? Why is this the case?

In Table 7, it is evident that transferring knowledge leads to catastrophic forgetting of the original knowledge, and GPT-4 KD fails to recover this original knowledge. The publicly available GPT-4 KD dataset comprises only 52K instances, limiting its scope despite its high quality. In contrast, the original large language model (LLM) contains a broader range of English knowledge across various domains. Thus, recovering the original knowledge can significantly enhance the LLM's capabilities. While the quality of recovery KD is lower than that of GPT-4 KD, it effectively recovers the original knowledge (Table 7). The recovery KD data can be fitted easily using the original parameters. This enables the LLM to learn a “shortcut” that uses the English knowledge from the original parameters and the transfer knowledge from the LoRA parameters. In lines 323-330, we show TransLLM successfully learns this “shortcut” by LoRA + Recovery KD.

评论

Thanks for your comments again. We would like to know if our response addressed your concerns. We look forward to discussing these concerns with you.

审稿意见
4

This paper focuses on a scenario as transforming a English-centric chat large language model to a non-English chat large language model (or not just en-centric). The authors want to address the catastrophic forgetting problem where further tuning on the original En-chat LLM without reusing their original SFT data will hurt their original chat abilities mainly in English. In terms of this, they introduce several techniques, including translation chain-of-thought, low-rank adaptation, as well as recovery KD, and they claim that with only single-turn Thai data, they can successfully transform English LLama2-chat-7B to Thai, while the performance in MTBench and some others remain competitive.

优点

The authors propose a pipeline to tune the chat English language model to a non-english language model. The pipeline looks interesting and the performance looks good in MTBench, Th-translated MTBench, etc.

缺点

  1. To be honest, I am not sure whether this paper has good starting point. They focus on a scenario where you want to further fine-tune the chat model in English to a chat model in other languages. Could you provide more concrete application scenarios in your introduction? I am not convinced because typically, in practice, people will use all the data available to tune a base model, and usually, this renders the best downstream task performance, or at least, the robustness defined as the average downstream task performance of a particular tuned chat LLM. Why cannot we follow this paradigm?
  2. I suggest the authors to further polish the paper abstract. I am getting confused several times, for example, in your statement ``Transforming English-centric LLMs to non-English has been identified as an effective and resource-efficient method'', why can you simply say this is a resource-efficient method? As far as I know, if you just use knowledge distillation by distilling the data synthesized from a stronger model in your target languages, you might still need to use stronger model, usually those closed-source models, e.g., ChatGPT, this is still expensive, right? Besides, as even the knowledge distillation, in the SFT stage, you have to mix all the data to do the training as well. So it is still expensive.

问题

See the weaknesses.

局限性

See the weaknesses.

作者回复

Thank you for your time and comments.

W1: To be honest, I am not sure whether this paper has good starting point. They focus on a scenario where you want to further fine-tune the chat model in English to a chat model in other languages. Could you provide more concrete application scenarios in your introduction? I am not convinced because typically, in practice, people will use all the data available to tune a base model, and usually, this renders the best downstream task performance, or at least, the robustness defined as the average downstream task performance of a particular tuned chat LLM. Why cannot we follow this paradigm?

A high-quality chat LLM relies on high-quality training data, which is usually unavailable even in English. For example, LLaMA-2 models release parameters but withhold their valuable datasets, such as multi-turn conversation and human preference data. Acquiring such high-quality data from scratch and conducting complex training, like RLHF, in non-English languages, is notably costly. Therefore, this paper explores transferring abilities from English to non-English for chat LLMs leveraging limited available datasets, such as target language monolingual datasets, parallel datasets, and instruction tuning datasets.

In our experiments, we demonstrate that solely using all available data in the target language to continue tuning the LLM (i.e. Typhoon [1]) results in less satisfactory performance. In contrast, the proposed TransLLM effectively transfers capabilities from English. For instance, TransLLM surpasses strong baselines on the multi-turn conversation benchmark in Thai without using any multi-turn conversation datasets. Notably, TransLLM surpasses GPT-4 on the safety benchmark in Thai without using any safety or human preference alignment datasets. We hope TransLLM can lay a solid foundation for developing safe LLMs in non-English.

W2: I suggest the authors to further polish the paper abstract. I am getting confused several times, for example, in your statement ``Transforming English-centric LLMs to non-English has been identified as an effective and resource-efficient method'', why can you simply say this is a resource-efficient method? As far as I know, if you just use knowledge distillation by distilling the data synthesized from a stronger model in your target languages, you might still need to use stronger model, usually those closed-source models, e.g., ChatGPT, this is still expensive, right? Besides, as even the knowledge distillation, in the SFT stage, you have to mix all the data to do the training as well. So it is still expensive.

To enhance the non-English ability, a simple approach is that we create non-English data as much as English and perform heavy training. However, it is almost impossible due to the huge cost. Compared to this paradigm, transforming English-centric LLMs to non-English is much more efficient. The knowledge distillation (KD) data in English is from the publicly available Alapca dataset. The KD can be directly translated into the target language using available machine translation systems as shown in [2, 3]. Therefore, we can achieve KD data at a low cost. In practice, there is only about 52k KD data that needs to be tuned. This computational cost is usually acceptable.

Thanks for your suggestion. We would like to clarify these points in the paper and further polish the abstract.

[1] Pipatanakul et al., 2023. Typhoon: Thai large language models.

[2] Zhu et al., 2023. Extrapolating large language models to non-english by aligning languages.

[3] Zhang et al., 2024. Plug: Leveraging pivot language in cross-lingual instruction tuning.

评论

Thanks for your comments again. We would like to know if our response addressed your concerns. We look forward to discussing these concerns with you.

作者回复

We thank all reviewers for your time and contributions.

Generalizability

Most reviewers express concerns about the generalizability of TransLLM, given that our experiments have only involved transforming LLaMA-2-chat-7B to Thai. We would like to address these concerns in the following aspects:

  1. The TransLLM procedure is agnostic to particular model architectures or language features. The extension of vocabulary and LoRA modules can be seamlessly applied to numerous existing LLMs. The training template is highly adaptable for customization across different languages.

  2. We chose Thai as the target language because it can cover a wide range of general situations. Firstly, Thai and English exhibit significant differences, sharing only a limited alphabet, which complicates knowledge transfer from English in contrast to English-related languages like German. Secondly, the target language model, LLaMA-2-chat, has been trained on limited Thai data. Consequently, LLaMA-2-chat struggles to follow instructions in Thai and frequently responds in English.

  3. Due to limited resources, we could only conduct experiments on one language in the previous version. Here, we try our best to conduct experiments on transforming LLaMA-3-instruct-8B to Arabic. The results demonstrate that our TransLLM outperforms ChatGPT and strong baselines on the MT-bench in Arabic using just 8B parameters. This confirms that TransLLM exhibits strong generalization across various LLMs and languages. Details are outlined as follows: LLaMA-3 [1], a recently released open-source LLM, incorporates numerous advancements in training, data, and architecture over its predecessor, LLaMA-2. Arabic is a language with significantly different linguistic features compared to English and Thai. In this experiment, we follow similar experiment settings in Thai. In addition to evaluating ChatGPT and GPT-4, we include prominent open-source Arabic LLMs, such as Jais [2] and AceGPT [3], as our baseline models. Notably, AceGPT stands out as the current state-of-the-art open-source LLM for Arabic.

Comparison between our model and strong LLMs on MT-bench in Arabic under human evaluation:

Turnvs. ModelWin (%)Tie (%)Loss (%)Δ
FirstChatGPT50.0030.0020.0030.00
FirstGPT-417.5030.0052.50-35.00
SecondChatGPT42.5035.0022.5020.00
SecondGPT-412.5027.5060.00-47.50

Agreement between GPT-4 and humans in Arabic:

SettingFirst Turn (%)Second Turn (%)
w/ Tie (R = 33%)58.0058.44
w/o Tie (R = 50%)87.3687.04

Comparison between our model and strong LLMs on MT-bench in Arabic under GPT-4 evaluation:

Turnvs. ModelWin (%)Tie (%)Loss (%)Δ
FirstJais56.2525.0018.7537.50
FirstAceGPT38.7533.7527.5011.25
FirstChatGPT45.0022.5032.5012.50
FirstGPT-411.2542.5046.25-35.00
SecondJais48.7533.7517.5031.25
SecondAceGPT61.2517.5021.2540.00
SecondChatGPT46.2526.2527.5018.75
SecondGPT-421.2526.2552.50-31.25

[1] Llama Team, 2024. The Llama 3 Herd of Models.

[2] Senguptaet al., 2023. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models.

[3] Huang et al., 2024. AceGPT, Localizing Large Language Models in Arabic. Agreement between GPT-4 and humans in Arabic.

最终决定

This paper focuses on teaching an English-centric chat large language model a new language. The authors aim to address the catastrophic forgetting problem, where further tuning of the original English-centric chat LLM without reusing its original SFT data could harm its original chat abilities. They introduce translation chain-of-thought, low-rank adaptation, and recovery KD as solutions.

Reviewer aw9R challenges the framework's application, which is addressed in the rebuttal. Reviewers gAEj, J14C, and vx95 all question the framework's generalizability. During the rebuttal, the authors present experiments involving Arabic, which addresses the concern to some extent.

The novelty of the paper is questioned. The author claims that the research target and challenge are novel during the rebuttal. However, it has been previously recognized that continuous learning of a new language can degrade the performance of a chat model, as shown in studies like [1] (just an example).

Among the techniques used in the framework, TCOT is derived from previous literature, and LoRA is widely recognized. While KD from an existing LLM is not new, KD from the chat model itself can be considered novel. This approach is similar to Magpie [2], but Magpie should be regarded as concurrent work with this paper. The paper demonstrates that KD from the chat model outperforms a strong model (namely GPT-4 in this context). It would be beneficial if the authors included more comparisons and analysis in this direction for the revised version.

[1] https://arxiv.org/abs/2310.04799

[2] https://arxiv.org/abs/2406.08464