PaperHub
6.8
/10
Rejected4 位审稿人
最低5最高8标准差1.3
5
6
8
8
4.0
置信度
正确性3.0
贡献度2.8
表达3.3
ICLR 2025

Lens: Rethinking Multilingual Enhancement for Large Language Models

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
Multilingual EnhancementLarge Language Models

评审与讨论

审稿意见
5

This paper proposes a novel method for enhancing multilingual performance in large language models by refining the model's text representation space. The approach applies singular value decomposition to analyze and separate cross-language similarities and differences within the representation space. During model training, this method aims to minimize the "language-agnostic" subspace and separate out the "language-specific" representation space, enabling better knowledge sharing with English (the central language) and improving the unique characteristics of each language. Experimental results indicate that this method effectively enhances multilingual capability, especially improving language fidelity.

优点

  1. The methodology is straightforward, with clear explanation and rationale.
  2. Comparative results validate the effectiveness of the proposed approach.

缺点

  1. The technique primarily builds on existing methods (such as https://aclanthology.org/2023.findings-emnlp.190/) and lacks significant novelty.
  2. Language selection for experiments could have accounted for the proximity of each language to the central language, which would add meaningful insights.
  3. Performance inconsistencies are observed, such as the inferior results for the Phi dataset compared to llama's performance in the appendix.

问题

  1. Since the method doesn't leverage external data to augment knowledge, wouldn't languages closer to English theoretically benefit more? It raises the question: does the "alignment" operation inherently favor languages that are closer to or in the same family as English?
  2. Why do the results in Figure 2 show greater improvement for Chinese and Japanese, but limited improvement for Arabic and Bengali?

伦理问题详情

na

评论

We deeply appreciate your valuable comments and your recognition of our contributions to enhancing multilingual performance in large language models. Below, we address the raised concerns point by point.


Weakness 1: The technique primarily builds on existing methods and lacks significant novelty.

Thank you for bringing this up. We would like to clarify that our work differs in both ideas and methodology from the referenced paper.

The referenced work focuses on improving cross-lingual performance through token-level and semantic-level alignment among different languages, an idea widely adopted by many current methods. This idea motivates part of our approach as well, specifically within the language-agnostic subspace to align different languages.

However, our work goes a step further by addressing a critical but underexplored aspect: separating language-specific representations within the model’s language-specific subspace. This highlights the importance of simultaneously aligning and separating representations across languages for effective and efficiency multilingual enhancement. The core novelty of our work lies in identifying this dual requirement and demonstrating its efficacy through extensive experiments. This idea and our methodological innovation have also been positively acknowledged by Reviewer U88a and Reviewer CKFG.

We appreciate your suggestion and will cite the referenced work in our revised manuscript while further elaborating on the distinctiveness and contribution of our work.


Weakness 2: Language selection for experiments could have accounted for the proximity of each language to the central language, which would add meaningful insights.

Thank you for pointing this out. The language selection in our experiments is now based on two principles:

  • The target languages should be classified as out-of-scope in the official model card of the base model, ensuring that our experiments address under-represented language cases (lines 263 - 265).

  • The selection should reflect a balance of diverse linguistic families and resource levels, allowing us to evaluate performance across a broad spectrum of languages (lines 259 - 262).

And the six languages we chose (Bengali, Swahili, Chinese, Japanese, Korean, and Arabic) also represent varying degrees of linguistic proximity to English, ranked approximately from closest to farthest: Bn, Sw, Zh, Jp, Ko, and Ar.

We greatly appreciate your suggestion and have conducted additional experiments on LLaMA-3-8B-Instruct with Spanish (Es), German (De), and French (Fr)—languages that are more linguistically closer to English, in terms of both multilingual understanding (MU) and generation (MG).

MUMG
EnEsFrDeEnEsFrDe
LLaMA-364.9053.5052.5056.506.995.885.274.56
xSFT-LoRA66.1053.3051.4056.306.305.235.034.68
xSFT-Full-LoRA65.0051.4050.7056.406.134.924.764.79
xSFT64.9050.8049.9056.005.734.364.474.00
xSFT-Full60.0050.9048.9051.505.954.604.424.33
SDRRL64.0048.9047.8050.306.092.743.062.54
QAlign62.8048.5046.6051.303.612.882.912.31
Lens64.6053.7052.1057.107.105.905.634.90

These new results provide further insights and show that our method consistently improves multilingual performance across languages, regardless of their distance from English. We will include these findings in the revised manuscript for completeness and to better address this concern.

评论

Weakness 3: Performance inconsistencies are observed, such as the inferior results for the Phi dataset compared to llama's performance in the appendix.

Thank you for highlighting this point. We believe the observed inconsistencies may stem from differences in the underlying capabilities of the Phi and LLaMA model families. However, as they do not provide detailed training or dataset documentation, it is challenging to conduct a deeper analysis or pinpoint the exact causes of these differences.

Despite this, when evaluating the multilingual enhancement effects, our Lens consistently improves performance on both model families and outperforms current baseline methods. This demonstrates the scalability and robustness of our approach across different pretrained backbones.

We appreciate your suggestion and will emphasize this point more clearly in the revised manuscript to address this concern.


Question 1: Since the method doesn't leverage external data to augment knowledge, wouldn't languages closer to English theoretically benefit more? It raises the question: does the "alignment" operation inherently favor languages that are closer to or in the same family as English?

Thank you for raising this concern. We would like to clarify that our method is not inherently biased toward languages closer to English.

  • First, the six target languages selected in our study are intentionally distant from English, ensuring a focus on languages that typically are under-represented in LLMs. Our experimental results demonstrate effective performance improvements for these languages, validating the method’s applicability even for linguistically distant languages.

  • Second, as highlighted in our response to Weakness 2, we conduct additional experiments on linguistically closer languages—Spanish, German, and French. These results further confirm that our method achieves meaningful improvements regardless of the language’s proximity to English.

  • Finally, we show in this paper that ’’alignment’’ alone is insufficient to achieve comprehensive multilingual enhancement. Instead, our approach combines alignment with ’’separation’’ operations in the language-specific subspace, a critical insight provided by this work. By leveraging this dual mechanism, our proposed Lens effectively enhances multilingual capabilities without being biased toward languages closer to English.


Question 2: Why do the results in Figure 2 show greater improvement for Chinese and Japanese, but limited improvement for Arabic and Bengali?

Thank you for raising this excellent question.

Our hypothesis is that it stems from the imbalanced proportions of languages in the pretraining corpus of the backbone, which result in varying representation capabilities across languages. Unfortunately, as LLaMA-3 only reports that approximately 90% of its pretraining data is English, with no detailed breakdown of the remaining 10%, it is challenging to verify this hypothesis with certainty.

However, based on general resource availability, we infer that the remaining 10% pretraining data likely favors high-resource languages such as Chinese and Japanese over mid-resource languages like Arabic or low-resource ones like Bengali. Consequently, our method’s relative improvement may be influenced by the uneven representation capability in the pretrained backbone. Despite this, our results demonstrate consistent performance gains across all languages compared to the backbone model, highlighting the robustness of our approach.

We deeply appreciate your question and agree that this is an important topic for further investigation. We hope that future multilingual LLMs with more transparent pretraining corpus details will provide better support for understanding such disparities and refining enhancement techniques.


We hope these clarifications address your concerns adequately. Thank you once again for your detailed and thoughtful feedback, which has been invaluable in refining our work.

评论

I appreciate the authors' detailed responses; however, the author's response could not address my concern. They mentioned the work goes a step further by addressing a critical but under-explored aspect: separating language-specific representations within the model’s language-specific subspace. I think this point may be valid, but there is no theoretical evidence in the work to support the existence of such a language-specific space, despite empirical results seems promising.This work may be suitable for presentations at NLP conferences like EMNLP, but I believe it is not suitable for ICLR. Therefore, I insist on my score.

评论

Thank you for your continued feedback. We understand your concern regarding the theoretical foundation for the existence of a language-specific space, and we would like to address it from three perspectives:


1. Linguistic Theory

From a linguistic standpoint, the idea of separating representations into language-agnostic and language-specific spaces is grounded in established theories of language universals and typology. Language-agnostic features align with universal linguistic structures, such as shared syntactic patterns or semantic primitives [1,2], while language-specific features capture unique aspects like phonology, morphology, or syntax [3,4]. These distinctions have also been studied in computational linguistics, such as in multilingual embeddings [5] and cross-lingual representation learning [6], supporting our conceptual basis.


2. LLM Interpretability

Recent interpretability studies have provided compelling evidence that LLMs internally encode language-agnostic and language-specific subspaces. For example, specific neurons or groups of neurons have been identified as responsible for mapping multilingual input representations into either a shared language-agnostic space [7 - 11] that different languages share the common knowledge or distinct language-specific spaces [12 - 14] that are crucial for the accurate expression for specific languages. These findings support our assumption that LLMs naturally exhibit such separable structures, and our work leverages this inductive bias to improve multilingual performance.


3. Related works at ICLR

Building upon the above two theoretical foundations, particularly from linguistic theory, we would like to show that, over the past five years, most multilingual papers at ICLR have focused on aligning representations in the language-agnostic space [15, 20 - 24] or aligning gradients during optimization [17,18] to leverage shared features across languages.

However, few works in multilingual machine translation have considered language-specific characteristics, primarily to implement routing mechanisms or modular designs to improve performance [16, 19].

In contrast, our proposed Lens goes a step further that it utilizes both language-agnostic and language-specific subspaces to comprehensively enhance multilingual performance (including multilingual machine translation, please refer to our detailed response to Reviewer CKFG). Our experimental and visualization results (in Figure 6) clearly validate the effectiveness of leveraging these distinct subspaces for representation learning, both inheriting the theoretical soundness and demonstrating practical utility of our approach.


Once again, we deeply appreciate your feedback, which reminds us that our related work discussion could be more comprehensive in addressing these connections. We have added the above discussion to Appendix F in our revised paper (in orange), clarifying how Lens builds upon and extends prior research.

We hope our response and the revisions alleviate your concerns.

评论

Reference:

[1] Greenberg J H. Universals of language[J]. The Massachusetts Institute of Technology, 1963.

[2] Comrie B. Language universals and linguistic typology: Syntax and morphology[M]. University of Chicago press, 1989.

[3] Croft W. Typology and universals[M]. Cambridge university press, 2002.

[4] Cotterell R, Schütze H, Eisner J. Morphological smoothing and extrapolation of word embeddings[C]. ACL 2016.

[5] Artetxe M, Labaka G, Agirre E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[C]. ACL 2018.

[6] Ruder S, Vulić I, Søgaard A. A survey of cross-lingual word embedding models[J]. Journal of Artificial Intelligence Research, 2019, 65: 569-631.

[7] Chen N, Wu N, Liang S, et al. Is bigger and deeper always better? probing llama across scales and layers[J]. CoRR, 2023.

[8] Starace G, Papakostas K, Choenni R, et al. Probing LLMs for Joint Encoding of Linguistic Categories[C]. EMNLP 2023 Findings.

[9] Wang W, Haddow B, Wu M, et al. Sharing matters: Analysing neurons across languages and tasks in llms[J]. arXiv preprint arXiv:2406.09265, 2024.

[10] Chen Y, Cao P, Chen Y, et al. Journey to the center of the knowledge neurons: Discoveries of language-independent knowledge neurons and degenerate knowledge neurons[C]. AAAI 2024.

[11] Wendler C, Veselovsky V, Monea G, et al. Do llamas work in english? on the latent language of multilingual transformers[C]. ACL 2024.

[12] Tang T, Luo W, Huang H, et al. Language-specific neurons: The key to multilingual capabilities in large language models[C]. ACL 2024.

[13] Zhang Z, Zhao J, Zhang Q, et al. Unveiling linguistic regions in large language models[C]. ACL 2024.

[14] Kojima T, Okimura I, Iwasawa Y, et al. On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons[C]. NAACL 2024.

[15] Hu J, Yao Y, Wang C, et al. Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages[C]. ICLR 2024.

[16] Zhao X, Chen X, Cheng Y, et al. Sparse moe with language guided routing for multilingual machine translation[C]. ICLR 2024.

[17] Lee S, Lee H B, Lee J, et al. Sequential reptile: Inter-task gradient alignment for multilingual learning[C]. ICLR 2022.

[18] Wang Z, Tsvetkov Y, Firat O, et al. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models[C]. ICLR 2021.

[19] Zhang B, Bapna A, Sennrich R, et al. Share or not? learning to schedule language-specific capacity for multilingual translation[C]. ICLR 2021.

[20] Berend G. Massively multilingual sparse word representations[C]. ICLR 2020.

[21] Cao S, Kitaev N, Klein D. Multilingual alignment of contextual word representations[C]. ICLR 2020.

[22] Wang Z, Mayhew S, Roth D. Cross-lingual ability of multilingual bert: An empirical study[C]. ICLR 2020.

[23] Alaux J, Grave E, Cuturi M, et al. Unsupervised hyperalignment for multilingual word embeddings[C]. ICLR 2019.

[24] Wang X, Pham H, Arthur P, et al. Multilingual neural machine translation with soft decoupled encoding[C]. ICLR 2019.

评论

Dear Reviewer yppZ,

Could you please let us know if our responses regarding the clarification of theoretical foundation and related works at ICLR satisfactorily address the remained issues? We would greatly appreciate any further suggestions or clarifications you may have and are happy to discuss them further if needed.

Thank you again for your time and consideration.

审稿意见
6

This paper introduces a method to enhance the multilingual capabilities of LLMs by leveraging the central-language internal language representation as pivot signal. Specifically, the authors decouple the internal language representation spaces into language-agnostic and language-specific subspaces. In the language-agnostic subspace, they pull the target language representations closer to those of English to inherit its capabilities, while in the language-specific subspace, they push the target language representations away from English to ensure distinct expression.

优点

  1. Resource-efficient Compared with previous resource-intensive methods like MSFT and continual pretraining, the proposed method enhances multilingual capabilities efficiently with fewer data resources and computation costs.

  2. Competitive Performance This method demonstrates comparable performance with open-source LLMs that conduct large-scale post-training to enhance multilingual capabilities. Moreover, it surpasses current strong baselines in multilingual enhancement by a large margin.

  3. Good Interpretability Inspired by previous findings on LLM interpretability, the authors manipulate the internal language representations in the top layers of LLMs, applying these findings to multilingual enhancement successfully. The results of the visualization analysis underscore the interpretability advantages of the proposed method.

缺点

  1. Missing Reference Previous work [1] has explored how to enhance multilingual abilities through aligning internal sentence representations, but there is a lack of detailed introduction to this relevant research.

  2. The results of the ablation study do not fully support the authors' claims The authors claim that target languages inherit capabilities from English by pull the target language representations closer to those of English. However, the left part of the figure 3 demonstrate that the performance improvement does not primarily stem from this component. There is only a slight performance variance, even when the hyperparameter is set to zero.

  3. Incorrect Color in Table1 The performance of SDRRL on XCOPA outperforms original backbone. However, the authors highlight it with red color. Additionally, I believe that comparable performance should not be indicated in green. Some results that are clearly lower than the original backbone is still marked in green, which could lead to misunderstandings about performance for readers.

[1] Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment (https://aclanthology.org/2024.naacl-long.445) (Li et al., NAACL 2024)

问题

  1. Lack of explanation Consider adding an explanation for what the bold and underlined text indicates in the caption of Table 1.

  2. Caption Error The caption for Figure3 (Line408) contains an error, please correct it.

评论

Weakness 2: The results of the ablation study do not fully support the authors' claims.

We apologize for the unclear presentation of Figure 3, which may have led to a misunderstanding of the experimental results.

Regarding the “pull the target language representations closer to those of English”, it indeed helps to inherit the central language’s capabilities and contributes to multilingual performance enhancement. Specifically, as the hyperparameter increases from 0 to 1, the MG performance improves from 5.61 to 5.77, and the average MU performance improves from 73.11 to 73.67. However, this is only part of our claim.

Another critical claim in this work is the importance of separating the central and target language representations in the language-specific subspace (lines 413 - 414). This aspect has been overlooked in previous studies that almost all of them just focus on aligning representations from different languages (lines 86 - 90 and 416 - 419). Together, these two claims, which are all verified by our experimental results (in Figure 3), lead to our core conclusion: effective multilingual enhancement requires simultaneously aligning and separating representations across languages.

We sincerely thank you for pointing out this issue, as it highlights the need to present Figure 3 more clearly and ensure a stronger emphasis on our core claim in the revised manuscript. Your feedback is greatly appreciated.


Weakness 3: Incorrect Color in Table1.

We sincerely apologize for the confusion caused by the incorrect highlight of SDRRL’s performance on the COPA dataset in Table 1. We have corrected this to reflect the results more accurately in current revised version.

Regarding the use of green for comparable results, our marking principle is that if the performance drop in central language (English) is within 0.5 points, it is considered acceptable and thus marked in green. We apologize for not making this clearer, which may have led to misunderstandings.

As you acknowledge in the Strengths, our Lens demonstrates significant improvements over baselines, both in enhancing the target language’s capabilities and in maintaining the central language’s performance. We have made further adjustments in the presentation of the results in Table 1 and Table 4 to ensure clarity and avoid any misinterpretation. Thank you for your constructive feedback.


Question 1: Consider adding an explanation for what the bold and underlined text indicates in the caption of Table 1.

Thank you for pointing this out. The bold and underlined text in Table 1 indicates the best and second-best results in the comparison with the baseline methods, respectively. We have added an explanation of this in the revised manuscript in Table 1 and Table 4, which is highlight in blue .


Question 2: The caption for Figure3 (Line408) contains an error, please correct it.

We appreciate you highlighting the caption error for Figure 3. We have corrected this error (MU to MG in line 408, highlighted in blue) to ensure accurate description of the figure.


We hope these clarifications address your concerns adequately. Thank you once again for your detailed and thoughtful feedback, which has been invaluable in refining our work.

评论

Thank you for your clarification and the additional experiments, which have addressed some of my concerns.

However, I am still not fully satisfied with the explanation for Weakness 2. While the authors highlight the importance of separating the central and target language representations in the language-specific subspace, a key motivation of the paper is that the central language representation provides a high-quality internal supervised signal, which enables the target languages to inherit capabilities from English. Therefore, I believe the main performance improvement should stem from this component. However, the ablation experiments show that the actual performance improvement mainly comes from enhancing the separation, even when λ1 is set to zero.

Hence, I only increase my rating to 6.

评论

Thank you for your thoughtful response and for increasing your rating to 6. We deeply appreciate your follow-up, which provides us an opportunity to address your remaining concerns and clarify our motivation further.


Key Motivation Clarification

Your current concern may stem from a partial misunderstanding of the paper’s core motivation. To clarify, the key motivation of our work is that well-established English representations in existing English-centric LLMs can act as a pivot to improve the performance of other languages (lines 61 - 63). This pivot provides two forms of supervisory signals:

  • Aligning the target language with the central language.

  • Separating the target language from the central language.

It is not solely about providing a one-sided alignment signal, as you currently understand. Instead, it aims to provide supervision signals for both alignment and separation, and our experimental results confirm this, with a significant contribution from disentanglement. This finding offers a novel insight not observed in previous work and has the potential to inspire future research directions.


Broader Insights and Connection to Superficial Alignment Hypothesis

Our motivation and experimental conclusions may also support the superficial alignment hypothesis [1,2,3], which posits that LLMs acquire their core knowledge and abilities during pretraining, while post-alignment training primarily guides the model towards a desirable subdistribution of formats to use when prompted. In the multilingual settings, this is specifically for:

  • Despite the imbalance in pretraining resources for different languages, the majority of language-agnostic knowledge is already well-comprehended and aligned during pretraining, especially for current LLMs exposed to super-large-scale pretraining corpora (e.g., over 15T tokens for LLaMA-3).

  • Current post-alignment training, which disproportionately focuses on English data, limits other languages to a subdistribution aligned with English-specific formats.

Thus, further aligning multilingual representations may have less impact compared to stimulating language-specific expressiveness in the target languages, but both mechanisms contribute to performance improvement in our method, with separation playing a more significant role.


We thank you again for raising this point, which allowed us to clarify our motivation and engage in deeper discussion. We hope this explanation resolves your concerns and demonstrates how our findings fit within and expand current understanding of multilingual model alignment and enhancement.

Reference:

[1] Zhou C, Liu P, Xu P, et al. Lima: Less is more for alignment[C]. NeurIPS 2023.

[2] Lin B Y, Ravichander A, Lu X, et al. The unlocking spell on base llms: Rethinking alignment via in-context learning[C]. ICLR 2024.

[3] Yan Y, Li J, Zhang Y, et al. Exploring the LLM Journey from Cognition to Expression with Linear Representations[C]. ICML 2024.

评论

Dear Reviewer 1RPm,

Could you please let us know if our responses regarding the clarification of our key motivation and broader insights satisfactorily address the remained issues? We would greatly appreciate any further suggestions or clarifications you may have and are happy to discuss them further if needed.

Thank you again for your time and consideration.

评论

Thank you for your valuable feedback on our paper. We appreciate the recognition of resource-efficiency, competitive performance, and good interpretability of our work. We address your concerns and questions as follows.


Weakness 1: Missing Reference: Previous work has explored how to enhance multilingual abilities through aligning internal sentence representations, but there is a lack of detailed introduction to this relevant research.

Thank you for pointing this out. However, we believe there may have been a misunderstanding. In fact, we have already cited the mentioned work (lines 648 – 651) in the Related Work section (line 128) and discussed its limitations in the Experiment section (lines 416 - 418) of our paper. Since it shares a similar idea with SDRRL and QAlign in aligning internal sentence representations, we did not reproduce it separately in our original submission.

Following your suggestion, we have now included reproduction results for this method across three backbones under both bilingual and multilingual enhancement settings in terms of multilingual understanding (MU) and multilingual generation (MG), as shown in the table below (CLA is the baseline method you mention). The results indicate that it still struggles to effectively enhance target language performance while maintaining central language performance.

LLaMA-3-8B-Instruct

  • Bilingual (En, Zh)
MUMG
EnZhEnZh
LLaMA-374.6069.026.992.72
xSFT74.0771.854.792.94
xSFT-Full70.9769.555.804.44
SDRRRL73.7368.316.603.84
QAlign66.9051.283.591.23
CLA73.8070.266.474.41
Lens74.3073.677.215.77
  • Multilingual (En, Zh, Jp, Ar, Ko, Sw, Bn)
MUMG
EnZhArBnJpKoSwEnZhArBnJpKoSw
LLaMA-374.6069.0262.6035.3455.7939.3066.336.992.724.022.712.302.862.57
xSFT70.2062.2762.5032.4052.9733.3063.855.483.012.241.852.211.851.68
xSFT-Full72.3768.4562.2535.0053.7037.0072.955.914.303.762.483.772.483.10
SDRRRL59.7349.7337.6025.5052.4528.2051.554.641.911.811.811.811.811.52
QAlign67.0756.1346.6029.7051.9331.1051.052.941.371.021.181.151.181.07
CLA72.7766.8560.5031.7053.9134.0065.056.503.471.811.983.233.191.99
Lens73.5072.7963.5835.5656.5240.0867.897.015.574.213.194.514.292.96

LLaMA-3.1-8B-Instruct

  • Bilingual (En, Zh)
MUMG
EnZhEnZh
LLaMA-3.176.4075.747.315.38
xSFT76.0075.325.333.32
xSFT-Full72.3770.756.024.18
SDRRRL74.0070.316.493.14
QAlign71.4047.204.132.65
CLA77.2075.416.394.49
Lens76.5376.017.415.96
  • Multilingual (En, Zh, Jp, Ar, Ko, Sw, Bn)
MUMG
EnZhArBnJpKoSwEnZhArBnJpKoSw
LLAMA-3.176.3775.6660.9039.1057.7743.4066.707.315.385.433.984.885.223.98
xSFT74.9374.9763.5537.7054.9542.6069.707.353.752.702.213.083.152.38
xSFT-Full76.8373.8964.0036.0057.3539.9072.856.314.434.112.964.034.213.19
CLA76.5775.5266.5537.6056.4140.6071.106.813.952.732.943.303.142.55
Lens76.6775.7961.2039.1058.6043.8066.407.385.925.234.134.925.224.19

Phi-3.5-mini-Instruct

  • Bilingual (En, Zh)
MUMG
EnZhEnZh
Phi-3.581.0071.406.184.92
xSFT81.4371.665.293.31
xSFT-Full80.0769.745.253.84
SDRRL81.1771.446.154.03
QAlign78.5067.015.283.15
CLA80.1371.866.084.26
Lens80.9771.516.445.16
  • Multilingual (En, Zh, Jp, Ar, Ko, Sw, Bn)
MUMG
EnZhArBnJpKoSwEnZhArBnJpKoSw
Phi-3.580.9771.4459.1031.8060.2736.8352.356.184.924.331.344.793.921.48
xSFT79.3069.5457.0532.5057.9837.4053.005.393.742.741.272.712.311.52
xSFT-Full79.7069.8756.0032.2057.7736.3056.205.493.942.851.613.153.031.75
CLA80.6770.7458.5032.0060.1336.2053.205.844.483.481.353.873.261.58
Lens80.9771.4158.9532.1060.2737.2052.456.404.944.341.494.744.121.51

This further reinforces a key conclusion of our paper: aligning language representations alone cannot achieve substantial performance improvements. Further, we should also enhance the separation between representations of different languages in the language-specific subspace, a point overlooked by existing works.

We appreciate your suggestion, which has prompted us to further discuss and analyze this important related work.

审稿意见
8

This paper, LENS, is an innovative approach that enhances multilingual capabilities of large language models (LLMs) by manipulating their internal language representation spaces. It operates on the top layers of LLMs, aligning target languages with a central language in a shared semantic space while differentiating them in a language-specific space. This method significantly improves multilingual performance without compromising the original language abilities and does so with fewer computational resources compared to existing post-training techniques.

优点

  1. Novelty of Approach: LENS introduces a novel perspective on multilingual enhancement by leveraging the internal language representation spaces of LLMs, offering a fresh approach compared to traditional data-driven post-training methods.

  2. Efficiency and Effectiveness: LENS demonstrates high efficiency and effectiveness by achieving superior multilingual performance with significantly less computational resources, making it scalable and practical for large-scale applications.

  3. Preservation of Central Language Abilities: A key strength of LENS is its ability to enhance multilingual capabilities without sacrificing the model's original central language performance, addressing the common issue of catastrophic forgetting.

  4. Comprehensive Improvement: LENS shows a comprehensive improvement across various multilingual tasks, including both comprehension and generation, which is a significant advancement over methods that focus on only one aspect of language performance.

  5. Transparency and Interpretability: LENS provides transparent and interpretable solutions for multilingual enhancements, allowing for better understanding and control over how language models process and represent different languages.

缺点

1Typical Multilingual General or Unique Case Performance: The LENS approach, while effective in enhancing multilingual capabilities, may encounter challenges when dealing with languages that have unique grammatical structures or vocabularies significantly divergent from the central language. The method's reliance on internal representations might not fully capture the intricacies of such languages, potentially leading to suboptimal performance in tasks requiring deep linguistic understanding. 2 Alignment in Multilingual Tasks such as Machine Translation: Although LENS improves multilingual performance by manipulating language representations, it might not fully address the complexities of tasks like machine translation, especially for low-resource languages. The scarcity of high-quality parallel corpora for these languages could hinder the model's ability to learn the fine-grained linguistic nuances necessary for accurate and fluent translations.

问题

  1. Typical Cases Analysis: How does the paper analyze typical cases for the language-agnostic and language-specific subspaces in Parts 4 and 5, and what implications do these analyses have for the performance of LENS in handling multilingual tasks?

  2. LENS vs. xSFT-Full Performance: In Figure 2, why does LENS not perform better than xSFT-Full for the Swahili language, and what factors might contribute to xSFT-Full's superior performance in this specific case?

评论

Question 1: How does the paper analyze typical cases for the language-agnostic and language-specific subspaces in Parts 4 and 5, and what implications do these analyses have for the performance of LENS in handling multilingual tasks?

Thank you for your insightful question regarding the analysis of language-agnostic and language-specific subspaces. We are pleased to provide further clarification and address this point in detail.

  • First, in Section 5.1, we independently analyze the contributions of language-agnostic and language-specific subspaces to the improvement of multilingual capabilities. The results highlight that both subspaces contribute to enhance multilingual performance, with language-specific subspaces contributing more substantially. This finding underscores the importance of explicitly modeling language-specific properties, which has often been overlooked in previous works.

  • Second, in Section 5.2, we examine the impact of manipulating language-agnostic and language-specific subspaces at different layers of the backbone. The results align with conclusions from existing model interpretability research, showing that language-specific parameters predominantly reside in the higher layers of the model. This layer-specific behavior further supports the design of LENS and its focus on top-layer operation.

  • Finally, in Section 5.4, we provide a visualization of the representations. The results show that representations of different languages tend to cluster more tightly within the language-agnostic subspace while being more dispersed in the language-specific subspace. This visualization effectively demonstrates how LENS balances shared semantic alignment with the preservation of language-specific distinctions.

Once again, we sincerely thank you for this question, which allows us to elaborate further on these analyses. We will ensure these points are more explicitly highlighted in the revised manuscript.


Question 2: In Figure 2, why does LENS not perform better than xSFT-Full for the Swahili language, and what factors might contribute to xSFT-Full's superior performance in this specific case?

Thank you for raising this important observation. We believe the performance gap for Swahili may be attributed to the uneven quality of the training data used in our experiments. Specifically, the Bactrian-X dataset used for training derives its input from Google Translate and its output from GPT-3.5-turbo, meaning the dataset quality depends heavily on these two sources. As a result, inconsistencies in translation and generation quality can introduce noise, leading to uneven performance gains from data-driven post-training approaches like xSFT-Full. This highlights one of the key limitations of the current data-driven paradigms.

In contrast, LENS seeks supervision signals internally from the backbone itself, bypassing the need for extensive reliance on potentially noisy external datasets. This intrinsic approach allows LENS to achieve consistent improvements over the backbone model across a wide range of languages, demonstrating better scalability and robustness. We have also demonstrated this phenomenon in our experiments, showcasing LENS’s broader applicability. In our future work, we propose combining the LENS training paradigm with advancements in data selection and filtering methods. We believe this hybrid approach holds great potential for further enhancing multilingual performance.

Once again, thank you for your thoughtful question. We will emphasize this point more explicitly in the revised manuscript.


We hope these clarifications address your concerns adequately. Thank you once again for your detailed and thoughtful feedback, which has been invaluable in refining our work.

评论

We sincerely thank you for your insightful feedback and encouraging comments on our paper. We appreciate the recognition of the strengths of our approach, particularly the novelty of the methodology, its efficiency and effectiveness, the preservation of central language abilities, the comprehensive improvement across multilingual tasks, and the transparency and interpretability of the model. We will now address your concerns point by point.


Weakness 1: The approach may encounter challenges when dealing with languages that have unique grammatical structures or vocabularies significantly divergent from the central language.

Thank you for raising this important point. We would like to clarify that:

  • The languages selected for enhancement in our study, particularly Chinese, Japanese, Korean, and Arabic, are structurally distant from the central language (English). Despite these linguistic divergences, our experimental results demonstrate that LENS effectively improves the performance for these languages, highlighting its robustness across diverse linguistic structures.

  • Additionally, our evaluation benchmarks encompass a variety of challenging tasks, including multilingual commonsense reasoning, multilingual world knowledge, multilingual multi-turn instruction following. These tasks are designed to assess the model’s deep understanding of different languages, providing strong evidence of LENS’s capability to handle linguistic diversity.

We sincerely appreciate your suggestion and will emphasize this point more clearly in the revised manuscript.


Weakness 2: Although LENS improves multilingual performance by manipulating language representations, it might not fully address the complexities of tasks like machine translation, especially for low-resource languages.

Thank you for your thoughtful feedback and for highlighting the importance of machine translation as a benchmark for multilingual capabilities. In response to your suggestion, we have supplemented our experiments with evaluations on the FLORES-101 dataset [1]. Specifically, we assess the bidirectional translation performance between the target language and English, reporting scores using the COMET metric with the WMT22-comet-da model [2].

  • X to En
ZhJpArKoBnSw
LLaMA-385.486.1584.7786.0785.5178.15
xSFT70.4172.467.0972.4359.5273.56
xSFT_Full84.9985.5284.3285.1482.4280.28
QAlign85.5285.2683.1184.9683.1373.66
SDRRRL44.7845.7340.8745.2945.0541.51
Ours85.6486.2385.1586.0785.6780.05
  • En to X
ZhJpArKoBnSw
LLaMA-385.2888.3276.5184.5380.1471.44
xFT83.7882.2274.381.0873.458.48
xFT_Full85.7988.4881.1185.2376.3476.32
QAlign61.6558.6649.4157.1641.150.96
SDRRRL62.5257.6543.8364.1168.7460.0
Ours85.5988.4779.5285.7780.271.88

The experimental results demonstrate that LENS still effectively enhances the multilingual machine translation performance, further validating its robustness across diverse multilingual tasks.

We sincerely appreciate your valuable suggestion, which has enriched and completed our evaluation framework. This addition provides a more comprehensive demonstration of LENS’s effectiveness.

References:

[1] Goyal N, Gao C, Chaudhary V, et al. The flores-101 evaluation benchmark for low-resource and multilingual machine translation[J]. Transactions of the Association for Computational Linguistics, 2022, 10: 522-538.

[2] Rei R, De Souza J G C, Alves D, et al. COMET-22: Unbabel-IST 2022 submission for the metrics shared task[C]//Proceedings of the Seventh Conference on Machine Translation (WMT). 2022: 578-585.

审稿意见
8

This paper presents a novel method, Lens, to enhance the multilingual capabilities of LLMs. Lens first explores the subspaces of Language-Specific and Language-Agnostic features, and introduces three training objectives—pull, push, and retain—to optimize the model's multilingual performance. Experiments across various understanding and generation tasks demonstrate that Lens effectively prevents catastrophic forgetting and significantly improves the performance of LLMs.

优点

  • This paper is innovative, offering a fresh perspective on enhancing the multilingual capabilities of large models. It not only provides new ideas for future multilingual research but also helps the Chinese NLP community better understand large models.
  • The proposed method verifies its effectiveness on multiple datasets including NLU and NLG and avoids problems such as catastrophic forgetting.
  • This paper is well written, and the figures and tables are well drawn, making it easy to understand.

缺点

  • The experiments in the paper primarily compare Chinese and English, with additional languages including Japanese, which belongs to the same language family as Chinese, as well as low-resource languages like Bengali and Swahili, which are more distant from the representation space of English in LLMs. I am curious about the extent of improvement the method proposed in the paper can offer when the representation space of LLMs for languages within the same language family as English is closer to that of English, such as Es, Fr, and De.

  • This paper compares two SFT schemes with different data sizes, xSFT and xSFT-Full. But I still recommend that the authors compare SFT based on the LoRA version, as some experimental work [1] suggests that LoRA-based SFT can effectively prevent catastrophic forgetting, particularly when data quality is insufficient, such as when training data is sourced from automatic translation.

  • Some explanatory text needs to be added to explain the author's motivations and make it easier for readers to read. For examples, the function Span() in Equation (2), the design of Equation (3).

  • Typo: line 1009, “batchsize” -> “batch size”.

[1] MindMerger: Efficient Boosting LLM Reasoning in non-English Languages

问题

How different data sizes affect Lens, and how much improvement can be achieved using more training data.

评论

Weakness 2: I still recommend that the authors compare SFT based on the LoRA version.

Thank you for recommending the comparison with LoRA-based SFT. We acknowledge the importance of this perspective and here are the additional results under both bilingual and multilingual enhancement settings on all three backbones in terms of multilingual understanding (MU) and multilingual generation (MG) performance.

LLaMA-3-8B-Instruct

  • Bilingual
MUMG
EnZhEnZh
LLaMA-374.6069.026.992.72
xSFT-LoRA75.2069.916.793.36
xSFT-Full-LoRA74.3369.646.054.68
xSFT74.0771.854.792.94
xSFT-Full70.9769.555.804.44
SDRRL73.7368.316.603.84
QAlign66.9051.283.591.23
Lens74.3073.677.215.77
  • Multilingual
MUMG
EnZhArBnJpKoSwEnZhArBnJpKoSw
LLaMA-374.6069.0262.6035.3455.7939.3066.336.992.724.022.712.302.862.57
xSFT-LoRA73.4768.8263.1033.1054.7439.3067.355.984.643.473.293.954.082.66
xSFT-Full-LoRA73.6770.4362.9535.2057.2537.8074.905.984.313.842.713.813.83.28
xSFT70.2062.2762.5032.4052.9733.3063.855.483.012.241.852.211.851.68
xSFT-Full72.3768.4562.2535.0053.7037.0072.955.914.303.762.483.772.483.10
SDRRRL59.7349.7337.6025.5052.4528.2051.554.641.911.811.811.811.811.52
QAlign67.0756.1346.6029.7051.9331.1051.052.941.371.021.181.151.181.07
Lens73.5072.7963.5835.5656.5240.0867.897.015.574.213.194.514.292.96

LLaMA-3.1-Instruct-8B

  • Bilingual
MUMG
EnZhEnZh
LLaMA-3.176.4075.747.315.38
xSFT-LoRA76.0075.447.164.84
xSFT-Full-LoRA76.3374.776.504.51
xSFT76.0075.325.333.32
xSFT-Full72.3770.756.024.18
SDRRL74.0070.316.493.14
QAlign71.4047.204.132.65
Lens76.5376.017.415.96
  • Multilingual
MUMG
EnZhArBnJpKoSwEnZhArBnJpKoSw
LLaMA-3.176.3775.6660.9039.1057.7743.4066.707.315.385.433.984.885.223.98
xSFT-LoRA76.4073.8364.9038.0058.3941.0072.456.224.793.843.294.134.272.89
xSFT-Full-LoRA75.1773.1259.7037.2059.5442.8072.706.174.443.732.714.014.113.14
xSFT74.9374.9763.5537.7054.9542.6069.707.353.752.702.213.083.152.38
xSFT-Full76.8373.8964.0036.0057.3539.9072.856.314.434.112.964.034.213.19
Lens76.6775.7961.2039.1058.6043.8066.407.385.925.234.134.925.224.19

Phi-3.5-mini-Instruct

  • Bilingual
MUMG
EnZhEnZh
Phi-3.581.0071.406.184.92
xSFT-LoRA81.3371.596.234.70
xSFT-Full-LoRA79.8370.875.363.96
xSFT81.4371.665.293.31
xSFT-Full80.0769.745.253.84
SDRRL81.1771.446.154.03
QAlign78.5067.015.283.15
Lens80.9771.516.445.16
  • Multilingual
MUMG
EnZhArBnJpKoSwEnZhArBnJpKoSw
Phi-3.580.9771.4459.1031.8060.2736.8352.356.184.924.331.344.793.921.48
xSFT-LoRA80.4769.8458.8031.2058.6033.7053.005.363.982.941.592.962.611.51
xSFT-Full-LoRA79.5070.5455.2031.1059.1235.6052.155.463.863.021.493.362.851.56
xSFT79.3069.5457.0532.5057.9837.4053.005.393.742.741.272.712.311.52
xSFT-Full79.7069.8756.0032.2057.7736.3056.205.493.942.851.613.153.031.75
Lens80.9771.4158.9532.1060.2737.2052.456.404.944.341.494.744.121.51

Based on our experimental results, we derived the following key conclusions:

  • For preserving the central language’s capabilities, incorporating LoRA-based SFT is indeed more effective at preventing catastrophic forgetting than its full-parameter conterpart. However, it primarily protects multilingual understanding (MU) tasks while multilingual generation (MG) capabilities are also significantly affected.

  • For target language enhancement, LoRA-based methods also show a trend for improving MU tasks over MG tasks.

  • By contrast, our proposed Lens achieves a more comprehensive performance, simultaneously enhancing understanding and generation for target languages while maintaining both the understanding and generation capabilities of the central language across different base models.

We appreciate your suggestion and will include the above experimental results and discussion in the revised manuscript to provide more empirical insights.


We hope these clarifications address your concerns. Thank you once again for your detailed and thoughtful feedback, which has been invaluable in refining our work.

评论

Thank you to the authors for the detailed supplementary experiments. The discussion on the effects of LoRA of mitigating catastrophic forgetting is insightful. I recommend emphasizing this analysis in the next version, with an expanded comparison to prior work (such as the paper referenced in my first comment) on LoRA’s effect on multilingual models.

Considering the author's promise to add new findings in the next version, I decided to increase the rating.

评论

Thank you for your recognition of our work and for your thoughtful feedback on our rebuttal. We are truly grateful for your valuable suggestions, which have significantly contributed to making our experimental results more comprehensive. We will carefully follow your advice and incorporate these additional results and discussions into the final version of the paper.

Once again, we sincerely appreciate your constructive comments and support throughout the review process.

评论

Thank you for your valuable feedback on our paper. We appreciate the recognition of novelty of our method and its potential impact on the multilingual research community. We address your concerns and questions as follows.


Weakness 1: I am curious about the extent of improvement the method proposed in the paper can offer when the representation space of LLMs for languages within the same language family as English is closer to that of English, such as Es, Fr, and De.

Thank you for this insightful suggestion.

In our work, we deliberately focused on languages that are currently more under-represented in existing LLMs to highlight the robustness of our method in enhancing multilingual capabilities across a diverse range of linguistic characteristics. We agree that evaluating languages within the same family as English, such as Spanish (Es), French (Fr), and German (De), is also valuable.

Here are our supplemented experimental results based on LLaMA-3-8B-Instruct, where these 3 languages are also not supported according to its official model card [1]. For Multilingual Understanding (MU) evaluation, we adopt M-MMLU dataset which covers all 4 languages En, Es, Fr and De. And Multilingual Generation (MG) evaluation is performed on MT-Bench.

MUMG
EnEsFrDeEnEsFrDe
LLaMA-364.9053.5052.5056.506.995.885.274.56
xSFT-LoRA66.1053.3051.4056.306.305.235.034.68
xSFT-Full-LoRA65.0051.4050.7056.406.134.924.764.79
xSFT64.9050.8049.9056.005.734.364.474.00
xSFT-Full60.0050.9048.9051.505.954.604.424.33
SDRRL64.0048.9047.8050.306.092.743.062.54
QAlign62.8048.5046.6051.303.612.882.912.31
Lens64.6053.7052.1057.107.105.905.634.90

The results show that our Lens achieves comparable improvements for these languages as well. We will include these results in the revised version to provide a more comprehensive evaluation.

Reference:

[1] https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct


Weakness 3: Some explanatory text needs to be added to explain the author's motivations and make it easier for readers to read. For examples, the function Span() in Equation (2),the design of Equation (3).

We agree that some equations could benefit from clearer explanations to enhance readability. Specifically:

  • For Equation (2), in linear algebra, Span() refers to the columns of a matrix that represents all possible linear combinations of those vectors. This concept is commonly used to describe subspaces. Here the constraints indicates that our language-agnostic and -specific subspace must be orthogonal to each other.

  • For Equation (3), it aims to identify a direction of language expression within the language-specific subspace, ensuring each target language could be effectively expressed along this direction.

These additions will make the underlying motivations more transparent to the reader.


Weakness 4: Typo: line 1009, “batchsize” -> “batch size”.

Thank you for pointing out the typo in line 1009. We have corrected batchsize to batch size in the revised version, which is highlighted in orange.


Question 1: How different data sizes affect Lens, and how much improvement can be achieved using more training data.

Thank you for this insightful question. To address this, we investigated the impact of varying training data sizes (from 50 to 1,000) on the performance of LENS.

MUMG
EnZhEnZh
5074.9770.517.062.88
10074.9771.936.944.83
20074.3073.677.215.77
50074.0370.027.194.88
100074.0768.786.883.51

The results indicate that increasing the amount of training data leads to diminishing returns for LENS, a trend consistent with the observations for xSFT and xSFT-Full. This finding reinforces our claim (lines 361 - 367) that for extensively pre-trained LLMs such as LLaMA-3 (trained on over 15T tokens), over-reliance on more training data falls short of meeting scalability needs. Instead of focusing on larger training datasets, it is more critical to identify supervision signals that are both reliable and scalable. This directly motivates us to seek internal supervision from the central language with the backbone itself. And we hope that LENS inspires future research to explore more efficient, scalable, and automated supervision signals for multilingual enhancement of state-of-the-art LLMs.

Once again, thank you for your thoughtful question. We will further emphasize this point in the revised manuscript.

AC 元评审

This work proposes LENS (multiLingual Enhancement method based on the hidden represeNtations within language Space of LLMs), as a new method to improve the multilingual perfromance of LLMs by modifying their internal representation spaces. Specifically, LENS consists of two steps; the first is language subspace probing, where the representations at each layer are separated into language-specific and language-agnosic subspaces using SVD. Then, in the second step, they perform language subspace manipulation, where the representations in the language-agnostic subspace are aligned and separated in the language-specific one. The experiments with 3 open-source LLMs show that LENS reduces catastrophic forgetting of the central, or source, language and improves performance on the target languages.

Strengths:

  • The paper builds on prior work on aligning subspaces in LLMs by also differentiating subspaces for language-specific portions rather than just aligning them (U88a, CKFG).
  • The experiments suggest that this method provides improvements on a variety of tasks (U88a, CKFG, 1RPm, yppZ). The method also reduces some drawbacks to traditional model finetuning for specialization, including mitigating catastrophic forgetting (U88a, CKFG).
  • The method is also more efficient than prior work in the area, particularly compared to model-training-based approaches (CKFG, 1RPm), and is more interpretable (CKFG, 1RPm).
  • The paper is well-written and easy to understand (U88a, yppZ).

The authors also addressed many of the reviewers' concerns with their response and new experiments that were added to the paper.

Weaknesses:

  • The choice of languages for adaptation is somewhat random and limited, which makes it hard to understand how this method will generalize across different choices of central and target languages (U88a, yppZ). The authors add some additional experiments with languages more similar to English (the central language) in the rebuttal. Still, there is little discussion or analysis of how different types of relatedness to the central language (typology, script, vocabulary overlap, etc.) affect downstream performance.
  • The proposed method is quite similar to prior work (but include the addition of separating the language-agnostic spaces further). While LENS shows fairly consistent gains over these prior method baselines, the improvements are quite small and not significance-tested. (yppZ)

审稿人讨论附加意见

The authors provided detailed responses to each reviewer and added additional experiments to the paper based on their feedback. In response, two reviewers chose to increase their score.

最终决定

Reject