PaperHub
7.5
/10
Poster4 位审稿人
最低7最高8标准差0.5
7
8
7
8
4.5
置信度
COLM 2025

Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

OpenReviewPDF
提交: 2025-03-21更新: 2025-08-26
TL;DR

What can multilingual LLM evaluation learn from MT evaluation?

摘要

关键词
multilingualevaluationmeta-evaluationmachine translation evaluation

评审与讨论

审稿意见
7

This position paper aims to raise awareness of the flaws of procedures that are currently used to evaluate multilingual language models. For this, it draws a comparison with the long history of evaluation and meta-evaluation best practices in machine translation, which have been developed notably in the course of the WMT workshops since 2006. Five recommandations are then made regarding evaluation protocols and the best ways to report their findings, in the view of making them more transparent and reproducible. These recommandations are illustrated with focused experiments with contemporary mLLMs. A final section is devoted to meta-evaluation.

接收理由

Strengths:

  • an actionable list of recommandation for evaluating mLLMs - focused more on the comparison of mLLMs using multilingual benchmarks, than on a proper evaluation of the actual multilingual abilities of models;
  • a call for cautiousness when meta-evaluating the evaluation of LLMs with crowd-sourced judgments (ELO);
  • a very extensive review of existing work on the evaluation of mLLMs, as well as a thorough historical coverage of (meta-)evaluation in the statistical and neural MT eras.

拒绝理由

Weaknesses:

  • such calls for more scientific evaluations are not entirely new (eg. (BigScience, 2022 "Bloom: a 175b multilingual model"; Biderman et al 2024 "Lessons from the trenches", inter alia - neither is cited)). These missing reference are minor issues, but they hint at the fact that the community does not just need more recommandations, but also a discussion on how to enforce them - eg. via checklists that the evaluation of mLLMs should adhere to.
  • recommandation #1 (against using translated test sets) could be made a bit stronger - we see a general proliferation of translated test sets, that can only increase as the choice of candidate MT systems widens, all this adding an extra layer of confusion when comparing papers and systems . One lesson of all these MT evaluation exercices is that test sets must be prepared with all the necessary care (eg. avoiding translationese effects) and human intervention, and yes, redone every year.
  • I also tend to view recommandation #3.3 as a bit problematic and ill-motivated. There are strong arguments (Colombo et al, "What are the best NLP systems" proc. NIPS 2023) against averaging across languages, and even more so across languages. Grouping languages representativeness in the data by vague categories such as high-resource / low-resource is not always possible, and does not solve the problem of aggregating numbers that are not directly comparable; this cannot be a lesson learned from the history of MT where averaging BLEU scores across languages has never been regarded as making much sense;
  • using LLMs-as-judge in a multilingual setting is also becoming a serious issue, worth perhaps a recommandation on its own.
  • Finally, there are two more lessons from the MT community that I did not see mentionned: (a) human evaluation are more reliable than automatic metrics. Compiliing human scores also enables to progress on the meta-evaluation of automatic metric; (b) do not evaluate systems whose training data is not public - especially for many commercial mLLMs for which the training language mix is unknown.

Minor issues:

  • What is the colour code in table 1? Tasks, as the small tasks icons tend to suggest. Could be explicit in the caption.
  • In the reference, make sure to upcase acronyms and capitalise language names.
评论

Thank you for the valuable comments and suggestions! Please find our answers below:

  1. Prior calls for more scientific evaluations: Thank you for the suggested references! We will include them into our citations. We shall put more emphasis on the checklist that we provide in Appendix I (and also on Github, upon release).
  2. Recommendation 1: This is a very good point, especially how it affects comparability of scores and the future evolution of MT and that it requires revisiting old benchmark translations with newer models. We will incorporate this into our discussion.
  3. Recommendation 3: Yes, averaging even across sub-groups can be problematic, and the ambiguity of what counts as low-resource is complicating such recommendations. We will strengthen our demand for language breakdowns (they should always be included at least in supplemental material), and weaken the recommendation regarding the particular sub-groups we made here.
  4. Multilingual LLM-as-a-judge: Yes, and scores cannot be assumed to be comparable. Our results in Figure 8 and Figure 9 are also giving us reason for caution in mLLM as a judge use. We will emphasize this more in the discussion.
  5. Two more lessons: Great suggestions. For (a) we do emphasize this in Section 4.2, but we don’t have empirical evidence to support the reliability directly. For (b) it is actually rare to have full transparency into training data. What would be the recommended way forward with those models that don’t provide insights into training data? And what level of transparency would be needed there to determine whether they should be included in an evaluation?
评论

Thanks for your answers. I still think that averaging scores across languages should be discouraged, unless it can be clearly seen that all scores are truly comparable. I look forward to read the final version of this work.

审稿意见
8

This position+empirical paper argues and showcases several current problems in the multilingual evaluation (of LLMs). Many issues have appeared in machine translation research (hence Déjà Vu): translated data sources, a lack of trustworthy and standardised reporting, the use of neural/LLM metrics, and the need for meta-evaluation from different aspects. Authors ran small-scale experiments to demonstrate the current limitations in evaluating LLMs and recommended better practices or standards for future work.

接收理由

  • The authors of this paper identified several drawbacks in the current multilingual LLM evaluation practice. These cover almost every aspect of the evaluation process, like data sourcing, prompts, metrics, reporting, and meta-evaluation. I consider the work to be quite comprehensive.
  • Their arguments are well-supported by reasoning, experiments, and visualisations (mostly Sec 3) as well as related past experience (mostly Sec 4).
  • I think the work can have a good impact in the field. The "lessons" can be consulted by practitioners to help the community conduct evaluations that are more transparent, fair, and meaningful.

拒绝理由

I vote to accept this paper.

Here I wish to point out that the organisation of the paper was causing some confusion:

  • Sec 3 questions (section titles) do not correspond to problems in Sec 2, so I had to go back and forth to check which recommendation addresses which drawback.
  • Sec 3 suggests adopting MT evaluation practices, and Sec 4 calls for meta-evaluation, but most citations still go to MT papers. To me, the difference is that Sec 4 is mostly "position" w/o experiments, but both sections advocate for adopting MT practices.

给作者的问题

In 3.3 (score aggregation): perhaps reporting the number of wins is also meaningful, since it accounts for language coverage without making assumptions about the score distribution [1].

In 4.1 (metrics) and 4.2 (human eval): it may be helpful to add some discussions on i) how LLMs can be over-optimised towards neural metrics [2] and ii) even human preference may be gamed (via emoji, formatting) and does not reflect utility [3].

[1] Pitfalls and Outlooks in Using COMET
[2] Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
[3] From Lists to Emojis: How Format Bias Affects Model Alignment

评论

Thank you for these great suggestions! We will incorporate the suggestion to report number of wins, and we will refer to the provided papers to emphasize the complexity of automatic and human evaluation with considerations of reward hacking.

评论

Thank you for your reply. I will maintain my original assessment which I think is correct.

审稿意见
7

The paper addresses evaluation of multilingual large language models on different languages and different tasks. Since this type of evaluation is very new and the standards are not yet established, the authors propose to use the 20-year experience in machine translation evaluation. The authors describe each of the current problems/challenges and present the recommended solution learnt from machine translation evaluation.

接收理由

The addressed problem is important and not investigated yet.

The well-established principles in machine translation evaluation can really be helpful, and the authors do a good analysis and useful recommendations which will be useful for the community.

The paper is well-written (only some definitions/explanations are missing, see "drawbacks" below)

拒绝理由

not really a reason to reject, but it should be improved:

namely, the data sets ('benchmarks') should be shortly described, in order to better understand what is going on.

for example, in machine translation test sets there is the source text, a human-translated text as a correct solution, and machine-translated text to be evaluated. Then, the machine-translated text is compared with the human-translated text.

It is, however, not explained, what the generative data sets consist of and how the evaluation is carried out, which makes a number of the described details unclear.

给作者的问题

line 67: automatic prompt translations -- what the benchmarks are consisting of? Of prompts and the expected responses? Or of something else? That should be clarified briefly.

Table 1: how the popularity ranks are defined and calculated? What do the yellow stars mean? (orange stars represent the rank, right?)

What is the unit of "Size"? Number of documents?

What do the task symbols mean?

LLM-judged benchmarks are the contrast to what? To human-judged ones?

All that can be explained/defined in table caption.

line 74: what is "prompt sourcing"?

line 76: what is "test split"? Maybe it is a "test set", or "test part"? (in contrast to development/validation set/part)

line 101: heavily rely on aggregation -> what does that mean exactly?

line 124: what are "human annotated prompts"?

That should also be explained in section 2 (what the data sets consist of)

line 125: Why via a pivot language and not directly? Which pivot language? OK, round-trip translation of original prompt into a pivot language and back -- but what was the pivot language? English? Was the same language used for all five languages? That is very important.

Table 3 appears way before the text which describes it -- it would be better to appear closer (and after the text is better than before the text)

Also, it is not clear what "all", "high", "medium" and "low" mean from the table caption (the description should be added)

line 334: periodically releases => periodically release

line 347: on on correlation

评论

Thank you for the clarification questions and encouragement to add more details about the benchmarks. Please find our explanations below:

  1. Benchmark descriptions: We assume this is referring to the benchmarks described in Section 2. We can add more detailed explanations of them in the Appendix, especially for the open-ended prompts that we use in Section 3.
  2. "what the generative data sets consist of and how the evaluation is carried out" - If this is referring to the experiments in Section 3, we will add a sentence explaining where the prompts (Aya human annotated and mArenahard) come from. Most important to know is that Aya human annotated prompts are crowdsourced and in the target languages (re: question about line 124), while mArenaHard prompts tend to be more difficult (30% math, coding domain) and are originally English and automatically translated. Appendix E.2 has the explanations of the evaluation with an LLM judge.
  3. Popularity ranks: The rank is grounded in the counts of each benchmark being used in model releases, which is detailed in Appendix A, Table 5. We can see that the star notation is confusing: yellow star outlines are just placeholders to indicate that 3 stars can be maximally given (for the most popular ones), but most benchmarks are not popular at all (0 stars).
  4. "size" - This corresponds to the number of prompts per language, we will clarify this.
  5. Task symbols: The task symbols are representing each task in a more visual way. We will add a legend.
  6. "LLM-judged benchmarks are the contrast to what?" - In contrast to those that are evaluated in relation to a reference answer, with a fixed evaluation metric (such as BLEU for machine translation tasks).
  7. "prompt sourcing" - The process of selecting prompts for a test set.
  8. "test split" - Test set, a set of benchmark prompts can be split into dev and test portions.
  9. Relying on aggregations: In order to draw conclusions about which model (or method, or variant - whatever the experimental intervention is that is tested in a given paper) is performing better, all results across languages are usually averaged uniformly.
  10. Pivot language: The pivot language is English for all languages except English, and for English it’s Portuguese. These details are described in Appendix G.1.
评论

Thanks for the answers! Longer explanations are fine in Appendix, a short sentence about the pivot language should be written in the main text.

评论

Yes we will do that, thanks!

审稿意见
8

The authors give recommendations for the evaluation of (L)LMs, taking inspiration from learnings in machine translation evaluation from the past ~20 years. Issues with current (multilingual) evaluation methodology are laid out as well, like using machine translation for translating prompts or using small data sets for evaluation. Some specific issues that are discussed:

  • Translationese: The authors show with round-trip translation via a pivot language that the worse the MT system is that's used to produce the translations, the easier the task becomes.
  • Significance testing: In MT evaluation, sigtests are widespread.
  • Averaging over languages: The authors show that categories like high- or low-resource languages are also useful for LM evaluation.
  • Call for more in-depth evaluation: Performance on long vs. short promps is the example here.
  • Reproducibility: The authors call for reproducible benchmarks, as exemplified by sacreBLEU which outputs identifiers for how it was run.
  • The authors point out that when doing LM-as-a-judge comparisons, the results may largely depend on the input language and evaluation LM.
  • Lastly, the authors call for meta-evaluation, to come up with evaluation protocols or quality dimensions (like MQM).

接收理由

  • Insightful analysis of LM evaluation in contrast to work in MT.
  • Clear recommendations for LM evaluation.

拒绝理由

None

评论

We thank all reviewers for their fantastic and thoughtful reviews. We are grateful for the time and the effort you have put into them, for your appreciation of our work, and your questions and suggestions. We add individual responses to address clarification questions, and we indicate where we will incorporate suggestions into the final submission.

最终决定

The paper tries to transfer lessons learnt from decades of MT evaluation to the evaluation of multilingual LLMs (mLLMs), highlighting for every area in that problem possible ("like data sourcing, prompts, metrics, reporting, and meta-evaluation", reviewer gL8s) insights from the MT literature, and performing small-scale experiments to illustrate claims and ideas. One reviewer in particular appreciated the thorough review, but raised concerns about some recommendations like averaging across languages.

Overall a useful review and conversation paper, reviewers seem to all only highlight minor issues if any at all. Personally, I appreciated the thoroughness with which concepts were evaluated, as shown for example by the inclusion of Figure 2, and agree that this could become a well-cited paper with influence in the field of mLLMs.