PaperHub
7.5
/10
Poster4 位审稿人
最低7最高9标准差0.9
7
7
9
7
3.8
置信度
COLM 2025

One ruler to measure them all: Benchmarking multilingual long-context language models

OpenReviewPDF
提交: 2025-03-16更新: 2025-08-26
TL;DR

ONERULER is a multilingual benchmark for evaluating long-context LLMs across 26 languages, extending RULER beyond English. It aims to assess model performance in diverse linguistic settings using seven tasks, including detecting absent information.

摘要

关键词
MultilingualBenchmarkLong-contextSynthetic dataset

评审与讨论

审稿意见
7

This paper describes the benchmark "one ruler", containing two tasks for long-context evaluation, Needle in a haystack (NIAH) and common word extraction (CWE), for 26 languages. The NIHA task is novel in that it also contains the possibility of the needle not occurring in the context, which makes the task considerably harder, even when the needle is there. The paper describes the benchmark and experiments on it with five different LLMs, leading to some insights.

接收理由

The paper describes a useful multilingual resource that allows interesting cross-lingual analysis.

The experiments are interesting and hint at many interesting aspects of these models.

拒绝理由

Most of the experiments just scratch the surface. I miss a clear analysis of the performance difference between the different subtasks, for instance.

The CWE analysis and description are minimal. I think the paper would have been stronger without including that part, and focusing only on NIHA, which is the most in-depth and interesting part.

Most novels used for context are very old, from 1605 to the early 20th century. This probably impacts the results, since the language in many cases is different from the modern language the models are trained on, and the language in which the needles and instructions are formulated. This is not even mentioned in the paper. I understand that the reason for this is probably copyright, but it still needs to be discussed, especially since the novels differ in age and the amount of language change between the used languages likely differ as well. It is, for instance, worth noting that the top-performing language, Polish, has one of the most modern novels (1934). This issue may thus impact one of your main findings.

给作者的问题

I'm not an expert on NIHA, but it seems slightly weird to me that all needles are formulated in the same way ("the special magic number ..."). Do the models not learn from this? How does this impact the overall task? Is this standard?

How do you decide where to insert the needles? Completely randomly, by some distribution, or some other model?

For the cross-lingual experiment (Figure 5A), how much do you think this is affected by script? Would you get the same results if you swapped Korean with a low-resource language with Latin script?

How common is it that models do not produce an answer (footnote 9)? Do you use the same subset of examples for all models, i.e. if a model does not produce an answer for an example, that example is removed for all models, or does this make the models being evaluated on different subsets? If the latter, what impact does this have on your analysis? Also, how is this handled cross-lingually? Does this mean that the subsets evaluated differ across languages? More elaboration on this issue is needed than a footnote.

Minor comments: Please do not add analysis to figure and table captions. Keep the analysis in the text, and let the captions describe the figure.

评论

We thank the reviewer for the insightful feedback and address their concerns below.


Limited analysis (subtasks)

We agree that the current analysis of the subtasks in the main paper is limited. We report results on subtasks in the Appendix (Figure 25), but we don’t discuss these subtasks separately. Rather, in the main text, we look at the aggregated performance on all NIAH tasks as a proxy for retrieval (Figure 3 and 4; lines 170–208). We agree that a more detailed discussion should be included and we will add it to the paper.


CWE results

Thanks for the suggestion. Due to space constraints, we summarize CWE results in the main text (lines 209–216) and provide full details in Figures 22 and 33. We will bring more CWE content into the main paper in the final version.


Age of books

Thank you for this insightful comment. As the reviewer noted, we selected public-domain texts to avoid copyright issues, which sometimes meant using older books. We agree that the age and linguistic style of these distractor texts may influence model performance. However, the relationship is not straightforward. For example, the Spanish context comes from Don Quixote (1605), yet Spanish ranks 4th in performance (Figure 3). The Polish novel is from 1934, while several other languages use books from the early to mid-20th century, ranging from 1910 to 1947. Korean uses a 1914 novel but performs noticeably worse. We agree this is an important and under-discussed factor that may affect our findings, and we will address it explicitly in the revised version.


Needle formulation and learning potential

Thank you for raising this important point. We adapted the needle sentence template from the RULER paper with slight modifications. The formulation follows conventions used in previous work on passkey retrieval and NIAH-style tasks [1, 2]. This is commonly used to evaluate retrieval capabilities under heavy distractor settings. As discussed in Section 4, models may learn this pattern, particularly those achieving near-perfect accuracy in standard NIAH tasks such as Qwen 2.5, possibly due to fine-tuning exposure. However, our results show that even minor changes like nonexistent needles cause significant drops in performance, suggesting that this fine-tuning does not generalize to uncertain or unanswerable cases.


Needle insertion strategy

Needles were inserted (always between sentences, not within them) at one of 40 uniformly divided context positions, selected randomly. While a deeper investigation of insertion depth and its impact would be valuable, we leave this as future work due to resource limitations.


Cross-lingual experiment: Effect of script vs. language performance

We have not explored combinations beyond those reported in the paper, mainly due to limited computing resources. However, model performance likely depends on both script and language. In the monolingual setup (Figure 3), Qwen72B performs poorly on low-resource Sesotho (Latin script) but much better on Hindi, Persian, and Japanese (non-Latin scripts). Conversely, Gemini-Flash performs well on Sesotho but struggles with most non-Latin-script languages, including Chinese. Our preliminary cross-lingual analysis suggests that instruction language may matter more than content language. Thus, Qwen72B might perform worse with Sesotho instructions than with Korean, while Gemini-Flash could benefit from them.


Treatment of empty generations (footnote 9)

Among all models in Figure 3, only o3-mini occasionally runs out of tokens and fails to generate a response, despite a 10K budget. This occurs rarely in NIAH tasks (2.8%), and we currently discard such generations for o3-mini only. Including them as failures (i.e., incorrect answers) does not affect the overall model ranking in Figure 3, though language ranking changes slightly (swaps: fa ↔ fi, vi ↔ sr, zh ↔ hi). For o3-mini, the Kendall Tau-b between both setups over {language, context_len} is strong (0.888****). In CWE tasks, o3-mini runs out of tokens more frequently, likely due to the model attempting to count words (Figure 20). This affects accuracy only in the easier setups at 8K and 32K (in other settings, accuracy is near 0%). We plan to move this to the main text, as these failure cases should likely be treated as mistakes and factored into the final score, including how much was affected due to generation issues.


Minor comment on captions

Thank you for the suggestion. We agree and will revise the paper to ensure the analysis appears in the main text, while figure and table captions remain strictly descriptive.


References
[1] Wang, Liang et al. (2023) “Improving Text Embeddings with Large Language Models.” ArXiv abs/2401.00368
[2] https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main

评论

Thank you for your thorough reply, which clears my questions and shows insight. I will stick with my current scores, though, since I still think it is a good paper.

Adding some additional info about the issues discussed in your reply to the final paper, like info about where you inserted needles, and discussions of the age of text and script, would strengthen the paper. I think it is a bit of an issuestill, that in the small cross-lingual experiment, script is a clear confounding factor. The cross-lingual setting is different from the monolingual one, so we don't know for sure if the findings would be the same. I think that at least expanding the discussion of this issue would be good.

Doing something about the CWE section would also be good, but I still think it would strengthen the paper to remove CWE rather than expand it, in order to allow for more insights about the NIHA analysis.

审稿意见
7

In this paper, the authors propose a new needle in a haystack (NIAH) benchmark for evaluating long-context LMs called OneRuler. Their benchmark extends the original Ruler Benchmark and offers the following advantages: 1) a more comprehensive set of challenging tasks (see Figure 2) that highlight new challenges and shortcomings in models; 2) expert translations from English into 25 languages that allows them to systematically study performance across languages; 3) a focus on instruction-tuned models as opposed to base models (I think this last point is insufficiently motivated, I would like to see the authors motivate this more).

With this new dataset, they report many intriguing results. I note the following: A variant of NIAH that allows a model to predict no needle greatly reduces performance of the original single NIAH (Figure 1b). “Low resource” languages are strongly outperformed by “High resource” languages in aggregate (where the distinction between “high” and “low” is determined by Wikipedia rank). English is not the best performing model in aggregate and is outperformed by several slavic and Romance languages: Polish, Russian, French, Italian and Spanish (figure 4b); Chinese, another high-resource language, is ranked 4th from the bottom. They also systematically investigate the effect of mixing languages (Figure 5a) and find strange behavior involving reasoning models (Figure 5b).

In summary, I find this to be a really exciting study that addresses clear short-coming in current testing methods for long-context language models, notably the lack of multilinguality and the limited complexity of existing tasks. I also found the paper very easy to read.

接收理由

  • A new and more comprehensive set of multilingual challenge datasets for testing long-context language models, ones that I am confident with have broad impact within this area of NLP and beyond (those engaged in general pre-training and post-training research).

  • A new set of intriguing results about the gap between high and low resources languages on these tasks that open up many new questions and possible directions of future research.

拒绝理由

  • Their most controversial design decision relates to how they make the total token count uniform across all languages to account for differences in tokenization, as discussed in lines 90-98. If I understand correctly, this has the effect of truncating the total amount of information content that each language receives (e.g., it could be that an entire book in one language might be a single chapter in another language simply due to the tokenizer being more verbose in the latter model. In such a case, I would think that this would affect the relative complexity of each task potentially creating problems for having a fair comparison across languages).

My own intuition is that the first option they consider to deal with these mismatching tokenization issues, i.e., ensuring that the input text is of the same length, is more natural. Of course, the authors do report results for this setting too in Appendix D. However, I would like to see this potential issue of mismatching information content addressed.

给作者的问题

  • Did you do any systematic comparison between Figure 25 and Figure 3? E.g., are these results highly correlated across these two settings? Would it make sense to take an aggregate of these results? (Such questions relate to my comments above)

  • Is there an analogue to a bits-per-byte style metric (i.e., what’s commonly used for normalizing tokenization mismatch in perplexity analysis) that one could employ here to deal with tokenization issues?

  • Do you have a table or non-inline version of the CWE results discussed starting on line 209? I am confused about where to look for these results.

  • Like above, I was confused about the results reported on lines 157 (“While most models perform near perfectly on vanilla NIAH for English”). Which plots or tables are showing this? It’s not clear.

评论

We appreciate the detailed feedback and would like to address the reviewer’s concerns.


Tokenization issues
Thank you for pointing this out! To assess consistency between Figure 25 and Figure 3, we computed Kendall’s Tau-b over the NIAH tasks and obtained an agreement of 0.82 (p < 1e-100 ****), indicating strong and statistically significant agreement in model performance rankings [1, 2]. More generally, we agree that tokenization vs information provided to the model is an issue and hence report both results.
We decided to include results on each model’s own tokenizer in the main paper as we wanted to present findings that align with how a user might expect the model to perform, particularly when its entire context window is utilized based on its native tokenization. We will also include results based on scores averaged over these two conditions.


Bits-per-byte style metric
This is a great point. We hadn’t considered using a bits-per-byte style normalization, but we agree it could be a meaningful way to address tokenization mismatches, particularly when comparing across languages and models.
We appreciate the suggestion and will consider exploring this approach in future analysis.


CWE results
The detailed CWE results can be found in Figure 22 and 23. We plan to move the key findings into the main paper in a future version to ensure a more complete presentation.


Unclear result reference
The statement on line 157 is based on the results shown in Figure 3, where at the 8K context length, models such as Gemini, o3-mini, Qwen 2.5 72B, and LLaMA 3.3 70B achieve over 98% accuracy on the English vanilla NIAH task. We will make this connection clearer in the revised version.


References
[1] Kendall, M. G.. Rank Correlation Methods (1949).
[2] Kubicki, Alexandre et al. “The Frail’BESTest. An Adaptation of the ‘Balance Evaluation System Test’ for Frail Older Adults. Description, Internal Consistency and Inter-Rater Reliability.” Clinical Interventions in Aging, 15 (2020): 1249–1262.

评论

Thank you. While my main concern about language comparability was not a reason for rejection since you do report both scenarios, this extra correlational analysis is helpful and further reassures me. Since my original score was a clear accept, I will keep it the same.

审稿意见
9

The paper aims to extend a benchmark for multilingual long-context models (RULER) focusing on needle-in-the-haystack (NIAH), introducing a novel category where the needle is not present in the haystack, and common word extraction tasks towards instruction fine-tuned model evaluation. The curation process involves a two-step process with English generations that are then translated into 25 other languages with native speakers. Extensive experiments with multiple LLMs demonstrate the performance gap between low- and high-resource languages. The process and benchmark will prove useful to multilingual research benchmarking at large.

接收理由

  • The benchmark covers 26 languages in total, and allows for cross-lingual performance evaluation. Moreover, native speakers translate the English-based benchmark, introducing cultural relevance to the benchmark.
  • The benchmark covers multiple types of possible NIAH tasks including ones where the answer does not exist, as well as different context lengths, making for a robust benchmark.
  • The takeaways are well studied and interesting, revealing important performance gaps.
  • It is interesting to note that the number of tokens vary across tokenizers. This study is reported in the Appendix, and provides interesting insights for multilingual settings.

拒绝理由

  • An obvious limitation is the coverage of low-resource languages, especially those which are regional and culturally rich. However, finding native speakers for these language could be difficult.
  • Further details about native speakers and how many samples were created by each (along with details of whether multiple annotators were involved in the creation of a single data point) would strengthen the data curation section.
  • While the paper conducts relevant analysis on performance observations, a great addition would be evaluating domain-specific or noisy data, which might provide deeper insights into discrepancies for English, Polish and Chinese where observations are unexpected
评论

We appreciate the reviewer’s insightful comments and would like to address their concerns below.


Low-resource languages
We agree that expanding our low-resource languages beyond Sesotho, Swahili, or Tamil would strengthen this work.
This was primarily limited by translator availability (as noted by the Reviewer) and the difficulty of sourcing suitable long-form texts. For example, the Sesotho novel required scanning the physical book and manually transcribing it by a native speaker.


Data creation and annotator details
Thank you for pointing this out! In our current setup, each language was primarily handled by a single native speaker who was responsible for translating all prompts and noun lists in that language.
We have communicated with them directly, and asked follow-up questions to ensure that the needle replacement won’t break sentence grammaticality. In a few cases, a second annotator reviewed the outputs for quality assurance, but no single data point was created by multiple annotators. We will clarify this in the revised version of the paper.


Evaluating domain-specific or noisy data
This is an excellent suggestion. We used literary domain but we agree that incorporating other domains or noisy data could offer deeper insights, particularly for languages like English, Polish, and Chinese where some unexpected patterns were observed. We see this as a promising direction for future work.

评论

Thank you authors for your responses to all reviews. Hope you are able to incorporate the suggestions to strengthen the paper further.

审稿意见
7

This paper describes an extensive experiment about testing LLMs with synthetic long-context tasks, based on the needle-in-a-haystack (NIAH) and the common-word-extraction (CWE) tasks. The authors present five variants of NIAH, including cases in which the needle is not present the text, so no result should be obtained, and they report that this addition makes the task particularly hard for many LLMs that would otherwise solve NIAH easily. They also present two variants of CWE classified as easy and hard.

They experiment with several open and closed models, and the benchmark is first generated in English and then the instructions and words to use are manually translated into 25 more languages of different families and scripts, which makes the experimental setup quite complete.

The paper presents interesting analyses of the results of the different NIAH variants, but here resides the major lack of this paper, because the results of CWE are not shown in the main document and only detailed in the appendices. As the CWE experiments are described inside the main section, one would expect at least a table with its main results and not just a brief comment about what happened, especially because it seems to be a much challenging task than NIAH for current LLMs. On the other hand, the paper would be just fine without the CWE experiments, as the different variants of NIAH and their analyses already have a lot of substance, but it's strange that the CWE experiments are described but not analyzed in the same way.

Some other interesting information is also relegated to the appendices, which is understandable given the size of the paper, but this is perhaps a case of a paper that is better suited to a journal because the amount of content that needs to be compacted in 9 pages is too much.

接收理由

The described benchmark is extensive with different languages covered and in different context sizes. It was originally built for English but manually translated to the other languages.

The experimental setup is solid covering open and closed LLMs, and the results analysis for the NIAH tasks with follow up experiments for particular behaviors are also very interesting.

拒绝理由

It is strange that the CWE experiments are drafted but the results are not presented on the main paper. They could be altogether removed, or otherwise at least the basic results incorporated on the main document.

Figure 1 is strange, presenting results of slightly different experiments at the same time. At least both figures should be given the same vertical axis space so they can be compared side by side. Also it seems that for shorter context, the missing needle scenario is actually solved better by the models, which contradicts the analysis.

给作者的问题

What is the actual text used as long context in each language? Is it also synthetic? Is it text originally in that language or is it a translation of some general long text? Are they Wikipedia articles?

The paper describes two separate subtasks "Single NIAH" and "None NIAH", but also it says that Single NIAH could have instances where the needle is not there. Can you describe in which way these are two different tasks or are they just the same in different scenarios?

Line 89 says each annotator was paid 25 USD per language to translate (instructions + nouns), and then the total compensation was 492 USD. How do we arrive at this final number?

In line 231 the authors indicate their suspicion that some of the LLMs might already be finetuned with the NIAH task in mind, and using actual NIAH data (at least obtained from other works). Would this render the approach obsolete? Would it impact on other languages is it's only done in English?

Small correction on line 107: "make sure to each language's" -> "make sure to follow each language's"

评论

We appreciate the Reviewer’s insightful comments and would like to address their concerns below.


The results of CWE experiment in the main paper
Thank you for pointing this out! Due to limited space, we summarize the CWE results in the main text (lines 209–216) and present more details and figures (22–23) in the appendix (§C.3). We agree that more CWE results should be moved to the main text, and will do so for the final version of the paper.


Figure 1 is not clear and does not match the analysis
We agree that Figure 1 could be confusing, and 1-a and 1-b should be separated. We will revise this in the submission.
Figure 1b specifically, shows the performance of o3-mini-high on the English Single NIAH task. It presents two conditions:
(1) our standard testing condition where the prompt includes the “none” option (i.e., the model is instructed to return “none” if the answer is not there), and
(2) ablation, where we specifically remove the “none” option from the prompt.
Both cases receive the same standard Single NIAH input where the answer exists. We report that this simple change causes accuracy drops at longer context (lines 225–227) in the Single NIAH task (not “none” NIAH task).
More generally, when given an option to return “none” models do so across all contexts but o3-mini-high does it more often (Figure 5b and lines 227–231). We see how this is confusing and will clarify it.

The none-NIAH task, i.e., a task where the answer is not in text (and the model should return “none”) has accuracy of 89.8% at 8K and 72% at 128K; averaged across languages and models. We acknowledge that this information is missing, and will add it to the final version.


Text used for long-context
For each language, we use open-domain novels as the source of the long-context input. This is briefly mentioned in Section 2.1 and described in more detail in Appendix §B.


Difference between Single NIAH and None NIAH
Thank you for raising this question! We see how the phrasing may have been confusing.
Our prompt format is identical for Single NIAH (where the correct needle is always present) and None NIAH (where the correct needle is always absent) tasks, instructing the model to return the specific value or 'none' if it's not found in the context.
This shared prompt reflects a realistic setting where the answer's presence isn't always guaranteed. We will clarify this distinction in the revised version.


Annotator compensation
The base rate for translation was $25 per language (covering both instructions and nouns), but the final total of $492 reflects a combination of several factors. We recruited 17 annotators to work on 18 languages, with one annotator handling two languages. Annotators are hired on Upwork, which additionally charges contract and processing fees leading to hire overall cost. We also have 6 volunteers recruited from the local community and contributed translations for 7 languages. We will clarify these details in the final version.


Fine-tuned multilingual LMs
Thank you for raising this important point. While it is possible that some models, such as Qwen 2.5, were exposed to NIAH-like data during fine-tuning (as suggested by their near-perfect performance in standard NIAH settings), our results show that introducing even a simple variation, such as a nonexistent needle, leads to a significant performance drop.
This could suggest that the fine-tuning does not generalize across all NIAH variants, particularly those involving uncertainty or unanswerable cases. We will incorporate the suggested correction into the paper.

最终决定

In this paper, the authors propose a new needle in a haystack (NIAH) benchmark for evaluating long-context LMs called OneRuler. Their benchmark extends the original Ruler Benchmark and offers the following advantages: 1) a more comprehensive set of challenging tasks that highlight new challenges and shortcomings in models; 2) expert translations from English into 25 languages that allow them to systematically study performance across languages.

With this new dataset, they report intriguing results including, A variant of NIAH that allows a model to predict no needle greatly reduces performance of the original single NIAH, “Low resource” languages are strongly outperformed by “High resource” languages in aggregate, and English is not the best performing model in aggregate and is outperformed by several slavic. They also systematically investigate the effect of mixing languages and find strange behavior involving reasoning models.

Reviewers raised many great questions that need to be clarified in the final version. This includes clarifying the NIAH subtask definitions, improving Figure 1, moving CWE details to the main paper, annotation details, annotator compensation, the influence of book age and language style, tokenization, and various other small clarifications.

Reasons To Accept:

  • A new and more comprehensive set of multilingual challenge datasets for testing long-context language models, ones that can have broad impact within this area of NLP and beyond (those engaged in general pre-training and post-training research).

  • A new set of intriguing results about the gap between high and low resources languages on these tasks that open up many new questions and possible directions of future research.

Reasons To Reject:

  • Nothing that can't be addressed in the final version