PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
4
3
3
4
ICML 2025

NoLiMa: Long-Context Evaluation Beyond Literal Matching

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We introduce NoLiMa, a long-context benchmark that removes literal cues from needle-haystack tests, revealing that LLM performance degrades sharply with context length due to difficulty retrieving information without lexical overlap.

摘要

关键词
Long-contextContext length evaluationLiteral matchLexical gap

评审与讨论

审稿意见
4

The authors present NoLiMa, which is a long-context benchmark. What differentiates it from previous long-context benchmarks is that there is a small n-gram overlap between the question text and relevant context. The authors describe the benchmark creation process, including filtering steps to remove distractors/conflicting information. They evaluate 12 popular models on their benchmark and find that effective context lengths are much less than previously claimed (i.e., benchmark performance drops notably as length of the haystack increases). The authors also examine the effects of number of latent hops, inversion, needle depth, CoT, and adding distractors. The overall conclusion is that long-context understanding remains a challenge for even the strongest models.

给作者的问题

I've recommended "Accept" because I think there is enough to this paper to be useful to the community. I don't think that changes during the reviewing process would lead me to change this evaluation. Nevertheless, I hope the authors will consider my feedback in the "Other Strengths and Weaknesses" section. I feel like some of the suggested changes/clarifications would be easy to make, and would strengthen the paper.

论据与证据

Yes

方法与评估标准

Yes

理论论述

N/A

实验设计与分析

The overall experimental design seems reasonable

补充材料

No

与现有文献的关系

This paper is related to analysis of long-context capabilities of LLMs. There has been substantial work trying to increase long-context comprehension and to measure this ability. This work builds on the evaluation side of things.

遗漏的重要参考文献

I'm not aware of any, but it's possible that is due to a lack of awareness on my part.

其他优缺点

Strengths

  • I found the tables and figures very nice - particularly Table 3
  • Nice Rouge comparison to previous works
  • Overall a straightforward idea with solid results

Weaknesses

  • More major
    • The needle set is very limited. There are only 3 total questions ("Which character has been to...", "Which character cannot drink...", "Which character cannot eat..."
    • The third template in the needle set seems ambiguous. For example, seeing a painting of the Eiffel Tower does not imply one has been to France.
  • More minor
    • The authors generally seem to discount the usefulness of CoT (e.g., "The challenge with CoT prompting is that the questions in NOLIMA are straightforward. They are mentioning a singular clue to the answer, meaning they cannot be further decomposed into simpler steps.") However, CoT seems to really help (up to like 20%-range). I can see that even with CoT the task remains hard, but I feel like in interesting question is why CoT works so well, and the authors seem to ignore this.
    • Adding results for newer models (e.g., Deepseek, Gemini 2) would make the paper stronger
    • Some parts were a little unclear (maybe I missed something)
      • In Figure 2 is 0% depth at the end of the sequence or the start? I'm assuming end.
      • What does "minimize sensitivity to tokenization problems" mean on line 145?
      • Would be nice to see an example or two in the "Distractor Filtering" section
      • How are problematic parts "removed" in the "Conflicting Information Filtering" section
      • Is Figure 2 averaged over all models
    • Potential inclusions
      • It would be nice to see Table 3 broken down into e.g., one-hop vs two-hop in the appendix
      • It would be nice to see the unsmoothed Figure 2 in the appendix
      • Would be interesting to see accuracy broken down by question - are some of the questions just harder than others?

其他意见或建议

Please see "Other Strengths and Weaknesses"

作者回复

First of all, we want to thank the reviewer for their thorough and detailed review and constructive feedback.

[Limited needle set]: While we agree the set is limited, many comparable works—such as RULER or vanilla NIAH—use even fewer or similarly limited questions.

[The Eiffel tower painting example]: As noted in 148–150 (2nd Col) and Appendix A, the keywords W_q and W_n were carefully chosen to ensure unique and true associations, which is also reflected in the near-perfect base scores of top models.

[CoT Effect]: Our point is not to dismiss CoT, but to highlight that it’s not a fully robust solution in this setting. While it does improve performance, the gains diminish as context length increases, which remains a core challenge.

[Other models]: Gemini 2.0 and Deepseek R1 were released after or near the submission deadline. Gemini 2.0 released on February 5th (6 days after the ddl.) and Deepseek R1 on January 20th (10 days before the ddl.). Nevertheless, we will include additional results on Deepseek R1 (distill) and GPT o1 and o3-mini models to the CoT section of the camera-ready version. On a challenging NoLiMa subset, o1, o3-mini, and R1-distilled-70B all drop below 50% of the base score despite scoring near 100% on the base task, highlighting the difficulty even for reasoning-models. Also, for the main evaluation in Table 3 we will add results on Gemini 2.0 Flash. Moreover, the evaluation code and dataset will be publicly released to enable testing on future models.

[In Figure 2 is 0% depth...]: In Fig. 2(a) and (b), 0% depth marks the start of the sequence. In the last 2K (tokens depth), 0 means the end (Fig. 4) -- We'll clarify this in the camera-ready version.

[What does "minimize sensitivity to tokenization problems" mean]: Since the answers to the dataset questions are character names, relying on a fixed name could introduce bias in model performance. Some models may split names into more sub-tokens than others (e.g., "Ve ro nica" vs. "Vero nica"). Rotating character names helps reduce potential biases caused by this tokenization variance.

[Distractor filtering example and the removal process in "Conflicting Information Filtering"]: We can include some examples in the appendix. Here's one from the stories before filtering:

“…Yale, but then he (Steve) rebelled, cashed in his med-school scholarship, and went to Paris to study photography…”

For the question “Which character has been to France?, this passage introduces a conflict with the needle. Steve is a valid answer based on this span, but we want the model under evaluation to select only the needle as the relevant fact. To prevent this, the filtering model flags such conflicts—where additional spans introduce alternative correct answers—and we remove the entire conflicting span, expanding to the nearest sentence boundaries to preserve fluency.

[Is Figure 2 averaged over all models]: No, Figure 2 is only for LLaMA 3.3 70B. A high-resolution sweep, like Fig 2—since it involves twice the placements—would be too costly on closed models.

[Potential Inclusions]: Thank you for the great feedback! We can include the one-hop vs. two-hop breakdown for Table 3 and the unsmoothed Figure 2 in the appendix. Regarding question difficulty, it tends to vary with keywords and model behavior. As for observable patterns, as discussed in the paper, two-hop and inverted questions are generally more challenging.

审稿人评论

Thanks for the response. I look forward to seeing the next version of the paper.

审稿意见
3

This work provides an examination of the capability of large language models (LLMs) to handle long-context information retrieval tasks. The authors propose a new benchmark, NoLiMa, which is designed to test the ability of LLMs to find relevant information in long texts without relying on direct lexical cues. This benchmark is an important contribution to the field, as it addresses the limitations of existing needle-in-a-haystack (NIAH) tests by reducing literal matches and forcing models to rely on more complex inference mechanisms. The paper's evaluation of 12 popular LLMs across different context lengths is revealing. The findings that performance degrades as context length increases are critical for understanding the current limitations of LLMs. The in-depth analysis about the impact of latent hop and CoT shed further light to the understanding of long-context reasoning capabilities in LLMs.

给作者的问题

No.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

This work provides an empirical study. There is no theoretical proof.

实验设计与分析

Yes. I have checked all the results and analysis in Section 4. They make sence to me.

补充材料

No.

与现有文献的关系

This work contributes to the existing evaluation and understanding of long-context LLMs.

遗漏的重要参考文献

No.

其他优缺点

The main weakness, in my opinion, is that the main findings of this work are very similar to those of RULER [1]. Truely, there are some differences: this work tries to minimize the lexical overlap between the needle and the haystack while RULER does not. But I don't see the necessity of this setting, as RULER also finds out that existing long-context LLMs have a limited effective context length. Moreover, this work provides some in-depth analysis about long-context reasoning (e.g., the impact of latent hops and CoT reasoning), but I think these analysis can also be done on RULER.

Therefore, I think this work has made a solid but incremental contribution.

[1] RULER: What’s the real context size of your long-context language models?

其他意见或建议

No.

作者回复

First of all, we want to thank the reviewer for their thoughtful review and constructive feedback.

[Our works contribution compared to RULER]:

  • Our related work section highlights how extensively literal matching is involved across various long-context benchmarks—including RULER.
  • We show that literal matches play an impactful role in affecting the results (in Section 4.4.4) an analysis that cannot be done with RULER due to their existing literal matches.
  • Since RULER is heavily based on lexical matching, it does not support an analysis of latent hops (4.4.1) and needle placement with reasoning variability (4.4.2). We would appreciate it if the reviewer could describe how such an analysis could be done on RULER.
审稿意见
3

This paper present a benchmark which extend NIAH with the needle set requiring models to infer latent associations beyond literal matching. It shows current long context LLMs will have performance degradation in their proposed benchmarks.

给作者的问题

I am curious about the differences and relationship between this paper and [1], which requires global reasoning ability and might also not have severe literal matching problems.

[1] One Thousand and One Pairs: A “novel” challenge for long-context language models, Karpinska et al.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

No theoretical claims.

实验设计与分析

Yes.

补充材料

No supplementary material.

与现有文献的关系

There are previous works for more realistic long context benchmark like LongBench V1 and V2, and complex version of NIAH like RULER, but the author's perspective is still novel.

遗漏的重要参考文献

No, as far as I know.

其他优缺点

Strengths:

  1. The proposed benchmark is novel in terms of the problem definition, focusing on avoiding literal matching and constructing latent associations that are important for current long-context evaluation.
  2. I appreciate the benchmark creation process described in Figure 1, which avoids conflicting information and distractor keywords while ensuring the purity of the evaluation task.
  3. The experiment is comprehensive and the results are interesting - the scores for 32K are notably low, and the analysis of latent hops & inversions provides valuable insights.

Weaknesses:

  1. The context length is only limited to 32K; it could be extended to 128K to examine what happens when further generalizing the context length.

其他意见或建议

No.

作者回复

First of all, we want to thank the reviewer for their thoughtful review and constructive feedback.

[32K limit]: NoLiMa has no inherent context length limit. As stated (lines 200–206, 2nd col.), haystacks are built from random story snippets (using the filtered stories) and can be extended to any length. Since all models drop below the 0.85×base score at 16K and 32K (10 out of 12 drop below 0.5×base score), extending further offers little to no extra insight.

Nevertheless, to demonstrate this, we will include results at 64K and 128K for two models: GPT-4o, the top-performing model in Table 3, and Gemini 2.0 Flash, a new model we plan to add.

[Differences with [1]]: While [1] requires global reasoning over entire novels, NoLiMa focuses on the retrievability of single facts. Notably, questions in [1] often include key cues (e.g., character names) that help narrow down relevant spans—similar to our multi-choice setup in Section 4.4.4. Even without strong literal overlap, the presence of answer choices allows the model to locate relevant regions and reason accordingly.

审稿意见
4

The paper introduces NoLiMa, a benchmark designed for advanced needle-in-a-haystack (NIAH) tests with minimal lexical overlap between questions and the relevant information (needles) within the context. NoLiMa comprises 56 question-needle pairs, each paired with contexts of varying lengths. The benchmark minimizes literal overlap between questions and the relevant information, compelling models to rely on latent associations (one-hop and two-hop reasoning) and world knowledge rather than surface-level pattern matching.

The study evaluated 12 long-context Large Language Models (LLMs) on these question-needle pairs with different context lengths, including a short context used as a baseline to assess the models' fundamental capability to answer questions without the influence of long contexts. The results indicate that 10 out of 12 models experienced a performance drop below 50% of their baseline at 32K tokens. Even the strongest model, GPT-4o, saw a significant decline from 99.3% (baseline score) to 69.7% at 32K tokens. These findings suggest that the long-context reasoning capabilities of LLMs have been overstated in existing benchmarks that rely on literal matches between questions and context.Additionally, the paper provides a comprehensive analysis of latent hops, fact direction, and the position of needles within the context.

给作者的问题

Are there any insights derived from the attention weights in LLMs in the experiments on NoLiMa? Are LLMs robust to long contexts with non-lexical synonyms rather than one-hop hyponym-hypernym reasoning?

论据与证据

The main claim of the paper is that model performance significantly degrades as context length increases when literal matching is removed. This assertion is generally supported by the experiments, such as the observation that 10 out of 12 models dropped below 50% of their baseline performance at 32K tokens on the NoLiMa benchmark. However, my concern is that NoLiMa only addresses a limited aspect of the problem. The non-lexical overlap in NoLiMa primarily reflects hyponym-hypernym relationships, such as "city" to "state." There are many other factors to consider when assessing long-context reasoning without lexical overlap, including synonyms, entity abbreviations, and cross-lingual elements.

方法与评估标准

I find the methods and evaluation criteria to be sound.

理论论述

The paper does not make theoretical claims.

实验设计与分析

The experimental designs in the paper are sound and well-executed. The experiments cover a diverse range of models, including state-of-the-art models and smaller models, as well as open-sourced and closed-sourced models. The experiments include variations to provide comprehensive analysis, such as comparing CoT vs non-CoT approaches and one-hop versus two-hop reasoning.

While the current experimental setup is necessary, I believe that providing an analysis of the attention mechanisms in open-source LLMs could further benefit the community. This would help in understanding the results and improving future LLMs. The current experiments focus on aggregated results from a limited dataset size, which may reduce the depth of insight gained.

补充材料

I have reviewed the data provided in the supplementary material.

与现有文献的关系

The findings in this paper could significantly benefit the LLM community by drawing more attention to non-lexical-overlap long context tasks. The proposed dataset serves as a valuable benchmark for evaluating Needle-in-a-Haystack capabilities.

遗漏的重要参考文献

Multi-hop retrieval has been widely discussed in retrieval-augmented generation (RAG) work. The paper could benefit from a brief discussion comparing the NoLiMa dataset with other RAG datasets to highlight its unique contributions and potential synergies.

Multi-hop Question Answering https://arxiv.org/pdf/2204.09140 U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack https://arxiv.org/pdf/2503.00353v1

其他优缺点

See my other comments.

其他意见或建议

N/A

作者回复

First of all, we want to thank the reviewer for their thoughtful review and constructive feedback.

[Multi-hop retrieval]: Multi-hop retrieval, which involves fact chaining, is discussed in our "Related Work" and "Introduction" (Hsieh et al., 2024; Levy et al., 2024). NoLiMa focuses on a prior step—locating relevant fact(s) from long contexts, which is a prerequisite for effective multi-hop reasoning.

Note: The U-NIAH paper was released after the submission deadline.

[Insights on attention weights]: Analyzing attention weights at long context lengths is extremely memory-intensive. For example, with 8K tokens, a LLaMA 3.1 8B model yields 32(layers) × 32(heads) × 8000**2 = 65 billion weights (~260 GB) for a single input. That said, we did limited analysis comparing inverted vs. default cases on some examples. We found that in inverted settings, W_n gets relatively more attention than the character name when W_n appears earlier—loosely supporting the theory in lines 300–306 (col 2). But due to high computational cost and the analysis being somewhat outside our main scope, broader experiments (e.g., on LLaMA 3.3 70B or full dataset or longer lengths) weren't feasible.

[Non-lexical synonyms]: While using non-lexically matched synonyms helps avoid literal matching, their embeddings are often similar enough to still act as a match. Moreover, not all our examples rely solely on hyponym-hypernym reasoning—some, like the dietary cases (e.g., lactose intolerance <==> milk), require commonsense reasoning.

最终决定

Summary: The paper presents a non-trivial long-context benchmark which does not boil down to simple text-matching-based retrieval. This allows the authors to uncover new serious deficiencies in long context performance of language models. The reviewers appreciate the paper's contributions, with only minor concerns that appear addressed in my opinion. Thus I recommend accept.

More details: The reviewers agree that the experiments support the main claims, are comprehensive (bVpt) over a diverse set of models (7az5), and well-executed (7az5). The reviewers also say that the results about the deficiencies of language models on the benchmark are interesting (bVpt), in-depth (wAvn), solid (q6NK) and useful to the community (q6NK).

No serious weaknesses were pointed out. There was a point about the benchmark capturing only about hyponym-hypernym relationships, but the authors note that there are commonsense-based relations too. Reviewer wAvn suggests that prior work (RULER) is highly similar to this work (or rather that all the analyses here can be done there too), but I agree with the reviewers that RULER involves lexical matches and it's not clear that these analyses can be immediately done there.