5.5

/10

Rejected4 位审稿人

最低5最高6标准差0.5

3.8

置信度

正确性3.0

贡献度2.5

表达3.0

ICLR 2025

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

Tao Yuan,Xuefei Ning,Dong Zhou,Zhijie Yang,Shiyao Li,Minghui Zhuang,Zheyue Tan,Zhuyu Yao,Dahua Lin,Boxun Li,Guohao Dai,Shengen Yan,Yu Wang

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

TL;DR

LV-Eval is a long-context benchmark with 5 length levels up to 256K. It's designed to be challenging, suitable for controllable comparison, and mitigates knowledge leakage issue in evaluation.

摘要

State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below $64k$. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs' performances can significantly degrade in the presence of confusing information, especially in the pressure test of "needle in a haystack". (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval.

关键词

large language modellong-context benchmarkknowledge leakage mitigation

评审与讨论

审稿意见

评分: 5置信度: 42024-10-31

The paper introduces LV-Eval, a long-context benchmark specifically designed to evaluate language models across a range of text lengths, reaching a maximum of 256k words. This benchmark encompasses two primary tasks: single-hop and multi-hop question answering (QA), and it utilizes a total of 11 bilingual datasets to enhance its robustness. To effectively assess model performance, LV-Eval incorporates three key techniques: confusing fact insertion, keyword and phrase replacement, and a keyword-recall-based metric design. The experiments conducted on this proposed long-context benchmark yield several significant findings, providing valuable insights into the capabilities and limitations of various LLMs in handling extended contexts.

优点

The structure of this paper is coherent and well-organized, facilitating ease of reading and comprehension.
This paper presents the LV-Eval benchmark, which has the potential to foster further research in the area of long-context modeling.
The thorough experiments conducted in this paper provide valuable insights that assist readers in selecting the most suitable LLMs for various real-life applications and scenarios.

缺点

The contributions of this paper are somewhat limited in scope. While the techniques of confusing fact insertion and keyword/phrase replacement for data augmentation may be useful, they are relatively straightforward and lack significant innovation.
To emphasize the contributions of the proposed benchmark, it would be beneficial to compare LV-Eval with more recent long-context benchmarks, such as “XLBench: A Benchmark for Extremely Long Context Understanding with Long-Range Dependencies” and “Loong: Benchmarking Long-Context LLMs with Extended Multi-Doc QA.” Such comparisons would provide a clearer context for evaluating the effectiveness and relevance of LV-Eval within the current landscape of long-context modeling.

问题

Please refer to Weaknesses.

评论- Response to Reviewer vECU

2024-11-22

Many thanks for your time in reviewing our paper and proposing the questions & suggestions! We answer the questions as follows. Looking forward to any further discussion!

Q1: the contributions of this paper are somewhat limited in scope. While the techniques of confusing fact insertion and keyword/phrase replacement for data augmentation may be useful, they are relatively straightforward and lack significant innovation.

Thank you for the question. We'd like to discuss our contributions in the following two aspects.

(1) Regarding the benchmarking techniques: While confusing fact insertion with keyword/phrase replacement is a straightforward approach in data augmentation, to the best of our knowledge, previous benchmarks do not include this feature and are unable to effectively evaluate LLMs on these aspects. More importantly, KPR allows us to decouple common-sense ability from long-context understanding, which we believe is crucial for accurate long-context evaluation.

(2) Regarding the resulting benchmark itself: To more effectively and fairly assess long-context LLMs, we have invested significant effort in developing this benchmark, which features a sufficiently long context length, challenging evaluation with distracting or conflicting information, and mitigates issues like knowledge leakage and overreliance on common sense. We hope and think the reulting benchmark, LV-Eval, could contribute to the community as a tool to better understand the long-context ability of LLMs.

Q2: to emphasize the contributions of the proposed benchmark, it would be beneficial to compare LV-Eval with more recent long-context benchmarks, such as “XLBench: A Benchmark for Extremely Long Context Understanding with Long-Range Dependencies” and “Loong: Benchmarking Long-Context LLMs with Extended Multi-Doc QA.” Such comparisons would provide a clearer context for evaluating the effectiveness and relevance of LV-Eval within the current landscape of long-context modeling.

Thank you for the valuable suggestion. We note that "XLBench" and "Loong" were developed concurrently with our benchmark. However, we agree that a comparison of LV-Eval with these benchmarks would provide further context for our contributions, and we have included this discussion in the revised version.

There are several key differences between LV-Eval and these two benchmarks:

Context Length Balance: Both Loong and XLBench are not length-balanced, meaning that test instances of different lengths do not share the same QA pairs. In contrast, LV-Eval ensures that the supporting documents and QA pairs are shared across different length levels, making it easier to track performance degradation as context length increases.
Application Focus: Loong is primarily focused on information extraction tasks, such as sheet analysis, which limits its scope. In contrast, LV-Eval is a more general-purpose benchmark, assessing not only knowledge extraction but also manipulation and understanding in complex long-context scenarios.
Benchmarking Techniques: We introduce the KPR benchmarking technique to decouple the evaluation of common-sense knowledge from the evaluation of long-context understanding, addressing a key limitation in existing long-context benchmarks. Additionally, we propose an optimized metric, leveraging manual efforts to label keywords, which makes the scoring more intuitive and interpretable (see Table 5).

We hope this comparison clarifies the distinct contributions of LV-Eval to the community. Thanks again!

审稿意见

评分: 6置信度: 32024-11-02

The paper's about LV-Eval, a fresh test to see how well big language models get long contexts. It's got five levels of length and covers two types of question-answering in 11 different languages. They mixed in some tricky stuff like made-up facts and swapped out keywords to make it harder. They tried it on 15 models and saw how they did, especially when things got confusing.

优点

It's got a good range of lengths, which is key for seeing how models handle long texts.
It's not just English—it's got Chinese too, so it's more useful for different models.
They tried new things to make the test harder and stop models from cheating with common knowledge.
The scoring is more focused on the important bits of the answer, which makes it more accurate.
They shared all the data and code, which is cool for transparency and building on their work.

缺点

It's mostly about question-answering, which might not cover everything we need for understanding long texts.
Testing some models is pricey, so they couldn't check out all the new ones.
They're still relying a lot on people to check the tricky parts, which takes time and can be off.
There's a chance models could just learn the test, not actually get better at understanding.
Models had a hard time with the confusing stuff, so maybe the test needs more of that.

问题

How do these results compare to other tests, and does that tell us something about how well they do in the real world?
Can the authors walk us through how they checked the tricky parts and if there's a way to do it by machine?
What does it mean if models use common knowledge instead of the text, and how can we fix that for future tests?
Do models do as well on LV-Eval as they do on real-world tasks like summarizing or chatting?
Are there certain types of long-text jobs where LV-Eval is really good or bad, and how could that change future tests?

评论- Response to Reviewer Ytmu (part I)

2024-11-22

Many thanks for your time in reviewing our paper and proposing the questions & suggestions! We answer the questions as follows. Looking forward to any further discussion!

Q1: It's mostly about question-answering, which might not cover everything we need for understanding long texts.

Thanks for this question. LV-Eval indeed cannot assess all types of capabilities of long-context LLMs. Its focus is on evaluating the ability to extract and manipulate knowledge from extremely long contexts, particularly in the presence of distracting or confusing information. This capability is fundamental and essential for many applications that involve long inputs. And our results suggest that this task is challenging for current LLMs.

Q2: Testing some models is pricey, so they couldn't check out all the new ones.

Yes, we believe this is an unavoidable cost if we are to benchmark the long-context capabilities of LLMs. We have provided reference results and hope that more model developers will share their performances using the standard LV-Eval, which would eliminate the need for others to retest the models.

Q3: They're still relying a lot on people to check the tricky parts, which takes time and can be off.

During the construction process, we employ human annotators to handle only the most challenging and critical tasks, such as CFI and KPR, to ensure their quality through manual verification. We hope that the resulting LV-Eval benchmark, built through this effort, will make a valuable contribution to the community.

Q4: There's a chance models could just learn the test, not actually get better at understanding.

We would like to clarify if the reviewer is asking about the scenario where the model is trained directly on the test set. If so, we must acknowledge that there is indeed a risk of benchmark overfitting, which is a common issue faced by most benchmarks.

Q5: How do these results compare to other tests, and does that tell us something about how well they do in the real world?

Thanks for the question. Fir the first question, we can see that current LLMs cannot achieve a high score on LV-Eval, with the highest Moonshot-v1-128k, Qwen2.5-72B achieving only 40% on 16k length level and lower on higher levels. Generally, it's difficult to compare the numbers between LV-Eval results and other long-context benchmarks, as we use different data construction, some benchmarking techniques to make the dataset more challenging, and an improved metric.

In the real world, we found that those LLMs that perform well in LV-Eval are typically good at long-context QA and long-context recall, but other abilities like chat, summary, math, etc., which are also important in real world applications may not be as good as long-context QA.

Q6: How they checked the tricky parts and if there's a way to do it by machine?

Thanks for the question. For the details of annotation process, please refer to Appendix B. In our preliminary exploration, we found it is hard for LLM like GPT-4 to make sure 100% correctness of CFI, although we can use prompt engineering to increase the correctness, but we still need to check each one by human annotation for zero errors.

Q7: What does it mean if models use common knowledge instead of the text, and how can we fix that for future tests?

Thanks for this question. By using common knowledge, the model can answer some questions without referring to the context, i.e., without the need of relying on its long-context understanding ability. For example, without referring to the context that is discussing biological knowledge, an LLM can answer 'red' to the question 'what the color is blood'.

The interference of common-sense knowledge in an LLM can lead to incorrect evaluations of its long-context understanding abilities, which is a critical issue for long-context benchmarks. To the best of our knowledge, we are the first to explicitly address this problem with our KPR method. Specifically, we (1) apply KPR to replace common-sense information in the context with non-common-sense information, and (2) provide explicit instructions to the LLMs to rely solely on the context provided, rather than on their internal knowledge. By doing so, if the model answers the question correctly, we can be confident that it is not relying on common-sense knowledge (since common-sense knowledge would not lead to the correct answer after KPR), but rather on its ability to understand the long context. In summary, our KPR method helps decouple the evaluation of common-sense knowledge from the evaluation of long-context understanding, addressing a key issue in existing long-context benchmarks.

(to be continued ...)

评论- Response to Reviewer Ytmu (part II)

2024-11-22

(... continue from part I)

Q8: Do models do as well on LV-Eval as they do on real-world tasks like summarizing or chatting?

Thank you for the question. Based on our experience, performance on tasks like summarization and chatting does not necessarily correlate with performance on LV-Eval, which primarily assesses the ability to extract and manipulate knowledge from long inputs, whereas summarization and chatting focus on tasks that involve long outputs. However, we would like to emphasize that the ability to extract and manipulate knowledge from long contexts is fundamental and crucial for all applications that involve long inputs.

Q9: Are there certain types of long-text jobs where LV-Eval is really good or bad, and how could that change future tests?

Sorry that we might not fully understand this question. Are you asking about the real-world tasks that LV-Eval performance may be indicative of? We would appreciate it if clarifications could be provided on this.

审稿意见

评分: 6置信度: 42024-11-04

Authors propose a long-context benchmark dataset for Large Language Models, with 5 gradually increasing length levels - 16k, 32k, 64k, 128k, and 256k tokens. The datasets consists of single-hop and multi-hop QA examples. The dataset was constructed using techniques such as content mixing up, keyword and phrase replacement and confusing facrs insertion. Authors have also conducted evaluation of 15 LLMs on the proposed benchmark. The insights from this evaluation includes: 1) Moonshot-1 and Qwen2.5-72B and LLaMA-3.1-70B achieved top performance; 2) Within the group of LLMs with 6-9B parameters, GLM-4-9B achieves the best results; 3) Score trends on various length levels differs substantialy between LLMs.

优点

The dataset design approach with 3 options for confusing evaluated LLMs allows to take a broader look on LLM capabilities in dealing with potentially out-of-distribution content. This could be a substantiall addition to the set of benchmarks for thorough examination of LLMs in the context of their generalization.
5 different length levels allows to precisely point out how model is able (or unable) to recall information from different parts of the context.
The inclusion of bilingual datasets helps assess the efficacy of long-context recall and understanding in two vastly different languages,
Authors evaluated multiple open-source LLMs with various context window sizes. This helps to assess the efficacy of different approaches to increasing the context window size for different models.

缺点

The conducted evaluation of LLMs on the LV-Eval benchmark included only 3 closed-source models. It is even more confusing that among those 3 models 2 are very outdated versions of GPT-3.5 and GPT-4. This paper would immensely benefit from inclusion of at least relatively recent closed-source LLMs, such as GPT-4 with 128k context window (which was released almost a year ago to this date), along with Anthropic Claude, which shows remarkable performance in long-context recall. The argumentation of high-cost is slightly confusing - particualrly most recent closed LLMs have been much less expensive compared to their previous versions. This might not apply to some models from Anthropic, but it certainly does apply to GPT-4. The price of GPT-3.5 with 16k context window was 3 $per 1m tokens, for GPT-4 is was 30-60$ per 1m tokens, while most recent GPT-4o pricing is 2.5$ per 1m tokens.
Confusing Facts Insertion relies entirely on GPT-4 internal concept of a "confusing fact". This yields a problem of data leakage - GPT-4 may have an unfair benefit to other models on this benchmark.

问题

Do you believe that using GPT-4 in confusing fact generation will not result in GPT-4 having an unfair advantage on this benchmark? If so, why?

2.1. Have you measured an inter-annotator agreement during annotation for CFI and KPR?

2.2. How many annotators have seen each example?

2.3. Have you rejected any annotations due to disagreement between annotators?

评论- Response to Reviewer 1MsF

2024-11-22

Many thanks for your time in reviewing our paper and proposing the questions & suggestions! We answer the questions as follows. Looking forward to any further discussion!

Q1: The conducted evaluation of LLMs on the LV-Eval benchmark included only 3 closed-source models. It is even more confusing that among those 3 models 2 are very outdated versions of GPT-3.5 and GPT-4. This paper would immensely benefit from inclusion of at least relatively recent closed-source LLMs, such as GPT-4 with 128k context window (which was released almost a year ago to this date), along with Anthropic Claude, which shows remarkable performance in long-context recall. The argumentation of high-cost is slightly confusing - particualrly most recent closed LLMs have been much less expensive compared to their previous versions. This might not apply to some models from Anthropic, but it certainly does apply to GPT-4. The price of GPT-3.5 with 16k context window was 3per1mtokens,forGPT−4iswas30−60 per 1m tokens, while most recent GPT-4o pricing is 2.5$ per 1m tokens.

Thanks for this question and detailed information, we really appreciate it. Despite the cost of closed LLMs having significantly reduced, the full test of gpt on all five length levels still costs 1,768.8 USD (~707.52M tokens). And unluckily, our budget for running model evaluation is running out. We can only say that if we get other reliable results on testing new models or we have got new budgets on testing new models, we will update the performance list in our project page.

Q2: Confusing Facts Insertion relies entirely on GPT-4 internal concept of a "confusing fact". This yields a problem of data leakage - GPT-4 may have an unfair benefit to other models on this benchmark.

Thank you for the great question. We did not consider the impact of this aspect in our study. Indeed, there could be an unfair effect, but we are unsure whether it gives GPT an edge or a disadvantage. This is because, unless explicitly prompted to consider confusing facts within the context, the confusing facts generated by GPT itself may be the ones that are hardest for GPT to distinguish. This is an interesting discussion point, and we've added this discussion into appendix.

Q3: Have you measured an inter-annotator agreement during annotation for CFI and KPR? How many annotators have seen each example?

For each example, two annotators conduct independent annotation, and they will recheck the other's annotation. We did not measure inter-annotator agreement during the annotation process. Since the principles of the annotation task are simple and clear, and the number of annotations is relatively small, we had the two annotators review each other's work to resolve any disagreements after completing their independent annotations. We've made this more clear in the appendix.

Q4: Have you rejected any annotations due to disagreement between annotators?

In our early overseeing of the annotation process, we found that in a very few rare cases, there is disagreement between annotators, but these were mainly due to significant conflicts in the content of the text, making deriving a single reasonable answer difficult. Therefore, we not only rejected the annotations but also removed such data from the dataset.

审稿意见

评分: 5置信度: 42024-11-04

This paper introduces a new LLM eval benchmark called LV-Eval to assess LLMs longer-context QA capability up to 256k words. LV-Eval incorporates a few techniques (e.g., confusing facts insertion, keyword manipulation) and adopts a new eval metric, which provides better/more reliable LLM evaluation. Authors conduct experiments on 15 different LLMs and demonstrate how performance can vary significantly in different testing scenarios.

优点

A new dataset that could potentially benefit the community
Paper presentation is overall clear
The proposed benchmark construction method is overall reasonable

缺点

The quality (and usefulness) of proposed benchmark is not fully evaluated. How do state-of-the-art LLMs and human (both domain experts and lay people) perform on this dataset? Does this benchmark capture LLM's full capability (other than knowledge extraction/manipulation capabilities for the QA tasks)? In multiple places in the paper, the authors mention human interventions (e.g., Line 269, “ask human annotators to resolve any conflicts in generated facts”), more analysis/discussions on the human annotation qualities are needed.

The second limitation is its construction process is not very clearly discussed. For example, around line 254, the authors mention that “For each length level, we sample distracting documents one by one until the cumulative word count meets the desired length level“, how does the distracting document sampling work exactly? Furthermore, as mentioned in the previous limitation, for each place where human annotation/intervention is involved, more discussions (e.g., # of annotator, annotation guideline, human annotation agreement numbers, etc) are needed. I would recommend authors moving some part of current Appendix section B details to the main text.

Finally, as mentioned by the authors in the limitation section, the current benchmark only include QA tasks. As a result, the proposed new evaluation metric also only targets short-form answers (compared to long-text generation). A natural question is that what if we use another LLM auto-rater to judge if two answer strings match semantically (instead of the lexical based methods).

问题

Please refer to the questions in the above weaknesses section.

评论- Response to Reviewer PwXz (part I)

2024-11-22

Many thanks for your time in reviewing our paper and proposing the questions & suggestions! We answer the questions as follows. Looking forward to any further discussion!

Q1: The quality (and usefulness) of proposed benchmark is not fully evaluated. How do state-of-the-art LLMs and human (both domain experts and lay people) perform on this dataset?

Thanks for the great question. We note that the source documents and QA pairs in LV-Eval are primarily derived from established long-context datasets such as Longbench and Loogle. Since the quality of these source documents and QA pairs has already been validated, and our construction process mainly involves synthesizing more challenging contexts from them, we think another turn of full human evaluation on our synthesized benchmark could be omitted. We also acknowledge that this decision is also driven by cost considerations, as conducting a comprehensive human evaluation on the new dataset would be prohibitively expensive. According to these two reasons, we opted not to carry out a full human evaluation on LV-Eval, as is common with synthetic datasets.

Q2: Does this benchmark capture LLM's full capability, other than knowledge extraction/manipulation capabilities for the QA tasks.

LV-Eval does not assess all types of capabilities. Its focus is on evaluating the ability to extract and manipulate knowledge from extremely long contexts, particularly in the presence of distracting or confusing information. This capability is fundamental and essential for many applications that involve long inputs. And our results suggest that this task is challenging for current LLMs.

Q3: In multiple places in the paper, the authors mention human interventions (e.g., Line 269, “ask human annotators to resolve any conflicts in generated facts”), more analysis/discussions on the human annotation qualities are needed.

We discuss the annotation details in Appendix B. Specifically, to ensure annotation qualities, we conduct checks during the annotation process. We hire master students in linguistics and LLM research, and they double check the quality of annotation, e.g., the confusing fact insertion doesn't introduce actual conflicts. Since the number of instances to be annotated is not large (the QA pairs in a dataset remain the same across different length levels), verifying the annotation quality is easy for our hired annotators.

Q4: The second limitation is its construction process is not very clearly discussed. For example, around line 254, the authors mention that “For each length level, we sample distracting documents one by one until the cumulative word count meets the desired length level“, how does the distracting document sampling work exactly?

During construction, for a given target document and its QA pair, we randomly sample irrelevant documents in the same source dataset as the distracting documents. To create context of each length level, we uniformly sample irrelevant documents (without putting back) from the same source datasets, one at a time, and concatenate them into the context. Once the overall length of the context reaches the designated length level, we stop the sampling and return the final context.

Q5: Furthermore, as mentioned in the previous limitation, for each place where human annotation/intervention is involved, more discussions (e.g., # of annotator, annotation guideline, human annotation agreement numbers, etc) are needed. I would recommend authors moving some part of current Appendix section B details to the main text.

Thanks for this suggestion. We have revised Line 211 to include some information (number and academic level of annotators), and we underline the sentence "we employ and guide human annotators to revise the confusing facts, replace the keywords and phrases, and annotate the keywords in GT answers" to make it more clear where we rely on human annotators. Due to space limitations, we were unable to include the full discussion on annotation process in the main text. If the reviewer thinks there are other places where additional information is needed, we are happy to continue the discussion.

(to be continued ...)

评论- Response to Reviewer PwXz (part II)

2024-11-22

(... continue from part I)

Q6: Finally, as mentioned by the authors in the limitation section, the current benchmark only include QA tasks. As a result, the proposed new evaluation metric also only targets short-form answers (compared to long-text generation). A natural question is that what if we use another LLM auto-rater to judge if two answer strings match semantically (instead of the lexical based method).

Thanks for this question. It is worth discussing. Our benchmark focuses on correctly extracting and manipulating information from very long input texts. We are not particularly concerned with the ability to generate long outputs. Indeed, the reviewer makes a valid point that our metric is not well-suited for evaluating long outputs, so we specifically filtered out long outputs when constructing the dataset. Thanks for the suggestion -- we realized we didn't mention this filtering in the paper, and we're adding this information.

Regarding the use of an LLM rater to judge whether two answers match, that's a great suggestion. For benchmarks that require evaluating long outputs, this approach is indeed appropriate. Anyway, our benchmark does not focus on long outputs, so we rely on human efforts to provide a simpler metric, hoping that the evaluation process does not involve a strong LLM, which could introduce additional biases and costs. We've added this discussion to the revision.

2024-11-26

Thanks for the response and clarifications. However, I still intend to keep the original score as some of my core questions are still not answered. First, for the quality (and usefulness) of proposed benchmark, you mention that this dataset is primarily derived from established long-context datasets such as Longbench and Loogle with additional syntheses. How do we know if the syntheses are good? And perhaps more importantly, how do we know if the state-of-the-art models will saturate on this dataset? This is also related to the next question about the eval metric.

I understand this paper is targeting questions with short-form answer, however, consider the question "Who received the IEEE Frank Rosenblatt Award in 2010?" with ground truth answer "Michio Sugeno" and model predicted answer "Michio Sugeno received the IEEE Frank Rosenblatt Award in 2010". While these two responses are basically saying the same thing, the model predicted answer will get a somewhat low score (instead of a perfect 1.0 accuracy/score). I think the experiments using an AutoRater is necessary and this also helps us to understand if the current SOTA model can already do a good job on the proposed dataset.

评论- Response to Reviewer PwXz

2024-11-27

Thanks for your follow-up! We hope our following responses can further address the concerns.

Regarding the suggestion of using an AutoRater: Thank you for emphasizing the importance of AutoRater. Based on your suggestion, we have incorporated the AutoRater judge into our evaluation. Specifically, we selected the top 3 LLMs from our study for evaluation with AutoRater. The results suggest that AutoRater generally assigns higher scores than the metrics we proposed. Despite this, the relative model performance is still accurately reflected.

The comparison between the evaluation results using our original metric and the AutoRater version has been added to Fig. A11 and Appendix-line 912 in the revision, including the auto-rating prompts. We will update the codebase to support LLM auto-rater, which incurs an approximate cost of $20 per model evaluation. We truly appreciate your suggestion, as using AutoRater is a valuable choice, albeit with some evaluation cost.

Regarding evaluating the human performance on LV-Eval: Thanks for this suggestion again! We would like to seize the opportunity to further discuss with you and try to reach some understandings. We understand that reporting human performance on a benchmark is helpful for diagnosing the room for improvements for models, and thus adds to the usefulness of the benchmark. However, for our large and long-context LV-Eval benchmark (11 datasets on 5 length levels of 16~256k), evaluating human performance is too time-consuming and labor-intensive.

Additionally, the QA pairs in LV-Eval are primarily knowledge-retrieval tasks, which do not require complex reasoning and are relatively simple for humans to handle. We conjecture that, given sufficient time, human testers could achieve near-perfect accuracy.

AC 元评审

2024-12-15

The paper presents a benchmark dataset for evaluating large language models (LLMs) in long-context scenarios, along with a novel evaluation metric. This benchmark is focused on question-answering tasks in two languages, featuring varying context lengths and employing diverse construction techniques. The paper also includes an analysis of various LLMs' performance on this benchmark. While reviewers acknowledge the dataset and evaluation metric's potential to advance research in long-context modeling, they have identified several areas for improvement:

The benchmark is limited to question-answering tasks (PwXz, Ytmu), and the dataset construction methods offer only incremental novelty compared to existing benchmarks (Ytmu, vECU).
The paper lacks the evaluation of both state-of-the-art LLMs and human performance, which are crucial for thorough validation (PwXz, 1MsF, Ytmu).
The dataset construction process is not well-explained in the main text, leading to some confusions (PwXz, 1MsF).

Due to the above shortcomings, a rejection is recommended.

审稿人讨论附加意见

The authors have addressed several points raised by the reviewer, such as the auto-rater and the clarity issues, but the the major concerns in the technical contribution (Ytmu, vECU) and the lack of evaluation (PwXz, 1MsF, Ytmu) still remain.

Due to the resource and time constraint, these drawbacks cannot be simply resolved by authors in the discussion period. But from the long run, these shortcomings do reduce the value of this work, and we encourage the authors to seriously consider these concerns and keep improving the whole work.

最终决定Reject

2025-01-22

Reject