PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差1.0
4
4
2
2
ICML 2025

Minerva: A Programmable Memory Test Benchmark for Language Models

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
LLM evaluationLLM capabilitycontext utilizationmemory benchmark

评审与讨论

审稿意见
4

Minerva introduces a thorough evaluation set to test the memory abilities of different LLMs. The evaluation set is generated with parametric programs and covers a wide breadth of different memory skills such as information retrieval and localization, processing, content transfer and structural awareness. Tests are also divided into atomic and composite where atomic tests individual skills while composite relies on the combination of atomic skills. Different LLMs are evaluated and the results provide a more detailed picture on the different memory skills of LLMs.

给作者的问题

  1. How many trials were done (if more than one)?

论据与证据

The main claim is that current benchmarks do not capture memory-usage capabilities and the evidence is shown through the new evaluation set as well as well as the different results.

方法与评估标准

The method in this case is an evaluation set and it is measured with several different criteria including ROGUE, exact match and Jaccard similarity. Given the nature of the task, these metrics make sense.

理论论述

N/A

实验设计与分析

There are many experiments across the different tasks. The main limitation is that it is unclear how much context or in-context examples were given to the different models as well as different prompt variants. Given the sensitivity of LLMs to these factors, it is an important aspect to look at.

补充材料

Yes, all of them

与现有文献的关系

Deigning new benchmarks to test LLM abilities is a common topic. The most related are needle in a haystack benchmarks but Minerva goes much deeper in testing the abilities of memory of LLMs.

遗漏的重要参考文献

Not that I am aware of

其他优缺点

Strengths:

  • The topic of memory utilization is extremely important as it is crucial to many different tasks. In general, designing new benchmark is also important given the saturation of current ones.
  • The test set is fairly comprehensive and shows strengths and weaknesses of different models (rather than all models performing similarly). These results are useful for the broader community.
  • The test set is flexible making it easier to add to. The benchmark is also easy to extend and can prevent models from over fitting to it.

Weaknesses:

  • As mentioned earlier, the main weakness is how the models were evaluated. Given how sensitive models are to different prompts and effectiveness of in-context learning, it would make sense to provide several examples for the different tasks as that might change the performance, especially for the more complex tasks. For the composite tasks, the examples could come from the output of the atomic tasks. Overall, this paper presents an extendable evaluation set that examines the memory capabilities of different LLMs. Given that there is not an evaluation set like this, I think it should be accepted as it provides insight that would otherwise be difficult to obtain.

其他意见或建议

N/A

作者回复

Thank you for your constructive feedback and positive assessment of our paper. Below, we address your main concerns:

Use of in-context examples

We intentionally did not include in-context learning examples, as our goal is to evaluate models' inherent memory capabilities, rather than their ability to perform few-shot learning. We think that including in-context examples would conflate memory with adaptation to in-context cues, making it harder to isolate the specific skills we aim to measure.

Prompt variation and sensitivity

We recognize that LLMs can be sensitive to prompt formulation. However, our primary goal in this work is to establish a standardized benchmark for evaluating memory capabilities, rather than optimizing performance through prompt engineering.

To assess the impact of prompt variations, we conducted a preliminary experiment where we tested different wordings for the same atomic task. We found that, as long as no heavy prompt tuning is involved, the impact on performance remains minimal. For example, in the word presence task, we obtained identical results from multiple models for the prompt variant "Is xxx present in the context?" compared to "Given the context, determine if xxx is present in the context." This suggests that minor wording differences do not substantially alter model performance.

Given these findings, we believe that fixing the prompts to a single, simple version (as shown in Appendix A) ensures consistency and comparability across models. While prompt tuning could enhance performance for specific models, that is a separate research question beyond the scope of this work.

Number of trials

The results reported in the paper are based on a fixed snapshot of the benchmark, meaning we used our program to generate a predefined set of 1110 evaluation instances, which we assume corresponds to what the reviewer means by "one trial." However, we generated multiple instances for each task (see Appendix B), ensuring a more diverse representation. To maintain reproducibility and consistency, we fixed the experimental settings as described in the paper.

We also want to emphasize that our benchmark can be used to generate any number of test samples and is fully extendable to different context lengths. We will open-source our code and also provide the exact snapshot of our data used in this experiment to facilitate further experimentation with alternative configurations.

审稿人评论

Thank you for the clarifications and additional details. I keep my score at 4.

审稿意见
4

Paper presents a new benchmark for evaluating LLMs' long-context problem-solving abilities. The proposed benchmark include atomic tasks (searching, recalling, editing, matching, etc) that evaluate models on tasks that go beyond those commonly explored (passkey, key-value, needle in the haystack). Experiments present a comprehensive and detailed analysis of different models' ability to use memory.

给作者的问题

N/A

论据与证据

The main claim is that the proposed approach provides a more comprehensive and nuanced evaluation of an LLM's context utilization, which I believe is supported by the experiments.

方法与评估标准

Presents a wide range of different and relevant atomic tasks for evaluating long-context capabilities in LLMs, including search, recall, edit, match, compare, spot the difference, stateful processing, processing data blocks. Paper also evaluates models on composite tasks that combine multiple atomic abilities, and show that there is a significant deterioration of abilities in this setting.

理论论述

N/A

实验设计与分析

The experiments evaluated a range of different models (both black-box and open-source) on the proposed benchmarks. The evaluations show that different models perform differently on different atomic tasks (i.e. A model being better of some tasks doesn't mean it will be better on another), which highlights the need to a wide range of tasks to evaluate different kind of long-context retrieval abilities.

补充材料

No

与现有文献的关系

While a number of prior works have proposed benchmarks for context utilizations, they mainly focus on basic information retrieval or are relatively limited in scope. The proposed new benchmark presents a more comprehensive suite of atomic evaluations on different kinds of tasks.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

One of the claims of the paper is that the proposed benchmarks provide more a more detailed view into the long-context capabilities of the model, which can provide better guidance for future model training and development. It would be nice to add a bit more discussions about the specific kinds of improvements the evals revealed.

作者回复

We sincerely appreciate the reviewer’s positive assessment of our work and thoughtful feedback.

One key takeaway from our experiments is that different models exhibit high variance across atomic memory tasks, reinforcing the need for a diverse evaluation suite. In addition, we found that a major failure pattern is that models perform well on direct retrieval (e.g., finding an exact string in a long context) but degrade significantly when required to store, update, and apply memory dynamically. We believe this might be due to the mainstream attention-based architecture, which makes point-wise access (e.g., retrieving a specific token) relatively easy but lacks explicit memory mechanisms for storing the information and updates. Future models may benefit from architectures that explicitly model memory, such as those explored in the recent Titan paper.

Beyond memory retention, models also struggle with comparing, integrating, or transforming stored information, particularly when relevant facts are dispersed across different parts of the context rather than appearing in a single block, or when the task requires keeping track of updates to an entity’s attributes over time. These results suggest that attention alone may not be enough for effective long-context understanding. Future improvements might require structured memory systems, such as hierarchical memory representations to help models better handle complex memory tasks.

We will expand this discussion in the paper to further highlight how our evaluations can inform future model development.

审稿意见
2

This paper presents a framework for automatically generating a broad set of tests to evaluate LLMs' memory usage. Going beyond simple search, the benchmark assesses tasks like recalling, editing, matching, and tracking information across distinct data blocks. Experiments reveal that while models handle basic retrieval, they often fail more complex operations, highlighting important gaps and guiding future improvements.

给作者的问题

Please refer to Strengths And Weaknesses.

论据与证据

Please refer to Strengths And Weaknesses.

方法与评估标准

Please refer to Strengths And Weaknesses.

理论论述

Please refer to Strengths And Weaknesses.

实验设计与分析

Please refer to Strengths And Weaknesses.

补充材料

Please refer to Strengths And Weaknesses.

与现有文献的关系

Please refer to Strengths And Weaknesses.

遗漏的重要参考文献

Please refer to Strengths And Weaknesses.

其他优缺点

Strengths

  • The paper presents a wide range of memory-related tasks and thoroughly evaluates them.

  • By incorporating visual results, the authors make it easier for readers to compare model performance and understand outcome disparities.

Weaknesses

  • While extensive in empirical evaluation, the work does not introduce significant new theoretical concepts or frameworks for memory usage.

  • The paper positions itself as going "beyond simple search," but does not clearly illustrate the threshold that separates simple retrieval from more advanced memory tasks; additional concrete examples would clarify these distinctions.

  • The paper's core focus on LLM memory benchmarking may not align closely with the calls for papers in ICML.

其他意见或建议

Please refer to Strengths And Weaknesses.

作者回复

We appreciate the reviewer’s comments and the opportunity to clarify our contributions.

Position of the paper and fitness with ICML

The reviewer raises the concern that our paper does not introduce new theoretical concepts. The paper does not aim to be theoretical. The nature of memory analysis in LLMs is empirical. However, we believe that a rigorous and comprehensive benchmarking framework is a significant contribution. Using static data benchmarks to test needle-in-the-haystack type of analysis is not sufficient especially in a rapidly evolving field like LLMs. Our study provides:

  1. A well-defined taxonomy of memory-related tasks that categorizes different types of memory demands in LLMs, providing a structured way to assess and analyze LLM memory capabilities, as described in Section 2.

  2. New insights into LLM memory behavior across various tasks, revealing critical limitations that have not been studied before. As discussed in Section 4 line 434 - 436 (left) and 385-394 (right), previous benchmarks and tests focus mainly on information retrieval, and lack a comprehensive evaluation on memory. Many other benchmarks are also static dataset benchmarks.

  3. A programmable, extensible benchmark that enables future research on memory efficiency, potentially informing both theoretical studies and model development. This mitigates the risk of overfitting on benchmarks, and biasing the empirical evaluations.

Historically, benchmarks have played a pivotal role in advancing ML research. For example, ImageNet reshaped computer vision, and MMLU influenced LLM evaluation. Similarly, our benchmark exposes key limitations in memory-intensive tasks, which could inspire future advancements. Moreover, evaluation is explicitly within the scope of ICML’s call for papers.

Clarifying the distinction between simple search and other tasks

Thanks for the suggestion and we will add additional explanations and examples to better illustrate this distinction in the paper.

By "simple search," we refer to tasks that require only the ability to locate and extract relevant information from memory without additional processing (see line 105, left). In our benchmark, this includes tasks like string search, key-value search, and batch search (see Table 1 and Appendix A). For example, given a keyword x, the model retrieves the associated value y. This "locate-and-extract" ability resembles traditional search tasks, and thus we refer to them as simple search. This type of task has dominated prior work on LLM memory utilization, such as the Needle-in-a-Haystack task (see Section 4, line 419, left).

However, real-world applications of memory demand more than just retrieval. Our benchmark explicitly evaluates tasks that require models to recall, edit, match, compare, and track state, which extend beyond this simple search (see Table 1 and line 55-67, right). For instance: A writing assistant that rephrases or edits a paragraph operates differently from a system that merely retrieves a fact embedded in a document. A financial analysis tool that verifies consistency across multiple reports needs to compare and reconcile stored information rather than just extract it. A personalized assistant managing ongoing tasks must track state changes over multiple interactions instead of retrieving isolated details. In practice, memory-intensive tasks often require a composition of these abilities. Our benchmark not only evaluates these capabilities but also provides a structured taxonomy to identify model strengths and weaknesses systematically.

审稿意见
2

This paper introduces a framework for systematically evaluating the memory utilization capabilities of language models. Expanding beyond conventional memory tests—such as passkey retrieval, key-value lookup, and needle-in-the-haystack search—the proposed framework assesses models on a broader range of atomic tasks, including searching, recalling, editing, matching, and comparing information within structured memory contexts. Additionally, the authors design composite tests to examine models' ability to maintain state across operations. By simulating real-world data structures, this benchmark provides a more interpretable and granular assessment of large language models' memory handling capabilities.

给作者的问题

Mentioned above.

论据与证据

The paper claims to introduce a novel framework for evaluating the memory utilization capabilities of large language models (LLMs). While the authors do propose an extended set of tests beyond traditional memory benchmarks, it is unclear whether the framework provides fundamentally new insights or if it merely reinforces existing knowledge about model performance. Looking at the performance results, it appears that models that perform well on standard NLP benchmarks (e.g., GPT-4o outperforming GPT-4 Turbo, which in turn outperforms GPT-4o-mini) also perform well on this benchmark. This raises the question of whether the proposed benchmark truly captures unique aspects of memory utilization or if it is simply correlated with overall model quality. Additional evidence, such as cases where model rankings differ from traditional benchmarks, would strengthen the claim of novelty.

方法与评估标准

The authors employ a variety of atomic and composite tests to assess models' abilities to retrieve, edit, and manipulate information within structured contexts. However, there are some concerns regarding the design of the evaluation criteria. For example, in the String Search task, the prompt asks whether a certain substring (e.g., “bbb”) is present, with expected answers being "yes" or "no." However, the evaluation metric used is exact_match, which does not seem appropriate in this context.

Additionally, the authors set the temperature to 0 and top-p to 1, ensuring deterministic outputs. While this suggests an intent to evaluate models under controlled conditions, it is unclear whether this decision might disadvantage certain models or affect specific types of tests. For instance, tasks requiring reasoning or more flexible memory retrieval might be better evaluated with slightly higher randomness. The paper should clarify the implications of this choice on model performance.

理论论述

N/A

实验设计与分析

The experimental setup is well-structured, but there are areas where further exploration could enhance the analysis:

  1. The paper does not discuss whether the authors experimented with varying context lengths, which could have provided deeper insights into how memory performance scales with longer or shorter inputs. Given that real-world applications involve diverse context lengths, it would be valuable to understand whether model performance remains stable across different settings.

  2. It is unclear whether the models' performance depends on how instructions are phrased in the prompts. The authors should clarify whether variations in prompt wording lead to performance shifts and, if not, why that would be the case. Understanding how models respond to different instructions is critical for assessing robustness.

  3. The authors emphasize the distinction between proprietary (black-box) and open-source models (Line 184), but it is unclear why this distinction is particularly relevant in the context of memory benchmarking. Are there fundamental differences in how proprietary and open-source models handle memory? Additional clarification on this point would be useful.

补充材料

Yes, all the sections.

与现有文献的关系

The work is relevant for the research community.

遗漏的重要参考文献

The authors have discussed sufficient literature on the topic.

其他优缺点

Strengths:

The paper provides a structured approach to evaluating LLMs’ memory abilities, introducing a broader set of tasks beyond traditional benchmarks. The inclusion of composite tests that assess state retention while operating on memory is valuable for understanding sequential dependencies.

Weaknesses:

The novelty of the benchmark is questionable, given that performance trends largely align with existing NLP benchmarks. The lack of variation in context length experiments limits insights into how well models generalize memory performance across different scales. The deterministic setting (temperature = 0) may not be suitable for all types of memory-related tasks and could influence results in ways that are not discussed in the paper.

其他意见或建议

Minor Line103-104, seems to have some structural irregularity. I don’t see the bullet points following from the earlier discussion. Please rewrite. Line 74, 84 (right) - remove boldening of words

作者回复

We appreciate your thoughtful feedback and the chance to clarify our contributions.

Insights from the benchmark

Our main goal is not to rank known models, but rather to introduce tests to evaluate different functionalities needed for LLM (agents). Any model can be (1) tested using our benchmark, (2) updated/retrained/fine-tuned to improve its functionalities WITHOUT the problem of being overfitted on the benchmark, and (3) have older and newer versions of the model compared fairly by rerunning the benchmark with fresh randomness.

While some of our results align with prior benchmarks, we disagree that our work merely reinforces prior knowledge. It is true that GPT-4o performed the best in our tests, as expected, but our findings provide meaningful insights beyond the simplification of “GPT-4o is better than GPT-4o-mini”. Specifically: (1) Models of similar sizes (particularly open-source ones) show notable performance variations across different types of memory tasks (See Figure 3-5) (2) Performance gaps vary by task, e.g. while models perform similarly in information fetching tasks, more challenging tasks like "spot the difference" reveal larger gaps. (3) Smaller models can outperform larger ones on certain tasks. Notably, LLaMA-3.1-8B and phi-3-small outperformed GPT-4-turbo in “patch the difference”, despite being smaller. This suggests that factors beyond sheer model size, such as architectural choices and training data, play a significant role in memory-related capabilities. Generating insights doesn’t require contradicting prior benchmarks. Instead, we aim to uncover nuanced model behavior patterns that inform development and evaluation strategies. This goes beyond using benchmarks merely for ranking models.

We also would like to clarify on the following regarding our experiment setups:

1. Variation in context length

We mainly fixed the context length to 4K tokens to highlight that models already struggle in many memory tests at this length. We have also shown in one of our ablation studies (Figure 6) how the number of operation steps/context length affects performance. We also have other ablation results that show a similar trend: performance was near perfect at shorter lengths, but significantly lower well before reaching what is typically considered "long context". We can add more results in the paper to clarify these findings. Additionally, since our benchmark is fully programmable and transparent, it can be extended to evaluate models at different context lengths.

2. Evaluation criteria

For the substring presence test, exact match means that if the model’s response (yes/no) matches with presence/absence of the substring, it receives a score of 1, otherwise 0. We believe this is a fitting metric for a binary classification task.

3. Variations in prompt wording

As discussed in our response to Reviewer RVnF, we conducted a preliminary experiment to assess the impact of wording variations and found that performance differences are minimal for instruction rephrasing. This supports our decision to fix prompts to a single simple version (in Appendix A) for consistency and comparability across models. So, based on this result, we observe that the main challenge is not in “understanding” the instruction, but rather in “performing the task”. All that said, note that (1) our benchmark is not a static dataset and a simple script can be added to our benchmark to generate many variations; (2) we will add confidence intervals around the results to show also robustness to instruction variations.

4. Deterministic setting

We deliberately used a deterministic setting for consistency and controlled comparisons. While we understand the concern that this might disadvantage certain models, we argue that: (1) A deterministic setting provides more stable and reproducible evaluation. (2) This choice is an evaluation setting, not a limitation of the benchmark itself. While we used a deterministic setup in this study, our benchmark is fully configurable. Users can modify decoding parameters to introduce stochasticity and explore its effects on different memory tasks.

5. Separation between proprietary and open-source models

We separate proprietary and open-source models because these groups tend to be more comparable within themselves. For example, the open-source models we evaluated are of similar sizes, making side-by-side comparisons more meaningful. This grouping allows for clearer analysis while still enabling cross-category comparisons where relevant.

Final remarks on experiment setups

Our benchmark is designed to be fully programmable, extendable to different context lengths, evaluation criteria, prompting strategies, or other configurations. We will open-source our code and provide a snapshot of our data to encourage further research. However, for the sake of reproducibility, we fixed the experimental settings as described in the paper to ensure consistency in our results.

审稿人评论

I thank the authors for answering my questions in detail. This is helpful, and I believe they clarify the concerns that I raised to some extent. So, at this point, I will increase the score. However, as the authors themselves mentioned, a lot of things need to be added in the paper for thorough evaluation of the benchmark.

作者评论

We thank the reviewer again for the thoughtful feedback and for raising the score. We really appreciate your engagement with the work.

We’d like to clarify that we have run the additional experiments we mentioned (e.g., prompt variations, context length changes) and we’re ready to include them in the final version.

While we understand these additions will help clarify the paper further, we also want to emphasize that we believe the main ideas and contributions are already well-supported in the current version. Our main contribution is to introduce a comprehensive, programmable benchmark for memory evaluation, one that goes beyond simple recall or “needle-in-a-haystack” tasks. We designed tasks that isolate different memory behaviours, which we think is crucial for understanding where and why models succeed or fail.

Moreover, a key strength of our programmable benchmark is its flexibility: it allows users to generate diverse variations of the tests by adjusting different parameters. While we present results from a particular snapshot in the paper, along with ablation studies for selected variations, the framework is designed to support a wide range of experimental setups. Our goal is not to "sell" the specific results shown in the paper, but to introduce a modular set of tasks and an extensible framework for evaluating memory. Researchers can pick it up and run their own experiments with configurations that suit their needs. We believe the core idea is new and useful to the community. Also, the code and data are ready to use and extend.

To give a clearer picture of the results we’ll add in the revision:

Context length variation

We tested how model performance changes with increasing context length, beyond the results already shown in Figure 6 (for integer/set state tracking). We see that the model performance dropped well before reaching a “long” context.

Task: Functional updates (metric: Rouge-L)

Model5001k2k4k8k
gpt-4o1.001.000.990.930.59
gpt-4o-mini0.690.660.420.240.10
phi-3-small0.490.450.210.070.03

Task: Count (metric: exact match)

Model1k2k4k8k16k
gpt-4-turbo0.520.440.400.320.28
cohere0.360.240.200.200.12
phi-3-medium0.200.160.120.080.04

Prompt variation

We ran small wording changes on the same task and evaluated the difference in model performance. We found that across multiple models, which suggests that the main challenge is not in “understanding” the instruction, but rather in “performing the task”. Below, we report task performance and 95% confidence intervals (CIs).

Task: String search (word)

number of sample per variation: 50

Variation 1: Given the context, determine if xxx is present in the context.

Variation 2: Is xxx present in the context?

ModelVariation 1Variation 295% CI
gpt-4o1.001.00(0.94, 1.00)
gpt-4o-mini0.980.98(0.94, 1.00)
phi-3-medium1.001.00(0.94, 1.00)

Task: Group association

number of samples per variation: 40

Variation 1: determine if the word "aaa" and the word "bbb" are in the same list

Variation 2: check if the words "aaa" and "bbb" belong to the same list

ModelVariation 195% CIVariation 295% CI
gpt-4o0.65(0.50, 0.80)0.63(0.48, 0.78)
cohere0.70(0.56, 0.84)0.75(0.62, 0.88)
phi-3-small0.55(0.40, 0.70)0.55(0.40, 0.70)
最终决定

This paper introduces Minerva, a framework for evaluating LLMs' memory capabilities beyond simple retrieval tasks. The benchmark assesses a comprehensive set of atomic tasks (searching, recalling, editing, matching) and more complex composite tasks that require maintaining state while operating on memory. Strengths

  • Offers a programmable framework that prevents overfitting through dynamic test generation
  • Provides a structured taxonomy of memory-related tasks for systematic evaluation
  • Reveals nuanced insights into model performance across different memory capabilities
  • Demonstrates that models perform well on direct retrieval but struggle with more complex operations

Weaknesses

  • Primary experiments use a fixed context length (4K tokens), though additional results show scaling behavior
  • Limited exploration of prompt variations, though preliminary tests suggest minimal sensitivity
  • Primarily empirical rather than theoretical contribution (though appropriate for this work)

The contribution addresses an important gap in LLM evaluation, offering value to the research community by enabling systematic assessment of how models utilize context memory. The experimental results could inform future architectural improvements. While some aspects could be strengthened, the core contribution represents a valuable addition to the field of LLM evaluation.