PaperHub
7.8
/10
Oral4 位审稿人
最低4最高5标准差0.4
4
5
5
5
3.5
置信度
创新性2.8
质量3.0
清晰度3.0
重要性2.5
NeurIPS 2025

Memory Mosaics at scale

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
In-context learningmemory mosaics

评审与讨论

审稿意见
4

This paper extends the work of Memory Mosaics, introducing the Memory Mosaics v2 architecture, which they train on 1 trillion tokens. They demonstrate comparable performance to transformers (trained on the same data-mix and training recipe) on downstream language tasks, while showing superior results on the RULER (needle-in-a-haystack) benchmark, as well as in-context learning tasks.

优缺点分析

Strengths

  • The experiments are large scale (up to 8B model size and 1T tokens)
  • They achieve greater than 10% boost on RULER (NIAH) tasks compared to transformers at 32K context size
  • The memory mosaics architecture demonstrates superior performance on in-context learning tasks
  • The architectural modifications are relatively simple and easy to understand

Weaknesses

  • The architectural benefits are demonstrated on a specific RULER task and 3 in-context learning tasks (Banking77, Tacred, Goemotion). To be truly convincing I believe readers would need to see performance boosts on more benchmarks.
  • While they compare against transformers with a comparable FLOPs/token, there is no training + inference throughput profiling for the memory mosaics architecture relative to transformers. Throughput is the most important performance metric for large scale training and model serving, thus comparable training and inference efficiency would need to be demonstrated in order for the Memory Mosaics architecture to be adopted at scale.

问题

Questions

  • Could the authors provide throughput profiles for both the memory mosaics v2 architecture as well as the Transformer architecture (with flash-attention and all relevant optimizations).
  • Could the authors provide additional evaluations, perhaps on the full RULER suite including multi-key retrieval, as well as other long-context benchmarks (e.g. HELMET, BabiLong, PG-19, SWDE, SQA, SNQA)?
  • Did the authors run evaluations on other in-context learning benchmarks?
  • Re Figure 1: Do the different lines correspond to different layers? Please edit the caption to make the figure self-contained.
  • Memory mosaics (informally) interpolate based on next value (kt,vt+1)(k_t, v_{t + 1}), whereas transformers interpolate based on a value at the same position (kt,vt)(k_t, v_t). Has an ablation ever been run where a transformer operates the same as memory mosaics (values shifted forward in time) and the diagonal of the attention mask is removed in order to preserve causality? This seems sensible to try to see how well transformers perform under the most similar interpolation scheme.
  • If m<hm < h, there is no computational saving from splitting into short and long-term memory, but rather the benefit is from separate parameters for long and short term memory?
  • In appendix: "During training, memory mosaics v2 samples the long-term memory delay step m from [64, 256], sets the short-term memory window size h = 256. At inference, m is set to 64", what is the intuition behind this? Did you observe any loss of accuracy without the sampling and keeping m fixed at 64?
  • what is the number of shots for each benchmark in Table 1? Since few-shot learning is analogous to in-context learning, it would be helpful to potentially vary the number of shots and compare with transformers. This would be relatively low cost compared to the huge cost of the pre-training experiment, so would be worth adding to the paper.

局限性

The authors mention limitations in the "Computation and # parameters concerns" section. They mention the difference in FLOPs/token relative to transformers in Table 2. However, to be more convincing the authors could provide throughput profiles for Memory Mosaics v2 relative to Transformers.

最终评判理由

The authors made efforts to address my concerns in the rebuttal. While I still have lingering reservations regarding the method's efficiency and also would like to see more extensive evaluations on long-context benchmarks (e.g. HELMET) that are relevant for memory models, I believe this work still has merit and meets the acceptance bar all things considered, and thus I have raised my score from 3 to 4. I hope that the authors will take reviewer feedback to improve the camera-ready version of the manuscript.

格式问题

Minor suggested edits:

"were continuously pursued but remain challenging" --> "have been continuously pursued but remain challenging"

"via various statistical “independent” viewpoint" (This language is a bit awkward, consider rephrasing)

作者回复

Thank you for your thoughtful reviews and insightful suggestions. We appreciate the opportunity to address your concerns. Most of the issues stem from insufficient clarification and misunderstanding regarding the primary goal of this paper. As suggested by reviewer vHo6, we would like to highlight footnote 5 in our paper, emphasizing that "this work is neither an attempt at linearizing attention mechanisms nor focused on model efficiency or model serving".

The goal of this paper is to develop new architectures capable of effectively learning novel tasks. We do not focus on the performance metrics typically associated with persistent-knowledge storage & retrieval, nor model efficiency, model serving, or throughput. While model efficiency could be a future research direction once one wants to establish model serving at a product level, this paper is currently focused on the early stages of architecture exploration and new-task learning ability.

Therefore, most concerns can be addressed by providing more clarity regarding our goals. We will emphasize the goal of this work in the final version!


Q1: Additional ruler benchmarks

We chose two question-answering tasks in RULER, because they are notably more challenging than other "needle in a haystack" tasks like multi/single-key retrieval, where many models can easily achieve >99% performance (see Appendix E and F in RULER [4]). Note that the information entropy of multi/single-key retrieval tasks is extremely low, making these tasks easy to compress.

On the other hand, we are happy to share experimental results of Memory Mosaics v2 on these key retrieval tasks. Table A and B show that Memory Mosaics v2 outperforms Transformers.

modeltask-length 4ktask-length 8ktask-length 16ktask-length 32k
transformer small99.499.098.297.8
memory mosaics v2 small100.0100.0100.0100.0
Table A: RULER S-NIAH benchmark comparison between transformer small and memory mosaics v2 small.
modeltask-length 4ktask-length 8ktask-length 16ktask-length 32k
transformer large100.0100.0100.099.6
memory mosaics v2 large100.0100.0100.0100.0
Table B: RULER S-NIAH benchmark comparison between transformer large and memory mosaics v2 large.

Thank you for suggesting additional benchmarks such as HELMET, BabiLong, PG-19, SWDE, SQA, and SNQA. However, these benchmarks overlap with each other, RULER, and the three in-context learning tasks presented in this paper.

For instance:

  • HELMET includes HotpotQA, banking77, and some ruler tasks.
  • The HotpotQA is used in one RULER question-answering task, which is used in this paper.
  • Banking77 is one of 3 in-context learning tasks used in this work.
  • BabiLong consists of "needle-in-a-haystack" tasks with low information entropy, similar to many RULER tasks.

Therefore, this paper doesn't evaluate similar or redundant tasks repeatedly.

[4] Hsieh, Cheng-Ping, et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?." arXiv (2024).


Q2: The goal of this work

As mentioned previously, we would like to emphasize that Footnote 5 states: This work is neither a linearization of attention nor attention efficiency.

Thanks for the valuable suggestion! We will emphasize this goal more clearly in the final version.


Q3: The number of in-context learning benchmarks

This paper includes 3 in-context learning datasets, each paired with target label strategies, 2 few-shot prompt strategies, 2 few-shot examples ordering strategies, and 10~16 possible number of shots. This results in over 300 different tasks. Despite that the more benchmarks the better understanding, we find that the 300 tasks in this paper are sufficient to support the in-context learning dimension within a limited space. Note that there are two other evaluation dimensions to explore within the limited pages.

Thanks for the suggestions. We will leave more in-context learning benchmarks as future work.


Q4: training + inference throughput profiling.

The primary concern regarding model efficiency and throughput appears to be based on the assumption that this paper targets model efficiency or serving, which is not the case. As mentioned earlier, this work is neither a linearization of attention nor attention efficiency (Page 3, Footnote 5). The goal of this paper is to develop new architectures capable of better learning new tasks.

Despite these clarifications, we are also happy to share the training and inference costs for our current Memory Mosaics v2 implementation.

Currently, Memory Mosaics v2 is implemented based on FlexAttention [5] due to its flexibility. Compared to highly optimized transformer implementations over many years, like flash-attention or fused kernels, FlexAttention [5] achieves approximately 85%–90% of their speed in practice. Therefore, the training cost of Memory Mosaics v2 and baseline Transformer models remains similar. We do not observe any significant training or inference overhead.

Thanks for the suggestions on model efficiency. We leave it as a future direction when our model is ready for model serving.

[5] Dong, Juechu, et al. "Flex attention: A programming model for generating optimized attention kernels." arXiv (2024).


Q5: What does the design of long/short-term memory target on, the separation of functionality or computational efficiency?

As emphasized by the goal of this paper, the design of long/short-term memory targets the separation of functionality, rather than computational efficiency.


Q6: Re Figure 1: Do the different lines correspond to different layers?

The figure is borrowed from [1], where different lines correspond to different attention/memory heads. Thanks for the suggestions! We will edit the caption to make the figure self-contained.

[1] Zhang, Jianyu, et al. "Memory mosaics." arXiv preprint arXiv:2405.06394 (2024).


Q7: Long-term memory mask is randomized during training while fixed at inference. Visualization of long/short-term memories.

The motivation for creating long/short-term memories comes from the "flat-then-peak" attention distribution in Figure 1 (right). To achieve context length extrapolation, we need to allocate the "flat" part (position-invariant) to long-term memory, and the "peak" tail to short-term memory. As such, only the size of long-term memory increases during context length extrapolation.

Because we do not know the exact boundary between "flat" and "peak" parts, we use a randomized long-term memory mask length to further impair long-term memory from learning the tail of sequence. Note that on average this randomization of mask length equals a linearly-decayed importance.

We will compare and share the long/short-term attention distribution (visualization) using this randomized mask in the final version, explaining the motivation behind it and showing its benefits.


Q8: Add the number of shots in Table 1.

Thanks for pointing it out. Following the setups in Deepseek-V3 [8] and Llama3 [9], we set the number of shots for common benchmarks as shown in Table C. We will clarify the #shots setups in the final version.

obqaarc-easywindograndearc-challangepiqaboolqhellaswagnqsiqatqagsm8kmmluhuman eval+
# shots0000000505850
squadbbhmathmbpprace middlerace high
# shots134300

Table C: Number of shots for common benchmarks.


Q9: Suggestions on varying the number of shots for the common benchmarks in Table 1?

First, we observe in [8-9] that people do not vary the number of shots for these common benchmarks. We conducted a study to explore different numbers of shots and found that performance on these common benchmarks (e.g., MMLU) usually does not increase with more shots. Note that few-shot learning is expected to see improved performance with additional shots.

Furthermore, Deepseek-R1 [7] even observed a decreased performance as the number of shots increased. In their original words: "When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a zero-shot setting for optimal results."

These results reveal that these tasks are not classic "few-shot learning". The shots in these tasks provide the information of "problem format", rather than contextual knowledge. Therefore, this paper, as well as [8-9], doesn't vary the number of shots for these common benchmarks.

Thanks for the suggestions. We will add this clarification in the final version.

[7] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv (2025).
[8] DeepSeek-AI. "DeepSeek-V3 Technical Report". (2024).
[9] Grattafiori, Aaron, et al. "The llama 3 herd of models." arXiv (2024).


Minor suggested edits.

Thanks for the suggestions on editing! We will fix them in the final version.


Summary

Thank you for your thoughtful reviews!

Q1-Q5 raise concerns mainly based on an incorrect assumption --- namely that "this work focuses on model efficiency or model serving". We provide details clarification to state our goal clearly, along with additional experimental results. These concerns should be well addressed after clarifying the goal of this work.

Q6-Q9 ask for details about the model, motivations, and experimental setups. We provide comprehensive explanations and experimental details.

Does the response address your concerns?

评论

I thank the authors for their response. I recognize that the main focus of the work is to demonstrate a novel memory architecture to achieve benefits on recall and in-context learning tasks. However, I think it is a bit dismissive to name the paper "Memory Mosaics at Scale" (emphasizing scale in the title), and emphasize the model is trained on 1T tokens and say that efficiency is an afterthought. For a 7-10B model assuming strong throughput of ~1 billion tokens per day on an 8xA100 node and ~3 billion on a 8xH100 node, training on 1T tokens is >150K A100 GPU hours and ~50K H100 hours. How can we seriously claim that efficiency does not matter at that scale? Computational concerns were also highlighted by reviewer jbCa. Moreover, the authors quote that footnote 5 states "this work is neither an attempt at linearizing attention mechanisms nor focused on model efficiency or model serving" but in the submitted manuscript it states "It worth noting that this work is neither a linearization of attention nor attention efficiency. The recurrent feature extractor in Eq. 7 is used to create keys, while associative memory in Eq. 1 still stores all key-value pairs.". Not to be pedantic but aside from the misquote there is a difference between stating that your focus is not to make attention more efficient and to state that model efficiency is not pertinent to the work at all.

I appreciate the addition of the ruler evals. I recognize that the rebuttal period is time limited so this is reasonable, however I think it is fair for a paper studying memory to have evaluations on the HELMET suite which is a strong indicator of true in-context ability, while I do recognize the task overlap. Please consider this for the final manuscript.

评论

Dear reviewer YUYJ,

Thank you for your thoughtful reviews and suggestions regarding model efficiency. Although this work does not focus on model efficiency or model serving, your suggestions help clarify the goal of this paper. We provide a brief summary of our responses in the Summary paragraph, outlining our answers to questions Q1–Q9.

We hope the response addresses your concerns. We also understand that a mismatch regarding the assumed goal of the paper may impair the review-response process. Thus, please feel free to let us know if you have any further concerns/questions!

评论

Thanks for your reply! For model efficiency issue, as stated in Q4: "Therefore, the training cost of Memory Mosaics v2 and baseline Transformer models remains similar. We do not observe any significant training or inference overhead.". In short, model efficiency is not the overhead of Memory Mosaics v2, compared with Transformer.

It is the highly-optimized implementation (such as, flash-attention and fused kernels) in Transformer that help transformer gain ~20% training and inference speed. One can, of course, optimize the Memory Mosaics v2 implementation via replacing FlexAttention by customized triton/cuda kernels to fill this gap. (Note that current Memory Mosaics v2 use FlexAttention[5] implementation to due to its flexibility). However, this optimization of implementation is not the main focus of this paper. This is a normal case in computer science. For example, Transformer was published in 2017, while a efficient implementation, flash-attention, was created in 2022.

For the term "scale in title "Memory Mosaics at Scale", there are many potentional problems to consider, including model efficiency, when scaling up a model. For example, some ideas work well on small data/model-size but fail on large data/model-size, a certain advantage of a method at small scale may also disappear at large scale, etc. Of course, training cost is also a key issue to consider during scaling up (thanks for your suggestion). The term "at scale" in the title needs to consider all these problems, not only efficiency. This paper optimizes the implementation until "model efficiency is not the overhead of Memory Mosaics v2". After that, we stop over-optimizing the implementation and focus on other problems. We will clarify this point in the final version.

Dear reviewer, we thanks for your helpful suggestion on model efficiency! We will state in the final version that this paper doesn't focus on the optimization of implementations (flash-attention, fused kernels). In current implementation, model efficiency is already not the overhead of Memory Mosaics v2, compared with Transformer. We will also consider the HELMET suite in the final version (e.g. removing the overlap).

We hope the answer address your concerns. We appreciate it if you can update scores based on these clarifications and new results.

评论

Thank you for engaging with my concerns. While I still have lingering reservations regarding the method's efficiency and also would like to see more extensive evaluations on long-context benchmarks (e.g. HELMET) that are relevant for memory models, I believe this work still has merit and meets the acceptance bar all things considered, and thus I have raised my score from 3 to 4. I hope that the authors will take reviewer feedback to improve the camera-ready version of the manuscript.

评论

Thanks for your suggestions. Thanks for raising scores. We will improve the final version accordingly, addressing the model efficiency and benchmarks concerns.

审稿意见
5

This work investigates scaling up the Memory Mosaics model to LLM sizes, trained with real-world datasets. This work builds on the original Memory Mosaics model and introduces Memory Mosaics v2, with three modifications: (i) adaptive bandwidth for associative memory, (ii) gated time-variant feature extractor and (iii) 3-level memory design. Memory Mosaics v2 is trained on one trillion tokens and is thoroughly evaluated across three dimensions: (i) training-knowledge storage, new-knowledge storage, and (iii) in-context learning. The results indicate that Memory Mosaics v2 matches the performance of transformers in learning training knowledge, and outperforms transformers in acquiring new tasks during inference.

优缺点分析

Strengths:

  • Demonstrates that the properties of Memory Mosaics model is retained at scale, when trained upto one-trillion tokens of real-world data
  • Demonstrates superiority in learning new-tasks compared to transformers, even when transformers are trained with substantially more data.
  • The new architectural changes are well motivated and offers a simpler way to understand the dynamics of the models
  • Comprehensive evaluation

Weaknesses:

  • Lacks analysis on computational overhead compared to transformers

问题

  • How would this model work on generative tasks and natural language modeling?
  • Does this architectural changes incur any computational overhead during inference? How does the performance compare under the same amount of FLOPs?

局限性

Yes

最终评判理由

I believed the work was already solid, with minor suggestions which had been addressed in a convincing manner during the rebuttal. The question I raised was based on my initial misunderstanding, and it was sufficiently clarified during the rebuttal.

格式问题

N/A

作者回复

Thank you for your thoughtful reviews and useful suggestions! It is a pleasure to discuss these questions.


W1 & Q1: Lacks analysis on computational overhead compared to transformers. Do these architectural changes incur any computational overhead during inference? How does the performance compare under the same amount of FLOPs?

Thank you for pointing out the need for computation analysis of Memory Mosaics v2 and transformers. We have now included #FLOPs and #parameter analysis for Memory Mosaics v2 and transformers across all three evaluation dimensions. The results demonstrate that Memory Mosaics v2 achieves better performance under comparable FLOPs/parameters or fewer FLOPs/parameters under comparable performance.

  • persistent-knowledge storage and retrieval: Table 2 in our paper shows that after removing long-term memory, memory moasics v2 contains fewer parameters (8.3B vs 8.8B) and Flops (15.6B vs 16.7B) than Transformer. Despite this reduction, Memory Mosaics v2 still achieves a comparable performance as Transformer. Flops/token is estimated at context length 256 via [1] approach.

  • New-knowledge storage and retrieval: This paper compares memory mosaics v2 and transformer on two scales in Table 4, allowing us to study the relationship between #params/#FLOPs and performance. Table A summarizes the #params, #FLOPS, and performance of these models. By plotting a curve between #params/#FLOPs and performance using the data in Table A, one can easily observe that Memory Mosaics v2 outperforms Transformer by over 10% under any #params or #FLOPs configuration. Flops/token is estimated at context length 32k via [1] approach. | model | # params | # Flops per token | # ruler question-answering (32k task-length)| |---|---|---|---| |transformer small | 1.5B |4.4B| 22.1| |transformer large | 8.8B|25.2B |41.1| |Memory Mosaics v2 small |1.6B|4.7B|53.4| |Memory Mosaics v2 large |9.9B|27.4B|53.4|
    Table A: Ruler question-answering performance of transformer and Memory Mosaics v2 at different parameters and FLOPs scales. Memory Mosaics v2 outperforms transformer by over 10% under the same #params or #FLOPs.

  • In-context learning: Similar to the comparison in New-knowledge storage and retrieval, Table B summarizes the #params, #FLOPs/token, and #performance comparison on in-context learning tasks (TacRed 10-shots). Again, by plotting a curve between #params/#FLOPs and performance using the data in Table B, one can easily observe that Memory Mosaics v2 outperforms Transformer by over 10% under the same #params or #FLOPs configurations. (A similar conclusion also holds for other in-context learning tasks and other numbers of shots). | model | # params | # Flops per token | # in-context learning (tacred 10-shots)| |---|---|---|---| |transformer small | 1.5B |4.4B| 23.0| |transformer large | 8.8B|25.2B |50.0| |Memory Mosaics v2 small |1.6B|4.7B|45.5| |Memory Mosaics v2 large |9.9B|27.4B|63.0|
    Table B: In context learning performance of transformer and Memory Mosaics v2 at different parameters and FLOPs scales. Memory Mosaics v2 outperforms transformer by over 10% under the same #params or #FLOPs.

Thanks for the useful suggestions! We will add figures for these comparisons and analyses in the final version.

[1] Adam Casson. Transformer flops. 2023


Q2: How would this model work on generative tasks and natural language modeling?

The in-context learning and new-knowledge storage & retrieval tasks are evaluated in a generative manner. Specifically, we allow Memory Mosaics v2 to generate texts freely. Then check whether the generated text correctly answers the question.

Memory Mosaics v2 handles generative tasks and natural language modeling similarly to transformers. First Memory Mosaics v2 pre-fill prompts (e.g., a question), store KV-cache of the prompt, and then generate text using this stored KV-cache.

Meanwhile, other related techniques in transformers, such as KV hashing and quantization, can also be directly applied to Memory Mosaics v2.


Summary

Thanks for your thoughtful reviews!

Q1 highlights the need for computational analysis (e.g., FLOPs). We address this point by adding FLOPs and parameter analysis on all three evaluation dimensions.

Q2 raises the concern about the text generation process of Memory Mosaics v2. We provide clarification, showing that the generation process of Memory Mosaics v2 is similar to that of transformers. Many evaluation tasks in this paper are assessed in the generative manner.

We hope the response address your concerns!

评论

Dear Reviewer yHif,

Thanks for your thoughtful reviews and helpful suggestions! We have added computational analysis and clarification of text generation process to address your concerns. We hope our answers sufficiently address them.

评论

Thank you for the clarification. I think the FLOPs addition, although minor, could strengthen the analysis and benefits of the suggested architecture, especially when comparing performance between two models. The authors have also adequately clarified my misunderstanding when raising the second question.

评论

We are pleased to see that our answers address your concerns. Many other readers may share similar questions, so your suggestions have helped us improve the quality of this paper. We appreciate it if you can update scores based on these clarifications and new results.

审稿意见
5

This paper builds on the key-value associative memory architecture known as Memory Mosaics, which addressed some of the inherent limitations of standard Transformers in storing and retrieving long-term knowledge. While the original Memory Mosaics showed promise in improving memory retention and compositional generalization, it was previously explored only in medium-scale settings and synthetic tasks. To overcome these constraints, this paper proposes Memory Mosaics v2, an enhanced version that scales the framework to large models and real-world tasks. Specifically, it introduces three key features: a multi-level memory hierarchy, an adaptive bandwidth, and a gated key extractor. Through extensive experiments, the paper shows that Memory Mosaics v2 consistently achieves performance comparable to standard Transformers while offering stronger memory retention and in-context learning abilities. Additionally, the ablation studies provide insightful analyses of each component’s contribution to the overall framework.

优缺点分析

Strengths

  • S1. The paper is well-motivated, addressing the limitations of the existing Memory Mosaic v2 architecture.
  • S2. The proposed architecture successfully scales the Memory Mosaics framework to realistic large-model settings (e.g., LLaMA-8B), bridging the gap between small-scale memory-augmented models and production-scale large language models.
  • S3. The comprehensive experiments across persistent knowledge, new knowledge storage, and in-context tasks demonstrate that the model consistently matches or exceeds the performance of standard Transformers, with particularly strong results on tasks requiring long-context reasoning.
  • S4. The ablation studies and analyses thoughtfully examine the contributions of each architectural component, providing valuable insights into the practical impact and trade-offs of the proposed design.

Weaknesses

  • W1. Compared to a standard Transformer, the proposed model requires more parameters to achieve comparable performance, which may limit its efficiency in some scenarios.
  • W2. While the paper demonstrates the effectiveness of the model’s capacity for in-context learning, it would be helpful to include more detailed explanations, experiments, or analyses on how the capacity scales with the number of Mosaic v2 parameters. This would better clarify where the proposed memory structure provides its main advantages.
  • W3. In Section 5.2, the paper mentions comparisons with memory compression algorithms, but does not provide detailed quantitative results. Including concrete empirical comparisons would make the strengths and trade-offs of the Memory Mosaics v2 approach clearer.

问题

In Section 5.2, the paper discusses the failures of many potential baselines. I would like to hear the authors’ thoughts from a slightly different perspective.

From the Memory Mosaics architecture viewpoint, the model can be seen as consisting of contextual associative memory and persistent memory components. One could interpret many memory compression algorithms (which act like short-term models due to their difficulty recalling past information) as enhancing the long-term context understanding of the contextual associative memory, ultimately enabling more effective retrieval from the persistent memory.

If this interpretation aligns with the authors’ view, would it make sense to also compare the proposed model with memory compression algorithms in the experiments for Section 5.1 (persistent-knowledge storage and retrieval), in addition to the comparison in Section 5.2?

局限性

Some of the limitations are discussed in the Weaknesses and Questions sections.

最终评判理由

They provided additional experimental results showing that Mosaic v2 outperforms the Transformer while requiring only a modest increase in parameters (0.1B for the small model and 1.1B for the large model). They also clarified the design of Mosaic v2 and presented empirical evidence supporting their claims. Based on this, I have increased my score from 4 to 5. I encourage the authors to incorporate the key points from the rebuttal into the camera-ready version of the manuscript.

格式问题

I don't see any major formatting issues in this paper.

作者回复

Thank you for your thoughtful reviews! After reading the weaknesses W1-W3, we find that all of them can be effectively addressed by improving clarity and providing more experimental evidence.


W1: Compared to a standard Transformer, the proposed model requires more parameters to achieve comparable performance.

Actually, Memory Mosaics v2 achieves better performance under comparable #FLOPs/#parameters, or fewer #FLOPs/#parameters under comparable performance across all three evaluation dimensions.

  • Persistent-knowledge storage and retrieval: Table 2 in our paper shows that after removing long-term memory, memory moasics v2 contains fewer parameters (8.3B vs 8.8B) and Flops (15.6B vs 16.7B) than Transformer. Despite this reduction, Memory Mosaics v2 still achieves a comparable performance as Transformers.

  • New-knowledge storage and retrieval: This paper compares memory mosaics v2 and transformer on two scales in Table 4, allowing us to study the relationship between #params/#FLOPs and performance. Table A summarizes the #params, #FLOPs, and performance of these models. By plotting a curve between #params/#FLOPs and performance using the data in Table A, one can easily observe that Memory Mosaics v2 outperforms Transformer by over 10% under any #params or #FLOPs configuration. (To the ease of reading, we will use a figure to showcase this result in the final version.) | model | # params | # Flops per token | # ruler question-answering (32k task-length)| |---|---|---|---| |transformer small | 1.5B |4.4B| 22.1| |transformer large | 8.8B|25.2B |41.1| |Memory Mosaics v2 small |1.6B|4.7B|53.4| |Memory Mosaics v2 large |9.9B|27.4B|53.4|
    Table A: Ruler question-answering performance of transformer and Memory Mosaics v2 at different parameters and FLOPs scales. Memory Mosaics v2 outperforms transformer by over 10% under the same #params or #FLOPs.

  • In-context learning: Similar to the comparison in New-knowledge storage and retrieval, Table B summarizes the #params, #FLOPs, and #performance comparison on in-context learning tasks (TacRed 10-shots). Again, by plotting a curve between #params/#FLOPs and performance using the data in Table B, one can easily observe that Memory Mosaics v2 outperforms Transformer by over 10% under the same #params or #FLOPs configurations. | model | # params | # Flops per token | # in-context learning (tacred 10-shots)| |---|---|---|---| |transformer small | 1.5B |4.4B| 23.0| |transformer large | 8.8B|25.2B |50.0| |Memory Mosaics v2 small |1.6B|4.7B|45.5| |Memory Mosaics v2 large |9.9B|27.4B|63.0|
    Table B: In-context learning (tacred 10-shots) performance of transformer and Memory Mosaics v2 at different parameters and FLOPs scales. Memory Mosaics v2 outperforms transformer by over 10% under the same #params or #FLOPs.

Even though W1 is not a genuine weakness of this paper, it helps improve the writing of the paper. Thank you for pointing it out! We will add these results (using figures), as well as clarifications, in the final version.


W2.1: How does the capacity scale with the number of Mosaic v2 parameters?

The previous answer in Tables A and B showcases how the capacity scales with the number of Mosaic v2 parameters and FLOPs, demonstrating that Memory Mosaics v2 outperforms Transformer by over 10% under the same #params or $FLOPs. We will add these results in the final version, addressing this concern.


W2.2: Clarify where the proposed memory structure provides its main advantages.

The proposed long/short-term memories structure in Memory Mosaics v2 is built based on the "flat-then-peak" attention distribution observed in Memory Mosaics [1] (Figure 1 in this paper). To achieve context length extrapolation, we need to allocate the "flat" part (position-invariant) to long-term memory, and the "peak" tail to short-term memory. This ensures that only the size of long-term memory increases during context length extrapolation.

For handling in-context learning tasks (few-shot learning), a fair treatment is required for different shots/examples/demonstrations, regardless of their position in a sequence. The position-invariant nature of long-term memory provides this fairness.

In contrast, transformer attention lacks this "flat" attention distribution (Figure 1 left in our paper). Therefore, transformers struggle with context length extrapolation and perform poorly on in-context learning tasks.

We will visualize the attention distribution of Memory Mosaics v2 with long/short-term memories in the final version, addressing the concern of long/short-term memories.

[1] Zhang, Jianyu, et al. "Memory mosaics." arXiv preprint arXiv:2405.06394 (2024).


W3: Include experimental evidence to support the claim that "Memory compression algorithms are weak in New-knowledge storage & retrieval and in-context learning tasks."

Weakness of memory compression algorithms in New-knowledge storage & retrieval:

The RULER paper [4] has already studied the weaknesses of model compression methods, such as Mamba and RWKV, via experimental comparison (Figure 4). Here is a summary:

modeltask-length 1ktask-length 2ktask-length 4k
Llama2-7b96.091.695.0
rwkv-v5-7b87.573.751.4
Mamaba-2.8b-slimpj62.652.6-
Table 1: Weakness of Mamba and RWKV on RULER benchmark.

In the original words of the RULER paper, "Finally, we show that non-Transformer architectures, such as RWKV and Mamba, still lag behind Transformer by large margins on RULER."

Note that the two question-answering RULER tasks used in our paper are notably more challenging than other "needle in a haystack" tasks in RULER, like multi-/single-key retrievals. The information entropy of these multi-/single-key retrieval tasks is extremely low, making these tasks easy to compress, allowing many models to easily achieve >99% performance (see Appendix E and F in RULER [4]).

Weakness of memory compression algorithms In-context learning:

[5] has already experimentally shown the weakness of model compression methods, such as Mamba and RWKV, in in-context learning (Note that this paper uses the same in-context learning data as in [5].). Here is a summary:

modeltask-length 1 shotstask-length 2 shotstask-length 3 shotstask-length 4 shotstask-length 5 shots
Qwen-1.5-7B-base 7B38.747.345.243.640.6
Mistral-7B-v0.2-base53.353.151.648.042.3
RWKV-5-World 7B2.32.61.001.2
Mamba-2.8B00000
Table 2: Weakness of Mamba and RWKV on in-context learning benchmark (TacRed classification).

Memory compression algorithms, such as Mamba and RWKV, completely fail on the in-context learning tasks.

We will refer these results to experimentally support the claim (weakness of memory compression methods) in the final version.

[4] Hsieh, Cheng-Ping, et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?." arXiv preprint arXiv:2404.06654 (2024).
[5] Li, Tianle, et al. "Long-context llms struggle with long in-context learning." arXiv preprint arXiv:2404.02060 (2024).


Q1: Perspective on memory compression algorithms

First, we agree that the memory and computation efficiency are significant aspects of memory compression algorithms. Our gated recurrent key feature extractor is also inspired by memory compression algorithms.

In your perspective, you mentioned this: "One could interpret many memory compression algorithms [...] as enhancing the long-term context understanding of the contextual associative memory, ultimately enabling more effective retrieval from the persistent memory." Note that memory compression algorithms don't contain a true "long-term context" analogous to either long-term memory in Memory Mosaics v2 or attention in transformer. (Otherwise, these algorithms would not be considered as memory "compression".)

Due to the absence of a mechanism capable of storing contextual information over large amounts of new information, memory compression algorithms face difficulties with tasks like new-knowledge storage & retrieval and in-context learning.

For persistent-knowledge storage and retrieval tasks, we agree that memory compression algorithms perform well. Actually, rich literature (e.g., Mamba, RWKV, xLSTM, etc) has already experimentally verified this point. In our paper, Table 2 (memory mosaics v2 large without long-term memory) also supports this point. Note that "memory mosaics v2 large without long-term memory" is a kind of fixed-sized sliding window, and thus acts like a memory compression algorithm.

Thanks for the suggestion! We will clarify that "memory compression algorithms (Mamba, RWKV, xLSTM, etc) perform well on persistent-knowledge storage and retrieval" in the final version, using rich experimental evidence from literature.


Summary

W1–W2 point out issues related to clarification. We have revised the text to improve clarity, added experimental details, and reorganized the experimental results accordingly.

W3 highlights the lack of experimental evidence supporting our claims about the weaknesses of memory compression algorithms. Thanks for pointing it out! We now refer to the extensive experimental evidence provided in [4] and [5] to support our argument.

Q1 raises a question regarding the perspective on memory compression algorithms. We have included a corresponding discussion in the revised version.

Does the response address your concerns?

评论

Dear reviewer jbCa,

Thank you for your helpful suggestions. We've added additional clarifications and experimental results to address your concerns. We are eager to hear whether these additions adequately resolve them. Please don’t hesitate to let us know if you have any further questions! We appreciate it if you can update scores based on these clarifications and new results.

评论

Thank you for the rebuttal and the detailed response.

The authors have adequately addressed the concerns raised during the review process. They provided additional experimental results showing that Mosaic v2 outperforms the Transformer while requiring only a modest increase in parameters (0.1B for the small model and 1.1B for the large model). They also clarified the design of Mosaic v2 and presented empirical evidence supporting their claims. Based on this, I have increased my score from 4 to 5. I encourage the authors to incorporate the key points from the rebuttal into the camera-ready version of the manuscript.

评论

Thanks for the suggestions. Thanks for raising score. We will add the computation analysis, the experimental results of Table 1 & 2, and other key points during rebuttal into the final version.

审稿意见
5

This paper updates the "Memory Mosaics" architecture, an alternative to widespread autoregressive transformer language model architecture that replaces attention with an associative memory module. This paper scales up the architecture and benchmarks it on larger-scale language modeling tasks, introducing three architectural changes designed to improve language modeling performance. First, it makes the kernel smoothing in the memory adaptive to the number of examples stored in the memory. Then, it introduces a recurrent gate mechanism in front of the "key" computation, designed to allow the model to learn key representations that are flexible to extra tokens in the sequence (as you might want to do when moving from timeseries to language modeling). Finally, it uses attention masking to separate long-term and short-term memory modules. It also sets aside the unified persistent associative memory in favor of going back to a traditional feedforward MLP module interleaved with the contextual memory instead of unified.

The paper then measures the resulting architecture in three ways. First, it does standard benchmark evaluations at two model sizes, comparing to a standard transformer at similar sizes, comparing on 19 standard benchmarks, including six that are appear to be very sensitive to long-term memory, showing comparable performance and interesting wins on the long-term tasks. Then it does a RULER "needle in a haystack" evaluation, showing very strong wins in both in-domain and extrapolated-length performance testing up to 8x or 16x the training context length. Finally it tests on a suite of ICL text classification tasks with both meaningful and anonymized target labels, and show significant wins over the similar transformers.

优缺点分析

Strengths: Memory mosaics is a very pretty simplification of the transformer attention architecture, and this paper shows that it holds promise in being able to perform just as well, or even better, on large-scale language modeling when compared to an autoregressive transformer. The introduced architectural modifications on top of MMv1 are all sensible and nicely motivated, the performance measurements are well-chosen, and the performance wins on ICL and needle-in-haystack tasks are noteworthy.

Weaknesses: The argument is not fully convincing that the MM architecture is responsible for the observed ICL/long-context performance wins over transformers. In particular, the paper leaves the reader with the unanswered question: could the same benefits be achieved without changing the attention mechanism, but by simply enforcing long- and short-term masks on traditional transformer attention? No experiment to apply the simple separation between long/short attention to transformers is done. To make the case, a long/short attention baseline should be compared, to give transformers a fair shot.

Some ablations are missing. Several innovations over MMv1 are introduced, but ablations are only shown for the long/short innovation, which leaves open the question "are all three sets of extra parameters needed? what is the benefit of each?" Although the adaptive smoothing and the learnable gate for keys both seem like reasonable ideas, but it is not measured how much benefit each of them provides. Even a small measurement would be helpful to justify and motivate the added complexities.

Another little trick is hidden, which is that the long-range memory is trained with a randomized mask but then tested at inference with a fixed mask. What motivates this trick, and how important is it? Some ablation or evidence should be shared.

The presentation of the distinction between long vs short memories is not sufficiently precise, given its centrality to the method. It should be presented in formula form in addition to the Fig 2 picture. For example, the masking, which seems like the key to their operation, should be included in the equations, for example by showing precisely how the masks are used to modify eq 5, and then how the masks are different for long and short memories; if there are any differences besides masking, they should be explained. Similarly, after training what are the effects of the change in masking? If it leads short- and long-term memory to learn something qualitatively different, insights or visuailizations on the different behavior or differently-learned parameters would be worthwhile. Or a toy-model motivation could help clarify.

A paragraph is devoted to the weak RULER performance of linear alternatives to attention such as Mamba and RWKV making some (logical) claims about their weaknesses, but no measurements are shown; if this comment is made, at least the appendix should quantify the weak performance on this benchmark by recaping or reproducing. Or the paragraph should be rewritten to attribute the claims to a specific previous published result. Related to that: since a lot of research attention is spent on linear attention schemes that attempt to reduce inference scaling costs, the paper should clarify that that is not the goal of this method (if so). It seems like the goal is the opposite, which is to improve long-term behavior and ICL using a similar computational budget as trasnformers.

问题

[original questions below; after discussion my main concerns have been addressed.]

  • Would transformers with traditional attention achieve similar performance wins if attention heads were similarly split between short- and long-range attention through a similar masking scheme?
  • What is the empirical benefit of each of the innovations, and are all needed? What is the performance benefit of adaptive smoothing? What is the performance benefit of the gated recurrent key computation? Why is long-term memory trained with randomly-changing masks while being tested with fixed masks? Can ablations or other measurements be shown to justify these added complexities?
  • Exactly where in the formulation is masking applied to distinguish between long- and short-range memories? This should be clarified in the equations, as well as any other differences between the two types of memories.
  • After training, do the two types of memories learn different patterns of parameters or different kinds of behavior in a way that can be clearly visualized or quantified? For example, MMv1 noted that "induction heads" can be implemented in a single layer. In MMv2, do these induction heads show up in the short- or long-range memories, or both?
  • Can you clarify the goals of MM in the context of the other literature that also aims to replace attention. (Such as Mamba/RWKV etc which focus on inference performance, or other attention alternatives such as sparse attention.) Does MM have a different goal? What is the goal, and how does it relate to that other work?

局限性

Yes.

最终评判理由

The MM architecture provides some interesting insights over traditional transformers, and this work both shows that performance can be competitive while clarifying several aspects of the architecture. Support accepting the paper.

格式问题

No issues.

作者回复

Thank you for your thoughtful reviews and valuable suggestions! It was a pleasure to answer your questions.


Q1: Could the same benefits be achieved without changing the attention mechanism, but by simply enforcing long/short-term masks on traditional transformer attention?

The long/short-term memory components in Memory Mosaics v2 are built based on the nice "flat-then-peak" attention distribution in Memory Mosaics (Figure 1 in our paper). Unfortunately, transformer attention lacks this nice attention distribution, due to its position-encoding and asymmetric query-key in transformers. Therefore, simply adding a long/short-term mask cannot help transformer attention to achieve the same in-context learning benefits.

This paper builds upon Memory Mosaics [1], which has already shown that the position-encoding (as well as corresponding asymmetric query-key) in transformer attention results in a position-dependent attention distribution (Figures 14 & 15 in [1]). This position-dependent behavior impairs transformer's ICL capabilities, as shown in Figures 8 & 9 of [1]. Simply adding long/short-term masks does not address this fundamental issue and thus fails to improve ICL performance.

We will visualize the attention distribution of MMv2 with long/short-term masks in the final version, addressing the concern of applying long/short-term masks to transformers.

[1] Zhang, Jianyu, et al. "Memory mosaics." arXiv preprint arXiv:2405.06394 (2024).


Q2: Motivation and Performance benefit of adaptive smoothing, gated recurrent key computation, long-/short-term memory?

Training Memory Mosaics v2 Large and Transformer Large on 1 trillion tokens requires over 4 GPU years with H100 GPU. Comprehensive ablations at this scale are not feasible within weeks. Fortunately, we can better address the concern using conceptual and practical evidence.

These three components are conceptually necessary and practically helpful:

  • Adaptive Smoothing (bandwidth): In memory-based methods like kernel smoothing, the optimal bandwidth depends on the number of examples, and must be adaptive. Rich literature [2-3] supports this need for an adaptive bandwidth (see lines 89-91 for details). Furthermore, the adaptive bandwidth in MMv2, β0+β1nα\beta_0 + \beta_1 * n^{\alpha}, degenerates to a constant bandwidth as in MMv1 when α=0\alpha = 0, providing a chance to learn a constant bandwidth.
    Practically, MMv2 learns positive α\alpha (0.2~0.4) after training, achieves a significantly lower training loss and better performance on the common benchmarks (persistent-knowledge storage and retrieval).

  • Gated recurrent key feature extractor: the gated recurrent key feature extractor is time-variant, enabling the model to extract semantic meanings without being constrained by sequence structure. An example is shown in line 99, where “tom-and-jerry” and “tom---and---jerry” share the same semantic meaning despite differing in sequence structure. Rich literature in recurrent-style networks (e.g., RWKV, mamba) has experimentally verified the benefit of the time-variant feature extractor (line 100).
    Practically, this gated recurrent key feature extractor helps MMv2 (small scale) on common benchmarks (+0.5%) and ruler benchmarks (+8%) when trained on 4k evaluated on 32k.

  • Long/short-term associative memories: The motivation and benefits of long-/short-term memory have been discussed in previous answers.

We will include these clarifications and results in the final version.

[2] Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. The elements of statistical learning, 2009.
[3] E. García-Portugués. Notes for Nonparametric Statistics. 2024.


Q3: Long-term memory mask is randomized during training while fixed at inference. Visualization of long/short-term memories.

The motivation for creating long/short-term memories comes from the "flat-then-peak" attention distribution as shown in Figure 1 (right plot). To achieve context length extrapolation, we need to allocate the "flat" part (position-invariant) to long-term memory, and the "peak" tail to short-term memory. As such, only the size of long-term memory increases during context length extrapolation.

Because we do not know the exact boundary between "flat" and "peak" parts, we use a randomized long-term memory mask length to further impair long-term memory from learning the tail of a sequence. Note that on average this randomization of mask length equals to a linearly-decayed importance.

We will compare and share the long/short-term attention distribution (visualization) using this randomized mask in the final version, explaining the motivation behind it.


Q4: Formal equation for long vs short memories.

Thanks for the valuable suggestion! Following Eq. 4, we express long/short memories as:

short-term memory: f(k;{(kmax(th,1),vmax(th,1))...(kt1,vt1)})f(k; \{(k_{max(t-h,1)}, v_{max(t-h,1)}) ... (k_{t-1}, v_{t-1})\})

long-term memory: f(k;{(k1,v1)...(ktm,vtm)})f(k; \{(k_1, v_1) ... (k_{t-m}, v_{t-m})\}), where tm>=1t-m >= 1

There are no other differences besides the mask of long/short-term memories. We will add these equations in the final version.


Q5: Weakness of Mamba and RWKV on RULER benchmark

Weakness of memory compression algorithms on Ruler:

The RULER paper [4] has already studied the weaknesses of model compression methods, such as Mamba and RWKV, via experimental comparison (Figure 4). Here is a summary:

modeltask-length 1ktask-length 2ktask-length 4k
Llama2-7b96.091.695.0
rwkv-v5-7b87.573.751.4
Mamba-2.8b-slimpj62.652.6-
Table 1: Weakness of Mamba and RWKV on RULER benchmark.

In the original words of the RULER paper, "Finally, we show that non-Transformer architectures, such as RWKV and Mamba, still lag behind Transformer by large margins on RULER."

Note that the two question-answering RULER tasks used in our paper are notably more challenging than other "needle in a haystack" tasks in RULER, like multi-/single-key retrievals. The information entropy of these multi-/single-key retrieval tasks is extremely low, making them easy to compress, allowing many models to easily achieve >99% performance (see Appendix E and F in RULER [4]).

Weakness of memory compression algorithms on in-context learning:

[5] has already experimentally shown the weakness of model compression methods, such as Mamba and RWKV, in in-context learning (Note that our paper uses the same in-context learning datasets as in [5].). Here is a summary:

modeltask-length 1 shotstask-length 2 shotstask-length 3 shotstask-length 4 shotstask-length 5 shots
Qwen-1.5-7B-base 7B38.747.345.243.640.6
Mistral-7B-v0.2-base53.353.151.648.042.3
RWKV-5-World 7B2.32.61.001.2
Mamba-2.8B00000

Table 2: Weakness of Mamba and RWKV on in-context learning benchmark (TacRed classification).

We will refer these results to experimentally support the claim (weakness of memory compression methods) in the final version.

[4] Hsieh, Cheng-Ping, et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?." arXiv preprint arXiv:2404.06654 (2024).
[5] Li, Tianle, et al. "Long-context llms struggle with long in-context learning." arXiv preprint arXiv:2404.02060 (2024).


Q6: The goal of this work

Page 3 Footnote 5 claims: This work is neither a linearization of attention nor attention efficiency.

The goal of this paper is to develop new architectures that can better learn new tasks. We neither focus on the performance of persistent-knowledge storage and retrieval (many common benchmarks), nor model efficiency on model serving, nor throughput, nor memory efficiency. Even though they can be future research directions. The three evaluation dimensions in our paper are, therefore, designed to assess the new-tasks-learning capability.

Thanks for the valuable suggestion! We will emphasize our goal more clearly in the final version.


Summary

Q1–Q3 and Q6 concern clarification and explanation issues that could potentially impair readability and understanding. Thanks for pointing them out! We have provided more detailed explanations to address these concerns.

Q4 highlights the need for formal equations to make the paper scientifically sound; we have added the relevant equations accordingly.

Q5 points out the need for experimental evidence to support our claim regarding the "weakness of memory compression methods." To address this, we refer to the rich experimental evidence provided by the Ruler benchmark [4] and the in-context learning benchmark [5]. Note that these two benchmarks are the ones used in this paper.

We hope our responses sufficiently address your concerns!

评论

Dear reviewer vHo6,

We find your review thoughtful and concrete! It is pleasure to read, reply, and discuss. Since the deadline of author-reviewer discuss period is approaching, we are eager to hear your feedbacks. Do our responses address your concerns? Do you have other questions? Please feel free to let us know.

评论

Dear reviewer vHo6,

Thank you for your thoughtful reviews. We provide a brief summary of our responses in the Summary paragraph, outlining our answers to questions Q1–Q6.

We hope our answers sufficiently address your concerns. However, please feel free to let us know if you have any further questions! We appreciate it if you can update scores based on these clarifications and new results.

最终决定

The paper scales Memory Mosaics to 10B parameters on 1T tokens. It does this by introducing adaptive bandwidth, a gated key extractor, and a three-level memory. It matches transformers on standard benchmarks, and it has advantages for long-context retrieval and few-shot in-context learning.

Strengths: The architecture is simple, principled, and results in improvements over the original version of Memory Mosaics. Evaluation seems to be done carefully to separate persistent knowledge, new memory, and ICL. The findings directly challenge the prevailing belief that scaling transformers and data is sufficient, and their architecture outperforms transformers by up to 10% (at least in their experiments).

Weaknesses: The paper lacks a transformer baseline with the same masking, making it hard to disentangle the benefits of associative memory.

Discussion & rebuttal: Reviewers requested masking baselines, more ablations, and stronger efficiency profiling. The rebuttal provided FLOPs and parameter tables, clarified mask construction, and added extended results. While some issues remain, the main findings were strengthened, and most concerns seem to have been addressed.

Decision: Accept. The paper demonstrates an alternative to transformers with clear benefits in its experiments.