Resona: Improving Context Copying in Linear Recurrence Models with Retrieval
We improve the in-context performance of linear recurrent models by augmenting them with a parallel cross-attention branch that can mix in information from the context.
摘要
评审与讨论
While Linear Recurrent Models (LRMs) demonstrate computational efficiency advantages, their fixed-size hidden states create an information bottleneck that limits effective utilization of long-context information. To address this, we propose Resona, a framework that enhances LRMs' in-context learning capabilities through a retrieval mechanism. This approach bypasses the hidden state constraints of traditional LRMs by enabling direct information flow from input contexts to model reasoning. Extensive experiments demonstrate that Resona significantly improves LRMs' performance without requiring major architectural modifications, offering plug-and-play usability. I recommend clear accept.
接收理由
- Resona focuses on in-context retrieval rather than external databases, differentiating it from traditional RAG methods. By integrating retrieval mechanisms with LRMs, it fills an architectural gap and achieves outstanding performance on long-sequence tasks.
- The paper formally analyzes the information bottleneck caused by fixed-size hidden states in LRMs. The Rank Limitation of Linear Recurrence analysis in appendices mathematically explains LRMs' limitations, providing theoretical grounding for the methodology.
- Comprehensive experiments confirm Resona's significant performance improvements across multiple benchmarks.
- The framework requires minimal architectural changes to existing LRMs and supports plug-and-play deployment.
拒绝理由
- The impacts of critical parameters (chunk size U, number of retrievals k, mixing coefficient α) on performance lack systematic analysis.
- Insufficient discussion about Resona's adaptations for different LRM architectures (e.g., Mamba/Hyena/RWKV), particularly regarding hidden state interactions and retrieval module parameterization.
- While claiming comparable inference efficiency to baselines, the paper lacks empirical comparisons of FLOPs, memory consumption, or practical latency (e.g., pre-filling phase chunk encoding overhead).
We are greatly appreciative of the feedback from the reviewer. We are grateful that they appraise this work highly and are also happy to see that there are points they raise that can improve the quality of our work. We hope that the following details are sufficient to address these questions they raise.
The impacts of critical parameters on performance lack systematic analysis.
The following table compares for the influence of in our method, tested on the MAD Lab suite of tasks with a Mamba2 backbone.
| Model | Compression (%) | Fuzzy Recall (%) | In-context Recall (%) | Noisy Recall (%) | Selective Copying (%) | Average (%) |
|---|---|---|---|---|---|---|
| Transformer | 53.4 | 69.4 | 96.7 | 97.2 | 98.7 | 83.1 |
| Mamba2 | 43.6 | 21.1 | 96.4 | 96.7 | 93.3 | 70.2 |
| Mamba2 + Resona (=0.3) | 44.3 | 61.5 | 99.9 | 100.0 | 92.3 | 79.6 |
| Mamba2 + Resona (=0.4) | 43.6 | 62.9 | 100.0 | 100.0 | 93.6 | 80.0 |
| Mamba2 + Resona (=0.5) | 45.2 | 65.4 | 99.9 | 99.9 | 93.3 | 80.7 |
| Mamba2 + Resona (=0.7) | 43.3 | 66.8 | 100.0 | 100.0 | 93.6 | 80.7 |
The following table shows performance on MADLab's Fuzzy In-Context Recall, where = 0.4. The baseline is Mamba2, and the other entries correspond to Mamba2 + Resona ( = 0.4) with different chunk sizes:
| Chunk Size | Fuzzy In-Context Recall (%) | Model |
|---|---|---|
| Baseline | 21.9 | Mamba2 (Baseline) |
| 6 | 62.9 | Mamba2 + Resona ( = 0.4) |
| 7 | 60.2 | Mamba2 + Resona ( = 0.4) |
| 8 | 55.1 | Mamba2 + Resona ( = 0.4) |
| 10 | 49.4 | Mamba2 + Resona ( = 0.4) |
| 12 | 40.2 | Mamba2 + Resona ( = 0.4) |
We see that a smaller chunk size can be more powerful, hence our choice of the parameter.
We will include a greater discussion of these details in an updated Appendix.
Insufficient discussion about Resona's adaptations for different LRM architectures (e.g., Mamba/Hyena/RWKV), particularly regarding hidden state interactions and retrieval module parameterization.
Thank you for pointing this out. We have provided some more details in our general response (https://openreview.net/forum?id=4mxQmpnawk¬eId=8yQDwnb0ES). Furthermore, we will add these details directly to the Appendix to explicitly mention how we have adapted each specific model we work with.
While claiming comparable inference efficiency to baselines, the paper lacks empirical comparisons of FLOPs, memory consumption, or practical latency (e.g., pre-filling phase chunk encoding overhead).
Thank you for the comment. Please see our general response (https://openreview.net/forum?id=4mxQmpnawk¬eId=8yQDwnb0ES) for tables containing these details and a further discussion on the trends/observations.
This paper proposes a new neural network module named Resona, that resembles an RAG procedure on a per-layer basis. The context is first chunked and embedded and relevant chunks are retrieved using queries corresponding to the hidden state of each token. The information retrieved chunks are then fused back into the model using cross-attention.
This paper is well-motivated as it is known that one of the critical bottlenecks of linear attention modules is that it could not retrieve perfectly from the context. While a hybrid model with softmax attention has been proposed to address this issue, this paper's more explicit way of fusing retrieval is a meaningful exploration.
While the paper is overall clear, there are two issues of the paper that need to be addressed (I will expand more in the reasons to reject):
-
While the 'formal' theory part has good intuition, the formulation of this theory is unfortunately incorrect.
-
There are some important experiment details that are not properly explained, which may hinder reproduction.
接收理由
-
The motivation of the paper is well delivered.
-
The results on synthetic data and realistic data show that the designed module indeed improves the language modeling performance over the original linear architecture. The finetuning experiment also suggests that Resona layer is better than sliding window attention.
拒绝理由
-
The theoretical argument is not rigorous. The authors defined the state to be the hidden vector of dimension H with precision p, which restricts the maximal size of the set of state to be , making Lemma 3.1 vacuous. I don't consider this a strong reason to reject but this is something the authors need to fix.
-
Regarding the language modeling, the authors mentioned that Chunk and Search process is done 'offline'. I found this incompatible with the architecture design authors proposed, because the search process required the hidden state produced by the layers.
-
There is no measurement of wall clock time of this module, making it hard to judge its generality.
-
Also, the authors can be more clear on what exactly is used in the module. For linear architecture like deltanet, the recurrent RNN state is actually a matrix instead of a vector and it is hard to get every recurrent state using the fastest implementation. This is related to point 3. If the authors are actually using the output of the linear module, then this should be made clear.
-
There should be an ablation comparing Resona with full attention.
给作者的问题
Please refer to Reasons to Reject.
Also, I think there are some missing citations:
[1] discussed RNN's limitation on memorizing, which has a theory that is a more formal version of Lemma 3.1. It also proposes retrieval as a potential fix.
[2] The way the data is constructed seems to resemble in-context pretraining.
We would like to thank the reviewer for their thorough appraisal of our work. The detail of the comments is very much appreciated, and we hope the following response can be sufficient to address the questions the reviewer raises such that our entire manuscript is more accessible and complete.
Lemma 3.1 [is] vacuous. I don't consider this a strong reason to reject but this is something the authors need to fix.
We thank the author for their comment. We admit that putting this statement may not have been best suited as a Lemma but rather as a point of discussion, perhaps in the related works. We will rectify this in an updated manuscript.
Regarding the language modeling, the authors mentioned that the Chunk and Search process is done 'offline'. I found this incompatible with the architecture design authors proposed, because the search process required the hidden state produced by the layers.
This is specific to the language modeling experiment. In this case, we conduct an offline search to construct masks that are then used during training.
We process the training data in segments. At the beginning of each segment, we do an inference pass with each example in the training set, which allows us to get the hidden state representations. Using these representations, we conduct the Chunk-and-Search with the input embeddings to find the most relevant chunks, which are then used to create a mask. After we construct all these masks, we can save them and use them during training
This process is designed to make the process faster, as online searches can be quite expensive. While there is admittedly a potential downside in that the hidden states used for the Chunk-and-Search are static, our results are nonetheless positive. We plan on further improving upon this in future work.
There is no measurement of wall clock time of this module, making it hard to judge its generality.
Thank you for the comment. Please see our general response (https://openreview.net/forum?id=4mxQmpnawk¬eId=8yQDwnb0ES) for tables containing these details and a further discussion on the trends/observations.
The authors can be more clear on what exactly is used in the module. [...]. If the authors are actually using the output of the linear module, then this should be made clear.
Thank you for pointing this out. We have provided some more details in our general response (https://openreview.net/forum?id=4mxQmpnawk¬eId=8yQDwnb0ES). Furthermore, we will add these details directly to the Appendix to explicitly mention how we have adapted each specific model we work with.
There should be an ablation comparing Resona with full attention.
We have provided an ablation comparing Resona with a full-attention Transformer.
| Model | Comp. | ICR | Noisy ICR | Fuzzy ICR | SC | Mem. | Average |
|---|---|---|---|---|---|---|---|
| Transformer | 53.4 | 96.7 | 99.7 | 69.4 | 98.7 | 89.4 | 84.1 |
| Mamba | 38.3 | 76.7 | 74.9 | 9.3 | 33.2 | 88.5 | 53.5 |
| Mamba + Resona | 38.2 | 99.9 | 100.0 | 63.4 | 42.7 | 88.8 | 72.1 |
| Mamba2 | 43.6 | 96.4 | 96.7 | 21.1 | 93.3 | 86.9 | 73.0 |
| Mamba2 + Resona | 46.6 | 100.0 | 100.0 | 62.9 | 93.6 | 88.1 | 81.9 |
| RWKV5 | 36.8 | 96.4 | 96.6 | 12.1 | 52.7 | 55.0 | 58.3 |
| RWKV5 + Resona | 40.4 | 99.7 | 99.8 | 59.7 | 58.0 | 70.6 | 71.5 |
| Hyena | 42.2 | 79.3 | 77.4 | 9.96 | 72.8 | 88.9 | 61.7 |
| Hyena + Resona | 42.6 | 99.9 | 99.9 | 66.2 | 74.3 | 89.0 | 78.7 |
As we can observe, in most cases the addition of a Resona module is sufficient to bridge the gap between the initial linear recurrent model and Transformer performance.
Missing citations
Thank you for the references, we will add them to the relevant sections of our discussion.
I appreciate the authors' effort in measuring the wall time speed and comparison with full attention. The rebuttal answers all my questions. I will increase my score.
This article introduces a novel technique aimed at improving the contextual information handling capabilities of Linear Recurrent Models (LRMs), a recently proposed efficient alternative to transformer-based models based on the principles of recurrent networks. Despite their advantages in terms of computational and spatial efficiency, LRMs suffer from limitations when copying and integrating previous input fragments due to their fixed-size hidden states, thereby creating an information bottleneck. To address this, the authors propose extending LRMs with a method inspired by Retrieval-Augmented Generation (RAG). Unlike traditional RAG, which retrieves external documents, this approach encodes and retrieves embeddings of previous context passages within the input itself. These embeddings then influence the model's hidden state updates and subsequent layer outputs. The authors present rigorous experiments demonstrating significant performance improvements for LRMs such as Mamba, RWKV, and Hyena in tasks that heavily depend on context copying capabilities.
接收理由
The experimental validation provided is robust, demonstrating clear performance improvements over baseline LRMs. The proposed approach effectively addresses a critical limitation in LRMs, potentially broadening their applicability as alternatives to transformer-based models, in general, and transformer-based LLMs in particular.
拒绝理由
To the best of my knowledge of linear recurrent networks (though I consider myself more of an expert in classical recurrent or transformer-based networks), I do not see any major flaws in the manuscript. One possible addition could be a statistical significance analysis, but the improvements reported are substantial enough that such an analysis may not be strictly necessary. The paper could benefit from a clearer explanation of the computational overhead introduced by integrating the proposed technique.
给作者的问题
There are wrong sentences or typos at lines 49, 179, and 207. It would be useful to provide more detailed explanations of the benchmarks used and the kind of tasks they are designed to evaluate.
We thank the reviewer for their feedback and enthusiasm for this work. We also are happy they provide some meaningful points where additional details can be provided to improve our work. We hope the following response resolves these remaining questions.
The paper could benefit from a clearer explanation of the computational overhead introduced by integrating the proposed technique.
Thank you for the comment. Please see our general response (https://openreview.net/forum?id=4mxQmpnawk¬eId=8yQDwnb0ES) for tables containing these details and a further discussion on the trends/observations.
We will also discuss where there may be overhead and how we adjust for them, within an additional Appendix.
There are wrong sentences or typos at lines 49, 179, and 207.
We appreciate that you have pointed these out and will fix them.
It would be useful to provide more detailed explanations of the benchmarks used and the kind of tasks they are designed to evaluate.
Thank you, we will further expand on these in the Appendix.
I acknowledge and appreciate the authors’ response. As my scores are aligned with those of the other reviewers, I maintain my original assessment and recognize the merit of the work.
General Response
We first thank the reviewers for their comprehensive reviews as well as their mentions of strengths and areas of improvement for our work. We are happy to see that two reviewers express high enthusiasm for our work, while another has provided significant comments that can greatly improve it.
Given the shared nature of some questions/comments, we would like to provide these details in a general response to improve readability of our discussion and reduce repetition.
Computational overhead
We thank all the reviewers for mentioning this comment. We provide the following tables, which indicate pre-filling speed, generation speed as well as memory usage of a number of methods. In particular, we compare a Transformer using Flash-Attention, as well as various linear recurrent models both with and without Resona plugins.
Overall, we observe that the Resona plug-in is very lightweight and does not add significant overhead in computation, particularly for longer sequences.
Prefilling Time (ms)
| Prefill Length | Transformer | Mamba | Mamba + Resona | Deltanet | Deltanet + Resona | GLA | GLA + Resona |
|---|---|---|---|---|---|---|---|
| 2k | 29 | 45 | 52 | 53 | 64 | 37 | 46 |
| 4k | 34 | 77 | 88 | 59 | 78 | 43 | 64 |
| 8k | 71 | 149 | 170 | 72 | 103 | 62 | 104 |
| 16k | 173 | 294 | 349 | 106 | 181 | 109 | 202 |
| 32k | 503 | 571 | 653 | 208 | 338 | 202 | 386 |
| 64k | 1665 | 1118 | 1285 | 412 | 652 | 407 | 757 |
| 128k | 6094 | 2257 | 2412 | 807 | 1289 | 806 | 1518 |
Generation Time (ms)
All results are for generation of 128 tokens following the given pre-filling length
| Prefill Length | Transformer | Mamba | Mamba + Resona | Deltanet | Deltanet + Resona | GLA | GLA + Resona |
|---|---|---|---|---|---|---|---|
| 2k | 2679 | 2972 | 3329 | 2945 | 3686 | 2749 | 3528 |
| 4k | 2758 | 3044 | 3360 | 3042 | 3772 | 2777 | 3543 |
| 8k | 2866 | 3155 | 3499 | 3023 | 3789 | 2774 | 3523 |
| 16k | 3389 | 3164 | 3613 | 3057 | 3829 | 2869 | 3689 |
| 32k | 5759 | 3491 | 3910 | 3080 | 4207 | 2912 | 4044 |
| 64k | 11144 | 4119 | 4656 | 3171 | 4404 | 3145 | 4352 |
| 128k | 24050 | 4747 | 5601 | 3611 | 5278 | 3509 | 5076 |
Memory Usage (GB)
| Prefill Length | Transformer | Mamba | Mamba + Resona | Deltanet | Deltanet + Resona | GLA | GLA + Resona |
|---|---|---|---|---|---|---|---|
| 2k | 3.1 | 2.8 | 4.0 | 2.9 | 4.8 | 2.9 | 3.6 |
| 4k | 3.5 | 2.8 | 4.1 | 3.0 | 4.9 | 3.0 | 3.7 |
| 8k | 4.5 | 3.1 | 4.4 | 3.3 | 5.2 | 3.2 | 3.9 |
| 16k | 6.4 | 3.7 | 5.1 | 3.7 | 5.5 | 3.6 | 4.4 |
| 32k | 10.2 | 4.9 | 6.5 | 4.6 | 6.6 | 4.3 | 5.4 |
| 64k | 17.7 | 7.2 | 9.2 | 6.4 | 8.4 | 5.9 | 7.5 |
| 128k | 32.9 | 11.9 | 14.7 | 10.2 | 12.3 | 9.1 | 11.7 |
Resona adaptations for specific architectures.
We thank the reviewers for pointing this out. It is true that different architectures utilize different hidden states, which can influence how Resona layers interact with them. We hope the following details provide more clarity on this front.
In the most simple setting, this hidden state is simply a vector (such as in Mamba), hence we can directly use it as queries for the Resona module.
In other settings, this can be a matrix (DeltaNet, GLA, etc.) and hence we simply take the hidden states and reshape it into a vector that is projected as queries.
Finally, as Reviewer UqFQ mentioned, we can also use the output of the previous layer as the queries. While we did not directly consider this setting in our initial manuscript, we have tried this and observed little difference compared to using . Nevertheless, we do realize that for computational reasons, it can be desirable to use given that this is a vector and hence may lead to lower memory usage compared to using a matrix as the queries, which we believe can further improve our method. Our initial goal was to maintain a more general method that could be applied between architectures, but this is certainly a valid point to consider.
This paper presents Resona, a framework that augments Linear Recurrent Models with retrieval mechanisms to address their well-known issues in context copying and ICL. The main idea is to add a parallel cross-attention branch that retrieves relevant information from the input context, bypassing the bottleneck created by fixed-size hidden states.
Strengths:
- The motivation is clear and addresses a genuine limitation of LRMs
- The experimental eval is comprehensive across multiple archs(mamba, RWKV, Hyena) on both synthetic and real-world tasks
- The approach is practical and plug-and-play
- Authors provided detailed overhead analysis in their rebuttal
Weaknesses:
- The theory analysis (Lemma 3.1) seems flawed as one reviewer correctly pointed out
- Some implementation details were initially unclear around the "offline" chunk and search process and how hidden states are handled across different architectures
- perhaps lacks systematic ablation on some hyperparameters (though the authors partially addressed this during rebuttal with some results on eg chunk size)
- Missing important related work (fixable)
Overall it's a solid piece of work, that all Revs and I agree.