/10

Poster4 位审稿人

最低2最高4标准差0.8

ICML 2025

Peripheral Memory for LLMs: Integration of Sequential Memory Banks with Adaptive Querying

Songlin Zhai,Yuan Meng,Yongrui Chen,Yiwei Wang,Guilin Qi

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

TL;DR

This paper introduces the peripheral memory for large language models.

摘要

关键词

Memory Augmentation for LLMsKnowledge-based Model EditingLarge Language Models

评审与讨论

审稿意见

评分: 22025-02-18

This paper proposes peripheral memory, which is inspired by RAM architecture. It focuses on the task of model editing, and significantly outperforming previous methods. The peripheral memory seems to add external memory into the process of LLM inference, editing some layers in the foundation model. Although the proposed method seems interesting, I'm still confused by some details.

给作者的问题

See above.

论据与证据

Good. The author claims the proposed method can greatly address the task of model editing, and the experiments have verified its effectiveness.

方法与评估标准

I'm still confused by some details in terms of the methods and evaluations.

Figure 1 and Figure 2 need more explanations. It is a bit hard to understand.
What is the meaning of the function g and h? Could you please explain with some detailed examples, such as Llama architecture?
I think Llama3-8B is not suitable as the backbone to evaluate consecutive editing. I notice that the comparisons are under 3k updates. However, the maximum number of tokens of Llama3-8B is limited to 8192 tokens, which is obviously less than 3k updates (it seems more than 300k tokens, assuming 100 tokens for each update). Therefore, most prompt-based editing methods may fail, due to the token limitation. I think the author should use Llama3.1-8B with the maximum length of 128k tokens, or GPT-4o with 128k tokens as well.
How do you implement the baselines in your experiments? The results of baselines are too low (even with 0.00), and the proposed method seems significantly high. I would prefer to request more details about the experimental results. For example, how many times the experiments are repeated?

理论论述

No theoretical claims in this paper.

实验设计与分析

See "Methods And Evaluation Criteria".

补充材料

Yes, I have checked the supplementary materials. I also view the code that the author has provided.

与现有文献的关系

Yes, very relevant.

遗漏的重要参考文献

[1] Zhong, Wanjun, et al. "Memorybank: Enhancing large language models with long-term memory." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 17. 2024.

[2] Packer, Charles, et al. "Memgpt: Towards llms as operating systems." arXiv preprint arXiv:2310.08560 (2023).

其他优缺点

See above.

其他意见或建议

I think the author should respond to the above questions and concerns. I'm willing to raise my score if the response can address my concerns.

作者回复

2025-03-31

1. Question about baselines

Thank you for raising this important concern. We provide detailed clarifications regarding the baseline implementation:

(1) Baseline Implementation

All baselines for Knowledge-based Model Editing (KME) were implemented using the widely adopted toolkit EasyEdit (https://github.com/zjunlp/EasyEdit), ensuring reproducibility, reliability and alignment with established practices. For instance, in the case of WISE, we strictly followed the official implementation guidelines, with our editing code closely mirroring the provided examples:

from easyeditor import WISEHyperParams
hparams = WISEHyperParams.from_hparams('./hparams/WISE/llama-3-8b.yaml')
editor = BaseEditor.from_hparams(hparams)
metrics, edited_model, _ = editor.edit(
        prompts=prompts,
        rephrase_prompts=rephrase_prompts,
        target_new=target_new,
        subject=subject,
        sequential_edit = True,
        locality_inputs=locality_inputs,
)

The hyperparameters are suggested by the EasyEdit.

(2) Low Baseline Performance

We observed that several baselines (e.g., PMET, MEMIT) exhibit zero performance in consecutive editing scenario. Initially surprised by these results, we rigorously repeated the experiments 10+ times and consistently observed the same outcome.

Key Insight: When the number of sequential edits exceeds 1K, these localization-based editing methods suffer from catastrophic performance degradation, mainly arising from the repeated modifications progressively destabilize the LLM’s parameters. This phenomenon aligns with observations in recent works [1,2], which highlights the fragility of localized edits under high edit counts.

(3) Reproducibility of our method

Full implementation details and hyperparameters are provided in the supplementary material. For instance, refer to memory.py (Lines 1073–1075) for key parameters such as memory depth and grid size. Additionally, they have also been discussed in Section 4.3 for clear description. We welcome further discussion or code review to address any remaining concerns.

2.Questions about Llama3.1-8B as backbone

Thank you for your insightful feedback. We address your concerns as follows:

(1) Token Limit of Llama3-8B

The token limitation of Llama3-8B (8k tokens) does not impact our experiments, as the input sequences in KME datasets (ZsRE and CounterFact) are well within this constraint: ZsRE: Maximum input length = 36 tokens and CounterFact: Maximum input length = 56 tokens. Thus, even for 3,000 sequential edits, the total token count remains far below the 8k limit, impossibly impacting model performance. Thus, the observed performance limitations of baseline methods (e.g., MEMIT, PMET) are not attributable to token constraints but to inherent challenges in parameter-localized editing [2].

(2) Performance under longer token limit

We acknowledge that token limits could theoretically hinder prompt-based methods. For comprehensive comparison, we conducted additional experiments on Llama3.1-8B (128k token limit) and compared our method with IKE, a state-of-the-art prompt-based editing method that uses in-context learning without parameter updates.

Type		ZsRE				CF
	Efficacy	Generality	Locality	Score	Efficacy	Generality	Locality	Score
Original	0.2287	0.5211	1.0000	0.5832	0.0043	0.0040	1.0000	0.3361
IKE	0.5232	0.5231	0.5190	0.5218	0.0055	0.0040	0.6725	0.2273
Ours	0.9919	0.6010	1.0000	0.8643	1.0000	0.2875	1.0000	0.7625

Our method achieves superior performance even with extended context lengths, confirming that token limitations are not the bottleneck in our setup. This also confirms that our framework’s advantages are architecture-agnostic and not contingent on context length.

3. Explanation of Figure 1 and Figure 2 and The function $g$ and $h$

Figure 1 provides an overview of our framework, while Figure 2 illustrates the peripheral memory.

The functions $g_i^k$ and $h_{i1}^k$ are learnable univariate functions, parameterized as B-spline curves. For a detailed understanding, please refer to Chapters 5.1 and 5.2 [3] for more details.

Due to current character constraints, we are unable to provide a more detailed explanation here but will offer a comprehensive discussion in later response.

References

[1] Wang et al. 2024. WISE: Rethinking the Knowledge Memory for Lifelong Model Editing of Large Language Models. In NeurIPS.

[2] Li et al. 2024. Consecutive Batch Model Editing with HooK Layers. In EMNLP, 13817–13833. Association for Computational Linguistics.

[3] Prautzsch et al. 2002. Bézier and B-Spline Techniques. Mathematics and Visualization. Springer Science & Business Media. doi:10.1007/978-3-662-04919-8. ISBN 978-3-540-43761-1.

审稿意见

评分: 42025-03-10

The paper introduces a "peripheral memory" memory augmentation method for LLMs. The paper views memory as a separate ram-like component that interfaces with an LLM. The memory is designed as a sequence of memory banks, each modeled using KANs. Memory operations are controlled by query signals from the LLM internal states and dedicated memory bank generates mask values indicating relevance of retrieved data. The paper claims to improve over limitations of prior methods by enhancing scalability, reusability and configurability. The method is evaluated on knowledge-based model editing and long-context Q&A.

Update after rebuttal: I maintain my score as my main questions are addressed.

给作者的问题

How does the approach handle conflicting information or multiple edits about the same fact? The paper demonstrates how the confidence bank can mask out irrelevant knowledge, but I do not believe it addresses the case of contradictory knowledge?
How will the approach scale to even more facts or edits, e.g. 100K or 1M or more?

论据与证据

Yes, the claims around scalability, reusability and configurability generally seem pretty well supported by the strong performance on the knowledge editing, long context QA, shared memory across LLM experiments, and experiments on memory bandwidth, depth and allocation experiments.

方法与评估标准

Yes, the paper appears to evaluate the method on standard benchmarks for measuring memory effectiveness and compare to a variety of baselines.

理论论述

The paper is empirical based without any major theoretical claims.

实验设计与分析

Yes the experiments seem to be well designed and take into account relevant baseline methods, different LLM backbones and varied settings.

补充材料

Yes I skimmed the supplementary material which includes additional info on KANs, and additional experiments and details.

与现有文献的关系

The paper positions itself clearly in relation to existing memory augmentation approaches to LLMs, working memory, implicit memory and explicit memory and their limitations.

遗漏的重要参考文献

In general it seems fine, however a more detailed discussion and comparison with RAG approaches would strengthen the paper. There is a brief discussion and comparison in the appendix, but this seems relevant enough to be included in the main paper.

其他优缺点

Strengths:

The overall approach appears novel and creative
The comparison with physical RAM is intuitive
The empirical results seem convincing
The ability to share memory across different LLMs seems promising for practical deployments

Weaknesses:

As mentioned above, a more detailed comparison and discussion of RAG would strengthen the paper
While some motivation of KANs is introduced, no empirical comparisons of this are made, leaving the reader to wonder how important they actually are to the method
In addition, while sprinkled throughout the paper, a more explicit discussion of the limitations of this work and the proposed methods and future improvements would strengthen the paper.

其他意见或建议

N/A

作者回复

2025-03-31

1. Handling conflict

Thank you for raising this critical point. Below, we clarify our method’s current behavior and outline potential enhancements:

(1) Last-Write-Wins Policy

In sequential editing scenarios, our memory defaults to a temporal priority strategy: the most recent edit overwrites previous entries for the same fact. This is reflected in the confidence mechanism, where newer entries are assigned higher trust values due to their proximity in the query distribution. While this ensures consistency in retrieval (only the latest edit is returned), it does not explicitly resolve semantic contradictions. However, the confidence bank can be easily configured into detect contradictions using off-the-shelf entailment classifiers. $\Downarrow$

(2) Conflict-Aware Confidence Integrate semantic similarity checks between new edits and existing memory entries. For example, if a new edit contradicts a stored fact (measured via entailment models, the confidence bank could trigger a conflict resolution protocol (e.g., human-in-the-loop verification or probabilistic truth maintenance). Notably, this enhancement can be easily implemented as a configurable extension to our peripheral memory without altering the core LLM.

Actually, we are now actively studying a version-controlled peripheral memory inspired by database systems, where conflicting edits are stored as alternate branches. This could enable users to query the memory with temporal constraints (e.g., “What was believed about xxx in 2023?”).

2. Scale to even more edits

Our framework currently employs a direct memory querying strategy, which achieves reliable storage accuracy for up to 10K edits (see Figure 7). However, as noted in Section 5.4, while the memory retains high storage fidelity at scale, its generalization degrades for semantically equivalent queries due to geometric misalignment in hidden feature spaces (see Appendix C). To balance robustness and scalability, we adopt a memory archival protocol: when the active memory reaches 1K entries (optimal for generalization), it can be archived, and a fresh memory will be initialized. This approach enables theoretically unbounded storage capacity while maintaining generalization performance thanks to the unique properties of our peripheral memory component. Notably, even with a 1K entry limit, our memory achieves high effective storage density (4.95, see Table 4).

While this archival strategy significantly improves scalability, it trades off per-bank memory utilization. To address this, we are developing a memory management component (MMC), inspired by the Memory Management Unit (MMU) in operating systems. The module maps semantically equivalent inputs to unified memory space using contrastive learning, decoupling semantic alignment from storage operations. This allows the peripheral memory to specialize in efficient storage/retrieval, while the MMC handles query normalization and address translation.

3. Importance of KAN

Response #3 of Reviewer gZ6v provides detailed experimental results. These indicate that the KAN-based memory bank surpasses the MLP-based counterpart within our peripheral memory.

(1) Accuracy vs. Generalization Trade-off

KAN demonstrates exceptional performance in both accuracy and generalization, while MLPs (even at 4 $\times$ params) plateau at 89.7% accuracy and 57.8% generalization. MLPs’ noisy memory outputs degrade Locality, as seen in CounterFact: KANs retain 100% Locality vs. MLPs’ 2.3%-3.3%.

(2) Scaling MLPs Fails to Close the Gap

Increasing MLP hidden dimensions marginally improves storage accuracy (e.g., 89.7% at 4× params vs. 86.7% at 1× on ZsRE), but harms generalization due to overfitting on memorizing data. This reflects the inability of MLPs to learn smooth mappings.

(3) Catastrophic Collapse at Extreme Scaling

At 8 $\times$ parameters, MLP performance collapses to 0 across all metrics. We attribute this to:

Optimization Instability: Overparameterized MLPs suffer from vanishing/exploding gradients, exacerbated by the memory’s sequential architecture (sequence length = 512).
Loss Landscape Degradation: High-dimensional MLP weights create chaotic loss surfaces, preventing convergence. In contrast, KANs’ spline parameterization inherently regularizes the optimization landscape.

(3). Why MLPs Underperform

Approximation Theoretic Limitations: MLPs struggle to model the compositional structure of memory mappings (query→key→value), which KANs explicitly encode via Kolmogorov-Arnold superposition[1].
Noise Amplification: MLPs’ fixed activations amplify high-frequency noise in memory queries, degrading generalization. KANs’ adaptive splines act as low-pass filters, suppressing noise[2].

References

[1] Liu et al. (2024). Kan: Kolmogorov-arnold networks.In ICLR.

[2] Prautzsch et al. (2002). Bézier and B-Spline Techniques. Mathematics and Visualization. Springer Science & Business Media. ISBN 978-3-540-43761-1.

审稿意见

评分: 42025-03-14

This paper proposes a novel memory augmentation technique for LLMs by decoupling memory from the model architecture, analogous to a CPU and RAM architecture. The proposed peripheral memory consists of sequential memory banks modeled by Kolmogorov-Arnold Networks (KAN) to have smooth and adaptive memory read/write operations controlled by internal LLM states. The framework integrates retrieved memory content with an adaptive confidence masking mechanism. The experiments demonstrate effectiveness in knowledge-based model editing and long-context question answering.

给作者的问题

Could you provide a detailed analysis or empirical investigation into how semantic drift or retrieval quality degradation is mitigated (or worsens) at extremely large storage capacities (e.g. >10K updates)? Have you considered explicit drift-mitigation methods such as periodic re-indexing or semantic clustering?
How sensitive is the performance to the specific choice of query representation?
Could you discuss comparison to recent advanced retrieval-augmented generation methods, particularly on long-context question answering tasks?

论据与证据

The authors claim improvements in scalability, reusability, and configurability of memory augmentation. These claims are convincingly supported by extensive experiments. It would be better if the experiments involve more diverse LLM architectures.

方法与评估标准

The method is novel and well-justified. The evaluation criteria, including benchmarks such as ZSRE, COUNTERFACT, Qasper, and HotpotQA, are appropriate and widely recognized. One minor weakness is the absence of comparison to retrieval-augmented generation (RAG) techniques, particularly in the QA tasks, which are directly related.

理论论述

The theoretical explanation is clear and correctly aligns with established literature on smooth nonlinear mapping networks.

实验设计与分析

The experimental design is sound and clearly articulated, providing a fair and extensive comparison against state-of-the-art baselines. A potential bias arises in the choice of baselines, as recent RAG methods were not explicitly compared.

补充材料

I reviewed the supplementary material.The supplementary content was helpful in supporting and clarifying the main results.

与现有文献的关系

The paper situates itself clearly within the existing memory-augmentation literature for LLMs.

遗漏的重要参考文献

[1] Jiang, Ziyan, Xueguang Ma, and Wenhu Chen. "Longrag: Enhancing retrieval-augmented generation with long-context llms." arXiv preprint arXiv:2406.15319 (2024).

[2] de Jong, Michiel, et al. "Fido: Fusion-in-decoder optimized for stronger performance and faster inference." arXiv preprint arXiv:2212.08153 (2022).

其他优缺点

Strengths:

The idea that in conceptualizing memory architecture analogous to RAM-CPU structure is novel and interesting. The empirical performance outperforms current state-of-the-art methods significantly. Comprehensive analysis of scalability and configurability, clearly showing real-world applicability.

Weaknesses:

Limited discussion regarding limitations or failure cases of the proposed approach, particularly in cases where retrieval signals might degrade or when memory banks encounter interference at extremely high capacities. Lack of comparison with advanced RAG methods.

其他意见或建议

N/A

作者回复

2025-03-31

1.Question about the semantic drift

Thank you for this insightful question. We acknowledge that semantic drift and retrieval degradation at extreme scales (>10K updates) for semantically equivalent queries remain challenges. Below, we summarize our empirical observations and outline explicit mitigation strategies under development:

As discussed in Appendix-C, our memory is directly queried using the hidden state of the final input token. While efficient, this design inherits a challenge in representation learning: geometric misalignment in high-dimensional spaces. That is, small differences in inputs can lead to disproportionate shifts in representations. Thus, semantically equivalent queries may map to distinct regions of the memory space due to minor differences in token-level representations. This reduces the robustness of rephrased inputs, as shown in Table 2. Additionally, as memory utilization scales, the module becomes overly specialized to the original query distribution (stability), sacrificing generalization to representationally divergent queries (plasticity).

To address the limitations of the direct memory querying strategy, we are now actively developing a memory management module inspired by Memory Management Unit (MMU) in operating systems. The module acts as an abstraction layer between the LLM and peripheral memory, decoupling semantic alignment from storage operations. This allows the peripheral memory to specialize in efficient storage/retrieval, while the MMU handles query normalization, introducing three key innovations:

(1) Semantic-Aware Querying

The MMU parses and refines raw query signals (e.g., token-level hidden states) into semantically enriched descriptors, mitigating hypersensitivity to input variations. For example, paraphrased queries like "What is the capital of France?" and "Name France’s capital city" would be mapped to unified descriptors, enabling robust retrieval regardless of surface-form differences.

(2) Optimized Memory Operations

The MMU supports bulk memory read/write operations, reducing overhead for large-scale edits. Inspired by virtual memory paging, it dynamically groups related memory entries (e.g., knowledge about a specific entity) into contiguous blocks, improving cache utilization.

(3) Adaptive Scheduling Policies

Leveraging reinforcement learning, the MMU learns optimal policies for memory allocation and eviction. This balances hot (frequently accessed) and cold (rarely used) memory regions, addressing the stability-plasticity trade-off while minimizing fragmentation.

This MMU introduces OS-inspired abstractions (e.g., memory pages) to LLMs, enabling systematic memory control. We are currently refining this architecture and will introduce it in future work.

2.Question about sensibility of query The performance of our framework exhibits moderate sensitivity to the choice of query representation. Below, we analyze two key dimensions of this sensitivity:

Token Position Sensitivity: Using the last token hidden state as the query signal is an intuitive choice since it aggregates semantic information from the entire input. In addition, we also do an experiment using the average of all hidden token features as query, and find similar results. This shows that the selection of token features is relatively stable. |Type||ZsRE||||||CF||| |:-----|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|-----:|:-----:|-----:|-----:| ||Efficacy|Generality|Locality|Score|||Efficacy|Generality|Locality|Score| |Last Token|0.9774|0.6432|1.0000|0.8735|||0.9915|0.3108|1.0000|0.7674| |Average|0.9698|0.6399|1.0000|0.8699|||0.9800|0.2520|1.0000|0.7440|
Layer-Wise Variability: Deeper layers (e.g., layer 8-31 in Llama3-8B) yield more stable query representations due to their focus on high-level semantics, while shallower layers (e.g., layer <8) exhibit higher variance (See Figure 12 in Appendix D.3). This means that query performance is relatively stable as long as it is within reasonable limits.

Additionally, as noted in prior responses, the improved query mechanism (MMU) will decouple semantic alignment from query representation by aggregating hidden states across multiple layers at the same time. This could balance low-level syntactic and high-level semantic features, reducing positional bias and improving robustness.

3.Comparison with Recent RAG Thank you for emphasizing the importance of comparison with recent RAG. We have provided experimental analysis in Appendix D.1. Briefly, we evaluate our method against recent LongRAG:

	Qasper	MutilfieldQA-en
C-500 $\times$ 3	22.5	39.5
B-500 $\times$ 3	20.4	26.2
LongRAG	15.5/26.3 $^*$	38.9/49.4 $^*$
Ours	30.1	43.3

LongRAG: Results with * use Document-level retrieval; others use Passage-level (1 retrieval unit).

4.Limitation discussion:see Response #1.

5.Additional results on diverse LLM architecture will be provided later due to current character constraints.

审稿意见

评分: 32025-03-14

This work proposes Peripheral Memory for LLMs, in which the sequence modeling and the memory updates interleaves in the language modeling process. The experimental results on knowledge-based model editing and long-context QA demonstrate the effectiveness of such method.

update after rebuttal

increased by 1 score due to the helpful discussions from authors on the generality of the work

给作者的问题

Check weakness

论据与证据

No, check evaluations and weakness.

方法与评估标准

I checked the code in supplementary material. I found the functions for training on KME, TME and LongBench. The author do not mention the details in the paper. But I believe this method requires data-specific fine-tuning to train its "convetor" module. Thus, most of the comparisons lead to unfair comparisons because your model is trained on LongBench but other long-context models are not trained on LongBench data.

理论论述

None.

实验设计与分析

Check evaluations.

补充材料

Yes, I read the code carefully to understanding how to use this method because the paper does not mention any.

与现有文献的关系

This method is quite impressive because the designs are very lightweight. Compared with this method, MemoryLLM (Towards Self-Updatable Large Language Models) requires the parametric memory at each layer, which consumes quite a lot of GPU memory.

遗漏的重要参考文献

M+: Extending MemoryLLM with Scalable Long-Term Memory
CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory
Augmenting Language Models with Long-Term Memory

其他优缺点

Strength:

The designs of MemoryBank is motivated by the RAM and make a lot sense. The working flow is clear and can captures the global memory information.

Weakness:

The motivation of Kolmogorov–Arnold Networks design are invisible and confused. I understand it is a novel architecture which can arouse the attention from the community. But I believe using MLP here rather than KAN will make the design clearer and more intuitive. Additionally, using MLP is very likely to yield better performance in your case.
If I understand the method clearly, this method is not a training-free method. The W_0 and W_1 mapping matrices in Section 3.2 are newly initialized and they require training for alignment. But I am sure that I did not find any details about the training for such mapping weights as well as the KAN network. If you use a general SFT dataset to adapt your foundational LLM into a LLM with memory, then please introduce your training and dataset details. If you perform fine-tuning on the training data on each downstream task, your evaluations are completely unfair.
The variable l in Section 2.1 should be a hyper-parameter, but I did not find any value about this hyper-parameter as well as the ablation studies on that.

其他意见或建议

Check weakness

作者回复

2025-03-31

1.Question about $\mathbf{W}_0$ and $\mathbf{W}_1$

Thank you for raising this important concern. We clarify the training protocol and evaluation fairness as follows:

(1) Role and Training of Convertors

The mapping matrices $\mathbf{W}_0$ and $\mathbf{W}_1$ (see Section 3.2) serve solely as feature space adapters, analogous to "plug-and-play" connectors between the LLM’s hidden states and the memory module. Their sole purpose is to align dimensionality between the LLM’s hidden space and the memory’s representation space, not to encode task-specific knowledge. Specifically, the convertors were trained solely to memorize contexts into memory (See Line-986, 587, 273 of code), with no exposure to query data or ground-truth answers during training. This ensures the convertors do not learn task-specific knowledge.

(2) Ensuring No Data Memorization

To rigorously verify that convertors do not memorize evaluation data, we also implemented an empty_fn (similar to the memory_fn in Line-508 of supplementary code):

def empty_fn(self, layer_idx, hidden_states, causal_masks, **kwargs):
    if layer_idx == self.merge_layer_idx:
        empty_feats = self.convertor(self.query_signal.to(self.convertor.device))
        hidden_states[:,self.replace_idx,:] = empty_feats.to(hidden_states.device)
        causal_masks = self._replace_attn_fn(causal_masks, 1., self.replace_idx)
    if layer_idx > self.merge_layer_idx:
        causal_masks = self._replace_attn_fn(causal_masks, 1., self.replace_idx)
    return hidden_states, causal_masks

This function skips memory retrieval and aligns the LLM’s output with its original predictions. However, extensive experiments show no statistically significant difference in model performance ( $\Delta$ < 0.3%) between models with and without empty_fn, confirming that convertors introduce no hidden memorization. In consideration of efficiency and simplicity, we omitted this part directly and did not provide it in the supplementary materials.

2.Question about the variable l in Section 2.1

Thank you for highlighting this important point. The variable $l$ in Section 2.1 denotes the index of the hidden layer from which query features are extracted for memory retrieval. While $l$ is indeed a critical hyper-parameter, we relegated its analysis to Appendix D.3 due to space constraints. Findings can also be found in Response#2 of Reviewer UVbc.

3. Question about using KAN

We appreciate your thoughtful critique and address the motivation for using KANs (Kolmogorov-Arnold Networks) as follows:

(1) Motivation for KANs

As formalized in Eq.2, the memory query process can be abstracted as a smooth mapping from query signals to memory data. Physical RAM approximates this mapping with an indicator function, which is ill-suited for neural memory systems requiring gradual state transitions. KAN was chosen over MLP due to its superior symbolic regression capabilities:

Superior Function Approximation: KANs achieve higher precision with fewer parameters by leveraging spline-based nonlinearities, avoiding MLPs’ reliance on rigid activation functions (e.g., ReLU).
Smoothness: KANs’ piecewise polynomial basis functions enable $C^2$ -continuous mappings, critical for stable gradient propagation during sequential memory interactions. We will revise the title of Section 2.2 in future editions and provide further clarification in this section.

(2) Empirical Comparison with MLPs

We compared KANs against MLPs with varying parameter counts:

Type		ZsRE				CF
	Efficacy	Generality	Locality	Score	Efficacy	Generality	Locality	Score
KAN $\approx$ 0.1932M	0.9774	0.6432	1.0000	0.8735	0.9915	0.3108	1.0000	0.7674
MLP(#KAN-param $\times 1$ ) $\approx$ 0.0219M	0.8672	0.5954	0.3649	0.6092	0.8675	0.2410	0.0240	0.3775
MLP (#KAN-param $\times 2$ ) $\approx$ 0.0434M	0.8835	0.5883	0.3652	0.6123	0.8742	0.2370	0.0315	0.3809
MLP (#KAN-param $\times 4$ ) $\approx$ 0.0864M	0.8973	0.5785	0.3747	0.6168	0.8835	0.2295	0.0330	0.3820
MLP (#KAN-param $\times 5$ ) $\approx$ 0.1079M	0. 3989	0.3852	0.0000	0.2614	0.8330	0.2104	0.0230	0.3555
MLP (#KAN-param $\times 8$ ) $\approx$ 0.1723M	0. 0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000

Note: Where the MLP is used to model a specific memory bank. Each MLP memory bank mirrors KAN’s structure:

First&Confidence bank: torch.nn.Linear(1,9) $\to$ torch.nn.Linear(9,1)
Other banks: torch.nn.Linear(2,9) $\to$ torch.nn.Linear(9,1)
Hidden dimensions scaled via torch.nn.Linear(1,9 $\times X$ ) $\to$ torch.nn.Linear(9 $\times X$ ,1), #Epoch=10.

Results indicate that the KAN-based memory bank surpasses the MLP-based counterpart within our peripheral memory.

For a more detailed analysis, please refer to Response #3 of Reviewer TWi2 due to chararacter limitation of current response.

审稿人评论

2025-04-05

Hi authors, I appreciate the contributions and novelty of this work. My major concerns still remains that the adaptation of W0 and W1 weights are not conducted on general data. I believe it is necessary to redesign your training process to ensure the generality of your method. If you believe your method is used in plug-in manner, then you should repurpose your contributions and change the baselines you would like to compare with.

作者评论

2025-04-07

We sincerely appreciate your engagement and recognition of our work's novelty. We understand your new concern and provide further clarification below.

Generality of $\mathbf{W}_0$ and $\mathbf{W}_1$

Based on our experiments, even without training the Convertors on general data, it can effectively map query features from the hidden LLM space into the memory space, without compromising its generality. To validate this, we conducted cross-data evaluations using two completely distinct datasets (ZsRE and CounterFact), where these two datasets exhibit significantly different data distributions [1]:

	Efficacy	Generality	Locality	Score
ZsRE `original`	0.98	0.64	1.00	0.87
ZsRE ( $\mathbf{W}_.$ `from CounterFact`)	0.98	0.64	1.00	0.87
CounterFact `original`	0.99	0.31	1.00	0.77
CounterFact ( $\mathbf{W}_.$ `from ZsRE`)	0.99	0.30	1.00	0.76

The results show that cross-dataset performance remains comparable to original training. This suggests that convertors trained on specific datasets generalize effectively to unseen distributions.

A plausible explanation is that the hidden-state features of large language models reside on low-dimensional manifolds and encode rich, domain-agnostic supervisory signals, exhibiting robust domain generality [2]. This enables the Convertors to learn generalizable mappings across datasets -consistent with representation learning theory [3,4].

Deeper Discussion of Generality

2.1 Clarification of Generality in Table 2

The Generality metric in Table 2 reflects performance on semantically equivalent queries, primarily influenced by:

Direct Memory Querying strategy instead of the convertors (see Question #1 and Question #2 of Reviewer UVbc and the corresponding Responses #1–2).\
Memory storage density (analyzed in Appendix C).

To address the limitations of direct memory querying strategy, we are actively developing a memory management module ( $\mathcal{M}^2$ module) inspired by Memory Management Unit (MMU) in operating systems. The module acts as an abstraction layer between the LLM and peripheral memory, decoupling semantic alignment from storage operations of peripheral memory. This allows the peripheral memory to specialize in efficient storage/retrieval, while the new module handles query normalization. Preliminary experiments (using T5-small as $\mathcal{M}^2$ module) show $20^+$ % improvement in Generality of Table 2, and we are currently refining this architecture and will introduce it in future work.

2.2 Training Data Scope

The primary function of the $\mathcal{M}^2$ module is to process query features and manage large-scale memory. Consequently, its training should be conducted on general data to enhance its ability to discern and handle memory that stores diverse types of knowledge. In contrast, the convertor-based direct memory querying strategy does not require such training.

Clarification of Contributions

3.1 Convertor's Role

The Convertor is not the core innovation of this paper; it is solely used for feature-space bridging. As shown in above table, even when trained exclusively on a specific dataset, the Convertor successfully performs feature transformation and retains generality. Therefore, direct end-to-end training of convertors during memory writing is a convenient and efficient choice.

3.2 Key contributions

Our key contributions are:

The introduction of a novel, lightweight, and user-friendly memory architecture that is configurable and can be shared across different models. This represents a pioneering attempt at a new memory architecture and constitutes the main contribution of our work.
Decoupled CPU-RAM design, eliminating architectural entanglement in prior works.

These contributions have been explicitly stated in the Paragraph 2 and Contributions list in Introduction.

Classification on baselines

As stated above, our objective is to improve current memory architecture, thereby increasing capabilities in Scalability, Reusability and Configurability. To demonstrate its effectiveness, we have deliberately selected existing popular memory-augmented approaches as baselines, including WISE (NeurIPS 2024), MemoryLLM (ICML 2024, See Table 9), GRACE (NeurIPS 2023) and IKE (EMNLP 2023). These baselines were chosen because:

They represent current best practices in memory augmentation.
Enable direct comparison of architectural innovations.
Were evaluated under identical protocols.

References

[1] Meng et al. (2022). Locating and Editing Factual Associations in GPT. In NeurIPS

[2] Jeremy et al. (2018). Universal Language Model Fine-tuning for Text Classification. In ACL

[3] Bengio el al. (2013). Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8)

[4] Arora el al. (2019). A Theory of Representation Learning in Neural Networks. In ICML

最终决定Accept (poster)

2025-05-01

This paper proposes a novel memory augmentation framework for LLMs that treats memory as a peripheral module, inspired by RAM-CPU architecture, to enable efficient long-term knowledge updates. Using Kolmogorov-Arnold Networks (KAN) for smooth memory transitions and dynamic read/write operations, the system enhances model predictions without degrading performance. Experiments show strong results in knowledge editing and long-context question answering.

The paper was reviewed by four experts with final decisions of two Accepts, one Weak Accept and one Weak Reject. All reviewers appreciated the premise of the work, especially its motivation by RAM to use a peripheral memory with contemporary deep learning models, as well as its strong results and comprehensive analysis. There were initial concerns around the motivation for KANs (R#gZ6v, R#TWi2), comparison with RAG (R#UVbc, R#TWi2), need for discussion of limitations and failure cases (R#UVbc, R#TWi2), and some clarification questions (R#Eb4F). The authors' responses addressed most concerns, and R#gZ6v increased their score considering the authors' clarifications on the generality of their method. While there is one reviewer with a borderline negative view, their concerns were more around clarifications which are largely addressed in the rebuttal. Given these considerations, the paper is recommended for acceptance. The authors are however recommended to integrate their rebuttal responses in the final version.