PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
5
4
4.0
置信度
创新性3.0
质量3.5
清晰度3.3
重要性2.8
NeurIPS 2025

MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We propose an efficient lifelong model editing method which maintains high performance with large number of edits.

摘要

关键词
Lifelong model editingCatastrophic forgettingLarge language modelLLM

评审与讨论

审稿意见
5

This paper introduces MEMOIR (Model Editing with Minimal Overwrite and Information Retention), a lifelong model editing method that affords a good trade-off between reliability, generalization, and locality. The edits are performed through a separate fully-connected layer whose output is later added to that of the original model layer. These edits are restricted to subsets of parameter selected by an input-dependent sparse mask. This allows MEMOIR to mitigate catastrophic forgetting. At inference time, they use the sparse mask to find the closest edited prompt to the input. The model's predictions on non-edited prompts are preserved. They evaluate with 4 models and compare with a wide range of baselines.

优缺点分析

Strengths:

  • The method offers a good balance between the main metrics of interest—reliability, generalization, and locality—without introducing additional severe computational overload. The memory module is non-invasive and does not lead to catastrophic forgetting.
  • MEMOIR is compared to an extensive baseline.
  • The paper is well-structured, easy to read.

Weaknesses:

  • An important evaluation of model editing methods was not mentioned in the paper. Updating some facts can affect other facts in the model that are not necessarily captured by semantic similarity. There are already existing benchmarks such as this one https://arxiv.org/abs/2307.12976 to analyze this phenomenon.

Minor comments: There are 2 'the' on line 316.

问题

1 - How does MEMOIR compare to prior methods on benchmarks (for example https://arxiv.org/abs/2307.12976) that measures the impact of fact updates on other facts?

2 - Do you expect your approach to generalize to bigger models?

局限性

yes

最终评判理由

The authors have answered all the questions and addressed what I considered the main weakness of the paper. This additional experiment should be included in the final version of the paper.

格式问题

None

作者回复

General comment

First, we would like to respectfully bring to the reviewer’s attention a correction in our reported results due to a technical oversight. Specifically, when computing TopHash indices during inference, we inadvertently included label information by using features of the concatenated [prompt, label] sequence.

We have corrected this by recomputing TopHash indices using centered features averaged over prompt tokens only. This affected the semantic representation used for similarity computations and the subsequent routing step, impacting mainly the generalization metric for ZsRE QA task with LLaMA 3. In contrast, reliability and locality are only minimally affected. We see no or negligible degradation in the remaining combinations of benchmarks and models across all metrics.

The revised results are provided below. Importantly, none of our previous conclusions are affected. In particular, in all settings, MEMOIR performance remains state-of-the-art, consistently outperforming all baselines across all metrics with a strong margin over the second-best methods.

Table 1: Q&A task results.

MethodT=1T=10T=100T=1000
Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.
LLaMA-3-8B
MEMOIR(before)1.000.981.000.990.990.961.000.980.970.941.000.970.950.911.000.95
GRACE1.000.461.000.821.000.421.000.811.000.391.000.801.000.371.000.79
WISE0.920.841.000.920.780.740.980.830.620.601.000.740.660.641.000.77
AlphaEdit0.980.891.000.960.930.850.980.920.910.790.940.880.840.770.560.72
MEMOIR1.001.001.001.000.970.891.000.950.960.891.000.950.940.851.000.93
Mistral-7B
MEMOIR(before)1.000.931.000.980.980.911.000.960.960.891.000.950.930.871.000.93
GRACE1.000.361.000.791.000.151.000.721.000.151.000.721.000.021.000.67
WISE0.980.971.000.980.920.891.000.940.870.801.000.890.700.671.000.79
AlphaEdit0.830.771.000.870.870.750.990.870.860.740.950.850.850.720.680.75
MEMOIR1.000.991.001.000.970.941.000.970.950.911.000.950.940.891.000.94

Table 2: Hallucination correction task results.

LLaMA-3-8BMistral-7B
T=1T=10T=100T=600T=1T=10T=100T=600
MethodRel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.
MEMOIR(before)1.001.001.011.001.091.001.371.001.001.001.021.001.091.001.221.00
GRACE1.051.007.10e11.007.12e11.007.73e11.001.391.005.971.009.531.009.571.00
WISE4.93e10.981.460.952.100.993.200.991.401.002.560.941.310.995.210.93
AlphaEdit1.581.003.120.985.970.938.49e30.051.751.001.761.002.870.981.70e20.88
MEMOIR1.001.001.011.001.071.001.251.001.001.001.021.001.091.001.221.00

Rebuttal

We thank the reviewer DSuA for their valuable feedback. We are pleased they find our paper well-structured and easy to read, our baseline comparison extensive, and our method offering a good balance between the main metrics. Below, we address their specific comments and questions.

W1/Q1: Evaluation on RippleEdits We thank the reviewer for drawing our attention to this benchmark. First, we note that, in the current manuscript, MEMOIR follows the evaluation protocols established in prior work (ROME, MEMIT, GRACE, WISE) and covers a broad range of factual knowledge editing tasks, including question answering (ZsRE), hallucination correction (SelfCheckGPT), and OOD generalization (temporal dataset).

We recognize that evaluating how edited facts influence other facts beyond semantic similarity is a compelling direction that would further strengthen our contribution. Following the reviewer’s suggestion, we have added new experiments on the RippleEdits benchmark to analyze how MEMOIR impacts surrounding factual knowledge during the editing process.

We evaluate MEMOIR on the POPULAR dataset under 10 and 100 sequential-edit settings, using three metrics: Reliability (Rel.), Logic Generalization (Gen.), and Relation Specificity (Spec.). Rel. quantifies the model’s ability to absorb new knowledge (as defined in the draft), Gen. captures whether the edit supports consistent logical inference over related facts, and Spec. verifies that unrelated relations remain unchanged. These metrics reflect more complex forms of generalization and locality. The results are presented in the table below. Notice that evaluations are based on our current implementation under a lifelong editing setup. In particular, they are performed only after all edits in the sequence.

MethodT=10T=100
Rel.Gen.Spec.Avg.Rel.Gen.Spec.Avg.
ROME0.800.040.290.380.010.010.020.01
MEMIT0.980.270.290.510.010.000.010.01
GRACE1.000.000.050.350.960.010.100.36
AlphaEdit1.000.130.440.520.900.220.460.53
WISE0.780.020.040.280.680.100.060.28
MEMOIR1.000.270.540.600.980.200.590.59

Despite this being a highly challenging regime, as reflected by lower Gen. and Spec. scores across all baselines, MEMOIR achieves the highest performance on 5 out of 6 metrics and obtains the best average accuracy across all 3 tasks. This demonstrates its strong generalization ability even in this challenging setting. While there remains room for overall improvement, these results highlight MEMOIR as a strong foundation for advancing multi-hop generalization. Additionally, we expect further refinements, including improved training and tuning over hyperparameters, to yield stronger results on this benchmark in the revised version of our work. We thank the reviewer for directing us to this new benchmark, and we will update the manuscript to include these new results to further strengthen our contribution, along with a dedicated paragraph analyzing and discussing them in depth. We kindly ask the reviewer to reconsider their score based on the new evaluation results.

Q2: Generalizability to larger models We follow prior works in the knowledge editing field and have covered major LLMs experimented in this field with varying sizes and architectures, including GPT-j-6B, LLaMA-2-7B, Mistral-7B, LLaMA-3-8B. We expect our method to generalize well to even larger models for the following reasons:

  1. The residual memory, initialized from an intermediate FFN projection layer, contains more parameters in larger models. Assuming a fixed active subspace per edit, this enables storing more edits with reduced interference and less forgetting during sequential editing.
  2. Larger models capture semantic variation more robustly and are expected to generalize more effectively across paraphrases. Our conditional knowledge activation mechanism can thus more accurately identify semantically relevant and irrelevant samples during inference.

We hope this effectively addresses the reviewer’s concerns and sincerely appreciate their positive assessment. We remain available for any further discussion or clarification.

评论

Thank you for your answers and further clarifications. The authors have answered all the questions and addressed the main weakness. I will update the score accordingly. Even though previous work did not evaluate on RippleEdits to investigate the impact of fact updates on other facts, this experiment should be included in the paper as it might be very informative for the community.

审稿意见
4

This paper proposes MEMOIR, a framework for lifelong model editing in large language models (LLMs), designed to inject new knowledge while minimizing interference with existing knowledge. MEMOIR introduces a residual memory module with sparse, data-dependent masking, allowing each edit to target a distinct subset of parameters. During inference, MEMOIR activates memory selectively using a similarity-based routing mechanism, ensuring generalization to rephrased queries and suppression of irrelevant memory usage. The method is evaluated across tasks such as question answering, hallucination correction, and OOD generalization, showing good trade-offs between reliability, generalization, and locality, scaling effectively to thousands of sequential edits. The framework outperforms previous parametric and non-parametric methods including MEMIT, AlphaEdit, and WISE.

优缺点分析

Strength

The paper is methodologically solid. The empirical evaluation is extensive, covering multiple models, tasks, and a wide range of edit scales. The performance gains are clearly demonstrated with ablation studies.

The paper is clearly written, with intuitive figures that help illustrate the core mechanisms. The explanation of the masking strategy (TopHash) and inference-time routing is particularly clear and technically precise.

The combination of sparse memory allocation and conditional activation based on TopHash is novel in the context of LLM editing. While sparsity and activation gating are known techniques, their integration into this framework and use for routing edits in a memory-efficient and interference-minimizing way is well-motivated.

Weakness

MEMOIR only modifies a single transformer layer’s projection weights. While this suffices for many factual edits, it may limit applicability to complex or multi-hop knowledge edits requiring broader changes across multiple layers.

The effectiveness of TopHash-based routing depends on the number of active indices and the similarity threshold. Though the paper provides ablations, further justification or principled selection strategies would improve robustness and usability.

问题

Have you experimented with extending MEMOIR to multiple transformer layers, or hierarchically allocating memory? Would it improve generalization or allow for more complex edits?

How sensitive is performance to the choice of routing threshold? Could it be learned during editing, rather than heuristically set?

In practice, if the number of edits grows indefinitely, the residual memory may become large. Have you considered any memory pruning or compression mechanisms?

局限性

The authors appropriately note limitations in scope (single-layer editing, decoder-only models) and acknowledge future directions. The societal impact of memory-based editing (e.g., misinformation correction vs. malicious manipulation) is not deeply discussed but could be strengthened with a short reflection.

最终评判理由

The authors answered all my questions and addressed my concerns. I updated my score to 4 and recommend an acceptance.

格式问题

None

作者回复

General comment

First, we would like to respectfully bring to the reviewer’s attention a correction in our reported results due to a technical oversight. Specifically, when computing TopHash indices during inference, we inadvertently included label information by using features of the concatenated [prompt, label] sequence.

We have corrected this by recomputing TopHash indices using centered features averaged over prompt tokens only. This affected the semantic representation used for similarity computations and the subsequent routing step, impacting mainly the generalization metric for ZsRE QA task with LLaMA 3. In contrast, reliability and locality are only minimally affected. We see no or negligible degradation in the remaining combinations of benchmarks and models across all metrics.

The revised results are provided below. Importantly, none of our previous conclusions are affected. In particular, in all settings, MEMOIR performance remains state-of-the-art, consistently outperforming all baselines across all metrics with a strong margin over the second-best methods.

Table 1: Q&A task results.

MethodT=1T=10T=100T=1000
Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.
LLaMA-3-8B
MEMOIR(before)1.000.981.000.990.990.961.000.980.970.941.000.970.950.911.000.95
GRACE1.000.461.000.821.000.421.000.811.000.391.000.801.000.371.000.79
WISE0.920.841.000.920.780.740.980.830.620.601.000.740.660.641.000.77
AlphaEdit0.980.891.000.960.930.850.980.920.910.790.940.880.840.770.560.72
MEMOIR1.001.001.001.000.970.891.000.950.960.891.000.950.940.851.000.93
Mistral-7B
MEMOIR(before)1.000.931.000.980.980.911.000.960.960.891.000.950.930.871.000.93
GRACE1.000.361.000.791.000.151.000.721.000.151.000.721.000.021.000.67
WISE0.980.971.000.980.920.891.000.940.870.801.000.890.700.671.000.79
AlphaEdit0.830.771.000.870.870.750.990.870.860.740.950.850.850.720.680.75
MEMOIR1.000.991.001.000.970.941.000.970.950.911.000.950.940.891.000.94

Table 2: Hallucination correction task results.

LLaMA-3-8BMistral-7B
T=1T=10T=100T=600T=1T=10T=100T=600
MethodRel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.
MEMOIR(before)1.001.001.011.001.091.001.371.001.001.001.021.001.091.001.221.00
GRACE1.051.007.10e11.007.12e11.007.73e11.001.391.005.971.009.531.009.571.00
WISE4.93e10.981.460.952.100.993.200.991.401.002.560.941.310.995.210.93
AlphaEdit1.581.003.120.985.970.938.49e30.051.751.001.761.002.870.981.70e20.88
MEMOIR1.001.001.011.001.071.001.251.001.001.001.021.001.091.001.221.00

Rebuttal

We thank Reviewer HSS5 for their valuable feedback. We are pleased that they found our paper well-motivated, clearly written, our method both solid and novel, and the empirical evaluation extensive. Below, we address their specific comments and questions.

W1/Q1: Editing a single FFN projection layer Although arguably the simplest design choice, in our experiments, we note that editing a single FFN projection layer is shown to be highly effective for MEMOIR. Crucially, it achieves significantly higher performance than prior methods that edit multiple layers (MEMIT, AlphaEdit), while using 5x fewer trainable parameters, resulting in much lower computational cost. We believe this is a major strength of our method since it combines improved performance with significantly fewer trainable parameters. The reviewer’s suggestion to extend to multiple layers is indeed a promising direction, which we leave for future work. Indeed, it’s reasonable to expect such an extension to enhance generalization in tasks involving multi-hop reasoning and allow for more complex edits.

Furthermore, we kindly refer the reviewer to our response to reviewer DSuA, where we present a new experiment evaluating MEMOIR on RippleEffects[1], a benchmark designed to assess two-hop reasoning in knowledge editing. Notably, despite editing only a single layer, MEMOIR achieves the strongest generalization performance in this multi-hop reasoning setting, outperforming baseline methods that modify multiple layers, such as MEMIT and AlphaEdit.

W2/Q2: Robustness of TopHash-based routing We selected the hyperparameters for the number of active indices kk and mask similarity threshold based on performance over 1000 edits on ZsRE. As shown below, both parameters τ\tau and kk remain highly robust across varying edit counts, and generalize well to other tasks. In MEMOIR, kk controls the trade-off between edit specificity and interference: larger kkbetter captures individual edits, while smaller kk reduces forgetting. Despite this trade-off, MEMOIR remains stable across a wide range of kk: we use the same k=4096k = 4096 across different model architectures (LLaMA-2, LLaMA-3, Mistral), tasks (ZsRE QA, SelfCheckGPT), and edit scales (from 1 to 7000 edits). As shown in Figure 6, the reliability remains high for kk between [1024,8192][1024, 8192], indicating strong robustness and ease of tuning. Similarly, τ\tau is set once per model to reflect differences in representation spaces and remains fixed across all tasks and edit counts. This consistent performance demonstrates the generalizability of our TopHash-based routing strategy.

Q2: Learning a mask similarity threshold We thank the reviewer for the suggestion to introduce a learnable thresholding mechanism. We did not adopt it because we want to keep the simplicity of our method with minimal requirements. Indeed, MEMOIR has demonstrated substantial performance gains over all baselines, including the ones that introduce a learnable router mechanism (e.g., WISE and MEND). However, the use of an adaptive router assumes the availability of semantically aligned or paraphrased prompts during training, which introduces extra computational costs and limits the adaptability of the method. However, learnable routing is a promising addition that could further strengthen MEMOIR’s strong generalization across irrelevant prompts, which we leave for future works.

Q3: Scalability to growing number of edits In terms of the scalability of the memory footprint, the residual memory, initialized as a clone of the original FFN projection layer, has a fixed size and does not grow with the number of edits. Precisely, MEMOIR only requires a copy of a single linear layer in a LLM and one binary (rather than float32) mask per edit. Specifically, for a LLaMA-3-8B model and 7000 edits, this means a memory increase of 0.7% and 0.1% for parameters and masks, respectively, resulting in an overall increase of 0.8%.

In terms of the scalability of performance, a key feature of MEMOIR is its ability to incorporate a large number of edits with minimal forgetting. This is achieved through the TopHash knowledge distribution mechanism, which mitigates catastrophic forgetting during sequential edits. Empirically, MEMOIR maintains strong performance even with 7000 edits, showing no obvious degradation. To further illustrate the scalability of MEMOIR to a large number of edits, we conducted additional evaluations of MEMOIR with 11k and 15k edits on ZsRE QA dataset. As shown in the table, MEMOIR maintains stable performance with no collapse up to 15k edits. These experiments showcase the superior robustness of MEMOIR to longer edits; while baselines suffer in one or even all metrics, MEMOIR maintains strong performance even for 15k edits.

MethodRel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.
11k15k
GRACE0.990.281.000.760.990.281.000.76
WISE0.440.421.000.620.430.411.000.61
AlphaEdit0.010.010.000.010.020.020.000.01
MEMOIR0.890.810.990.900.870.780.980.88

Clearly, if the number of edits grows indefinitely, at a certain point it becomes infeasible to store all knowledge within a single memory. A possible improvement for such cases, inspired by the reviewer’s suggestion, is to store edits across multiple memories and compress them into a single memory before inference. This can be achieved by pruning redundancies within each memory and then merging them while minimizing interference, similar in spirit to the strategy proposed in Ties-Merging [2]. We believe this line of work is promising for scaling knowledge editing beyond current memory limits.

L1: Societal impact of memory-based editing We appreciate the reviewer’s thoughtful observation on the societal implications of our method. Indeed, memory-based knowledge editing has far-reaching implications. On the positive side, it enables rapid correction of misinformation, dynamic adaptation to evolving world knowledge, and alignment with ethical standards, especially important for applications in education, healthcare, and safety-critical domains. However, as the reviewer rightly notes, these capabilities also raise concerns. The same techniques could be exploited for malicious purposes, such as inserting biased content, manipulating facts, or selectively erasing information to influence user beliefs or public opinion. Following the reviewer’s suggestion, we will include a dedicated section in the revised manuscript.

We thank the reviewer again for their constructive feedback and hope our clarifications help address their concerns. We would appreciate a reconsideration of the score in light of these responses.

[1] Evaluating the Ripple Effects of Knowledge Editing in Language Models, Cohen et al., TACL 2024.

[2] TIES-Merging: Resolving Interference When Merging Models, Yadav et al., NeurIPS 2023.

评论

Thank you for your responses and clarifications. The authors answered all my questions and addressed my concerns. I will update the score to 4 and recommand an acceptance.

审稿意见
5

The paper introduces MEMOIR, a scalable framework for lifelong model editing that updates large language models (LLMs) with new knowledge while minimizing forgetting of previous edits. MEMOIR uses a residual memory module with sparse, data-dependent activations to isolate edits, combined with a conditional inference mechanism that activates only relevant memory for semantically similar prompts. Through experiments on tasks like question answering, hallucination correction, and out-of-distribution generalization using models such as LLaMA-3 and Mistral, MEMOIR shows superior performance in reliability, generalization, and locality, scaling up to 7000 edits with minimal performance degradation.

优缺点分析

Strengths:

MEMOIR addresses a fundamental challenge in LLMs—efficient post-hoc knowledge editing—with a well-motivated, technically sound approach. By sparsifying activation via TopHash and storing updates in a dedicated residual memory, the framework avoids catastrophic forgetting while supporting generalization to paraphrased queries. Its conditional knowledge activation at inference is novel, precise, and efficient, outperforming strong baselines (e.g., WISE, AlphaEdit) across multiple metrics and LLM architectures. The scalability to 7000 sequential edits with stable performance demonstrates practical robustness.

The paper is thorough in empirical evaluation, comparing MEMOIR to a broad set of baselines across diverse tasks. It supports claims with detailed ablations (e.g., removing conditional activation or varying sparsity levels), showing clear advantages of the proposed design. The clarity of writing, theoretical intuitions, and visualizations (e.g., mask overlap distributions and reliability curves) further enhance the paper’s readability and impact.

Weaknesses: Despite its strengths, MEMOIR is still limited by its scope. It edits only a single FFN projection layer, which may restrict its ability to handle deeply integrated or abstract knowledge. This could be problematic for tasks involving multi-hop reasoning or systemic model behaviors. Additionally, the TopHash-based sparsity strategy, while effective, introduces a fixed permutation mechanism that may lack adaptiveness, and its dependence on mask similarity thresholds might not generalize robustly across all tasks or model types.

问题

This paper choose the threshold 𝜏 for determining whether a prompt is relevant to stored edits is fixed. It might worthwhile to consider Introducing a learnable or context-sensitive thresholding mechanism (e.g., trained via meta-learning or reinforcement learning) to better distinguish semantically similar prompts from unrelated ones.

局限性

yes

最终评判理由

The author's rebuttal addressed all the three issues raised in my review, thus I have no further questions. I maintain the same score "Accept".

格式问题

none

作者回复

General comment

First, we would like to respectfully bring to the reviewer’s attention a correction in our reported results due to a technical oversight. Specifically, when computing TopHash indices during inference, we inadvertently included label information by using features of the concatenated [prompt, label] sequence.

We have corrected this by recomputing TopHash indices using centered features averaged over prompt tokens only. This affected the semantic representation used for similarity computations and the subsequent routing step, impacting mainly the generalization metric for ZsRE QA task with LLaMA 3. In contrast, reliability and locality are only minimally affected. We see no or negligible degradation in the remaining combinations of benchmarks and models across all metrics.

The revised results are provided below. Importantly, none of our previous conclusions are affected. In particular, in all settings, MEMOIR performance remains state-of-the-art, consistently outperforming all baselines across all metrics with a strong margin over the second-best methods.

Table 1: Q&A task results.

MethodT=1T=10T=100T=1000
Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.
LLaMA-3-8B
MEMOIR(before)1.000.981.000.990.990.961.000.980.970.941.000.970.950.911.000.95
GRACE1.000.461.000.821.000.421.000.811.000.391.000.801.000.371.000.79
WISE0.920.841.000.920.780.740.980.830.620.601.000.740.660.641.000.77
AlphaEdit0.980.891.000.960.930.850.980.920.910.790.940.880.840.770.560.72
MEMOIR1.001.001.001.000.970.891.000.950.960.891.000.950.940.851.000.93
Mistral-7B
MEMOIR(before)1.000.931.000.980.980.911.000.960.960.891.000.950.930.871.000.93
GRACE1.000.361.000.791.000.151.000.721.000.151.000.721.000.021.000.67
WISE0.980.971.000.980.920.891.000.940.870.801.000.890.700.671.000.79
AlphaEdit0.830.771.000.870.870.750.990.870.860.740.950.850.850.720.680.75
MEMOIR1.000.991.001.000.970.941.000.970.950.911.000.950.940.891.000.94

Table 2: Hallucination correction task results.

LLaMA-3-8BMistral-7B
T=1T=10T=100T=600T=1T=10T=100T=600
MethodRel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.
MEMOIR(before)1.001.001.011.001.091.001.371.001.001.001.021.001.091.001.221.00
GRACE1.051.007.10e11.007.12e11.007.73e11.001.391.005.971.009.531.009.571.00
WISE4.93e10.981.460.952.100.993.200.991.401.002.560.941.310.995.210.93
AlphaEdit1.581.003.120.985.970.938.49e30.051.751.001.761.002.870.981.70e20.88
MEMOIR1.001.001.011.001.071.001.251.001.001.001.021.001.091.001.221.00

Rebuttal

We thank reviewer ftpB for their valuable feedback. We appreciate that they find our paper well-motivated, technically sound, thorough in empirical evaluation, and with strong performance. Below, we address their specific comments and questions.

W1: Editing a single FFN projection layer Although arguably the simplest design choice, our experiments show that editing a single FFN projection layer is highly effective for MEMOIR. Crucially, it achieves significantly higher performance than prior methods that edit multiple layers (MEMIT, AlphaEdit), while using 5x fewer trainable parameters, resulting in much lower computational cost. We believe this is a major strength of our method since it combines improved performance with significantly fewer trainable parameters. The reviewer’s suggestion to extend to multiple layers is indeed a promising direction, which we leave for future work. Indeed, it’s reasonable to expect such an extension to enhance performance in tasks involving multi-hop reasoning or systematic model behaviors.

Furthermore, we kindly refer the reviewer to our response to reviewer DSuA, where we present a new experiment evaluating MEMOIR on RippleEffects[1], a benchmark designed to assess two-hop reasoning in knowledge editing. Notably, despite editing only a single layer, MEMOIR achieves the strongest generalization performance in this multi-hop reasoning setting, outperforming baseline methods that modify multiple layers, such as MEMIT and AlphaEdit.

Adaptability of the TopHash mechanism In MEMOIR, TopHash distributes parameter updates across edits. The fixed permutation mechanism is tasked to redistribute the active indices from the influential parameters to less critical ones, which has been shown to be highly effective in reducing catastrophic forgetting. We note that this mechanism is highly adaptable across diverse settings. We apply TopHash across multiple model types (LLaMA-2, LLaMA-3, Mistral, GPT-J), tasks (ZsRE QA, SelfCheckGPT hallucination correction, temporal OOD generalization), and scales (from a single edit up to 7000 edits), where it consistently demonstrates strong performance and adaptability.

W1/Q1: Generalization of mask similarity threshold The mask similarity threshold τ\tau is used to determine whether a given prompt is semantically relevant to previously edited prompts. In our experiments, τ\tau varies only across different model architectures due to intrinsic differences in their representation spaces. For each model, we apply a fixed τ\tau across all experiments, including different tasks (ZsRE QA and SelfCheckGPT hallucination correction) and varying numbers of edits (from 1 to 7000), demonstrating its robustness and generalization across diverse testing scenarios.

While we appreciate this suggestion to introduce a learnable thresholding mechanism, we opted for a simpler design to ensure minimal requirements and broad applicability. Indeed, MEMOIR has demonstrated substantial performance gains over all baselines, including the ones that introduce a learnable router mechanism (e.g., WISE learns it with contrastive learning, and MEND learns it with meta-learning). However, the use of an adaptive router assumes the availability of semantically aligned or paraphrased prompts during training, which introduces extra computational costs and limits the adaptability of the method. Yet, learnable routing is a promising addition that could further strengthen MEMOIR’s strong generalization across irrelevant prompts, which we leave for future work.

We hope this effectively addresses the reviewer’s concerns and sincerely appreciate their positive assessment. We remain available for any further discussion or clarification.

[1] Evaluating the Ripple Effects of Knowledge Editing in Language Models, Cohen et al., TACL 2024.

审稿意见
4

In this paper authors introduce MEMOIR in which they introduces a memory module, i.e., a dedicated fully-connected layer in a single transformer block where all edits are performed. MEMOIR mitigates catastrophic forgetting by allocating distinct parameter subsets to each edit and retrieving them during inference to activate only the relevant knowledge for a given prompt. They propose a structured sparsification of the input to this layer, dynamically activating only a sample-specific subset of parameters in the introduced memory module in the forward pass. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks across LLaMA-3 and Mistral demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.

优缺点分析

Strengths:

  1. The paper is overall well written, well motivated with good experimentation coverage.

  2. The method's ability to dynamically activate or deactivate relevant knowledge based on prompt type (edited, rephrased, or irrelevant) effectively eliminates the need for large corpora of irrelevant samples during editing, which is often required by prior approaches to enhance locality. This is a strong differentiation with practical applications in real life.

  3. The framework demonstrates strong scalability, extending the state-of-the-art edit horizon to 7000 edits while delivering superior performance in challenging sequential singular edit settings.

Weaknesses:

  1. The performance of MEMOIR is sensitive to the number of active indices k used in TopHash. It appears crucial for balancing the model's ability to capture edits, prevent excessive overwriting, and maintain the quality of semantic relevance for routing. Although authors discuss it during the ablation, hyperparameter tuning aspect of this parameter k might be challenging in practice.

  2. While the author claim that MEMOIR is scalable, the computational efficiency and memory footprint for extremely large LLMs (beyond 8B parameters) or significantly more edits than 7000 are not explicitly detailed. However, the method's design with a residual memory and sparse updates suggests better scaling than full fine-tuning.

  3. The authors acknowledge that the paper does not present a formal theoretical results or proofs, focusing instead on empirical validation. While empirical results are strong, a theoretical foundation could further solidify the understanding of why MEMOIR performs well.

问题

Questions:

  1. The paper focuses on factual knowledge editing. An interesting extension to the studied question is how would MEMOIR perform on other types of model updates, such as stylistic changes, or ethical alignment which are more subtle than facts and not necessarily true/false.

  2. How does the fixed random permutation in TopHash affect the computational overhead during editing and inference, and is there a trade-off between the randomness introduced and the speed of the operation?

局限性

Please refer to Strengths and Weaknesses.

格式问题

N/A

作者回复

General comment

First, we would like to respectfully bring to the reviewer’s attention a correction in our reported results due to a technical oversight. Specifically, when computing TopHash indices during inference, we inadvertently included label information by using features of the concatenated [prompt, label] sequence.

We have corrected this by recomputing TopHash indices using centered features averaged over prompt tokens only. This affected the semantic representation used for similarity computations and the subsequent routing step, impacting mainly the generalization metric for ZsRE QA task with LLaMA 3. In contrast, reliability and locality are only minimally affected. We see no or negligible degradation in the remaining combinations of benchmarks and models across all metrics.

The revised results are provided below. Importantly, none of our previous conclusions are affected. In particular, in all settings, MEMOIR performance remains state-of-the-art, consistently outperforming all baselines across all metrics with a strong margin over the second-best methods.

Table 1: Q&A task results.

MethodT=1T=10T=100T=1000
Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.
LLaMA-3-8B
MEMOIR(before)1.000.981.000.990.990.961.000.980.970.941.000.970.950.911.000.95
GRACE1.000.461.000.821.000.421.000.811.000.391.000.801.000.371.000.79
WISE0.920.841.000.920.780.740.980.830.620.601.000.740.660.641.000.77
AlphaEdit0.980.891.000.960.930.850.980.920.910.790.940.880.840.770.560.72
MEMOIR1.001.001.001.000.970.891.000.950.960.891.000.950.940.851.000.93
Mistral-7B
MEMOIR(before)1.000.931.000.980.980.911.000.960.960.891.000.950.930.871.000.93
GRACE1.000.361.000.791.000.151.000.721.000.151.000.721.000.021.000.67
WISE0.980.971.000.980.920.891.000.940.870.801.000.890.700.671.000.79
AlphaEdit0.830.771.000.870.870.750.990.870.860.740.950.850.850.720.680.75
MEMOIR1.000.991.001.000.970.941.000.970.950.911.000.950.940.891.000.94

Table 2: Hallucination correction task results.

LLaMA-3-8BMistral-7B
T=1T=10T=100T=600T=1T=10T=100T=600
MethodRel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.Rel.Loc.
MEMOIR(before)1.001.001.011.001.091.001.371.001.001.001.021.001.091.001.221.00
GRACE1.051.007.10e11.007.12e11.007.73e11.001.391.005.971.009.531.009.571.00
WISE4.93e10.981.460.952.100.993.200.991.401.002.560.941.310.995.210.93
AlphaEdit1.581.003.120.985.970.938.49e30.051.751.001.761.002.870.981.70e20.88
MEMOIR1.001.001.011.001.071.001.251.001.001.001.021.001.091.001.221.00

Rebuttal

We thank the reviewer for their valuable feedback. We appreciate their recognition that our framework is both scalable and achieves superior performance in sequential edits, and we are happy that they found the paper well-written, the motivation clear, and the experiments comprehensive. Below, we address their specific comments in detail.

W1: Choice of the number of active indices kk In the framework of MEMOIR, the choice for the number of active indices kk results in a trade-off, i.e., the higher kk, the easier to capture individual edits; the lower kk, the less forgetting due to reduced interference with previous edits. Thus optimal kk in our experiments is chosen based on the average reliability after a sequence of edits. However, we note that the choice of kk is globally robust to different settings. For example, we use the same k=4096k=4096 for different models (LLaMA-2, LLaMA-3, and Mistral), different tasks (ZsRE QA dataset, and SelfCheckGPT hallucination-correction task), and different numbers of edits (from a single edit to a total of 7000 edits). This demonstrates strong robustness of the choice of kk across testing scenarios. Furthermore, we also showed in Figure 6 that the performance of MEMOIR is not sensitive to the choice of kk, with the reliability metric remaining high across a wide range of kk (from 1024 to 8192). Therefore, while kk is an important hyperparameter in the framework, MEMOIR's performance remains stable across a broad range of values, and we expect hyperparameter tuning to be straightforward in practical settings.

W2: Memory footprint and scalability to large LLMs and number of edits We thank the reviewer for highlighting the computational efficiency of our method. In the revised version, we will include a dedicated section in the Appendix discussing the computational and memory footprint of MEMOIR.

In terms of memory footprint, MEMOIR only requires a copy of a single linear layer in an LLM and one lightweight binary (rather than float32) mask per edit. Specifically, for a LLaMA-3-8B model and 7000 edits, this means a memory increase of 0.7% and 0.1% for parameters and masks, respectively, resulting in an overall increase of 0.8%. Similarly, for a LLaMA-3-70B, it will have an increase of 0.3% for parameters and less than 0.05% for masks. Importantly, we see that scale favors MEMOIR's memory footprint, as larger models result in a decreased (percentage-wise) memory overhead.

In terms of computational efficiency, MEMOIR is more efficient than prior methods that perform parameter updates across multiple layers (e.g., MEMIT, AlphaEdit), as it only modifies a single MLP projection layer, while still achieving higher performance.

To further illustrate the scalability of MEMOIR to a large number of edits, we conducted additional evaluations of MEMOIR with 11k and 15k edits on ZsRE QA dataset. As shown in the table, MEMOIR maintains stable performance with no collapse up to 15k edits. These experiments showcase the superior robustness of MEMOIR to longer edits; while baselines suffer in one or even all metrics, MEMOIR maintains strong performance even for 15k edits.

MethodRel.Gen.Loc.Avg.Rel.Gen.Loc.Avg.
11k15k
GRACE0.990.281.000.760.990.281.000.76
WISE0.440.421.000.620.430.411.000.61
AlphaEdit0.010.010.000.010.020.020.000.01
MEMOIR0.890.810.990.900.870.780.980.88

W3: Theory As also acknowledged by the reviewer, MEMOIR is methodologically well-motivated and supported by strong empirical performance, including ablation studies that clarify the method’s design and its performance gains. It is further inspired by a long line of work on catastrophic forgetting [1,2,3], which provides a solid foundation for sparsely allocating memory to mitigate forgetting in lifelong learning scenarios. Formal analysis of LLMs remains challenging with current tools or requires strong assumptions that limit practical relevance. Following standard practice in knowledge editing (e.g., GRACE, WISE) and broader LLM research, we prioritize strong empirical evidence guided by intuition.

Q1: Other types of model updates We thank the reviewer for the helpful suggestion. MEMOIR addresses a specific problem, i.e., factual knowledge editing in LLMs, building on prior work and focusing evaluation on modifying stored factual knowledge. While our motivation and experiments follow the knowledge editing literature, MEMOIR is a general framework that, in theory, can apply to any setting where modifying intermediate representations is beneficial. Following ROME’s insight that factual knowledge is stored in specific layers, subsequent knowledge editing methods (including ours) focus on one or a few layers. Similarly, prior work has shown that editing intermediate or final representations can be effective for stylistic control and ethical alignment [4,5]. As MEMOIR operates on intermediate layers, it shares similarities with these methods and we expect it to generalize to such applications.

Q2: Computational overhead The fixed random permutation in our experiments introduces negligible computational overhead during both editing and inference. For instance, the fixed permutations and TopK computations in the TopHash algorithm take just 1.5 seconds out of a total 11-minute runtime, accounting for only 0.23% of the total execution time. Given this negligible cost, there is no meaningful trade-off between the added randomness and computational efficiency.

We hope this effectively addresses the reviewer’s concerns and remain available for any further discussion or clarification.

[1] Dropout as an implicit gating mechanism for continual learning, Mirzadeh et al., CVPR 2020 Workshop.

[2] PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning, Mallya et al., CVPR 2018.

[3] Supermasks in Superposition, Wortsman et al., NeurIPS 2020.

[4] Style Vectors for Steering Generative Large Language Models, Konen et al., EACL2024

[5] Spectral Editing of Activations for Large Language Model Alignment, Qiu et al., NeurIPS 2024

最终决定

This paper studies lifelong model editing for large language models, where targeted facts are updated in language models multiple times in a row. The authors propose a novel method and demonstrate that it outperforms a large set of baselines on benchmark tasks across when editing multiple language models. The work was well-received by the reviewers, who appreciated the problem as important and hard, the solution as innovative and interesting, and also found the paper to be well-written and clear. The main negatives were largely acknowledged and resolved during the discussions, though I urge the authors to incorporate some of the identified weaknesses into the paper (e.g., editing just one layer). During the rebuttal period, the authors also identified and resolved a bug that led to new results, though the main findings remain true. Overall, I recommend accepting the paper because the strengths outweigh the weaknesses and will hopefully drive further work in this important area.