PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
5
6
6
5
3.3
置信度
正确性2.8
贡献度2.8
表达2.5
ICLR 2025

Resolving Lexical Bias in Edit Scoping with Projector Editor Networks

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

PENME is a model editing technique that overcomes limitations of distance-based scoping by using a projector network. It effectively handles lexical biases, improving performance while remaining efficient and adaptable.

摘要

关键词
Representation LearningModel Editing and LLM's

评审与讨论

审稿意见
5

The paper proposes to address lexical bias in continual model editing (i.e., token similarity affecting edit decisions). The framework is similar to existing clustering-based setups (e.g., GRACE [1]). This seems to be achieved by explicitly training a projection network and discouraging exploiting lexical correlations. The paper shows improved performance on CounterFact and zsRE.

[1] Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors (Hartvigsen et al., 2023)

优点

The paper highlights the problem of lexical bias in clustering-based editing approaches, which can raise awareness of this particular issue.

缺点

It is difficult to tell exactly what parts of the paper are novel contributions. I think the main difference is that in GRACE the cookbook representations are manually maintained whereas here they're learned. The related work section needs to tell the reader why this work is different from the previous works, not just describing them.

问题

N/A

评论

We thank Reviewer uwxU for their time and feedback, which we address below.

1. It is difficult to tell exactly what parts of the paper are novel contributions. I think the main difference is that in GRACE the cookbook representations are manually maintained whereas here they're learned. The related work section needs to tell the reader why this work is different from the previous works, not just describing them.

Novelty: This is the first work that identifies the issue of lexical bias in model editing. The explicit modelling of a representation space designed to improve paraphrase success while minimizing mismatch with neighbouring examples is a novel solution to the problem at hand. To the best of our knowledge, PENME is the only model editing method designed to maintain both high locality and generalization. The results in Table 1 demonstrate the efficacy of PENME where none of the other methods, whether weight-preserving or weight-modifying, improves both locality and generalization. We have highlighted the novelty of our work in the introduction. We have further improved the Related Work section with a more explicit comparison between our method and other model-editing methods.

The key distinction between GRACE [1] and PENME lies in their approach to model representations: GRACE utilizes a codebook with cached representations as keys, whereas PENME employs learned representations generated by a projection network. GRACE adopts small thresholds or deferral radii to address challenges with neighbourhood prompts; however, this approach is less effective when lexically similar prompts are in close proximity. Additionally, maintaining small radii necessitates storing paraphrases as extra codebook entries alongside their corresponding edits to enhance generalization, which significantly increases retrieval time. The radius-based design further introduces the risk of edit forgetting, as overlapping cluster radii for similar edits necessitate the resizing of radii. This creates a trade-off between locality and generalization, where small radii preserve locality at the expense of generalization, and larger radii enhance generalization but reduce edit localization. In contrast, PENME effectively addresses these issues by managing lexically similar prompts more robustly and leveraging its learned representation space to restrict each edit to a single codebook entry, thereby simplifying storage and retrieval processes.

To illustrate the aforementioned challenges in GRACE and how PENME alleviates these issues, we present an ablation experiment comparing PENME with GRACE across various sample sizes of edits, ranging from 50 to 300 in increments of 50, on the Counterfact dataset, as shown in the table below. The results indicate that PENME outperforms GRACE in speed due to the number of codebook entries, while also revealing key shortcomings in their scoping methods. Specifically, the 'Edits Forgotten' and 'Edit Conflict' columns highlight the significant number of keys lost during training due to conflicts in the deferral radii of edits.

Number of EditsPENMEMELO/GRACE
Runtime (ms)Runtime (ms)Codebook EntriesEdits ForgottenEdit Conflict
-------------------------------------------------------------------------------------------------------------------------------------------------
500.024 ± 0.0030.316 ± 0.0902692421
1000.115 ± 0.1290.364 ± 0.0505237766
1500.188 ± 0.1820.624 ± 0.082785132114
2000.279 ± 0.1701.423 ± 0.1801048188169
2500.404 ± 0.1701.681 ± 0.2051319254217
3000.418 ± 0.1252.149 ± 1.0691554301268

We have updated Section (RELATED WORK) to highlight this better.

评论

We recognize that the primary concern of the reviewer was about what makes our work novel, as well as its differences from previous related work. We hope that our previous rebuttal was able to show exactly what our novel contributions are and how they relate to previous work. We thank the reviewer for raising this point and have added clarity to our revisions.

Please let us know before the end of the rebuttal period if there is anything further we can clarify.

审稿意见
6

The paper introduces Projector Editor Networks for Model Editing (PENME), a novel approach to improving large language model editing techniques. PENME addresses the wrong when deal with incorrect edits on irrelevant prompts with similar words by using contrastive learning to create an optimized representation space. This space allows precise localization of edits by maintaining distance between irrelevant prompts while keeping paraphrases close. The empirical study demonstrates that PENME achieves great results in model editing.

优点

  • Propose the lexical bias in Model editing which is a new aspect to improve the performance of model editing.
  • Propose a projection network that maps the model’s representation space to a new representation space where lexical dominance is minimized

缺点

  • This paper suggests that lexical bias refers to different editing subjects with the same relation, such as "The twin city of Pittsburgh is" and "The twin city of Portsmouth is." However, the prevalence of such cases in the CounterFact and ZsRE datasets is unclear. 
  • Figure 3 and Figure 7 illustrate the "Percentage of samples where edits are closer to unrelated neighbors," but this is insufficient to demonstrate lexical bias. At lower model layers, high similarity may result from underdeveloped sentence representations, while at higher layers, the reduced percentage indicates greater differentiation between sentences. 
  • The results in Table 1 show that GRACE is a strong baseline. PENME, which extends GRACE by using a projection network to map data representations, needs to clearly highlight the differences between PENME and GRACE. 
  • PENME focuses on addressing lexical bias, so it should perform well on Loc and Para. However, in Table 1, only Para shows improvement, which is insufficient to fully support the paper's contributions.

问题

  • It is better to give more example to shown what is lexical bias and lexical overlap in the paper.
  • Some results in Table 1 is not clear enough (close to 0.0), such as why GRACE on zsRE get the 0.00 in Loc on Llama2-7b?
  • In Figure 6, it is better to add the resutls on GRACE.
评论

2. The results in Table 1 show that GRACE is a strong baseline. PENME, which extends GRACE by using a projection network to map data representations, needs to clearly highlight the differences between PENME and GRACE.

Please refer to our central response for clarification.

3. Figure 3 and Figure 7 illustrate the "Percentage of samples where edits are closer to unrelated neighbors," but this is insufficient to demonstrate lexical bias. At lower model layers, high similarity may result from underdeveloped sentence representations, while at higher layers, the reduced percentage indicates greater differentiation between sentences.

We agree that the lower layers of a model may have underdeveloped sentence representations, while higher layers may have developed a better sense of semantic similarity. The Figure 3 bar chart shows that the lexical bias exists in higher layers as well; however, it is relatively low compared to lower layers. Moreover, the issue of lexical bias has also been observed in the recent NeurIPs paper [1], which specifically analyzed the last layer representations of a diverse set of large language models.

4. PENME focuses on addressing lexical bias, so it should perform well on Loc and Para. However, in Table 1, only Para shows improvement, which is insufficient to fully support the paper's contributions.

PENME achieves a balanced outcome, delivering high scores in both generalization and locality without compromising performance in either aspect. The deferral radius in GRACE [2] and MELO [3] which utilize the same retrieval system is dynamically adjusted, initially set to a small epsilon to avoid interference with other prompts, ensuring high locality, while still permitting the execution of some paraphrases for generalization. It is important to note that multiple paraphrases during training are essential for achieving effective generalization. While the default deferral radius is small, enabling high locality, it leads to a redundant solution with minimal generalization as evident from the results. If the radius is expanded it leads to better generalization but at the cost of locality, given that lexically similar neighbours exhibit proximity to the edits, this leads to low locality scores.

5. Some results in Table 1 is not clear enough (close to 0.0), such as why GRACE on zsRE get the 0.00 in Loc on Llama2-7b?

For our evaluations, we employ the default parameters and layer settings provided in the EasyEdit library [4]. It is important to note that Llama-2-7b operates within a representation space that differs significantly from that of the GPT series models. Alternative model editing approaches, such as MEMIT, have been shown to perform sub-optimally for model editing tasks in this context. We direct the authors to a recent NeurIPs paper [5], where the appendix Section D (Appendix /Additional Experiments) demonstrates that the results of MEMIT and GRACE for the LLaMA model align closely with those observed in our study.

6. In Figure 6, it is better to add the resutls on GRACE.

Conducting model editing methods evaluations across different scaling experiments is computationally intensive. To address concerns related to scalability, we focused our primary experiments on comparing PENME's performance with other editing approaches for 2,000 edits, as shown in Table 1. Notably, experiments in relevant literature typically evaluate performance with up to 1,000 edits [2], [3], [5].

[1] Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad (2024). Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations. In Proceedings of NeurIPS.

[2] Tom Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. Aging with grace: Lifelong model editing with discrete key-value adaptors. Advances in Neural Information Processing Systems

[3] Lang Yu, Qin Chen, Jie Zhou, and Liang He. Melo: Enhancing model editing with neuron-indexed dynamic lora. In Proceedings of the AAAI Conference on Artificial Intelligence.

[4] Peng Wang, Ningyu Zhang, Xin Xie, Yunzhi Yao, Bozhong Tian, Mengru Wang, Zekun Xi, Siyuan Cheng, Kangwei Liu, Guozhou Zheng, et al. Easyedit: An easy-to-use knowledge editing framework for large language models. arXiv preprint arXiv:2308.07269, 2023a

[5] Xiusheng Huang, Jiaxiang Liu, Yequan Wang, and Kang Liu. Reasons and solutions for the decline in model performance after editing. In Proceedings of NeurIPS.

评论

We appreciate Reviewer hiiz's time and attention in reviewing our paper, as well as the detailed feedback, which we intend to address below.

1. This paper suggests that lexical bias refers to different editing subjects with the same relation, such as "The twin city of Pittsburgh is" and "The twin city of Portsmouth is." However, the prevalence of such cases in the CounterFact and ZsRE datasets is unclear.

It is better to give more example to shown what is lexical bias and lexical overlap in the paper.

To quantify lexical bias, we compute token overlap using Jaccard similarity and ROUGE metrics between pairs of (edits, paraphrases) (xi,pij)(x_i, p_{ij}) and (edits, neighbours) (xi,nbij)(x_i, nb_{ij}), and also present examples from both datasets in the tables below. From the token overlap metrics table, it is evident that the edit prompt and neighbours show a high overlap in Counterfact, whereas the overlap is minimal in ZsRE. The table with dataset examples further demonstrates that neighbours share a similar lexical structure but often refer to different entities. Interestingly, these entities can also be semantically related, as seen in the fourth example, where Thomas Arne and Bill Brandt are both well-known British figures.

This, coupled with the experiment in Section 6.1 (LEXICAL DOMINANCE) highlights the challenging nature of the Counterfact dataset, attributable to its inherent lexical bias.

ZsRECounterfact
MetricPair TypeScorePrecisionRecallF1ValuePrecisionRecallF1
Jaccard Similarity(xi,pij)(x_i,p_{ij})0.399---0.401---
Jaccard Similarity(xi,nbij)(x_i,nb_{ij})0.086---0.430---
ROUGE-1(xi,pij)(x_i,p_{ij})-0.3210.3150.316-0.3100.3250.307
ROUGE-1(xi,nbij)(x_i,nb_{ij})-0.0760.0870.079-0.2950.2930.290
ROUGE-2(xi,pij)(x_i,p_{ij})-0.1890.1940.194-0.1890.1980.184
ROUGE-2(xi,nbij)(x_i,nb_{ij})-0.0080.0080.008-0.2050.2030.201
ROUGE-L(xi,pij)(x_i,p_{ij})-0.2990.2940.293-0.2990.3120.295
ROUGE-L(xi,nbij)(x_i,nb_{ij})-0.0700.0800.073-0.2940.2920.289

Random Samples from the Counterfact and ZsRE Datasets

CounterfactEditParaphraseNeighbour
The twin city of Cologne isWhat is the twin city of Cologne? It isThe twin city of London is
Alexander Zinoviev works in the area ofAlexander Zinoviev's domain of work isFred W. Riggs works in the area of
The original language of Kondura wasThe language of Kondura isThe language of Taal is
Thomas Arne died in the city ofThomas Arne lost their life atBill Brandt died in the city of
ZsREEditParaphraseNeighbour
Which river system contains Laborec?What river system does Laborec contain?Where does the last name Serrano come from?
Which airport does Air Seychelles operate in?Which airport is closely linked to Air Seychelles?How many students attend Chippewa Valley High School?
The country of origin for Kala Pul is what?Which was the country for Kala Pul?When do the new Sky Sports channels launch?
What label was responsible for Wild World?What was the label Wild World?Who composed the music for Avengers: Infinity War?

评论

The author in Part 1 of the analysis notes that "it is evident that the edit prompt and neighbours show a high overlap in Counterfact, whereas the overlap is minimal in ZsRE." The bias the author aims to address concerns data with similar structures but different entities.

According to your analysis, such data is more prevalent among the neighbours. Therefore, PENME should seemingly perform better on Loc, especially on the Counterfact dataset. However, this contradicts the results in Table 1, where GRACE performs better, leaving the reasons for the overall performance improvement unexplained.

评论

As highlighted in our response to Part 1, GRACE falls short of satisfying a fundamental criterion of model editing: meaningful generalization. While its high locality may initially appear noteworthy, the results underscore a critical limitation of GRACE, as it achieves less than 22% generalization on Counterfact across all models. This performance is comparable to a simplistic input-matching mechanism reliant on caching. The primary scientific challenge in model editing lies not only in preserving locality but also in ensuring that targeted factual updates propagate effectively across semantically related queries. PENME is designed to address both the challenges of generalization and locality, demonstrating strong performance in both areas.

To illustrate how GRACE's performance evolves as its deferral radii are adjusted in an attempt to approach PENME's generalization performance, we increase the deferral radii and evaluate the approach on 2000 Counterfact samples for GPT2-XL, as shown in the table below.

PENMEGRACE
ESLOCPARAESLOCPARA
1.000.8470.8751.000.1710.767
评论

Thank you for your response. The results indeed demonstrate the advantages of PENME in scenarios involving extensive continual editing.

However, it seems to rely on a crucial component, the "PROJECTION NETWORK," which raises a question: Is the PROJECTION NETWORK pre-trained, or does it play a role in continual learning?

In the context of continual editing, the assumption is that the model do not have all the knowledge updates at once but receive them sequentially.

Therefore, the training of the PROJECTION NETWORK appears to be a small-sample training problem. Has the author considered this aspect?

You could check the paper "https://arxiv.org/pdf/2405.14768", they didn't use all edits before editing, but for PROJECTION NETWORK, does it need all edits for training?

评论

We appreciate the opportunity to clarify our approach. Our evaluations were conducted within a batch editing framework, and we thank the reviewer for highlighting the problem of continual editing. To address this, we conduct an experiment using Llama-2-7b, leveraging the pretrained projector network from our original experiments. We randomly sample 1,000 unseen instances from ZsRE to evaluate PNEME and compare it to WISE [6]. The results, presented in the table below, show that PNEME maintains strong performance in both locality and generalization in a continual editing setting. In contrast, while WISE emphasizes high locality, it experiences trade-offs in the form of edit forgetting and reduced generalization.

Due to current computational resource constraints, we are in the process of compiling the experimental results for all models for this setting. Once finalized, these results will be included in the manuscript.

PENMEWISE
ESLOCPARAScoreESLOCPARAScore
1.000.9170.8610.930.771.000.720.83

[6] Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Wise: Rethinking the knowledge memory for lifelong model editing of large language models.

评论

Thanks for your time and effort during the rebuttal. I think you have addressed most of my concern in this paper. So I decide to revise my scores accordingly.

审稿意见
6

This paper addresses an important question in model editing: the tradeoff between generalization (e.g., paraphrase handling) and locality (e.g., avoiding unintended edits on irrelevant queries). To tackle this issue, the authors propose PENME, which consists of two components: (1) a projection network trained with a contrastive objective to separate paraphrased and irrelevant prompts in the representation space, and (2) a memory-based retrieval scheme that enhances editing precision by applying a similarity threshold as a scoping mechanism. Experiments on three models demonstrate the effectiveness of PENME compared to other baselines.

优点

  1. This paper addresses an important topic in model editing: the tradeoff between generalization and locality. Though the problem is already well-defined, the two proposed methods are simple yet effective to improve editing's effectiveness.
  2. The experimental results are significant, showing improvements over several state-of-the-art editing methods.

缺点

  1. I don’t have strong negative feedback on this paper, but additional analyses would be helpful. See the Questions section for more details.
  2. The presentation could be improved; all figures are quite blurry and lack high quality.
  3. The writing, especially in the experiments section, needs clarification. The baseline setup is hard to follow as the authors haven't provided an overview of all compared baselines (e.g., MELO). For instance, it’s unclear why Llama-2-7b wasn’t tested on MELO and no T5 for MEMIT.

问题

  1. An ablation study on each proposed component would strengthen the analysis.
  2. Could you provide a visualization of the representation space showing edits, paraphrases, and neighbors before and after editing? This would nicely complement the current analysis.
  3. Fig6: LAMA->LLaMA
  4. Adding more editing methods in the scaling experiment can help to confirm the robustness of PENME.
评论

We would like to thank Reviewer UAfd for taking the time to review our work and for the useful feedback. We hope to address the stated concerns below.

1. The writing, especially in the experiments section, needs clarification. The baseline setup is hard to follow as the authors haven't provided an overview of all compared baselines (e.g., MELO). For instance, it’s unclear why Llama-2-7b wasn’t tested on MELO and no T5 for MEMIT.

We apologize for the lack of clarity. As detailed in Appendix B.2, we utilize the EasyEdit library [1] for our evaluations. This library extends the original codebases for model editing approaches, enabling the use of additional models that were not supported previously. However, it is important to note that not all models are currently supported for every approach (e.g., Llama-2-7b for MELO). Additionally, MEMIT is specifically designed to work with decoder-only models, which is why we did not conduct evaluations using the T5 model. Working details for MELO are provided in the Related Work Section of the main text.

2. Could you provide a visualization of the representation space showing edits, paraphrases, and neighbours before and after editing? This would nicely complement the current analysis.

We have included the requested visualization in Appendix Section F (VISUALIZATIONS). The visualization shows that in model representations edits are closer to their neighbours than to their respective paraphrases. Additionally, edits tend to be closer to other edits, which can result in misfires in similarity retrieval systems. In the representation space of the projection network, we can see that the neighbours are far from the edits and the edits are farther away from each other. This visualization is referenced in Section 6.2 (DISENTANGLED PROJECTION SPACE) of the main paper.

3. Fig6: LAMA>->LLaMA

Thank you for pointing that out. We have updated the image to rectify this typo.

4. Adding more editing methods in the scaling experiment can help to confirm the robustness of PENME

Conducting model editing methods evaluation across different scaling experiments is computationally intensive. To address concerns related to scalability, we focused our primary experiments on comparing PENME's performance with other editing approaches for 2,000 edits, as shown in Table 1. Notably, experiments in relevant literature typically evaluate performance with up to 1,000 edits [2], [3], [4].

5. The presentation could be improved; all figures are quite blurry and lack high quality.

We apologize for the image quality issues and have updated all images to enhance image quality.

6. An ablation study on each proposed component would strengthen the analysis.

There are three major hyperparameters in PENME. Edit-to-edit pairings, the margin mm in the contrastive loss and the data-driven threshold τ\tau. Abalations regarding edit-to-edit pairings and τ\tau can be found in Section \S7.1 GENERALIZATION AND LOCALITY of the main text. For mm we conduct an ablation study and examine its implications for generalization and locality, utilizing 500 samples from the Counterfact dataset with the GPT2-XL model. The table below presents the results with adjustments to τ\tau for achieving balanced outcomes for the metrics. Margins between 40 and 80 provide a balanced trade-off between generalization and locality. Notably, locality improves with increasing, which can be advantageous in scenarios where minimizing false matches is critical. We have added the results in Appendix Section D.1.2.

Margin mmThreshold Adjustment (τ\tau)GeneralizationLocality
1000.6340.831
2030.8910.880
3060.9580.948
4080.9670.977
50110.9780.965
60130.9760.986
70170.9730.976
80170.9730.976
90200.9280.986
评论

[1] Peng Wang, Ningyu Zhang, Xin Xie, Yunzhi Yao, Bozhong Tian, Mengru Wang, Zekun Xi, Siyuan Cheng, Kangwei Liu, Guozhou Zheng, et al. Easyedit: An easy-to-use knowledge editing framework for large language models.

[2] Tom Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. Aging with grace: Lifelong model editing with discrete key-value adaptors. Advances in Neural Information Processing Systems

[3] Lang Yu, Qin Chen, Jie Zhou, and Liang He. Melo: Enhancing model editing with neuron-indexed dynamic lora. In Proceedings of the AAAI Conference on Artificial Intelligence.

[4] Xiusheng Huang, Jiaxiang Liu, Yequan Wang, and Kang Liu. Reasons and solutions for the decline in model performance after editing. Proceedings of NeurIPS.

评论

Thanks for your time and effort during the rebuttal. I think you have addressed most of my concern in this paper, especially for the presentation. So I decide to revise my scores accordingly.

审稿意见
5

The paper points out that knowledge editing methods based on scoping mechanisms are limited by lexical bias. It proposes using a projector network to decrease the distance between edits and paraphrases, and increase the distance between edits and neighbors to address this issue.

优点

  • The paper highlights the challenge of lexical bias in knowledge editing, providing valuable guidance for future research in this field.
  • The paper proposes using a projector network to enhance retrieval within the codebook, effectively improving the generalization of edits and preventing misfires.

缺点

  • The authors didn't fully explain their method. In discussing the construction of key-value memory, they described how to create the keys and set the threshold but didn't explain how to obtain the corresponding values.
  • The paper claims that PENME enables faster edit retrieval and simplifies edit removal or updates. However, it lacks supporting experimental evidence.
  • The hyperparameter m in the loss function is crucial for the projection network's performance, yet the paper lacks ablation studies on this.
  • Many methods were selected for performance comparison, but the authors did not explain why these specific methods were chosen.
  • The images in the paper are disorganized and difficult to interpret. Combining multiple experiment results into single images reduces readability.
  • The organization of the main text and appendix is unclear. For example, ablation experiments present results for different similarity thresholds for edit-to-edit pairings, but this hyperparameter isn't introduced in the main text, making it hard to understand.
  • The paper doesn’t provide detailed explanations of the projector networks, such as their parameter dimensions.

问题

  • Why choose L2 distance in the loss function instead of cosine similarity?
  • Why choose cosine similarity for edit-to-edit pairings instead of L2 distance?
  • What impact does the hyperparameter m have on the projection network's performance?
  • What is the architecture of the projection network? Is it similar to a feed-forward layer in a transformer?
  • How is the memory value in the key-value memory obtained?
  • The paper proposes two data-driven thresholding schemes. Was Option 1 chosen over Option 2 based on experimental results?
  • Why were these methods chosen as baselines in the paper? Is it because they achieved state-of-the-art results on certain metrics or share similarities with PENME?
评论

12. Why choose L2 distance in the loss function instead of cosine similarity?

Cosine loss only takes into account the direction of the representations, whereas L2 loss considers both the direction and the magnitude. Existing literature indicates that training with a contrastive objective using L2 loss leads to more separable clusters compared to cosine loss. One issue with cosine loss is that reducing the angle between positive representations increases the magnitude of the representations. Thus, enforcing cosine similarity may result in a leaked heuristic that allows the model to manipulate the magnitude in the projection space, leading to unexpected incompatibility with the downstream computation flow. Furthermore, cosine loss is prone to gradient diminishing effects due to this increase in magnitude, as well as when the initial angle between positive pairs is large. We refer the authors to [3] which outlines these issues. We also provide experimental evidence for these issues by training GPT2-XL on 500 samples from Counterfact, where we modify the objective function of the projector network to cosine loss within PENME’s pipeline. The results show that a negative τ\tau needs to be set, which means that the paraphrases used during training will fail. Moreover, the performance is lower compared to training with the contrastive learning objective.

PENME Cosine Loss

GPT2-XLLlama2-7b
τ\tauGeneralizationLocalityτ\tauGeneralizationLocality
-20.4630.779-100.4820.536
-10.6910.603-90.5460.506
00.8780.423-80.5750.466
10.9810.250-70.6180.429
21.000.096-60.6540.392

PENME Contrastive Loss

GPT2-XLLlama2-7b
τ\tauGeneralizationLocalityτ\tauGeneralizationLocality
100.9560.991100.9350.99
130.9720.990130.9630.987
180.9780.984150.9710.985
190.9800.980180.9810.98
200.9820.975200.9870.973

[1] Tom Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. Aging with grace: Lifelong model editing with discrete key-value adaptors. Advances in Neural Information Processing Systems

[2] Lang Yu, Qin Chen, Jie Zhou, and Liang He. Melo: Enhancing model editing with neuron-indexed dynamic lora. In Proceedings of the AAAI Conference on Artificial Intelligence.

[3] Andrew Draganov, Sharvaree Vadgama, and Erik J Bekkers. The hidden pitfalls of the cosine similarity loss. arXiv preprint arXiv:2406.16468, 2024.

评论

Thank you for your response. However, I believe the following issues remain unresolved:

  • Q5 has not been answered: Is the memory value obtained through training or other methods?
  • I agree that placing multiple parameters in a single figure can better demonstrate their combined effects. However, without a clear explanation, it might confuse the readers even more. Additionally, the overlapping lines in Figure 5 further increase the difficulty of interpretation.
  • When changing the margin m, how was the corresponding threshold determined? Is there a quick way to find a suitable threshold?

Although I think this paper still has some shortcomings, I also believe that this work effectively improves the generalization of cluster-based knowledge editing methods, which makes it valuable. I am willing to revise my score and hope the authors will continue to refine this paper.

评论

We would like to thank Reviewer y63D for taking the time to review our paper and we appreciate the detailed comments which we hope to address below.

1. The authors didn't fully explain their method. In discussing the construction of key-value memory, they described how to create the keys and set the threshold but didn't explain how to obtain the corresponding values. How is the memory value in the key-value memory obtained?

The ZsRE and Counterfact datasets provide edit prompts xix_i alongside their corresponding new outputs yiy_i, as well as paraphrases pijp_{ij} and neighbours nijn_{ij}. The keys are projector network representations and corresponding values containing a learned similarity threshold (δ\delta) and the new associated output yiy_i. The threshold for each edit xix_i is determined by computing the Euclidean distance between the projector network representations for the edit and its training paraphrases pijp_{ij} and choosing the maximally far paraphrase distance + τ\tau where τ\tau is a hyperparameter. As highlighted in Section 4 (Projector Editor Networks for Model Editing), alternative playback mechanisms can be seamlessly integrated with this approach, offering a viable alternative to directly storing the new output information. We describe Key-value Memory in detail in Section 4.2.

2. The paper claims that PENME enables faster edit retrieval and simplifies edit removal or updates. However, it lacks supporting experimental evidence.

We conducted an ablation study to evaluate the number of codebook entries and retrieval time, scaling from 50 to 300 edits in increments of 50 samples per experiment. The results, shown below, reveal that both MELO [2] and GRACE [1] demand significantly more codebook entries, leading to slower inference times as the number of edits increases. We have added the results in the Appendix Section (COMPARISON SCOPING MECHANISM: PENME VERSUS MELO AND GRACE).

Number of EditsPENMEMELO/GRACE
Runtime (ms)Runtime (ms)Codebook Entries
500.024 ± 0.0030.316 ± 0.090269
1000.115 ± 0.1290.364 ± 0.050523
1500.188 ± 0.1820.624 ± 0.082785
2000.279 ± 0.1701.423 ± 0.1801048
2500.404 ± 0.1701.681 ± 0.2051319
3000.418 ± 0.1252.149 ± 1.0691554

Caption: Runtime Performance Comparison of PENME versus MELO retrieval system. For PENME, the number of Codebook entries is the same as the number of edits.

3. The hyperparameter m in the loss function is crucial for the projection network's performance, yet the paper lacks ablation studies on this.

What impact does the hyperparameter m have on the projection network's performance?

We conducted an ablation study on the hyperparameter mm for the GPT2-XL model. The table below presents the margin mm alongside the corresponding adjustments to τ\tau to achieve a balance between generalization and locality. Margins between 40 and 80 provide a balanced trade-off between generalization and locality. Notably, locality improves with increasing mm, which can be advantageous in scenarios where minimizing false matches is critical. We have added the results in Appendix Section D.1.2.

Margin mmThreshold Adjustment (τ\tau)GeneralizationLocality
1000.6340.831
2030.8910.880
3060.9580.948
4080.9670.977
50110.9780.965
60130.9760.986
70170.9730.976
80170.9730.976
90200.9280.986
评论

4. Many methods were selected for performance comparison, but the authors did not explain why these specific methods were chosen.

GRACE and MELO were selected for comparison because they are weight-preserving approaches, similar to PENME, which ensures that model weights are minimally disrupted while integrating new edits. This makes them relevant benchmarks for evaluating the efficiency and precision of PENME. In addition, MEMIT and SERAC were included as they represent high-performing techniques in model editing. MEMIT is a weight-modification-based approach that directly encodes edits by adjusting weights, while SERAC employs external components to handle edits without altering the core model. These diverse approaches provide a comprehensive basis for assessing PENME's performance across different model editing paradigms, highlighting its superior performance. Section 5 (EXPERIMENTAL SETUP) has been updated to clarify this. Moreover, the Introduction and Related Work sections provide further details on various types of model editing methods present in the literature.

5. The images in the paper are disorganized and difficult to interpret. Combining multiple experiment results into single images reduces readability.

We apologise for the issues with the figures. We have enhanced the image quality of all figures. Based on the comment, we understand that Figure 6 may be challenging to interpret, as it is the only figure that includes combined ablations on both edit-to-edit pairings ϕ\phi and the distance threshold τ\tau. The ablation is presented in a single visualization as it naturally addresses how the two parameters change in combination.

To improve the readability of the figure we have increased the iteration gap between ϕ\phi and have restricted the visualization to two models (GPT2-XL and T5-small) with the visualization for the last model (Llama-2-7b) being provided in the Appendix Section F (VISUALIZATIONS). The paper text and figure caption have been updated accordingly.

6. The organization of the main text and appendix is unclear. For example, ablation experiments present results for different similarity thresholds for edit-to-edit pairings, but this hyperparameter isn't introduced in the main text, making it hard to understand.

We apologise for the omission. We have introduced ϕ\phi as the hyperparameter utilized for this pairing. The paper text has been updated to reflect this in Section 4.1 (PROJECTION NETWORK) where edit-to-edit pairings are initially introduced.

7. The paper proposes two data-driven thresholding schemes. Was Option 1 chosen over Option 2 based on experimental results?

We decide it based on intuition. However, we added empirical results in the rebuttal to support our intuition. In the following, we summarize our intuition and empirical results. Option 2for an edit may result in a threshold that is lower than the most distant training paraphrase, meaning we can not guarantee generalization for the training paraphrases. For instance, when performing 500 edits on the Counterfact dataset, for a total of 2500 neighbouring pairs 8.62%8.62\% of the edits encounter this issue. In contrast, Option 1 does not prioritize locality for the training neighbours. We have updated Section 4.2 (KEY-VALUE MEMORY) to provide clarification.

9. Why were these methods chosen as baselines in the paper? Is it because they achieved state-of-the-art results on certain metrics or share similarities with PENME?

We select a diverse set of model editing methods including both weight-preserving and weight-modified methods. Our motivation to include various diverse methods is to evaluate the effectiveness of our proposed method across all types of methods. However, in terms of methodological similarity, GRACE and MELO are closest to PENME.

10. Why choose cosine similarity for edit-to-edit pairings instead of L2 distance?

The bounded range of cosine similarity simplifies the process of determining a threshold across edits. While L2 distance can also be a valid approach and yields similar results, the goal is to pair edits that are closer in the vector space.

11. What is the architecture of the projection network? Is it similar to a feed-forward layer in a transformer?

The projection network is similar to the feed-forward layers in a Transformer as it contains two layers with ReLU activation in between with an addition of a Batch Normalization layer, a common element in contrastive learning. We have updated Section D (EXPERIMENTATION AND IMPLEMENTATION DETAILS) in the Appendix to clarify this.

评论

1. Q5 has not been answered: Is the memory value obtained through training or other methods?

Response: Please recall from our previous rebuttal (Part 1, 1) response that the memory value is a tuple consisting of (data driven threshold, yiy_i the edited output from the dataset), we highlight that any playback mechanism such as vector playback or LoRA indexed blocks can be used as well. The threshold is not learned but is a hyperparameter that is empirically determined.

Thank you for raising this point again, we will ensure this is as clear as possible in the paper as a result of this discussion.

2. I agree that placing multiple parameters in a single figure can better demonstrate their combined effects. However, without a clear explanation, it might confuse the readers even more. Additionally, the overlapping lines in Figure 5 further increase the difficulty of interpretation.

Response We have updated the caption which now reads "Figure 5: Shows the trade-off between generalization and locality performance across different hyperparameter settings. The distance threshold τ\tau varies from 0.010.01 to 0.20.2 (0.010.01 increments and τ\tau is normalized by 100), while the edit-pairing similarity threshold ϕ\phi ranges from 0.50.5 to 0.90.9 (0.10.1 increments). Higher ϕ\phi values enforce stricter edit similarity requirements. The results showcase the effect of hyperparameter tuning on the projector network's learning capacity and overall performance.". To make the lines more clearly visible we moved the the minimum value on the y-axis from 0.35 to 0.6.

3. When changing the margin m, how was the corresponding threshold determined? Is there a quick way to find a suitable threshold?

Response: The higher the margin m is kept the higher the value for τ\tau needs to be. As stated before threshold is not learned but but is itself a hyperparameter that is empirically determined based on the data which in the case of Llama-2-7b is 10 for original experiments (Table 1). A quick way to find a suitable value for τ\tau is to utilize unseen samples. Using 100 unseen samples we find that 6 is the optimal τ\tau value for a balanced outcome in generalization and locality for those samples. Utilizing this value for original experiments yields the results shown in the table below. We observe that the scores are high for both generalization and locality but with a stronger emphasis on locality.

PENME (Llama-2-7b unseen 100)
τGeneralizationLocality
30.7640.975
40.8570.936
50.8750.904
60.8690.870
70.9620.821
81.00.828
PENME (Llama-2-7b)
ESLOCPARAScore
1.000.9460.8510.932
评论

Dear Reviewer y63D,

As the discussion phase draws to a close, we kindly ask whether our response has resolved your concerns or if there are any remaining issues that we can further clarify. Your insights are invaluable in refining our work, and we are eager to ensure that all your concerns are fully addressed.

Thank you once again for your time and effort in reviewing our manuscript.

评论

We would like to thank the reviewers for their valuable comments and feedback. In the following, we reiterate the contribution of our work and summarized the experiments conducted for the rebuttal.

Novelty: This is the first work that identifies the issue of lexical bias in model editing. The explicit modelling of a representation space where paraphrases of edits demonstrate proximity while neighbouring examples are far away is a novel solution to the problem at hand. To the best of our knowledge, PENME is the only model editing method designed to maintain both high locality and generalization. The results in Table 1 demonstrate the efficacy of PENME where none of the other methods, whether weight-preserving or weight-modifying, improves both locality and generalization.

Comparison between PENME, GRACE and MELO: Reviewer uwxU, y63D and hiiz asked about the difference between PENME and GRACE/MELO. In the following, we compare these methods and have clarified our contribution.

Cluster-based similarity systems like GRACE and MELO [1] and [2] rely on concept separability within the representation space to manually maintain keys in their codebooks. However, our analysis reveals that lexically similar prompts cluster closer to edits than their paraphrases, heightening the risk of system failure as can be seen from figure 1 and 3. Moreover, their cluster based design necessitates storing edit paraphrases as codebook entries for effective generalization which increases retrieval latency. PENME overcomes these limitations by learning a projection space that enhances representation structure, enabling more effective organization of keys for faster and accurate retrieval. Moreover, PENME consistently outperforms both weight-preserving and weight-modifying methods across various architectures, underscoring its adaptability and efficacy.

Experiments: We provide a detailed discussion on each experiment in our response to individual comments. In the following, we provide a summary of conducted experiments and their findings.

  1. A runtime comparison between PENME, GRACE, and MELO reveals that PENME is faster due to its requirement for fewer codebook entries. The experiment also highlights that multiple codebook entries are lost during training as a result of cluster resizing operations.

  2. Ablation on margin mm and the corresponding adjustments needed to τ\tau for data-driven threshold. The results demonstrate that a value of 40m8040 \leq m \leq 80 provides a balanced trade-off between generalization and locality.

  3. Comparison of cosine loss and contrastive loss for training the projection network. The results demonstrate that the L2 distance performs better than the cosine similarity.

  4. Token overlap metrics which showcase the lexical similarity characteristics between edits and neighbours in Counterfact and ZsRE datasets. The results indicate that this issue is not only more pronounced in Counterfact compared to ZsRE but also occurs abundantly within the Counterfact dataset. We also provide data samples from both datasets.

  5. For data-driven threshold we provide the percentage of samples where training paraphrases or neighbours fail due to adjustment factor τ\tau. This provides insights for data driven thresholds Option 1 and 2.


Summary of Revisions: Details regarding Section numbers are provided in individual comments.

  1. The structure of the Appendix has been changed to reflect the order in which it the referred to in the texts.

  2. The requested ablations and visualizations have been added to the appendix.

  3. Paper components for which further clarifications were requested have been added as blue text in the paper.

  4. Image quality of all images has been improved. Figures 3 and 6 have been altered to improve readability.

[1] Tom Hartvigsen, Swami Sankaranarayanan, Hamid Palangi, Yoon Kim, and Marzyeh Ghassemi. Aging with grace: Lifelong model editing with discrete key-value adaptors. Advances in Neural Information Processing Systems

[2] Lang Yu, Qin Chen, Jie Zhou, and Liang He. Melo: Enhancing model editing with neuron-indexed dynamic lora. In Proceedings of the AAAI Conference on Artificial Intelligence.

评论

Dear Reviewers,

Thank you for your thorough feedback on our manuscript. We have addressed all your comments. With the rebuttal deadline approaching, we would greatly appreciate a discussion regarding our responses. Please let us know if there are any points that require further clarification or additional explanation.

Authors

AC 元评审

This paper explores the tradeoff between generalization (e.g., handling paraphrases) and locality (e.g., avoiding unintended edits on irrelevant queries) in model editing. The authors introduce PENME, a method with two key components: (1) a projection network trained with a contrastive objective to distinguish paraphrased from irrelevant prompts in the representation space, and (2) a memory-based retrieval scheme that improves editing precision by using a similarity threshold as a scoping mechanism. Experiments across three models show that PENME outperforms other baseline approaches. However, the technical contribution of the work is limited, especially in comparison to previous works like GRACE, where cookbook representations are manually maintained while in this paper, they are learned. Moreover, this paper does not fully explain the methodology, particularly in the construction of key-value memory. The organization and writing of this paper can also be improved. The authors are encouraged to carefully revise the paper based on the reviewers' feedback.

审稿人讨论附加意见

The reviewers and authors had discussions, but some reviewers think that this paper lacks innovation compared to previous baselines such as GRACE.

最终决定

Reject