Applying Sparse Autoencoders to Unlearn Knowledge in Language Models
We investigate whether sparse autoencoders can be used to unlearn bioweapon-related knowledge in language models and compare performance to an existing fine-tuning based approach.
摘要
评审与讨论
This paper investigates the use of Sparse Autoencoders (SAEs) to selectively unlearn specific knowledge within language models, using interpretable interventions. The study focuses on the Gemma-2b-it and Gemma-2-2b-it models, particularly on knowledge related to biosecurity from the Weapons of Mass Destruction Proxy Dataset (WMDP-bio).
优点
- Overall, applying SAE to unlearn specific knowledge in LLMs is an interesting and practical approach.
- SAE interventions provide precise control over targeted knowledge, enhancing transparency.
- This method avoids weight modification, offering a novel, activation-based approach to unlearning.
缺点
- It negatively impacts performance on unrelated domains.
- The proposed method seems to require re-training each time a new domain needs to be unlearned, along with access to data from that domain.
- More evaluation results beyond the WMDP-bio dataset would enhance the assessment.
问题
- The evaluation appears to be based on multiple-choice question answering. How would the method perform in open-ended question answering?
- What are the additional training and inference costs of using SAE?
- Can a trained SAE model be transferred to other large language models ?
We thank the reviewer for their detailed reading of the manuscript, the thoughtful comments (and list of strengths), and appreciate the rating above the acceptance threshold. We discussed the comments in detail and present our responses below. We hope that the reviewer will consider raising their score for the revised version of the paper.
Weaknesses:
It negatively impacts performance on unrelated domains.
We agree with the above claim as stated, however we do not agree that this should be considered a weakness of the paper. This is simply a weakness of applying SAEs to unlearning in this context.
The goal of our work was not to produce a perfect unlearning technique, but rather to investigate whether a new, promising interpretability approach (i.e. sparse autoencoders) could be applied to the important, difficult and unsolved problem of unlearning knowledge from language models.
We do not have control over the effectiveness of SAEs in this context, we are merely measuring it. Therefore, we do not consider it a weakness of the paper that the method has some negative side-effects. The strength of our paper comes from the careful study of a novel technique to an interesting, open problem and the honest presentation of the strengths and weaknesses of this approach.
The proposed method seems to require re-training each time a new domain needs to be unlearned, along with access to data from that domain.
While new features need to be searched for in new domains, the advantage of using SAEs for unlearning is that a single SAE can unlearn a wide variety of concepts and does not need to be retrained each time (unlike a fine-tuning based method such as RMU). This is one of the potential strengths of this kind of approach.
More evaluation results beyond the WMDP-bio dataset would enhance the assessment.
We agree with this comment and would love to see this addressed in future work. We think our work provides a strong basepoint from which later work can build, for example facilitated by the open source SAE weights and source code we will release upon successful publication.
Questions:
The evaluation appears to be based on multiple-choice question answering. How would the method perform in open-ended question answering?
We think this is an interesting question and also investigated this in a qualitative way throughout our work! In summary, when the loss added was small, the open-ended answering ability appeared as good as the base model. However, when loss added was high, the open-ended answering ability was often damaged, particularly when discussing biology-related topics. For the purposes of presenting the results in this paper, we chose to have a more quantitative metric of the impact on the model.
What are the additional training and inference costs of using SAE?
Training an SAE of width 16k on a 2B parameter language model can be done on an A6000 in ~12 hours. The compute scales with the SAE width and the size & layer number of the language model (to compute the activations). We will add this detail to our appendices.
Can a trained SAE model be transferred to other large language models?
Typically not, except between a base model and chat model, where recent work suggests that they transfer well. (Reference: https://www.alignmentforum.org/posts/fmwk6qxrpW8d4jvbd/saes-usually-transfer-between-base-and-chat-models)
Thank you for your response. I will keep my current score. Best of luck with your paper.
The paper explores the potential of using sparse autoencoders (SAEs) to remove specific types of knowledge from language models in an interpretable way. Specifically, it investigates whether SAEs can selectively "unlearn" harmful biology-related information from two language models, gemma-2b-it and gemma-2-2b-it, using a subset of Weapons of Mass Destruction Proxy dataset. The main findings suggest that adjusting (negatively scaling) certain biology-related feature activations is effective for unlearning this knowledge, whereas simply zeroing out features is not effective.
Key insights include:
- Negative scaling of feature activations is necessary for unlearning specific topics, while zeroing features is ineffective.
- Multiple SAE features can be manipulated simultaneously to unlearn several topics, but this method has side effects similar to, or greater than, those of the existing fine-tuning-based Representation Misdirection for Unlearning technique.
优点
- The paper addresses an important and current issue in AI safety, focusing on controlled knowledge removal.
- The authors present a thorough analysis of how individual SAE features can be targeted to unlearn specific knowledge, showcasing the possibility for precise, fine-grained control.
缺点
- Novelty: I am not sure about the difference between the paper’s method and negative activation in [1] and [2].
- Unlearning Performance: The submission underperforms relative to existing unlearning methods, notably RMU, as benchmarked by the WMDP. While it proposes an innovative approach, it fails to deliver superior results compared to RMU across several metrics, raising concerns about its effectiveness and relevance in high-stakes applications.
- Helpfulness: Furthermore, the exact MMLU accuracy of the Gemma models in absolute terms is unclear, although it appears to be close to random (approximately 25%). The authors should clarify the selection criteria, the number of questions in the MMLU subset, and whether the overall performance of the model improves or worsens.
- Validity of Evaluation Method: The evaluation is limited to small subset of the WMDP dataset (300 questions, or less than 8% of the full dataset), which diminishes the credibility of its results. Expanding the evaluation to other subsets, such as Chemistry and Cybersecurity, would provide a more robust measure of generalizability. The choice of “Selected MMLU” for assessing unlearning is also problematic; the subset itself is potentially biased toward knowledge that is resistant to permutation, which complicates unlearning and evaluation.
- Explainability: The negative scaling approach in the submission should be explained more. According to the monosemanticity principle, zeroing the feature activation should be enough to suppress targeted knowledge, since features are believed to be untangled. It’s unclear why stronger negative values would yield better suppression. The lack of transparency on what negative values signify and whether the feature activation is one-dimensional undermines the plausibility of the approach.
[1] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
[2] Scaling and evaluating sparse autoencoders https://cdn.openai.com/papers/sparse-autoencoders.pdf
问题
- On "Perfectly Unlearned Models": The paper should clarify why a perfectly unlearned model should achieve a score below 6, as this criterion feels arbitrary without supporting rationale.
- On the Negative Value in Feature Suppression: A deeper analysis of the meaning and impact of negative scaling in feature suppression would add clarity to the method's robustness claims.
Validity of Evaluation Method: The evaluation is limited to small subset of the WMDP dataset (300 questions, or less than 8% of the full dataset), which diminishes the credibility of its results. Expanding the evaluation to other subsets, such as Chemistry and Cybersecurity, would provide a more robust measure of generalizability. The choice of “Selected MMLU” for assessing unlearning is also problematic; the subset itself is potentially biased toward knowledge that is resistant to permutation, which complicates unlearning and evaluation.
We use the full WMDP-bio dataset, and narrow this down to 172 and 522 questions for gemma-2b-it and gemma-2-2b-it for reasons discussed in Sec. 3.1. We do not use only 300 questions of the WMDP dataset or claim this anywhere. We think that the following existing text from Sec. 3.1 should clear up the confusion:
“We use the biology subset of the Weapons of Mass Destruction Dataset (WMDP-bio), which consists of 1273 multiple choice questions related to hazardous knowledge in biosecurity (Li et al. 2024). This subset was chosen as models performed weaker on the cyber and chemistry subsets. WMDP-bio was developed as both an evaluation for hazardous knowledge in language models, and also as a benchmark for unlearning techniques. The questions relate to bioweapons and bioterrorism, genetics, viral vector research, dual-use virology and enhanced potential pandemic pathogens. gemma-2b-it achieves a base score of 560/1273 (44.0%) on the WMDP-bio dataset. With four multiple choice options, one can expect the model to get ∼ 25% of the multiple choice questions correct by random chance, without actually having the knowledge to obtain the correct answer. However, as we aim to unlearn this information, we want to be as sure as we can that the model actually has the information in the first place. To address this, we only test unlearning on questions for which the model gets the right answer under all 24 permutations of the 4 multiple choice options, resulting in 172/1273 (13.5%) questions in the WMDP-bio dataset for gemma-2b-it and 522/1273 (41.0%) questions for gemma-2-2b-it.”
To clarify why we did not use the WMDP-cyber dataset and only the WMDP-bio dataset, we added the following text to Sec. 3.1: “This subset was chosen as models performed weaker on the cyber and chemistry subsets.”
Finally, we see no reason to support the reviewer’s claim above that the MMLU subset is potentially biased toward knowledge that is resistant to permutation.
Explainability: The negative scaling approach in the submission should be explained more. According to the monosemanticity principle, zeroing the feature activation should be enough to suppress targeted knowledge, since features are believed to be untangled. It’s unclear why stronger negative values would yield better suppression. The lack of transparency on what negative values signify and whether the feature activation is one-dimensional undermines the plausibility of the approach.
We agree that it is unclear why stronger negative values would yield better suppression. This is an empirical result and we think it’s very interesting which is why we mentioned it in the abstract. As such, we are unable to provide transparency on what negative values signify and would love future work to address this question!
On "Perfectly Unlearned Models": The paper should clarify why a perfectly unlearned model should achieve a score below 6, as this criterion feels arbitrary without supporting rationale.
The reason we claim that “A perfectly unlearned model should achieve a score of ≤ 6 out of 24 permutations.” is that there are 4 possible options for each answer, A, B, C, and D, and a model that is randomly guessing without any information on the topic should expect to get the correct answer, on average 25% of the time, or 6 times out of 24. If a model consistently always gets an answer wrong, it must be because it has some “strongly-held misconception” (for lack of a better analogy) about the topic, rather than simply not knowing anything about it. Our goal for unlearning was to make the model simply not know anything about the topic.
On the Negative Value in Feature Suppression: A deeper analysis of the meaning and impact of negative scaling in feature suppression would add clarity to the method's robustness claims.
We agree that a deeper understanding of the meaning of negative scaling would be very interesting, and would love to see this explored further and understood in future work! This is beyond the scope of our current work.
Thank you for your response.
Novelty & Explainability
If the paper's goal is to analyze previously introduced negative activation methods [1, 2], I would appreciate it if you could include detailed analyses on why such methods work.
- While I understand the list of ablations provided, could you please explain why negative activation helps unlearn targeted knowledge?
Helpfulness
- I think it would be a good idea to include a table. Since there are no tables in the paper, it might be difficult for readers to locate the core metrics. For example, authors mention that MMLU's details are in the third paragraph of Section 3.2, which is actually in the third paragraph of Section 2.2. If it's confusing for the authors themselves, it could be confusing for readers as well.
- Initial review was referring to the MMLU/WMDP-bio accuracy of Gemma models with SAEs plugged in, which I believe should serve as the baseline score for the reported performance of the unlearned Gemma models. As mentioned in the paper, plugging in SAEs into the LLM stacks a loss when replacing the model activation with the SAE reconstruction activation. Therefore, if you were to insert SAEs into every layer of the LLM, the results would become close to random. I noticed that you've plugged SAEs into layer 3 for gemma-2-2b-it and layer 9 for gemma-2b-it. Even if it is not close to the random, the aforementioned performance is likely to be different from the quoted score (44.0%). Such diminished baseline might reduce the unlearning’s effectiveness.
Validity of Evaluation Method
- I still think the authors should clarify the metrics on both the full and selected subsets of the evaluation dataset, which I believe will enhance the transparency of the results. In the case of MMLU, using 97 or 300 (on average, about 200) questions out of the total 15K questions diminishes credibility. Similarly, for WMDP, using 172 or 522 (on average, about 300) questions out of the entire 3.7K questions also raises concerns. I think such selection criteria is arbitrary compared to other LLM literatures, detaching its applicability in other general scenarios.
- I maintain that selecting subsets based on prompts' permutation-resistant criteria is problematic. Most of the metrics in the paper are based on the subset of questions that the base model gets correct for all 24 permutations of prompts. Such questions are likely those that the model is sufficiently confident to answer or are common-sense questions that are easy to answer. I believe these types of knowledge are difficult to unlearn, and I'd appreciate your thoughts on this.
I hope my comments will help improve your research.
References:
[1] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
"While I understand the list of ablations provided, could you please explain why negative activation helps unlearn targeted knowledge?"
We agree this is an interesting question. We made some efforts in the paper to address why negative ablation helps unlearn targeted knowledge in the section titled "Impact of encoder vs decoder in unlearning". As we mentioned above, further investigation in this direction is complicated, would require a deep understanding of the connection between LLMs and SAEs trained on LLM activations, and is beyond the scope of our paper.
Such kinds of explanations are typically not present in the literature, for instance Anthropic's paper “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” also does not explain why positive/negative clamping of features steers the model in a certain way, so we feel it is appropriate to investigate the impact of negative feature clamping, without fully explaining why it works.
I think it would be a good idea to include a table. Since there are no tables in the paper, it might be difficult for readers to locate the core metrics. For example, authors mention that MMLU's details are in the third paragraph of Section 3.2, which is actually in the third paragraph of Section 2.2. If it's confusing for the authors themselves, it could be confusing for readers as well.
We are happy to include a table, could you clarify which values you would like to see in a table?
For further context regarding the core metrics: By definition, the base model gets 100% on our metrics of the selected 522 WMDP-bio questions 100% on the selected 300 MMLU questions (number of questions is for gemma-2-2b-it), and 0 loss added. It isn’t clear to us which core metrics would be helpful to see in a table.
Initial review was referring to the MMLU/WMDP-bio accuracy of Gemma models with SAEs plugged in, which I believe should serve as the baseline score for the reported performance of the unlearned Gemma models. As mentioned in the paper, plugging in SAEs into the LLM stacks a loss when replacing the model activation with the SAE reconstruction activation. Therefore, if you were to insert SAEs into every layer of the LLM, the results would become close to random. I noticed that you've plugged SAEs into layer 3 for gemma-2-2b-it and layer 9 for gemma-2b-it. Even if it is not close to the random, the aforementioned performance is likely to be different from the quoted score (44.0%). Such diminished baseline might reduce the unlearning’s effectiveness.
We appreciate the further clarification on this point! We agree that our communication of this point was not clear in the previous version of the manuscript and made two changes to a further revised version of the paper to improve the clarity of this point. We modified Figure 1 to make the reference to the error term more clear to the reader. We also added a sentence to the Method section stating that "Our method avoids the extra loss when inserting the SAE by adding back the error term caused by the SAE in the forward pass, such that if no modifications are made to the feature activations, there is no impact on the model output."
When we insert SAEs into the LLM, we avoid the extra loss that you refer to above, by adding back the error term caused by the SAE in the forward pass, such that if no modifications are made to the feature activations, there is no impact on the model output or loss.
I still think the authors should clarify the metrics on both the full and selected subsets of the evaluation dataset, which I believe will enhance the transparency of the results. In the case of MMLU, using 97 or 300 (on average, about 200) questions out of the total 15K questions diminishes credibility. Similarly, for WMDP, using 172 or 522 (on average, about 300) questions out of the entire 3.7K questions also raises concerns. I think such selection criteria is arbitrary compared to other LLM literatures, detaching its applicability in other general scenarios.
We have attempted to discuss in our previous response why it is not possible to use the entire WMDP dataset in the framework of our study. We believe that it is not helpful to include questions that the model gets wrong for certain permutations of the 4 answers, and that if included, they might make the unlearning technique appear more effective than it actually is. To avoid overstating the effectiveness of our method, we have not included these questions in our metrics.
Furthermore, we do not believe we are making any unusual or strong claims based on the MMLU results. On the contrary, we are just pointing out one of the limitations of using SAEs for unlearning.
If you feel any specific claim made in the paper is not credible, we would be happy to discuss and make changes.
I maintain that selecting subsets based on prompts' permutation-resistant criteria is problematic. Most of the metrics in the paper are based on the subset of questions that the base model gets correct for all 24 permutations of prompts. Such questions are likely those that the model is sufficiently confident to answer or are common-sense questions that are easy to answer. I believe these types of knowledge are difficult to unlearn, and I'd appreciate your thoughts on this.
We agree that such questions are likely to be those that the model is sufficiently confident to answer - this is the purpose of selecting permutation-resistant questions. If we select questions that the model gets right by accident, the model may answer incorrectly following a modification, but we probably haven't unlearned anything, since the model didn't really contain the knowledge to answer those questions to begin with. We are attempting to measure unlearning of knowledge that the model contains, and to avoid being affected by the fact that a model will sometimes provide the correct answer to a multiple choice question without containing enough knowledge to answer the question. We also want to avoid overstating the effectiveness of our unlearning method.
We appreciate the reviewer for taking the time to read our paper and provide the thoughtful comments below. We agreed with many of them and discussed our responses below. Following this discussion, we hope that the reviewer will consider raising their score and accepting the revised version of the paper.
Weaknesses:
Novelty: I am not sure about the difference between the paper’s method and negative activation in [1] and [2].
We agree with your general comment. Indeed, our paper’s method and the negative activation in previous work are applying the same approach.
The novelty in our paper is applying an existing approach (i.e. sparse autoencoders trained on language model intermediate activations, and clamping feature activations to a given value) to a new problem (i.e. unlearning knowledge from language models). We think it is an important research contribution to systematically show that negative scaling is crucial for unlearning and zero ablation is insufficient: [1] and [2] only present case studies of negative scaling. They also do not perform negative scaling of many features at once as we do in our “Evaluation of SAE Unlearning” section.
Unlearning Performance: The submission underperforms relative to existing unlearning methods, notably RMU, as benchmarked by the WMDP. While it proposes an innovative approach, it fails to deliver superior results compared to RMU across several metrics, raising concerns about its effectiveness and relevance in high-stakes applications.
The goal of our work was not to produce a technique that outperforms RMU on WMDP, but rather to investigate whether a new, promising interpretability approach (i.e. sparse autoencoders) could be applied to the important, difficult and unsolved problem of unlearning knowledge from language models.
We do not have control over the effectiveness of SAEs in this context, we are merely measuring it. Therefore, we do not consider it a weakness that the method “underperforms” RMU. The strength of our paper comes from the careful study of a novel technique to an interesting, open problem and the honest presentation of the strengths and weaknesses of this approach. We think that this is an important scientific contribution and would appreciate further clarification if you disagree with this.
Helpfulness: Furthermore, the exact MMLU accuracy of the Gemma models in absolute terms is unclear, although it appears to be close to random (approximately 25%). The authors should clarify the selection criteria, the number of questions in the MMLU subset, and whether the overall performance of the model improves or worsens.
We agree that we should clarify the number of questions in the MMLU subset: indeed these details are already quoted in the third paragraph of Sec. 3.2.
We do not state anywhere that the MMLU accuracy of the Gemma models is close to random. Rather, we quote that “gemma-2b-it achieves a base score of 560/1273 (44.0%) on the WMDP-bio datase” in Sec. 3.1
Hello reviewer zKVH, we only have 23 hours left to be able to respond to your questions and concerns. We think that we've addressed them as detailed in our discussion in this thread, so could you let us know ASAP about further worries, or otherwise would you be open to revising your initial judgement?
This paper tests the use of sparse autoencoders (SAEs) for unlearning specific knowledge in LLMs, their application of interest is the biosecurity-related Weapons of Mass Destruction Proxy dataset. The authors selectively suppresses SAE features that are highly used by the WMDP dataset (i.e. figure 2). Experiments compare SAE-based unlearning with Representation Misdirection for Unlearning methods, they test accuracy and side effect on the OpenWebText dataset. The authors report effective unlearning with some success in minimizing side effects, though challenges remain with both interpretability and the general applicability of SAEs for unlearning.
优点
- Unlearning
- Minimal side effects
缺点
- Not concise: I still don't fully understand what a SAE is. The entire paper proposes a new methodological framework without a single math equation. It is very hard to follow and reads like a conversation between LLM software engineers moreso than a technical report on a new methodology.
- Plots everywhere: why is figure 10 cited an entire page before figure 2? The figures should be placed in close proximity to the text in which it is being discussed
- What causes the drop on OpenWebText? What datapoints do you lose performance on? How many datapoints do you misclassify? What type of questions are they?
- It might just be unlearning biology related content. There should be more focus on testing biology related content.
问题
Figure 2: the figure says it's a distribution, but it has counts on the y-axis. Also, it does not seem to be normalized per dataset, i.e. the OpenWebText is a lot smaller, why is that? Also, why is figure 2 on page 2 instead of next to the text it's mentioned in. Please fix that it makes it hard to follow the storyline.
"Interestingly, the model modified using RMU answers option “A” on 62% of questions, compared to 25% for the base model." I dont understand what that means.
"2.4 HOW TO SELECT RELEVANT SAE FEATURES" by this point in the paper I still have no idea what a SAE feature is and this section is unintuitive to me. In general, until this point, the paper has been very verbose and has had limited conciseness.
Section 3: The entire section has heavy use of DL/LLM lingo and explains the methodology using "math-intuition". This is, in fact, quite unintuitive to me. Please rewrite and be concise about the method you are proposing. I am sure you can do it in much less space.
We thank the reviewer for their detailed reading of the manuscript and the thoughtful comments. We discussed the comments in detail and and provide our responses below. We hope that the reviewer will consider raising their score and accepting the revised version of the paper.
Weaknesses:
Not concise: I still don't fully understand what a SAE is. The entire paper proposes a new methodological framework without a single math equation. It is very hard to follow and reads like a conversation between LLM software engineers moreso than a technical report on a new methodology.
We are sorry to hear that you do not understand what a sparse autoencoder is. Without any specific examples to refer to, it is difficult for us to implement revisions based on this comment.
We did our best to clearly and concisely describe our method and results throughout the paper. Detailed discussions of sparse autoencoders can be found in the citations that we reference early in the paper and are beyond the scope of our work (for example, the Anthropic interactive paper “Towards Monosemanticity” has a clear definition of the SAE architecture: https://transformer-circuits.pub/2023/monosemantic-features#appendix-autoencoder and a clear motivation for the technique). We also note that none of us are LLM software engineers.
Plots everywhere: why is figure 10 cited an entire page before figure 2? The figures should be placed in close proximity to the text in which it is being discussed
As far as we are aware, figures from appendices are usually cited throughout the paper.
What causes the drop on OpenWebText? What datapoints do you lose performance on? How many datapoints do you misclassify? What type of questions are they?
We would like to clarify here that the OpenWebText is not a benchmark, but is instead a proxy for the pre-training objective of the language models. We are measuring the cross entropy loss on this distribution.
It might just be unlearning biology related content. There should be more focus on testing biology related content.
Yes, the technique is unlearning biology-related content. Our paper claims to unlearn knowledge in language models, and our metrics track how much of the WMDP-Bio content is unlearned while still the model retains strong average case performance on MMLU/OpenWebText, and unlearning some general biology-related content is useful for getting good results according to these metrics.
Questions:
Figure 2: the figure says it's a distribution, but it has counts on the y-axis. Also, it does not seem to be normalized per dataset, i.e. the OpenWebText is a lot smaller, why is that? Also, why is figure 2 on page 2 instead of next to the text it's mentioned in. Please fix that it makes it hard to follow the storyline.
A plot of a distribution usually has counts on the y-axis. As stated in the caption of that figure, the plots are based on the same number of base tokens. The difference in the area under the curve is part of the message of the plot - the feature fires on WMDP-bio prompts much more the OpenWebText (we do not plot any datapoints where the feature did not fire, as this would dominate all space on the plot).
"Interestingly, the model modified using RMU answers option “A” on 62% of questions, compared to 25% for the base model." I dont understand what that means.
This means that when prompted with the WMDP-bio dataset, the model modified by RMU provides “A” as the answer on 62% of the questions. However, in the base dataset, “A” is only the answer in 25% of questions. We felt this was interesting to point out in the paper.
"2.4 HOW TO SELECT RELEVANT SAE FEATURES" by this point in the paper I still have no idea what a SAE feature is and this section is unintuitive to me. In general, until this point, the paper has been very verbose and has had limited conciseness.
We are open to specific suggestions regarding the text and communication of the message of the paper, however this comment does not provide any specific examples of what you feel should be removed from the paper.
Section 3: The entire section has heavy use of DL/LLM lingo and explains the methodology using "math-intuition". This is, in fact, quite unintuitive to me. Please rewrite and be concise about the method you are proposing. I am sure you can do it in much less space.
We are open to criticism of specific examples in the text that you feel are not clearly explained, but referencing “math-intution” does not direct us to any specific changes that you feel should be made.
Upon reading the other reviews, I have upped my review from 3 to 5. The method is quite interesting, but I do believe the writing is quite lackluster and unorganized, thus I am still opposed acceptance until the writing improves.
"Yes, the technique is unlearning biology-related content. Our paper claims to unlearn knowledge in language models, and our metrics track how much of the WMDP-Bio content is unlearned while still the model retains strong average case performance on MMLU/OpenWebText, and unlearning some general biology-related content is useful for getting good results according to these metrics."
I appreciate the change in the manuscript to highlight biological deterioration. However, this is the analysis that I truly want to see. What happens to related fields when removing knowledge that should've been specifically targeted biological weapons? We surely don't want LLMs to remove all biological knowledge.
Upon reading the other reviews, I have upped my review from 3 to 5. The method is quite interesting, but I do believe the writing is quite lackluster and unorganized, thus I am still opposed acceptance until the writing improves.
We greatly appreciate the increased score.
To try to improve the organisation and presentation in the paper, we followed your advice from the comments above regarding mathematical notation and we added a new Notation subsection (1.2) to the revised version to explain SAE and our method in equations. We hope that this will add clarity to the paper and improve the reading experience.
"Yes, the technique is unlearning biology-related content. Our paper claims to unlearn knowledge in language models, and our metrics track how much of the WMDP-Bio content is unlearned while still the model retains strong average case performance on MMLU/OpenWebText, and unlearning some general biology-related content is useful for getting good results according to these metrics."
I appreciate the change in the manuscript to highlight biological deterioration. However, this is the analysis that I truly want to see. What happens to related fields when removing knowledge that should've been specifically targeted biological weapons? We surely don't want LLMs to remove all biological knowledge.
We agree that this unlearning method should ideally be very precise and would love to see this explored in the future. There are two reasons why we didn’t pursue this direction further in this paper. First, the precision of the SAE features is limited by the size and quality of the SAE. In this work, we use SAEs with 16k features, which doesn’t easily allow for precise distinctions between very similar biology topics. Second, we found that many of the questions in the WMDP-bio dataset are testing broad, general biology knowledge, and are not very specific to bioweapons. We hope that future work could explore how precise the unlearning can be, without impacting similar, but distinct topics.
Hello reviewer tFqg, we only have 23 hours left to be able to respond to your questions and concerns. We think that we've addressed your concerns about writing in the PDF, so could you let us know ASAP about further questions, if you have any?
The paper uses sparse autoencoders to identify features related to biology, which they isolate and clamp down to unlearn on a subset of the WMDP dataset. They use two versions of gemma-2b-it, intervening at an intermediate layer on specific bio-related features. They find that such an intervention can be successful in unlearning, while adding relatively small loss on a generic text dataset.
优点
I think applying SAEs to this task is useful – for us to do good unlearning we almost certainly want an interpretable method, so these are worthwhile first steps. I like the depth you went into in Section 3, as well as Figure 2. I think the methodology was clearly defined, as well as the metrics and tasks you were evaluating. I think there are some nice ablations as well, e.g., Section 4.2.
缺点
I think the messaging of the paper needs to change to increase the novelty by emphasizing how your work and existing work (RMU) differ. Specifically, how yours is more interpretable and why that’s a good thing. I know the latter is mentioned in the intro but that’s the most substantive discussion of this difference, which is the main reason right now people would want to use SAEs for unlearning.
I think the experiments section should include both gemma models. Relatedly, I think Figure 4 is weak and makes the results hard to interpret. I also don’t really know why the added loss would matter much when you show such a relatively large deterioration on MMLU.
I’d prefer you to show at least some results on even one other subset of WMDP just to see how this generalizes. I wonder if SAEs may be more helpful than fine-tuning if we are trying to unlearn a combination of separate tasks, e.g., both biology and chemistry.
Finally, there is a lack of a related work section, which I feel is necessary as I think you could better contextualize your, e.g., by further showing how people are currently using SAEs to adjust features.
问题
- Why did you select the hyperparameters you did for the clamped feature activations (1x, 10x, 50x, etc.)? Same for RMU hyperparameters? Specifically, the layers.
- Why did you use 300x in Figure 5 but not Figure 4?
- On page 2, “We trained SAEs at several intermediate layers of the residual stream” – which layers? Sometimes you use layer 3, sometimes layer 9.
- Why do you use gemma-2b-it for Section 3, but don’t present results for gemma-2b-it in the main body? There is a disconnect between this and Section 4, which uses gemma-2-2b-it.
- On page 5, there are experiments mentioned but I’d like to see the results somewhere: “To investigate the importance of the particular feature that we selected, we performed the same ablation on a variety of features that activate on this prompt, chosen at random.”
- Figure 1 feels bare. You could consider including some results in half of this figure or some examples on the bio WMDP subset so it is more tailored to your goal instead of a somewhat bare depiction of a SAE.
These did not affect my review, but some smaller things to note:
- Could you use different symbols to represent the feature number and question? You use “#” for both feature number and question number (e.g., “feature #9163” and “question #841”), which I think is distracting.
- In the conclusion you say intervening using “10-20” features – I would change it to 10 or 20 because you don’t use any intermediate values in the range.
- On page 4, “The model still provides “A” as the correct answer with probably” -> “probability”
- On page 6, I would explicitly define what means in for readers are less familiar with SAEs.
- Figure 5 you should use a different color for the random decoder vector as in Figure 4 you use that some color for representing a 20 feature intervention.
- On page 8, you say “We propose four key directions for future research” but only provide three.
We thank the reviewer for their detailed reading of the manuscript and the thoughtful comments below. We addressed each of the comments below and made appropriate modifications, where possible, in the revised version of the manuscript. We hope that based on the discussion below and revisions to the paper, the reviewer will consider raising their score and accepting the revised version of the paper.
Response to Weaknesses:
I think the messaging of the paper needs to change to increase the novelty by emphasizing how your work and existing work (RMU) differ. Specifically, how yours is more interpretable and why that’s a good thing. I know the latter is mentioned in the intro but that’s the most substantive discussion of this difference, which is the main reason right now people would want to use SAEs for unlearning.
We agree on the importance of emphasising the distinction between our work and RMU, and in the revised version. Based on this, we have added a new section for “Related Work” (Sec 1.1) and moved some of the content discussing RMU from later in the paper to the Related Work section. We feel that this more appropriately emphasises the importance of why an interpretable method is more important and the nature of the difference between RMU and our method.
I think the experiments section should include both gemma models. Relatedly, I think Figure 4 is weak and makes the results hard to interpret. I also don’t really know why the added loss would matter much when you show such a relatively large deterioration on MMLU.
The reason the added loss is relevant to look at is because the MMLU does not capture every type of impact an intervention would have on a model, regardless of the level of deterioration on MMLU. MMLU only tests high school and early college of academic subjects, and further only requires outputting one of four tokens (A/B/C/D). On the other hand cross entropy loss on a wide text distribution tests all domains on the internet, and the prediction of all tokens.Based on the comment above, it’s not clear to us why you feel that Figure 4 is weak. Is there anything you can concretely suggest to improve this, or else are you willing to retract your criticism?
I’d prefer you to show at least some results on even one other subset of WMDP just to see how this generalizes. I wonder if SAEs may be more helpful than fine-tuning if we are trying to unlearn a combination of separate tasks, e.g., both biology and chemistry.
We agree with this comment and think that our work will enable lots of future research to study other unlearning domains. In the current paper, the models we studied had very low performance on WMDP-cyber and WMDP-chem (between 25 and 35% for gemma-2b-it) and we decided it was inappropriate to measure unlearning on a dataset that the base models performed this poorly. We have added a sentence to the revised version to reflect this.
Finally, there is a lack of a related work section, which I feel is necessary as I think you could better contextualize your, e.g., by further showing how people are currently using SAEs to adjust features.
We agree with the reviewer’s comment and have now modified how we present the paper to include a Related Work section in the revised version and to better contextualize our results, as suggested by the reviewer.
Why did you select the hyperparameters you did for the clamped feature activations (1x, 10x, 50x, etc.)? Same for RMU hyperparameters? Specifically, the layers.
We tested the unlearning performance across all layers of the model for the SAE unlearning technique. We also tested a range of layers from 0 to 12 for the RMU technique, as well as performing a hyperparameter sweep for the RMU method to maximise its performance on the dataset for the models we studied.
Why did you use 300x in Figure 5 but not Figure 4?
This was simply a presentation choice in the figures.
On page 2, “We trained SAEs at several intermediate layers of the residual stream” – which layers? Sometimes you use layer 3, sometimes layer 9.
We clarified that we trained SAEs at layers 3, 5, 7, 9, 11 and 13 in the revised version of the text. Based on our hyperparameter sweeps, we presented results for layer 9 for gemma-2-it and layer 3 for gemma-2-2b-it. We agree that this was not clearly presented and we have attempted to fix this in the revised version.
Why do you use gemma-2b-it for Section 3, but don’t present results for gemma-2b-it in the main body? There is a disconnect between this and Section 4, which uses gemma-2-2b-it.
This was simply a presentation choice to keep the paper sufficiently short and to reflect the fact that we investigated unlearning with SAEs in both models in detail. We wanted to present gemma-2-2b-it in detail in Section 4 as it is the more capable model.
On page 5, there are experiments mentioned but I’d like to see the results somewhere: “To investigate the importance of the particular feature that we selected, we performed the same ablation on a variety of features that activate on this prompt, chosen at random.”
We agree and we have added a new section to the appendix containing two figures describing the results of this test in the revised version
Figure 1 feels bare. You could consider including some results in half of this figure or some examples on the bio WMDP subset so it is more tailored to your goal instead of a somewhat bare depiction of a SAE.
We feel that this figure is a useful and concise description of our technique and is valuable as an introductory figure to our paper, and feel that adding WMDP-bio dataset examples may overcrowd the figure. We prefer to leave it as it is.
Hello reviewer C6qk, we only have 23 hours left to be able to respond to your questions and concerns. We think that we've addressed them as detailed in our two comments, so could you let us know ASAP about further worries, or otherwise would you be open to revising your initial judgement?
In this paper, authors investigate the use of sparse autoencoders (SAEs) for selective knowledge removal in language models, specifically testing if SAEs can unlearn bioweapon-related knowledge from Gemma models using the WMDP-Bio dataset. The core finding is that negative scaling of SAE feature activations can effectively unlearn targeted knowledge, while zero ablation proves ineffective. The authors demonstrate that multiple SAE features can be manipulated simultaneously to unlearn different topics, though with side effects comparable to or exceeding existing fine-tuning methods.
The key strength lies in applying interpretable mechanisms to the critical challenge of controlled knowledge removal in language models. The authors conduct thorough analyses of how individual SAE features can target specific knowledge, demonstrating the potential for precise control. The methodology is sound and well-documented, with clear ablation studies supporting the main claims. However, several critical weaknesses emerge from the reviews and discussion. As noted by Reviewer zKVH, the paper's novelty is questionable given similarity to negative activation approaches in prior work. While the authors argue this is an application of existing methods to a new problem, they do not provide sufficient theoretical analysis of why negative activation enables unlearning. The evaluation methodology raises concerns, with Reviewer tFqg highlighting that removing all biology knowledge rather than specifically targeting bioweapon information limits practical utility. The selection criteria for evaluation questions, focusing only on permutation-resistant examples, potentially biases results toward easily-learned knowledge that may be harder to unlearn, as pointed out in the discussion with Reviewer zKVH.
The writing requires substantial improvement. Reviewer tFqg notes the lack of mathematical formalization and clear explanation of SAE features makes the methodology difficult to follow. While the authors added mathematical notation in response, the core explanations remain unclear. The paper's organization needs work, with figures placed far from relevant text discussions.
Altogether, the following points are considered while making the recommendation: 1) Limited novelty beyond applying existing negative activation techniques without theoretical advancement, 2) Evaluation methodology concerns regarding question selection and broad biology knowledge removal rather than targeted unlearning, and 3) Unclear presentation that makes the technical contribution difficult to assess. While the authors provided detailed responses, the core issues around novelty and evaluation methodology remain unresolved. Based on the key points mentioned above, it is recommended to not accept the manuscript in its present form.
审稿人讨论附加意见
See above.
Reject