PaperHub
5.0
/10
Rejected4 位审稿人
最低3最高6标准差1.2
6
6
5
3
2.8
置信度
ICLR 2024

Copy Suppression: Comprehensively Understanding an Attention Head

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

We present a single attention head in GPT-2 Small that has one main role across the entire training distribution, that we reverse-engineer.

摘要

关键词
InterpretabilityMechanistic InterpretabilityLarge Language ModelsNatural Language ProcessingScience of Deep Learning

评审与讨论

审稿意见
6

This paper does a very deep and comprehensive analysis on an attention head (L10H7). It firstly defines its role as copy suppression. Then, the authors applied weights-based arguments using QK and OV circuits to mechanistically verify the hypotheses about the copy suppression. Finally, they showed how the analysis has applications to open problems in ablation-based interpretability.

优点

The analysis did in the paper is very insightful and interesting. It analyzed the behaviors of an attention head in a very innovative way. The proposed hypotheses are also well justified with further analysis. Considering understanding LLM can be essential to the model safety, I believe this paper would provide helpful perspectives to the community.

缺点

The paper is limited to a very specific model architecture and probably checkpoint, so whether the conclusion is generalizable and the method is scalable is questionable.

问题

While reading the paper, I was wondering how did you find and choose to target L10H7?

评论

Thank you for the detailed and positive feedback on our paper! We are glad you found our analysis innovative and insightful, and our work relevant to ensuring model safety.

The paper is limited to a very specific model architecture and probably checkpoint, so whether the conclusion is generalizable and the method is scalable is questionable.

Our initial findings already demonstrated the scaling of copy suppression across different models but in an Appendix that we’ve now to the main text (Section 4.1). To further bolster this, we've added a new appendix focusing on Llama models, showcasing the presence and implications of copy suppression in these larger and more complex architectures (Appendix A).

I was wondering how did you find and choose to target L10H7?

The project initially sprung out of an analysis of the roles of heads L10H7 and L11H10 in the indirect object identification circuit. We started out with the theory that these heads were responding to predictions made by the earlier name mover heads, rather than to task-specific features. After accruing evidence for this hypothesis, we moved to studying these heads on the entire OWT distribution, and found that the hypothesis still held up. We focused more on L10H7 because it was a cleaner example of copy suppression, and scored higher on our metrics, however we also believe L11H10 is implementing a weaker version of the same algorithm. As mentioned previously, our analysis of attention heads over a large class of models in Section 4.1 provides evidence for copy suppression being a universal phenomenon.

评论

Dear reviewer bo4P, as the end of the discussion period is on November 22 (in 2 days) we would like to gently remind you that we have posted a response to your review and hope you can let us know if we have addressed your concerns. If there are any remaining questions, we would love to continue the discussion.

评论

Hello reviewer bo4P, we notice you kept your review identical but changed the score from 8 to 6. We think our revision has improved the work. Would you be open to updating the review score, or at least discussing the limitation you have now found?

审稿意见
6

This paper explores the main role of an attention head in an LLM across the entire training distribution. The authors explain "negative heads" and "anti-induction heads" as instances of copy suppression, and define "negative heads" as attention heads that primarily reduce the model's confidence in token completions. It can be defined by three steps, including prior coping, attention, and suppression.

They also apply weights-based arguments to explain the role of language model components. Experimental results show that about 77% of the role of attention head L10H7 on GPT-2 Small's training distribution is copy suppression, and copy suppression can explain about 40% of the self-repair phenomenon.

优点

  1. This paper defines copy suppression, namely the main role of an attention head across GPT-2 Small's training distribution. Then they apply weights-based arguments to analyze the hypotheses about copy suppression.

  2. Experiments demonstrate that copy suppression explains 39% of self-repair in one setting and copy suppression with weights-based evidence and can explain 76.9% of the impact of L10H7 in GPT-2 Small.

缺点

This work only explores the findings on GPT2 models, and it would be better to verify it on more and larger models.

问题

Are there other attention heads that have similar effects of copy suppression?

Can we apply the findings to further improve models?

评论

Thank you reviewer DKgn for your feedback on our paper. We have carefully considered your comments and have made significant revisions to address the concerns raised.

This work only explores the findings on GPT2 models, and it would be better to verify it on more and larger models

As stated in our general reply, we acknowledge that our initial presentation did not communicate the universality of the copy suppression phenomenon across different models, and we have moved our appendix on this issue into Section 4 to emphasize this point. However, we disagree with your claim as we tested for copy suppression in a variety of Pythia (up to 6B parameters in size), SoLU, and GPT models in the original paper. Further, we have added a new appendix that further studies larger models such as Llama-2 7B and 13B, and found evidence for copy suppression heads, with limitations due to the nature of the tasks we studied. Llama results can be found in Appendix A.

Are there other attention heads that have similar effects of copy suppression?

The universality of copy suppression is now one central theme of the discussion in 4.1, highlighting its frequency across transformer models of various sizes and architectures. These changes help show that there are, indeed, other attention heads in other models that have similar effects of copy suppression. Within GPT-2 Small, head L11H10 also acts as a copy suppression head though more weakly (and it may also have other functions).

Can we apply the findings to further improve models?

There has been a rise in techniques which rely on corruptions in forward passes to improve and steer model behavior, such as Inference Time Interventions (Li et al., 2023: https://arxiv.org/abs/2306.03341) and Activation Addition (Turner et al., 2023: https://arxiv.org/abs/2308.10248), which used one subsets of model layers). We believe work such as this provides a framework for understanding the impact of specific real-time interventions such as ablations, and can provide a framework for understanding future model interventions. We also discussed this in the section “i) our work has actionable consequences” with reviewer ccpw.

Additionally, we believe our work enhances the toolkit available to practitioners for understanding Large Language models. In particular, previous works were misled in their thinking about Negative Heads, and interpretability research has frequently been bottlenecked by its applicability and generality. We believe the methods and results of this paper provide a meaningful contribution towards refining our understanding of how transformers operate internally, and enabling future interpretability work.

We hope that these revisions and clarifications underscore the importance and universality of our work. We appreciate the opportunity to enhance our paper based on your insightful feedback and look forward to your reconsideration. Does our reply remove your reservations about our work?

评论

Dear reviewer DKgn, as the end of the discussion period is on November 22 (in 2 days) we would like to gently remind you that we have posted a response to your review and hope you can let us know if we have addressed your concerns. If there are any remaining questions, we would love to continue the discussion.

审稿意见
5

The paper introduces the concept of copy suppression, where a head suppresses a predicted token if it appears earlier in the context. The authors focus on Attention Head 10.7 (L10H7) in GPT-2 Small, which plays a crucial role in copy suppression throughout the training distribution (OpenWebText dataset is used). This head prevents naive token copying and corresponding analysis can explain negative heads in prior works. Moreover, the paper presents evidence of copy suppression's role in self-repair, where downstream neural network components compensate for crucial part ablation. In a narrow task, copy suppression explains 39% of the behavior.

优点

  1. This paper presents an interesting hypothesis called "copy suppression": If components in earlier layers predict a certain token, and this token appears earlier in the context, the head suppresses it. The paper conducts extensive experiments to verify this hypothesis. The results show that a single head can play a complete role, which helps deepen our understanding of attention heads.
  2. Copy suppression helps to understand the self-repair phenomenon, and the author conducts a quantitative analysis on this topic.

缺点

  1. The conclusions given in the paper about transferability across different model classes, sizes, and data are not clear. In my opinion, this is the biggest issue with this paper. Although the author's experiments involve other models such as GPT-2 medium and Pythia besides GPT-2 small, it still does not eliminate concerns about this issue. The unclear applicability of the conclusions makes it difficult to assess the paper's contribution.
  2. The presentation of this paper is not clear enough. For example, Figure 1 shows a illustration of L10H7's copy suppression mechanism, which is not easy for readers to understand. In Section 3.1, the WU matrix appears for the first time but without any clear explanation, which may cause difficulties for readers to understand.

问题

  1. The connection between copy suppression and self-repair may not be intuitive. Can authors elaborate on the relationship between copy suppression and self-repair? Also, what motivated authors to explore the role of copy suppression in self-repair?
  2. Can authors provide one or two more possible scenarios where insights related to copy suppression might be helpful?
评论

Thank you reviewer pBv5 for your comments and feedback on the paper. We have made changes based on your review.

The conclusions given in the paper about transferability across different model classes, sizes, and data are not clear

As mentioned in our general comments, we acknowledge that our initial presentation did not communicate the universality of the copy suppression phenomenon across different models, and we have revised Section 4 to emphasize this point. We believe this highlights that copy suppression is a phenomena that exists across different model classes, sizes, and data, as our experiments include Pythia, SoLU, an GPT models (which are models of different classes and sizes, trained on different datasets). Notably, Pythia (unlike GPT-2 Small) uses rotary positional encodings which alters the functioning of the attention mechanism, yet copy suppression heads still form. Pythia was also trained without dropout, creating less of a need for self-repair.

The presentation of this paper is not clear enough.

We appreciate your comments on improving the clarity of the paper. From your feedback, we have taken steps to make Figure 1 clearer and more legible, removing the unnecessary details and adding more to the figure caption (highlighted in red), and have additional inserted various definitions in Section 2-4 (including definitions for the W_U matrix, the effective embedding, logit lens, and Q-composition). We hope this adds additional clarity to these sections of our paper. Has this addressed your concerns? If not, we would appreciate pointers to how else our presentation may be improved.

The connection between copy suppression and self-repair may not be intuitive. Can authors elaborate on the relationship between copy suppression and self-repair? Also, what motivated authors to explore the role of copy suppression in self-repair?

We agree that the connection between self-repair and copy suppression is not entirely intuitive.

Here is one example we think is intuitive - if the model is currently predicting the next token after the phrase “All is fair in love and”, there will be naive copying heuristics in the model which copy the token ‘ love’ from earlier in the context. The role of copy suppression is to ablate this naive prediction.

Now imagine we were to ablate the original copying mechanism, preventing the ‘ love’ token from being copied. This change doesn’t happen in isolation - it also subsequently deactivates copy suppression.

As a result, if one is measuring the effect of an ablation solely by measuring the presence of the ‘ love’ token in the unebedding, they will not see a significant change, despite the fact that they ablate the original copiers. This is self-repair in action, caused by copy suppression (and the lack of it upon the ablation of copying).

The authors had originally discovered copy suppression while studying the roles of heads L10H7 and L11H10 in the indirect object identification circuit, in which these heads clearly self-repair. Once we discovered copy suppression, it was natural to next attempt to connect both phenomena.

Can authors provide one or two more possible scenarios where insights related to copy suppression might be helpful?

In Section 4.1, we help highlight how this phenomena helps explain confusion in prior works related to Anti-Induction as well as Negative Heads in general. We have also already discussed how negativity can help calibrate models in Appendix C.

We believe that our work provides a meaningful contribution towards refining our understanding of how transformers operate internally and will help enable future interpretability work. Even beyond interpretability, we believe work such as this one can help provide insights into methods that rely on ablations and interventions in forward passes, such as Inference Time Interventions (Li et al., 2023: https://arxiv.org/abs/2306.03341) and Activation Addition (Turner et al., 2023: https://arxiv.org/abs/2308.10248), as discussed with reviewer ccpw regarding how “i) our work has actionable consequences”.

We hope that these comments and revisions highlight the importance and universality of our work. We appreciate your valuable input and would appreciate potential reevaluation. Have we successfully addressed your concerns?

评论

Dear reviewer pBv5, as the end of the discussion period is on November 22 (in 2 days) we would like to gently remind you that we have posted a response to your review and hope you can let us know if we have addressed your concerns. If there are any remaining questions, we would love to continue the discussion.

审稿意见
3

This paper examines a single attention head in GPT-2 small and identifies that it attempts to reduce the probability of previous tokens.

优点

  • The paper carefully examines a particular head, including analysis and visualizations.

缺点

I would like to preface the discussion here with the comment that perhaps I am not the ideal audience for this paper. But from my personal impression as someone familiar with language modeling, and also interested in model interpretability, I looked at the main contribution of the paper:

Our central claim is that at least 76.9% of the role of attention head L10H7 on GPT-2 Small's training distribution is copy suppression.

and was left with the impression "I'm not sure why I care about this result?" Here are several reasons why I'm not sure if this result is significant:

  1. "negative heads" and "anti-induction heads" have already been discussed in prior work. It was not clear to me what this work contributes on top of these works.
  2. It is not clear why we should care about a single head in GPT-2 Small (a model that few people use in practical settings anymore). If similar heads could be identified across a wider variety of models that would increase the generality and perhaps interest.
  3. Even if "1." and "2." is achieved, it is not clear to me how these findings would be actionable. I don't necessarily think that all interpretability research needs to be actionable, but it makes the relevance of the interpretations much more convincing if there is a path to better system building, more fair/generalizable models, etc.

问题

I would be interested in answers to my questions above.

评论

(Continued from "Reply to reviewer ccpw 1/2")

it is not clear to me how these findings would be actionable … it makes the relevance of the interpretations much more convincing if there is a path to better system building, more fair/generalizable models, etc.

Mirroring two paragraphs from our introduction, we think that i) our work has actionable consequences for any work involving ablations in language model forward passes, including beyond interpretability and ii) our work has relevance to fundamental research into understanding deep learning models.

i): consider self-repair, which is partially explained by copy suppression (Section 4.2). Whenever techniques involve ablations in forward passes, self-repair can arise automatically through copy suppression, an observation that our work introduces, and which limits the utility of these techniques. Examples of work that intervenes on the forward pass for practical uses include Inference Time Interventions (Li et al., 2023: https://arxiv.org/abs/2306.03341) and Activation Addition (Turner et al., 2023: https://arxiv.org/abs/2308.10248). We made these arguments in our paper but have added citations thanks to this feedback. If copy suppression heads can be identified and frozen, this may reduce self-repair and make these interventions more effective.

Further, as touched upon in our introduction, “negative components may be ignored by attribution methods that only find the positive components”. Our work finds that some negative components – copy suppression heads – have developed an algorithm that sometimes suppresses the correct answer but improves performance across the training distribution. Though prior work had observed these negative algorithms, they were not studied in detail, and it was plausible that it was merely a spurious bug in the model. Therefore our work provides evidence that it is crucial that attribution methods in interpretability assume some internal model components may be negative.

ii): finally, independent of the arguments in i) we believe our work is a significant contribution to the scientific understanding of complex deep learning models. As discussed in our work, the achievement of reverse-engineering language model components across the whole pretraining distribution is a significant improvement upon the depth of understanding achieved in prior work. We have essentially no ground-up understanding of how large language models are computing their outputs, and a rigorous understanding of the smaller and more tractable models could provide a better groundwork for scaling to a true understanding of models. This is because existing methods of scaling explanations have had limited success (Huang et al., 2023: https://arxiv.org/abs/2309.10312). We believe that a deep understanding of model internals may in the long-term form a basis for more robust and transparent improvements and debiasing of models. A vision for the importance of mechanistic interpretability via basic science is written about in ‘Interpretability Dreams’ (Chris Olah; https://transformer-circuits.pub/2023/interpretability-dreams/index.html).

评论

Thank you for the detailed response. I have read it, but unfortunately have not had time to review the revised manuscript in detail. Because of this my rating is still based on my original reading of the manuscript, but I will note to the area chair that the revised manuscript may have resolved some of my concerns and uncertainty about the original draft.

评论

Thank you reviewer ccpw for your feedback on our paper. We hope our answers and edits to the paper address your questions. Overall, though we believe our work has relevance to directly improving model performance and fairness, first and foremost we argue our work is a significant contribution towards a basic scientific understanding of large language models, an ambitious agenda which in future could have significantly stronger benefits to improve models than less rigorous approaches. We will explain our response to your 3 reasons for being unenthusiastic about our contribution.

"negative heads" and "anti-induction heads" have already been discussed in prior work

We think this misunderstands prior work as well as what our work contributes. Anti-Induction (Olsson et al., 2022: https://arxiv.org/abs/2209.11895) and Negative Name Moving (Wang et al., 2022: https://arxiv.org/abs/2211.00593) heads were found due to their behavior, not their mechanism. Specifically, researchers were studying specific tasks and found that there were model components that seemed unhelpful, for reasons that the researchers did not understand. We unify both of these components as common instances of copy suppression heads through their mechanism of detecting high confidence prediction, attention and suppression (1-3 in our introduction). Put another way, our contribution on top of these prior works is that we point out that their assumptions about task-specific negative components were both wrong – copy suppression is a general motif and is not specifically implementing a form of induction or name moving.

We are thankful that you studied this result, and we have moved it to the maintext (Section 4.1) as it is important for addressing the limitations you raised.

If similar heads could be identified across a wider variety of models that would increase the generality and perhaps interest

As addressed in the prior question, our results on negative heads and anti-induction heads are examples of copy suppression heads across GPT, Pythia and SoLU architectures, which is surprising since the copy suppression motif still arises in both of these very different tasks. We have moved this into the main text and we hope this is clearer. Further to your concern that the model studied is not practically deployed, we found evidence of a divide between positive heads and suppression heads in Llama-2-7B and Llama-2-13B (models used in practice) in our new Appendix A, with some limitations due to the nature of the tasks studied. We found suppression heads (that also performed anti-induction) in these larger models, showing generalization of our claims.

(Continued in "Reply to reviewer ccpw 2/2")

评论

Dear reviewer ccpw, as the end of the discussion period is on November 22 (in 2 days) we would like to gently remind you that we have posted a response to your review and hope you can let us know if we have addressed your concerns. If there are any remaining questions, we would love to continue the discussion.

评论

We thank the reviewers for their feedback on our paper. As stated in our paper’s introduction, our three contributions are:

i) finding the copy suppression phenomenon

ii) a detailed explanation of the main role of an attention head across the whole language training distribution (which required the use of several novel approaches and went beyond existing reverse engineering efforts) and

iii) application of our findings to a better explanation of self-repair.

Reviewers generally appreciated ii) and iii) – for example reviewer bo4P identified how we “analyzed the behaviors of an attention head in a very innovative way. The proposed hypotheses are also well justified”. Reviewers questioned how scalable and generalizable contribution i) was, which we think is just one contribution of our work. Nevertheless, we have taken steps to show that copy suppression does generalize. Specifically, a) our existing appendix showing that copy suppression occurs across different model scales and architectures is now in the main text and b) we added an appendix on copy suppression heads in Llama-2-7B and Llama-2-13B along with limitations in Appendix A. We thank the reviewers for pointing out that this was an important part of our message, and worth making more prominent.

We finally note that one of our core contributions is the methodology for understanding an attention head on the full distribution, and that even if the findings do not straightforwardly generalize across all models, we think that our case study is still valuable. For example [1, 2, 3] all performed analysis of very small models, but have implications for much larger models: “grokking” [1] occurs across many different architectures [4], induction heads [2] are connected to the in-context learning capability of large language models [5] and backup effects [3] generalise to at least 7B models [6]. We believe that building the foundations of a basic science of understanding language models now, will in future enable precise and rigorous control and improvement of these systems, which elaborate on in our reply to reviewer ccpw.

[1] “Progress measures for grokking via mechanistic interpretability”, https://arxiv.org/abs/2301.05217

[2] “A Mathematical Framework for Transformer Circuits”, https://transformer-circuits.pub/2021/framework/index.html

[3] “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small” Section 3.4, https://arxiv.org/abs/2211.00593

[4] “Omnigrok: Grokking Beyond Algorithmic Data”, https://arxiv.org/abs/2210.01117

[5] “In-context Learning and Induction Heads”, https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

[6] “The Hydra Effect: Emergent Self-repair in Language Model Computations”, https://arxiv.org/abs/2307.15771

AC 元评审

This paper performs a detailed analysis of a single attention head (L10H7) in GPT2-small that has the effect of suppressing copying behavior. The authors are able to identify how L10H7 is able to achieve this behavior via a weights-based explanation, which is impressive in its level of detail. Reviewer criticism essentially boils down to two main points: (1) how transferable are these findings to more modern models, and (2) what actionable insights do we learn from this analysis? Regarding (1), while the authors do add a brief section on LLaMA models in Appendix A, the writing is too vague to tell if their copy suppression heads behave similarly to L10H7. Regarding (2), I am not as negative on this kind of interpretability research as Reviewer ccpw, as I think it is generally interesting to discover and understand low-level mechanisms in deep networks. However, I do agree that in this specific paper, the analysis is too specific to GPT-2, and the connections to self-repair abilities are not obvious enough to sufficiently excite me about the results.

为何不给更高分

As stated in the metareview, I think the paper doesn't do a good enough job at exploring other models, and the authors did not adequately address this in their revised version.

为何不给更低分

N/A

最终决定

Reject