PaperHub
7.0
/10
Poster4 位审稿人
最低6最高8标准差0.7
8
6
7
7
3.5
置信度
COLM 2024

Transformer Circuit Evaluation Metrics Are Not Robust

OpenReviewPDF
提交: 2024-03-23更新: 2024-08-26
TL;DR

We show that the faithfulness metrics used to measure performance of circuits in previous work in mechanistic interpretability are not robust to slight changes in experimental setup.

摘要

关键词
mechanistic interpretabilityinterpretabilitycircuits

评审与讨论

审稿意见
8

Lots of work has aimed to characterize how networks accomplish tasks by attributing those behaviors to "circuits" inside of the network. These prior works have evaluated the extent to which a circuit explains a behavior based on ablating non-circuit components to see if the circuit computes the hypothesized function. They identify several key degrees of freedom with ablations: granularity of the computational graph representation, type of activation value, token positions, ablation direction, and the set of components. Through experiments, this paper shows that reasonable variations in methodology can produce significantly different results.

I think that this paper is very needed, will be valuable to myself and the people I work with, and should be accepted.

接收理由

1: This work seems to do an exceptionally good job of engaging with past work down to the inclusion of specific details from prior works.

2: This work seems to join in a useful tradition of works showing that cherished attribution methods fail certain sanity checks. I think that works like this have been really valuable to the literature in past years.

3: This is a kind of work that participates in a really useful process of critiquing past methods. I think that it will definitely be useful for a variety of future works.

4: Figure 3 is compelling to me.

5: I would not have thought of the ideas in this paper despite doing related work myself.

拒绝理由

6: Not a strong reason to reject -- I think that the conclusion that the paper draws is not exactly the right one. I think that the main implication of this work is that the important thing to do is not to invent different faithfulness metrics but to figure out how to connect circuits attribution methods to meaningful practical tasks. Currently, almost all claims about the function of circuits in the literature are based on pontification that aren't meaningfully connected to competitive uses in practical applications. Inventing new faithfulness metrics wouldn't solve that. Only real-world applications would.

7: Not a strong reason to reject -- this paper does not propose concrete solutions per se. But I wouldn't expect it to.

给作者的问题

  1. What do you think a good evaluation methodology would look like? Is finding a principled patching-based evaluation method hopeless?
作者回复

We thank the reviewer for their detailed and encouraging feedback. We are glad they believe the paper will be a valuable contribution.

The reviewer presents an interesting perspective on the correct resolution to debates around faithfulness metrics. There are a range of views on this topic in the literature. For instance [1], (as cited by reviewer W9hJ), rejects explanations that enable users to make better decisions with the help of the model as a poor metric for the faithfulness of the explanation. Instead, they claim that this is merely “indicative of correlation between the plausibility of the explanations and the model’s performance.” This is not a debate which we feel our paper needs to resolve. As explained in our comments to other reviewers, our primary motivation is to critique the faithfulness claims of previous papers and warn against over-optimizing particular metrics.

We do however believe that better evaluation is important. This is a perspective shared by others, e.g. [2]. One example of a potential improvement to evaluation methodology could be to test the circuit across a very wide distribution, as compared with the narrow distribution used to discover it, to find whether it implements the same behavior more broadly. In particular, some of the authors are currently investigating a method which would ablate the circuit and test whether the tokens with the greatest decrease in loss across a large dataset are instances of the task that the circuit is supposed to perform.

Alternatively, the interpretability community has had some limited recent success in applying insights on 'features’ or 'representations’ to better control models [3], which is an example downstream application. It is an interesting question to ask whether the same may be possible for entire computational circuits. We suspect it will be found to be significantly harder. Concretely, it may be possible to leverage circuit-based insights to perform highly specific capability unlearning [4] or parameter efficient finetuning [5], as a generalization of recent work attempting to use features to do the same.

[1] https://arxiv.org/abs/2004.03685 [2] https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7/p/f8nd9F7dL9SxueLFA [3] https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html [4] https://arxiv.org/abs/2403.03218 [5] https://arxiv.org/abs/2404.03592

评论

Thanks to the authors for the reply. I have nothing much to litigate. Here are some concluding thoughts.

  • I have read the other reviewers concerns, and I think they should be carefully addressed, but I have not see anything that makes me want to drop below an 8. I will stay.
  • I recommend approaching tweaks to the paper discussion to talk about downstream applications.
  • I think that the upside of this paper is to have a comparable type of impact to https://arxiv.org/abs/1810.03292. So I would recommend revising and sharing the paper in a way that aims to best communicate an analogous message.
  • Good luck
审稿意见
6

This paper studies how circuit faithfulness evaluations change depending on the exact methodology used for calculating the metric. The paper empirically compares the effect of different methodological choices on a variety of circuits that have been identified in prior work. The main result is that the faithfulness scores can change a lot depending on these choices.

接收理由

  • This paper addresses a timely topic, evaluating mechanistic model explanations. Existing work in this area uses a number of subtly different evaluation methodologies, so it is a useful contribution to provide a unifying survey of these different choices and to try to explain their effect.
  • The experiments cover a good variety of circuits, including several manually-identified circuits from prior work and circuits discovered using more recent automated methods.
  • The authors release a code library: this could be a good contribution, to unify various methods in one place, and also provide more efficient implementations compared to available libraries. (However, I would also suggest that the authors include a relevant benchmark or any other details about the computing resources used.)
  • A number of the specific findings are interesting and could be relevant to practitioners. These include the differences between ablation types and the effect of the dataset size used for calculating mean ablations. I also appreciated the discussion in section 5 about Tracr models, which illustrates the implication of a specific design choice (mean ablation at the token level still preserve position information).

拒绝理由

  • I believe the paper is missing a formal definition of faithfulness, and a discussion of prior work that has attempted to define faithfulness in the context of explanations for NLP models. In particular, I think the paper should discuss [1, 2, 3]. Having a clear definition is important for helping us reason about which design choices are appropriate.
  • Similarly, I think the paper would benefit from a more formal or mathematical description of the overall methodology used for calculating these faithfulness metrics. Without this, section 3 is difficult to understand, especially for people who aren't familiar with these methods, and it's difficult to reason about the effect of different design choices. (I was especially confused by section 3.1.4, "Token Position".) As a suggestion, most of these approaches are based on causal mediation analysis [5, 6], and I think this kind of presentation would be helpful here. For some recent references that use this framework for formalizing circuit faithfulness, see [7, 8].
  • The paper is also missing a formal, mathematical definition of the "logit difference recovered" metric. (The exact formulation of this metric turns out to be important in section 4.1 so it should be defined precisely.)
  • The paper doesn't always provide citations for the methodology choices (for example, in sections 3.1.3-3.2). More generally, I think the paper would benefit from a more systematic review of the specific methods used in prior work--for example, in the form of a table like Table 1.
  • Some of the findings in this paper seem to overlap with [4], who also study the effect of different ablation methods. I think this paper still makes a number of new contributions, but I feel it should discuss the differences between [4] at more length.
  • At a very high level, I have some doubts about the the key assumption in this work, which is that faithfulness metrics should be robust to methodological choices like ablation method. It seems to me that different methodological choices reflect different assumptions and measure different things (for example, edge level ablations vs. node level ablations correspond to different subgraphs--we wouldn't expect them to be the same). I think many of the points in the paper are still salient, but should be stated more moderately. For example, instead of arguing that existing metrics are "arbitrary," inconclusive, and misleading, the paper could make the point that these methodological choices actually aren't "slight", and researchers should describe their choices clearly, and use more rigorous and precise language to describe the exact claims that they support. (Along the same lines, I have some doubts about the title. "Robust" commonly means robust to noise. In this case, it's debatable whether the ablation methodology is really noise. I would prefer a more specific phrase, e.g. "sensitive to ablation methodology".)

Summary: I think this paper makes a number of useful findings on an important topic, but the paper is missing a discussion of some foundational prior work (especially [1]), and I feel there are issues with the clarity, presentation, and framing. I would be open to increasing my score if the authors could address some of these concerns.

Update after rebuttal: The authors have proposed several changes to address my main concerns so I am increasing my score accordingly.

References

[1] Jacovi and Goldberg, 2020. Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness?

[2] Lyu et al., 2024. Towards Faithful Model Explanation in NLP: A Survey.

[3] Geiger et al., 2023. Causal Abstraction for Faithful Model Interpretation.

[4] Zhang and Nanda, 2024. Towards Best Practices of Activation Patching in Language Models: Metrics and Methods.

[5] Vig et al., 2020. Investigating Gender Bias in Language Models Using Causal Mediation Analysis.

[6] Finlayson et al., 2021. Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models.

[7] Marks et al., 2024. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models.

[8] Kramár et al., 2024: An efficient and scalable method for localizing LLM behaviour to components

给作者的问题

  • Could you offer any comments on whether there are any principled reasons to prefer one design choice over another?
  • Section 3.1.3 describes "Resample Ablate" and "Mean Ablate" as "Two more principled approaches" -- In what sense are these two approaches more principled?
  • Can you offer any more details about the meaning of the "Token Positions" methodology in section 3.1.4? Similarly, in appendix C.1 (docstring task)--what exactly does it mean to differentiate by token position?
  • In section 4.1, can you provide more details about how exactly the ABC dataset is used? Is it just used to calculate the the mean activations which are used as the ablation value?
作者回复

Thank you for the detailed feedback.

We defer to the definitions of faithfulness used by the authors that discovered the circuits we study. These papers consider faithfulness to be the similarity of the explanation (in this case circuit) to the full model on the task - this is “fidelity” in the language of [1] - but they use a variety of metrics for their exact definition. We will include further discussion of this, and put this in the context of other definitions of faithfulness in the NLP literature in our revised paper.

We thought that diagrams would provide the clearest explanation for our discussions of ablation methodology, but we will also include mathematical descriptions for further clarity (using do-calculus notation). We will add the formula for logit difference recovered and appreciate that this may be easier to understand than our verbal description. A table summarizing prior methodological choices is a good idea, thank you for the suggestion.

We agree that Zhang and Nanda (2024) is highly relevant and we do cite it. They perform experiments studying different aspects of ablation methodology. Importantly, they do not arrive at our notion that circuit faithfulness can only be defined by specifying the circuit evaluation and ablation method. We agree these differences should be enumerated in the paper.

On the question of our initial assumptions, we wish to clarify that we are not arguing that faithfulness metrics “should” be robust to methodological choices, we are merely noting that they are not, because prior work implicitly assumed that optimal circuits could be defined by tasks only. We agree that some of the wording choices are too strong and that these can be made more precise. In particular, we like the suggestion of recommending that researchers describe their choices clearly, and use more rigorous and precise language to clarify the exact claims they support. We agree that “sensitive” is a better description than “not robust”, and will adjust wording where possible.

We have benchmarked our library against the most popular circuit discovery library [2] and found it to be much faster. We will include this graph in the final paper.

Differentiating by token position means considering a prompt of n-tokens to have n parallel computational graphs, that interact only through the attention mechanism. The graphs can be ablated independently.

[1] https://arxiv.org/abs/2004.03685 [2] https://github.com/ArthurConmy/Automatic-Circuit-Discovery

评论

Thank you to the authors for their detailed response. I think that the proposed changes will address my main concerns (defining faithfulness, clarifying the method descriptions, and adding some more context about the prior work), and I am updating my score accordingly.

审稿意见
7

The work focuses on the six dimensions of variation used in studies evaluating transformer circuits via ablation. It carries out a comprehensive analysis of these and shows that circuit analysis and evaluation are affected by all these dimensions. That is, variations among them lead to differences in what types of conclusions we can draw from this analysis. The paper's main claim appears to be the following: talking about “a circuit responsible for a task” may not be sensible; as the results change depending on the used ablation analysis, we should always narrow our claims relative to that.

接收理由

  • Overall, the paper provides an insightful contribution to the field of transformer-based mechanistic interpretability, both by showing the limitations and flaws of current approaches and proposing some directions on how to mitigate some of the problems (e.g., by using the same ablation method for ground truth and circuit discovery).

  • Given the growing interest in MI, the contribution of the paper is timely and potentially impactful.

  • The paper is overall well written and provides a good overview of the previous/concurrent literature.

拒绝理由

  • The paper should pay more attention to clarifying, introducing, and motivating its terminology. Many terms are used without being explained and this makes the reader’s life hard in various places. For example, what does a percentage of recovered logit difference > 100% mean and how should it be interpreted?

  • As is often the case of papers within the MI field, many methodological details and the terminology are given for granted: e.g., that GPT-2 is used for the experiments (this is never really mentioned in the main paper).

  • The whole paper used GPT-2 as its reference model. While I understand that this is the most widely used model in the MI literature, I do believe that the "model" is a seventh important dimension of variation to consider in a comprehensive evaluation of circuit discovery and evaluation. To make the claims even more insightful, other (pre-trained) models should have been considered.

  • Section 3.1.5: it is unclear why it makes sense to focus on the complement components (the ones not in the circuit) when performing evaluation. There are two dimensions here, that the authors should explain and motivate better: 1) model performance on the task: this is destroyed when a clean circuit is ablated, restored when a corrupt circuit is ‘clean resampled’; 2) evaluation of the circuit: how important/unimportant the target/complement circuit is for the task. This, to me, seems to be independent of model performance, as it only tests whether the role or importance of a circuit (being it the target or the complement) is kept the same or changed after an ablation. I would therefore explain better the terms faithfulness and model/circuit performance and their difference and to motivate better why focusing on the complement circuit is a valuable approach.

给作者的问题

  • “This results in a systematically lower faithfulness as compared with mean ablations”: Are the differences between resample and mean in Figure 3 statistically significant?

  • “The IOI circuit implements the IOI task faithfully on average, but not for many single data points.”: the difference between ABBA and BABA seems to suggest that there may be two distinct circuits responsible for the two orders. Is it something that the authors hypothesized, too, and perhaps tested experimentally?

作者回复

Thank you for the detailed feedback.

We agree that certain methodological points are omitted. We will include a more thorough description of methods at the beginning of each experiment. Regarding the specific points you raise:

  • In our work, we test the evaluation metrics used by prior circuit discovery work. As such, we study the very same circuits found by those works, which are often in GPT-2. We follow [1] here, and many of the circuits that we study are canonical in the MI literature e.g. [2, 3, 4]. We agree that the model choice is another important dimension of circuit discovery, but disagree that the model is a choice to be made when evaluating some particular circuit. Note that we do perform experiments in the appendix on circuits found in different models (e.g. Pythia 2.8B).
  • We will include a mathematical description of logit difference recovered. A logit diff greater than 100% means the logit diff output by the circuit is greater than the logit diff of the full model, which is suboptimal because it means that the circuit is not producing the same output as the full model.

Regarding Section 3.1.5, it is true that the correctness of the circuit is largely independent of the model performance on a task. The aim of localizing model behavior requires only that the circuit reproduces the model’s behavior of the task, not that the circuit performs well at the task. As mentioned in our comment to reviewer W9hJ, our use of faithfulness here is following previous work.

Ablating the complement of a circuit is valuable because it tests whether the circuit is sufficient to perform the task by eliminating any useful computations that might be done outside of the circuit. Ablating the circuit itself on the other hand is coarse – for instance, just ablating one important edge in a circuit may be sufficient to destroy performance.

Thank you for the idea to run a statistical significance test. The difference between Mean and Resample Ablation was significant when ablating nodes (p=1e-5), but not edges. We will include this result.

The IOI authors do consider the possibility of two distinct circuits that are combined with an OR operation in a late layer (in section 4.1 of their paper), but as they do not themselves notice the difference between the ABBA and BABA prompts, they do not propose this hypothesis.

[1] https://arxiv.org/abs/2304.14997 [2] https://arxiv.org/abs/2405.08366 [3] https://arxiv.org/abs/2310.04625 [4] https://arxiv.org/abs/2309.16042

评论

Dear authors,

thank you for your extensive rebuttal.

I appreciate your commitment to adding the missing methodological points and clarifying some choices. I am satisfied with the answers to my questions and doubts and (still) willing to accept the paper.

审稿意见
7

The paper addresses the problem of post-hoc interpretability in transformer language models. They analyze the faithfull scores of circuits obtained by identifying by using different methods and show that results are highly sensitive to the particular metric used making this approach quite unstable. In particular they show that the circuit faithfullnes highly depends on the ablation methodology used to evaluate them. This insight is very relevant given the importance of give reliable explanation for black box models.

Strengths of the paper :

useful insight for obtaining recommendation in applying MI; availability of a library containing efficient implementations of the circuit-discovery and circuit-evaluation techniques used in this paper Weaknes of the paper The paper is not clearly written, the english is slippery sometimes and the arguments are not well explained (see list of comments below. Abstract and introduction creates great expectation that are somehow not met in the subsequent sections and in particular by the results. The recommendations for researchers end up to be quite shallow as the conclusions.

List of Comments

last line of page 1 "paradigm of that aims" -> "paradigm that aims" Related work section "Circuit analysis a form" -> "Circuit analysis is a form" "learned algorithm responsible" -> "learned model responsible" I think model is more suitable and anyway the whole sentence is unclear. Circuit analysis a form of post-hoc interpretability focused on understanding "A circuit is some subgraph" -> "A circuit is a subgraph" "a large amount of researcher-effort to discover." the sentence seems to be not completed. "Ideal circuits exist on the Pareto frontier of faithfulness, completeness and [as the whole network is already trivially optimal for the first two] simplicity" what does it means ? "Besides circuit analysis, MI work more broadly seeks to understand..." please correct and clarify the sentence "MI has also inspired work in steering model outputs" please explain "There are many choices MI researchers may choose between.." choices and choose in the same sentence is sound bad, please us a different word.

3.1 Ablation Methodology The la st paragraph at pag 2 intrroducing the concept of ablation is a bit confusing becaous it is not clear if you arre considereing ablation or patching, please be clear! "what type of component in the graph ablated" a verb is missing ? Define MLP Define Q, K and V inputs "Circuits in autoregressive transformers on a narrow distribution are sometimes defined in terms of edges which each act at a given token position." not clear

4 Faithfulness Metrics are Sensitive to Ablation Methodology "The IOI circuit is specified as an edge-level circuit, but Wang et al. (2023) evaluate its faithfulness via a node-wise ablation methodology" could have it be done differently or better ? explain (or is just a problem with the english ? ) When you explain the IOI circuit you can have template BABA or ABBA, then the corrupt (corrupted) distribution fills the same templates with ABC : why only three ? because the last A is fixed ? names in the order ABC

4.1 Variance Between Ablation Methodologies You say that yoou observe a systematically lower faithfulness when use one method instead of another : this is not strange and does not mean that the two methods are inconsistent if the relative order is maintained

接收理由

It addresses a relevant problem

拒绝理由

Abstract and introduction creates great expectation that are somehow not met in the subsequent sections and in particular by the results. The recommendations for researchers end up to be quite shallow as the conclusions.

给作者的问题

See list of comments

作者回复

Thank you for the detailed feedback. We apologize for the list of typos and will ensure to correct them.

To clarify the point on the Pareto frontier of faithfulness, completeness and simplicity, we are pointing out that if you consider the neural network itself as an explanation or interpretation for the neural network, this is actually perfectly faithful and complete. However it is not a useful description because it is too complex for humans to understand. We argue that there exist many optimal circuits, which make different tradeoffs between faithfulness, completeness and simplicity.

As stated in Section 3.1 “we consider activation patching to be a specific type of ablation, and use the term Resample Ablation interchangeably.” By MLP (multi layer perceptron) and QKV (attention head queries, keys and values) we refer to internal model activations of the transformer based architectures we consider. We will include a mathematical description of the transformer architecture.

We believe that Wang et al. would have done better to evaluate their IOI circuit at an edge level, because this is the level at which their circuit is specified. We perform this experiment and compare the result to the node-level circuit, which is what they measure.

In the IOI prompt format (e.g. “When A and B went to the shops, A bought milk for B” (ABAB)), the names are arbitrary, so BABA is the same as ABAB and ABBA is equal to BAAB. Our choice of prompt formats follows prior work [1]. We will clarify that we are not arbitrarily omitting an ordering of names.

We do not think it is very strange that we observe lower faithfulness scores with Resample Ablations than Mean Ablations. Conceptually, resample ablations induce the network to more strongly predict a particular wrong answer, whereas the mean activation attempts to simply destroy the signal of the clean prompt. We agree that methods may not necessarily be inconsistent if relative ordering is maintained. As we evaluate single circuits in this work (and don’t discover circuits ourselves), we do not study this relative ordering of circuits, but our null hypothesis is that all methodological choices matter. Also, absolute faithfulness scores are interpretable and meaningful.

Our recommendations for researchers are fairly limited as the focus of our paper is to highlight problems in existing circuit discovery research. See rebuttal to reviewer 3bug for further discussion of this point.

[1] https://arxiv.org/abs/2211.00593

评论

thank you for your reply. It was useful for clarifying some doubts. I'm satisfied and I think that the paper is acceptable.

最终决定

This paper investigates different ablation methods used for measuring the faithfulness of circuits in mechanistic interpretability and shows how such methodological choices can lead to different results. The authors study a diverse set of circuits from prior work, found via automated discovery or manual approaches.

Overall, as unanimously brought up by the reviewers, surveying and unifying prior circuit faithfulness measures provides valuable insights and is timely. Reviewers provided constructive recommendations on how to improve the presentation of the work, clarify definitions, and distinguish the findings from other work that provides evidence for similar (though not identical) interpretability illusions of different patching variations. Perhaps most importantly, reviewers offered more nuanced interpretations of the results, which I agree better communicate the findings. The authors have agreed to incorporate these insights, which I believe will significantly improve the paper.