PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高6标准差0.4
5
6
6
6
3.3
置信度
正确性2.5
贡献度2.0
表达2.3
ICLR 2025

Understanding and Enhancing Context-Augmented Language Models Through Mechanistic Circuits

OpenReviewPDF
提交: 2024-09-25更新: 2025-02-05
TL;DR

We mechanistically investigate extractive QA tasks and find that the circuit components can be used towards reliable data attribution

摘要

关键词
circuitsmechanistic interpretabilitylanguage modelsextractive QA

评审与讨论

审稿意见
5

This work explores the mechanistic circuit in real QA task. They design two probe datasets and extract the Context-Faithfulness Circuit and Memory-Faithfulness Circuit. Specifically, they use a similar token to replace the answer token to force the model predict the answer based on the context. Also, they use a semantical-unrelated token to replace the answer token to force the model based on the memory. They find that the circuit between context-faithfullness and memory-faithfulness are different. Furthermore, they find that there are a small set of attention heads important for data attribution. Based on these findings, they propose a new attribution method, ATTNATTRIB, for extractive QA. The ATTNATTRIB method achieves good results on different language models.

优点

  1. This paper focuses on an important question: the difference between the mechanism of in-context learning and memory.
  2. The general idea about designing the probe datasets for context circuit and memory circuit is good.

缺点

  1. The dataset is too small to get the huge conclusions. There are only 200 questions (line 194).
  2. Other experiments should be made to support the conclusions. There are two main conclusions in this paper: a) The circuits of in-context and memory are different (Section 3.3.1). b) A small subset of attention heads are important for data attribution (Section 3.3.2). A hidden hypothesis of the experiments is: in-context learning cases share similar circuits, and memory cases share similar circuits. Only if this hypothesis is verified, constructing the circuits among in-context cases or memory cases is meaningful. The authors should conduct experiments to explore this. For example, they can do experiments on 5 types of knowledge and compare the circuits.
  3. The experimental settings of the comparison between the context case and the memory case is not convincing. Let’s think about one example. The groundtruth case is “Space Seattle is in New York. Where is Space Needle located?” => “Seattle”. The in-context probe case is “Space Needle is in New York. Where is Space Needle located?” => “New York”. The memory case is “Space Seattle is in __. Where is Space Needle located?” => “Seattle”. When comparing the in-context case and the memory case, the circuits are different. However, to get this conclusion, a hidden hypothesis is: the circuits between predictions with “Seattle” and “New York” should be similar. Otherwise, this comparison is meaningless. But this hypothesis is not verified.

问题

  1. How many cases are there in the datasets for the experiments in Section 3?
  2. What is the distribution of knowledge in the probe dataset? Is it possible to construct the circuits of different knowledge and see whether the circuits are the same in in-context learning/memory?
  3. Is it possible to explore the circuits between cases with similar predictions? For example, some cases with prediction “Seattle” and some cases with prediction “New York”.
评论

We thank the reviewer for appreciating our interventional study and also the design of the probe dataset which is crucial towards obtaining circuits.

Below we address the comments from the reviewer:

“The dataset is too small to get the huge conclusions. There are only 200 questions (line 194).” : The reviewer raises a valid question. We point the reviewer to a recently published paper at ICLR 2024 (Entity tracking circuit: https://arxiv.org/pdf/2402.14811) which has extracted circuits using a probe dataset of a similar size. Moreover we highlight that our probe dataset is not templated, which hinders generating a larger number of examples. During the rebuttal, we have however extended our probe dataset to 1000 examples. We provide the results in Appendix (Sec. (D)), where we find that the circuit components for both the context and memory faithfulness do not change – highlighting that a dataset size of ~200 examples is sufficient. We hope that these new experiments clarify that our dataset size is not a drawback. We also note that we validate the circuits from the interpretability experiments on other large-scale datasets (e.g., HotPotQA, Natural Questions), where we find that ablating the extracted circuits lead to a large drop in accuracy (see updated Sec. (3.4) and Fig. (4)) highlighting that our results generalize to more complex and messier datasets.

“How many cases are there in the datasets for the experiments in Section 3?”: The probe dataset in Section 3 contains 200 examples for extracting the circuit for context-faithfulness and 200 examples for extracting the circuit for memory-faithfulness. We also provide updated results with 1000 examples in Sec. (D) – highlighting that one can obtain the same circuit components with a larger set of examples.

“Other experiments should be made to support the conclusions. There are two main conclusions in this paper: a) The circuits of in-context and memory are different (Section 3.3.1). b) A small subset of attention heads are important for data attribution (Section 3.3.2). A hidden hypothesis of the experiments is: in-context learning cases share similar circuits, and memory case....: While QA can comprise various knowledge types (e.g., pure extractive QA, reasoning, open-ended QA), we note that our paper particularly targets the pure extractive QA setup. In updated Fig.(4) during the rebuttal, we show that the extracted circuit generalizes to other extractive QA datasets (e.g., Natural Questions, HotPotQA) which contain extractive questions with a large diversity. In particular, if the extracted circuit for context-faithfulness is ablated, the language model performs very poorly on these datasets. This result shows that as long as the downstream task is of extractive QA, the underlying circuit is similar amongst different question-types.

We believe that mechanistic understanding of other types of QA (e.g., reasoning / open-ended QA) warrants a separate investigation in itself.

“What is the distribution of knowledge in the probe dataset? Is it possible to construct the circuits of different knowledge and see whether the circuits are the same in in-context learning/memory?”: The reviewer asks an interesting question. We bring to the notice of the reviewer that we extract mechanistic circuits for the pure extractive QA task (e.g., processing a document and obtaining an answer from it). In lieu of this, the probe dataset is curated for such a case and contains only pure extractive question-answer pairs. To obtain circuits for other QA knowledge types (e.g., different types of reasoning) one needs to strategically curate a new probe dataset and extract a new circuit. We hypothesize that this new circuit (for another QA task) will have some overlapping components with the extractive QA circuit. We have provided results on generalization from the extractive QA circuit to the reasoning questions by measuring attribution performance. We find that attribution using the extractive QA circuit leads to non-trivial performance for reasoning questions. We defer the extraction of circuits for other QA knowledge types to future work, as it warrants a full investigation in itself.

评论

“Is it possible to explore the circuits between cases with similar predictions? For example, some cases with prediction “Seattle” and some cases with prediction “New York”.: For the extractive QA task, based on our findings the underlying circuit for similar predictions will depend on whether the language model is directly using the context information or using the parametric memory to answer it (rather than the predictions itself).

For e.g., we curated a small set of 50 questions (from KnownQA such that the parametric answer is not Seattle) where the perturbed token in the context is “Seattle” and the model answers from the context, rather than the memory. If our extracted circuit is ablated from the language model for these questions, the extractive QA accuracy drops to 0%. However, on ablating the same context faithfulness circuit for a set of 20 questions where the answer “Seattle” (where the model uses the parametric memory to answer), the language model gets 19 questions correctly obtaining a 95% accuracy. We hope that this controlled experiment is further evidence of the fact that the circuit across different questions rely on the underlying mechanism of context vs. memory.

“The experimental settings of the comparison between the context case and the memory case is not convincing. Let’s think about one example. The groundtruth case is “Space Seattle is in New York. Where is Space Needle located?” => “Seattle”. The in-context probe case is “Space Needle is in New York. Where is Space Needle located?” => “New York”. The memory case is “Spa......: We sincerely request the reviewer to clarify the question, so that we can answer it better! We note that in our experiments, the ground-truth case will not lead to the answer of "Seattle", given the perturbation in the context is with a set of tokens which have similar semantic meaning (e.g., city) as New York. We are happy to engage with the reviewer further on this question, after receiving clarifications.

评论

We thank the reviewer again for the constructive comments during the rebuttal! During this rebuttal, we have strived to address the comments and incorporate suggestions towards improving the paper (do check the updated paper!). Given the discussion period is ending very soon, we would like to check with the reviewer if our rebuttal has addressed their comments?

评论

Thank you for the responses. I have raised my score. The result on 1,000 examples is more convincing.

My main concern is that different types of knowledge may rely on distinct circuits, and that average scores across the entire dataset may be misleading. For example, knowledge1 is represented by circuits L1H1 and L2H2, while knowledge2 relies on circuits L1H2 and L2H1. If we compute the importance of these circuits across the whole dataset, the contributions of L1H1, L1H2, L2H1, and L2H2 would be treated equally, which may not accurately reflect their actual significance. This issue could also arise with in-context circuits. Therefore, conducting circuit analysis across the entire dataset may be too coarse a method.

评论

The circuits may vary across different types of knowledge. The experiments using an 800/200 split on the probe dataset do not fully address this concern. One straightforward approach would be to illustrate the memorization and ICL circuits for different types of knowledge. Previous studies suggest that different knowledge may involve distinct circuits [1, 2, 3].

I understand that your work follows the mechanistic interpretability literature, which typically averages over a well-crafted probe dataset. However, I believe this experimental setting may not be entirely accurate, which is why I think further experiments are necessary. The results from these experiments could provide valuable insights, and the experimental setup could have broader implications for other studies as well.

While I’m not certain if my concern directly applies to this work, I do believe that averaging scores across a dataset may not provide accurate results or insights. I will discuss this further with other reviewers and AC.

[1] Dissecting Recall of Factual Associations in Auto-Regressive Language Models, EMNLP 2023

[2] Knowledge Circuits in Pretrained Transformers, NeurIPS 2024

[3] Neuron-Level Knowledge Attribution in Large Language Models, EMNLP 2024

评论

We thank the reviewer for the updated score and we are glad that we were able to address your concerns! We thank the reviewer for the continued engagement!

We believe that the reviewer raises an interesting point about different knowledge types. While we generally follow the well-established norm in mechanistic interpretability literature (by averaging over a well-crafted probe dataset), we went towards investigating whether this is the best way to obtain circuits during the rebuttal with some experiments:

To validate this further, we partition the probe dataset (of size 1000) into a set of 800 and 200. The set of 800 is used to find the parametric memory circuit and the set of 200 is where the extracted circuit is ablated. If the knowledge types across the 200 questions are indeed following very different circuits, then after ablating the circuit there should not be a large drop in extractive QA accuracy. We find that ablating the extracted circuit lead to a drop in accuracy of 0.72 to 0.21 (~70% drop in accuracy). This experiment is performed for Llama-3-8B.

In a similar vein, we point the reviewer to the experiment performed during the rebuttal for context-faithfulness circuit in updated Fig. (4), where the drop in extracted QA accuracy for NQ-Swap (containing different in-context knowledge type questions) when the circuit (computed from our probe dataset) is ablated is ~85% (drops from 0.84 to 0.12). This experiment is performed for Llama-3-8B.

This result shows that for the in-context cases, majority of the knowledge types share the same circuit as the drop is very large. For the parametric case, the drop is slightly lower, but still significant. Overall, the experimental conclusions are : (i) if the task is of extractive QA and the model is faithful to the context, then majority of the questions will follow similar circuits irrespective of the underlying knowledge type. This is potentially because a small set of attention heads are the primary driving components which write the answer from the context into the residual stream and ablating them leads to their absence from the stream, thus leading to the wrong answer. (ii) For parametric memory, there are more chances for different questions to follow different slightly different circuits.

Given that a major focus of this paper is on the extractive QA circuit with also applications focusing on them, a more thorough investigation of parametric memory circuits [by constructing a well-annotated probe dataset with different knowledge types (e.g., celebrity knowledge vs. travel knowledge) with a mechanism to compare multiple individual circuits are scale] needs to be performed for the future work and we believe it can be a full paper in itself.

We will add the main takeaway about the reviewers question / ask in the camera-ready version i.e., for in-context cases, majority of the questions will follow the same circuit as long as they belong to the family of extractive QA, but for parametric knowledge questions -- although a large number of questions will follow similar circuits, there can be slightly more cases of distinct circuits.

Do let us know if there are additional questions on this front and we are happy to engage and discuss this point further.

评论

We thank the reviewer for continued engagement!

The experiments using an 800/200 split on the probe dataset do not fully address this concern. One straightforward approach would be to illustrate the memorization and ICL circuits for different types of knowledge. Previous studies suggest that different knowledge may involve distinct circuits [1, 2, 3]: We thank the reviewer for pointing us to these papers! For the new experiment we provided for both context and memory circuit generalization during the rebuttal, we point out that we follow the generalization setup for validating circuit (e.g., obtain a circuit using a dataset and test it on another split) commonly used in the community (e.g., [2] also uses this). If ablating an extracted circuit leads to a drop in accuracy for a question Q, it would mean that the question Q follows a similar circuit. We find that for extractive QA (main focus of our paper), the drop in accuracy is maximal which validates that majority of the questions follow similar circuits. Based on the reviewer’s suggestion — we performed a new experiment by averaging out the circuits across different knowledge partitions of the probe dataset. We have earlier saved the circuit component scores for each question in the probe dataset. For the context-faithfulness circuit, we partitioned the probe dataset into different knowledge types : (i) Country; (ii) Capital Cities; (iii) Language.

Following are the context-faithfulness circuit components for each category (Llama-3-8B) :

Knowledge about Country: Attention Layers : [27, 23, 31, 24, 25, 29, 30, 21]; Attention Heads: [[27, 20], [23, 27], [31, 7], [17, 24], [25, 12], [31, 20], [24, 27], [27, 6], [26, 13], [16, 1], [30, 12], [31, 6], [29, 31], [31, 3]]

Knowledge about Capital: Attention Layers : [27, 23, 31, 24, 25, 29, 21, 30]; Attention Heads: [[27, 20], [23, 27], [31, 7], [17, 24], [25, 12], [31, 20], [24, 27], [27, 6], [16, 1], [30, 12], [31, 6], [29, 31], [31, 3]]

Knowledge about Language: Attention Layers : [27, 23, 31, 25, 24, 29, 21, 30]; Attention Heads: [[27, 20], [23, 27], [31, 7], [25, 12], [31, 20], [17, 24], [24, 27], [27, 6], [30, 12], [31, 6], [29, 31], [31, 3]]

We observe that the circuit components are almost similar (with slight change in ordering only) across different categories (which we experimented with) for the extractive QA. This together with the generalization experiment in Fig. (4) in our paper — highlights that as long as the task is of pure extractive QA, when the language model follows the context — the circuits are very similar, albeit with a slight change in ordering of the components. Although [2] extracts circuits for different knowledge types, in their paper — we did not find any significant information on the node overlap between knowledge questions (e.g., country vs. language) within a particular task (e.g., factual). However, we find it quite interesting that even for different tasks (e.g., commonsense, factual) — the circuit activation frequencies across layers show very similar trends (Fig. 2 in [2]).

However, given we work in the area of mechanistic interpretability — we do acknowledge the reviewers comments about the general practices of averaging the datasets. Our take is that it depends on the downstream task complexity. For example, if one wants to find a circuit for a reasoning task — then averaging would be more detrimental than a pure extractive QA task. That would indeed be a great position paper in this area, but currently out of scope for our paper.

We hope that this new experiment has put some light into the reviewer’s comments! We are happy to discuss further!

审稿意见
6

This paper explores the extraction of mechanistic circuits from language models to improve their interpretability, focusing on question-answering tasks. By leveraging causal mediation analysis, the authors identify key components such as attention heads that contribute to both parametric memory and context-based responses. They introduce the ATTNATTRIB algorithm, which enhances data attribution with minimal computational cost, and demonstrate how these insights can steer models towards more context-faithful answers, ultimately improving performance on extractive QA benchmarks.

优点

● The experiments in this paper are rigorously designed and provide detailed explanations of the findings.

● The discovery that "the circuits activated when the model uses contextual information are markedly distinct from those invoked for parametric memory" is especially compelling. This highlights an important mechanistic difference between context-faithful and memory-faithful circuit components, with minimal overlap among the core elements activated in each mode.

● The paper is exceptionally well-written, with a structured and clear presentation that facilitates understanding of the complex methodologies and results.

缺点

● The paper’s claims about addressing limitations in synthetic probe dataset reliance seem overstated. While it critiques prior work for its dependence on synthetic datasets (lines 44-50), the proposed method also incorporates synthetic dataset design and path patching. This suggests the approach does not fully resolve these limitations. To strengthen the discussion, could the authors clarify how their method advances upon previous approaches in terms of synthetic data reliance and dataset design?

● Although the findings in this paper are interesting, there appears to be limited advancement in the technical contributions regarding interpretability methods. Specifically, the paper would benefit from innovative approaches or improvements to the path patching process. For instance, in sections that discuss attention head selection and context-memory circuits, I expected to see new mechanisms that either optimize these components or introduce alternative techniques to improve the model's interpretability further. It would be helpful if the authors could expand on the aspects of the interpretability methods, particularly in causal mediation analysis and hierarchy extraction, where innovative contributions could enhance the model's practical applications and robustness.

● The generalizability of the interpretation results remains uncertain. It is unclear whether the extracted circuits would exhibit similar behavior on other types of datasets.

● The proposed ATTNATTRIB seems heuristic and limited, requiring full access to LLMs and beding specifically designed for this task. The authors could enhance the discussiong by evaluating its performance on other tasks, such as MATH or MMLU, to tetter assess its broader impact.

If the above issues are resolved, I will consider raising the score based on the extent of the improvements.

问题

Please summarize the entire process of completing question-answering (QA) tasks across different modules based on the findings presented in this paper.

评论

We thank the reviewer for acknowledging that the paper is exceptionally well-written, has experiments which are rigorously designed and contains detailed explanations of the findings. Below we address the comments from the reviewer:

“The paper’s claims about addressing limitations in synthetic probe dataset reliance seem overstated. While it critiques prior work for its dependence on synthetic datasets (lines 44-50), the proposed method also incorporates synthetic dataset design and path patching. This suggests the approach does not fully resolve these limitations......” : The reviewer raises a valid point and we are happy to discuss this. For obtaining circuits, using a probe dataset is inescapable. Earlier published works ([1,2,3]) firstly extract circuits using path-patching for tasks such as entity tracking / greater than math operation which are not truly reflective of real-world tasks. For these tasks, earlier works use a particular underlying template to generate a synthetic dataset where each example has a fixed length. For a real-world task such as extractive QA, using a synthetic dataset with fixed-length input is not possible due to variety in context lengths. Therefore, In our paper, we depart from such practices towards using path-patching (and adapting it with a greedy aggregation strategy) towards using variable length probe datasets which are carefully constructed. We also note the probe dataset design is a crucial component in obtaining circuits and our paper introduces such a probe dataset which can be used towards extracting circuits for extractive QA. We have updated our Introduction and Sec.(3) (marked in blue lines) to highlight our contributions and how our work differs from prior works.

[1]. https://arxiv.org/abs/2211.00593

[2]. https://openreview.net/forum?id=8sKcAWOf2D

[3]. https://openreview.net/forum?id=p4PckNQR8k

“Although the findings in this paper are interesting, there appears to be limited advancement in the technical contributions regarding interpretability methods......”: The reviewer asks a pertinent question. We first point the reviewer to earlier published works ([1,2,3]) which extract circuits for a language model task (e.g., entity tracking) using path-patching directly. We believe that using and adapting well-established tools to gain novel empirical insights for a popular task in language modeling is an innovative contribution. Beyond just providing mechanistic insights, our paper also strategically uses them towards building two practical applications on (i) Data attribution and (ii) Model steering. We are happy to engage with the reviewer on these points during the discussion period!

[1]. https://arxiv.org/abs/2211.00593

[2]. https://openreview.net/forum?id=8sKcAWOf2D

[3]. https://openreview.net/forum?id=p4PckNQR8k

“The generalizability of the interpretation results remains uncertain. It is unclear whether the extracted circuits would exhibit similar behavior on other types of datasets.” : The reviewer raises an important question. To validate the similar behavior on other types of dataset, we prune the circuit components (direct edge connections) for context-faithfulness on Natural Questions, NQ-Swap and HotPotQA – and we find that the extractive QA accuracy drops significantly. However, when pruning a random circuit, the drop is only minimal. This validates that the extracted circuits exhibit similar behavior on other types of extractive QA datasets. We have added this new result to the paper (see updated Sec. (3.4) and Fig.(4)) and we are happy to answer if there are any further questions.

评论

“The proposed ATTNATTRIB seems heuristic and limited, requiring full access to LLMs and beding specifically designed for this task. The authors could enhance the discussion by evaluating ...:

Heuristic and Limited. First, we note that AttnAttrib is designed from the mechanistic insights of the extracted circuit. Given the causal nature of the circuit for generation (with respect to context faithfulness), AttnAttrib has the inherent property of being reliable automatically. Second, across multiple QA benchmarks we show that AttnAttrib outperforms existing attribution baselines with a large margin – which shows that its scope is not limited and it can be useful towards real-world attribution to context spans.

Full-access to embedding. We respectfully note that access to the embedding is required for any mechanistic interpretability research – especially ones where the insights are used towards designing downstream applications (e.g., model steering). We also note that with the impressive performance of open-source language models when compared to closed-source ones in recent times, attribution with open-source language models is a practical research problem.

MMLU / Math. We note that our task is a pure extractive QA task i.e., a language model processes a context (from a document / script) and answers questions where the answer is present inside that context. To the best of our knowledge, MMLU/Math cannot be readily used for this task. We also point to the reviewer that we have used standard challenging datasets for QA attribution as used by the community (e.g., HotpotQA, Natural Questions).

“Please summarize the entire process of completing question-answering (QA) tasks across different modules based on the findings presented in this paper.”: One can think of the Contextual Question Answering task as involving three modules : 1) A “map-reduce/RAG module” that fetches context given the user’s question. 2) An LLM based “answer-generation module” that takes this context to generate an answer. 3) An “attribution module” that provides context-grounded attributions by citing sentences or paragraphs from the context.

Our work focuses on the attribution module (Section 4), although we show (in Section 5) that we can also improve the answer generation. While these citations can be self-generated (in which the LLM both answers the question and also provides a textual description of the citations), such techniques are prone to hallucinations. Another attribution technique is based on retrieval, in which the answer is decomposed into sentences and the most similar sentence from the context acts as a surrogate for a citation. This method too is not fool-proof because similarity does not guarantee that the chosen sentence from the context was used to generate the answer, and in addition, the document needs to be indexed beforehand. Our method differs from the above. We do not need auxiliary methods like retrievers to generate citations, we instead generate them using attention scores of a chosen attention head. We have provided strong experimental evidence that this attention head provides reliable attribution and that it is causal, for manipulating this attention head generates a different answer.

评论

We thank the reviewer again for the constructive comments during the rebuttal! During this rebuttal, we have strived to address the comments and incorporate suggestions towards improving the paper (do check the updated paper!). Given the discussion period is ending very soon, we would like to check with the reviewer if our rebuttal has addressed their comments?

评论

Thank you to the authors for their thorough response. Some of my concerns have been addressed, and as a result, I have increased my score.

评论

We thank the reviewer for improving our score and are glad that we were able to address your concerns during the rebuttal discussion period!

审稿意见
6

This work proposes a circuit extraction method for question-answering by using causal mediation analysis. The proposed method can be used for QA tasks rather than entity tracking. They experiment on two constructed datasets, copy and memory, and reveal a small set of attention heads in the extracted circuit for context faithfulness. Based on these findings, they present ATTNATTRIB, a fast and efficient data attribution algorithm, which shares a better performance and generality.

优点

This work presents an intervention study on QA tasks, which extends the existing work of entity tracing to a more practical setting. The new ATTNATTRIB can outperform other baselines and also have a better generality.

缺点

  • the writing strongly hampers the contribution of this paper. The work often goes into technical details without contexts, definitions, and motivations (novelty and links to existing work), therefore, it is hard to understand the contribution of this work well (see comments in the questions below).
  • the proposed interventional algorithm is very similar to ROME [1]. It is necessary to state the novelty of this work.
  • the experimental design can be improved as follows: (1) for each proposed algorithm, it is useful to give the basic background and discuss the relation between the existing method and the new method (2) this work claims that the detected mechanistic circuit extraction can be used for real-world tasks. However, the experiments focus on short-answer questions. A better setting is the open-ended QA tasks and long-form generations (3) the intervention performs the patching operation at the last token position. It is applicable to use Multi-choice QA to locate associations [2] (4) Regarding the effectiveness of the proposed interventional algorithm, it is necessary to add baselines such as ROME or gradient-based methods (5) I do not fully understand the design of the memory corrupted dataset. Can we use the queries solely for this goal?

In summary, the writing is a big issue of this submission, which prevents us from accurately assessing the scientific contributions. I am open to further discussions.

[1] Locating and Editing Factual Associations in GPT

[2] Identifying Query-Relevant Neurons in Large Language Models for Long-Form Texts

问题

  • it would be useful to define the computational graph, circuit, nodes, and edges in this graph
  • Dmemory\mathcal{D}_{memory} corrupts the original context to force LLMs to answer based on parametric memory. Why don't you directly ask LLMs the question without context?
  • what is the used metric score? It appears in Figure 1 and line 277 without any definitions.
  • Section 3.1 can be improved. The current structure is kind of tricky. (1) introduce the original dataset D\mathcal{D}; (2) give the motivation why you design two corrupted variants: Dcopy\mathcal{D}_{copy} \mathcal{D}_{\text{copy}}; (3) describe the two corrupted datasets with notations, respectively
  • in section 3.2, it is useful to state the intuition of the proposed method at the beginning
  • in line 247, "We selected this position for patching because the information in the last residual stream plays a crucial role in determining the probability distribution of the generated tokens." Do you have proof of this claim?
  • in section 3.3.1, where is the experimental result?
  • in Algorithm 1, what is the definition of the data attribution algorithm and a span
评论

We thank the reviewer for appreciating the interventional study on a real-world task setting such as extractive QA. Below we address the comments from the reviewer:

“the writing strongly hampers the contribution of this paper. The work often goes into technical details without contexts, definitions, and motivations (novelty and links to existing work), therefore, it is hard to understand the contribution of this work well….” : We have updated the introduction with a more clear overview of our contributions (see updated Sec. (1)). Moreover, we have updated Sec. (3.2), Sec (4) with appropriate motivations and context to improve the readability of the draft. We hope that the new changes in the paper have uplifted the major contributions made in the paper. We also point the reviewer to the Global rebuttal for a concise explanation of our contributions.

“the proposed interventional algorithm is very similar to ROME [1]. It is necessary to state the novelty of this work / it is necessary to add baselines such as ROME or gradient-based methods .” : We highlight that ROME is a model editing algorithm for introducing new factual associations in the language model. If the reviewer is pointing to the Causal tracing mechanism in the ROME[1] paper – we note that this idea is a common interventional procedure for mechanistic model interpretability works. Several highly cited and well-published papers have been using this procedure directly ([1,2,3,4,5] amongst others) for obtaining insights about foundational models.

In our paper, we adapt this technique for finding circuits (a set of internal model components) for a real-world task of extractive QA by : (i) Introducing a high-quality probe dataset (ii) A greedy procedure towards aggregating the important nodes and (iii) Performing the patching operations using variable length sequences, whereas earlier works consider patching with templated similar length sequences. These tweaks are essential towards obtaining a circuit for a real-world task such as extractive QA. We have made these points more clear in the Introduction and the Methods section (see Updated Sec. 3) for interventional patching.

[1]. https://arxiv.org/abs/2402.14811 (ICLR ‘24)

[2]. https://openreview.net/forum?id=p4PckNQR8k (NeurIPS ‘23)

[3]. https://arxiv.org/abs/2310.13730 (ICLR ‘24)

[4].https://proceedings.neurips.cc/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf (NeurIPS 2020)

[5]. https://arxiv.org/abs/2305.15054 (EMNLP ‘23)

We appreciate the reviewer’s suggestion on gradient based methods. We note that gradient based methods (e.g., https://arxiv.org/pdf/2104.08696) have been primarily used for analyzing the neurons in encoder based transformers. To the best of our knowledge, such methods are not well established for decoder only models or for obtaining circuits. Adapting gradient based methods to the decoder setup and using the insights to obtain circuits is a new research direction in general!

“the experimental design can be improved as follows: (1) for each proposed algorithm, it is useful to give the basic background and discuss the relation between the existing method and the new method” : We thank the reviewer for the suggestion. We have updated the section for each algorithm (see Updated Sec. 3.2, Sec. 4) with a basic background and the relationship between the existing vs. new methods. We hope that these changes have uplifted the quality and the readability of the paper.

“ this work claims that the detected mechanistic circuit extraction can be used for real-world tasks. However, the experiments focus on short-answer questions. A better setting is the open-ended QA tasks and long-form generation” : We note that we have provided results on data attributions for long-form generations in Appendix (Sec. (I)) on NQ-Long and CNN-Dailymail. We find that our method is effective towards even long-form generations. We have highlighted these results more in the main paper (see Sec. (4.2)). We also note that our attribution algorithm is used on commonly used extractive QA datasets (e.g., NQ-Swap, Natural Questions, HotpotQA, Natural Questions Long) used by the community. We also point the reviewer to a concurrent work on attribution (https://arxiv.org/pdf/2409.00729) which uses a similar set of datasets.

To the best of our knowledge, open-ended QA falls outside the purview of extractive QA, the task which we are studying and we believe obtaining circuits for such a task is a separate investigation in itself.

评论

“the intervention performs the patching operation at the last token position. It is applicable to use Multi-choice QA to locate associations [2]” : We thank the reviewer for providing us with the reference to this paper. We note that this paper primarily investigates factual associations in the multi-choice QA setup (without the presence of a context). We emphasize that our work concerns QA under the presence of a context i.e., extractive QA. Given the difference in task composition, we believe that investigating multi-choice QA is outside the scope of this work and requires a new investigation in itself.

“​​it would be useful to define the computational graph, circuit, nodes, and edges in this graph, Dmemory” : We have updated these definitions in Updated Sec. (3) and 3.1. We hope that these changes have improved the readability of the paper.

“I do not fully understand the design of the memory corrupted dataset / corrupts the original context to force LLMs to answer based on parametric memory. Why don't you directly ask LLMs the question without context?”: The reviewer asks a pertinent question. The primary reason is that we want to extract circuits when the model uses parametric memory in spite of the presence of a context (therefore ignoring it). Mechanistic insights from this behavior can enable us to make certain modifications in the model such that the model answers from the context reliably. We have also updated Sec. (3.1) to make these points more clear and improve the readability.

In fact, with the goal of model steerability, we find that the copy attention heads from the context faithfulness circuit still pay high attention to the perturbed answer spans in the context, when the model answers from the parametric memory. In Sec.(5) – we show that because of our interpretability insights due to the dataset design, we are able to steer models towards better context faithfulness. We are happy to answer if there are further questions.

“what is the used metric score? It appears in Figure 1 and line 277 without any definitions” : We have redefined this metric score in updated Sec. (3.2) so that it’s more visible for the readers. Thanks for providing the suggestion to improve its visibility!

“in Algorithm 1, what is the definition of the data attribution algorithm and a span” : We have updated the definition of the data attribution algorithm and span in Sec.(4). We hope that this has improved the readability of the paper.

“Section 3.1 can be improved. The current structure is kind of tricky. (1) introduce the original dataset D ; (2) give the motivation why you design two corrupted variants: Dcopy\mathcal{D}{copy} \mathcal{D}{\text{copy}}; (3) describe the two corrupted datasets with notations, respectively”: We thank the reviewer for the suggestion! We have updated the dataset section (Sec. 3.1) to provide a strong motivation for the design of two variants. We hope that these new changes have improved the readability of the paper!

“in section 3.3.1, where is the experimental result?”: We have provided the summary of the results in 3.3.1. The full list of circuit components can be found in the Appendix Sec.(D). During the rebuttal, we have reorganized the results in 3.3.1, to be better readable.

During the rebuttal, we have updated distinct sections of the paper with improved readability and we sincerely hope that it has improved the quality of the draft. We are open to any further questions during the discussion period.

评论

Thanks for updating the manuscript and providing detailed responses. Some of my concerns have been addressed, especially for the writing part. I would like to raise my score.

However, the gradient-based baselines can be adapted to decoder-only models [1,2,3]. Furthermore, the size of the curated dataset is rather small (200 questions). Therefore, the conclusions can be more solid by including gradient-based methods and having more data samples in the experiment.

[1] Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons

[2] Identifying Query-Relevant Neurons in Large Language Models for Long-Form Texts

[3] Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons

评论

We thank the reviewer for the response and improving our score! We are glad that most of the comments / concerns have been addressed.

Below we respond to the comments on gradient-based methods and size of the probe dataset:

Size of dataset: We point the reviewer to a recently published paper at ICLR 2024 (Entity tracking circuit: https://arxiv.org/pdf/2402.14811) which has extracted circuits using a probe dataset of a similar size. Moreover we highlight that our probe dataset is not templated, which hinders generating a larger number of examples. During the rebuttal, we have however extended our probe dataset to 1000 examples. We provide the results in Appendix (Sec. (D)), where we find that the circuit components for both the context and memory faithfulness do not change – highlighting that a dataset size of ~200 examples is sufficient. We hope that these new experiments clarify that our dataset size is not a drawback. We also note that we validate the circuits from the interpretability experiments on other large-scale datasets (e.g., HotPotQA, Natural Questions), where we find that ablating the extracted circuits lead to a large drop in accuracy (see updated Sec. (3.4) and Fig. (4)) highlighting that our results generalize to more complex and messier datasets.

Gradient-Based Methods: We thank the reviewer for pointing us to these papers. These interesting and very recent papers, identify knowledge neurons for factual recall or long-form generation (with factual entity) -- where no context information is there. We note that our paper primarily extracts circuits which are sub-graph of the underlying transformer's computational graph. This is possible because the transformer (due to its residual connections) has nodes (e.g., MLPs / attention heads / attention layers) having incoming edges from other nodes which appear before them in the forward sequence. This readily available graphical structure allows the use of interventional algorithms (e.g., causal tracing) and recover a sub-graph (i.e., a circuit). These interventions can also be technically performed at the neuron level, but then the search space will be significantly larger to the order of the number of neurons -- leading to the important neuron identification step infeasible in practice. In that sense, gradient-based neuron attribution methods can be useful, as they can directly identify the important neurons by using the gradient information without performing an interventional step.

However, we believe using such methods directly or even with minimal adaptation to obtain circuits as baselines is non-trivial due to the following reasons: (i) The relevant works independently identify important neurons for the parametric memory case only. Moreover, connecting them towards a graph is slightly more challenging due to the more lack of structure (which is more easily present at the representational level such as the output of MLPs and attention heads). (ii) Using attention heads as nodes (instead of neurons) can have direct practical applications such as data-attribution. It is not straightforward to use the important neurons to perform data-attribution. We do believe though, further work needs to be done to understand if the information corresponding to the interplay between parametric vs. context can indeed be captured at the neuron level, rather than at the representational level -- but that would entail a new work in itself. In that sense, might be a better idea to start with controllable and toy tasks such as entity tracking or indirect object identification. Once such a framework for circuit extraction is validated on these tasks, it can be adapted towards a more real-world extractive QA task. We will add these discussion to the final version of the paper and are happy to answer any further questions from the reviewer on this interesting point!

审稿意见
6

This paper leverages the internal behavior of attention heads for feature attribution. The authors build on prior methods to identify influential attention heads that determine whether the model relies on internal memory or external context. Furthermore, they demonstrate that by controlling these specific attention heads, they can steer the behavior of LLMs.

优点

  • This paper analyzes attention patterns in LLMs to examine their reliance on contextual versus parametric knowledge, revealing the presence of key attention heads (or circuits) involved in this process.

缺点

  • The contribution of this paper is unclear. The probing method used was introduced in prior work (Wang et al., 2022). Although the authors mention some limitations of this prior work, these issues also apply to the current paper, such as only testing on a simple extractive QA task, reliance on a high-quality probe dataset, and limited practical impact.
  • For "understanding context-augmented language models," the feature attribution task based on extractive QA is overly simplistic. This work remains an interpretability task with the extractive answer already given. Note that various feature attribution methods, including those based on attention analysis, have been extensively studied over the years [1]. Is the proposed method better than prior attention-based methods? Moreover, it remains unclear whether the detected attention heads generalize across different tasks and domains.
  • For "enhancing context-augmented language models," the experimental results are too weak, with no baseline comparisons (see the next point).
  • While this paper aims to address knowledge conflicts between contextual and parametric knowledge, it overlooks a substantial amount of relevant literature. Notably, some of these works also conduct causal analysis. Below is a selective list
    • Mallen, Alex, et al. "When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories." Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023.
    • Wang, Fei, et al. "A Causal View of Entity Bias in (Large) Language Models." Findings of the Association for Computational Linguistics: EMNLP 2023. 2023.
    • Wu, Kevin, Eric Wu, and James Zou. "How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior." arXiv preprint arXiv:2404.10198 (2024).
    • More work can be found in this survey: Xu, Rongwu, et al. "Knowledge conflicts for llms: A survey." arXiv preprint arXiv:2403.08319 (2024).

[1] Wiegreffe, Sarah, and Yuval Pinter. "Attention is not not Explanation." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.

问题

See Weaknesses.

评论

We thank the reviewer for the constructive comments. We appreciate the recognition of our work’s strengths, particularly regarding the identification of key attention heads and their role in context-based responses by the underlying language model. We also emphasize that the paper’s contributions extend beyond interpretable insights from attention heads; we introduce a robust data attribution algorithm, AttnAttribute, which surpasses existing data attribution methods and also shows its effectiveness in steering language models towards context faithfulness. Below, we address the specific weaknesses noted by the reviewer:

" The contribution of this paper is unclear. The probing method used was introduced in prior work (Wang et al., 2022). Although the authors mention some limitations of this prior work...: The contribution of our paper is twofold: (i) First we adapt existing mechanistic understanding methods with our probe dataset to "mechanistically" understand the behavior of language models under the presence of context for QA tasks -- a popular real-world task. (ii) Using the interpretable insights we show two practical applications: (1) we design a robust and much simpler data attribution algorithm which surpasses existing data attribution algorithms in terms of performance and (2) show the potential of model steering for context faithfulness. We have made the contribution of our paper more clear in the updated Introduction (see Sec 1 with lines marked in blue).

Difference from earlier path-patching works. We note that our work is developed on path-patching / causal mediation analysis - a foundational technique for mechanistic interpretability. In fact, earlier accepted papers at top conferences have applied this technique directly(without minimal adaptation) for understanding distinct language model tasks such as : (a) entity tracking [2], (b) indirect object identification [1] and (c) greater than math operation [3]. All these works are essential for understanding language models, but are not reflective of the real-world user facing tasks. In our work, we extract circuits for a practical task by adapting path-patching (or causal mediation analysis) by (i) departing from a fixed-length synthetic probe dataset and (ii) using a greedy aggregation strategy to a new real-world task (extractive QA) to obtain novel mechanistic insights and use them towards designing two downstream applications.

We have added these points to the paper (see Updated Introduction and Section 3.1, 3.2) to make our contributions more clear.

[1]. https://arxiv.org/abs/2211.00593

[2]. https://openreview.net/forum?id=8sKcAWOf2D

[3]. https://openreview.net/forum?id=p4PckNQR8k

Testing on simple extractive QA task : We bring to the notice of the reviewer that these datasets (e.g., NQ-Swap, Natural Questions, HotPotQA, CNN-DailyMail) are standard benchmarks used by the community for data-attribution and extractive QA. For e.g., a concurrent work of ours (https://arxiv.org/pdf/2409.00729) used a similar set of datasets for data attribution. We note that we provide results for both short-form QA as well as long-form QA in our paper.

Reliance on a high-quality probe dataset: We note that in order to extract circuit components, the reliance on a probe dataset is inescapable and its design is crucial towards extracting circuit components from a language model. In our paper, we design a high-quality probe dataset which is in fact a contribution in our paper. We have updated Sec. 3.1 to be more readable towards highlighting our contributions.

Limited Practical Impact: We highlight that we depart from extracting circuits for toy language tasks towards extracting circuits for mechanistic understanding of a real-world practical task (i.e., extractive QA). Second, our data-attribution algorithm can be computed in one forward pass therefore obtaining attribution for free -- increasing practical benefits due to ease of use. We note that data attribution in extractive QA is an extremely practical task as can be seen in features in various products involving language models such as ChatPDF[1], ChatDoc[2], Microsoft Copilot[3], [4] Adobe Acrobat AI Assistant.

[1] https://www.chatpdf.com/

[2] https://chatdoc.com/

[3]https://support.microsoft.com/en-us/office/welcome-to-copilot-in-word-2135e85f-a467-463b-b2f0-c51a46d625d1

[4] https://helpx.adobe.com/acrobat/using/generative-ai.html

评论

For "understanding context-augmented language models," the feature attribution task based on extractive QA is overly simplistic. This work remains an interpretability task with the extractive answer already given." : We note that the "understanding" part of the language model is through the extracted circuits. The data-attribution task in our paper is an application from the interpretability insights that a very small set of attention heads from the circuit perform data-attribution by default. As noted in the earlier section, we bring to the notice of the reviewer that the extractive QA datasets used in our paper are standard datasets used by the community.

Generalizability Across Datasets. In updated Sec. (3.4) and Fig. (4), we provide new results showing that the extracted context-faithfulness circuit generalizes across other extractive QA datasets (e.g., Natural Questions, HotPotQA) – highlighted by the drop in extractive QA accuracy, when the context-faithfulness circuit is pruned. We also note that our attribution algorithm outperforming other baselines across multiple QA datasets is a strong evidence that our extracted attention heads generalize across different datasets.

"For "enhancing context-augmented language models," the experimental results are too weak, with no baseline comparisons (see the next point)." : We have added a strong baseline (Context-aware decoding : https://arxiv.org/pdf/2305.14739) to the paper for improving extractive QA accuracy (see updated Sec. (5) and Fig.(7)). Both Context-aware decoding and ours require two forward passes for generation – therefore leading to a fair comparison. We note that our method outperforms existing methods across the different QA datasets. In updated Fig. (7), we have also added a new result with HotPotQA.

"While this paper aims to address knowledge conflicts between contextual and parametric knowledge, it overlooks a substantial amount of relevant literature." : We thank the reviewer for pointing out these papers. We have added these papers to the Related Works section in our paper. We will also like to point out that none of these papers mechanistically understand the interplay of context vs. parametric memory as a function of internal model components through circuits -- which we believe can give rise to downstream practical applications such as (i) data attribution and (ii) model steering towards context faithfulness as shown in our paper.

Other Attention Based Baselines: During the project, we investigated other attention heads (not selected via our circuits). For e.g., we investigated the attention heads in the last layer, but find that these attention heads do not perform attribution. We note that due to the causal nature of the mechanistic circuits, they are able to exactly retrieve the relevant attention heads required for extractive QA.

评论

We thank the reviewer again for the constructive comments during the rebuttal! During this rebuttal, we have strived to address the comments and incorporate suggestions towards improving the paper (do check the updated paper!). Given the discussion period is ending very soon, we would like to check with the reviewer if our rebuttal has addressed their comments?

评论

Thanks the authors for the detailed response. Partial concerns have been addressed. I've raised my score.

评论

We thank the reviewer for improving our score and are glad that we were able to address your concerns during the rebuttal discussion period!

评论

We thank all the reviewers for the constructive feedback towards improving the paper. Through this global rebuttal, we would like to put forth the main contributions of our paper and also clarify a few points that we believe benefit from additional explanation:

Main Contributions:

Contribution 1 : We mechanistically study a real-world language model task - extractive QA across various language models via circuits. Extractive QA is the task of answering a question by directly extracting words from the context/document (in contrast to "abstractive QA" or "open-ended QA" where the words comprising the answer may not necessarily appear in the context). We note that earlier mechanistic circuit works study toy tasks such as entity tracking or indirect object identification, which are not reflective of real-world use-cases of language models.

Contribution 2: We use the mechanistic interpretability insights towards designing a novel and simple data-attribution algorithm - AttnAttribute, which outperforms existing data-attribution baselines by a significant margin. We highlight that our algorithm extracts these attributions for free on the fly while generating the answer. We also show these attributions can in fact be used to steer language models towards better context-faithfulness across multiple extractive QA datasets.

A mechanistic understanding of a practical task such as extractive QA can not only provide insights on the inner workings of the model for this task, but can also enable downstream applications (e.g., data-attribution to context and model steering) to improve the model reliability. We believe that the mechanistic insights and two downstream applications are novel, practical contributions to the community!

Clarifications:

Relationship to Causal Interventions. Our method is developed on the foundational concept of causal interventions which a variety of published and well-cited works use to understand behavior of different foundational models. Meng et. al (2023) [1] is one of the most well-known papers which utilizes causal interventions to find important layers for a factual association task. Works such as [2] have developed on causal interventions to obtain circuits (i.e., a subgraph of the original transformer’s graph) for toy tasks such as entity tracking, math operations and indirect object identification. We use a similar technique, but adapt it to study a real-world language model task such as extractive QA and provide two downstream applications.

[1]. https://arxiv.org/abs/2202.05262

[2]. https://openreview.net/forum?id=8sKcAWOf2D

Importance of data attribution as an application. As language models are being increasingly used to process documents, scripts in order to answer various questions about them, it has become important to provide reliable grounding as these language models often hallucinate. In our work, we provide a simple data attribution algorithm, designed using mechanistic insights which obtains the attribution for free. We also point the reviewers to products such as ChatPDF[1], ChatDoc[2], Microsoft Copilot[3], Adobe Acrobat AI[4] Assistant, etc using language models which provide citations/attributions – highlighting the importance of the problem.

[1] https://www.chatpdf.com/

[2] https://chatdoc.com/

[3]https://support.microsoft.com/en-us/office/welcome-to-copilot-in-word-2135e85f-a467-463b-b2f0-c51a46d625d1

[4] https://helpx.adobe.com/acrobat/using/generative-ai.html

We have uploaded a revised draft of our paper (with the updated text highlighted in blue) by incorporating suggestions from the reviewers.

评论

We sincerely thank all the reviewers for their constructive feedback during the rebuttal period. We are pleased to have incorporated their insights into our paper, resulting in notable improvements. We also appreciate that the reviewers acknowledged these enhancements by improving their evaluations.

During this period, we updated our paper and conducted new experiments to strengthen our contributions. Below, we summarize the major updates (minor changes are detailed in individual replies to reviewers):

Enhanced Writing and Contextualization: We improved the readability of the Introduction and Abstract and added additional context in Sections 3 and 4 to better highlight the importance of the problem—extracting circuits for real-world tasks and using the insights to develop two applications.

New Generalization Experiments: We demonstrated the validity of the extracted circuits on general extractive QA datasets (see Figure 4).

Addition of a New Baseline: We included a new baseline for steering the language model toward improved context faithfulness (see Figure 7).

Validation with Larger Probe Dataset: We validated the circuit components using a larger probe dataset (see Section D.4).

Dataset Partitioning Analysis: We analyzed extractive QA circuit extraction across different knowledge partitions of the dataset (see Replies to Reviewer c1S6).

As today is the final day for reviewer questions, please let us know if there are any remaining points or concerns that we can address during the author response period.

AC 元评审

The authors perform an empirical study of causal mediation analysis on extractive QA tasks, identifying circuits that are relevant to the question answering task, and then doing data attributions using their interp methods. These result in generally strong results. Steering results are also provided, which shows how it's possible to control parametric vs context use during the answer.

The authors provide interesting empirical results, and all reviewers generally agree on this point. However, there are fairly common and persistent critiques of some of the designs in the paper, including the probe dataset (size, whether its synthetic), and whether this empirical result alone is a sufficient contribution. While the results are interesting, they are not particularly surprising given some of the emerging positive results in circuit identification and intervention, and while there are some exciting results (the steering ones for parametric vs context use) it's only a small part of the paper.

审稿人讨论附加意见

Discussion was extensive, and reviewers generally raised their score in response to the rebuttal.

最终决定

Reject