Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits
We propose a new method for interpretating transformer circuit by performing SVD on query-value and value-output matrices
摘要
评审与讨论
In this paper, the authors propose a framework for analyzing a large language model by applying SVD to its weight matrices, including attention heads (QK and OV matrices) and MLPs.
The central idea is a masking procedure they developed. This technique allows them to amplify or diminish the impact of specific singular vectors. By doing this, they can identify which SVD directions are most critical for the model's overall performance. This masking is learned using a loss that tries to preserve the model's final probability distribution while encouraging the sparsity of the mask.
They focus their analysis on the QK matrix of attention head 9.6. Their results suggest that certain singular vector directions in this matrix correlate with high-level semantic features that the attention head may be performing.
优缺点分析
Strengths
- Using a new approach based on masks to learn the singular vector directions that are important for the model's performance.
- Showing that there is a correlation between some of the singular vector directions and features that a name mover attention head may be using.
Weaknesses
- The paper does not mention previous work in this area:
- [1]: This paper shows that the SVD basis can be very useful in interpreting the QK interaction in vision transformers. They perform an in-depth analysis of the singular vectors of multiple vision transformer models, showing numerous examples of the semantics of the singular modes (the rank-1 matrices given by the SVD).
- [2]: This paper proposed the sparse attention decomposition, which states that the attention scores can be low-rank reconstructed using the SVD of the QK matrix. They show multiple examples of sparsity in the attention scores reconstruction. By using this decomposition, they identify the components that are responsible for the attention head to attend to a token pair. Then, they trace an IOI circuit using this approach and validate it by performing multiple causal interventions.
- [3]: This is a blog post that shows that the singular vectors of the OV and MLP matrices have highly interpretable semantic directions, and uses the singular vectors to change the model behavior.
- There is no comparison with baselines. [2] shows how to find a sparse set of singular vectors to reconstruct the attention score. Comparing the mask sparsity with this method would give an idea about how good the level of sparsity of the masks is. Even though they are not solving the exact same problem, they are very similar.
- The evaluation part could be more systematic rather than observational. Many of the conclusions are simply interpretations without justification (Section 5). It is very unclear whether the conclusions made in Section 5 could be extended, and how, to other tasks and models. For instance, in Figure 1, attention head 9.6 has around 10 singular vectors with very high mask values (close to 1), but only 3 were analyzed in the paper. One way that this could be done is by performing a causal analysis of the directions analyzed, similarly to [2, 4].
- A fair amount of Sections 2 and 3 are about the OV and MLP matrices. However, there is no analysis of the SVDs of these components. Is there any specific reason for that? What is the application of learning masks for these matrices?
- There is no mention of how good the masked model is in comparison to the original model: no mention of the loss value (I also did not find it in the Appendix). So, how faithful are these masks to the model performance?
- This approach does not cover more modern models that use RoPE, which makes it harder to analyze the QK interactions, especially using a learning-based approach. Do you have any ideas about how your methodology could be applied in RoPE models?
References:
- [1]: Pan, Xu, et al. "Dissecting query-key interaction in vision transformers." arXiv preprint arXiv:2405.14880 (2024).
- [2]: Franco, Gabriel, and Mark Crovella. "Sparse Attention Decomposition Applied to Circuit Tracing." arXiv preprint arXiv:2410.00340 (2024).
- [3]: Millidge, Beren, and Sid Black. "The singular value decompositions of transformer weight matrices are highly interpretable." AI Alignment Forum. 2022. (https://www.alignmentforum.org/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight).
- [4]: Merullo, Jack, Carsten Eickhoff, and Ellie Pavlick. "Talking heads: Understanding inter-layer communication in transformer language models." Advances in Neural Information Processing Systems 37 (2024): 61372-61418.
问题
Question 1
I did not understand exactly how the scores are given. What is "mean activation" in this context? In my understanding, using Section 2 of the paper, we write an attention score of a token (destination token) attending to another token (source token) as:
So, the contribution to the attention score (let's call it ) can be written as:
When you say that direction assigns high values to "Jerry", does it mean a high ? If this is the case, do you evaluate "Jerry" as a source token only, with the end token as a destination?
Question 2
Are the masks binary or values between 0 and 1? Using is a bit ambiguous. By looking at Figure 1, there are multiple values between 0 and 1, but I was thinking that the mask is binary.
Question 3
The paper says the following: "We retain the top- singular directions (typically = 64)". The rank of the QK matrix is 64, which makes sense. But is this used for the other matrices as well? Or do you use as the rank of the matrix that you are taking the SVD?
Suggestions
My suggestions to improve the work are described in the Weaknesses section.
局限性
Yes.
最终评判理由
I believe this paper is built on a strong and interesting idea. Using SVD to interpret model behavior at a sub-component level has clear potential advantages over other methods in circuits research: it is more granular, does not require a replacement model, and is more faithful by design.
However, as it stands, the experimental validation does not yet match the strength of the core idea. My primary concerns, which are also shared by Reviewer vMFZ (on quantitative analysis) and Reviewer BMMX (on comparisons to related work), remain. Addressing these two points would require significant new experiments that go beyond the scope of a rebuttal period and would likely lead to substantial changes throughout the paper.
For these reasons, I will be maintaining my current score. I hope this feedback is helpful for strengthening the paper in the future. With more robust experimental support, I believe this line of work could be a valuable contribution.
格式问题
None.
Thank you for taking the time to write a detailed and broad review. We are pleased with your broader perspective, mentioning additional references (going beyond circuit discovery literature) and suggestions to clarify and better articulate what this work is trying to do, and what it is not.
At its core, the goal of this paper is not to rediscover existing circuits, nor to benchmark a new variant of an existing circuit discovery pipeline. Instead, we’re trying to take a step back and ask, What if the atomic units we analyze, like neurons or heads, are too coarse to fully describe the computations these models perform? Singular vector directions in weight matrices provide a finer-grained, orthogonal basis that isn't constrained by component boundaries. This paper tries to make a case for interpreting behavior in that subspace.
We address the clarifications/suggestions below.
-
Suggested Related Works We agree with your suggestions; we should have cited the works you mentioned. We thank you for providing a broader scope to this work by including works in vision, feature reconstruction, and visualization. We hope you understand that the omission of these works was an oversight, not an attempt to overstate novelty. The novelty lies in shifting how SVD directions are used, i.e., not just for reconstruction or feature visualization, but as functional units optimized to preserve model behavior. The novelty lies in shifting how SVD directions are used, not just for reconstruction or feature visualization, but as functional units optimized to preserve model behavior, which was missing from the prior art. We will be including all of them in the updated version of the paper.
-
We agree that Sparse Attention Decomposition (Franco & Crovella, 2024) is somewhat related to our work, and in hindsight, we should have framed the relationship more clearly. That said, the goal in [2] is to identify a sparse subset of singular directions that reconstruct the attention scores themselves. Our objective is quite different; we’re optimizing for preservation of end-to-end model behavior (as done in circuit discovery methods), not intermediate reconstruction (done in SAE-type methods). This distinction matters. The directions they find reconstruct attention scores well, but aren’t necessarily predictive in the downstream sense. In contrast, the directions we find preserve the full output distribution (measured via KL divergence) and often correlate with interpretable, task-relevant roles. It is important to note that this work's primary goal is to look at things differently and see if abstract circuits exist, rather than using components as atomic units; we use these directions that were found to play specific roles in the tasks.
-
We would like to emphasize that Section 5 is intentionally qualitative. Rather than recovering known causal circuits (that can be obtained via any faithfulness-preserving optimization objective), we explored whether individual singular directions encode interpretable roles, suggesting that attention heads are not monolithic but contain overlapping subroutines. The examples were chosen to support this hypothesis through detailed case studies, not an exhaustive search (so we may have missed a few roles). We agree that a more systematic faithfulness is needed, but that is still an open research problem in the interpretability domain, which remains future work. In our analysis, we included the patterns that were explainable; we explored the other directions as well, but could not generalize the findings. Also, it takes much more time to analyze each direction separately. Overall, what we’ve consistently found is that directions tend to be overlapping and distributed, multiple directions in the same head contribute to distinct, sometimes competing subfunctions. This is precisely the reason we moved away from component-level analysis; the head as a whole doesn’t do “one thing”; it houses several subroutines. The SVD basis lets us isolate these roles. The key idea we want to put here is to understand not just "which head is active?", but "what kinds of structure live inside a single head?", formulating a different level of abstraction/atomic units of role-playing components. Sections 2 and 3 set up the decomposition for all major matrices (QK, OV, MLP), but we focused our case studies (that required some sort of manual inspection, which is time-consuming) and visualizations on QK matrices, since the QK matrices showed the highest sparsity in learned masks, making them easier to interpret direction-wise in manual case study/inspection. Moreover, the semantics of QK directions (via dot products) lend themselves to clearer qualitative interpretation as reported in the paper. On the other hand, OV and MLP components tend to exhibit denser, more distributed masks, complicating direction-wise analysis. This could be another direction to explore in future work. Regarding the faithfulness, the masks learned over these matrices are still predictive and preserve model behavior (as confirmed via KLD), but are less sparse and harder to dissect cleanly. This observation aligns with the idea that different components in transformers express different forms of computation, some more modular and interpretable than others.
-
Thank you for pointing this out; it’s an oversight that we didn’t report this clearly in the main paper, since the objective that we minimize in the algorithm is expected to learn a mask that preserves model behaviour. Hence, the main focus was to check if these directions that came out are actually interpretable, and whether this could be a new direction/atomic units to be used in the circuit discovery literature. We’ve now added these results, including KL divergence between masked and unmasked model outputs typically falls within 0.07-0.12 of the original, depending on the component, along with 96% of exact prediction match and 82% accuracy (full model has 84% accuracy) on the IOI task. Also, the learned masks outperform both top-k singular direction and random masks, which degrade performance more significantly. All of these metrics validate that the learned subspaces not only align with interpretable roles but also retain the functional essence of the model. We'll report these results more prominently in the revised version.
| Mask Type | Exact Prediction Match(%) | Accuracy(%) | KL Divergence |
|---|---|---|---|
| random 25 | 3.11 | 0.00 | 7.81 |
| random 10 | 0.00 | 0.00 | 7.56 |
| top 25 | 7.10 | 0.00 | 4.6 |
| top 50 | 38.14 | 32.37 | 1.99 |
| learned | 96.56 | 82.12 | 0.12 |
| inverted | 6.01 | 0.00 | 4.01 |
- On RoPE Models and Generalization We agree that applying this framework to RoPE-based models presents challenges. Unlike fixed positional embeddings, RoPE introduces position-dependent rotations, meaning the dot product between Q and K is not directly interpretable as a simple inner product on a static SVD basis. This introduces two issues: 1) Each token pair effectively uses a different QK interaction subspace, making a single SVD-based decomposition less meaningful. 2) Interpreting SVD directions in this dynamic context requires either reversing or compensating for the positional rotation or projecting into a unified subspace post-attention. That said, this challenge is also an opportunity. The fact that multiple position-dependent QK subspaces exist could enable richer decompositions, and perhaps even multiple, composable attention circuits within the same head. We believe adapting SVD decomposition to the RoPE regime is a natural next step, and could be explored in future work. At this point, we chose to focus on models without RoPE to first test the viability of the idea that singular directions themselves can act as interpretable subroutines. We’ll now add this limitation explicitly and outline how the method might extend to RoPE-based models.
Response to the Questions:
-
When we refer to mean activation, we mean the average activation of a given singular direction over the entire dataset. This averaging helps identify directions that are consistently active and not just responding to spurious examples or rare token contexts. Regarding the formulation, you're absolutely right in how you wrote the formulation.
-
The learned masks are continuous values between 0 and 1, not binary. We use a sparsity-inducing regularization (via an L1 penalty), which encourages many values to go to zero, but doesn’t hard-threshold them. We intentionally chose this softer form of masking to 1 avoid hard zeroing that may disrupt gradients during learning. 2) Allow finer control over relative importance between directions.
-
Yes, we use the full-rank decomposition for all components, including QK, OV, and MLP layers. For QK matrices, the effective rank is often 64, so r = 64 is both natural and complete. For OV and MLP matrices, the rank can differ (observed to be 65 for OV after the augmentation matrix formulation), and we retain all non-zero singular directions based on the decomposition (i.e., no truncation). So even though we say “top-r”, in practice we don’t discard any singular directions unless we explicitly test lower-rank baselines (e.g., top-k SVD vs. learned mask). We thank you for noticing this, as it may come out in the other way to the readers, we will make it clearer in the updated version of the paper.
First, thank you for the very detailed and careful response.
-
I would like to explain my reasoning for suggesting [2] and [3] as more than just citations, but as important points of comparison. I agree with you regarding [1]. It would likely be difficult to adapt its experimental setup to your work, so a citation and discussion are sufficient. For [2] and [3], I believe an experimental comparison would significantly strengthen your claims. Here is my thinking:
- On comparing with [2]: You define your novelty as optimizing SVD directions to preserve model behavior. However, the method in [2] aims to reconstruct the attention scores. Since the attention scores determine a layer's output, reconstructing them is also a way to preserve the model's behavior. Because both your method and [2] aim to preserve behavior through different objectives (downstream performance vs. attention score reconstruction), a direct comparison would help readers understand the trade-offs. This is especially relevant to your goal of finding a sparse subspace. In Figure 1 of your paper, some important heads (like (9, 9) and (10, 7)) do not seem to result in very sparse masks. This leads to a natural question: Is your method finding a small set of important directions, or is it learning to rescale the full set of SVD directions? Comparing your results to a method like [2], which is designed to find a sparse set of vectors, would provide a clear, quantitative answer and better demonstrate the advantages of your approach.
- On comparing with [3]: This paper offers a direct method for interpreting the SVD of the OV and MLP parts of the model. Since you also analyze these components in Section 2, the analysis from [3] could serve as a great baseline. Showing that the directions your method learns are meaningful, when viewed through the lens of [3], would provide strong evidence for your claims about finding "functional units."
In summary, adding these experimental comparisons would make your paper's contributions much clearer and more convincing. It would provide strong evidence that your method for preserving model behavior is more effective in practice than these other approaches.
-
See my previous answer.
-
The core hypothesis that attention heads contain overlapping "subfunctions" which can be isolated by SVD direction is compelling, and the case studies provide an interesting proof of concept. However, I believe the claims could be significantly strengthened. While I understand that a full, systematic analysis is future work, the current evidence, drawn from a manual inspection of a few directions in a single task, feels preliminary. To address the scalability issue of manual inspection that you mentioned, and to make the evidence for these subfunctions more robust, I suggest considering a more systematic analysis. For example, performing targeted causal interventions on the key SVD directions you identify, similarly to [4], or using tools like Autointerp could provide quantitative support for their interpretable roles without requiring an exhaustive manual search. Regarding the focus on QK matrices, your reasoning is clear. I am not suggesting you perform the same time-consuming manual inspection for the OV and MLP components. Instead, I recommend adding the results you already have to an appendix to make the paper more complete. This would include:
- Visualizations of the learned masks for OV and MLP matrices (similar to what you show for QK in Figure 1).
- The discussion you provided in your comment, explaining that these masks are denser but still preserve model behavior, as confirmed by the KL divergence.
-
Thank you for reporting these results. Another suggestion is to have a metric to show the sparsity of the masks, such as the number of singular directions used (or the L0 norm of the mask).
-
I completely agree with all the points about RoPE, and as you mentioned, they should be added as limitations/future work in the paper.
-
Regarding the questions, thank you for all the clarifications.
Thank you again for your thoughtful feedback and for taking the time to articulate so clearly where you see room for improvement. We deeply value this level of engagement, as it pushes us to improve not just the presentation but also the framing and positioning of our work within the broader interpretability literature. We would like to respond to the mentioned concerns below:
-
We thank you for the clarification; we now better understand the intent behind your suggestion to experimentally compare our approach with [2] and [3], not simply as citations but as baselines providing complementary strengths.
-
We agree that reconstructing the model activations is also a way to preserve model behaviour, however when optimizing for reconstructing the activations layer by layer, the primary focus is on the layer activation and not the entire model behaviour, as we mentinon in the rebuttal, this type of method belong to interpretability methods done in SAE-type (Sparse Autoencoder) methods, that are different from the circuit discovery methods. In SAEs, all the intermediate representations are reconstructed, whereas in circuit discovery, we aim for only the final model behaviour. Both lines of work are quite different from the conceptual perspective, and a direct comparison is not generally done in the prior art. Moreover, when it comes to computation cost, SAEs are more computationally intensive and help in dictionary learning that can be used to guide model behaviour (also known as model steering), to make the model representations interpretable, whereas this work's focus is to go away from model components being treated independently, rather focusing on directions that independently combine to preserve model behaviour as a whole.
-
Regarding [3], the weights of the GPT-2 are made interpretable using manual hand labelling. See section Experiments with automated direction labelling, which are not task-specific. Finally, for automating the process, the blog post recommends using a GPT3 model to come up with potential semantic labels for each dimension. In contrast, we devise a mechanism to learn a mask over these directions in an end-to-end fashion [see algorithm 1, page 4 of the main paper]. In our main paper, section 5, and in the appendix, we provide a detailed case study, which is also manually done to validate whether the found directions play some independent roles when combined with other directions in different components. We also find some generalized directions (also see Figure 6 and Figure 7) that some of them work across different tasks. All these findings point towards these directions to be treated as atomic units for future circuit discovery methods. We hope this new direction/perspective will act as a foundation for future works on finding model behaviour preserving circuits, providing a new direction to explore for interpretability. Given your suggestions, we would be happy to incorporate these with a detailed discussion and some initial experimental findings, if possible.
-
We would like to reiterate that the primary goal of our paper is not just to propose a masking method or a new analysis pipeline, but to shift the level at which we think about model decomposition (specifically in circuit discovery literature) from components to overlapping subspaces that encode subfunctions (i.e., changing the atomic units from model components to these independent directions). Your questions and suggestions helped us articulate that more clearly, and we are happy for the discussion to improve the paper with more details, attaching it to the broader interpretability literature.
-
Regarding the core hypothesis, we are pleased that you found the results to be compelling. We do agree with your suggestion to report extra discussion that we provided in the previous comment, and will be happy to update the appendix with an additional table that reports KLD accuracy and Exact prediction match. We believe the inverted category (last row of the table), where the masks were inverted, somewhat relates to the causal indication, showing that the learned masks, when inverted, increase the KLD, validating the significance/role of the found directions. We believe defining the exact causal intervention in this direction is an open area of research and requires efforts that go beyond the scope of this work. Finally, we are happy to incorporate your suggestions, 1) the Visualizations for OV and MLP matrices. 2) Discuss the added table in detail as suggested.
-
For sparsity, we report lambda (L1 sparsity). The lambda controls the trade-off between sparsity and performance. In the current setup, we used annealing lambda with linear increase [0, 5e-4] to lead to interpretable masks with minimal task degradation.
-
We thank you for pointing out the insights about RoPE. We would be happy to add them to the camera-ready version of the paper.
-
We are happy that you found all the answers to be satisfactory.
Thank you for your detailed response and for your willingness to incorporate the feedback. I would like to offer a few final clarifications and summarize my overall view.
-
Regarding the comparison to [2], my apologies if my point was unclear. I agree with you that your method has advantages over SAE-type approaches. My suggestion to compare with [2] was not to group your work with SAEs, but rather to use [2] as a non-SAE baseline that also seeks to find important model components using SVD, which are tied to behavior.
-
I agree that your method seems to have clear conceptual advantages over [3]. A direct, even if small, empirical comparison showing that your method can replicate or improve upon its findings would provide a much stronger grounding for your claims.
-
Regarding your contribution of shifting the "atomic units" of analysis, I think this is a very interesting direction. To help better position this idea, I would recommend citing and discussing recent related work on fine-grained circuits, such as Anthropic's work on "dictionary learning" and feature interpretation. This would help frame your method's unique advantages (no replacement model, SVD-based) within the current landscape.
Summary:
I believe this paper is built on a strong and interesting idea. Using SVD to interpret model behavior at a sub-component level has clear potential advantages over other methods in circuits research: it is more granular, does not require a replacement model, and is more faithful by design.
However, as it stands, the experimental validation does not yet match the strength of the core idea. My primary concerns, which are also shared by Reviewer vMFZ (on quantitative analysis) and Reviewer BMMX (on comparisons to related work), remain. Addressing these two points would require significant new experiments that go beyond the scope of a rebuttal period and would likely lead to substantial changes throughout the paper.
For these reasons, I will be maintaining my current score. I hope this feedback is helpful for strengthening the paper in the future. With more robust experimental support, I believe this line of work could be a valuable contribution.
The paper interprets singular vectors of weight matrices in GPT 2, which are responsible for predicting the answer to popular tasks like "Indirect Object Identification" and "Greater than". They propose a novel masking algorithm to classify the most important directions for completing the task in the SVD bases. This algorithm maximises the KL divergence at the answer token while learning a sparse mask over the top-k singular values for the weight matrices. They found that each single attention head performs many computations in parallel, denoted by each singular value. This could not be observed previously at larger granularities used in popular circuit discovery algorithms like ACDC, thus providing valuable information to further progress our understanding of LLMs.
优缺点分析
Strengths:
- This work extends circuit discovery to a finer level of granularity, without using expensive SAEs.
- Provides valuable information about what goes on in various components of language models, which were previously not well known.
Weaknesses:
-
Little to no comparison between other circuit discovery algorithms:
- Although the authors claim that since their analysis is at a different level of abstraction, metrics like KL divergence recovered, and the number of steps/forward passes required to obtain a mask/circuit are still relevant. I also expect there to be some similarities with other mask-based approaches that work on neuron level (like subnetwork probing variants), which weren't mentioned in the paper.
-
Unclear claims:
-
The paper claims to have studied "Can a small number of learned singular directions faithfully preserve model behavior?", but I can't seem to find the performance lost/recovered after applying the mask reported anywhere. The authors write, "Learned masks (...) achieve significantly lower KL divergence than top-k or random SVD baselines". However, I couldn't find this result reported in the appendix. In case this was missed, I request that this result be provided.
-
What's up with the mask values of other head types being high/not sparse?
The authors claim that the heads found by Wang et al. have masks with low sparsity and high mean values (high activation across multiple singular dimensions). However, there seem to be quite a few other attention heads that have many important singular vectors according to the learned mask. They have a higher importance (wrt the mask) than some of the heads in Wang et al.'s circuit. Are they important when resample ablated? If so, why do they not show up in Wang et al.? If not, what are they up to?
-
-
Little to no ablation studies and some unclear details:
- What is the effect of choosing "r" for top-r singular values v/s performance and interpretability of the mask found?
- The paper mentioned using early stopping on the validation dataset, but not the exact criteria for it. (i.e., what was the accuracy on the validation set that was assumed to be good enough to stop?)
- Is there any difference between using KL divergence v/s logit difference on the answer token?
Other points:
Related Work:
- Although no one has used the SVD bases to study task-specific circuits, there have been some works studying singular values of weight matrices agnostic to any task (see here: https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight). This does not reduce the novelty of this work. However, the previous work is significantly relevant to be included in the literature review.
- There has also been some work to decompose attention heads into components doing independent computations using sparse autoencoders (see here: https://arxiv.org/abs/2406.17759)
Writing clarity:
- line 107 should have "interpretability" instead of "interoperability"?
- Worth defining x as the residual stream in section 2 (and other similar variables properly)
Overall: Although this work provides valuable and novel insights into the workings of LLMs, there is still some work needed for it to establish this methodology.
问题
I've listed down most questions and suggestions in the strengths and weaknesses. I'll classify the suggestions in terms of importance and detail them here.
High importance:
- Reporting the performance before and after applying the learned mask on a test set
- Mentioning the early stopping criteria used on the validation set and the number of epochs after which each run used was stopped at.
- Reporting sensitivity to hyperparameters for learning the mask. Most importantly, the value of "r" used while learning the mask.
Decent importance:
- Clarifying the sparsity and mean of mask scores for attention heads shown in Figures 1, 2 and 5. Things that could help:
- A direct correlation between the causal effect observed through previous circuit discovery work and these metrics could be useful
- Interpreting singular values of some of these heads as well, to see if they are doing anything task-specific or do things that are, for example, universally useful.
- Looking at the counterfactual importance of singular values, especially the ones labelled as "other head types" might help. (Note: This would be ideal, but understandable if not done, as the authors mentioned the causal effects of the directions obtained to be a part of future work.)
- Adding in the requested related work and reframing the novelty of the paper to be along the lines of extending SVD-based interp to task-specific settings/circuits.
Good to have:
- Difference between logit difference and KL divergence as the loss function
- Do different random seeds affect the mask learnt a lot?
I would be happy to change my score if the important parts of my concerns are addressed.
局限性
Yes
最终评判理由
I decided to keep my scores the same as the preliminary review. I have noted down exactly what points were not addressed by the rebuttal in my official comments. I add a one-line description for them:
- Clarifying their claims around sparsity and the mean of mask scores for attention heads that seemed unclear to me.
- Incomplete comparison with previous work.
Edit: Hyperparameter search was clarified, and I have removed this from my justification.
格式问题
N/A
Thank you for your thoughtful and constructive feedback. We're glad that the core contribution, bringing interpretability to the level of singular vector directions within weight matrices, stood out clearly. Your comment that this work extends circuit discovery to a finer granularity without using SAEs captured exactly what we hoped to demonstrate. The primary focus is to think in a different direction (level of granularity) for finding/abstracting out functional components playing a specific role in a specific task.
We are particularly pleased that you resonated with the main idea of reframing interpretability from a component-level view to one grounded in singular vector directions. Our motivation was not to simply replicate known circuits through another lens, but to challenge how we define atomic units of function inside LLMs. Rather than focusing exclusively on causal faithfulness or reconstructing known pathways, we explored towards understanding what abstract predictive directions exist in model weights, even if not strictly necessary in a minimal circuit sense.
We are happy for your careful reading and your detailed points. We respond to the major points/clarifications below.
-
Comparison with Other Circuit Discovery Algorithms We agree that the paper could more clearly position/fit itself in relation to prior circuit discovery methods. However, we would like to emphasize that [also discussed in Section 5.1 (“Comparison with Existing Circuit Discovery Methods”)], our work proposes a fundamentally different level of abstraction, i.e., instead of treating entire components (e.g., heads, MLPs) as atomic units, we decompose these into granular orthogonal directions via SVD. This perspective does not completely align with previous work, such as ACDC or SAE-based methods, which focus on edge-level causal paths between predefined components. The proposed method instead focuses on uncovering predictive structure within components, revealing overlapping and distributed subfunctions that standard tracing cannot detect. We also agree that neuron-level mask-based approaches (e.g., subnetwork probing) may share similar optimization objectives. However, those typically treat neurons as the smallest unit, while we show that singular directions inside each weight matrix provide a finer and often more meaningful granularity (as reported roleplaying directions in Section 5 of the paper). We thank you for this insight, and we will revise the Related Work section to better reflect this distinction and reference relevant efforts (e.g., sparse subnetworks and Merullo et al. [2024b]). If you have any other works, we would be happy to include those as well.
-
Faithfulness of Masked Directions Thank you for highlighting this. We agree that explicitly reporting performance retention after masking would be essential to support our claim about preserving model behavior. In our experiments, we did learn the masks with early stopping based on KL divergence convergence to the unmasked model (on held-out validation set as done in prior arts like ACDC, EP, and EAP), so functional preservation was expected. However, to make this more straightforward in the updated version, we will report the KL divergence and task performance (e.g., top-1 accuracy) after applying the learned masks. These results show that learned masks consistently outperform top-k and random SVD baselines in preserving behavior, supporting the idea that our method selects functionally meaningful subspaces. Section 5 is intentionally qualitative. Rather than recovering known causal circuits (that can be obtained via any faithfulness-preserving optimization objective), we explored whether individual singular directions encode interpretable roles, suggesting that attention heads are not monolithic but contain overlapping subroutines. The examples were chosen to support this hypothesis through detailed case studies, not an exhaustive search (so we may have missed a few roles). We agree that a more systematic faithfulness is needed, but that is still an open research problem in the interpretability domain, which remains future work.
-
High Mask Scores in Non-Wang Heads We thank you for this insightful observation, and we agree it deserves more discussion. In general, we found that some heads not part of the Wang et al. circuit still receive high mask scores, and our current interpretation/speculation is that these heads may encode redundant or auxiliary predictive routines not captured by causal tracing (since Wang et al do not specifically use KLD). They could reflect task-agnostic computations (e.g., syntax, position handling) or serve as fallbacks that are predictively useful but not causally necessary. We plan to include resample ablation experiments in the revision to understand their role better. Preliminary results suggest that masking these directions does lead to measurable degradation (better if we can show that using some experiments), supporting their functional relevance, even if they were not selected by prior tracing work.
-
Effect of Choosing "r" for Top-r Singular Values We selected r = 64 for QK and r = 65 for OV based on the effective rank after augmentation (i.e., number of non-zero singular values). This was a pragmatic/natural choice to balance computational efficiency and expressiveness. While it worked well in our experiments, we agree that sweeping r or optimizing over the full orthogonal basis (specifically for MLP layers) could yield richer insights, possibly uncovering weak but interpretable directions. We will clarify this design decision and suggest full-rank optimization as an avenue for future work.
-
Early Stopping Details Our early stopping strategy was based on KL divergence convergence on a held-out validation set (like done in works like ACDC, EAP, and EP, we specifically use EP's held-out validation set), the training was stopped when L1 norm of the mask and kl divergence were saturated (KL value was in range 0.07-0.12 at the point of saturation, depending on the penalty on L1). We agree that a hybrid strategy combining KLD and sparsity metrics could be more principled and propose it as future work. This also balances out the reward hacking (extra amount of pruning to preserve task-specific behaviour), and keeps the optimization joint.
-
KL Divergence vs. Logit Difference We chose KL divergence as our loss function because it compares entire output distributions, forcing the masked model to match the original model’s behavior globally, not just on the answer token. In general, we observed that the KLD is compromised when logit difference is used as a loss objective; hence, to maintain the model behavior intact/close as possible, we prefer using KLD as a loss function. We thank you for pointing this out. We will add a small discussion with some preliminary results (if possible) to add an extra layer of transparency for future work.
-
Additional Related Works We thank you for pointing these out. We agree that the mentioned previous works are significantly relevant to be included in the literature review.
Response to the Questions:
-
Reporting the performance before and after applying the learned mask on a test set. We agree that this is essential for evaluating the effectiveness of the learned mask. We have now added results (showing the KL divergence and task accuracy on the test set, i.e., KL divergence was 0.07-0.12 (IOI) (masked vs. unmasked model),and the learned masks recover 96%(exact prediction match for IOI) of the original behavior (better than top-k or random masks). Also, for the Task accuracy, the performance drops are minimal (masked model has 82% accuracy, full model has 84% accuracy on IOI task), supporting the faithfulness of the learned subspace. These results confirm that the selected singular directions preserve core model behavior and are not only interpretable but also functional.
-
Early stopping criteria and number of epochs We used early stopping based on KL divergence saturation on a held-out validation set (similar to the one used in EP), i.e., the mask optimization terminated when KLD stopped decreasing and plateaued near the baseline model.
-
Sensitivity to hyperparameters, especially the value of "r", we selected r = 64 for QK and r = 65 for OV, based on the effective rank after augmentation. Although we did not perform a wide hyperparameter sweep (specifically for MLP). We will include a discussion on this and propose full-rank optimization as promising future work.
-
We would like to higlight that the L1 values that we obtain is around 700, which is around 2k for the wang et al circuit components, that would be one way of comparison. We believe that the current perspective is quite different from what is done in the past, the recent work with MIB benchmark [Mueller et al 2025], also lays down a generic measure of sparcity going to the dimension level (also referred as neuron level measure, refer to section 3.1 "Measuring circuit size." subsection)
-
Regarding the universally useful directions, we agree that some of the roles that we discover play a similar role in different tasks when investigated manually, some of the ones that are reported in the paper/appendix include Figures 6 and 7 [also see description for these roles].
-
We agree that it would be interesting, looking at the counterfactual importance of singular values, especially the ones labelled as "other head types,". We are currently running these experiments and will report some of the initial findings in the paper if possible.
-
We would be happy to add the requested related work and reframing the novelty of the paper to be along the lines of extending SVD-based interp to task-specific settings/circuits.
-
The random masks were initialized using a normal distribution, and multiple experiments with different seeds led to similar mask distributions learned during the optimization process.
Thank you for your time and effort on the rebuttal. However, I would like to keep my score the same, as the important parts of my concerns were not addressed by them. I summarise them below:
Regarding hyperparameters (rank 'r'):
- I am not proposing a very extensive hyperparameter search. I couldn't find even a narrow search (eg: 1 OOM either side of 64) in the paper.
- This is not just for understanding what the tail of SVD decomposition consists of, but also provides much-needed empirical evidence for (1) How sensitive are the results to this hyperparameter?, (2) What percent of the full rank is actually optimal (even wrt computation- you might find rank 16 works just as good, for example). These two questions are very important for reproducibility on other models/evaluating the usefulness of this method.
Clarification for high attention head scores:
- I would have liked some concrete results/explanation showing how important the heads you find outside of the IOI circuits are.
- For example, you write 'masking these directions does lead to measurable degradation'. I would like more concrete details about this experiment and some outcomes: Is the entire model, other than these directions, unmasked? If so, how much effect do you see from just these directions? Why doesn't simple resample ablation show this effect? If not, is it because the node-wise circuit you find itself is very brittle and dependent on this? Are you zero ablating while masking these directions/heads? If so, then you should note that this throws the activations extremely out of distribution.
- Also, in my 4th question, I was not asking for a comparison between causal tracing's sparsity with yours. Instead, I was asking for clarification of your claim, "the heads found by Wang et al. have masks with low sparsity and high mean values" (high activation across multiple singular dimensions). In your paper, you show all the heads individually, and that is difficult to interpret. It would be better to plot the sparsity and the mean score holistically to directly compare this.
- If you do not see a gap between these heads, then this claim should be removed
- If you see some outliers, those heads should be noted and/or investigated
Aside from this, I also can't agree with the point you raised about other methods being incomparable. Many methods try to find subspaces of activations that are relevant to a particular task. Without comparing 'what different does a fraction of the SVD basis find' wrt other methods like node-wise masking or Distributed Alignment Search (Geiger et al.), we can't really say how useful this method is over others.
There are many things to consider here: Something as simple as how many of the top-k SVDs show up in these subspaces or how different the rotation matrices found by DAS are from the singular vectors is insightful in itself. Since they search over the full space of activations, it is an important way to test your hypotheses around the SVD basis. I would also guess that a direct comparison is actually possible when taking all the singular vectors. The question then becomes which basis is better for applying a mask/searching for a mask. This does not substantially compromise the novelty of the paper, but it is extremely important to discuss (and maybe address the easiest ones).
Thank you for addressing the rest of my concerns. (Nit: It would be good if concrete tables/figures were added to the paper instead of prose to address them)
- Regarding the rank, we guess there is a confusion; we would like to note that the rank comes from the actual rank, i.e., the rank of the QK matrix is essentially 64, the remaining singular values being zero. A similar query was raised regarding this confusion by the Reviewer CfUp, which was clarified in the rebuttal. We would like to clarify that we use the full-rank decomposition for all components, including QK, OV, and MLP layers. For QK matrices, the effective rank is 64, so r = 64 is both natural and complete. For OV and MLP matrices, the rank can differ (observed to be 65 for OV after the augmentation matrix formulation), and we retain all non-zero singular directions based on the decomposition (i.e., no truncation). So even though we say “top-r”, in practice we don’t discard any singular directions unless we explicitly test lower-rank baselines (e.g., top-k SVD vs. learned mask). As suggested, we will make it clearer in the updated version of the paper. Regarding your suggestion, we agree that top-r could also be treated as a hyperparameter; however, our ablation results, as reported after the requests from the reviewers, show that using top-25 or top-50 does lead to a decrease in the performance. We also keep the table below for reference.
| Mask Type | Exact Prediction Match(%) | Accuracy(%) | KL Divergence |
|---|---|---|---|
| random 25 | 3.11 | 0.00 | 7.81 |
| random 10 | 0.00 | 0.00 | 7.56 |
| top 25 | 7.10 | 0.00 | 4.6 |
| top 50 | 38.14 | 32.37 | 1.99 |
| learned | 96.56 | 82.12 | 0.12 |
| inverted | 6.01 | 0.00 | 4.01 |
- Regarding the clarification for high attention head scores:
- We would like to mention that the learned and inverted entries in the above table help clarify the same, i.e., they capture the effect from just these directions, and inverting the mask shows a rise in KLD. We would also like to note that the masks are not binary but rather lie in the range [0, 1], which essentially prevents the activations from being extremely out of distribution.
- By the claim "the heads found by Wang et al. have masks with low sparsity and high mean values", we wanted to show a generic trend that helps validate the hypothesis that the algorithm we propose also has some alignment with the existing literature [line 136]. We would also like to mention that Wang et al.'s work captures these components (as atomic units), lays out the causal mediation analysis, and then finds/assigns these components roles based on their behaviour in the IOI task. In contrast, the core of this work is a shift in how we think about function in transformers (specifically in mechanistic interpretability circuit discovery literature), not as something that lives in whole heads or layers, but in overlapping directional subspaces [shown as a detailed study in the main paper as well as the appendix, Tables 3-5]. This reframing, supported by both qualitative insights and these new quantitative checks, is what we hope this work brings to the community, and we hope that future works could explore this direction in more detail. The Wang et al.'s work capturing independent components acts as guidance, which, in general, we also found an alignment with their work; however, saying that only these components play a role would not be adequate. In our setup, we found that there is a directional space where all the independent directions play some role (Tables 1 and 2), showing these in detail. In general, we present a different lens to look at things; instead of treating the components as atomic units, we treat these directions as atomic units. Being a new step towards finding interpretable components, the direct comparison with existing circuit discovery literature remains challenging. The existing methods rely on metrics like faithfulness metrics, along with the sparsity of the circuit. In our case, the faithfulness remains similar (KLD and performance matrix like accuracy that capture model behaviour); however, other comparisons like sparsity computation remain challenging, where the model components are not considered atomic units, but rather they are made of more granular directions that independently interact with each other for a particular model behaviour. In general, due to the edge-level communications centered around individual components as atomic units in existing circuit discovery methods, a direct comparison of this setup with existing arts remains challenging.
Thank you for clarifying top-r: it is indeed a confusing way to describe the setup! Thank you for pointing me to CfUp's question. I shall update my review based on this clarification.
Regarding your other points:
I am not arguing with your motivation and I am all for it. However, it is still very important to show why and what is different:
-
Unless it is shown that (subspaces of) other heads actually have global/task agnostic functions and are important in a way causal ablations do not capture, one cannot draw conclusions about the difference in results and how useful is this differences. It also cannot be claimed using metrics like average mean score and sparsity, and then argue that these metrics don't make sense over entire attention heads, right? I totally agree with "saying that only these components play a role would not be adequate", in fact the exact thing I am asking is 'What is the difference'? I proposed interpreting at least 1-2 heads outside this mapping.
-
Distributed Alignment Search (DAS) [1] learns a rotation matrix as well and find that low dimensional subspaces in these matrices are interpretable. It is important to show why the SVD basis might be a natural form of this or are more/less interpretable compared to this. I proposed to just compare the rotation matrices with singular vectors.
-
Unless one shows subnetwork probing learns masks that are not interpretable, you cannot claim masks over a certain basis are just better. I proposed simply comparing what this mask does to activations compared to your mask.
These analyses should not be disregarded using the claim of novelty. There are many pitfalls of doing partial node-wise analysis (for example, see [2], [3] as criticisms), and in my opinion, new methods proposed in mechanistic interpretability crucially need these sanity checks else they result in many illusionary results.
I think most of the experiments proposed in my replies and review are not as extensive or challenging that they cannot be included in this work. They do not compare something like sparsity across methods that are dissimilar. I'm also saying that just discussing points 2 and 3 (although very straightforward to add ablations for both), as I believe since this method lies somewhat in between these two, and this has its perks.
Thank you for your time, I hope that where I come from is understandable and some of these points are addressed in future versions of your draft.
[1]: Geiger et al. Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
[2]: Gupta et al. InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
[3]: Sutter et al. The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
This is a really interesting paper that puts together ideas from singular value decomposition, linearization through augmented inputs, and learnable diagonal masks to present a novel approach to interpretability of transformer circuits.
优缺点分析
Strengths:
- The idea of using singular value decomposition by linearizing attention, MLP and output computation using input augmentation is very interesting.
- The use of learnable diagonal masks coupled with freezing of the other model weights is a natural solution to the problem of learning the relevant componends of the singular value decomposition. The insight is well appreciated.
- The experimental results on GPT-2 are clear and interesting.
Weaknesses:
- There would be great value in looking at these results on larger models like Llama. However, even the current results in the main paper look promising to me.
- More examples like a circuit for addition or multiplication would have been valuable too.
问题
None
局限性
Yes, the authors have addressed limitations.
最终评判理由
- The rebuttal has been very helpful in understanding why the work focused primarily on GPT2 and I am happy with the author's responses on studying more complex algorithmic/compositional circuits.
I have studied the other reviews and find them all to be insightful. I have assigned more weight to the high-level idea and the core theme of the paper and relatively less weight to the details of the implementation. However, I respect and appreciate the reviews that have correctly focused on the details that will ensure the paper's value to its readers. On the whole, I am inclined to keep my positive review in light of the paper's interesting focus on identifying functional subspaces in the paper despite the correctly-identified shortcomings in the other reviews.
格式问题
None
Thank you for your thoughtful and encouraging review. We’re very happy that the primary ideas of the paper came through clearly, especially the integration of input augmentation, SVD-based decomposition, and sparse mask learning to isolate functional subspaces inside transformer models.
We are also pleased with your recognition of the simplicity and coherence of this approach, i.e., freezing weights and learning diagonal masks to identify task-relevant directions. This line of thinking grew out of an inclination to shift the conversation in mechanistic interpretability away from fixed components (like heads and layers) and toward functional directions embedded across the model.
-
On Scaling to Larger Models (e.g., LLaMA): We agree this is a natural and exciting next step. The framework was intentionally designed to be scalable, learning masks over SVD directions in frozen models is memory-efficient and parallelizable. We used the GPT2 architecture following the current literature in mechanistic interpretability. Moreover, GPT2 is been widely studied and help make the experimentation faster, with an added advantage of ground truth components already discovered by studies like Wang et al. that help validate the working of the presented approach, in the future, it would be interesting to see a combination of such directions playing role in bigger/larger models, making them more interpretable and alignable/steerable for a specific behaviour. More specifically, we chose GPT-2 for this study to keep our focus on interpretability and avoid confounding factors introduced by very large-scale models. Our aim was to first establish the methodological foundation, validate it with the existing literature, and later go towards scaling it for bigger models.
-
On Additional Circuits (e.g., arithmetic tasks), this is a great suggestion. We agree that the arithmetic tasks present unique opportunities/ways for discovering compositional or algorithmic circuits; however, they also require careful setup to ensure alignment between directionality and symbolic function. We believe this would be an interesting direction to explore in future works.
We are happy for your support of the work and for identifying its originality and potential impact. The positive signals from the current results, clear task-relevant directions, behavior preservation under masking, and alignment with known heads are encouraging and strong signs that this direction-based lens could become a useful method in interpretability, presenting a new direction to look at model behaviour.
Thank you for the detailed rebuttal. I think NeurIPS should really name it is Discussion to sound more pleasant and appealing.
-
I now better understand the role of choosing GPT-2 and the challenge of obtaining ground truth for larger models, as they are not readily available in literature.
-
I look forward to reading more about your work on compositional and/or algorithmic circuits in future papers. Those will be even more exciting pieces of work!
I will maintain my already positive score.
Interpreting model components that support particular behaviours is an interesting direction of research, and this work adds to this effort by hypothesizing that distinct singular directions in model's weight matrices correspond to distinct functions. To this end, they present a framework that optimizes a mask over the singular values of augmented weight matrices to identify the components that support model behaviour in a specific task. This optimization framework, that results from optimizing an L1-penalized KL loss, yields sparse masks that are used to assign functional relevance to model's components, e.g. individual heads. Together, this study demonstrates promising and exciting results that could help shape the field of mechanistic interpretability by offering an alternate perspective of singular vector directions (or eigendirections) being atomic units of computation in a model.
优缺点分析
Strengths:
- The paper is generally well-written and the core idea is easy to follow. The hypothesis is clearly described and the experimental setup is well designed.
- The authors do a good job in connecting their idea and results to the literature, which helps situate the reader well in the field.
- The results are interesting and definitely present a strong case for the framework being widely shared with researchers in the field of mechanistic interpretability and the larger ML community. I specifically like the results presented in Section 5, because it helps argue against the prevalent view of monolithic functionality of each head.
Weaknesses:
- While the current results are interesting, they do not completely support the authors' initial hypothesis. To truly support their claim, the results need to be augmented with a loss-of-function study (ablation) and/or gain-of-function study. Specifically, it is not clear from the current study what the effect of applying the mask is on the model's task performance. Is it roughly preserved? What happens if the mask is inverted, i.e. the identified components are removed/ablated? Does it lead to the model's performance being similar to a random baseline? The current results do not answer these questions and therefore, the study falls short of supporting the original claim that the model's function is compositional and be decomposed into individual singular vector directions.
- While the authors present results in Fig 1 and 2 demonstrating that previously-identified functional heads are generally associated with higher mask values, it is unclear whether the distribution of mask values for these heads is significantly different from the "other heads". A more quantitative justification might help make the point to be clearer. Moreover, I am not sure if this analysis is sufficient to demonstrate the correctness of the proposed method.
- While the authors present and mention that the proposed method can be used for the augmented output-value and the MLP matrices, no results are presented from these experiments in the main text. It would be great if the authors could comment on the reason, or defer the corresponding segment from the methods section to the appendix.
- The functional description of specific singular vector directions within a head, albeit impressive and interesting, is quite qualitative at the moment and would benefit from a more quantitative exploration. Perhaps, the authors could consider adding an analysis across examples and compare the distribution of activations for "functionally-relevant" vs irrelevant tokens, and add statistical tests to demonstrate differences?
问题
- Can the optimization objective be seen as a form of model pruning (for a specific task)?
- What is the effect of the optimization process on the model's task performance? What is the effect of on the task performance? I assume it is a hyperparameter, and its choice affects the final results?
- Is the optimization performed jointly for all layers and all matrices in the model, or is it one mask at a time? This was not clear from the text.
- Can Fig 1 be made more concise? It seems more like a supplementary figure now (with Fig 2 conveying the same key message). Also, can you order the mask values to be in increasing or decreasing order, because the matrices are permutation invariant and the singular vectors can be permuted without affecting the overall matrix, right?
- Same question as above about reordering in Fig 2, but for the heads within a single layer. I believe this reordering might help put the core point across to the reader more easily.
局限性
Yes, added in checklist.
最终评判理由
The authors presented new evidence that keeping only the learned singular directions enabled preserving the task performance, along with other controls to demonstrate the validity of their approach in identifying functionally relevant computational components. Overall, I am convinced that the authors' proposal argues for a fundamental shift in the mechanistic interpretability literature, something that will be of significance and interest to the NeurIPS community.
格式问题
N/A
Thank you for engaging with our work and understanding the key aspects of this work. We are pleased that your summary captures the spirit of what we aim to present, i.e., shift the focus in mechanistic interpretability from components like attention heads to the directions within them, represented by singular vectors, as the real atomic units of function.
We believe this perspective opens up a different way of seeing what’s happening inside transformers. Rather than assuming that heads or layers are monolithic, our work shows that they are better understood as overlapping bundles of functional subroutines/subfunctions, each of which can be observed as embedded in a low-rank direction. This reframing is the core contribution, and we’re encouraged that you found this idea both clear and interesting.
We respond to your suggestions and questions point by point as below:
- We agree with your insights regarding providing an ablation study to provide more empirical justification, for understanding the behaviour of the learned masks, that include the behaviour preservation, inversion (i.e., removing these directions that lead to performance that is near random), and comparison that includes observing the learned directions outperforming, both top-magnitude and random SVD baselines [also see table below]. We will try to include these comparisons in the updated version of the paper. However, we would also like to highlight that in the current version, we followed the prior art and took the circuits/components by Wang et al. as ground truth for model components playing a role, and it was interesting to see that similar components show a high mask value when compared to the rest of the model. We believe that since the mask optimization is done using the model behavior preserving metric, it is easier to get the learned directions that retain 96% of exact prediction match and 0.07-0.12 KLD with the full model and 82% accuracy on the IOI task, confirming they’re sufficient for behavior (due to the optimization objective of the task). In this work, the primary goal was to see if the independent directions also play a role in this behaviour preservation, i.e., the directions with high singular values are not the only ones that are important, and there are some task-specific directions that make the subroutine/behaviour possible for a task.
| Mask Type | Exact Prediction Match(%) | Accuracy(%) | KL Divergence |
|---|---|---|---|
| random 25 | 3.11 | 0.00 | 7.81 |
| random 10 | 0.00 | 0.00 | 7.56 |
| top 25 | 7.10 | 0.00 | 4.6 |
| top 50 | 38.14 | 32.37 | 1.99 |
| learned | 96.56 | 82.12 | 0.12 |
| inverted | 6.01 | 0.00 | 4.01 |
-
Figures 1 and 2 demonstrate that previously identified functional heads are generally associated with higher mask values, we agree that the stastical distribution may be matching with some of the other heads (non role playing heads), as also shown in the figure 2, however the primary focus is that there are independent directions (not always with a highest singular values), that play a curtial role in this model behaviour, to validate it in more detail, we provide the entire section 5 and the additional investigations in the appendix about different component's different directions playing a specific role for a specific task. In this work, the primary goal was to look at circuit discovery from a different lens that goes away from taking a module/component as an atomic unit and discover the abstract spaces learned/used by the model to perform a task.
-
We are happy for your detailed suggestion regarding comparing the distributions of activations for "functionally-relevant" vs irrelevant tokens, and adding statistical tests to demonstrate differences. We believe this would add an additional set of verifications regarding the independent directions. In the current version of the paper, we considered investigating the behaviour of the found directions in a qualitative sense, as done in the prior arts on the mechanistic interpretability literature. We believe a lack of suitable metrics to make this comparison between activations is the primary bottleneck that limits the ability to consider/take a statistical comparison as a strong source of verification. Activation comparison in general is a vast field that is filled with some limitations of comparing the model comparison, and the behaviour preservation, like the faithfulness score (comparison with the original model), is the only metric widely used in the community. We believe the qualitative investigation was more necessary, and it took too much time to help us frame a justification for the current approach of finding abstract components and studying patterns that came with some of the directions. This was helpful for suitable validation that some of the directions are more important than others, and a new way to think about model components is to take these directions as atomic units instead. If you have any other specific metric for the activation comparison that is used in the literature and would make this study more general, we would be happy to try out the same.
Response to Questions/Clarifications:
-
Q: Is this like model pruning? Yes, we are glad that you noticed this in detail, the current way of masking can indeed be seen as a task-specific, low-rank pruning strategy, but one that operates in the directional basis of parameter matrices rather than weights or neurons or any other component. The primary difference is that we’re not just simplifying, we’re studying/exploring the functional geometry (with discovered directions) of how models think/make decisions.
-
Q: What is the effect of lambda (L1 sparsity)? The lambda controls the trade-off between sparsity and performance. It is a hyperparameter that can be tuned to control the sparsity. In the current setup, we used annealing lambda with linear increase [0,5e-4] to lead to interpretable masks with minimal task degradation. Intuitively, we can say that too high a lambda can lead to losing coverage; on the other hand, too low values will make the masks lose meaning.
-
Q: Is optimization done jointly or separately? We optimize all the masks jointly to get a more coordinated structure.
-
Q: “Can Fig. 1 be made more concise?” Yes, we agree with the suggestion to make Figure 1 more concise. We specifically used Figure 1 (with ordering of singular values, rather than that of mask values), to highlight that the peak mask are different from the singular values, in the SVD decomposition it is natural to assume that the more singular values will play a more important role in matrix reconstruction, however we believe the peaks coming in the middle highlight that some of the directions are more task relevant than the ones with more singular values, showcasing the non monolithic nature of these modules with more detail. As per the suggestion, we can keep another image with sorting heads and singular directions by activation strength. Thanks for this practical and helpful suggestion.
We started this work from a simple question, i.e., what if we’re looking for functional units in the wrong place? Rather than treating entire components as atomic, we asked whether a function lives/occurs in directions, and whether those directions can be disentangled, isolated, and understood. In our initial study, we found that the early evidence suggests yes, and that a lot of what we consider “component function” may be an emergent property of overlapping directional subroutines, which we aim to disentangle in this work. We believe there is a lot to explore when taking/studying model components, and this work may lead to one such foundational way of thinking about model behaviour.
I would like to thank the authors for responding to me comments and questions. I believe the proposed additions and changes will improve the overall quality of the work and clarify the utility of this framework to the reader.
Some thoughts:
-
Thank you for adding the table demonstrating the loss-of-function and gain-of-function experiments. However, I am slightly confused as I thought the model's performance on the task (assuming you are using the logit difference metric from Wang et al.) is nearly 100%. Therefore, preserving only the learned directions leads to a decrease in task performance, which seems counter-intuitive to me. Perhaps, I am missing something about how the accuracy is computed on IOI for the GPT-2 model here, but it would be good to clarify this.
-
I am not totally convinced of the reasoning provided by the authors about the lack of quantitative comparisons. While I understand that most of the mechanistic interpretability literature focuses on qualitative comparisons, I feel that the current work would benefit from a more quantitative comparison. Especially given that the experiment scope is limited, I think a quantitative comparison is essential to demonstrate the utility of the proposed idea. If the authors intend to demonstrate an alignment between their findings and previous works, I feel the best way to do so is to show that mask values for functionally relevant heads are indeed statistically significantly different from task-irrelevant heads.
Thank you again for your responses. While I think this is a neat proposal and very interesting work, I am not completely convinced that this work validates the claim convincingly. I am eager to see the authors' response to my above questions and am happy to revise the score.
Thank you again for your engagement in the discussion phase. Your questions help us better clarify our intentions and refine how we present our work. We’d like to respond to the two points that are a little unclear:
-
Regarding task performance and the IOI accuracy metric: We understand the confusion around this, and we should’ve clarified it earlier. While some works like Wang et al. evaluate performance using logit difference (LD) scores, we follow the evaluation protocol from the widely used/recent circuit discovery literature, more specifically we follow Bhaskar et al. (we would like you to refer to their Figure 6 in the paper), where IOI task performance is framed as a token match with the ground truth and accuracy is typically around 75-100% for GPT-2 to be faithful for IOI (we compute the accuracy similar to them). In our case, the learned directions yield 82% accuracy, computed in the same way. So within that framework, the directions we identify are sufficient to preserve task-relevant behavior (highlighting faithfulness). Moreover, we would like to reiterate that our goal isn’t just to preserve performance, but to go a step further, i.e., to isolate components that are both behaviorally faithful and interpretable. We’re not just aiming for a pruning-like effect; we’re trying to surface functional subroutines embedded/present in low-rank directions, ones that actually make the behavior possible in a compositional and disentangled manner. This helps shift the focus of circuit discovery methods that are currently component-based, to a directional base as atomic units.
-
On the need for more quantitative comparison: We completely agree with your point, and we appreciate the push for stronger statistical evidence. Our original focus was on building an intuitive and interpretable case for singular vector directions being functional units (more as a foundational study that navigates future works with this new perspective of individual directions playing a role in behavior), using qualitative patterns that align with what has been valuable in prior mechanistic interpretability work. However, your feedback helped us realize that the quantitative angle can reinforce the argument more robustly, and the experiments we added are a first step in this direction. More specifically, the last row (inversion of mask, reported in the table above in the rebuttal) shows that when we remove the learned directions, task performance drops to zero and KL divergence increases sharply, which directly supports the claim that these directions are not only relevant, but crucial. While we didn’t include formal hypothesis tests comparing mask distributions across head types in the main text, these inversion results effectively show that the model behavior depends significantly on the identified directions. Additionally, future works could explore adding statistical significance tests between functionally and non-functionally identified heads (e.g., using t-tests or KS tests on mask values), which would make the argument more rigorous. We're currently exploring this and see if these can be included in a revised version (if the time and space permit).
We believe the core of this work is a shift in how we think about function in transformers (specifically in mechanistic interpretability circuit discovery literature), not as something that lives in whole heads or layers, but in overlapping directional subspaces [shown as a detailed study in the main paper as well as the appendix, Tables 3-5]. This reframing, supported by both qualitative insights and these new quantitative checks, is what we hope this work brings to the community, and we hope that future works could explore this direction in more detail.
We’re grateful and pleased for your openness to revisiting your score. We are happy that we could answer/clarify your initial concerns satisfactorily. Please let us know if there are any more clarifications that we could provide in the discussion phase to make the presentation quality even better, which highlights the core strength of this work. Given that the discussion phase is near the end, we would be happy to respond as soon as we can.
[1] Bhaskar, A., Wettig, A., Friedman, D., & Chen, D. (2024). Finding Transformer Circuits with Edge Pruning. ArXiv, abs/2406.16778. [NeurIPS 2024].
Thank you again for providing the clarifications.
-
IOI task accuracy: Thank you for pointing to Fig 6 in Bhaskar et al. However, my question was more to understand what the full model performance on the task was and how it compares to the performances reported by the authors in the table in their rebuttal. While I understand that the authors do not aim to preserve the task performance, but their claim that the identified directions are "sufficient for behavior" needs to be supported with evidence that demonstrates that keeping only these directions yields the same or very similar task performance. While the comparison to other controls is useful, this result indicates that the identified directions are better candidates for being the functional elements compared to these controls but does not indicate their sufficiency.
-
Quantitative comparison between functional and non-functional heads: I understand the authors' motivation and spirit that prompted them to adopt a more qualitative approach. However, my concerns arise from the fact that their claims around correspondence with findings from prior works lacks quantitative evidence. Currently, the reader is asked to believe the claims based on qualitative evidence presented in figures. While this evidence is great in building intuitions and demonstrating the correspondence, the claim needs to be backed by rigorous tests. Having these statistical tests will be beneficial to future work that aims to bridge the gap between existing notion of circuits in foundational models and the authors' proposal of subspaces being the functional atomic elements in a model.
Thank you again for your hard work. As before, I rate this work positively and believe this work is interesting. However, the missing quantitative analysis and my lack of clarity around the task accuracy metrics limit me from rating it higher.
Thank you again for your engagement in the discussion phase.
- Regarding IOI task accuracy, we believe there are some things that are still a little unclear. In the rebuttal, we presented a table [referred to again below] that shows exactly what was asked, i.e., to understand the full model performance on the task and how it compares to the performances reported by the authors in the table in their rebuttal. We specifically mention that the full model performance was 84% in Accuracy for the IOI task. Further, when we perform these ablations (presented in the table), we find that the learned mask [second last row of the table] shows performance close to the actual model, highlighting the faithfulness of the found directions. We believe this also indicates that the identified directions are thus better candidates for being the functional elements compared to these controls, but does not indicate their sufficiency, as suggested by you. We should have explained this in more detail. We hope we understood your concern correctly. Please let us know if you were referring to some other results/experiments that you would suggest for validating this sufficiency. We refer to the table below :
| Mask Type | Exact Prediction Match(%) | Accuracy(%) | KL Divergence |
|---|---|---|---|
| random 25 | 3.11 | 0.00 | 7.81 |
| random 10 | 0.00 | 0.00 | 7.56 |
| top 25 | 7.10 | 0.00 | 4.6 |
| top 50 | 38.14 | 32.37 | 1.99 |
| learned | 96.56 | 82.12 | 0.12 |
| inverted | 6.01 | 0.00 | 4.01 |
- Regarding the Quantitative comparison between functional and non-functional heads: We believe the above table supports this quantitative claim well, i.e., the found directions are indeed good candidates for functional components that play a role in the task. Moreover, in the appendix, we also report other quantitative results, finding a few directions that generalize across datasets/tasks (also see Figures 6 and 7). Also, Tables 3,4, and 5 in the Appendix also show the statistical significance of these components that play a role (estimated for the entire dataset) for the different tasks, where a few of the components show the “Highest Attention% %” for these components.
Additionally, the above table also indicates that using random directions [the first two rows], show a high rise in KLD with accuracy drop, and even the directions with high eigen value [top 25 and top 50], are not able to replicate/get the performance close to the original model. Meanwhile, the learned masks preserve the functional behaviour of the model for the task, and an inversion of this mask reduces and increases the KLD with a drop in performance. We believe all these quantitative results, along with Tables 3,4, and 5 in the Appendix, show the statistical significance of these components, and suitably justify/validate the claims made in the paper. Since in the mechanistic interpretability literature, it is also important to make these found directions/components explainable, we dig deeper with more detailed manual analysis to justify the claim even further. Please let us know if we are still missing something, or any specific ablation that is widely used in the existing mechanistic interpretability literature, which will make the claims more empirically justified. We hope that there were some clarification issues that may have been resolved now; in hindsight, we could have done a better job when explaining the results table in detail. We would be happy to respond as soon as we can if you have any further clarifications.
-
IOI Task accuracy: Thank you for the clarification. I was requesting for the full model performance, which you mentioned in 84% in your response. I did not find this information in your responses earlier (I checked the paper and your responses to my review). Did you mention it somewhere before and I missed this? My sincere apologies if that's indeed the case. As you mentioned, the full model performance is very close to the model performance when the learned mask is applied. This result addresses my concern.
-
Quantitative results: I respectfully disagree with the authors that the table provided by them is quantitative evidence in support of their claim that the learned singular value masks have higher values for previously-identified functional heads (Figs 1 & 2). I would have liked to see a statistical comparison between the distribution of mask values for functional vs non-functional heads. To this end, I think the authors could have done statistical tests to estimate how different their hypothesis (mask values for previously identified functional heads) is from the null distribution (mask values for previously identified non-functional heads). I liked the idea of using t-tests and KS test for this purpose, which the authors proposed in one of their responses.
Overall, I think the proposal presented by authors is interesting and warrants more attention from the community. The more recent task results have convinced me of the validity of this method for the IOI task. I hope this result allows the community to evaluate this proposal on larger networks and more complex tasks. Given this result, I will increase my rating to 5 (accept) and would like to congratulate the authors again on this very interesting work.
Thank you for engaging in this detailed discussion with us. We're genuinely very happy that you found the idea to be interesting and think that it warrants more attention from the community.
You're absolutely right to raise the concern about the missing comparison between the mask value distributions across functional and non-functional heads, specifically when it comes to comparison with the wang et al. We agree this is a critical discussion point, especially in evaluating the alignment of our framework with established circuit discovery methods. In our analysis, we do find that many of the heads identified by Wang et al. tend to receive high mask scores in our framework, specifically, they exhibit low sparsity and high mean values. This serves as an important sanity check and validates that our method can recover known functional components to some extent. However, we also observe that several heads not reported as part of the Wang et al. circuit still receive high mask scores. Our current interpretation of these findings is as follows 1) We speculate that these heads may represent redundant or auxiliary routines that are predictively useful but not strictly causally necessary, and thus may not be surfaced by causal tracing methods like those used in Wang et al. 2) Some directions may capture more generic computations, such as syntactic structure, positional encoding, or generic fallback strategies, that support the model’s performance in subtle ways but are harder to isolate with interventions. 3) More importantly, Wang et al.'s methodology is based on causal mediation, which is sensitive to the specific probing method and metric (e.g., logit difference). In contrast, the proposed framework instead uses a KL-divergence-based behavioral loss, which may reveal components contributing to the output distribution even if they do not directly mediate logit changes (specific to a task).
More broadly, we want to emphasize that our work provides a conceptual shift in how functionality is understood within transformer models. While Wang et al. treat components (e.g., heads or MLPs) as the atomic units of function, our framework instead posits that individual directions in the parameter space (identified via SVD and masked selectively) are the more granular, disentangled functional subspaces. As such, our notion of “component” is more fine-grained, and some overlap, but not complete alignment, with prior component-based methods is both expected and, in fact, desirable. As a caveat, this reframing makes direct comparison with prior circuit discovery techniques (ACDC, EAP, EP) challenging in some cases. Metrics like “sparsity of components” are hard to translate when components are not assumed to be monolithic units, but rather overlapping, compositional subroutines distributed across directions. Also, defining communication between these directions ("edges" in computational graphs) remains an open research problem for a direct comparison.
Nonetheless, our method aligns with prior work in terms of behavioral faithfulness (as measured by accuracy and KL divergence) and provides a complementary lens into model structure. We hope this opens up avenues for deeper exploration of directional subspaces as atomic units of computation, and we view prior works like Wang et al. as both a point of validation and a valuable contrast to our approach.
Also, you're correct that we had not explicitly stated the 84% full-model accuracy in our earlier responses; that oversight is on us (we mentioned it in response to other reviewers and may have missed it during the deadline). Thank you for pointing it out clearly. We're glad that the comparison to the learned-mask performance (82%) now addresses the question of sufficiency more concretely.
On a general note, this research started as a simple question about whether we’ve been looking at the wrong units of function in transformer models. Through your reviews and perceptive questions, we feel more confident in both the importance of the question and the potential value of the direction we’ve proposed. Needless to say, your feedback was quite helpful in making the claims more presentable with respect to the evidence needed to support them.
We thank you again for all your time.
Thank you again for the clarification. Your result about the correspondence to functional heads identified by Wang et al. makes complete sense to me. This is primarily the reason I brought up the point about statistical significance. I believe that the correspondence might be loose because of the reasons you mentioned in your response. I feel a statistical quantification of the correspondence along with the reasons for a lack of stronger correspondence will be helpful for future work. Hence, I would recommend adding it in the final version of your paper.
Once again, congratulations on the great work and I hope that this work brings about a shift in the monolithic perception of function among mechanistic interpretability researchers.
Thank you very much for the kind words and for your engagement throughout the review process.
We completely agree that a statistical quantification of the correspondence between our masked directions and the functional heads identified by Wang et al. would make the claims even clearer and provide insights into both alignment as well as divergence. We will try including this analysis in the final version, along with a brief discussion of why some heads may show weaker alignment, including methodological and conceptual differences between the approaches, as noted in our earlier responses.
We're happy for your encouragement and are hopeful that this reframing around directional subspaces contributes to advancing the mechanistic interpretability discussion.
This paper proposes a direction-level view of transformer computation. Instead of treating heads or MLPs as monolithic units, the authors linearize QK/OV/MLP via augmented matrices, take SVDs, and learn sparse, continuous masks over singular directions to preserve the model's output distribution on task prompts. The central empirical claim is that a small set of learned directions can retain behavior nearly as well as the full model, while also revealing that single heads implement multiple subfunctions in parallel (illustrated most deeply for the IOI name-mover head).
On my read, and reflected in the discussion, the strengths are (i) a clear, well-motivated reframing that moves beyond component-level circuits, (ii) a simple, transparent procedure (freeze weights and optimize diagonal masks against KL to the unmasked model with an L1 penalty) that others can reproduce, and (iii) evidence that the learned subspaces are not just cosmetic. The rebuttal added concrete numbers showing that learned masks preserve IOI behavior closely outperforming random or top-k SVD baselines by a wide margin, and that inverting the mask collapses performance. Two reviewers highlighted that this lens surfaces multi-functionality within heads and may help shift how we phrase questions in mechanistic interpretability.
The weaknesses are also clear. As some of the reviewers point out, the scope is narrow (GPT-2, IOI plus a small set of tasks), much of the semantic analysis is qualitative, and several comparisons that would strengthen the case remain absent, for instance head-to-head against SVD-based sparse attention reconstruction, against DAS / subnetwork-masking approaches, or quantitative significance tests showing that heads deemed functional by prior circuit work have distinctly higher mask values than others. Even after the rebuttal and discussions, questions remain about sensitivity (such as mask sparsity trade-off), treatment of OV/MLP beyond QK, and generalization to RoPE models. Finally, while the method is behavior-faithful by construction, causal validation is left for future work.
However, despite these gaps, the rebuttal and discussion added much needed evidence to the authors' claims. Specifically, the table added in rebuttal directly addresses the most material concern raised across reviews, which is "do the learned directions actually preserve behavior better than simple SVD heuristics?". The answer appears to be yes by a large margin. The authors also clarified the confusing top-r terminology and committed to including OV/MLP visualizations and statistics, and acknowledged the need to add significance tests contrasting functional vs other heads. However, two reviewers maintained their negative stance even after the rebuttal primarily because comparisons to closely related SVD and rotation-based baselines and more systematic quantification are still thin. I agree those additions would elevate the work, but in my judgment the paper, especially after the rebuttal, provides a novel, useful, simple, and reproducible viewpoint with sufficient quantitative evidence of behavioral faithfulness.