6.4

/10

Rejected4 位审稿人

最低4最高4标准差0.0

3.5

置信度

创新性3.0

质量2.5

清晰度2.5

重要性2.8

NeurIPS 2025

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

David Chanin,Tomáš Dulka,Adrià Garriga-Alonso

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

摘要

关键词

sparse autoencodersSAEsinterpretabilityNLP

评审与讨论

审稿意见

评分: 4置信度: 42025-06-25

This paper presents the new concept of feature hedging in Sparse Autoencoders (SAEs), where correlated and hierarchical features are mixed into one latent when the number of latent is lower than the number of true features. In opposition to feature absorption, where a child feature prevents a parent feature from firing, hedging happens in narrow SAEs. After illustrating this phenomenon on a toy example, the authors propose a score (the hedging degree) to measure how much hedging occurs in a given SAE. Based on these insights, they propose to add a new term to Matryoshka SAEs to balance hedging and absorption.

优缺点分析

Quality

[+] The methods used are appropriate, and the claims are generally well-supported

[+] I like the fact that the authors carefully present the studied phenomenon (feature hedging) through a toy example before.

[+] The limitations of this work are discussed.

[-] I am not entirely sure of the soundness of the proposed hedging degree score. This is my primary concern, as a good part of the paper is based on the results of the hedging degree (see questions 1-2)

[-] The experimental section is limited to one LLM (see question 4)

[-] The claim of providing a theoretical study of hedging is misleading (see question 5)

Clarity

[+] The paper is well-written, well-structured, and easy to follow

[-] Some minor details are lacking to fully understand some parts of the experiment (see question 3 and minor comments)

[-] The readability of some figures could be improved (see minor comments)

[-] I think there is a small typo in Eq.7 (see minor comments)

Significance

[+] The results could help the community design better SAEs

[-] The proposed solution (the weighted Matryoshka SAE) is not fully convincing, but the authors acknowledge that.

Originality

[+] As far as I know, the observed phenomenon was not discussed in the SAE literature before.

[+] This provides new insights into the poor performance of SAEs compared to other methods.

[-] The fact that multiple features will be encoded into one latent variable if the number of latent variables is low enough is not very surprising. A related issue was discussed in the VAE disentanglement literature a few years ago in [1,2].

References:

[1] Dai, B., & Wipf, D. (2019). Diagnosing and Enhancing VAE Models. International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=B1e0X3C9tQ [2] Rolinek, M., Zietlow, D., & Martius, G. (2019). Variational autoencoders pursue pca directions (by accident). In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12406-12415).

Rebuttal

I did not raise my score higher as the computational cost of the proposed method and its limited evaluation (also discussed by reviewer qeRK) lower its applicability and broader impact.

问题

Major comments

My main concern is about the stability of the proposed hedging degree (see questions 1-2). I will be happy to raise my score should these questions be satisfactorily answered.

l. 151-157: How can we be sure that the "parent" decoder (this is $W_{dec}^1[0:L]$ , correct?) remains unchanged? Is it possible that in some cases $W_{dec}^1[0:L]$ learns a mix of $f_1$ and $f_2$ (as in the first row of fig. 1b) and $f_1$ is correctly tracked by $W_{dec}^1[L:L+N]$ . If so, how does this impact Eq.5-7? If not, why?
If I understand correctly, in Eq.5 when there is no hedging and no absorption, we have $\\delta_L \\approx 0$ , so we should have $h \\approx 0$ , which makes sense. When there is some hedging, fewer features would be learned in one latent when we increase the dimensionality, so we would expect to have $| \\delta_L | > 0$ where $| \\cdot |$ is the absolute value. Because these features are learned somewhere in the N new latents, we would expect to have $h > 0$ . Assuming that absorption will only happen in the N new latents (see comment 1. about this), we should obtain the same $h$ regardless of whether absorption occurs in the new latents or not. All this is fine (if the assumption holds), but what happens when we add more latents to an SAE already showing some absorption? Could we have some cases there where $h > 0$ because of this? I would be happy to raise my score if the authors provide some justification for this and the assumption discussed in question 1, which can alleviate my concern that $h > 0$ may not only be due to hedging.
If I get this correctly, the encoder cosine similarity is $cos\\_sim(f(a), a)$ , but I don't know how the decoder cosine similarity is computed. Is this $cos\\_sim(\\hat{a}, a)$ ? something else? A clarification about this would ease the reading.
It would be nice to have an analysis of more than one LLM in the experimental section. However, I am aware that this is time-consuming and may not be feasible within the allocated time.
l. 10 The phrasing "We study it both theoretically in toy models [...]" could be misleading as what is presented is an empirical observation on toy models, not a mathematical proof of feature hedging. Please rephrase this to clarify it.

Minor comments

These comments are easy fixes which would, in my opinion, ease the reading of the paper.

Fig 1 is not black-and-white printer-friendly. Adding the corresponding cosine similarity scores, as in Fig.2, would significantly improve this.
Fig. 7b slightly stretches over the margin. Truncating a bit of the text in 7.b may help reduce the size of this figure and fix this issue.
The norm is already used in Eq. 7, so I think it is not necessary in Eq. 6, especially since this is a projection.
In Eq. 7, it would ease the reading to explain that $\\delta_L[i]$ is the difference between the weights of the $i^{th}$ feature.
In A4, it would be nice to have larger figures.
References with a large number of authors would be more readable with ellipses after the 20th, as per APA standards.

局限性

yes

最终评判理由

The authors' answer satisfactorily alleviated my main concern, and I have updated my score accordingly. I did not raise my score higher as the computational cost of the proposed method and its limited evaluation (also discussed by reviewer qeRK) lower its applicability and broader impact.

格式问题

作者回复

2025-07-31

We thank the reviewer for their in-depth reading of our paper and for providing feedback and thoughtful questions. We will make all recommended improvements in the minor comments section to improve readability and clarity. We are also grateful for the references to VAE work and will add references to these works in our related works section.

We address the major concerns below:

L 151-157: How can we be sure that the "parent" decoder (this is $W_{dec}^1[0:L]$ , correct?) remains unchanged? Is it possible that in some cases $W_{dec}^1[0:L]$ learns a mix of $f_1$ and $f_2$ (as in the first row of fig. 1b) and $f_1$ is correctly tracked by $W_{dec}^1[L:L+N]$ . If so, how does this impact Eq.5-7? If not, why?

This is an excellent question, and is also core to the reason Matryoshka SAEs work to solve feature absorption. In feature absorption, there is a parent feature ( $f_1$ in Figure 1b) and a child feature ( $f_2$ in Figure 1b). Since the child feature can only fire if the parent feature is firing, the child feature therefore must fire less frequently than the parent. For most realistic features one can think of, the child feature will indeed fire much less frequently than the parent feature (e.g., a parent feature like “animal” will fire much much more frequently than “snapping turtle”, a very specific animal).

As such, if an SAE is too narrow to represent both the parent and child feature (only one of $f_1$ or $f_2$ fits in $W_{dec}^1[0:L]$ ), the SAE is heavily incentivized to learn the parent feature $f_1$ rather than the child feature $f_2$ . E.g. if the SAE were to learn “snapping turtle” and not learn “animal”, the SAE would be much worse at reconstructing inputs than if it were to learn “animal”, since “animal” is a much more commonly occurring concept than “snapping turtle”.

This insight is why Matryoshka SAEs work to solve absorption as well. The inner levels of Matryoshka SAEs are assumed to learn the most general, high-frequency concepts - thus likely to be “parent” concepts - while later levels of the Matryoshka SAE learn lower-frequency “child” concepts of those parents. As long as a parent concept is learned in an earlier Matryoshka level to its child concepts, the Matryoshka SAE will prevent absorption, as engaging in absorption would increase the MSE loss of the level containing the parent. And empirically, Matryoshka SAEs do appear to work quite well to solve absorption, so this is further evidence that narrow SAEs prioritize learning parent features over child features.

In addition, for our hedging metric, we calculate the metric from scratch with SAEs trained with multiple different seeds to gauge the variance in the metric (1 stddev is shown in the shaded area of all plots in Section 4), and the variance is very low, further validating the stability of the metric.

If I understand correctly, in Eq.5 when there is no hedging and no absorption, we have $\delta_L \approx 0$ , so we should have $h \approx 0$ , which makes sense. When there is some hedging, fewer features would be learned in one latent when we increase the dimensionality, so we would expect to have $|\delta_L| > 0$ where $| \cdot |$ is the absolute value. Because these features are learned somewhere in the N new latents, we would expect to have $h > 0$ . Assuming that absorption will only happen in the N new latents (see comment 1. about this), we should obtain the same h regardless of whether absorption occurs in the new latents or not. All this is fine (if the assumption holds), but what happens when we add more latents to an SAE already showing some absorption? Could we have some cases where $h > 0$ because of this?

We defined our hedging metric so that only hedging, not absorption, can cause the hedging metric to increase. The logic behind why this is true is as follows:

If an SAE is too narrow to represent both a parent and child feature in a hierarchy, the SAE will represent only the parent and not the child (the reason for this is detailed in the above response). Thus, any new latent added to the SAE can be a child latent of an latent in the SAE already, but cannot be a parent latent to a latent in the SAE already.
Absorption modifies the encoder of the parent latent, and the decoder of the child latent, but, crucially, does not modify the decoder of the parent latent.
Thus, any change to existing decoder latents in the SAE when new latents are added that projects onto those newly added latents cannot be due to absorption.

We elaborate on this below:

For absorption to occur, we need a parent feature and a child feature, where the child feature can only fire if the parent feature fires. To begin, assume our SAE has a latent tracking the parent feature (and this latent has a hedged component of the child feature in both encoder and decoder), but not a latent for the child feature. This is the situation in Figure 1a. Next a new latent is added to the SAE, and this latent begins tracking the child feature. The SAE will now begin engaging in absorption. This entails the encoder of the parent latent (our existing latent) adding a negative component of the child feature, so the parent latent “turns off” when the child feature is active. The decoder of our newly added child latent is a mix of the parent feature and the child feature. This is the situation in Figure 1b. Crucially, under feature absorption the decoder of the parent latent does not contain any component of the child feature.

This is precisely why we use the SAE decoder to detect hedging and not the encoder. If we used the SAE encoder, then the negative component of the child feature in the encoder latent of the parent feature would project onto the child latent, and we would not be able to tell if that projection is due to absorption or hedging. However, because we use the decoder and absorption does not cause a change to the decoder latent of the parent feature, any change to the decoder that projects onto the newly added latent cannot be due to absorption.

The best way to visualize this is to imagine adding a new latent to the SAE in Figure 1.a, and arriving at the SAE in Figure 1.b. The newly added latent in Figure 1.b. is latent 2, tracking the child feature. The difference in the decoder of latent 1 moving from Figure 1.a to Figure 1.b is only the removal of the hedging of the child feature ( $f_2$ ) - there is no change induced to the parent decoder from absorption.

If I get this correctly, the encoder cosine similarity is $cos(f(a), a)$ , but I don't know how the decoder cosine similarity is computed. Is this $cos(\hat{a}, a)$ ? something else? A clarification about this would ease the reading.

For the toy model experiments, we create “true features” as mutually-orthogonal unit-norm vectors with the same dimension as the SAE’s input dimension ( $d_{input}$ ). We control the firing probabilities of each of these “true features”, and create training data for the SAE by summing the vectors of each feature that fires. The goal of the SAE, then, is to learn these underlying feature vectors despite only seeing inputs created from sums of feature vectors.

The cosine similarity is simply the cosine similarity between the SAE encoder ( $W_{enc} \in d_{input} \times d_{sae}$ ) or decoder matrix ( $W_{dec} \in d_{sae} \times d_{input}$ ) and the matrix of feature vectors. So, if $F \in K \times d_{input}$ is the set of true feature vectors, the encoder cosine similarity is $cos(W_{enc}, F)$ , and the decoder cosine similarity is $cos(W_{dec}^T, F)$ . In both cases, we get a a matrix of shape $d_{sae} \times K$ of the cosine similarity between every “true feature” and every SAE latent, where $K$ is the number of synthetic “true features”, $d_{sae}$ is the number of latents in the SAE, and $d_{input}$ is the input dimension to the SAE. These cosine similarity matrices are plotted as heatmaps in the synthetic data plots.

This setup for synthetic data experiments with SAEs and calculating cosine similarity between SAE latents and true features is standard in the literature and is used in both the original feature absorption work [1] and the original Matryoshka SAEs work [2]. Several other reviewers were also confused by our synthetic data setup, so we will further clarify this in the paper text.

It would be nice to have an analysis of more than one LLM in the experimental section. However, I am aware that this is time-consuming and may not be feasible within the allocated time.

We do include results for both Llama-3.2-1b and Gemma-2-2b in the main experimental results section (Section 4.1) to ensure we can quantify hedging in different SAE families. The results for both these SAEs show similar trends. For the balance Matryoshka SAE sweep in Section 6, we only include results for Gemma-2-2b as these sweeps are very expensive to calculate (each datapoint requires training 10 SAEs with different seeds), and thus focused on Gemma-2-2b as this model is commonly used for SAE research.

l. 10 The phrasing "We study it both theoretically in toy models [...]" could be misleading as what is presented is an empirical observation on toy models, not a mathematical proof of feature hedging. Please rephrase this to clarify it.

Thank you for this feedback, we will rephrase this sentence to avoid confusion in the final version of the paper. We will say “We study it both in toy models with known true features [...]”

We hope we have addressed the concerns you raised in the review, and would kindly ask you to adjust your review score while taking the rebuttal into account.

References

[1] Chanin, David, et al. "A is for absorption: Studying feature splitting and absorption in sparse autoencoders." arXiv preprint arXiv:2409.14507 (2024).

[2] Bussmann, Bart, et al. "Learning multi-level features with matryoshka sparse autoencoders." arXiv preprint arXiv:2503.17547 (2025).

评论- Thank you for your answer

2025-08-04

I thank the authors for their detailed response, which alleviates my primary concern. I have updated my score accordingly.

2025-08-05

We are grateful to the reviewer for their detailed engagement and thoughtful questions during the review process, and for raising their score. The discussion with the reviewer has been helpful for us to identify the parts of the paper in that can benefit from expanded explanation, and we will address all points raised in the final version of the paper. We believe this will make the paper much stronger as a result, so we thank the reviewer for this feedback.

审稿意见

评分: 4置信度: 32025-07-02

The paper investigates feature hedging, a failure mode of sparse auto-encoders (SAEs) in which correlated features collapse into a single latent when the dictionary is narrower than the number of underlying features, breaking monosemanticity. The authors demonstrate this issue theoretically in toy models and contrast it with the feature absorption failure mode. They then introduce hedging degree, a simple retrain-and-compare metric that isolates hedging. They empirically evaluate the impact of feature hedging in SAEs trained on LLMs. They show that feature hedging arises from the reconstruction loss term dominating the sparsity term when the latent width is too narrow. Finally, they propose balance matryoshka SAEs, which weigh the nested-loss terms so that positive (hedging) and negative (absorption) cross-feature interactions cancel, offering relative improvements on SAEBench interpretability metrics such as TPP and SCR.

优缺点分析

Strengths

The paper provides a very good theoretical analysis of feature hedging and demonstrates its existence.
The authors introduce hedging degree, as a new metric that isolates feature hedging.
The paper empirically evaluates the impact of feature hedging in SAEs trained on LLMs.
The paper proposes balance matryoshka SAEs, which offer good relative improvements on SAEBench interpretability metrics.
Good supplementary material with detailed experiments and analysis.

Weaknesses

It appears that the hedging degree metric involves costly re-training, limiting practical applicability.
There is no clear evaluation against other supervised baselines mentioned in the paper, such as supervised SAEs. With only the relative results, and with the literature the paper cites, it is unclear whether the absolute performance is still lower than supervised SAEs. Given the paper is about unsupervised SAEs, this may not be strictly necessary for the core contribution. However, the comparison would be very useful.
Although the paper presents a thoughtful analysis of feature hedging and a neat loss-balancing fix, the absence of evaluations against direct baselines weakens the paper's position on broader impact. Perhaps, authors can discuss this aspect or consider adding some direct comparisons.
While the empirical results are compelling, they are limited to specific models (Gemma-2-2B and Llama-3.2-1b). Broader validation across more models or architectures could strengthen the findings. While it is understood that this may not be possible within a short time-frame, authors should at least discuss relevant aspects.
The proposed solution is promising but may not universally balance hedging and absorption across the entire feature space, which authors acknowledge. Further exploration of per-latent balancing could strengthen the paper.

问题

Please see the Strengths and Weaknesses Section for more details.

As stated before, computing the hedging-degree metric is not cheap. Can authors discuss or devise a cheaper proxy that avoids retraining with N new latents, especially exploiting other proxies, such as second-order gradients?
The paper states (line#238) that 'balancing both hedging and absorption with a single $\beta_m$ is not always possible'. Yet the proposed solution still couples every nested-loss weight to one global multiplier. Could the authors clarify whether, after selecting the best multiplier, any individual latents still hedge (or absorb)? Perhaps this can be visualised using a plot illustrating per-latent hedging degree vs. $\beta_m$ (or similar). Although such a plot would not make the situation fully transparent, it would still useful. Secondly, if such a subset of latents exists, can it be detected early to avoid unnecessary multiplier sweeps? Addressing these points would clarify how far a single global multiplier goes toward balancing hedging and absorption.
The paper acknowledges the divergence between the SCR and TPP curves in the multiplier sweep, but does not provide a clear explanation for this phenomenon. This discrepancy leaves it unclear which metric should carry more weight. Does the pattern persist if the ablation budget k is varied, for example, comparing the results for $k=10$ , $k=2$ and $k=20$ ? Could the authors analyse the root cause of this behaviour (for example, dataset imbalance, probe capacity, or differences in the metric definitions), and add a discussion? Resolving (or at least clarifying) this issue would strengthen the empirical section.
Is the hedging phenomenon also common (or at least measurable) in other modalities, for example, in vision models? If so, how straightforward is it to port the full framework (including metric, and balanced matryoshka loss) to a new setting? In particular, are there any obstacles to finding a suitable weight-balancing schedule for vision SAEs?

局限性

Yes

最终评判理由

Authors have provided a very good justification for some of the questions, and good discussion points. Although it does not change the overall quality of the paper rather radically, those points are really useful in improving the standing of the paper from where it was.

格式问题

No major concerns.

作者回复

2025-07-31

We thank the reviewer for their detailed reading of the paper and excellent questions and feedback. We address the concerns of the reviewer below:

There is no clear evaluation against other supervised baselines mentioned in the paper, such as supervised SAEs. With only the relative results, and with the literature the paper cites, it is unclear whether the absolute performance is still lower than supervised SAEs. Given the paper is about unsupervised SAEs, this may not be strictly necessary for the core contribution. However, the comparison would be very useful.

We are not familiar with any existing use of supervised SAEs in interpretability. Our understanding when writing this paper was that SAEs are an unsupervised technique only. Searching for “supervised sparse autoencoders” online did bring up some papers, but none in interpretability. If you know of any papers or benchmarks about using supervised SAEs for interpretability we would be grateful if you could share them with us.

As stated before, computing the hedging-degree metric is not cheap. Can authors discuss or devise a cheaper proxy that avoids retraining with N new latents, especially exploiting other proxies, such as second-order gradients?

This has been a focus of ours as well since submitting the paper, as finding good proxy metrics for hedging and absorption could unlock the ability to automatically tune Matryoshka coefficients to optimal values during training. We have so far not found a metric that has worked reliably enough for this purpose, but are hopeful that we may find something reliable soon. We have so far tried metrics based on the following:

Cosine similarity metrics (cos sim between encoder latents, cos sim between decoder latents, cos sim between the encoder and decoder of the same latent, kurtosis and skew of cos sim between latents)
Projection of the encoder or decoder onto training samples (we expect on average, the less absorption or hedging the closer to gaussian centered at zero these should appear)
The magnitude of the decoder bias $b_{dec}$
The magnitude of gradients of each Matryoshka loss level

So far, none of these metrics track the retraining-based hedging metric perfectly, but there are some promising signs. The cos sim, projection, and $b_{dec}$ metrics do seem to change systematically depending on the width of the SAE, but not always in consistent ways and they often are a bit noisy / fluctuate during training. Many of these metrics also monotonically increase or decrease, so we can’t easily pinpoint where absorption and hedging are both minimized, since it seems like this point likely corresponds to a change in the slope of the curve rather than a set max or min. We think there are a few confounding factors we need to separate out as well:

Cos sim (and also projections) are impacted by both absorption and hedging. We can disentangle the effects of each of these in our main hedging metric by adding new latents, as this allows us to be sure the newly added latents cannot be parent latents in an absorption relationship to any existing latent in the SAE (see the response to reviewer FUCB for a detailed explanation why). But for existing latents within an SAE we do not have this same guarantee.
$b_{dec}$ magnitude does seem to change with hedging and not absorption, but we have noticed a wide variety of $b_{dec}$ norm values in Gemma Scope SAEs [1] that we do not yet understand still, and thus feel uncomfortable recommending this as a target metric.
For the encoder, cosine similarity is likely also influenced by attempts to reduce superposition interference, so we cannot get as much signal from the encoder cosine similarities / projections. This confounding effect on the encoder makes it harder to use cos sim between encoder / decoder as well.

The gradient magnitudes did not work out since loss tends to naturally find a place where the gradients are equally balanced, so we were not sure how to proceed. That being said, we are not experts in this area so it is very possible there are ideas we overlooked. Is there an idea you have about how specifically to use second-order gradients? If there is a technique or paper we should look at we would be excited to try out any ideas you may have!

The paper states (line#238) that 'balancing both hedging and absorption with a single $\beta_m$ is not always possible'. Yet the proposed solution still couples every nested-loss weight to one global multiplier. Could the authors clarify whether, after selecting the best multiplier, any individual latents still hedge (or absorb)? Perhaps this can be visualised using a plot illustrating per-latent hedging degree vs. $\beta_m$ (or similar). Although such a plot would not make the situation fully transparent, it would still useful. Secondly, if such a subset of latents exists, can it be detected early to avoid unnecessary multiplier sweeps? Addressing these points would clarify how far a single global multiplier goes toward balancing hedging and absorption.

Yes, even after selecting the best multiplier, there are individual latents that hedge or absorb, so it is very possible that just balancing the losses cannot be a full solution to absorption and hedging. We illustrate this in Appendix A.6, using a toy model to show that while a perfect balancing is not possible.

That said, it is possible to do much better than just setting all balancing parameters to 1.0, as also illustrated in Appendix A.6. Our claim is that balancing improves the situation, not that it completely solves it.

The paper acknowledges the divergence between the SCR and TPP curves in the multiplier sweep, but does not provide a clear explanation for this phenomenon. This discrepancy leaves it unclear which metric should carry more weight. Does the pattern persist if the ablation budget k is varied, for example, comparing the results for k=10, k=2 and k=20? Could the authors analyse the root cause of this behaviour (for example, dataset imbalance, probe capacity, or differences in the metric definitions), and add a discussion? Resolving (or at least clarifying) this issue would strengthen the empirical section.

This discrepancy between TPP and SCR has been very confusing for us as well, as these metrics are both supposed to measure the same thing, which is the SAE’s ability to isolate and disentangle related concepts. We include more results for k=10 in Appendix A.7.3, but the trend is largely similar. We tried modifying these metrics using balanced datasets, but this also did not change the results. The only explanation we can think of is that perhaps the way we are tying all balance coefficients together (due to compute constraints, we could not try every permutation of balance coefficients) might affect certain latents differently than others, and these metrics perhaps use differently affected latents. Or perhaps something about the underlying datasets / tasks in these metrics may interact with hedging and absorption in ways we do not understand, or one of these tasks may be flawed in some way we do not understand.

While this is unsatisfying, we can at least say that the choice of balance coefficients clearly has a large (if somewhat contradictory) impact on both of these metrics, so regardless it is clear that balance coefficients are an important hyperparameter to optimize for Matryoshka SAEs.

Is the hedging phenomenon also common (or at least measurable) in other modalities, for example, in vision models? If so, how straightforward is it to port the full framework (including metric, and balanced matryoshka loss) to a new setting? In particular, are there any obstacles to finding a suitable weight-balancing schedule for vision SAEs?

Hedging requires only that underlying features have correlations, which is certainly applicable to other modalities. For instance, we would expect that in vision models, “grass”, “sun”, and “flowers” are all highly correlated features and will thus induce hedging in narrow SAEs and Matryoshka SAEs. Balancing a vision Matryoshka SAE would require a different validation benchmark than SAEBench to find the best balancing, but our understanding of absorption and hedging should still be exactly the same and apply just as much as with LLMs.

References

[1] Lieberum, Tom, et al. "Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2." arXiv preprint arXiv:2408.05147 (2024).

2025-08-07

Thank you for the detailed author responses. Most of my concerns have been addressed, and I have increased my score accordingly. As stated before, one of my concerns is relative comparison with other SAEs, but agree with the authors that when it comes to interpretability, it is hard to establish any baselines. As discussed, any relative evaluation would have given a better understanding, but agree it is not possible.

I recommend the authors to incorporate some of these discussions in the paper to improve its overall quality and readability.

审稿意见

评分: 4置信度: 42025-07-03

The authors discovered a phenomenon called "feature hedging," where correlated features are merged into a single SAE latent unit. This effect is particularly pronounced in small-sized SAEs. The paper also explores ways to minimize feature hedging and proposes a coefficient for the Matryoshka SAE.

优缺点分析

Strengths:

S1. The motivation is clear, study of the phenomenon is extensive and experiments are thorough.

S2. Analysis builds up on up-to-date research and contributions are novel, including not only analysis of discovered hedging phenomenon but also a method to balance it with absorption.

S3. The experiments are performed with two LLM families, namely Gemma and Llama, which strenghtens the generalizability of results. Analyzed SAEs are of practical dictionary sizes, up to 65k features.

S4. Experimental details are detailed enough to follow and reproduce. Paper is well-written and is relatively easy to understand.

Weaknesses

W1. The paper does not include experiments demonstrating how hedging mitigation influences performance or interpretability on realistic downstream tasks, beyond what's presented in a SAE Bench. It remains unclear whether there exist a real use-cases where we could sacrifice absorption score in favor of better hedging properties, or something like that.

W2. The conclusion about balancing between absorption and hedging relies on toy example setting, where low or high balance ratio $\beta$ results in more pronounced hedging or absorption score, respectively. Then this result generalizes to the LLM with 2B parameters trained on a highly complicated dataset. The following experiments presented in a Figure 9 suggest that increasing $\beta$ result in lower absorption rate, but it is unclear how does hedging presence change; only proxy metrics from benchmarks are available. The claim could be strenghtened by adding a discussion about this dynamics.

问题

Q1: In Section 3 you say that no bias term is added, but in 3.1 you mention the learned decoder bias; could you elaborate on this?

Q2: Have you tried to rule out other causes of feature hedging? Have you identified any plausible alternative explanations?

Q3: Have you analysed the connections of this phenomenon to the feature composition phenomenon that you mention in a Related Work section, where the e.g. feature "red triangle" is expected to be a sum of two or more features, but represents a standalone feature? Is this a reverse of hedging?

局限性

The authors adequately addressed the limitations.

最终评判理由

After discussing with the authors and reviewing the feedback from other reviewers, I believe the strengths of this paper outweigh its weaknesses. The work makes a valuable contribution by enhancing understanding of the latents learned in SAEs, particularly in the context of MatryoshkaSAE, and provides insights that could guide future research to improve existing methods.

格式问题

No concerns.

作者回复

2025-07-31

We thank the reviewer for their thorough reading of our work and kind words about the strengths of the paper, including strong motivation, clarity, and thoroughness. We address the questions below:

The paper does not include experiments demonstrating how hedging mitigation influences performance or interpretability on realistic downstream tasks, beyond what's presented in a SAE Bench. It remains unclear whether there exist a real use-cases where we could sacrifice absorption score in favor of better hedging properties, or something like that.

SAEBench is currently the best standard benchmark suite for SAE performance on realistic interpretability tasks that goes beyond simple sparsity/reconstruction plots. We use SAEBench precisely because we care about performance on realistic downstream tasks. We support efforts in the field to improve SAEBench and expand the suite of benchmarks included, but expanding SAEBench is beyond the scope of this paper.

We are also not arguing for sacrificing feature absorption performance in favor of better hedging properties - quite the opposite! Our goal is to create an SAE that neither suffers from hedging nor absorption. Our claim is that since hedging and absorption have opposing effects on the SAE encoder, if we balance them against each other, we can get the effects of both to near zero and thus learn the correct SAE, rather than one that is distorted by either hedging or absorption.

Our work is noting that Matryoshka SAEs solve absorption by making hedging much worse, and are thus not a full solution to problems with SAEs. Before our work, the field was unaware that Matryoshka SAEs were making any trade-off at all, and thus had no reason to look for ways to improve them; indeed the field did not have any understanding that MSE loss can induce problems in SAEs at all, and thus had no explanation why adding a scaling coefficient to Matryoshka losses is important.

Our proposal for improving Matryoshka SAEs is not to make either hedging or absorption worse at the others’ expense, but to balance them against each other so they cancel out. While we acknowledge that our solution of using a single coefficient per Matryoshka level is still imperfect, it is a clear improvement over setting every Matryoshka coefficient arbitrarily to 1.0 (the current standard practice) as we show in the paper. Specifically, changing the Matryoshka coefficients significantly impacts the performance of downstream metrics as demonstrated with SAEBench.

In the short-term, it is important for the field to be aware that the choice of coefficient for each Matryoshka level is an important hyperparameter that will affect the performance of the resulting SAE.

In Section 3 you say that no bias term is added, but in 3.1 you mention the learned decoder bias; could you elaborate on this?

In section 3, we set up synthetic data experiments by defining a set of “true features” as mutually-orthogonal unit-norm vectors with the same dimension as the SAE’s input dimension. We control the firing probabilities of each of these “true features”, and create training data for the SAE by summing the vectors of each feature that fires. This is simulating the idea that an LLM is representing concepts as directions, and the SAE’s job then is to discover those directions. When we say there is no additional bias term added, we mean that the inputs the SAE is trained on are simply the sum of all firing “true feature” vectors, without any additional offset or bias added. So, if no features are firing, then the input to the SAE will be the 0 vector.

The SAE we train, however, does have a bias term in the decoder ( $b_{dec}$ ). We would hope that the SAE should learn to set $b_{dec} = 0$ to match the 0 bias of the data generating process, but we find that is not the case in Figure 2b.

Several reviewers have also been confused about the details of our toy model setup so we will clarify the description of this setup in the final version of the paper. It is worth noting that our method of creating synthetic data is standard in the literature, and is used in both the original feature absorption work [1] and the original Matryoshka SAEs work [2].

Have you tried to rule out other causes of feature hedging? Have you identified any plausible alternative explanations?

We have not found any other plausible explanation for feature hedging. In synthetic data experiments, it is easy to demonstrate that hedging occurs as soon as there is any correlation between underlying “true features”, as we show in Section 3, and our analysis of MSE loss curves in Section 3.5 also gives us further confidence in this conclusion.

For our hedging metric to detect hedging when training SAEs on real LLMs, we were careful to construct the metric to exclude any influence of feature absorption, and to be robust against random noise during training.

If there is another phenomenon that can cause a similar breakdown in SAE latents based on SAE width, it is not something that we are aware of in the literature.

Have you analysed the connections of this phenomenon to the feature composition phenomenon that you mention in a Related Work section, where the e.g. feature "red triangle" is expected to be a sum of two or more features, but represents a standalone feature? Is this a reverse of hedging?

Indeed, we have experimented with the “red triangle” problem in toy models. We find we can elicit this failure mode if we make the SAE wider than the number of true features of the toy model. When the SAE is wider than it needs to be, this allows the SAE to abuse its excess width to create artificial “combo” latents to further improve its sparsity score.

The “red triangle” problem can be said to be related to hedging in that they are both failure modes based on the width of the SAE, but the loss causing each is different, and the way the failure mode looks in the SAE is also different. The “red triangle” problem is actually a lot more similar to feature absorption, in that it is a way to increase the sparsity of the SAE, and looks similar to feature absorption in that the SAE encoder “turns off” the latents representing “red” and “triangle” when “red triangle” is present, firing the new artificial “red triangle” latent instead. Hedging, in comparison, is caused by reconstruction loss, and results in the SAE mixing components of correlated features together relatively symmetrically in the encoder and decoder, and thus its dynamics are different.

In reality, it seems very unlikely that we will train an SAE wider than the number of “true features” in a real LLM, so we do not expect the “red triangle” problem to be as serious an issue as feature hedging. Previous work by Anthropic has found that even extremely wide SAEs (34M latents) still do not cover all the concepts the model can elicit [3]. To our knowledge, the “red triangle” problem has still never been demonstrated in an SAE trained on a real LLM.

References

[1] Chanin, David, et al. "A is for absorption: Studying feature splitting and absorption in sparse autoencoders." arXiv preprint arXiv:2409.14507 (2024).

[2] Bussmann, Bart, et al. "Learning multi-level features with matryoshka sparse autoencoders." arXiv preprint arXiv:2503.17547 (2025).

[3] Bricken, Trenton, et al. “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.” Transformer circuits thread (2023)

2025-08-04

Thank you for your response and the clarifications. My main concern is that the paper does not clearly connect the feature hedging phenomenon to real-world applications. I’m thinking of a task where distorted features resulting from Matryoshka lead to suboptimal performance. Including such an example would strengthen the paper’s contribution.

That said, I still believe the strengths of this work outweigh its limitations. Thank you again for your thoughtful reply.

2025-08-04

Thank you for your engagement with the rebuttal and further question.

My main concern is that the paper does not clearly connect the feature hedging phenomenon to real-world applications. I’m thinking of a task where distorted features resulting from Matryoshka lead to suboptimal performance. Including such an example would strengthen the paper’s contribution.

All the tasks in SAEBench test real-world use-cases of an SAE for interpretability. This includes the ability of the SAE latents to act as classifiers, the ability of the SAE to disentangle concepts, and the interpretability of the SAE latents. Hedging causes the SAE to perform worse on all benchmarks, and this directly means the SAE has degraded performance on real-world tasks. Hedging is like adding a bunch of noise into the dictionary of the SAE - the more noise you add the worse the SAE gets. We will add text to the main body of the paper to clarify that poor performance on SAEBench due to hedging directly means a worse SAE in all respects for all tasks that one may reasonably want to use an SAE for.

Here are some examples of real-world tasks that hedging makes the SAE worse at:

Gender-debiasing hindering capabilities, in biased ways: If you’re using a hedged SAE to gender-debias the LLM for some task (https://arxiv.org/html/2411.18895v1), the gender latents (which are very frequent, and thus parents) will be conflated with many other traits that you don't want to affect. Say you’re scoring applicants for a stereotypically gendered work posting, e.g. a nursing position. Setting the gender vectors to 0 will also attenuate the model's awareness of any qualities related to nursing in the applicant, thus making the model a worse evaluator for no reason. In fact, doing so might disproportionately disadvantage applicants of one gender, if the nursing-related qualities they display are more gendered than otherwise. Put plainly, if (say) male applicants' nurseness is more male, and female applicants' nurseness is more gender-neutral, then the females will be unfairly advantaged in this situation.

Generally entangling things that shouldn't be: say you use an SAE, to influence the internal concepts of a NN that generates images with certain properties, such as a generator of the 3D shapes dataset (https://github.com/google-deepmind/3d-shapes), modified so that some of the concepts are correlated. Then an SAE which hedges will have features for mixtures of the concepts, as opposed to having a feature for each of the disentangled concepts separately.

That said, I still believe the strengths of this work outweigh its limitations. Thank you again for your thoughtful reply.

We thank the reviewer for their kind words. We hope we have addressed your concerns raised in the review, and would kindly ask you to adjust your review score while taking the rebuttal into account.

2025-08-08

Thank you for the great example. I think identifying features and adding the described examples to the case studies would be beneficial. Also, shouldn’t this type of noise affect SCR performance? As far as I can tell, the SCR metric does not change significantly in the range where a multiplier $\geq 1$ already yields nearly optimal performance in Figure 9d.

2025-08-08

We thank the reviewer for their feedback and suggestions. We will make sure to add more descriptions and examples of the real-world tasks entailed in SAEBench in the final version of the paper. We agree this will make the paper stronger.

Also, shouldn’t this type of noise affect SCR performance? As far as I can tell, the SCR metric does not change significantly in the range where a multiplier $\geq 1$ already yields nearly optimal performance in Figure 9d.

We were also surprised by the SCR results, as SCR should theoretically measure the same thing as TPP, and these metrics react slightly differently to large amounts of hedging. One thing to note about SCR is that it is the noisiest metric, and that noise seems to also be highest when multiplier $\geq 1$ . We suspect that hedging just makes the results of SCR more unreliable (increasing variance of the metric) rather than breaking it. The view of hedging as being like adding noise to latents seems consistent with this interpretation of the SCR results as well.

Thank you for your continued engagement with the work, let us know if there are any further questions!

审稿意见

评分: 4置信度: 32025-07-04

The paper introduces an issue of SAE, referred to as feature hedging, in which SAE merges correlated features into a single latent, leading to degraded interpretability. The authors explain the mechanism of feature hedging: it emerges when the width of SAE is relatively narrow compared to the number of underlying true features. They also demonstrate that the MSE objective contributes to feature hedging, as shown through an analysis of the loss curves. Furthermore, the authors demonstrate that feature hedging indeed occurs in practical settings with Gemma and Llama. To prevent the emergence of feature hedging, the authors introduce a scaling coefficient into the matryoshka SAE objective and show that the proposed method performs well on synthetic data and SAEBench.

优缺点分析

Strengths:

This paper identifies the emergence of the feature hedging problem and explains its underlying mechanism.
A metric for feature hedging is presented.
The authors propose a method to prevent the emergence of feature hedging.
The experimental results show that the proposed method outperforms the baselines on synthetic data and SAEBench.

Weaknesses:

Some parts of the content were difficult to follow:

The figures (e.g., Figure 1) are difficult to understand. Please clarify how the cosine similarity is computed. Are these values the cosine similarities between the weight matrices and the input vector (not latents)?
The definition of $f$ appears to be inconsistent. In the introduction, $f$ represents the true feature, while in section 2, it is defined as the encoder’s output (latents). Please clarify or unify the notation.
I understand that hedging and absorbing have opposite effects, and aligning them might diminish both. But, it is unclear why introducing the scaling coefficient would lead to such alignment. Could you please clarify this point?
The authors state, “Under feature absorption … arriving at Figure 1b.” (Lines 153-156). While the example in Figure 1 supports this claim, it is unclear whether this observation holds more generally. Could the authors clarify whether this is a general trend or a case-specific finding?
Based on the analysis, the hedging symmetrically affects the encoder and decoder. But, Figure 8 (a) does not appear to reflect this symmetry. My understanding could be incorrect. Could the authors clarify how to interpret this figure in terms of the symmetry?
I recommend using bold font for vectors and matrices (e.g., Eq. (1)).
I recommend using math font for norms such as L1 and L0 (e.g., line 68).
Please clearly define the meaning of the square bracket notation, especially $\delta_L[i]$ , as used in Eq. (7).
It is unclear why projecting the difference of the updated original latents onto the added N latents in Eq. (7) mitigates noise.
As noted in the limitations, the proposed hedging metric is costly and may be less practical.

问题

My concerns and questions are listed in the Weaknesses section.

局限性

yes

最终评判理由

My concerns have been addressed in the author response, and I have increased my score accordingly.

格式问题

none

作者回复

2025-07-31

We thank the reviewer for their response and for the suggestions to improve the clarity of the paper, which we will incorporate into the camera ready. We hope to address all of your concerns with our answers to the questions raised below:

The figures (e.g., Figure 1) are difficult to understand. Please clarify how the cosine similarity is computed

For all synthetic data experiments, we define “true features” as mutually-orthogonal unit-norm vectors with the same dimension as the SAE input dimension. We control the firing probabilities of each of these “true features”, and create training data for the SAE by summing the vectors of each feature that fires. The goal of the SAE, then, is to learn these underlying feature vectors. The cosine similarity is simply the cosine similarity between rows of the SAE encoder or decoder matrix and the matrix of feature vectors. This setup for synthetic data experiments with SAEs and calculating cosine similarity between SAE latents and true features is standard in the literature and is used in both the original feature absorption work [1] and the original Matryoshka SAEs work [2]. We will further clarify this in the paper text.

The definition of f appears to be inconsistent. In the introduction, f represents the true feature, while in section 2, it is defined as the encoder’s output (latents). Please clarify or unify the notation.

Thank you for pointing this out, we will replace f in section 2, representing the SAE hidden state, with z to avoid confusion.

I understand that hedging and absorbing have opposite effects, and aligning them might diminish both. But, it is unclear why introducing the scaling coefficient would lead to such alignment. Could you please clarify this point?

Hedging and absorption both pull the SAE encoder in opposite directions in Matryoshka SAEs, as shown in Figure 8. However, there is no guarantee that these pull with equal force and cancel out when Matryoshka levels are naively summed together as in a standard Matryoshka SAE (equivalent to every $\beta_m = 1$ ). Adding a scaling coefficient $\beta_m$ to every Matryoshka loss gives us a lever to tune the relative pull of absorption and hedging, and hopefully roughly balance them so they cancel out. Increasing $\beta_m$ for inner losses increases the pull of hedging relative to absorption, and lowering inner $\beta_m$ increases absorption relative to hedging. Figure 8 demonstrates this trade-off for extreme values of $\beta_m$ in a simple synthetic setting. When they match, then the SAE has neither absorption nor hedging.

In general when multiple losses are summed, it is very standard to add a scaling coefficient to each loss. We found it interesting that the original Matryoshka SAEs [2] work did not do this, most likely due to not understanding that there is in-fact a theoretical trade-off going on between absorption and hedging, and thus that tuning the Matryoshka loss levels is important to get this balance correct. In fact, the appendix of the Matryoshka SAEs paper mentions trying to detach the losses between each Matryoshka level (equivalent to Figure 8(a) in our work), but notes that they did not pursue this as the downstream metrics were worse than naively summing the losses from each level for reasons they did not understand. Given our understanding of feature hedging, we now understand the theoretical grounding for this finding in the original Matryoshka SAEs work as well - detaching losses between Matryoshka levels makes hedging extremely bad.

The authors state, “Under feature absorption … arriving at Figure 1b.” (Lines 153-156). While the example in Figure 1 supports this claim, it is unclear whether this observation holds more generally. Could the authors clarify whether this is a general trend or a case-specific finding?

This is a general characteristic of feature absorption, and has been demonstrated in all toy models of feature absorption that we are aware of, including all variations of feature absorption explored in the appendix of the original feature absorption work [1]. In order to avoid firing the parent latent when the child latent is active, the SAE must add a negative component of the child feature to the encoder of the parent. The decoder latent tracking the parent feature does not encode any component of the child feature under any model of feature absorption of which we are aware.

Based on the analysis, the hedging symmetrically affects the encoder and decoder. But, Figure 8 (a) does not appear to reflect this symmetry. My understanding could be incorrect. Could the authors clarify how to interpret this figure in terms of the symmetry?

Figure 8 is a Matryoshka SAE with an inner layer consisting of a single inner layer of width 1 (this inner layer contains just latent 1). As such, hedging affects latent 1 of the SAE in Figure 8(a), since this inner layer of the Matryosha SAE is still forced to reconstruct the input on its own, and we do indeed see that the encoder and decoder for latent 1 has a symmetric hedging pattern (features 2-4 are positively mixed in to latent 1). Latents 2-4 are the outer level of the Matryoshka SAE, and need to do their best to adapt to latent 1 being broken. Our claim of hedging applies to only latent 1 in Figure 8(a), as this latent is the inner level of the Matryoshka SAE and its width (1) is more narrow than the number of true features (4). We will add a marker to indicate the inner Matryoshka level in these plots in the final version of the paper to make this more clear.

I recommend using bold font for vectors and matrices (e.g., Eq. (1)). I recommend using math font for norms such as L1 and L0 (e.g., line 68).

Thank you for these suggestions, we will make these changes in the final version of the paper.

Please clearly define the meaning of the square bracket notation, especially $\delta_L[i]$ , as used in Eq. (7).

These brackets just mean selecting a specific row or range of rows from a 2d matrix. We will make this more clear in the final version of the paper.

It is unclear why projecting the difference of the updated original latents onto the added N latents in Eq. (7) mitigates noise.

Projecting the difference of the updated original latents onto the added N latents is how we are detecting hedging - it is not intended to mitigate noise. If there is no hedging occurring, then adding new latents to an existing SAE should have no effect on the existing latents, as these latents will be simply tracking different features. However, if hedging does occur, then we should expect components of the newly added latents to be removed from existing latents (the hedging should be “pulled out” of the existing latents), and this should appear as the change in existing latents projecting strongly onto the newly added latents.

We try to mitigate noise by adding multiple latents at the same time (64) rather than adding latents individually, effectively averaging the hedging effect of these 64 latents. We will reword this to make it more clear that adding multiple latents together is done to reduce the noise of the metric (although the metric is not very noisy to begin with, as shown by the very thin error bars in the hedging plots in the paper).

We hope we have addressed your concerns raised in the review, and would kindly ask you to adjust your review score while taking the rebuttal into account.

References

[1] Chanin, David, et al. "A is for absorption: Studying feature splitting and absorption in sparse autoencoders." arXiv preprint arXiv:2409.14507 (2024).

[2] Bussmann, Bart, et al. "Learning multi-level features with matryoshka sparse autoencoders." arXiv preprint arXiv:2503.17547 (2025).

2025-08-04

Thank you for the detailed author response. My concerns have been addressed, and I have increased my score accordingly. I appreciate the novelty of this work, but the initial submission lacks sufficient detail. The explanations provided in the rebuttal are essential for clarity, and I strongly recommend that the authors incorporate this information into the manuscript to improve its overall quality and readability.

2025-08-05

We are grateful to the reviewer for raising their score, and for their thoughtful engagement during the review process. We will focus on addressing all the points of clarity brought up in the review in the final version of the paper. We believe this feedback will make the paper much stronger as a result, so we deeply thank the reviewer for providing this feedback and helping to identify the parts of the paper that are unclear.

2025-08-06

We would just like to politely check: did the reviewer forget to revise their score after stating they would?

评论- Please engage with the rebuttals ASAP

2025-08-04

Dear reviewers,

Thank you for your work reviewing this paper. The authors have provided detailed rebuttals. Please engage urgently with the rebuttal by updating your review where necessary and by providing a final justification, asking for further clarification where needed.

Thank you, the AC

最终决定Reject

2025-09-17

This paper identifies and studies a phenomenon termed "feature hedging," where a sparse autoencoder (SAE) with insufficient capacity (too few latents) trained on correlated features learns to represent multiple features within a single latent direction, compromising monosemanticity. The authors contrast this with the previously identified problem of "feature absorption".

The authors provide clear, intuitive experiments with 1- and 2-latent SAEs showing how hedging arises under hierarchical, correlated, and anti-correlated features. They define the "hedging degree" to quantify the effect in larger SAEs trained on Gemma and Llama models, ostensibly showing severe hedging in narrow SAEs that persists even in very wide ones. They finally propose a modified "balance matryoshka SAE" to mitigate both issues; its key idea is to apply a scaling coefficient to the loss for each matryoshka level which should allow balancing hedging and absorption.

However, the core concept is somewhat obvious in hindsight: an undercomplete autoencoder will be forced to compress information, and correlated information will be compressed together. While the authors give concrete illustrations of this phenomenon in the context of SAEs on transformers, the resulting insight is limited. While initial concerns about the definitions of hedging were resolved during the discussion period, the proposed solution carries a huge computational cost due to the required fine-tuning of the multiplicative parameters in the SAEs, as noted by the authors and reviewers FUCB and qeRK. This complication lowers the applicability and broader impact of the method. Finally, the method does not convincingly close the performance gap with supervised baselines that motivated the work, so I ultimately do not recommend acceptance at NeurIPS.