7.0

/10

Poster3 位审稿人

最低7最高7标准差0.0

3.7

置信度

COLM 2025

Teach Old SAEs New Domain Tricks with Boosting

Nikita Koriagin,Yaroslav Aksenov,Daniil Laptev,Gleb Gerasimov,Nikita Balagansky,Daniil Gavrilov

OpenReview PDF

提交: 2025-03-20更新: 2025-08-26

TL;DR

We propose a method to add domain-specific features into a pre-trained SAE.

摘要

关键词

LLMInterpretabilitySAE

评审与讨论

审稿意见

评分: 7置信度: 42025-05-12

This paper proposes SAE Boost, a method for augmenting pretrained Sparse Autoencoders (SAEs) with domain-specific information without retraining or modifying the base model. The idea is to train a residual SAE to predict the reconstruction error of the original SAE on domain-specific data. At inference time, the residual output is added to the base reconstruction, aiming to recover domain-relevant features that the general-purpose SAE misses.

The method is simple and modular, and the empirical results are strong: across three domains (Russian, Chemistry, UN Debates) and two models (Qwen and Llama), the residual SAEs improve both explained variance and LLM cross-entropy while having minimal effect on general-domain performance. Multiple residuals can be composed without interference, which is a nice property.

接收理由

S1- Adding a residual SAE trained on domain-specific reconstruction error is a lightweight way to enhance feature coverage. It’s modular, avoids catastrophic forgetting, and requires no changes to the base SAE.

S2- The results are consistent: e.g., on Russian texts, EV improves by 59% and CE drops by over 50%. Similar improvements hold across the board. The method generalizes well and doesn’t degrade general performance (>99% retention).

S3- The authors show you can train multiple residual SAEs independently and combine them at inference. The fact that performance doesn’t collapse under this setup is compelling—it suggests the residuals don’t conflict with one another or the base.

S4- Cosine similarity distributions show that the residual features are less redundant with the base SAE. t-SNE plots show domain-specific features cluster naturally (e.g., Slavic languages). These are soft signals, but they help support the method’s core claim.

S5- The authors evaluate against extended SAEs (both random and most-active feature init), SAE stitching, and full fine-tuning. SAE Boost consistently performs better or similarly on domain metrics, and much better on general-domain retention.

拒绝理由

W1 - The method is described as “boosting” but has no formal connection to boosting theory. There’s no bias–variance framing, no connection to residual learning in sparse coding, no deeper discussion of why modeling the reconstruction error helps recover meaningful features. It’s clearly effective, but it would be good to understand why.

W2 - The residual SAE omits the decoder bias. The paper says this ensures contributions only when necessary, but this isn’t tested. What happens if the bias is included? The residual SAE uses k=5 (vs k=50 in the base SAE). Why 5? Is this tuned? Does performance improve at 10 or 20? Some sensitivity analysis would go a long way here. The regularization term λ is also never discussed. It would be useful to know how important the sparsity penalty is in the residual model.

W3 - The “Extended SAE (most active init)” is not well-defined. What does “most active” mean—highest mean activation? Frequency of firing? Cosine alignment with errors? In SAE Stitching, what threshold is used to select changed features? These details matter for fair comparison.

W4 - The examples in Figure 4 are good, but mostly anecdotal. Could the authors quantify how many residual features align with domain-specific concepts? E.g., match features to ontology labels or use external probes. The t-SNE plots are visually clean but sensitive to hyperparams like perplexity—no mention of those choices is made.

给作者的问题

Q1- Is there interference when residuals start overlapping semantically? Does performance saturate or degrade? At what point does the added sparsity outweigh the benefit?

Q2- Did you try training a residual SAE on out-of-domain data? Would that hurt performance, or just have no effect?

Q3- Is there any explicit mechanism ensuring residual features don’t overlap with base features? Would an orthogonality regularizer help?

Q4- What happens when multiple residual SAEs disagree—e.g., both try to "correct" the same error in different ways?

2025-06-03

We appreciate your constructive comments and provide responses addressing each concern:

W1: Connection to Boosting Theory

SAE Boost, while not formally derived from boosting theory, aligns conceptually with residual learning principles and targeted correction of systematic reconstruction errors. By training exclusively on the base SAE's reconstruction error, our residual SAE directly addresses high-bias residuals without affecting stable, general-domain representations. This parallels residual sparse coding, and empirical evidence demonstrates effective, complementary feature recovery. We agree that formal bias–variance framing could enhance theoretical grounding and will explore this further in future work.

W2: Technical Design Choice

Decoder Bias: Intentionally omitted to ensure the residual SAE provides zero contribution where domain-specific features are unnecessary, preserving base SAE performance on general-domain data. Ablation studies confirm that including or omitting the decoder bias has zero effect on EV.
Choice of k: Sensitivity analysis indicated minor domain performance gains with higher k but at significant sparsity and interpretability cost. We selected k=5 to balance strong domain performance, minimal general-domain disruption, and optimal interpretability.

Top-k	General EV	General L0	Domain EV	Domain L0
5	0.719	50	0.774	61
10	0.719	52	0.781	66
20	0.719	56	0.788	76
50	0.721	72	0.798	106

Regularization λ: Effectively λ=0 as BatchTopK explicitly enforces sparsity. The sparsity term remains for consistency with standard SAE practices but does not require tuning.

W3: Clarifying Baselines

Extended SAE (most active init): Initialized from features with highest mean activation on domain data.
SAE Stitching: Features selected based on maximum cosine similarity with original SAE; threshold set to choose the top-k most-changed features.

W4: Quantitative Interpretability

We agree quantitative metrics would strengthen interpretability claims. While Figure 4 provides anecdotal support, we further substantiate interpretability using quantitative metrics following Paulo et al. We evaluated all 1024 residual SAE features on the chemistry dataset and randomly sampled 1024 base SAE features evaluated on the FineWeb-Edu dataset:

SAE	Mean (median) detection score	Mean (median) fuzzing score
Base SAE	0.67 (0.65) ± 0.16	0.64 (0.60) ± 0.12
Residual SAE	0.75 (0.77) ± 0.17	0.68 (0.68) ± 0.12

Q1: Semantic Interference

Experiments confirm performance degradation (~7-12% EV loss) when combining semantically related residual SAEs (e.g., Romance or Slavic languages). Sparsity remains manageable, suggesting summation approaches are optimal primarily for distinct domains. This opens promising directions for future adaptive methods in residual SAE stitching.

Setup	Portuguese EV	Italian EV	Polish EV	Russian EV	General EV
Single Residual SAE	0.735	0.742	0.748	0.725	0.719
Combined Residual SAEs	0.679 (-7.6%)	0.688 (-7.3%)	0.656 (-12.3%)	0.635 (-12.4%)	0.719

Q2: Out-of-Domain Impact

We confirmed that residual SAEs trained on domain-specific data do not degrade performance on general text. In all cases, residual features remain inactive on out-of-domain inputs, ensuring non-interference. Performance drop is consistently <1% across tasks.

Q3: Orthogonality Mechanism

Implicit orthogonality naturally emerges from training the residual SAE exclusively on reconstruction errors. To empirically validate this, we introduced an explicit orthogonality regularization term:

L_{\text{orth}} = \text{mean}_i \left(\max_j \cos\left(W^{\text{res}}_i, W^{\text{base}}_j\right)\right)

This regularization had no measurable effect on EV—supporting our claim that the residual objective inherently encourages orthogonal representations. Additionally, Fig. 2 demonstrates residual features are more distinct compared to those learned through standard FT.

Q4: Conflicting Corrections

Residuals can either overlap, leading to over-correction and degraded performance, or complement, correcting distinct aspects of an input. Our experiments (e.g., Russian chemistry papers) demonstrate effective complementarity: the language residual handles syntax, while the domain residual corrects terminology. In distinct-domain settings, additive residuals consistently perform well (Tables 4, 8, 9). However, overlapping corrections remain an open challenge.

We sincerely thank the reviewer for their thoughtful and constructive feedback. We hope our responses have clarified the key contributions of our work and adequately addressed your concerns. We would be grateful if you would consider revisiting your score in light of these clarifications.

2025-06-04

Hi, Thanks for your response. I adjusted my score accordingly!

审稿意见

评分: 7置信度: 32025-05-13

This work proposes a method for domain adaptation of sparse auto-encoders. While SAEs have been used for domain adaptation, or can be re-trained for specific domains, this approach instead improves domain-specific capabilities of SAEs for interpretability without retraining them. It works by modeling the reconstruction error on the target domain and it is found to improve reconstruction quality and LLM cross-entropy without much performance loss. Experiments are conducted with three different domains and partially also with different languages on two different LLMs. An ablation study for training data size is performed and multi-domain adaptation is also explored. Results are compared to the existing SAE Stitching approach and some advantages are found.

Updated score after author response.

接收理由

The approach indeed seems novel, it saves resources and aids interpretability. Domain-specific data is a common problem in real-world applications where there might also not be the resources for full retraining so this seems like an approach that could be beneficial for quite a few real-world use cases.

The set of experiments is fairly comprehensive with three different metrics, two models, consideration for general domain performance, multi-domain adaptation and information on how much training data is needed for the approach. Examples are also shown (Fig 4). The contribution therefore has good substance.

The approach's capability for multi-domain adaptation without much of a performance hit stands out compared to previous approaches.

拒绝理由

The claim that the approach "discovers meaningful domain-specific concepts" is illustrated with examples but not really evaluated thoroughly. The chosen evaluation metrics are briefly motivated but they all pertain to next word prediction performance and feature utilization, without any evaluation that directly pertains to interpretability, which is often the end goal of these methods. The work could be strengthened with some human evaluation by a domain expert of the concepts that emerge from the method for at least one of the domains - interpretability in the end involves utility to human interpreters, but there is no human evaluation of specific feature outcomes here. The linguistic typology example is also a nice illustration, though.

The related work section is very short and doesn't fully contextualize the work. SAE Stitching is briefly mentioned, but where does the Extended SAE baseline come from? Is that also based on related work or did the authors invent it? There is also little discussion of or comparison to alternative approaches such as retraining a SAE on in-domain data for contextualization (even though direct comparison is not needed as it's a different task).

The conflation of 'languages' with 'domains' is not so obvious to me. There is a clear difference in that languages rely on completely different subtoken features. Perhaps it should be specified why the method works for both of these things and whether there is any difference in how the evaluation metrics work for domain adaptation and for language adaptation.

It is not clear whether code will be made available for this method.

2025-06-03

Thank you for your valuable review! We address the mentioned weaknesses individually (enumerated as W1–W4):

W1: Meaningful Domain-Specific Concepts

To address your concern about evaluating the meaningfulness of domain-specific concepts, we first demonstrate that the domain-specific features identified by our method do not appear in the general-domain SAE. Specifically, we present the top-3 most similar general-domain SAE features according to decoder cosine similarity:

SAE Boost Feature	Most Similar General-Domain Features
Oxygen in chemical contexts	(1) Numerical thresholds; (2) Names of people; (3) Risk-related phrases
Chemical and thermal stability	(1) Durability references; (2) Legal protections; (3) Safety responsibilities
Supramolecular structure formation	(1) Protein databases; (2) Scientific measurements; (3) Mineralogy terms
Chemistry energy values	(1) Chemical reaction terms; (2) Calibration concepts; (3) Physics discussions
Hydrogen chemistry	(1) Personal relationships; (2) Compound applications; (3) Synthetic processes

Additionally, the SAE Boost-learned features remain largely inactive when evaluated on general-domain data (Tables 2 and 4). Together, these results indicate that our approach discovers meaningful domain-specific features absent from the general-domain SAEs.

We further support our interpretability claims using quantitative metrics following the methodology of Paulo et al. (https://arxiv.org/abs/2410.13928). We evaluated all 1024 residual SAE features using examples from the chemistry dataset, and randomly sampled 1024 features from the base SAE using fineweb-edu examples:

SAE	Mean (median) detection score	Mean (median) fuzzing score
Base SAE	0.67 (0.65) ± 0.16	0.64 (0.6) ± 0.12
Residual SAE	0.75 (0.77) ± 0.17	0.68 (0.68) ± 0.12

W2: Related Work and Baselines

The Extended SAE baseline is our own contribution, introduced as a natural and straightforward method for domain adaptation—one practitioners might reasonably attempt absent more specialized methods. For thorough comparison, we implemented two variants of Extended SAE using different initialization strategies: one based on the most active features from the original SAE and another with random initialization.

We also acknowledge the importance of comparison with full fine-tuning. Hence, we conducted full fine-tuning of the original SAE on domain-specific data, stopping precisely when its domain-specific explained variance (EV) matched that of SAE Boost. This ensures a fair comparison at equivalent performance levels. Additionally, we visualized multiple checkpoints during fine-tuning to illustrate the evolving performance trade-off between domain-specific and general-domain data. The resulting Pareto front can be viewed here: https://ibb.co/0ph5d86d.

These experiments enrich our baseline comparisons and highlight SAE Boost’s key advantage: efficient domain adaptation with minimal impact on general-domain knowledge, unlike full fine-tuning, which introduces significant trade-offs. We will revise our Related Work section to emphasize these contrasts clearly and further explain the rationale behind the Extended SAE baseline.

W3: Language versus Domain Adaptation

While we agree that languages and topical domains differ conceptually, we found that both adaptations exhibit similar underlying shifts in activations and feature distributions. This similarity allows our method and evaluation metrics to apply effectively in both scenarios. Specifically, we treat each language as a distinct domain, where LLM activations systematically shift due to differences such as subtoken distributions and grammatical structures, benefiting from targeted residual modeling.

Our cross-lingual experiments in Appendix A.1 (Table 5) demonstrate consistent improvements in both Explained Variance (EV) and LLM Cross-Entropy (CE) across seven languages—including Japanese, which significantly differs from the original training languages. These findings confirm our metrics' generalization capability for both language and topical domain adaptations.

We will explicitly clarify in Section 3 why modeling reconstruction error effectively captures language-specific and domain-specific features and further discuss how EV, CE, and sparsity metrics reflect adaptation efficacy in each setting.

W4: Availability of Code

Addressing your concern, we provide an anonymous link to our source code here: https://anonymous.4open.science/r/SAEBoost-5C7E/README.md. Additionally, we intend to release a public GitHub repository with comprehensive, user-friendly documentation upon acceptance.

We appreciate your thoughtful feedback and constructive suggestions. If our responses adequately address your concerns, we kindly ask you to reconsider your evaluation score.

2025-06-03

Thanks for the clarifications and additional experiments. I still feel that domain expert evaluation of the domain-specific features would strengthen the contribution, but I'll update my score.

However, I don't have an edit button on my review currently. I will ask for help with this issue. edit: It is fixed and my score is updated.

审稿意见

评分: 7置信度: 42025-05-14

The authors propose SAE Boost, a method for extending a general-domain SAE with domain-specific sparse features for mechanistic interpretability. They propose to model the residuals of the general-domain SAE on the domain-specific dataset. They conduct experiments on splits of FineWeb and then apply it to downstream interpretability analyses.

接收理由

The paper is cogent and well-presented, along with practical analyses demonstrating their technique
The choices of the datasets are well-informed.
The technique is novel

拒绝理由

Taking a step back and looking at the big picture, I don't think the problem is particularly large in scope. While it is likely interesting to a small group of researchers, if the domain of the text is known ahead of time as is often the case for mechanistic interpretability, what is preventing researchers from training two SAEs -- one for general-domain texts and one for the specialized domain? Computationally, it wouldn't be different from the current proposal, as it likewise involves multiple SAEs. Thus, the main application scenario is if the domain isn't known beforehand at inference time.
A missing baseline would be mixing general-domain text with the particular specific-domain text for full fine-tuning, which is a simple way to mitigate catastrophic forgetting.
It isn't always clear whether SAE Boost is Pareto-better than the baselines in terms of e.g. the domain-specific and the general-domain EV. This trade-off wasn't investigated in the paper; for example, it is readily foreseeable that full fine-tuning yields different operating points/trade-offs between the domain-specific EV and the general EV depending on how early the fine-tuning is stopped. When plotted on a line, how do these operating points compare to SAE Boost? I think one important point for the paper to make is that SAE Boost achieves the Pareto-optimal boundary compared to baselines, e.g., what would the general EV for full fine-tuning be if it were terminated when the domain-specific EV matched SAE Boost?

EDIT: The rebuttal addresses these concerns.

给作者的问题

Would the general-domain SAE discover the same features in the interpretability analyses section? The authors hedge this as "Together, these features show that our approach can discover meaningful domain-specific concepts that might be overlooked by general-purpose SAEs," but I don't think the analyses definitively show whether that is the case.

2025-05-30

Thank you for your thoughtful and constructive review. We appreciate your detailed feedback and are grateful for the opportunity to address your concerns. Below, we respond point-by-point to the issues you raised, labeled as W1–W3 (Weaknesses) and Q1 (Question), and include clarifying details and new experimental results to strengthen the paper’s contributions.

W1: Scope and Motivation

We acknowledge that training separate SAEs for different domains is a viable approach when both domains are known ahead of time and computational resources are not a constraint. However, there are several practical scenarios where this is not feasible:

Model deployment constraints: Deploying multiple SAEs increases the memory footprint and complexity of inference-time logic. Maintaining and switching between multiple SAEs can be costly and brittle, especially in production environments.
Continual learning: In dynamic environments, where new domains emerge over time, it is more practical to incrementally augment the general-domain SAE rather than train and manage new SAEs for every domain.

To support this motivation, we include new experiments comparing SAE Boost to both fine-tuning and multi-SAE baselines, demonstrating that our method is more efficient while maintaining or improving performance.

W2: Missing Baseline – Mixed-Domain Fine-Tuning

We agree that this is an important baseline, and we have added experiments analyzing the trade-off between domain-specific and general-domain explained variance (EV) during mixed-data fine-tuning.

Specifically, we vary the ratio of general- and domain-specific data in the fine-tuning batches and evaluate performance across both domains. We then plot the resulting operating points to construct the Pareto frontier of EV trade-offs. SAE Boost outperforms these baselines by achieving better general-domain retention at comparable or higher domain-specific EV, even though the baselines have access to both datasets.

Moreover, SAE Boost is more parameter- and storage-efficient, as it avoids duplicating entire SAEs and does not require access to general-domain training data.

W3: Pareto Optimality and EV Trade-Offs

We now explicitly evaluate whether SAE Boost lies on the Pareto-optimal frontier of EV trade-offs. To do this, we perform full fine-tuning of the original SAE on domain-specific data and save intermediate checkpoints. For each checkpoint, we compute both domain-specific and general-domain EV, identifying the point where domain-specific EV matches that of SAE Boost.

This analysis reveals that SAE Boost consistently achieves higher general-domain EV at the same level of domain-specific EV, compared to full fine-tuning. This supports our claim that SAE Boost yields a more favorable trade-off and lies closer to the Pareto frontier.

We include a plot of these results in the updated paper and provide a link for early access:

📈 Pareto Frontier Plot

Q1: Are the Discovered Features Unique to SAE Boost?

We appreciate this insightful question. To address it, we retrieved the most similar features (by cosine similarity) from the general-domain SAE for each domain-specific feature highlighted in the interpretability section. The results are summarized below:

SAE Boost Feature	Most Similar General-Domain Features
Oxygen in chemical contexts	(1) References to numerical thresholds; (2) Names of people; (3) Phrases about risk
Chemical and thermal stability	(1) Durability references; (2) Legal protections; (3) Safety responsibilities
Supramolecular structure formation	(1) Protein databases; (2) Scientific measurements; (3) Mineralogy terminology
Chemistry energy values	(1) Chemical reaction terms; (2) Calibration concepts; (3) Physics discussions
Hydrogen chemistry	(1) Personal relationships; (2) Compound applications; (3) Synthetic processes

These results suggest that the general-domain SAE either misses or dilutes the highly specialized domain-specific features found by SAE Boost. In particular, many of the top general-domain features are either unrelated or only tangentially related, indicating that SAE Boost discovers meaningful and unique concepts that may be obscured in more general models.

Conclusion

In our revised paper we now:

Justify the scope and deployment motivation for SAE Boost more clearly.
Include strong baselines involving mixed-domain fine-tuning.
Demonstrate SAE Boost’s Pareto-optimality in EV trade-offs.
Provide concrete evidence that it discovers distinct domain-specific features.

We hope these additions address your concerns and demonstrate the value of our contribution. We would be grateful if you could consider updating your score based on the revised paper and additional analyses.

Thank you again for your thoughtful and helpful review.

评论- Nice Work

2025-05-31

Nice work, this addresses all my concerns. I've adjusted the score accordingly.

最终决定Accept

2025-07-08

The reviewers, after discussion with the authors, unanimously find this paper interesting and its claims well-supported by experiment. As for pros, this work provides a practical solution to an interesting problem of how sparse autoencoders fail to generalize outside their training distribution – it’s well-evaluated, the method is interesting but pleasingly simple. As for cons, I’m not sure the interpretability evals of the SAEs (including those in the reviewer discussion) are terribly convincing. This is editorializing on my part, but I also would have preferred to see evaluations not on reconstruction error, but on, say, steering or other behavioral control tests – uses of the SAEs themselves on some safety or concept control or otherwise in the new domain.

This is just an extra request – if the authors have room to clarify, I’m rather confused about LLM CE increases of, e.g., 0.935 up to 2.0 or 3.0 – aren’t these gigantic increases when the cross-entropy of the LLM on average is probably between 2 and 4 depending on the text? Are the SAEs doing a terrible job of maintaining the model’s distribution or am I reading this wrong?