/10

Poster4 位审稿人

最低3最高3标准差0.0

ICML 2025

Automatically Interpreting Millions of Features in Large Language Models

Gonçalo Santos Paulo,Alex Troy Mallen,Caden Juang,Nora Belrose

提交: 2025-01-23更新: 2025-07-24

TL;DR

We build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs

摘要

关键词

interpretabilitylanguage modelsaefeaturesexplanation

评审与讨论

审稿意见

评分: 32025-02-26

The paper presents an automated pipeline for generating and evaluating natural language interpretations of SAE latents in LLMs. The authors introduce five scoring techniques—Detection, Fuzzing, Surprisal, Embedding, and Intervention Scoring—to assess interpretation quality, with Intervention Scoring evaluating the causal effects of latents on model outputs. They test their framework on SAEs trained on two open-weight LLMs and find that different sampling and scoring strategies impact the specificity and generalizability of interpretations.

给作者的问题

(1) There is no systematic way to verify correctness, as LLM-generated explanations may sound plausible but not reflect actual latent function.

(2) Why were Surprisal and Embedding scores included despite their weak correlation with human evaluations? Do you see specific scenarios where these scores are more useful than Fuzzing, Detection, or Simulation?

论据与证据

Claim: "The pipeline improves upon previous interpretability work by producing higher-quality interpretations." The paper does not conduct rigorous human evaluations of interpretation quality. While the scoring metrics provide an automated way to assess interpretations, they are correlation-based and may not align with actual human judgment. The paper references some small-scale human evaluations but does not systematically compare their method to existing ones using expert annotations.

方法与评估标准

Yes

理论论述

None

实验设计与分析

The experimental design is well-structured.

补充材料

N/A

与现有文献的关系

The paper extends prior work on SAE-based feature extraction by scaling interpretations to millions of latents. It refines simulation-based scoring with five new metrics. While the pipeline improves scalability and efficiency, it offers incremental advances rather than fundamentally new interpretability methods.

遗漏的重要参考文献

None

其他优缺点

Strengths:

(1) The paper successfully scales LLM-based interpretability methods to millions of SAE latents, a good engineering contribution.

(2) The proposed scoring methods, especially Detection and Fuzzing, offer computationally cheaper alternatives to simulation scoring.

(3) code is provided with a detailed readme file.

Weaknesses:

(1) The core methodology builds on existing LLM-based interpretability and SAE approaches, offering mostly incremental improvements.

(2) The evaluation relies heavily on automated metrics without systematic human assessment of interpretation quality.

(3) Surprisal and Embedding scores have low correlation with human judgment, making their usefulness unclear.

其他意见或建议

None

作者回复

2025-04-01

We agree with this comment by the reviewer. Our generated interpretations only interpret the activations of individual latents, and are far from full explanations of their behaviour and downstream impact. Intervention based methods, like the one we proposed, are the ideal candidates to probe the causal role of latents. We believe that a combination of both is the most informative.

(2) Why were Surprisal and Embedding scores included despite their weak correlation with human evaluations? Do you see specific scenarios where these scores are more useful than Fuzzing, Detection, or Simulation?

We propose surprisal as a seemingly natural method to score interpretations, with the hope that it inspires future work in that direction, and that future work is aware of the limitation of this technique. Although we probably won’t use surprisal scoring much in the future, we believe it to be valuable for the community to know that it was tried.

On the other hand, embedding scoring is much faster and cheaper than all the other techniques, even though such comparisons are less apples-to-apples than the comparison between fuzzing/detection with simulation. This is because, not only are the input tokens lower but the embedding model used can be as small as 700M parameters. This makes for a very compelling reason to use embedding scoring, if one wants to quickly evaluate the interpretability of latents, potentially on the fly while training.

The fact that latents with low embedding scores are generally low scoring in fuzzing/detection can be used to filter out incorrectly interpreted latents or find latents which are generally not interpretable.

审稿意见

评分: 32025-03-14

This paper introduces an automated pipeline for interpreting the latent features of sparse autoencoders (SAEs), which decompose large language model (LLM) representations. The authors propose five scoring methods to assess interpretation quality, including detection, fuzzing, surprisal, embedding, and intervention scoring. The framework is tested across different SAE architectures and LLMs, and the results suggest that the proposed scoring methods are computationally cheaper while maintaining strong correlations with human evaluations.

给作者的问题

N/A.

论据与证据

The authors claim that the proposed methods provide a more scalable and efficient interpretation method while capturing causal effects through intervention scoring. However, the largest problem is that the paper only shows the results for the proposed method, without a systematic comparison with existing methods that have already been discussed in the related works.

方法与评估标准

The proposed scoring methods make sense for the interpretability problem at hand. Again, Although the experiments cover different SAEs, LLMs, and datasets, the absence of a detailed baseline comparison weakens the empirical evaluations.

理论论述

The work does not include theoretical guarantees or formal proofs, which is understandable given its empirical focus.

实验设计与分析

See Claims and Evidence and Methods and Evaluation Criteria.

补充材料

Yes. The supplementary material provides useful details on experimental setup and implementation.

与现有文献的关系

The paper builds on prior work in LLM interpretability.

遗漏的重要参考文献

The paper covers the most relevant references but lacks experimental comparisons with them.

其他优缺点

Strengths:

The paper is well-organized and easy to follow.
The proposed method is intuitive and technically sound.

Weaknesses:

Lack of direct numerical comparison with existing methods.
Following 1, it is unclear how much interpretability quality improves or is sacrificed for scalability.
Missing the analysis of failure cases. It is important to be aware of the safety boundary of the proposed method in implementation.

其他意见或建议

I strongly recommend the authors provide direct numerical comparisons between new and existing scoring techniques to clarify performance gains. I’m willing to raise my score if such a comparison is provided.

作者回复

2025-04-01

We thank the reviewer for this feedback.

Lack of direct numerical comparison with existing methods

We compare our scoring techniques with the standard scoring technique at the time of writing, simulation scoring. In Table 1, we compare the costs of both scoring techniques, and we perform human score correlations, lines 322 left. We also perform inter-score correlations, Tables A1 and A2.

Is there a specific comparison that the reviewer would like to see?

Following 1, it is unclear how much interpretability quality improves or is sacrificed for scalability.

Improvements in scalability were mostly on the generation of scores, which we discuss in Table 1. This means that the proposed techniques do not sacrifice interpretability quality for scalability. Indeed we believe that the proposed guidelines for generating and evaluating latents improve interpretability quality instead of sacrificing it. Some of our proposed scores correlate well with human given scores.

Missing the analysis of failure cases. It is important to be aware of the safety boundary of the proposed method in implementation.”

We currently provide one failure case in Figure A2 of the appendix, where the explainer model incorrectly identifies the explanation, one that is more easily gathered from top activations - as we write this we notice that the caption of the figure is missing, and that it indeed discussed this. We discuss this failure mode in the text. It is hard in general to discuss failure cases in interpretability, and even harder to define safety boundaries. We wonder if the reviewer has specific comments on how to improve on this.

审稿人评论

2025-04-02

Thank you to the authors for the clarifications. After reviewing the rebuttal and the comments from other reviewers, I recognized my earlier misunderstanding of the results in Table 1. I now lean toward acceptance and am happy to raise my score.

审稿意见

评分: 32025-03-15

The paper develops an automated pipeline for interpreting latent features identified by sparse autoencoders (SAEs) in LLMs. The authors implement a three-stage approach that first collects latent SAE activations, then generates natural language interpretations using external LLMs, and finally evaluates interpretation quality. Their main contribution is five scoring methods (detection, fuzzing, surprisal, embedding, and intervention) that are more compute-efficient than traditional simulation scoring. Intervention-based evaluation is not commonly used in autointerp pipelines and is a new addition to the mix. It is a causal framework that explains the latents by their effects on model outputs rather than input correlations. The provided evaluation approaches each have specific qualities, and the paper discusses that. The research demonstrates that interpretation quality depends on sampling strategy (stratified sampling across activation distributions produces better interpretations than using only top-activating examples). The practical implication is the most important as it enables scalable assessment of millions of features, providing a foundation for improved model understanding.

给作者的问题

No question.

论据与证据

Yes.

The main claims regarding novelty are: "We introduce five new techniques to score the quality of interpretations that are cheaper to run than the previous state of the art. One of these techniques, intervention scoring, evaluates the interpretability of the effects of intervening on a latent, which we find explains latents that are not recalled by existing methods"

The paper supports its claim of introducing five more cost-efficient scoring techniques through detailed cost analysis in Table 1, showing methods like fuzzing and detection require 5-30x fewer tokens than simulation scoring. The second claim about intervention scoring is evidenced in Figure 3, which demonstrates a negative correlation between fuzzing and intervention scores, proving that latents scoring poorly on traditional methods can be well-interpreted through their causal effects.

方法与评估标准

Yes, the proposed five evaluation methods are quite intuitive and make sense for interpreting latents.

理论论述

There are no major theoretical claims or issues.

实验设计与分析

Yes, I can validate the soundness of the following experiments:

Cost comparison of scoring methods (Table 1)
Correlation analysis between scoring methods (Tables A1 & A2)
Example sampling strategy comparison (Figure 2) - Testing how different sampling techniques (top activating examples, random sampling, stratified sampling) affect interpretation quality
Intervention scoring analysis (Figure 3) - Demonstrating that some latents with low fuzzing scores have high intervention scores, and comparing trained SAEs against random baselines

The number of latents used for correlation analysis can be more.

补充材料

Table A1 A2

与现有文献的关系

Considering that the work is more of an empirical study with practical value, it provides useful tools for the broader research community to interpret activations of various models across different domains.

遗漏的重要参考文献

Essential references are discussed

其他优缺点

Strengths:

The authors provide a useful tool for the community, although the automated interpretability pipeline itself is not novel
The novelty lies in the evaluation approaches for the interpretations, which are insightful
The intervention-based approach is useful to have in the stack

Weaknesses:

While being a useful tool for a broad community (both interpretability community and people who want to use the interpretability toolsets), the overall insightful findings and novelty in the method are limited. The automated interpretability pipeline is not new.

Verdict: I consider this work as a valuable project and tool for the community, rather than being scientifically profound. I would require to see more scientific novelty or rigour for a clear accept. Therefore, I consider this work as weak accept.

其他意见或建议

None.

作者回复

2025-04-01

Thank you for your helpful comments.

While being a useful tool for a broad community (both interpretability community and people who want to use the interpretability toolsets), the overall insightful findings and novelty in the method are limited. The automated interpretability pipeline is not new.

At the time of writing, evaluating SAE interpretability focused on evaluating the top activating examples. We show that this is misleading for multiple reasons. Firstly, this gives an illusion that SAEs are more interpretable and monosemantic than they are (we show this in figure 2), ignoring the full distribution. There was also the impression that the lower activations of SAEs were completely uninterpretable and less valuable, which again we show not to be the case. We think this to be a valuable insight.

Our scoring techniques improved on the state of the art, and are now the standard instead of simulation scoring. We think that our discussion on interpretability scoring is also helpful for the community, and such an analysis is not common. Intervention scoring was a type of scoring that was missing in the current framing of SAE interpretability.

We would love to be able to produce more rigorous results that would convince the reviewer that the paper strongly merits acceptance. If there are specific examples that the reviewer would like to be done more rigorously we would take those into consideration.

审稿意见

评分: 32025-03-19

This paper proposes an automatic concept explanation method based on LLM to address the problem of poor human comprehension of sparse autoencoders. Specifically, the author collects highly responsive sentences and corresponding concepts in SAE and carefully designs LLM prompts to prompt LLM. Then LLM automatically explains other units in SAE. The author also provides 5 evaluation metrics to evaluate the effectiveness of the method.

update after rebuttal

给作者的问题

论据与证据

The author claims in the title to explain millions of features in LLM, but it seems that this article explains the massive units in SAE latent space.

方法与评估标准

The author provides 5 evaluation indicators, which I think are reasonable.

理论论述

No theoretical analysis in the paper.

实验设计与分析

Lack of comparison with some baseline methods.

补充材料

Many experimental details are provided in the appendix.

与现有文献的关系

It is interesting to interpret the parameters in LLM, and this paper proposes an interesting approach.

遗漏的重要参考文献

其他优缺点

Strengths:

The idea of using LLM to automatically explain SAE is interesting, although there is similar work, such as using GPT-4 to explain GPT-2.

Weaknesses:

The authors only used one LLM to interpret SAE in this article. It would be more convincing if the authors could consider more LLMs and perform prompt sensitivity analysis.
The author claims in the title to explain millions of features in LLM, but it seems that this article explains the massive units in SAE latent space.
What if we let an LLM explain its own SAE? What if we use an additional LLM to explain another LLM's SAE? The author can explore this.
Can this method identify some latent units that do not have obvious conceptual information?
The method in this paper is very interesting, but the experiments seem to be insufficient, for example, there is a lack of some sensitivity analysis, specific ablation experiments, or more potential downstream applications, etc.

其他意见或建议

Please see weaknesses, I am willing to raise my score if the author can address my concerns convincingly.

作者回复

2025-04-01

We thank the reviewer for your helpful comments.

The authors only used one LLM to interpret SAE in this article. It would be more convincing if the authors could consider more LLMs and perform prompt sensitivity analysis.

In the current version of the article, we evaluate the explanation given by Llama 3.1 70b, Llama 3.1 8b and Claude 3.5 Sonnet, as shown in table A8. We discuss prompt sensitivity analysis in the last reply of this rebuttal

The author claims in the title to explain millions of features in LLM, but it seems that this article explains the massive units in SAE latent space.

In this work we use features to mean interpretable directions in both the residual stream and the MLP outputs. SAEs were developed because directions in the residual stream and MLP (by looking at neurons for instance) were often polysemantic and hard to understand. The interpretability community uses feature to mean interpretable concepts, and so we believe the claim of explaining millions of features to be correct, being similar to the one in Templeton et al., 2024. We have added to the introduction the following sentence: "Throughout this work we are going to be using latent to mean a specific pair of encoder-decoder indices in SAE, and feature to mean the concept that this latent is representing.”

What if we let an LLM explain its own SAE? What if we use an additional LLM to explain another LLM's SAE? The author can explore this.

We did both of these in this work. We used Llama 8b to generate explanations for the Llama 8b SAE, and in general have used other LLMs – Llama 70b – to generate the explanations for the SAEs trained on LLama 8b and Gemma 2 9b. Llama 8b is worse at generating explanations than Llama 70b and at the time of writing there were no Llama 70b SAEs, which was the smaller model that gave reasonable explanations, so these were the only experiments we could perform.

Can this method identify some latent units that do not have obvious conceptual information?

Some units have simple conceptual interpretations (they fire on the verb to be, or on chair) and some have more complicated patterns, like firing in text between parenthesis. We are not sure if this is what the reviewer meant to ask.

The method in this paper is very interesting, but the experiments seem to be insufficient, for example, there is a lack of some sensitivity analysis, specific ablation experiments, or more potential downstream applications, etc.

We have ablated model size (for both explainer and scorer), number of examples shown, number of tokens per example shown, origin of the examples shown, using chain-of-though or no and, showing token activations or. We also investigated the interpretability of different layers, whether the interpretability of the residual stream SAEs and of SAES trained the output of the MLP, as well as the dependence of interpretability on the SAE expansion factors. Is there a specific ablation that the reviewer has in mind?

审稿人评论

2025-04-07

Thanks to the author's reply, most of the concerns have been addressed. Considering the article writing and contribution, I decided to raise the score to weak accept.

最终决定Accept (poster)

2025-05-01

This paper presents an automated pipeline for generating and evaluating natural language interpretations of SAE latent features in LLMs. The authors introduce five new scoring techniques that provide more computationally efficient alternatives to traditional simulation scoring while maintaining strong correlations with human evaluations. The reviewers acknowledged several strengths of this work, particularly the intervention scoring approach that offers a novel causal perspective on latent interpretability, revealing features that aren't captured by existing methods. The paper makes a valuable practical contribution by scaling interpretability methods to millions of SAE latents, with comprehensive ablation studies across different sampling strategies, model configurations, and evaluation criteria. While reviewers initially raised concerns about the lack of direct numerical comparisons with existing methods and the absence of systematic human evaluations, the authors adequately addressed these issues in their rebuttal by clarifying the comparisons made in Table 1 and explaining the correlations with human scores. All reviewers ultimately provided weak accept recommendations, recognizing that while the core methodology builds on existing LLM-based interpretability approaches, the improvements in scalability, the novel intervention scoring technique, and the practical utility of the pipeline make valuable contributions to the field. Given the paper's technical soundness, practical utility, and the authors' thorough responses to reviewer concerns, I recommend accepting this paper for publication at ICML 2025.