Sparse Autoencoders for Hypothesis Generation
We propose HypotheSAEs, a sparse autoencoder-based method to hypothesize interpretable relationships between input texts and a target variable; we show our method performs well on several computational social science datasets.
摘要
评审与讨论
This paper proposes a hypothesize generation method by training a sparse auto encoder to find text examples trigger the same neuron which interpreted as containing the same human concepts, and then leverage on LLM to generate the interpretation of the neuron from these examples. The paper performs experiments on both synthesized datasets and real-world datasets, and show results that the method has better accuracy and efficiency than the SOTA.
update after rebuttal
The authors have addressed my major concerns and questions, and I increased my rating to accept.
给作者的问题
- Does the H from "HYPOTHESAES outputs H natural language concepts (line 188)" equal the dimension M of the activation matrix, Z_SAE? And how do you determine M?
- Did you observe any hallucinations from the LLM when labeling neurons? If so, how did you address this?
- How sensitive is the method to different sparsity levels (k in TopK activation)?
论据与证据
The claims in this paper are overall convincing, with some claims overstated or need more supports:
- In introduction, the paper claims it "clarifies the challenge" of context window and reasoning limitations in LLM-based approach for hypothesize generation, however, the method proposed here is also subjected to the context window limit when generating interpretations for each neuron.
- In section 4.1., it claims the dimension of z_i to be empirically "often highly interpretable, corresponding to human concepts", but without any supporting evidence or reference. The validity of this statement is critical for the entire work and deserve more attention.
方法与评估标准
For the method, using sparse auto encoder to find text examples that may carry the same concept makes much sense, but interpreting each single neuron of the activation matrix to be a nature language concept is problematic: neurons are discrete, but language concepts are not.
Furthermore, there is no convergence analysis on the concepts (labels) of neurons that interpreted by LLM. Since LLM has to interpret a neuron base on small number of examples that activate the target neuron, it should provide more insight to show how stable are the the interpretations when the examples are changed, or even just change the temperature.
理论论述
Not.
实验设计与分析
Yes, checked the datasets and metrics.
补充材料
B.2. Neuron interpretation
与现有文献的关系
It improves LLM-only hypothesize generation approaches by using sparse autoencoder to generate interpretable feature representation (SAE neurons) instead of directly prompting an LLM.
遗漏的重要参考文献
Not in my awareness.
其他优缺点
Strengths:
- The paper is well structured and presented.
- The method has good computational efficiency.
Weakness:
- Interpreting the neurons with LLM requires per dataset manual efforts to tune the bin percentiles from which to sample highly-activating and weakly-activating examples, making the method less generalizable.
- Lack of human evaluation for the interpretation quality of the LLMs and the generated hypothesize.
其他意见或建议
in equation (6), the z should be z_i
Thank you for your detailed, positive review. We are glad you find the claims convincing, the paper well-presented, and the method efficient.
To reply to your questions:
(1) “The method proposed here is also subjected to the context window limit[...]” Thanks for pointing this out; we will clarify in the paper. HypotheSAEs doesn't run into context window limit issues in practice because it applies LLMs only to interpret SAE neurons, for which the LLM requires only a few examples (we test this in C.2). In contrast, prior methods use LLMs to learn the interpretable features in the first place, which requires reasoning over many examples.
(2) “it claims the dimension of z_i to be [...] highly interpretable [...] without any supporting evidence” Thanks for this point. Supporting references are in Sec 1+2.2, and we will include them here as well. However, this sentence is meant to motivate using SAEs, rather than to prove this claim; our experiments on real datasets demonstrate it holds in our setting. One quantitative metric (see response to R2) is that interpreting SAE neurons yields much higher fidelity vs. interpreting embeddings directly (F1: 0.84 vs. 0.54).
(3) “Interpreting each single neuron of the activation matrix to be a nature language concept is problematic” Thanks; we agree this is a key challenge, which is why we conduct extensive experiments to maximize fidelity to the underlying neuron (Appendix C). Empirically, Figure 3 demonstrates that the loss in predictive performance due to using discrete concepts to approximate the neurons is quite small on average (-2.4%).
(4) “How stable are the interpretations when the examples are changed” Thanks for this question. We ran a new experiment to measure stability. For 100 random neurons on the Yelp dataset, we generated 3 interpretations with different random seeds, and computed text embeddings of all interpretations. The mean pairwise cosine similarity of two interpretations generated for the same neuron is high: 0.84 (vs. 0.34 for a pair of interpretations from two different neurons). This is despite two sources of randomness—sampling different examples and an LLM temperature of 0.7—showing that the underlying concept learned by each neuron is durable. Here is a set of interpretations for a random neuron (can share more upon request):
Neuron 50 (stability: 0.76): ['discusses the seasoning of food, either praising it as perfect or criticizing it as lacking', 'mentions seasoning or lack of seasoning in the food', 'mentions seasoning or lack thereof in the context of food flavor']
(5) “Interpreting the neurons with LLM requires per dataset manual efforts to tune the bin percentiles”; “How sensitive is the method to different sparsity levels (k in TopK)”. Thanks for these questions about hyperparameters. We ran some experiments and found that results are not particularly sensitive; they work well with defaults:
Headlines, using default bin [90, 100] instead of [80, 100] as in paper: AUC 0.70, 11/20 significant (still beats all baselines) Yelp, using default k=8 instead of k=32 as in paper: R^2 0.77, 14/20 significant (~identical to original)
We provide hyperparameter guidance for practitioners in Appendix B, with more detail in the Python package we released publicly (which, unfortunately, we aren't permitted to link here).
(6) “Lack of human evaluation for the interpretation quality of the LLMs and the generated hypothesize” Thank you; in light of this comment, we conducted a qualitative human eval where we asked three computational social science researchers (not involved with the paper) to evaluate all significant hypotheses on the Headlines and Congress datasets. We followed prior HCI work (Lam et al. 2024, “Concept Induction”) and asked them to annotate for “Helpful” and “Interpretable” hypotheses. We use the median of the three ratings. HypotheSAEs substantially outperforms baselines in terms of raw counts and percentages: 24/30 (80%) are rated helpful, and 29/30 (97%) are interpretable.
Results plot: https://imgur.com/a/qw6bt3s
We also spoke to domain experts about novelty; see our reply to R3. We hope these findings increase your confidence that our hypotheses are high-quality.
(7) “Does the H [...] equal the dimension M” No: M is the total number of SAE neurons, from which we select H predictive neurons to interpret. We choose it based on the validation AUC of how well the SAE neurons predict y (see B.1 + our repo).
(8) “Did you observe any hallucinations from the LLM when labeling neurons?” ~ We did not observe hallucinations in the usual sense, but some interpretations did not describe the neuron well, which is why we generate multiple interpretations and choose the highest fidelity one. In practice, our results are not very sensitive to this step (see B.2).
Given the strengths you highlight, and our experiments to address your comments, would you consider raising your score? If not, do you have further questions?
The paper presents HYPOTHESAES, a three-step method (SAE-based feature generation, feature selection, and LLM-based feature interpretation) for hypothesis generation that identifies interpretable relationships between text data and a target variable. The approach leverages sparse autoencoders (SAEs) to learn meaningful, human-interpretable features, which are then used to generate hypotheses through feature selection and large language model (LLM)-based interpretation.
Key Contributions
- Theoretical Framework: The authors establish a formal framework for hypothesis generation, introducing a "triangle inequality" that relates the predictiveness of learned features to their interpretability.
- Algorithm - HYPOTHESAES: A three-step approach: Feature Generation: Train a sparse autoencoder on text embeddings to create interpretable neurons. Feature Selection: Identify predictive neurons using an L1-regularized regression. Feature Interpretation: Use an LLM to label neurons with natural language descriptions, forming hypotheses.
Main Findings
Improved Hypothesis Generation: HYPOTHESAES outperforms recent LLM-based hypothesis generation methods and traditional topic modeling approaches. Efficiency: Requires 1-2 orders of magnitude less computation than LLM-driven baselines. Interpretability: The sparse autoencoder structure ensures that identified features align with human-interpretable concepts.
给作者的问题
-
What if we directly use pretrained LLMs to perform the same task and use them as one of the baselines? It seems pretrained LLMs can also perform the task and offer some self-explanations. I am curious about the performance and the cost.
-
How do you define the novelty of a generated hypothesis? While the method identifies hypotheses that were not explicitly found in prior studies, the paper does not provide direct human validation of the novelty and importance of these hypotheses.
-
Can you provide any supporting evidence that the method can be applied to more complicated real-world hypothesis generation in other fields like healthcare? Or can you discuss the difference between complicated scientific hypothesis generation using LLMs in the papers I mentioned above and the proposed method?
-
It would be better if the authors could add more discussion about the necessity of the specific theoretical bound (proposition 3.1) in practical performance.
论据与证据
Most claims are convincingly supported, particularly those regarding performance gains, computational efficiency, and the effectiveness of SAEs.
Novelty claim could benefit from additional validation, especially through expert evaluation. While the method identifies hypotheses that were not explicitly found in prior studies, the paper does not provide direct human validation of the novelty and importance of these hypotheses.
“Broad applicability beyond text-based tasks claim” is overstated. The method is only tested on text-based hypothesis generation tasks. There is no evidence that it generalizes to: Healthcare (e.g., clinical note analysis) Biology (e.g., scientific literature mining) The claim would need domain-specific evaluations before being fully credible.
方法与评估标准
Yes, they do.
理论论述
The theoretical framework (Triangle Inequality for Hypothesis Generation) is a fundamental contribution but it’s weakly supported: The theoretical proposition (Proposition 3.1) is an interesting insight, but: The empirical validation is indirect—while the model works well, the necessity of this specific theoretical bound in practical performance is unclear.
实验设计与分析
The evaluation metrics, statistical tests, and experimental setups are generally valid and well-designed. However, a key limitation is the lack of human evaluation to assess the novelty of the generated hypotheses.
补充材料
Yes, I reviewed all the supplementary materials in the appendix, including Additional Synthetic Experiments, Hyperparameter Settings & Training Details, Labeling Fidelity and LLM-Based Interpretation, Cost and Runtime Analysis, and Theoretical Analysis.
与现有文献的关系
This paper synthesizes ideas from hypothesis generation, sparse representation learning, and interpretability research to create a scalable and structured approach for discovering meaningful insights from text data. Unique contributions to the broader literature include:
- Combines sparse autoencoders with LLM interpretation for efficient hypothesis generation.
- Establishes a theoretical link between interpretability and predictiveness in machine learning models.
- Demonstrates real-world applications that extend findings in political science, marketing, and behavioral research.
遗漏的重要参考文献
It's better for the authors to discuss the difference between the hypothesis generation scenario in the paper and more complex scientific hypothesis generation scenario in the papers below:
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers arXiv preprint arXiv:2409.04109
Learning to generate novel scientific directions with contextualized literature-based discovery arXiv preprint arXiv:2305.14259
Scimon: Scientific inspiration machines optimized for novelty arXiv preprint arXiv:2305.14259
Researchagent: Iterative research idea generation over scientific literature with large language models arXiv preprint arXiv:2404.07738
其他优缺点
Strengths
-
Originality (a) Novel combination of sparse autoencoders and hypothesis generation. The paper introduces a creative synthesis of ideas from sparse representation learning, interpretability research, and hypothesis generation. Prior work has focused on either interpretable feature extraction (SAEs) or LLM-driven hypothesis discovery, but this paper combines the two, making hypothesis generation more scalable and efficient. The triangle inequality formulation (Proposition 3.1) provides a new theoretical perspective on the trade-off between feature predictiveness and interpretability. (b) Breaks from fully LLM-driven approaches. Recent LLM-based hypothesis generation methods (e.g., HYPOGENIC, NLPARAM) require high computational cost, limiting their scalability. By decoupling feature learning from LLM inference, HYPOTHESAES provides a more practical and resource-efficient solution.
-
Clarity (a) Well-structured and clearly written. The paper provides a step-by-step explanation of the method (Figure 1) and offers intuitive interpretations of the results. The supplementary material includes detailed experimental settings and theoretical justifications, ensuring reproducibility. (b) Strong theoretical grounding. Proposition 3.1 is well-motivated, and the proofs in the appendix provide a rigorous foundation for the method’s effectiveness.
Weaknesses Issues listed in the above sections.
其他意见或建议
The introduction of the datasets should be in more formal language. For example, around line 263, 268, 274 etc., the sentences can be modified to be:
We utilize 200k reviews for training, 10k for validation, and 10k for held-out evaluation.
As the appendix is long, it would be better to have a table of contents to better organize everything and add a brief paragraph about all the prompt templates, etc.
Thank you for the detailed review and suggestions. We're glad you found our method original, clearly presented, theoretically grounded, and computationally efficient, with the performance gains convincingly supported.
To respond to your questions and suggestions:
(1) “ ‘Broad applicability beyond text-based tasks claim’ is overstated”. Could you clarify which part of the paper you’re referring to (we couldn’t find this quote directly)? We agree that we do not provide evidence that the method generalizes beyond text-based tasks; we would be happy to revise specific areas where you thought this was implied.
(2) “What if we directly use pretrained LLMs [...] as one of the baselines?” We included a baseline, HypoGeniC, which uses pretrained LLMs directly. This method performed worse across 14 of 15 quantitative comparisons and was ~10x more expensive and ~30x slower, owing to the need to make many LLM calls to select candidate hypotheses. If you have specific suggestions for other ways to use pretrained LLMs, please let us know!
(3) “Novelty claim could benefit from additional validation, especially through expert evaluation.” Thank you; based on this suggestion, we reached out to two experts to assess the novelty of our findings on the Headlines dataset (Table 1 of the paper). In particular, we asked them whether the three hypotheses most negatively associated with engagement—"addressing collective human responsibility or action", "mentioning environmental issues or ecological consequences", "mentioning actions or initiatives that positively impact a community"—were novel or reported previously. They did not provide papers with these specific findings; for example, one expert noted, “I don't know of any papers specifically looking at [the hypotheses you mention]”. Both experts pointed us to theories that are broadly supported by these findings: e.g., Robertson et al. (2023) find that negativity drives news consumption, which is consistent with the third hypothesis.
In light of your and R4’s comments, we also conducted a human eval where we asked three computational social science researchers (not involved with the paper) to evaluate all significant hypotheses on the Headlines and Congress datasets. We followed prior HCI work (Lam et al. 2024, “Concept Induction”) and asked them to annotate for “Helpful” and “Interpretable” hypotheses. We use the median of the three ratings. HypotheSAEs substantially outperforms baselines in terms of raw counts and percentages: 24/30 (80%) are rated helpful, and 29/30 (97%) are interpretable:
We hope these new findings increase your confidence that the results from HypotheSAEs are (1) novel, as per our correspondence with experts; and (2) helpful and interpretable, as per our human evals.
(4) You asked about “the necessity of the specific theoretical bound (proposition 3.1) in practical performance.” We agree the theoretical bound is not strictly necessary for practical performance, but rather serves as a broader motivation for the procedure, as you note. It also provides us a way to conceptually decompose hypothesis generation performance into (1) neuron predictiveness, and (2) interpretation fidelity. C.3 explores this empirically.
(5) You asked about “the difference between ... the paper and more complex scientific hypothesis generation scenario in the papers below.” Thank you for these references, which we will include in an additional related work paragraph titled “Automated Idea Generation”. We agree that these literatures are both in service of helping researchers conduct science, but they address different tasks:
Our work is focused on solving the problem: “given a dataset of texts and a target variable, what are the human-understandable concepts that predict the target?” In contrast, the literature you mention focuses on: “given a corpus of prior scientific papers, can we propose promising research ideas?” Practically, the former task emerges when a researcher knows what they are studying and have collected data, but they need tools to make sense of the data. The latter task emerges when a researcher is trying to decide what to study based on prior literature. Methodologically, the former task involves methods like clustering & interpretability, while the latter involves methods like prompting & RAG. We think these two literatures are distinct, but complementary; for example, a political scientist might use an idea generation method to decide to study partisan differences in social media posts, and then, after collecting a dataset, use HypotheSAEs to propose data-driven hypotheses for further validation.
(6) More formal language; appendix table of contents: Thank you! We will make these edits.
Given the strengths you highlight, and our experiments to address your comments, would you consider raising your score? If not, do you have further questions?
Thank you for the detailed and thoughtful rebuttal. You've addressed most of my concerns.
My main remaining concern is the claim regarding the method’s broad applicability beyond text-based tasks, specifically the statement in the conclusion: "The method applies naturally to many tasks in social science, healthcare, and biology." Based on the datasets used in the paper, the method seems best characterized as addressing tasks of the form you metioned: given a dataset of texts and a target variable, what are the human-understandable concepts that predict the target—a formulation may be commonly studied in social science. I think this task essentially corresponds to a classification task with explanations.
However, in real-world hypothesis generation, for example, in domains like healthcare and biology, hypotheses are often more open-ended and complex, and may not be reducible to this specific task formulation. As such, I believe the claim of broad applicability is somewhat overstated. It may be clearer and more accurate to explicitly define your hypothesis generation task as one focused on discovering interpretable predictors from labeled text data. And since no empirical evidence is provided for applications in healthcare or biology, it might be more appropriate to reserve that discussion for the impact statement rather than the main conclusion.
We’re happy to hear that we addressed most of your concerns.
Thank you for explaining this further; we agree with your point that the conclusion is too broad, and we will revise as follows:
- We will remove "The method applies naturally to many tasks in social science, healthcare, and biology” and replace it with “The method applies naturally to settings with a large text corpora labeled with a target variable of interest.”
- We will revise the citations to point to clear examples of tasks where one can run our method: Ziems et al., 2024 (a review of tasks and datasets across many disciplines within social science); Bollen et al. 2010 (social networks); Card et al. 2022 (political science); Breuer et al. 2025 (media studies). These settings are related to the ones we study and fall within our task formulation.
- We will clearly delineate non-text datasets and more specialized settings as room for future work not covered by our paper: "An exciting direction for future work, which we do not explore here, is extending our method to non-text modalities (e.g., proteins or images: Vig et al. 2021; Dunlap et al. 2023) as well as more specialized text domains (e.g., healthcare: Hsu et al. 2023, Robitschek et al. 2025)."
This is our final chance to reply, but if we have addressed your concerns in these two sets of revisions, would you consider increasing your score? While we won’t be able to reply, feel free to suggest additional tweaks that you think will add further clarity, and we will strongly consider incorporating them. Thanks again for your comments, which have strengthened the paper.
This paper addresses the task of using LLMs to take a labeled text dataset and propose natural language hypotheses predicting those labels. The performance of hypotheses is measured by having an LLM evaluate each prompt according to the explanations, producing a boolean vector. Using this vector, a linear regression is learned to predict whether the output is true or false, and this is scored in several ways - recovering known hypotheses on synthetic datasets, getting many statistically significant hypotheses, having high AUC, etc.
The authors implement several existing baselines. Their key contribution is a new method based on sparse autoencoders on text embeddings. They use the OpenAI text embedding API and train a narrow sparse autoencoder with approximately a thousand units. They then identify which units are predictive of the labels using L1 regularised logistic regression and select the most predictive ones.
They then use auto-interp on the top units to generate 3 hypotheses (GPT-4o) score them (GPT-4o mini) and then validate the predictive power of the hypotheses on a held-out validation set, and with a Bonferroni correction for multiple hypothesis testing.
They find that this significantly outperforms existing (largely LLM-based) baselines across most metrics. Finally, they present several qualitative examples of the explanations produced. These explanations broadly seem reasonable and interpretable, are often more specific than prior methods would give, and on some well-studied datasets give what the authors consider to be novel hypotheses.
给作者的问题
No
论据与证据
Yes
方法与评估标准
Seems reasonable, though I am not very aware of the right baselines and implementation for this task
理论论述
Briefly
实验设计与分析
The HypotheSAEs method is sound
补充材料
Not much
与现有文献的关系
Sparse autoencoders have been a major focus of the mechanistic interpretability field over the past year. A major open question is whether sparse autoencoders are practically useful on downstream tasks in a way that beats baselines, and there have been many recent negative results. In my opinion, as an interpretability researcher, this paper is highly significant because it is the most compelling example I have yet seen of sparse autoencoders beating baselines on a real task that people have actually tried to solve with other methods. Furthermore, this kind of exploratory analysis and hypothesis generation is exactly the kind of thing where I would expect sparse autoencoders to add significant value, as they are able to surface unexpected hypotheses.
遗漏的重要参考文献
No
其他优缺点
As discussed above, this paper is a compelling example of sparse autoencoders (SAEs) beating baselines on a real task. This is the best example I have seen on an extremely important question, and for this reason, I consider this work a strong accept.
Not only are SAEs substantially cheaper than the existing modern baselines, but they also performed substantially better. Enough qualitative examples were provided that I am reasonably satisfied that real hypotheses are being surfaced rather than spurious findings. Similarly, it was able to recover the correct hypotheses on the synthetic tasks.
The statistical analysis and effort put into baselines seemed of fairly high quality, though I found it difficult to judge the details here, as I am not familiar with this task under these baselines.
其他意见或建议
A few things could improve the paper further:
I would love if you could take a dataset where your method seems to generate novel hypotheses, like the Congress one, and show them to a domain expert. Have them evaluate whether these are interesting and novel, and whether they seem plausible. If they say yes, I am much more impressed by your results.
It would be worthwhile to try simpler baselines, such as:
- Using the PCA basis rather than the SAE basis
- Decision trees
- Logistic regression with L1 on bag of words
- Frequency analysis of words or bigrams
- Verifying that these don't work well
I'm concerned that with any complex baseline, it could be easy to overcomplicate things or misimplement them. We might have the wrong hyperparameters, and it's good to try a range of styles, just in case.
I'd also be excited to see whether matryoshka SAEs help here. You mentioned in the appendix that sometimes you train several SAEs of different sizes, and matryoshka is designed to remove the need for that by allowing concepts in a range of different granularities to be learned in the same SAE. I would be keen to see the authors open source this code and make it easy for other people to work with.
Thank you for the thoughtful and positive review. We’re especially glad to read that as an interpretability researcher, you found our work to be “highly significant because it is the most compelling example I have yet seen of sparse autoencoders beating baselines on a real task that people have actually tried to solve with other methods.” Indeed, we agree that recent literature is mixed on what value SAEs contribute, and we see the method as highlighting an area in which they have a real advantage: the ability (as you note) to find interpretable and unexpected patterns in data. We will further highlight this contribution in the paper.
We also thank you for suggesting simpler baselines, which we agree would clarify the importance of the SAE. We ran several of them on the three real-world datasets (Headlines, Yelp, Congress). Besides the feature generation step, the baselines are identical to our method.
Embedding: Run Lasso to select 20 dimensions directly from embeddings. PCA: Transform embeddings to PCA basis (k=256, which explains ~80% variance), then select 20 dimensions using Lasso. Bottleneck (suggested by R1): Fit a simple neural network which uses the embeddings as input, a hidden 20-neuron “bottleneck” layer, and predicts the target as output. Then interpret the 20 bottleneck neurons. (This is conceptually similar to NLParam, but simpler.)
| Method | Total Count | Average Predictiveness | Average Fidelity |
|---|---|---|---|
| SAE | 45/60 | 0.717 | 0.836 |
| Embedding | 30/60 | 0.706 | 0.545 |
| PCA | 23/60 | 0.697 | 0.547 |
| Bottleneck | 27/60 | 0.699 | 0.640 |
Metrics are combined/averaged across the three datasets. Recall that count is the number of significant hypotheses in a multivariate regression; predictiveness is the AUC (or R^2 for Yelp) using all hypotheses together; fidelity is the F1 score for how well the interpretation matches the neuron activations.
Using the SAE produces many more hypotheses which are significant in a multivariate regression (45/60 across the three datasets for SAE vs. 30/60 for the next best baseline); slightly higher predictiveness; and much higher interpretation fidelity. This is consistent with our qualitative finding that SAE neurons fire on more specific, distinct concepts than embeddings. Specificity permits the high-fidelity interpretations, and distinctiveness results in the wide breadth of hypotheses. While these baselines perform reasonably well in terms of predictiveness, for downstream utility, predictiveness is only a prerequisite: we ultimately want hypotheses that are non-trivial and non-redundant (as is discussed in Sec 6.2).
We believe the results of these baselines clarify the value of the SAE in our setting, and we are excited to include them in the updated version of the paper.
Regarding your suggestion to fit an n-gram/bag-of-words model: in the paper, we perform a related analysis for the Congress dataset, using n-gram results from Gentzkow et al. (a seminal, widely-cited analysis). They fit an L1 regression using bigrams and trigrams, and report 600 predictive ones. Controlling for counts of the n-grams on this list, we find that our hypotheses improve AUC from 0.61 to 0.74 and 28 out of 33 remain significant. This suggests that n-grams aren’t sufficient to cover our hypotheses. Qualitatively, we find hypotheses which we think n-grams can’t easily capture, like “criticizes the allocation of resources or priorities by the government, particularly highlighting disparities between domestic needs (e.g., education, healthcare, energy) and actions abroad (e.g., Iraq)”.
Thank you also for the pointer to Matryoshka SAEs. We’ve been excited about these as well, and plan to add them to our codebase and try them. We agree they may reduce the need to stitch together SAEs of different sizes.
We agree that working with domain experts would be a valuable direction to increase confidence in our results. In the last week, we took some initial steps here (more detail in the response to R3): we (1) confirmed that two domain experts did not know of specific work producing several of our findings on the headlines dataset (the researchers also noted that some of these findings support theories in psychology) (2) conducted a human eval for whether our hypotheses are helpful and interpretable, which produced strongly positive results (https://imgur.com/a/qw6bt3s).
We share your desire to make our code easy to work with. In fact, we’ve released a public codebase with a pip-installable package, though unfortunately we are not allowed to link to it (even anonymously).
Thank you again for your suggestions, and we’re really happy to see that you found the paper to make a significant contribution to interpretability research.
Thanks for the follow-up experiments, I think they strengthen the paper's results. I've read the other reviews and stand by my score (though, naturally, cannot increase it)
This paper proposes a method to generate hypotheses using SAEs. The first step of the method involves generating interpretable features by training SAEs on feature embeddings. The second step involves identifying which features are predictive for a task, using Lasso. Finally, the third step involves using LLMs to generate human interpretable natural language explanations for the identified predictive features. The experiments show that this method identifies better hypotheses than baselines, while requiring 1-2 orders of magnitude less compute.
给作者的问题
N/A
论据与证据
The claims made by this submission are very clear and are well-supported by evidence.
方法与评估标准
The methods and evaluation criteria make sense for the problem and application at hand.
理论论述
I did not formally check the correctness of the theoretical claims.
实验设计与分析
I did not find any issues with the soundness or validity of the experimental designs.
补充材料
I did not carefully review the supplementary material.
与现有文献的关系
One of the key contributions of the paper is proposing a compute-efficient method to perform hypothesis discovery using LLMs. While some previous methods (Zhou et al, "Hypothesis Generation with Large Language Model", 2024) entirely rely on LLMs to propose hypotheses for relationships between two target variables, and other methods (Zhong et al, "Explaining Datasets in Words: Statistical Models with Natural Language Parameters", 2024) require large amounts of compute to achieve this task; this paper performs the same task with about 1-2 orders of magnitude reduction in compute.
Another contribution is using SAEs in a systematic manner to identify hypothesis. While most work in the SAE literature focusses on improving or measuring the fidelity of the representations, this paper makes use of imperfect interpretations to generate hypothesis; while being able to reason about the corresponding fidelity of the hypotheses generated.
遗漏的重要参考文献
None that comes to mind.
其他优缺点
Strengths:
- This paper proposes a systematic framework to generate and identify hypotheses using SAEs / Lasso, along with LLMs in the loop. The paper is overall well-written and its goals are clearly stated. The biggest advantage of this method is that it seems to use significantly less compute than the baselines; making this a useful contribution to the field.
Weaknesses:
- Missing non-SAE baseline: The proposed method consists of three independent steps; the first involves training an SAE, the second involves identifying suitable features via a Lasso regressor. Finally, an auto-interpretation step is proposed to interpret the identified features in terms of natural language explanations. It would be great to have an ablation analysis that tests the utility of step 1, i.e., using SAEs features. For example,
- why not use trained bottleneck features like NLParam?, or
- directly use Lasso on top of the pre-trained language embedding features, skipping the SAE step altogether?
These ablations would help assess the importance of the SAE.
其他意见或建议
As a suggestion, the phrase "triangle inequality" may not be appropriate for the result in Proposition 3.1; as such a result requires three quantities in a shared space (like a field or a vector space) and corresponding notions of distance between them. While the proposed theory is useful, using the term "triangle inequality" might lead to confusion, and I suggest that the authors reconsider its usage in the paper.
Thank you for the thoughtful and positive review. We’re glad that you found the claims of the paper to be very clear and well supported.
We agree that a significant benefit of our method is that it reduces compute requirements by 1-2 orders of magnitude, and believe this will facilitate real-world applications. We would like to also emphasize what we see as an even more important benefit of the method: its performance in identifying predictive hypotheses. On synthetic tasks (where we know ground-truth hypotheses), the method outperforms all three baselines on 11/12 metrics. On real-world tasks, the method outperforms all three baselines on 5/6 metrics, generating 45 significant hypotheses out of 60 candidates, while the next best generates only 24.
We also appreciate your feedback about the phrase “triangle inequality.” Perhaps we could call it a “sufficient condition for hypothesis generation.” We would be curious for your thoughts on this phrase. An alternative could be to drop the phrase.
We also thank you for suggesting ablations, which we agree would clarify the importance of the SAE. On the three real-world datasets (Headlines, Yelp, Congress), we ran both ablations you mentioned—using embeddings directly and fitting a bottleneck—and a third suggested by R2, which selects features from an embedding PCA. We keep other steps of our method fixed.
Embedding: Run Lasso to select 20 dimensions directly from embeddings. PCA (Suggested by R2): Transform embeddings to PCA basis (k=256, which explains ~80% variance), then select 20 dimensions using Lasso. Bottleneck: Fit a simple neural network which uses the embeddings as input, a hidden 20-neuron “bottleneck” layer, and predicts the target as output. Then interpret the 20 bottleneck neurons. (This is conceptually similar to NLParam, but simpler.)
| Method | Total Count | Average Predictiveness | Average Fidelity |
|---|---|---|---|
| SAE | 45/60 | 0.717 | 0.836 |
| Embedding | 30/60 | 0.706 | 0.545 |
| PCA | 23/60 | 0.697 | 0.547 |
| Bottleneck | 27/60 | 0.699 | 0.640 |
Metrics are combined/averaged across the three datasets. Recall that count is the number of significant hypotheses in a multivariate regression; predictiveness is the AUC (or R^2 for Yelp) using all hypotheses together; fidelity is the F1 score for how well the interpretation matches the neuron activations.
Using the SAE produces many more hypotheses which are significant in a multivariate regression (45/60 across the three datasets for SAE vs. 30/60 for the next best baseline); slightly higher predictiveness; and much higher interpretation fidelity. This is consistent with our qualitative finding that SAE neurons fire on more specific, distinct concepts than embeddings. Specificity permits the high-fidelity interpretations, and distinctiveness results in the wide breadth of hypotheses. While these baselines perform reasonably well in terms of predictiveness, for downstream utility, predictiveness is only a prerequisite: we ultimately want hypotheses that are non-trivial and non-redundant (as is discussed in Sec 6.2).
We believe the results of these baselines clarify the value of the SAE in our setting, and we are excited to include them in the updated version of the paper; thank you for suggesting them!
We believe that we have now addressed the only weakness you raised (comparison to additional baselines) and were hoping, in light of this, that you would be willing to raise your score. If not, please let us know if there are further concerns we could address or experiments we could conduct. Thank you again for your suggestions, which have strengthened the paper!
Thank you for the rebuttal and for performing the non-SAE baseline experiments! It is indeed an interesting and very surprising finding that the SAE seems critical for proposing predictive hypotheses. These results may make a great addition to the paper.
Overall, I maintain my current rating.
This paper shows how to generate interpretable relationships between text data and target variables leveraging sparse-autoencoders. It compares favorably against state-of-the-art methods, while using less compute.
The reviewers generally agree that this work is a meaningful contribution. One of the reviewers points out that the scope of the method may be limited to text data, and the revised version of the paper should reflect this.