We are happy to provide concrete details for our modeling architectures and analysis setup, which we will summarize in our revised NeurIPS paper. We recognize that we could have made more explicit the distinction between the model under study (the transformer) and the experimental tool we use to explore it (the sparse autoencoder).

The “main models” of study in this paper are OpenCLIP ViT-B/32 (vision transformer) and Pythia-deduped-160M / Gemma-2B-IT (language transformer). Note that these transformer models were not trained or finetuned as part of our work, but were instead sourced as pre-trained models from Hugging Face:

Model	Input Data Modality	Layer Count	Embedding Dimensionality (per token)	Max Context Length	Pre-training Dataset	Pre-training Dataset Size	Pre-training Objective Function
OpenCLIP ViT-B/32	Vision	12	768	224×224 px (49 spatial tokens + 1 CLS)	LAION dataset	1.4B image–text pairs	Contrastive supervision
Pythia-deduped-160M	Language	12	768	2048 tokens	The Pile	300B tokens	Next-token prediction
Gemma-2B-IT	Language	18	2048	8192 tokens	Google-curated web, code, and math sources	6T tokens	Next-token prediction

At the output of each transformer layer, the token embeddings are known as “residual stream activations” in the LLM explainability literature. These dense embeddings are hard to interpret directly due to entangled (“overlapping”) internal representations. We therefore use sparse autoencoders (SAEs) to map residual activations into a higher-dimensional sparse concept space, with , i.e., an overcomplete latent space rather than a bottleneck. The encoder applies a linear map followed by ReLU: . The entries of constitute the concept activation vector: component reports the presence and magnitude of concept contained in the activations . The linear decoder reconstructs the activation, , whose columns define the concept directions. We train SAEs end-to-end with the loss

where the first term ensures accurate reconstruction of the original input activation and the term (tunable hyperparameter ) encourages sparsity in the concept space. This yields a representation in which each residual activation is expressed as a sparse linear combination of learned concepts. (Shapes: encoder , decoder , bias term )

To address our goal of unpacking the internals of our transformer models (cf. table above), each SAE is trained independently on a single transformer layer’s residual stream activations—e.g., for a 12 layer transformer, we train 12 distinct SAEs on the residual stream activations output from the corresponding layer. In this work, we train 3 sets of SAEs, and source one set of pre-trained SAEs from Hugging Face.

Below, we provide the training and evaluation details for each set of SAEs. Specific hyperparameters for training all SAEs can already be found in Appendix A.

Data used to fit SAEs

[Section 3]: Concepts identified in noise activations from OpenCLIP ViT-B/32. SAEs trained on 1.3M i.i.d. Gaussian noise image (224×224) activations, validation on 50k ImageNet-1k images, 12 SAEs for all 12 layers. .

[Section 4]: More concepts from transformer activations of increasingly disordered inputs. OpenCLIP ViT-B/32—trained on 1.3M ImageNet-1k activations, validation on 50k ImageNet-1k activations with shuffled patches of various sizes, 12 SAEs for all 12 layers. Pythia-deduped-160M—trained on 204.8M FineWeb-Edu tokens of normal text, validation on 5.12M FineWeb-Edu tokens with shuffled -grams of varying , 12 SAEs for all 12 layers. .

[Section 5]: Hallucination prediction and suppression from concepts identified in activations of Gemma-2B-IT. SAEs pre-trained on 1.23B token activations from FineWeb, validation on 1,006 Vectara benchmark document activations, pre-trained SAEs sourced from Hugging Face for layers . .

We appreciate the careful review. We believe that this compact description of our main modeling pipeline will improve reader clarity.