SAE-V: Interpreting Multimodal Models for Enhanced Alignment
We propose SAE-V, a mechanistic interpretability framework for multimodal large language models to analyze multimodal feature and enhance their alignment.
摘要
评审与讨论
The paper proposes SAE-V, a framework that utilizes SAEs trained on top of multimodal large language models (MLLMs) to measure image-text alignment. Specifically, for a given SAE feature, it retrieves the top activating tokens and image patches and computes their cosine similarity score, which produces an alignment metric for a single dataset sample. The paper evaluates this metric for applications like image patch filtering and dataset filtering.
给作者的问题
N/A
论据与证据
- Claim 1. SAE-V is superior to SAE in reconstruction quality.
- The presentation of this section should be improved; it is hard to evaluate the validity of the claim without further explanation.
- Figure 3 is missing a description of SAE vs. SAE-V. Is it that SAE is trained only on text features while SAE-V is trained jointly on text and image features? Is SAE evaluated on both text and image features, or only on text features? I don’t understand “Original” — shouldn’t the original feature achieve a reconstruction loss of 0?
- Claim 2. SAE-V can be used for image patch filtering.
- This experiment is quite novel and interesting, but more details could be provided.
- In L265, how is this evaluation performed? What is the size of the ImageNet test set, what MLLM is being evaluated here, how is it being evaluated (is it a comparison between the MLLM output vs. the ground truth ImageNet class)? In Figure 6, the y-axis is labeled as “loss value” but the caption states that it is “classification accuracy” — what is being plotted here? How is the masking performed; are the features of those image patches just zeroed out?
- Claim 2 and 3 are also missing a comparison against the non-SAE baseline. For example, you could compute the alignment metric by computing the cosine similarity of the original text token and image patch features, without any SAE projection. This would better illustrate why SAE-V training is necessary, beyond a simple training-free baseline.
- Claim 3. SAE-V can be used for dataset filtering.
- The experiment is well motivated and interesting, but the presentation is confusing.
- In Figure 7, what is the performance score metric exactly — is it some classification accuracy or a loss value? While it is trained on Align-Anything, is it also evaluated on a held out subset of Align-Anything, and what is the size of this subset? The figure caption states that the y-axis is scaled according to the “full dataset’s performance” — why at the 100% data percentage is the performance score ~96%, not 100%?
- For the comparison in L371, beyond IFD I would recommend the paper also include a CLIP baseline (i.e., taking the top percentage of samples based on highest CLIP scores). This CLIP baseline is already explored in prior work such as in [1]. To this end, I would also disagree with L375, which states there are “no widely recognized data filtering methods specifically designed for multimodal data.”
[1] Gadre et. al., 2023. DATACOMP: In search of the next generation of multimodal datasets. NeurIPS 2023.
方法与评估标准
See “Claims And Evidence” above.
理论论述
N/A
实验设计与分析
See “Claims And Evidence” above.
补充材料
Yes, I looked at the Supplementary.
与现有文献的关系
The paper proposes a novel use case for SAEs for measuring image-text alignment in multimodal models. To this end, they also introduce image patch filtering and data filtering as evaluation tasks. Both the method and evaluation tasks have not been explored in the context of MLLMs.
遗漏的重要参考文献
N/A
其他优缺点
Strengths
- The proposed alignment score is well-motivated and interesting.
- The proposal of image patch filtering as an evaluation task is very unique and compelling. Future work in multimodal alignment would also benefit from this framework.
- The dataset filtering evaluation is sensible and a practical example of how SAE-V is useful.
Weaknesses
- The presentation of the experiments needs a lot of work. Many of the figures and experimental setup are unclear and missing key details; also see “Claims And Evidence” above.
Overall, I like the premise of the paper but the presentation is poor. I am open to revising my score if the authors are able to address my questions and clarify details regarding the experiments.
其他意见或建议
- L160 typo, “donated” should be “denoted”
Your suggestions are insightful and will enhance the completeness of our paper!
We used all available resource and devoted efforts to conduct additional experiments. We addressed all your negative comments below and will add them into the revision. If this rebuttal addresses your concerns, we earnestly ask you to consider raising the score and supporting us for acceptance.
Claim 1 & Weakness: We prune the overall presentation of the paper to reduce repetitive expressions and add necessary backgrounds.
Limited by the length of rebuttal, we are not able to list all the changes we made, so only key examples are provided. E.g.:
- line 245-253:
The transferability of SAEs between foundation models and instruction-tuned models has been extensively investigated in text-only contexts[3][4][5], as it demonstrates whether SAEs can capture universal semantic features within LLMs. Similarly, the transferability from MLLMs to corresponding LLMs serves as a critical metric for the quality of features learned by SAE-V. (Followed by original content)
- Fig. 7 caption:
We evaluated SAE-V data filter method on LLaVA-NeXT-7B model, Align Anything dataset, and LLaVA-Bench benchmark. The result show that all SAE-V-based methods significantly outperforms the random selection baseline, while the cosine similarity filter achieved 108% of the full dataset’s performance with only 20% of the data, and the co-occurrence filter peaked at 50% of the data, reaching a score of 108.17.
To address the specific concerns:
- SAE only accepts textual tokens, while SAE-V is designed for both text and image tokens. When evaluating with multimodal input, SAE can only reconstruct text features, whereas SAE-V reconstructs both text and image features, achieving lower reconst. loss (Fig. 4, Col. 3). Moreover, even when evaluating on text tokens, SAE-V surpasses SAE in reconstruction capability (Fig. 4, Col. 1-2), showing effective cross-modal generalization.
- Regarding the "Original" bar in Fig. 3: The reconst. loss measures how well the model predicts the next token compared to ground truth. The original model's predictions are also probability distributions, not exact predictions, which is why the "Original" has non-zero reconst. loss.
Claim 2: We provided additional details, and conducted extra ablation study to compare SAE-V with non-training baseline.
Regarding details of the experiment in Section 3.2:
- Pipeline: Given the 1000 val. set of ImageNet, We score the 24*24 image patches according to the metrics of SAE-V features (Fig. 6 legend). We filter out the patches according to the score and the given ratio (x-axis in Fig. 6). We use LLaVA-NeXT-7B to classify the filtered image, and calculate the accuracy (Not loss value) of LLaVA-NeXT-7B output as y-axis in Fig. 6.
- Target: This experiment is designed to support the claim that SAE-V could preserve the key info from images, and the higher accuracy is, the more key info is preserved through the leftover patches. E.g., in Fig. 5, SAE-V could preserve the most relevant information (the dog) in the image.
We added an additional baseline to this experiment, using attention score of image patches and the original text token to filter the patches. Results:
| Masking (%) | 0 | 25 | 50 | 75 |
|---|---|---|---|---|
| Attn. Score | 0.9020 | 0.8770 | 0.8200 | 0.6930 |
| SAE-V Cos. | 0.9020 | 0.8670 | 0.8110 | 0.6630 |
SAE-V could achieve relatively similar performance using reconstruction instead of original activation of MLLMs.
Claim 3: We updated the presentation and added additional baselines.
Sorry for the ambiguous figure. To clarify, we used LLaVA-Bench[2] to report the MLLM performance, and the y-axis is not percentage, but the exact score on LLaVA-Bench. To demonstrate that our method is superior regardless of benchmark selection, we did an ablation study of benchmarks. For more details see the rebuttal of Reviewer k2no point 1.
As for data selection, we included the CLIP baseline and added a paragraph discussing the related work [1]. The experiment results are as follows:
| Filter Method | LLaVA-Bench @ 0% | 20% | 40% | 60% | 80% | 100% |
|---|---|---|---|---|---|---|
| CLIP | 94.2 | 99.3 | 102.9 | 102.6 | 103.8 | 95.8 |
| SAE-V Cosine Similarity | 94.2 | 104.1 | 103.8 | 100.4 | 101.1 | 95.8 |
| Random | 94.2 | 99.6 | 98.4 | 97.6 | 93.5 | 95.8 |
The experiment shows that although the peak performance is close, CLIP achieve the peak performance using 4 times data than SAE-V, showing the effectiveness of SAE-V.
Reference
[1] Gadre et al., DATACOMP: In search of the next generation of multimodal datasets, 2023.
[2] Liu et al., Visual instruction tuning, 2023.
[3] Kissane et al., Saes (usually) transfer between base and chat models, 2024.
[4] Taras et al., Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?, 2024.
[5] Gallifant et al., Sparse autoencoder features for classifications and transferability, 2025.
Thank you for the clarifications and experiments with additional baselines.
- The new result in Claim 2 shows that SAE-V is able to reproduce the performance of the raw activations, or attention score baseline.
- The new result in Claim 3 shows that SAE-V is competitive in dataset filtering against CLIP, mainly at the low data regime with 20% data.
Now that I better understand the experimental setup with these additional clarifications, the applications of image patch filtering and dataset filtering seem less strong, as SAE-V is mainly reproducing the behavior of the raw activations. The results would be much more convincing if the paper could show an application that cannot be done with raw activations, such as the discovery of features that represent a specific concept. For example, [1] shows that text-only SAEs can identify a feature that activates most highly on "parts of individual names, especially last names."
[1] Huben et. al. Sparse Autoencoders Find Highly Interpretable Features in Language Models. ICLR 2024.
Thanks for your feedback, and we have added additional experiments accordingly.
Thank you for the review and your insightful feedback. We appreciate your comments on our rebuttal, and added additional examples accordingly. If this rebuttal addresses your concerns, we earnestly ask you to consider raising the score and supporting us for acceptance.
The results would be much more convincing if the paper could show an application that cannot be done with raw activations, such as the discovery of features that represent a specific concept.
We'd like to highlight that we did demonstrate SAE-V's capability to discover interpretable features that represent specific concepts in our rebuttal to Reviewer b89F W1. Furthermore, after receiving your rebuttal comments, we used all available resource to build a multimodal neuronpedia on our LLaVA-NeXT-7B SAE-V. Due to time constraints, we haven't implemented a frontend interface, but we have found compelling examples that fulfill your requirements.
Example 1: Doberman dogs
Rebuttal Figure 1 shows the #44031 feature of SAE-V on LLaVA-NeXT-7B with consistent semantic meaning related to "Doberman dogs" across text and image modalities. This feature demonstrates SAE-V's ability to identify specific concepts with concrete physical meanings.
Example 2: Symmetry
Rebuttal Figure 3 shows the #11105 feature of SAE-V on LLaVA-NeXT-7B with consistent semantic meaning related to "Symmetry" across different modalities. We found that this feature is not tied to a single physical entity or relationship. In fact, it activates simultaneously in images with left-right symmetry, top-bottom symmetry, and central symmetry, and its activation areas in images align consistently with the symmetry patterns.
We believe this type of abstract semantics cannot be achieved using probes based on raw activations. It demonstrates that SAE-V can discover features representing specific abstract concepts beyond just physical entities or physical relationship.
Conclusion
Overall, these examples show that unlike methods based on raw activations, SAE-V identifies both concrete concepts (Doberman dogs) and abstract patterns (symmetry) with semantic consistency. We sincerely hope that these two examples eliminate any concerns you may have about our work.
We commit to including the above examples in the camera-ready version and building a multimodal neuronpedia based on LLaVA-NeXT-7B SAE-V to showcase more similar examples. We would like to emphasize again that if you feel your concerns have been addressed, we would greatly appreciate your consideration in raising our score and supporting our paper for acceptance.
- This paper straightforwardly extends the SAE framework to MLLMs, calling it the SAE-V framework.
- The authors introduce he cosine-sim scores as the cosine-sim b/w the TopK activated image and text features for a given input based on SAE activations.
- Based on the cosine-sim scores, the authors filter training datasets for MLLMs and find a correlation between the the performance and avg cosine-sim score for a filtered dataset.
- Using filtering they find only a fraction of data can boost performance.
- The SAE-V extends to LLMs well.
update after rebuttal
I will keep my rating as "weak accept" due to:
- the authors test only on LLaVA-Bench and MME, which are not great benchmarks. I would have liked to see more benchmarks like MMStar, POPE, etc.
- Pretraining a 7B llava-next model does not take 100s of GPU days since you use to the pretrained LLM, so you only do the multimodal PT and IFT which should take no more than a week for the 7B model, I would have liked to see that experiment.
给作者的问题
N/A
论据与证据
- The authors claim SAE-V can identify important semantic patches in the image, which is validated.
- The data filtration technique is also shown to work well.
方法与评估标准
- Yes, using LLaVA-NeXt and Chameleon for experiments makes sense.
- I am unsure what benchmarks authors use to report the MLLM performance, so I would like clarification.
理论论述
- N/A
实验设计与分析
- Yes, the training and evaluation of SAE-V models seem okay.
- The dataset used to train the MLLM also seem fine.
- One question I have for the authors is: why do they only try to fine-tune the LLaVA-NeXT-7B model and not pretrain and fine-tune the model from scratch? This is important to know how does the dataset filtering affect different training stages.
补充材料
Code looks okay
与现有文献的关系
SAEs is an established technique in LLMs, and extending it to MLLM is only natural and of interest to the community.
遗漏的重要参考文献
N/A
其他优缺点
The paper doesn't seem to have any major issues except the fact that the scope seems quite limited:
- Extending SAE to MLLMs is nothing technically innovative.
- The findings about the cosine-score and performance are interesting, but I'd have liked to see more models and more datasets being used for analysis.
Still, I believe it's a good paper that deserves a weak accept but a more thorough analysis section of how the findings can be used by the community would make it a very good paper.
其他意见或建议
N/A
Thanks for your valuable suggestion!
During rebuttal period, we used all available resource and devoted efforts to conduct additional experiments. We addressed all your negative comments below and will add them into the revision. If this rebuttal addresses your concerns, we earnestly and kindly ask you to consider raising the score and supporting us for acceptance.
point 1
I am unsure what benchmarks authors use to report the MLLM performance, so I would like clarification.
We used LLaVA-Bench[1] to report the MLLM performance, and we also test our method on MME benchmark. Here is the result on LLaVA-NeXT-7B and Align Anything dataset:
MME:
| Filter Method | LLaVA-Bench Performance @ 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SAE-V Cosine Similarity | 1233.87 | 1088.48 | 1115.41 | 1161.13 | 1211.46 | 1360.43 | 1246.26 | 1231.91 | 1276.10 | 1148.70 | 1246.26 |
| Random | 1233.87 | 1069.52 | 1132.25 | 1072.34 | 1156.12 | 1220.76 | 1289.27 | 1292.73 | 1198.32 | 1217.40 | 1246.26 |
This result demonstrates that the superiority of SAE-V is not effected by the selection of benchmarks.
point 2
Why do they only try to fine-tune the LLaVA-NeXT-7B model and not pretrain and fine-tune the model from scratch? This is important to know how does the dataset filtering affect different training stages.
We appreciate the reviewer's suggestion about pretraining from scratch. While we fully agree this would provide valuable insights into how our filtering method affects different training stages, the computational resources required for pretraining MLLMs from scratch are prohibitively expensive for academic research team like us.
Pretraining even a 7B parameter model requires hundreds of GPU-days on A100/H100 GPUs, which is unfortunately beyond our current resource constraints. Instead, we focused our experiments on fine-tuning existing models, which still demonstrates the effectiveness of our approach while being computationally feasible (for more details supporting this claim, see the rebuttal of Reviewer b89F W3&W5 and Reviewer k2no point #4.). We commit to conduct at least one experiment at pretrain stage in the future versions of our paper.
point 3
Extending SAE to MLLMs is nothing technically innovative.
While extending SAE to MLLMs may appear straightforward, our work contributes several key innovations:
- Cross-modal feature analysis: We've developed novel methods to identify and analyze features that capture cross-modal interactions (Section2.2).
- Interpreting Multimodal Alignment: Our framework provides unique insights into how MLLMs integrate information across modalities during the alignment process. As shown in Section 3.1.2, SAE-V reveals patterns in feature distribution that directly correspond to model performance on multimodal understanding tasks.
- Self-guided data filtering: Our paper made the first attempt to use mechanistic interpretability methods to perform multimodal data filter using the model's own representations.
Our work provides additional insights and extends the practical applications of multimodal interpretability methods, which is also the main contribution of our paper.
point 4
The findings about the cosine-score and performance are interesting, but I'd have liked to see more models and more datasets being used for analysis.
We conducted experiments on larger models (LLaVA-NeXT-Vicuna-13B) and datasets (MMInstruct) during rebuttal period.
For experiment on larger models, see the rebuttal of Reviewer b89F W3&W5.
For experiments on larger datasets, we selected MMInstruct[2], which contain 200k data pieces, and is 4 times bigger than Align Anything dataset. Based on MMInstruct, we applied SAE-V-based data filter and alignment on Chameleon-7B, and the results are shown below:
| Filter Method | LLaVA-Bench Performance @ 0% | 20% | 40% | 60% | 80% | 100% |
|---|---|---|---|---|---|---|
| SAE-V Cosine Similarity | 42.6 | 48.1 | 57.6 | 61.2 | 54.7 | 52.3 |
| Random | 42.6 | 47.4 | 46.8 | 51.2 | 54.8 | 52.3 |
It demonstrates that SAE-V paradigm could scale up to larger models and datasets, while maintaining its performance.
Reference
[1] Liu et al. Visual instruction tuning, 2023. [2] Liu et al. Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity, 2024.
This work aims to improve the vision language alignment performance of multimodal foundation models by finetuning data selection and filtering using interpretable tools, i.e., improved SAE. Specifically, it uses the alignment scores between selected topK vision-language tokens determined by SAE to select the finetuning dataset subset, and then perform a series of investigations. In the experimental results, this paper investigates the reconstruction error trend w.r.t. dataset size, and different data filtering strategies and metrics. The results show that the proposed approach can improve performance w.r.t. entire-dataset baseline and minorly improve performance compared with the previous filtering approach.
给作者的问题
I will raise scores if some of my key questions are addressed.
论据与证据
The interpreting in the title is a little concerned since most interpretability experiments done are reconstruction probing and limited case studies. I would suggest authors either rephrase the title or add additional explainability experiments to support it. This title, at my first glance, tells me this paper must include a comprehensive list of vision-language qualitative explainable alignment results.
方法与评估标准
The method presentation is unclear. What is the function of in its representation? Is it an SAE encoder or decoder or entire network? Then what is the in its representation with its shape undefined.
理论论述
Some minor mistakes:
- Eq 1, I think the matrix multiplication order is wrong.
实验设计与分析
All related questions are included in Other Strengths And Weaknesses.
补充材料
All.
与现有文献的关系
I understand the ICML reviewing policy on concurrent works. However, considering the whole community is moving so fast, to the most updated information, if thinking in a broader sense, I consider the significance of this work might be affected by this recent paper:
Sparse Autoencoders Can Interpret Randomly Initialized Transformers, which points out that reduced reconstruction loss cannot guarantee the meaningfulness of learned patterns or features.
On the other hand, the improved alignment performance has been validated which turns out the effectiveness of SAE-based approach. However, an in-depth investigation of the underlying working mechanism is lacking.
The proposed SAE-based data filtering approach is purely empirical-driven. It is understandable that the experiments is not that large-scale due to the high computational cost. Whether and how will this method generalize to larger-scale models and datasets is unclear with its underlying working mechanism left as a mystery.
遗漏的重要参考文献
Missing a reference:* Large Multi-modal Models Can Interpret Features in Large Multi-modal Models*, which is the first mechanistic interpretability work using SAE on the VL foundation models. I think this may relate to the significance of the first claimed contribution in this paper. However, it is still acceptable to ignore, since it is not formally published.
其他优缺点
strengths:
- The overall method is novel in some senses, intuitive and simple, and effective.
- The results are strong to support the effectiveness of the proposed approach.
weaknesses:
- This paper may not be that easy-to-follow for those who are unfamiliar with the mechanistic interpretability due to lack of introduction of motivation behind the operations or experimental settings, e.g., I think it is necessary to introduce the motivation and benefits of studying model transferrability of SAE (line 245) in the multimodal cases, or at least give some references, instead of repetitive descriptions of phenomenon (line 245-253). I suppose the readers coming from intersectional fields (both VL alignment and mechanistic interpretability).
- The overall presentation of this paper still needs to be refined. As mentioned above, some expressions are not informative and even repetitive to readers. Please try to condense the sentences.
- Some experiments only present findings, results, and speculations, and there is a lack of in-depth investigation and analysis beyond unveiling the trend: For example, in section 3.2, how would the classification accuracy vary scaling with reduced tokens? Comparing with compressing at token level and image level, how much is the gap? How much interpretable is the approach beyond the reconstruction error? Experiments on more finetuning datasets are appreciated.
- Some questions related to the methodology: I find the variations of performance w.r.t. the data percentage is significant, how to adaptively select the hyperparameter threshold and percentage in practice when the fine-tuning dataset is very large (original dataset is 400K). Do we need a hold-out validation set to train multiple rounds to search the parameters? We typically want to avoid that since our original goal is to train on only a small subset of FT dataset. Besides, when we search for optimal or near-optimal hparams, a reasonable strategy is to look ahead only say steps of the proportion (since we want to avoid exhaustive search), in this case, how would your findings guide the practice, e.g., naive elbow algorithm?
- Lack some theoretical insights, foundations, or guarantees supporting this method, which I think can be fine.
question:
- Why eq 7 is pair-wise cosine not point-to-point cosine (fully connected bi-graph)?
其他意见或建议
None.
Despite some misunderstandings, we conducted more experiments to clarify your valuable concerns.
We conducted additional experiments, addressed your comments, and would add them into revision. If this rebuttal addresses your concerns, we kindly ask you to consider raising the score.
Methods & Evaluation: We added additional paragraph to clarify this potential confusion.
In Eq. 1, we define the operation as the encoding operation of the SAE-V, whereas is the feature activation, is the length of the input to SAE-V, and denotes the number of features in the SAE-V. Each represents the activation of a specific token (index ) across all features in the SAE-V.
Theoretical Claims: We sincerely apologize for the mistake.
Thanks for pointing this out! Based on our definitions, Eq. 1 should be: . We updated the paper accordingly.
Broader Literature: We perform experiments to show the scalability of SAE-V empirically.
Regarding [1], we acknowledge that reduced reconst. loss alone cannot guarantee meaningful features. However, our work empirically validates SAE-V through image patch filtering (Section 3.2), and data filter & alignment (Section 4).
While the theoretical analysis of underlying mechanism remains an open question for future work, we conducted experiments to show that SAE-V has the potential to scale up to larger models and datasets. For more details, see the rebuttal of Reviewer b89F W3&W5 and Reviewer k2no point #4.
References: We acknowledge the novelty of [2], though we are making distinct contributions.
We acknowledge that [2] is a pioneering effort in applying SAE for mechanistic interpretability of VLMs, and we modified our claims accordingly and added [2] into the related works subsection. Compared with [2], our work provides additional insights and extends the practical applications of mech interp. For more details, see the rebuttal of Reviewer k2no point #3.
W1&2: We prune the overall presentation of the paper to reduce repetitive expressions and add necessary backgrounds.
Thanks for your suggestions! We have modified the paper accordingly. For some decent examples, see the rebuttal of Reviewer o9Vp Claim 1.
W3: We conducted deeper analysis of the underlying mechanisms.
We replicated the patch filtering experiments in Section 3.2 on VQA tasks (using A-OKVQA val. set with LLaVA-NeXT-7B), examining both text tokens and image patches:
| Masking (%) | 0 | 25 | 50 | 75 |
|---|---|---|---|---|
| Text Acc. (%) | 80.2 | 77.7 | 66.5 | 53.3 |
| Image Acc. (%) | 80.2 | 78.8 | 78.4 | 70.7 |
Key Findings
- Compression rate: Image information demonstrates lower compression rate (more redundancy) than text.
- Accuracy scaling w/ reduced tokens: Text masking shows a roughly linear relation of accuracy and masked token, while image filter maintains performance until 50%, with a more significant drop only appearing at 75%, suggesting that the information of text is more evenly distributed compared to image.
For interpretability beyond reconst. score, see the rebuttal of Reviewer b89F W1.
For more datasets, see the rebuttal of Reviewer k2no point #4.
W4: We presents adaptive parameter selection strategies to make our methods effective and pratical.
Thank you for raising this consideration! We tested hyperparameter selection using a 1/20 subset of Align-Anything dataset:
| Filter Method | LLaVA-Bench Performance @ 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SAE-V Cosine Similarity | 94.2 | 98.2 | 106.8 | 114.9 | 114.5 | 114.8 | 112.9 | 112.3 | 111.0 | 109.5 | 98.5 |
| Random | 94.2 | 96.5 | 97.0 | 98.3 | 95.3 | 93.7 | 96.5 | 98.83 | 98.2 | 96.8 | 98.5 |
The overall trend closely resembles Fig. 7, confirming that alignment metrics on a small val. set resemble the distribution on the complete dataset, enabling efficient hyperparameter selection.
Additionally, Section 4 have showed that our method outperforms using the complete dataset across most hyperparameter settings, making more refined parameter tuning (such as naive elbow) helpful but not strictly necessary.
W5
While we don't provide formal guarantees, our method is grounded in established principles of dictionary learning that have been validated by the community. We also performed comprehensive ablation study and open-sourced our code to confirm replicability. (see rebuttals above).
Question
Thanks for pointing out! As is shown in the supp. material (code/SAELens-V/scripts/cosimilarity.py, line 179-189), we actually used point-to-point cosine. We modified Eq. 7, and we verified that pair-wise consine performs similarly as the point-to-point in actual dataset selection:
| Filtered Top(%) | 25 | 50 | 75 |
|---|---|---|---|
| IoU | 0.71 | 0.77 | 0.85 |
Reference
[1] Heap et al. Sparse Autoencoders Can Interpret Randomly Initialized Transformers, 2025.
[2] Zhang et al. Large Multi-modal Models Can Interpret Features in Large Multi-modal Models, 2024.
This paper introduces SAE-V, a framework that extends Sparse Autoencoders (SAEs) to multimodal large language models (MLLMs). The authors argue that MLLMs present unique interpretability challenges due to the complex semantic space created by integrating visual modalities with text. SAE-V aims to address these challenges by identifying and analyzing interpretable features in MLLMs, focusing on cross-modal interactions and alignment dynamics. The authors demonstrate that SAE-V can be used to filter high-quality data for model alignment, achieving comparable or better performance with significantly less data. They conduct experiments on multiple MLLM architectures (LLaVA-NeXT-7B and Chameleon-7B) and datasets (Align-Anything and RLAIF-V) to validate their approach.
update after rebuttal
The authors' response has resolved my concerns. After reading other reviewers' comments, I think this paper is above the threshold of the acceptance.
给作者的问题
N/A
论据与证据
The main claims of the paper are:
- The authors demonstrate this through reconstruction loss metrics, showing that SAE-V outperforms standard SAE models when applied to MLLMs.
- The authors show that SAE-V models trained on MLLMs can be effectively applied to their base LLMs.
- Through image patch filtering experiments, the authors demonstrate that SAE-V can identify the most important parts of an image.
- The authors show that data filtered using SAE-V features can achieve better performance with less data compared to random selection or using the full dataset.
The evidence presented includes quantitative metrics (reconstruction loss, L0 sparsity, model performance on benchmarks) and qualitative analyses (case studies of image patch filtering).
方法与评估标准
SAE-V (Sparse Autoencoder for Multimodal Models) is proposed for interpretability and data filtering in MLLMs. It contains:
- Sparse Autoencoders (SAEs) extract interpretable multimodal features.
- Cosine similarity ranks data quality for filtering.
- Filtered data improves model alignment efficiency.
The paper adopts several benchmarks for evaluation:
- Align-Anything
- RLAIF-V
- ImageNet
理论论述
The main claims of the paper are:\
- The authors demonstrate this through reconstruction loss metrics, showing that SAE-V outperforms standard SAE models when applied to MLLMs.
- The authors show that SAE-V models trained on MLLMs can be effectively applied to their base LLMs.
- Through image patch filtering experiments, the authors demonstrate that SAE-V can identify the most important parts of an image.
- The authors show that data filtered using SAE-V features can achieve better performance with less data compared to random selection or using the full dataset.
实验设计与分析
The paper includes several experimental components:\
- Comparing reconstruction capabilities of SAE-V versus standard SAE on multiple models.
- Testing various metrics derived from SAE-V (L0, L1, co-occurring L0, cosine similarity) to identify important image patches.
- Using SAE-V features to filter high-quality data for model alignment, comparing against random selection and IFD metric.
- Examining the relationship between average cosine similarity scores and model performance.
补充材料
Yes, I've checked the appendix in the submission but I haven't checked the supplementary code yet.
与现有文献的关系
The paper positions itself at the intersection of mechanistic interpretability and multimodal model alignment. It builds upon previous work in sparse autoencoders for LLM interpretability and extends these approaches to multimodal settings. The authors compare their approach to recent data filtering methods like IFD, showing comparable results without requiring additional models.
遗漏的重要参考文献
N/A
其他优缺点
strengths:\
- The paper presents a clear contribution by extending SAE techniques to multimodal models, which is valuable given the growing importance of MLLMs.
- The authors demonstrate a concrete application of their interpretability method (data filtering) that improves model alignment, connecting theoretical interpretability to practical benefits.
- The experiments cover multiple models, datasets, and evaluation metrics, providing strong evidence for the effectiveness of the proposed approach.
weaknesses:\
- The process of determining which features are considered "interpretable" or "high-quality" is somewhat subjective and could benefit from more rigorous definition.
- The image patch filtering experiments may be subject to confirmation bias - examples are chosen where the method works well, but it's unclear how often the method fails to identify important regions.
- The experiments are limited to 7B parameter models. It's unclear if the findings would generalize to larger models where the semantic spaces may be even more complex.
- SAE-V claims to be efficient, but training times and computational costs are not reported.
- The results might not generalize to other multimodal models since only two models are tested.
其他意见或建议
N/A
We deeply appreciate your thoughtful insights that will significantly strengthen our paper's overall presentation.
In the rebuttal period, we used all available resources and devoted efforts to conduct additional experiments. We addressed all your negative comments below and will add them into the revision. If this rebuttal addresses your concerns, we earnestly and kindly ask you to consider raising the score and supporting us for acceptance.
W1: We defined these concepts with specific metrics, and we acknowledge that more rigorous definitions would strengthen our presentation.
We define a feature as interpretable when it activates for semantically related inputs across modalities. This definition is aligned with the automated interpretability score[1], but since there are currently no benchmarks for multimodal interpretability, we would display an example instead. (e.g., Rebuttal Fig. 1 shows an interpretable feature with its semantic meaning being consistent.)
As for quality, a feature is of high quality when it activates text and image patches that are semantically similar (given by Eq. 7), focusing more on the similarity across modalities instead of overall consistency. The feature in the example above would also be of high quality.
W2: We provided examples for the failure modes, as well as statistical evidence supporting that these failure modes are scarce.
We agree that there are failure cases where SAE-V didn't behave well on multimodal data, and here is an example: Rebuttal Fig. 2. In this example, SAE-V fails to capture the most informative patches because of its similarity with the background.
However, we need to mention that SAE-V is effective in most cases. Shown in Fig. 6, all SAE-V-based methods achieve high accuracy when preserving 75% or 50% patches, and cosine similarity score method maintains high accuracy even when only 25% patches are preserved. For more baselines and ablations, see the rebuttal of Reviewer bBFh Weakness 3 and Reviewer o9Vp Claim 2.
W3&W5: We tested our method on additional models to prove that our method could generalize and scale up.
To prove that SAE-V and its data filtering paradigm could generalize to other multimodal models and scale up to larger models, we replicated SAE-V and its data filtering method on LLaVA-NeXT-Vicuna-13B and LLaVA-NeXT-Vicuna-7B (in the paper, we used LLaVA-NeXT-Mistral-7B). Unfortunately, due to rebuttal time constraints and compute limitations, we were unable to test our method on larger models and more architectures, and we only tested our data filter method in a 5-fold manner rather than 10-fold in the paper.
The interpretability metrics of SAE and SAE-V on both models are shown in the table below:
| Model | Method | L0 |
|---|---|---|
| LLaVA-NeXT-Vicuna-13B | SAE | 128.56 |
| SAE-V | 193.63 | |
| LLaVA-NeXT-Vicuna-7B | SAE | 3162.96 |
| SAE-V | 585.64 |
| Model | Method | Reconst. |
|---|---|---|
| LLaVA-NeXT-Vicuna-13B | Zero | 10.37 |
| SAE | 3.170 | |
| SAE-V | 2.954 | |
| Original | 2.868 | |
| LLaVA-NeXT-Vicuna-7B | Zero | 10.37 |
| SAE | 8.126 | |
| SAE-V | 7.957 | |
| Original | 7.479 |
In all MLLMs and metrics, SAE-V consistently outperforms SAE, demonstrating its superior capability across different architectures, sizes, and semantic complexity.
The alignment experiment results of SAE-V are shown in the table below:
LLaVA-NeXT-Vicuna-13B:
| Filter Method | LLaVA Bench Performance @ 0% | 20% | 40% | 60% | 80% | 100% |
|---|---|---|---|---|---|---|
| SAE-V Cosine Similarity | 104.60 | 105.77 | 116.67 | 112.27 | 111.96 | 111.20 |
| Random | 104.60 | 105.27 | 105.77 | 107.20 | 110.40 | 111.20 |
The results show that SAE-V-based data filter outperforms the random selection baseline, and reached the highest performance of 116.67 with 40% data.
We believe that our experiments across 7B, 13B models and three different architectures (Chameleon, LLaVA-NeXT-Mistral, and LLaVA-NeXT-Vicuna) provide sufficient evidence to prove SAE-V's potential to generalize across architectures, scale to larger models, and handle more complex semantic spaces. We commit to adding experiments on at least one 30B-scale model in future versions of our work.
W4: We reported our training time and computation cost, and demonstrated the effectiveness of SAE-V.
Our SAE-V training was completed on 8*A800 GPUs. Using 100k multimodal data samples, each training typically takes around 21 hours, which is comparable to training an SAE on 7B model using the same amount of data.
The effectiveness of SAE-V lies in its generalization capabilities: as reported in Section 3.1.2 and Fig. 4, SAE-V trained on MLLMs demonstrates strong generalization capability to the corresponding LLMs. Therefore, a single training can yield an SAE-V that is applicable to both multimodal models and text-only models.
Reference
[1] Bills et al., "Language models can explain neurons in language models", 2023.
This paper introduces SAE-V, a sparse autoencoder-based framework that extends mechanistic interpretability methods to multimodal large language models (MLLMs). The key contribution lies in using cross-modal feature similarity to score and filter alignment data, enabling improved performance with reduced supervision. The reviewers generally agree that the method is intuitive and the empirical results are promising. However, there are notable concerns. First, the assumption that high cross-modal similarity indicates high-quality data may not always hold—low similarity might reflect model limitations rather than poor data quality. Second, the evaluation is limited to a few benchmarks (e.g., LLaVA-Bench, MME), lacking broader validation on widely adopted datasets such as POPE or MMStar.
Nonetheless, the authors addressed most reviewer concerns in the rebuttal and presented new experiments on larger models and datasets. While the core idea is not fundamentally novel, its practical implications are valuable to the community. Given the relevance to alignment and interpretability in MLLMs, I lean toward a very weak acceptance.