PaperHub
5.8
/10
Poster4 位审稿人
最低5最高7标准差0.8
7
5
6
5
3.8
置信度
正确性3.0
贡献度2.5
表达3.0
NeurIPS 2024

Measuring Per-Unit Interpretability at Scale Without Humans

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

We introduce the first scalable method to measure the per-unit interpretability in vision neural networks and demonstrate its alignment to human judgements.

摘要

关键词
interpretabilityexplainabilitydeep learningneural networksanalysisactivation maximizationalignmentevaluation

评审与讨论

审稿意见
7

The paper proposes a novel method for measuring per-unit (e.g. per-neuron) interpretability of vision models, which is based on a DreamSim-based automation of the 2-AFC task. The method, which is called the Machine Interpretability Score (MIS) is found to be highly correlated with human measures of interpretability, as well as capable of making novel predictions about what units humans will find more and less interpretable. The authors then use MIS to compute average per-unit interpretability for 835 computer vision models, and highlight their findingings (such as the negative correlation between accuracy and average per-unit interpretability, and the increased interpretability of deep layers) which would have been impossible to accomplish without the novel metric.

优点

The paper is a triumph of the genre: it is written in an incredibly clear fashion, presents limitations as they arise, and has an excellent appendix which directly addresses most questions that came up as I was reading the paper.

  • Originality: the contribution solves a longstanding problem in per-unit interpretability in a novel and scalable way. It builds upon some of the best work in the field, and runs experiments that, as the paper mentions, would cost billions of dollars to complete using current approaches.
  • Quality: the paper is of outstanding quality. It is exceptionally clear, and the appendix contains a host of highly useful sensitivity analyses. I especially appreciate Appendices C (justification of the choice of DreamSim) and I (application to sparse autoencoders).
  • Clarity: It's very clear what the results are, and what the contribution is, and where the claims are supported. The authors do a good job of demonstrating that MIS explains existing data; then, that it makes novel predictions, which they support with a human experiment (!); and finally, have lots of detailed figures explaining their experiments which use MIS.
  • Significance: human-free per-unit interpretability has long been sought in vision model mechanistic interpretability, and this provides a possible solution.

缺点

No significant weaknesses.

Minor nitpicks:

  • The way the query images are selected is only mentioned in the appendix; this might be worth quickly mentioning in the main paper (basically, the fact that they're also highly-activating images)
  • In Figure 14, unless I'm misunderstanding something, there's a strange difference in y-axis scaling. I reckon these should be aligned.
  • Line 195 has an extra full stop. (Or, if you were trying to evoke a sense of mystery, has one less than needed...)
  • Figure 17 says "interpreability"; should this be "interpretability"?

问题

  • Here's something I don't quite understand: for a unit, why do its weakest-activating dataset samples seem to all be monosemantic? Wouldn't we expect them to be more random than that?
  • It's unclear to me why one might expect "wider layers to be more interpretable" when comparing across models—sure, this seems true for a fixed model architecture and training set, but among the models being analyzed, wouldn't the models with larger layers be more likely to be larger models trained on more data?
  • Is there a way to quantify how superposition affects MIS? Do you have any thoughts on what percentage of the decreasing MIS throughout training is explained by increased superposition in neurons?
  • Do you have thoughts on training autoencoders to directly optimize for a combination of MIS and reconstruction loss?

Minor questions:

  • In Figure 3, the red points simply represent the location of the models tested by Zimmerman and not the HIS determined by Zimmerman for those models, correct?
  • What models form the pareto frontier in Figure 4A?
  • In what sense is the word "define" used on line 122?

局限性

The authors adequately address the limitations as they come up.

作者回复

Dear Reviewer gaJB,
Thank you for your valuable feedback and for praising our paper as a “triumph of the genre” with ”outstanding quality” and finding it ”solve[s] a longstanding problem”. Please let us know whether our responses below addressed all of your questions or whether there are further questions we can answer so that you feel even more confident in our work.

Q: “The way the query images are selected is only mentioned in the appendix”
A: Thank you for your suggestion. We will make use of the increased page limit in the camera-ready version of the paper to move this information to the main text.

Q: “Why [are] weakest-activating samples [not] more random?”
A: We agree with you that it is not obvious why the least activating samples are also clustered, and all contain the same visual feature. However, one can think of these features as “counterparts”/”anti-features” to the features displayed by the maximally activating samples: A unit only has high activation if just the feature but not the antifeature is present, which potentially allows the network to learn more specific feature detectors.

Q: “Is there a way to quantify how superposition affects MIS?”
A: That’s a great question and suggestion! While we can’t think of a precise quantification right now, we can offer a thought experiment. Polysemantic units respond (strongly) to multiple concepts. If those different features elicit similar activation ranges, the query and reference images used to compute the MIS differ, resulting in harder tasks and lower MIS. We think that quantifying through human studies how well a drop in MIS is correlated with units being polysemantic will be an interesting follow-up for our work. Thank you for the suggestion!

Q: “why [might] one [...] expect "wider layers to be more interpretable"?”
A: Thanks for raising this question! In this plot, we compare the relation of a layer's relative width and its interpretability, i.e., we ask whether wider layers of a network are more interpretable than narrower ones of the same network. We chose this relative comparison exactly to circumvent your concern. We think our observation might be explained by the superposition hypothesis: Narrow layers do not have sufficient capacity to represent different concepts individually through single units and instead have to leverage polysemantic units. We will expand on this connection in the camera-ready version of our paper.

Q: “thoughts on training autoencoders to directly optimize for MIS and reconstruction loss?”
A: We think that using our proposed MIS to increase the interpretability of networks is very exciting — both for auto-encoders used to make large networks interpretable and for directly making networks more interpretable. While the current computation of the MIS can be used for non-gradient-based optimization (e.g., hyperparameter grid search), optimizing it scales inefficiently with parameter count. A challenge for future work will be to find a differentiable approximation of the MIS that can be optimized using gradient descent, circumventing the efficiency issue. We believe such an approximation can be defined when training with sufficiently large batch sizes, and hope to explore this further in follow-up work.

Q: “Do the red points in Fig. 3 represent the location of models tested by Zimmermann et al.?”
A: Yes, that is correct. We will update the caption to ensure future readers do not mistake them for the results of Zimmermann et al.

Q: “What models form the pareto frontier in Figure 4A?”
A: Thank you for this suggestion! We determined the Pareto frontier of Fig. 4A using the paretoset python package and will include the following table in the camera-ready version of our paper:

ModelAcc [%]MIS
googlenet69.150.908
timm:resnet34.a3_in1k72.970.904
timm:resnet50_gn.a1h_in1k81.220.901
timm:ecaresnet101d_pruned.miil_in1k820.895
timm:eva02_small_patch14_336.mim_in22k_ft_in1k85.720.890
timm:vit_base_patch8_224.augreg_in21k_ft_in1k85.80.871
timm:caformer_b36.sail_in1k_38486.410.870
timm:caformer_s36.sail_in22k_ft_in1k_38486.860.870
timm:caformer_b36.sail_in22k_ft_in1k_38488.060.864
timm:beitv2_large_patch16_224.in1k_ft_in22k_in1k88.390.839

We found it interesting that although purely convolutional networks with high accuracy exist, all points on the Pareto frontier with high accuracy belong to transformer architectures.

Q: “In what sense is the word "define" used in L122”
A: We use the word define here in the same sense as one defines a statistical/machine learning model. Would you prefer the word “model” instead of “define” here?

Q: “Figure 14 [...] difference in y-axis scaling”
A: Thanks for the pointer, we will update this figure in the final version of the paper to use an aligned y-axis.

Q: Typos in L195 and caption of Fig. 17
A: We corrected these typos.

评论

Thank you for the detailed responses! This is a great paper. I guess my main hesitancy to increasing the score further is that while this paper is self-contained and attacks an important problem, it's not obvious to me that it would have "excellent impact on at least one area, or high-to-excellent impact on multiple areas"—this just seems like a really high bar. Essentially, I would be happy to see this paper accepted, but I am unsure as to what the actual impact will be, due to things like an unclear optimization target and complexity.

审稿意见
5

The paper presents a method to automate a per-unit (e.g. individual neuron, channel in a conv layer) interpretability score for vision models that was previously computed via an expensive human study [50]. They demonstrate that their automated scores are highly correlated with the human measures, and then apply their method on 835 vision models, obtaining a ranking of models by 'interpretability' that is consistent with the much smaller subset considered in [50]. Additional analyses using their method involve inspecting 'interpretability' vs. depth, vs. layer width, vs. layer type, and throughout training. The paper motivates their work by arguing that automating the interpretability score will enable optimizing for it, towards more 'interpretable' models.

优点

  • The paper achieves the goal it sets out to achieve, namely that of automating the human interpretability score of [50].
  • The authors conduct extensive experiments, evaluating a tremendous number of models (835!).
  • The authors offer two forms of validating their main claim: one via correlating their score with the human interpretability scores from a prior study [50], and another via a second (new) human study conducted after developing their method.
  • A number of follow up experiments are considered. Some interesting behaviors observed, like a big jump in MIS (machine interpretability score) for some of the batchnorms during the first epoch of training, and the drop in MIS over the final deepest layers (fig 6a).
  • I am a big fan of the abstract problem: measuring interpretability is a hard problem, and an automatic method could potentially be very useful.

缺点

While the MIS seems to do well at modeling the HIS, I am struggling to see the merits of such a score, as I find the underlying task too easy and as such, limited in its downstream potential for applications. As I understand it, the psychophysics task used to proxy interpretability simply asks users (given a 'unit' that maps each image to a scalar activation) to match a least activating image to a set of other least activating images and a highest activating image to a set of other highest activating images. This can be done so long as there is any kind of distinctiveness between the highest and least activating images. It also does not account for the well-known superposition phenomenon, in which a single unit may encode multiple concepts, thus hindering its interpretability / ability to steer model behavior; in such a case, one could still pass the psychophysics test by recognizing any one of the multiple concepts encoded by a unit, as the least activating images would not contain those concepts.

^to summarize, I don't think the underlying test that MIS can proxy gives us any valuable signal, as it does not tell us if a unit is aligned with a single concept. I could be wrong, but I find it vital for the authors to demonstrate an application of their interpretability score (e.g. combatting spurious correlations, steering model behavior, uncovering biases, etc).

The claim 'if we can measure it, we can optimize for it' is unsubstantiated and, in my opinion, misleading. I don't see a straightforward way nor reason to efficiently optimize for MIS.

The range of values that the MIS outputs empirically is quite tight, making it hard to place much significance on the observed differences.

I find the methods section to be needlessly mathy; I think it obscures the underlying method, which is not too complicated (sort images along a unit, select your query+example sets, use dream sim to get similarity between each example and the queries, average, pick).

The paper relies very heavily on [50]. It would be nice to see how this method could be incorporated with other popular ideas in current interpretability literature.

问题

How would you envision optimizing for the proposed MIS score? What advantages would that yield?

Why do more accurate models have lower MIS? This makes the idea of optimizing for MIS a harder sell.

Is there any evidence that nodes with higher MIS are easier to name? I'd expect there to be a positive correlation, but I'm not sure how strong it would be. Again, I'm trying to think of way in which MIS can be useful, e.g. in collaboration with other interpretability techniques.

What was the correlation between the per-unit HIS and MIS scores for the new human study on the two models?

局限性

The primary limitation is that the underlying score this method proxies may not sufficiently align with notions of interpretability that can ultimately be useful. I do not see how the method or the findings of this paper can be operationalized towards more research going forward. I would encourage the authors to devise and convincingly present a way in which this method can ultimately be utilized towards more trustworthy, transparent, or reliable models.

作者回复

Dear Reviewer j42j,
Thank you for your detailed feedback. Please let us know whether our responses below addressed all of your questions or whether there are further questions we can answer so that you can confidently increase your score

Q: “find the underlying task to easy”, “does not account for polysementacity”
A: You are right that the underlying psychophysical task checks for a rather rudimentary level of understanding. Note, however, that understanding units at this level is necessary for any deeper understanding. This is important because experimental results show that human understanding of units at this level is rather brittle: Zimmermann et al. (2023) showed for two models that by performing the psychophysical task for slightly less extreme query images, human performance drops strongly (see Fig. 6/7 of their paper). We can reproduce this finding with our MIS for many more models (see Fig. 1 of the general response), showing that this task is not trivial to solve. Further, the task partially accounts for polysemanticity as the most extremely activating images of polysemantic units might not correspond to a single but various concepts, making the task even harder to solve. In conclusion, this means there is still ample room for improvement in future models.

Q: “application of interpretability score”, “(advantanges of) optimizing for MIS”
A: Thank you for raising this question. We see two types of practical applications for our MIS that go beyond analyzing and understanding neural networks: (1) Optimizing neural networks directly to become more interpretable via gradient descent. (2) Performing model selection based on interpretability. (2) Tuning hyperparameters of other interpretability tools to make them explain networks better. While having a differentiable version of the MIS is required for (1) and would surely benefit (2) & (3), note that a non-differentiable metric still provides valuable insights enabling the latter two directions (although potentially less efficiently). As an example, we performed experiments with sparse auto-encoders (L231ff & Sec. I), revisited and investigated inconclusive results of previous papers, and performed hyperparameter selections. We use the increased page limit of the camera-ready version to highlight these results more in the main text.

Q: “range of [MIS] values is tight”
A: On a per-unit level, we find that the MIS spans the entire theoretical value range (~0.5 to 1.0) (see Fig. 2B/C). When averaged per model, the effective range indeed becomes tighter. We argue, however, that this is not an issue of the metric but instead shows that models only trained for good downstream performance all learn similar representations (https://arxiv.org/abs/2405.07987) with mediocre interpretability. With an increased interest in interpretable models, we expect future models to achieve higher MIS values. By choosing less extremely activating samples as query images (see Fig. 1 of general response), we can also increase the task’s difficulty to have a meaningful signal also for more interpretable models.

Q: “[Incorporation] with other popular ideas in current interpretability literature?”
A: A particularly popular topic at the moment is SAEs. As described above, our MIS can be used to make model selection and hyperparameter tuning of SAEs more efficient, enabling large-scale sweeps. Further, it is conceivable to use the MIS as a guiding signal when finding interpretable circuits in a network: by excluding particularly uninterpretable units from the search, one might reduce the computational cost of finding circuits. Finally, as the MIS also works with explanations other than dataset examples (see Appx. E), it can be used with future explanation methods, too (e.g., MACO [1]).

Q: “Why do more accurate models have lower MIS?”
A: This is an important question. We hypothesize that this is related to the phenomenon of superposition: With limited capacity, one way for models to obtain higher downstream performance is to represent features in superposition/entangle them. However, at the same time, this makes units harder to interpret as they don’t correspond to individual features anymore, explaining the lower MIS. We politely disagree with your assessment that this result makes it difficult to sell “optimizing for MIS”. First, note that we see only a correlation, which does not mean that there has to be a tradeoff between performance and interpretability, as this would assume a causal relation between these two variables. Second, note that if accuracy and MIS were positively correlated, then there would be no need to optimize for MIS as one would get this for free. On the contrary, our results show clearly that one should not hope to automatically get very interpretable models by only optimizing for high accuracy. Instead, we need to optimize for both accuracy and interpretability/MIS.

Q: “[are] nodes with higher MIS are easier to name?”
A: Interesting question! To verify the correctness of our MIS, we used the human psychophysical data of [50]. This data also contains scores indicating how confident humans were when making their choices. Regarding your question, one might say that if humans find it easier to name units, they will have high confidence in solving the 2AFC task. Interestingly, we find a high linear/rank correlation of 80% between this confidence score and our MIS. Thus, we conclude that nodes with higher MIS tend to be easier to name.

Q: “correlation between the per-unit HIS and MIS scores for the new human study?”
A: Based on your suggestion, we computed the correlation between MIS and HIS for the new human experimental data. Specifically, we compute the correlation for all units used for creating Fig. 2C and again find a high correlation (ρp=0.85,ρs=0.81\rho_p = 0.85, \rho_s = 0.81).

评论

I thank the reviewers for their detailed responses and all their effort during the rebuttal period. tldr: Some concerns remain (and are shared by another reviewer, so they will need convincing too), but in general I think the paper takes a gradient step in the 'right' direction, so I won't be the reason why it does not get published. I'll increase my score to 5.

Some comments:

  • on the underlying A2FC task: I still think the task is too easy, but I guess it is better than nothing. I do find it important though to mention other works that try to automatically name neurons (a much harder task), as an automated interpretability metric (i.e. how well their automatically generated name for the neuron matches with the highly activating images) is a by-product of their work. Since those works do not focus (nor, I believe, flesh out) on that contribution, the novelty will fall to this submission.
  • Looking at MIS for the least interpretable nodes is an interesting way to make the metric more insightful, though I'm not sure we need every node in the network to be interpretable.
  • I would still clarify that it is not obvious how MIS can be directly optimized for (i.e. via gradient descent / during model training). The suggestions of using MIS for model selection or filtering out nodes when finding circuits is not rigorously substantiated, but still interesting -- we can let the next paper figure out better uses for MIS ;)
评论

Dear Reviewer,

Thank you for your positive feedback and for appreciating our efforts in providing the new explanations/results. We are very pleased that you have increased your score to 5 and will be sure to integrate your comments into the discussion section of the camera-ready version of our paper!

审稿意见
6

The authors introduce a computational metric for interpretability. Their proposed metric is a computational version of an evaluation metric introduced in previous literature which measure human perceived interpretability. Crucially, they find that this metric correlates well, and since it does not require human studies, it can scale well, and therefore potentially enabling more insights in model interpretability.

优点

  • Significance: in time and cost terms, human studies (which are necessary for assessing interpretability) are the main bottleneck. This work is therefore significant as it proposes a way to bypass this.
  • Results: results seem overall interesting. I have some doubts here and there. see more below.
  • Clarity: paper is in general easy to follow

缺点

  • Experiments/Applicability: results seem compelling for vision tasks in a domain that somehow falls in human common knowledge. I am not sure what would be the results in different domains and more complex/specialized vision tasks. More discussion below.

问题

  1. Perhaps a bit of a philosophical question, but how do the authors envision their metric in a computer-human interaction (CHI) context. As said before I am totally sold in terms of time/money saving for interpretability evaluation by avoiding human user experiments. However, I am strongly of the opinion that humans should still be somehow involved in terms of interpretability. After all, they are the ones affected/taking the final decisions. So can the authors expand/comment more on how a highly scalable method can interact with human decision making, which is by nature non-scalable? Let's say that you identified a model which is on average more interpretable than many others (Section 4.2.1), at this point how would a person interested in interpreting the model proceed? Would you show the units with the highest MIS? I think my general question would be: can MIS be used to help an user to fully comprehend their model, or would MIS be useful only for some pre-filtering?
  2. As said above, results seem compalling in common knowledge vision tasks. Do you expect HIS and MIS to have the same degree of correlation in, say, text classification tasks?
  3. If we stay in vision task, what about out-of-distribution (OOD) samples? I somehow would expect HIS and MIS to stop correlating for such samples? If so, this would diminish the contribution of this work.
  4. (L. 209) do you have a reference for the claim "googlenet [..] is widely claimed to be more interpretable"?
  5. Do you have an intuition of why would interpretability decrease after the first training epoch? I would rather imagine the, e.g., conv filters to adapt and resemble more the data
  6. Would be interesting to see a "distributional study" (similar to figure 4b) also for the width layer comparison. Why are wider layers more interpretable? is the MIS a "real average behavior" of units in the wider layers, or it's more because the wider the layer the higher the probability of having by change a unit that appears more interpretable?

局限性

Most limitations discussed in the conclusion

作者回复

Dear Reviewer P5jG,
Thank you for your valuable feedback and for praising our paper as a “significant work” with ”interesting results”. Please let us know whether our responses below addressed all of your questions or whether there are further questions we can answer so that you can confidently increase your score

Q: “how do the authors envision their metric in a computer-human interaction (CHI) context?“
A: That’s a great question! We completely agree with you that, at its core, interpretability is a human-centered field, and, thus, humans need to be involved eventually. Therefore, we ensured that the proposed MIS explains human interpretability annotations well. We see numerous opportunities for the MIS to interact with/support humans: (1) The MIS might be used to increase a model’s interpretability to simplify the human’s job, e.g., through model selection, hyperparameter tuning or directly optimizing for it (where the latter would benefit from a differentiable approximation of the MIS). (2) The MIS can save time when interpreting models: Instead of attempting to understand incomprehensible units, we can start with particularly easy ones (based on their MIS) and allocate our time accordingly. (3) Such an MIS ordering can also be helpful when finding neural circuits in the network as excluding very uninterpretable units from the search reduces the combinatorial complexity and, thus, computational cost.

Q: “Do you expect HIS and MIS to have the same degree of correlation in, say, text classification tasks?”
A: This is a very interesting question! Assuming access to a sufficiently human-like perceptual similarity function, we expect the MIS to generalize to various data modalities. Given the tremendous progress in language modeling/embedding, we are optimistic the MIS will work on text data, too. Testing this hypothesis requires extensive human psychophysical experiments. We will include this exciting experiment in the outlook paragraph of our paper and hope it will inspire future work.

Q: “what about out-of-distribution (OOD) samples?”
A: Thanks for asking this question. Our experimental results indicate that OOD samples cause no problems. In Appx. E, we demonstrate that the MIS is still highly correlated with the HIS when using synthetic feature visualizations instead of dataset examples as reference images. These images pose a substantial distribution shift as they look very different from natural images. Therefore, we conclude that the MIS works correctly also for OOD samples.

Q: “reference for the claim "googlenet [..] is widely claimed to be more interpretable"?”
A: This statement refers to the fact that many interpretability papers focus on this network, producing more and more insights into how it operates. We see now that our statement was imprecise and will revise it accordingly.

Q: “why would interpretability decrease after the first training epoch?”
A: We have no definite answer to this question yet but hypothesize the following: This could be a sign of learning dynamics and the order in which features are learned. After initialization, the network can improve the fastest by learning very simple feature detectors (e.g., colors, simple geometric shapes), as those are weakly correlated with certain classes (e.g., blue colors increase the chance of seeing a fish). Those features are easy for humans to understand. Throughout the training, these feature detectors are replaced with more complex ones that are harder to decode. As suggested by reviewer gaJB, in later training epochs, the network might also tend to a state of stronger superposition to increase classification performance with the cost of decreased interpretability. In Fig. 3 of the general response, we show visual explanations of units with strong MIS drop between the second and last training epoch. We will use the increased page limit of the camera-ready version to discuss this hypothesis in the main text and include these visualizations in the appendix.

Q: “Why are wider layers more interpretable?”, “distributional study for the width”
A: Our data shows a moderate increase of the per-layer interpretability (i.e., per-unit scores averaged per layer) with increasing layer width. We see the same trend when instead of plotting the per-layer-average we plot the 5th or 95th percentile per-layer (see Fig. 2 of the general response). This suggests that the overall MIS distribution moves to higher values with increasing layer width and the effect is not dominated by few outliers.

Overall, we hope to have answered all of your questions. If you are satisfied with our responses, we would appreciate it if you would increase your score.

审稿意见
5

The paper suggests a new automatic measure to asses how interpretable individual units inside vision models are (called MIS). The per-unit metric assesses the similarity of two query images (one should maximize unit activation and one minimize it) to two groups of representative exemplars (top-activating and least-activating images for that unit) using LPIPS. Repeating this for several query image pairs measures how well top/least activating images are consistent with themselves. If MIS score of a unit is high, that means all visual exemplars represent very similar visual concepts (ie unit is monosemantic) and therefore the unit is highly interpretable. If MIS score approaches chance (0.5) the unit is likely less interpretable as visual concepts of visual explanations are broad (ie polysemantic). The metric is shown to be well correlated with human assessment of the same task. Authors show several uses of the metric in several tasks such as assessing the interpretability of a wide range of models, units of different depths and widths along deep nets, training dynamics, and correlation of model interpretability and performance.

优点

  • Automating scientific processes is important.
  • The metric is built from reliable tools and well-known 2afc test formatting
  • The authors made efforts to exemplify the use of MIS in several large scale analysis

缺点

  • Although the attempt to provide analysis of large-scale experiments, I do not find any of the conclusions of these very exciting and do not see the contribution of these to future studies:
    a. Paper shows a study for a large number of networks, but the analysis is a bit insightful; The MIS is very similar across all the networks. The per-unit analysis is also not very surprising showing that shallow layers are not as interpretable as deeper layers. There is an interesting phenomenon at the very beginning and end of the network but the authors do not attempt to explain it.\ b. Anticorrelation between the interpretability of units with classification performance was already shown in [50]. The larger scale experiment of the paper includes many types of networks, trained for different tasks, I wish authors would explore correlation to network performance for a broader set of tasks, not only classification.
    c. The training dynamic analysis is interesting, but again the interesting phenomenon of MIS the highest after solely 1 ephoc is not attempted to be explained, making it harder to consume the results. d. The only part of the analysis that felt more exciting was the SAE experiment, however, unfortunately, the authors chose to analyze layers with relatively high MIS scores. In that setting, SAE does not seem to improve the interoperability of units. Of course, the interesting experiment would be testing a layer with low MIS to see if MIS improves its interpretability.

  • The method relies heavily on existing tools like DreamSim, and well-known tests (eg 2afc setting from [50]). Therefore, there is no actual technical contribution in forming the metric itself. The correlation of MIS with the human scores is not surprising- authors use DreamSim in the exact setting it was trained for, with a high correlation with human judgment being the training objective.

  • Visual explanations are only one way to measure interpretability, the method does not suggest tackling any more advanced form of explanation, for example, those of [1],[2],[3].

  • An important aspect the MIS is missing, is how legible the visual explanations to end-users (which is at the end what we really care about when we want to measure "interpretability"). Is it true to assume that low variability in exemplars always implies visual explanation the is well-legible to users?

  • The method does not consider the full distribution of neuron activation bur rather the two extrema.

In general, I feel like the paper is an immediate extension of [50] (ie as if it was another subsection in it) and not a paper by it's own.

[1] "Natural language descriptions of deep visual features" iclr 2022
[2] "CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks" iclr 2023
[3] "a Multimodal Automated Interpretability Agent" icml 2024

问题

  • Please explain how the scalability of the measure enables to reveal phenomena that were unknown before (e.g. in [50])
  • Can further analysis of unexpected phenomena as described above be done?
  • Why is the SAE experiment performed on highly interpretable layers? Why not perform it on lower interpretable layers?
  • What are the insights for other vision models (not necessarily trained for classification)?
  • Can MIS be applied to other types of explanations (e.g. textual descriptions like Milan, clip-dissect, and MAIA)?
  • Can MIS be extended to testing the full distribution of unit activations, not only extrema?
  • Please explain the key differences and what is shared with the automated evaluation protocol of MAIA [3].
  • Is it true to assume that low variability in top activating exemplars also makes the visual explanation well-legible to users?

局限性

authors discuss the limitations of the method in the paper

作者回复

Dear Reviewer 8Jb5,
Thank you for reviewing our paper. Please let us know whether our responses below addressed all of your questions or whether there are further questions we can answer so that you feel confident in increasing your score.

Q: “I do not find any of the conclusions very exciting”
A: We see our contribution as part of a wider community effort to develop better techniques for understanding the inner workings of neural networks. As such, the MIS is better understood as improving our “microscope” for analyzing networks, rather than providing any specific conclusion. However, as you noted, our improved “microscope” already surfaces a range of interesting phenomena that have yet to be explained. It’s such unexplained phenomena that fuel and guide scientific progress, highlighting how relevant MIS is to the study of networks.

The improvement MIS provides is quite a step-change: compared to the manual approach of prior work (e.g. [50]), we can now test several orders of magnitude more units & models. This is why we could detect the anticorrelation between a model’s accuracy and interpretability, a relation that [50] failed to substantiate due to a lack of statistical power. It’s these kinds of analysis and phenomena that require automated interpretability metrics and which have been sought after for some time (e.g. see Rev. gaJB or [6]). Now, for the first time, such large-scale interpretability analysis is possible with our MIS.

Q: Relation to [50], “technical contribution”
A: Before our paper, there was no work that allowed large-scale quantification of per-unit interpretability in vision models. With respect to [50], please note: (1) The manual approach of [50] (and other prior works) does not scale at all. Our method is the first to scale their type of evaluation up and we succeed by finding a simple but clever way to automate it. We hope you can agree that, ultimately, a new method should be mainly evaluated by its potential impact. This impact is visible, e.g. by the fact that our study finds a clear anticorrelation whereas [50] could only produce inconclusive results due to multiple orders of magnitude fewer measurements. (2) We’d also like to emphasize that our technical contribution is far from obvious. It’s not clear why a global perceptual metric (DreamSim) could fit human responses in our 2AFC task. In particular, one might expect (e.g. [7]) that humans solve the task by searching for common local patterns in the reference images and then comparing the most common ones to those of the query images. Hence, the fact that a global perceptual metric fits human responses so well was quite surprising to us.

Q: “Differences to MAIA”
A: Thanks for asking this important question. Approaches such as MAIA and our MIS tackle completely different problems: MAIA is an automated explanation method, i.e. it tries to find explanations of what a unit does. But it does not allow quantifying how interpretable the unit is. On the contrary, our MIS is an interpretability metric that tells how interpretable a unit is given some explanations. Such an interpretability metric enables practitioners to increase the level of interpretability of a network, e.g., by model selection or hyperparameter tuning of the model or of an interpretability method.

Q: Relation b/w “diversity in exemplars” and legibility
A: Our MIS leverages an established 2AFC task used in multiple prior works [6, 49, 50]. We will integrate more information from these works on why this task measures how legible explanations are (i.e. how well they explain a unit). Their reasoning can be summarized as follows: The task requires humans to reason about positive (and negative) explanations, to identify the positive (or negative feature) shared by the explanations, and to recognize this feature in the correct query image. Please note that these features rarely correspond to completely unambiguous, clear-cut semantic classes, but identifying the common feature can be challenging. If the explanations are so diverse that humans fail to identify a shared feature, it shows that they cannot understand what the unit is firing for.

Q: “[Extension] to full distribution”
A: The MIS can very easily be extended to test more than just the extrema of the activation distribution: Instead of choosing the most extremely activating samples as query images, we can use less strongly activating ones and sample from other parts of the activation distribution. Please see Fig. 1 of the general response for a version of the paper’s Fig. 3 where we chose query images from the 98th/2nd percentile. As the human understanding (measured by HIS/MIS) even of the extrema is still limited and performance breaks down a lot when moving away from the extreme, we suggest using the MIS with images near the distribution’s tails (e.g. 98th or 99th percentile) to get a strong signal. Once models (let it be base models or SAEs) or explanations improve, it can be insightful to test larger parts of the distribution, too.

Q: “[What about] textual descriptions”
A: We are optimistic that the MIS can be generalized to textual descriptions too. With the recent progress in modeling language-vision models, it is conceivable to replace DreamSim with a language-vision encoder such that the MIS is based on the similarity between textual descriptions and query images. To keep our paper focused, and due to the lack of human interpretability annotations for such textual explanations, we decided to leave exploring this extension to future work. We will add this discussion to the final version of our paper.

Q: “SAE experiment performed on interpretable layers”
A: We chose a layer representing the median interpretability of GoogLeNet to neither make the experiment too hard nor too easy. We will re-run our experiments on a less interpretable model/layer and include it in the final version of our paper.

评论

Thank you for your reply.

  • if the conclusions in the paper are not the main contribution but rather the metric itself, please describe future potential use cases for it.

  • Regarding the usage of DreamSim: because the aggregation over all image pairs is by taking an average, I believe there would not be much difference between the scheme you mentioned and the current implementation of MIS (DreamSim had shown to represent a global space for perceptual similarity that goes beyond pairwise comparisons.) Nevertheless- if the procedure you describe is indeed how humans perform the task, why not construct MIS accordingly?

  • technically- I understand how MIS can be easily expanded to other "activation level" exemplars- but is it meaningful at these levels? The figure added is of 98% percentile, which is still very high. What about the percentages?

  • will be great to see SAEs results during the discussion period if possible, thank you for the effort on this.

评论

Dear Reviewer 8Jb5,
Thank you for your response!

  • We see numerous potential practical applications for our MIS that go beyond analyzing and understanding networks: (1) Directly optimizing networks for interpretability using gradient descent to improve the interpretability score; (2) Model selection by choosing models based on their interpretability scores; (3) Hyperparameter tuning for interpretability tools by optimizing their performance in explaining networks; (4) Prioritizing interpretability efforts by identifying easily and difficult-to-understand units to focus research; (5) Reducing computational complexity in neural circuit identification by excluding highly uninterpretable units from the search space. We demonstrate some of these applications by revisiting inconclusive results from previous work (Sec. 4.2.1) and performing hyperparameter selection (e.g., Sec. I for SAEs). We will further highlight these results in the main text using the extended page limit for the camera-ready version.

  • So far, the exact strategy humans employ in the 2AFC task is unknown. Based on our own subjective experience (and earlier work), we initially hypothesized a focus on local pattern recognition. Nevertheless, we decided to test how aligned the decisions of a machine based on a global similarity metric like DreamSim are with those of humans. To our surprise, the strong correlation between MIS and human decisions suggests a different strategy may be dominant. While a small performance gap between humans and the MIS may exist (currently unanalyzable due to the noise ceiling in the human data), future work could explore incorporating a "local pattern search" strategy into the MIS.

  • Both MIS and the evaluation protocol used for MAIA assess explanation informativeness. However, their implementations differ due to the distinct goals/outputs of MAIA (textual description) and our MIS (interpretability score): MAIA generates textual descriptions and LLMs and text-to-image models are used to evaluate activation differences in generated images based on these descriptions. In contrast, our MIS, grounded in established human psychophysical setups [6, 49, 50], utilizes natural images from a large database, eliciting high/low activations, and simulates human identification of these differences. This approach allows for a direct assessment of interpretability based on human perception.

  • While the 98th percentile might still seem high, please note that our results (Fig. 1 of the general response) indicate that this task is nevertheless very hard for current models. This shows there is ample room for models to increase their interpretability. Evaluating lower percentiles can be insightful in future work, especially as models and explanations improve. As model interpretability (whether base models or SAEs) or explanation quality improves, exploring a broader range of the distribution will become increasingly valuable.

  • Thank you for appreciating our effort! We are currently training SAEs for less interpretable layers but can already share some preliminary results with you for layer2_2_conv2 of a ResNet50 (MIS=85.83%). The table below now shows that using SAEs is beneficial compared to using the original layer. Moreover, this demonstrates how the MIS can be used for hyperparameter tuning (i.e., choosing the optimal sparsity weight). We will continue with more experiments and integrate them into the camera-ready version of our paper!

Sparsity WeightL0 CountMIS [%]MIS Improvement to Original Layer [%]
0.0112523389.173.34
0.0250013890.794.96
0.037509991.605.77
0.050007591.475.64
0.062506091.876.04
0.075004991.846.01
0.087504192.186.35
0.100003591.785.95

We hope to have clarified our message, answered your questions, and addressed your concerns. We sincerely hope this information offers a clearer understanding of our work, allowing you to reassess our work's value and increase your score.

评论

Thanks for your reply. I decided to raise my score to 5. I encourage the authors to add the potential use cases to the discussion/motivation of the paper and to include the additional analysis of SAEs.

评论

Thank you very much for increasing your score! We will ensure that all our explanations and new results are integrated into the camera-ready version of our paper.

作者回复

Dear reviewers,
Thank you for your valuable feedback. We are delighted that you praise our paper as a “triumph of the genre” with ”outstanding quality” (Rev. gaJB) and finding it’s results ”overall interesting” (Rev. P5jG) and “potentially very useful” (Rev. j42j) for an ”important” topic (Rev. 8Jb5).

Based on your feedback and questions, we implemented the following changes in our paper for the rebuttal:

  • We explained and demonstrated how the MIS can be computed not just for the extreme of the activation distribution but the rest of the distribution, too (Rev. 8Jb5 & j42j) (Fig. 1 in attached PDF)
  • We conducted a “distributional study” (Rev. P5jG) on the relation between layer width and interpretability (Fig. 2 in attached PDF)
  • We computed the correlation between MIS and HIS on the new data collected for Fig. 2C. (Rev. j42j)
  • We determined the Pareto frontier of models in terms of their accuracy-interpretability tradeoff (Rev. gaJB)
  • We explored reasons for why the interpretability of a ResNet50 decreases during training (Rev. P5jG) by visualizing units with a particularly strong drop in MIS (Fig. 3 in attached PDF)

Please let us know whether our responses below addressed all of your questions or whether there are further questions we can answer so that you feel more confident in our work and can increase your score.

评论

Dear reviewers,
Thank you again for your insightful feedback and engaging discussions throughout the review process! We are grateful that you recognize the potential and value of our work and now unanimously give this paper a positive score.

Based on your initial reviews and subsequent questions, we have made several revisions to enhance the clarity and impact of our paper (see previous post). In addition to the changes outlined in our initial responses, we have implemented the following for the camera-ready version:

  • We conducted additional SAE experiments on less interpretable layers of another network (Rev. 8Jb5)
  • We further emphasized the numerous potential practical applications of our MIS in both the introduction and discussion sections (Rev. 8Jb5)
最终决定

After a productive discussion period, the reviewers came to consensus that this paper was above threshold for acceptance. This approach holds great promise as a general purpose method for understanding large-scale model representations and strategies, and should prove impactful for AI practitioners trying to characterize the success and error modes of their models and cognitive scientists who want to compare models and biological brains. I recommend this paper for acceptance.