AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
摘要
评审与讨论
The paper proposes a method to align different vision models' features to a common space, and to discover interpretable features as clusters in this space. The alignment is done by learning linear mappings from features to fMRI activations, the intuition being that the human visual cortex provides a meaningfully structured space, in which locations have known properties (e.g. different regions are known to respond to specific concepts) and are thus readily interpretable. Once the features have been linearly aligned to this common space, they are treated as a weighted, fully connected graph, wherein each image patch is a node, and edge weights (affinities) are computed based on the cosine similarity between the features in each node. A standard spectral clustering method (Normalized Cut) is used to compute a soft partition of this graph into sub-graphs (clusters). As performing this clustering on the full graph would be computationally infeasible, the authors propose to cluster a sub-sample of the graph, and then propagate the resulting clusters to the K nearest neighbors of each subsampled node. Finally, in order to learn linear mappings that preserve the quality of the clusters, a regularization term is added to the reconstruction loss, which ensures that spectral clustering eigenvectors are preserved across the mapping, based on a subsample of nodes. This method is used to visualize, for each layer of three different models (MAE, DINO and CLIP), the concept that each image patch is assigned to, coded as a color. The 20 top eigenvectors are reduced to 3 dimensions using t-SNE, and these 3D vectors are shown as RGB colors. This visualization reveals that CLIP and DINO produce maps that are close to uniforms in the first 4 layers, suggesting that figure-ground segmentation only emerges in later layers. MAE, on the other hand, shows signs of segmentation from earlier layers. The segmentations extracted from the models in this way are evaluated on the ImageNet-segmentation benchmark, confirming that in CLIP (the model that showed the strongest segmentation) the segmentation emerges at layer 4, and plateaus afterwards. Using the PASCAL VOC benchmark, which also includes category labels, CLIP is also found to encode categorical information, which peaks at layers 9 and 10. In another analysis, a discovered "figure/ground" concept is visualized by averaging its activation within the "figure" and "ground" regions (based on the ImageNet-segmentation ground truth labels) and plotting it on the surface of the brain, showing that areas known to encode objects, faces or bodies tend to respond more to the foreground, while scene-selective areas more to the background. This figure/ground concept is found to be agnostic to object category, and to an extent, consistent across models. In the next section, concepts corresponding to different object categories are visualized on images, and on the surface of the brain. Finally, a 2D t-SNE visualization of the evolution of the features across layers shows a bifurcation between figure and background as the layer depth increases, in both CLIP and DINO.
优点
- The paper proposes to use the human brain as a shared space in which to evaluate different models: this is a clever intuition, which might prove useful for interpreting differences between models.
- It shows that spectral clustering can be a well-suited method for grouping the features of vision models, and in particular ViTs, into meaningful clusters.
- It proposes a clever modification of an existing subsampling-based method for graph clustering, by using K nearest neighbors.
缺点
-
The overall concept is not made clear in the Introduction. The method is relatively simple conceptually, consisting of a step in which multiple models' features are aligned to the brain to provide a common reference frame, followed by clustering of the features within this common space. The Introduction does not make this pipeline clear. Specifically, while the alignment into a common space is clear, the clustering procedure is not explained: first, the authors evoke neuroscientific ideas on lines 39-41, without discussing how these relate to the problem of clustering features, nor even that the goal of the current work is to find clusters of features. Subsequently, on lines 42-47, they discuss the problem in terms of "graph edges incidents on each pixel", and "channel grouping hypothes[e]s". While this is partly a matter of subjective taste, I believe that discussing the problem at hand in terms of reducing the dimensionality of features at each location (image patch), thus finding a small number of dimensions (combinations of features) which can explain the affinity structure between patches, would be more easily understood by most readers.
-
Several key details about the methods are left out: for example, almost no information about training (learning rate, optimizer used, batch size, number of epochs) is provided.
-
Related to the previous point, as the Appendix contains several important methodological details (such as the use of additional regularization losses) but is never referred to in the main text, the authors should add references to it where relevant.
-
A key component of the proposed method is the learning of a mapping of different models' features into a common space, and as shown in Figure 3, this does indeed result in the cosine similarities of different channels' activations becoming more similar across models. At the same time, the goal of the method is to uncover differences between models. The authors should include an explicit discussion of what kind of model differences are likely to be preserved, and which are likely to be destroyed in the alignment process. As a possible suggestion in this direction, the paper makes several references to the models' features before alignment (for example comparing their segmentations' evaluations with the aligned features in Figure 5), but these features are never visualized. A direct comparison of each model's (clustered) features before and after alignment would be very informative.
-
The feature clusters are visualized by reducing their dimensionality to 3D using t-SNE, and visualizing the resulting 3D features as RGB colors. This visualization, however, is not easily interpretable, as different channels are conflated together by the additional dimensionality reduction. Visualizing single channels separately might be more useful to understand the nature of the discovered clusters. Was there a specific reason for choosing the 3D t-SNE visualization rather than showing single channels?
-
Overall, it is not clear what the discovered channels can tell us about the models. The single interpretable channel that is discussed in depth in the paper is figure/ground. While this provides a good sanity check on the meaningfulness of some of the discovered features, the ability of vision transformers to segment objects has been observed in several papers (e.g. Melas-Kyriazi et al. 2022, Xu et al. 2023), and the different responsiveness of different regions in the visual cortex to figure and background is well established. Other concepts revealed by the method (the category concepts in Figure 8) are shown on the surface of the brain, but it is hard to interpret what these brain maps mean. The authors should make the questions that can be answered using the proposed method clear and explicit.
-
The paper fails to cite closely related work in the Related Works section. Particularly, it cites mechanistic interpretability work in other fields, such as language, but not the recent rich line of work that has specifically looked at vision transformers' ability to perform specific visual tasks. The two papers cited above (Melas-Kyriazi et al. 2022, Xu et al. 2023) are a good example, the former in particular as it proposes a segmentation method based on spectral clustering which is very close to the present one. Another relevant paper is El Banani et al. (2024), which looks at 3D-related tasks. As this is not my field of expertise, I am not aware of papers that look for meaningful directions in the space of vision transformers' channels, but I would be surprised if this didn't exist. I would recommend the authors to do a more exhaustive literature search to find papers that more closely relate to the method proposed here.
-
In Figure 7, the figure-ground visual concepts discovered by different models are plotted on brain maps. In the text, the authors write that "the foreground or background pixels activates similar brain ROIs across the three models". However, a glance at the brain maps reveals similarities, but also differences. A statistical evaluation of the similarity between different models' brain maps would be recommended.
References
El Banani, M., Raj, A., Maninis, K. K., Kar, A., Li, Y., Rubinstein, M., ... & Jampani, V. (2024). Probing the 3d awareness of visual foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 21795-21806).
Melas-Kyriazi, L., Rupprecht, C., Laina, I., & Vedaldi, A. (2022). Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8364-8375).
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., & De Mello, S. (2023). Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2955-2966).
问题
- As I wrote in the "weaknesses" section, I found the explanation of the clustering procedure given in the Introduction, in terms of a bipartite graph connecting pixels and feature channel, misleading. However, I want to be absolutely sure that this is not due to my misunderstanding of the method. The graph on which the clustering is performed is the graph wherein each node is a vector with the features in a given image patch, every node connects with every other node, and the weights of the edges are the affinities between feature vectors, is that correct? What was the reasoning, then, behind discussing a graph comprising channels and pixels as different nodes in the Introduction?
局限性
As I wrote in the "weaknesses" section, I believe the precise scope of the method (what kinds of questions can and cannot be answered with it) has not been properly acknowledged and discussed by the authors.
The authors proposes a method to create a universal feature space using brain fMRI response prediction as a training objective. The key idea is that deep networks trained with different objectives share common feature channels that can be clustered into sets corresponding to distinct brain regions, revealing visual concepts. By tracing these clusters onto images, semantically meaningful object segments emerge without a supervised decoder. The paper employs spectral clustering on the universal feature space to produce hierarchical visual concepts, offering insights into how visual information is processed through different network layers. The two main insight being the localization of emerge of foreground/background features, as well as interesting visualization of class-specific concepts using the top spectral-tsne egeinvectors.
优点
First, congratulations to the authors, I liked reading this paper, and I think the experiments and the core is great:
- Using brain voxel response prediction to find the common space between model is interesting and novel
- The author propose visualizations of what we could interpret at brain-activations subspace
- I like the idea for the visualization of how visual concepts emerge and transition through different layers of various models. However, i have concern on it's validity (see weakness)
- Nystrom-like approximation show the authors thought about scaling their methods
缺点
Nevertheless, this paper has problems, some more important than others.
So I will separate them into major problems (M) and minor problems (m). I want to make it clear that for me, all these problems are solvable and do not detract from the quality of the paper.
Let's start with what I think are the Major problems (M):
M1. Related Work Quality (Page 9, Section 4):
- The related work section is critically weak, with only 24 references and lacks depth in discussing relevant literature. The paper misses an entire set of works on (1) concepts xai, (2) alignment of brain and activations (3) study of representations and (4) attributions methods, which are either crucial or should be mentioned for this study. A significant rewrite is necessary to properly position the paper within the existing body of work.
M2. Validity of t-SNE for Distance Measures (Figure 9):
- The paper uses t-SNE for analyzing bifurcation in feature space, but t-SNE is known to distort distances. This raises concerns about the validity of conclusions drawn from t-SNE plots regarding feature bifurcation.
Now for the minors problems:
m1. Redundancy of Discovery Claim (Page 2, Line 25):
- The claim that channel feature correspondence exists across networks is not new, as it has been extensively studied in major works (eg using CKA, RSA...), please update and compare your work to this litterature.
m2. Reliance on Channels (General):
- Channels are not necessarily the best basis for analysis, as recent researchs suggests there are better ways to represent features, directions (neuron is not a great basis). The paper should address why it continues to rely on channels, considering the limitations.
m3. Orthogonality Assumption (General):
- The paper's assumption of orthogonality in feature decomposition does not align with current understanding, especially regarding the neural collapse phenomenon in late layers. Say it otherwise, all point for the class tench are nearly collapse in the latest layer, relaxing othogonality (e.g dict learning) may be a good idea (althought if i am correct, the nystrom approx should not yield perfectly orthogonal vectors). This should be discussed.
m4. Parameter Sensitivity (Appendix):
- As always when we have hyperparameter, i expect a small discussion discussing the effect of changing the parameters (λ eigen, λ zero, and λ cov). This would help understand the robustness of the method to these hyperparameters.
m5. Direct t-SNE Application (General):
- The paper uses eigenvectors for t-SNE. It would be more straightforward to apply t-SNE directly to the data, and the paper should justify the chosen approach.
问题
The paper presents a novel and promising approach, but it has several critical issues that need addressing. The major problems, particularly regarding the related work quality, assumptions of linear transformations, and the validity of t-SNE for distance measures, are significant and currently undermine the paper's contributions. Addressing these concerns would substantially improve the paper. The minor issues, while less critical, also need attention to enhance the clarity and robustness of the presentation.
局限性
Yes, the limitations identified by the authors are accurate and well-documented. Regarding the weakness I mentioned, I reserve the right to increase the score if the authors adequately address my major concerns.
The paper introduces a new method to interpret deep learning models using brain data. The two apparent contributions are that (1) this new model is able to align channels activations from different layers of different models into a universal feature space, and (2) a Nystrom-like approximation is introduced to speed up spectral clustering analysis.
优点
I have to be very transparent in this review. Unfortunately, I found extremely difficult to follow this paper, and I was not able to understand its different components; as a consequence, I don't feel capable to evaluate what the possible strengths of this work are.
缺点
As I said in the previous section, I was not able to understand the methodological details of this paper. A lot of terms and concepts are used throughout the paper without a proper explanation, and as a consequence I cannot honestly understand what is going on with the methods of this paper to properly evaluate it. This paper seems to have a lot of work in it, so I want to believe that what's happening here is that (1) this is one of the first - if not the first - paper from the authors, and thus they lack the experience to explain what they did in a way that their peers can understand, (2) a lot of the concepts used are seen by the authors as very obvious jargon from the subfield, and thus the problem is that I'm not familiar with this subfield, or (3) both. I'm leaving my doubts in the next section, in the hope that they will allow us to understand better whether my understanding difficulties are related to points (1), (2), or (3). As a consequence, I'm rating this paper as a borderline reject, hoping that during the rebuttal period the authors will have time to tackle this readability issues, and as a result I'll be able to properly reassess this work.
问题
- The work by Yang and collaborators from 2024 seems to be very important, as it is mentioned immediately in the first paragraph of Introduction. Indeed, the one-sentence summary seems to indicate that the authors are presenting something very similar to Yang et al.'s work. What are the similarities and the differences between this paper and Yang et al.'s work?
- In figure 2's caption, what is an "early" and "late" brain? What are the different "levels" of segmentation, and how is this done? What is "spectral-tSNE" method, an adaptation from tSNE? If we have image pixels, how can they be coloured by some sort of "3D" method?
- Between 42-49 the authors try to explain some sort of "hypothesis"? What hypotheses are we talking about here?
- Between lines 42-47, it is said that each channel produces a per-pixel response. In later/deeper layers of a deep learning model, the receptive field is larger, and thus there might be different ways to connect an input pixel to different channels. Can the authors clarify this point?
- If the Nystrom-like approximation is presented as a key contribution of this paper, why don't we have experimental evaluations showing the computational speed-ups in practice?
- What do the authors mean with "using brain encoding as supervision" in line 74?
- What do the authors mean with the term "meanings" in line 76? Can they be more precise with what this means in more computational terms?
- What are the actual contributions from points 2 and 3 between lines 77-81? The discovery of these points?
- What are the "features across different models" mentioned in section 2? Channel activations?
- When the authors mention "brain response prediction" or "brain prediction" or "brain prediction target" in section 2, what are these "predictions" and how and where do they come from?
- What do the authors mean by "all levels of semantics" and how do they relate to "rich representations" in line 94?
- Shouldn't the "Brain Dataset" in lines 103-106 be moved to the section where experiments are being characterised, rather than in the methods section? When the authors say that they've used the "first subject's (...)" does this mean they've only used one person's brain scan in this paper? Why and how was this data preprocessed?
- What training procedures were done in this paper? Was the NSD dataset used to train CLIP, DINOv2, and MAE mentioned in section 3? What train/validation/test splits were used? How were hyperparameters selected? How was the overall training procedure? How are the NSD, ImageNet, and PASCAL datasets mentioned in this paper used together? Why is NSD presented in section 2.1, but then it's not mentioned again?
- In section 2.1, I believe the transformation is what the authors name "channel alignment". The way these transformations are presented, they only seem to be learnable linear transformations almost as if we were using an MLP. In this sense, how is this an "alignment"? Assuming there is some complex loss function (though nothing seems to be mentioned in the main paper) applied to the final in equation 2, then it means that the learned will be some complex transformation learned by the network, but not necessarily aligned to a "universal" space, right?
- fMRI signals are supposed to be 4D, so how did the authors get to the 3D brain voxels mentioned in line 115?
- In lines 125 and 126 the authors mentioned "the" graph. Is this a specific graph? Or just "a" graph to explain how spectral clustering works?
- What are the "brain scores" and "brain prediction scores" mentioned in section 2.3?
- It is mentioned that equation 5 is "added". Added to what?
- When the authors say that they "extracted features" in lines 165, do they maybe mean channel activations? If not, can they be specific about how and what is being extracted?
局限性
Limitations of this work are mentioned in the Conclusion section, but no discussion about the potential negative impact of this work is presented.
In the domain of interpretability research, this paper aims to make a mark by proposing AlignedCut, a method to discover shared and expressive visual feature spaces across networks by aligning those spaces with neural responses in human brains. The method is quite interesting - channel-wise responses to images are aggregated and "feature clusters" are formed on the basis of the functional connectivity between the pixels and linear combinations of channels. These linear combinations are acquired by predicting neural responses to the same images. The feature space spanned by the neural responses is considered the universal feature space. The eigenvectors corresponding to those feature clusters help us visualize what parts of the image the networks rely on to encode which concept, thus providing an interesting interpretability lens. The most striking example presented is how figure-ground segmentation can be interpreted as a mapping between the input and specific channels in various networks.
优点
Originality
- The AlignedCut method is new to me - and is super interesting - however, I am not an expert in that specific sub-field so I am not sure of its novelty.
- The interpretability lens on figure-ground segmentation is very informative, however, again I cannot judge its novelty.
Quality
- The authors present plenty of analysis to demonstrate the power of their method, which helps in inspiring some confidence in the claims.
Clarity
- The methods and results are relatively clear and the authors provide useful context at the start of each section.
Significance
- Linking pixels to visual features, parameterized through network activations and neural activations opens doors in interpretability research.
缺点
I see three major weaknesses:
-
The necessity of the brain is unclear to me. Instead of aligning features to the brain, you could've aligned the features of the different networks to each other - creating an "emergent" universal feature space. Would your results, e.g. w.r.t. the figure-gound segmentation, change much if you do so? If not, what does bringing the brain into play buy us here in terms of network response interpretability? This is unclear to me.
-
Most of the results need robustness checks. For e.g., in Fig. 6 you show foreground vs background difference in neural response associations. Presumably, that's an average across a lot (all?) of images. Could you indicate some sign of robustness, for e.g, running a permutation test to assess how likely the differences you see would've been expected given the data statistics alone? Same holds for Figs. 7 and 8. We need to know if these differences are flukes or not.
-
Reliance solely on ViTs. To make your point more general, showing that a high-performing CNN shows the same results would be very informative. ViTs have more expressivity in terms of patches interacting with each other - perhaps figure-ground segmentation isn't as strong in CNNs (although if previous research is to be trusted, CNNs should have some notion of figure-ground segmentation; see Hong et al. NatNeuro 2016 and Thorat et al. SVRHM 2021).
Refs:
- Hong, Ha, et al. "Explicit information for category-orthogonal object properties increases along the ventral stream." Nature Neuroscience 19.4 (2016): 613-622.
- Thorat, Sushrut, Giacomo Aldegheri, and Tim C. Kietzmann. "Category-orthogonal object features guide information processing in recurrent neural networks trained for object categorization." SVRHM 2021 Workshop @ NeurIPS.
问题
- Is there a reason to start the paper with Yang et al. 2024 as a reference for "mapping b/w brain and deep nets"? These types of mappings have been studied since 2014 (see Khaligh-Razavi et al. PLOS Compbio 2014) and reviews such as Doerig et al. NatRev 2023 might be better suited for this.
- Correction for 2.1: In NSD, the plan was to collect responses to 10k images per participant - however many participants did not complete the experiment.
- In Sections 2.2 and 2.3, would it be possible to provide a layperson summary to drive the assumptions home? Esp. the interpretation of Eqs. 3 and 4
- In Section 3.1 you claim figure-ground segmentation emerges before categories based on the accuracy hitting ceiling for figure-ground segmentations earlier. I found it hard to digest. Usually, if we want to say some information is present before another, we look for hints of that information through "time"/layers. That would amount to something like the first layer where the performance is above chance.
Refs:
- Khaligh-Razavi, Seyed-Mahdi, and Nikolaus Kriegeskorte. "Deep supervised, but not unsupervised, models may explain IT cortical representation." PLoS computational biology 10.11 (2014): e1003915.
- Doerig, Adrien, et al. "The neuroconnectionist research programme." Nature Reviews Neuroscience 24.7 (2023): 431-450.
局限性
The authors mentioned methodological limitations. It is sufficient.
The reviewers appreciate the overall idea of the paper, but raise several substantial concerns regarding the motivation, implementation and clarity.