In Silico Mapping of Visual Categorical Selectivity Across the Whole Brain
We built a transformer-based model to predict brain activity across the whole brain from visual input, then use this model to label the categoriacal selectivity of areas beyond the visual cortex to better understand higher-order visual processing.
摘要
评审与讨论
This paper introduces a transformer-based encoder-decoder model ensemble to predict fMRI parcel vertex activity from image stimuli and use this as an in-silico model of the brain to study visual selectivity of cortical regions. The introduced ensemble is composed of multiple frozen image encoder backbones and trainable decoders to predict fMRI responses. The authors demonstrate the use of this ensemble model in assigning semantic meaning to parcels in and outside of the visual cortex (within single subjects and across multiple subjects) as a means to efficiently generate hypotheses about categorically selective regions of the brain.
优缺点分析
Strengths:
- Practical Utility in Experimental Design: The paper presents an intuitive framework for selecting brain parcels for future fMRI studies, particularly through its “Choosing parcels for future fMRI experimentation” section. This provides experimenters with a principled and cost-effective method to prioritize regions for follow-up studies based on encoder-derived selectivity metrics.
- Mapping selectivity beyond classical visual areas: The paper offers proof-of-concept results (proof-of-concept in the sense that in some cases selectivity hypotheses were generated by the model but not verified in-vivo) for identifying selectivity of regions outside of canonical visual areas. This type of analysis could offer insight into integration of visual information into regions of the brain that are not directly involved in visual processing.
- Quantitative evaluation of hypotheses: The authors go beyond qualitative image analysis and validate parcel selectivity hypotheses using semantic similarity metrics and ground-truth fMRI data.
Weaknesses:
- Positioning relative to prior work: While the paper introduces an ensemble transformer model, many prior works have also trained DNNs on the NSD dataset to predict brain responses and use these models to infer selectivity of arbitrary voxels. It is unclear to me how this method substantially differs in outcome or capability beyond architectural novelty.
- Lack of baseline comparisons: A more direct comparison between existing models that predict fMRI activity from image stimuli or ablation would help clarify the need for the proposed model ensemble (Does the proposed model predict fMRI activity more accurately than existing methods? Are the learnable parcel queries necessary compared to vanilla decoders?). The authors suggest that previous works are limited by the fact that they frequently rely on the use of an affine transformation from model representations to predict fMRI activity, but do not seem to provide evidence that their proposed model offers more accurate predictions.
- Method is limited to regions of the brain that are categorically and semantically selective: While the model demonstrates semantic selectivity for higher-order concepts and object categories, it does not convincingly show whether it can uncover selectivity for lower-level visual features (e.g., texture, orientation, motion), which are critical components of the visual processing hierarchy.
问题
- Ensemble ablation: What is the variability in fMRI predictivity among the individual models of the ensemble?
- Comparison to existing baselines: How much better does the proposed ensemble (and also individual models in the ensemble) predict fMRI activity compared to pre-trained vision (e.g., ResNet-50, DINO, etc.) models with learned affine transformations from intermediate layers?
- Comparison with existing work: Where do you suggest this method stands in comparison to pre-trained models that have been aligned with cortical regions of the brain with affine transformations and diffusion-based methods (e.g., MindEye) that synthesize images based on in-silico fMRI activity (thereby offering a platform for analyzing selectivity of different brain region)? Any concrete examples of what the proposed method can reveal and what existing methods cannot would help clarify the contributions of this work.
局限性
yes
最终评判理由
In follow-up responses with the authors, they have helped to justify, and provide quantitative results, for the unsupported claims in the paper that I had highlighted in my review. Additionally, the authors helped to clarify their model design choice and why this is an improvement over comparable models used for similar applications.
The primary unresolved issue that remain include (1) limited comparison to existing baselines in terms of how well the models can explain variance in well known voxels (not the entire brain, but rather common areas such as V1, V2, V2, IT) and (2) evaluating these baselines for in-silico mapping on the same datasets to verify the improvements from the proposed approach.
格式问题
N/A
We thank Reviewer kyMo for the helpful review. We address specific questions below.
Q1) Positioning relative to prior work
It is great that in silico mapping of the brain has received much attention in recent years, with many great papers that we build off of. Our method offers several key innovations:
- Massive scale. We apply in silico mapping on a massive scale with ImageNet and BrainDIVE, made uniquely possible with the transformer encoder architecture. We demonstrate that our pipeline can successfully test parcel selectivity beyond those concepts explicitly shown to the participant. To the best of our knowledge, no other study has tested image datasets on this scale nor demonstrated the capability to test concepts beyond the training set in silico.
- Mapping of the whole brain. Previous work has focused on areas in the visual cortex (mostly V1-V4, some limited higher-level areas). We expand semantic selectivity mapping to the whole brain beyond the visual area. We discover areas with novel complex semantic selectivity beyond the classic visual area.
- In silico verification. Our pipeline can verify selectivity hypotheses in silico by evaluating how well a label can predict ground-truth activation on a held-out set.
- New fMRI experimental paradigm. As both image datasets and encoding models improve, our encoder-agnostic pipeline offers a way to leverage these advances to accelerate and improve the accuracy of whole-brain mapping.
Q2) Reproducing selectivity of low-level areas in the visual hierarchy
We can indeed reproduce the selectivity to low-level visual features in the early visual processing hierarchy. We optimized 32 superstimuli (16 per hemisphere) that maximally activate parcels in V1, V2, and V4 using the BrainDIVE framework. FFA is included as a representative ROI in IT cortex.
Generated superstimuli for early visual areas reproduce the classic coarse-to-fine hierarchy: V1 stimuli appear as cluttered scenes filled with dense, repetitive texture; V2 images add composite color patches and rudimentary objects; V4 stimuli reveal smoother, recognizable object forms; and FFA images almost exclusively depict close-up faces, often with clear emotional expressions. We will include these superstimuli in the revision.
To quantify hierarchical properties, we examined whether the spatial-frequency content of our stimuli mirrors classical physiological findings (e.g., [1,2]). We computed the radial average power spectrum for each image [3] and calculated the proportion of spectral power above 10%, 20%, and 30% of the maximum spatial frequency. At every threshold the high-frequency energy ratio decreases monotonically from V1 > V2 > V4 > FFA, indicating that the images that best drive higher-level areas (FFA) contain proportionally less fine-scale texture and relatively more coarse, low-frequency structure.
| Threshold | ROI | High-frequency energy ratio |
|---|---|---|
| 0.1 | V1 | 0.006547 |
| 0.1 | V2 | 0.004160 |
| 0.1 | V4 | 0.002500 |
| 0.1 | FFA | 0.001347 |
| 0.2 | V1 | 0.001838 |
| 0.2 | V2 | 0.000944 |
| 0.2 | V4 | 0.000699 |
| 0.2 | FFA | 0.000299 |
| 0.3 | V1 | 0.000762 |
| 0.3 | V2 | 0.000339 |
| 0.3 | V4 | 0.000285 |
| 0.3 | FFA | 0.000104 |
Note: the natural image datasets (ImageNet, NSD) did not show as clear of a V1 > V2 > V4 > FFA ordering because their spectra cluster around natural photo statistics, masking early-visual preferences.
Q3) Ensemble ablation
Encoding accuracy, CLIP backbone
| Subject / Hemi | Layer -1 Run 1 | Layer -1 Run 2 | Layer -3 Run 1 | Layer -3 Run 2 | Layer -5 Run 1 | Layer -5 Run 2 | Layer -7 Run 1 | Layer -7 Run 2 |
|---|---|---|---|---|---|---|---|---|
| S1 LH | 0.346 | 0.341 | 0.348 | 0.348 | 0.331 | 0.338 | 0.353 | 0.350 |
| S1 RH | 0.345 | 0.354 | 0.342 | 0.355 | 0.337 | 0.353 | 0.347 | 0.342 |
| S2 LH | 0.384 | 0.372 | 0.368 | 0.371 | 0.377 | 0.374 | 0.371 | 0.375 |
| S2 RH | 0.403 | 0.395 | 0.402 | 0.418 | 0.414 | 0.404 | 0.405 | 0.410 |
| S5 LH | 0.344 | 0.345 | 0.352 | 0.349 | 0.351 | 0.345 | 0.344 | 0.346 |
| S5 RH | 0.331 | 0.331 | 0.325 | 0.341 | 0.324 | 0.332 | 0.328 | 0.335 |
| S7 LH | 0.322 | 0.327 | 0.330 | 0.332 | 0.328 | 0.330 | 0.320 | 0.341 |
| S7 RH | 0.306 | 0.317 | 0.313 | 0.303 | 0.309 | 0.316 | 0.309 | 0.317 |
Encoding accuracy, DINOv2 (ViT-B) backbone
| Subject / Hemi | Layer -1 Run 1 | Layer -1 Run 2 | Layer -3 Run 1 | Layer -3 Run 2 | Layer -5 Run 1 | Layer -5 Run 2 | Layer -7 Run 1 | Layer -7 Run 2 |
|---|---|---|---|---|---|---|---|---|
| S1 LH | 0.367 | 0.361 | 0.367 | 0.360 | 0.339 | 0.340 | 0.306 | 0.300 |
| S1 RH | 0.364 | 0.360 | 0.363 | 0.369 | 0.354 | 0.348 | 0.302 | 0.303 |
| S2 LH | 0.315 | 0.312 | 0.411 | 0.422 | 0.392 | 0.403 | 0.356 | 0.358 |
| S2 RH | 0.353 | 0.351 | 0.455 | 0.450 | 0.435 | 0.438 | 0.388 | 0.390 |
| S5 LH | 0.378 | 0.380 | 0.369 | 0.368 | 0.368 | 0.367 | 0.332 | 0.341 |
| S5 RH | 0.370 | 0.365 | 0.361 | 0.360 | 0.345 | 0.341 | 0.313 | 0.313 |
| S7 LH | 0.344 | 0.348 | 0.339 | 0.332 | 0.327 | 0.336 | 0.307 | 0.304 |
| S7 RH | 0.333 | 0.340 | 0.324 | 0.320 | 0.311 | 0.309 | 0.297 | 0.292 |
Q4) Comparison to existing baselines
The transformer-based encoding model does outperform existing models, including those that rely on an affine transformation from model representations. As the focus of the manuscript centers around the other parts of the pipeline (which we note is encoder agnostic), we did not include those comparisons here. See [4] for a comparison of a similar encoder to past approaches, showing that the transformer-based approach greatly outperforms other pre-trained vision models with affine transformations.
Furthermore, encoders that rely on affine transformations from model representations are not feasible when scaled to make predictions for the whole brain. Building a linear encoding model for the whole brain with the same size feature maps would have 242B parameters.† The transformer-based model learns parcel queries to reduce feature size to 768, while still outperforming the full linear model. An experiment comparing the two isn't feasible to conduct but [4] compares the performance of these architectures on the visual area.
Given the recent interest and advances in encoding models, our contribution emphasizes that a sufficiently good encoder can generate "superstimuli" superior to those presented to the subject, and uncover complex semantic selectivity beyond the classic visual area.
†31*31 (patch size) * 768 (dimension size) * 327684 (voxels) = 242B parameters
Q5) Comparison with existing work
As Reviewer kyMo pointed out, decoding models (including MindEye [5]) can be used to study selectivity by manipulating neural activity in different areas and examining the observed changes in the reconstructed images. However, those decoding models have only been trained for the visual areas because they require learning very parameter-intensive mappings from the voxels to the intermediate representations needed as inputs to the diffusion models. Due to this limitation, to our knowledge, they have not yet been applied to decoding from whole-brain fMRI activity as they are not suitable for the type of experiments we are performing. Future more lightweight decoding models that can generalize to the whole brain would certainly be good baseline models for our approach. We will add a discussion of these models to the revision.
[1] David H. Hubel and Torsten N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160(1):106–154, January 1962. doi:10.1113/jphysiol.1962.sp006837. URL
[2] R. Desimone and S. J. Schein. Visual properties of neurons in area V4 of the macaque: sensitivity to stimulus form. Journal of Neurophysiology, 57(3):835–868, March 1987. doi:10.1152/jn.1987.57.3.835. URL
[3] A. van der Schaaf and J. H. van Hateren. Modelling the power spectra of natural images: Statistics and information. Vision Research, 36(17):2759–2770, September 1996. doi:10.1016/0042-6989(96)00002-8. URL
[4] Hossein Adeli, Minni Sun, and Nikolaus Kriegeskorte. Transformer brain encoders explain human high‑level visual responses, 2025.
[5] Paul S. Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Ethan Cohen, Aidan J. Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth A. Norman, and Tanishq Mathew Abraham. Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors, 2023. URL
Thank you, authors, for your response.
Your clarifications about low-level selectivity, ensemble results, and positioning relative to prior work in terms of model scaling are helpful.
I would like to encourage the authors to substantiate some of these claims, however, with further evaluation to validate the contributions. Namely:
"We apply in silico mapping on a massive scale with ImageNet and BrainDIVE, made uniquely possible with the transformer encoder architecture."
In terms of scale of the evaluation dataset, many "brain-aligned" models have been evaluated on data outside of their training distribution [1, 2], so it is still unclear to my why it is being claimed that this is only possible with the proposed architecture. Most of these types of models are inherently trained to enable in-silico experimentation on held out image data, so can be readily used with ImageNet. Please correct me if I am misunderstanding your claim here.
"The transformer-based encoding model does outperform existing models, including those that rely on an affine transformation from model representations. As the focus of the manuscript centers around the other parts of the pipeline (which we note is encoder agnostic), we did not include those comparisons here."
Given that the entire pipeline is built around this model, it still seems critical that actual evidence is provided for this. I understand that this may not be feasible with every single voxel, but a comparison to baselines on a subset of voxels (e.g., V1, V2, V4) would be valuable.
Thank you again for your responses.
[1] Conwell et al. "A large-scale examination of inductive biases shaping high-level visual representation in brains and machines." Nature Communications, 2024.
[2] Schrimpf et al. "Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like?" 2020.
As you point out, models with linear transformation of features to neural activity can in theory be used here, however, the challenge with using the in our application has more to do with scaling them up to the whole brain prediction and achieving good encoding accuracy. The number of parameters in those models equals the number of features multiplied by the number of voxels in the brain. In our case, this would be
31*31 (patch size) * 768 (dimension size) * 327684 (voxels) = 242B parameters
Many solution has been proposed to address this limitation of linear models. Some studies use only the CLS token from the feature backbone models and map that linearly to the voxels [1,2]. This approach, while being simple, discards a lot of patch level information that could be very important for predicting brain activity across the whole brain. Other studies have used spatial-feature factorized models [3,4,5,6]. These models first learn a spatial mask (i.e. a spatial receptive field) to be the area of the image that the ROI or voxel responds to. Then the features are aggregated from those tokens to create a single token that is linearly mapped to the voxel values. However, these approaches are limited to capturing only static receptive fields. The brain areas for which we want to label the selectivity are deeper in the perceptual hierarchy and have content dependent receptive fields.
A transformer based encoding model [7] can capture exactly these receptive fields. For each ROI the model learns queries that are matched against the keys from different images patches. As shown in Fig. 1b, . Each ROI query can then learn to attend to a patch based in its content or its position or a combination of the two. This allows the model to learn the selectivity of each ROI and route only the relevant information to each.
In order to provide a fair (parameter matched) comparison to the linear models, we compare the performance of a model that uses the CLS token to predict all the vertices. This model has the same number of parameters as our model but performs significantly worst.
Ensemble Encoding accuracy using DINOv2 backbone
| Architecture | S1 | S2 | S5 | S7 |
|---|---|---|---|---|
| CLS | 0.33 | 0.34 | 0.39 | 0.33 |
| Transformer (ours) | 0.45 | 0.43 | 0.48 | 0.43 |
To further strengthen our argument we performed the parcel labeling test using the CLS+linear encoder. We can see below that the linear model performs significantly worse compared to the transformer encoder.
Spearman’s (mean ± std) between the model-predicted and ground-truth activation rankings on the NSD test set, averaged across parcels.
| s1 | |
|---|---|
| NSD train (ground truth) | 0.149 +/- 0.105 |
| BrainDIVE (DINOv2 patch + transformer) | 0.163 +/- 0.121 |
| BrainDIVE (DINOv2 cls + affine) | 0.136 +/- 0.118 |
The Transformer attention mechanism provides an elegant solution that massively cuts down on the number of parameters (compared to full regression models) while improving performance by taking into account the dynamics of brain computations. This becomes specially important when we go beyond the retinotopic visual areas. We will include this discussion in the revision to better motivate our choice of the brain encoder. We will also add a figure (similar to Fig. 3b) comparing the transformer based vs the linear model accuracy across the whole brain.
We appreciate your thoughtful comments and hope our responses have addressed them. Please let us know if you have any additional comments.
[1] Luo, A., Henderson, M., Wehbe, L., & Tarr, M. (2023). Brain diffusion for visual exploration: Cortical discovery using large scale generative models. Advances in Neural Information Processing Systems, 36, 75740-75781.
[2] Luo, A. F., Henderson, M. M., Tarr, M. J., & Wehbe, L. (2023). Brainscuba: Fine-grained natural language captions of visual cortex selectivity. arXiv preprint arXiv:2310.04420.
[3] Lurz, K. K., Bashiri, M., Willeke, K., Jagadish, A. K., Wang, E., Walker, E. Y., ... & Sinz, F. H. (2020). Generalization in data-driven models of primary visual cortex. BioRxiv, 2020-10.
[4] Klindt, D., Ecker, A. S., Euler, T., & Bethge, M. (2017). Neural system identification for large populations separating “what” and “where”. Advances in neural information processing systems, 30.
[5] St-Yves, G., & Naselaris, T. (2018). The feature-weighted receptive field: an interpretable encoding model for complex feature spaces. NeuroImage, 180, 188-202.
[6] Yang, H., Gee, J., & Shi, J. (2024). Brain decodes deep nets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 23030-23040).
[7] Hossein Adeli, Minni Sun, and Nikolaus Kriegeskorte. Transformer brain encoders explain human high‑level visual responses, 2025.
Thank you for providing these additional evaluations which help support some of the previously unquantified claims in the paper.
I have increased my initial score.
Thank you!
We deeply appreciate the engagement you have given to the review process and thank you again for the great feedback and questions!
This manuscript demonstrates an in-silico approach for mapping visual categorical selectivity across the whole brain, moving beyond traditional methods that rely on controlled stimuli and linear encoding models. By leveraging an encoder-decoder transformer with a brain-region to image-feature cross-attention mechanism, the proposed architecture flexibly aligns high-dimensional deep network features with cortical responses. Coupled with diffusion-based image generative models and large image datasets, this method can synthesize and select images that maximize the activation of different cortical parcels, revealing regions with complex compositional selectivity involving diverse semantic concepts. The findings suggest a new approach for discovering visual selectivity, offering a more flexible and scalable alternative to traditional fMRI experiments for hypothesizing and testing visual concepts.
优缺点分析
Strengths
This manuscript introduces an in-silico approach using an encoder-decoder transformer with a cross-attention mechanism, capable of aligning high-dimensional deep network features with cortical responses, thereby enabling the synthesis and selection of images that maximally activate different cortical parcels without controlled stimuli and linear encoding models.
Weaknesses
Although this research offers an interesting approach to identifying superstimuli, the lack of empirical validation of these superstimuli in new fMRI experiments inherently highlights a weakness.
问题
Major concerns
-
Although this research provides an interesting approach to finding superstimuli, the lack of empirical validation of these 'superstimuli' through new fMRI experiments is a clear limitation. However, I do not consider this an essential prerequisite for the paper's acceptance. To further bolster the model's persuasiveness without immediate empirical validation, more robust evidence is needed. I believe the paper's convincingness would be significantly enhanced by providing richer examples akin to Figure 4 (Verifying the selectivity of aTL-faces). Could similar examples be generated for other previously identified fMRI-sensitive areas such as the PPA (Place-selective), EBA (Body-selective), and VWFA (Word-selective)?
-
Beyond merely predicting categorical selective areas, is it possible to analyze changes in characteristics along the visual hierarchy? For instance, if 'superstimuli' could be identified along the ventral visual stream from V1 to IT, and these demonstrated previously known hierarchical processing properties, it would offer another robust validation point for the model.
-
The paper claims to provide interpretability through its cross-attention mechanism, specifically arguing that "the areas with highest attention weights for the aTL-faces parcel primarily overlap with faces, while the attention weights for the mfs-words parcel overlap with words (line 537)." However, Figure 8 appears to show attention weights that are merely high at the center of the image, rather than precisely aligning with the positions of faces and words. How can you more rigorously verify that the attention mechanism truly focuses on the selective object (face or word) rather than simply a central region?
-
To more directly assess the object-specific attention effect of your cross-attention mechanism, consider presenting multi-cue images that contain both a face and a word within the same image. How would you then visualize and analyze the attention weights specifically for the aTL-faces and mfs-words parcels in this scenario? This could provide clearer evidence of selective attention.
Minor issues
-
Figure 4 captions: last (b) actually means (e)?
-
line 539: “previous work using a similar encoder model [] “ -> Empty reference
-
line 890: “The paper doe snot" should be corrected to “does not”
局限性
This manuscript proposes a model for mapping visual categorical selectivity; however, the lack of experimental validation for the suggested 'superstimuli' inherently limits the plausibility of the model.
最终评判理由
I raised a few concerns regarding the proposed methods, but the authors' rebuttal resolved them. In addition, the authors' additional experiments are helpful to trust the usefulness and significance of the mapping results. Thus, I raised my score accordingly.
格式问题
No paper formatting concerns
We are thankful to Reviewer ywAe for the detailed and helpful review. We ran additional experiments to address the concerns and incorporate these additional results in our revision.
Q1) Reproducing the selectivity of other known areas
We are able to reproduce the known selectivity for all the major known areas for all the subjects. Optimal stimuli from NSD test, ImageNet, and BrainDIVE all appear similar to those reported in the literature. We will include in our revision examples like Figure 4 (verifying aTL-faces selectivity) for each aforementioned area to demonstrate place-, body-, and word-selectivity.
Q2) Reproducing characteristics of the visual hierarchy (V1 to IT)
We optimized 32 superstimuli (16 per hemisphere) that maximally activate parcels in V1, V2, and V4 using the BrainDIVE framework. We additionally include results on FFA as a representative IT-cortex ROI.
Generated superstimuli for early visual areas reproduce the classic coarse-to-fine hierarchy: V1 stimuli appear as cluttered scenes filled with dense, repetitive texture; V2 images add composite color patches and rudimentary objects; V4 stimuli reveal smoother, recognizable object forms; and FFA images almost exclusively depict close-up faces, often with clear emotional expressions. We will include these superstimuli in the revision.
To quantify hierarchical properties, we examined whether the spatial-frequency content of our stimuli mirrors classical physiological findings (e.g., [1,2]). We computed the radial average power spectrum for each image [3] and calculated the proportion of spectral power above 10%, 20%, and 30% of the maximum spatial frequency. At every threshold the high-frequency energy ratio decreases monotonically from V1 > V2 > V4 > FFA, indicating that the images that best drive higher-level areas (FFA) contain proportionally less fine-scale texture and relatively more coarse, low-frequency structure.
| Threshold | ROI | High-frequency energy ratio |
|---|---|---|
| 0.1 | V1 | 0.006547 |
| 0.1 | V2 | 0.004160 |
| 0.1 | V4 | 0.002500 |
| 0.1 | FFA | 0.001347 |
| 0.2 | V1 | 0.001838 |
| 0.2 | V2 | 0.000944 |
| 0.2 | V4 | 0.000699 |
| 0.2 | FFA | 0.000299 |
| 0.3 | V1 | 0.000762 |
| 0.3 | V2 | 0.000339 |
| 0.3 | V4 | 0.000285 |
| 0.3 | FFA | 0.000104 |
Note: the natural image datasets (ImageNet, NSD) did not show as clear of a V1 > V2 > V4 > FFA ordering because their spectra cluster around natural photo statistics, masking early-visual preferences.
Q3/4) Attention mechanism criticism and suggestions for improvement
We acknowledge that the attention weights in our models tend to have a strong center bias, and suspect a few possible causes. First, the images that maximally activate a given parcel tend to have the object front and center. This is to be expected as there exist more voxels that represent centrally fixated objects. Second, the Schaefer parcellation is based on functional connectivity and does not necessarily respect categorical boundaries which can weaken the attention signal. We do agree the experiment you are suggesting would target the attention maps more directly, and we will explore them in future work.
We would like to note that the attention maps are only included in the supplementary and not mentioned in the core text. We believe it is one promising avenue for interpretability using in silico mapping methods, but not a core contribution of our study. We will include in the revision an additional interpretability metric, demonstrating with representational similarity analysis that learned transformer ROI parcel queries reproduce the resting state functional connectivity between those parcels. See our rebuttal to Q1 from Reviewer jfLo for more details.
Q5) Typos in figure and text
Thank you for pointing these out. In the Figure 4 caption, the last (b) should actually be (e). The empty reference should be to [4]. We will fix these in the revision.
[1] David H. Hubel and Torsten N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160(1):106–154, January 1962. doi:10.1113/jphysiol.1962.sp006837. URL
[2] R. Desimone and S. J. Schein. Visual properties of neurons in area V4 of the macaque: sensitivity to stimulus form. Journal of Neurophysiology, 57(3):835–868, March 1987. doi:10.1152/jn.1987.57.3.835. URL
[3] A. van der Schaaf and J. H. van Hateren. Modelling the power spectra of natural images: Statistics and information. Vision Research, 36(17):2759–2770, September 1996. doi:10.1016/0042-6989(96)00002-8. URL
[4] Hossein Adeli, Minni Sun, and Nikolaus Kriegeskorte. Transformer brain encoders explain human high‑level visual responses, 2025. URL
I thank the authors for their detailed responses. In particular, the additional results reproducing the characteristics of the visual hierarchy (V1 to IT) are intriguing and would strengthen the manuscript. I have three minor questions regarding the discussion. Since this is the discussion period, I’m not requesting new experimental results—just the authors’ opinions on the following points:
Q1. Reproducing the selectivity of other known areas
Does the reproduced example of place-, body-, and word-selective areas also exhibit results similar to those shown in Figure 4 for the face-selective area? In other words, do the superstimuli for the body-selective area predominantly contain body images (and similarly for the other areas)? I cannot discern these details from the figure, so I would like to confirm.
Q2. Reliability of the superstimuli
Can we infer the reliability (or confidence) of the generated superstimuli for each area? For example, what happens if we generate superstimuli from the primary auditory cortex (A1) or the motor cortex—areas that are presumably unrelated to specific visual-category selectivity? The model will still produce something, but how can we distinguish whether those patterns genuinely reflect the area’s function? Could the model’s confidence scores help us identify when a superstimulus is not truly meaningful?
Q3. Dependence on the training dataset
This work relies heavily on the NSD dataset to train the model on brain dynamics. To what extent can the model predict category selectivity for categories not included in NSD? For instance, if the model were trained on a version of NSD with all face images removed, do the authors think it could still identify an face-selective region? If the model only predicts selectivity for categories present in NSD, then its capacity for discovering new category-selective areas is entirely determined by the stimulus space of the fMRI experiment.
Thank you for the follow up questions.
Q1. Reproducing the selectivity of other known areas
Yes, we confirm that the reproduced examples of place-, body-, and word-selective areas all exhibit objects that agree with the known selectivity. We are not able to share images here but we'll certainly add them to the revision.
Q2. Reliability of the superstimuli
Our measure of Spearman correlation provides this exact reliability test. The measure shows how well the assigned categorical label can predict the activity of different images for a given parcel in a held-out test set within the same subject or across subjects.
Please note that we only apply the parcel labeling algorithm for parcels where the signal to noise ratio in response to only visual stimuli is high and that our encoder can predict the brain activity above a certain threshold of explained variance. Parcels in motor areas are usually not high in total explainable variance or explained variance with a visual input. To address reviewer's comment, we can test, among the selected parcels, whether those that do have higher explained variance for our encoding model predictions also show higher Spearman correlation, examining whether the latter is a good test of reliability.
Table. Prediction reliability (different measures) vs. Spearman correlation on retrieved ordering.
| S1 | S2 | S5 | S7 | |
|---|---|---|---|---|
| Pearson (encoding model performance , BrainDIVE retrieved parcel correlation) | 0.478 | 0.325 | 0.275 | 0.461 |
| Pearson (encoding model performance , ImageNet retrieved parcel correlation) | 0.450 | 0.320 | 0.463 | 0.433 |
| Pearson (parcel snr, BrainDIVE retrieved parcel correlation) | 0.329 | 0.248 | 0.261 | 0.451 |
| Pearson (parcel snr, ImageNet retrieved parcel correlation) | 0.329 | 0.241 | 0.434 | 0.440 |
We can see that the greater visual responsiveness of the parcel correlates strongly with the strength of the label that our pipeline produces. Parcels whose response are not predicted well by our encoding model (e.g. in the motor areas) will therefore not have a high Spearman correlation for the assigned label.
Q3. Dependence on the training dataset
NSD is the largest scene viewing brain imaging dataset to date and scaling our models to be applied to large number of natural scenes can allow them to interpolate selectivity that can be generalized to other categories and visual patterns. These scenes include many objects that are not segmented but the model can capture their brain responses and allow for discovery of parcels selective to them. One form of generlization that we observed is for parcels selective to higher level concepts that combine other more basic ones (e.g. tool use).
We hope our response is helpful in addressing the reviewer comments!
I thank the authors for their detailed responses and clarification. My concerns have been addressed, and I support the value of this work. I will update the score accordingly.
We really appreciate your positive assessment of our work and thank you again for your detailed comments and feedback!
This paper proposes a pipeline for characterizing semantic selectivity across the entire cortex using fMRI responses to natural images. The method relies on a brain encoder that predicts voxel-wise responses from images, from which a “concept vector” is constructed per cortical parcel by averaging CLIP embeddings of top-activating images. Semantic selectivity is then evaluated via Spearman correlation between parcel activation predictions and CLIP-based similarity scores. The approach is used to compare different image sources (NSD-train, ImageNet, BrainDIVE), and is extended to assess cross-subject generalization of semantic selectivity.
优缺点分析
Strengths
- Technically sound integration of encoding models and CLIP-based embeddings into a pipeline to assess semantic selectivity.
- Explores semantic selectivity beyond canonical visual areas, using the large-scale NSD dataset.
- Encoder-agnostic framework with potential applicability to other predictive models (although not demonstrated).
- Evaluation includes cross-subject generalization, which strengthens the methodological contribution.
Weaknesses
- At times I found it hard to follow the methodological setup, particularly regarding how the concept vectors were constructed in different conditions (e.g., for cross-subject experiments), whether actual or predicted responses were used for NSD-train, and what exactly the authors consider part of the "pipeline". These ambiguities made it difficult to interpret the core comparisons and contributions.
- Reported Spearman correlations are low, and no threshold is provided to interpret what constitutes meaningful semantic selectivity.
- It is not clear whether the differences observed across image sets is actually due to concept coverage or could it be due to sampling density around activation peaks.
- Several implementation details (e.g., whether measured or predicted responses were used, how concept vectors are constructed for the cross-subject analysis) were not clear to me.
- Conceptually, the work builds on existing ideas and may offer limited novelty beyond assembling and evaluating known components.
问题
Generally, I think the work has potential and raises interesting ideas, but in its current form I have concerns about its conceptual clarity and novelty. If the authors can address the following questions, I’d be happy to reconsider and possibly increase my score.
1. Could the result in Table 2 simply reflect differences in input sampling density around activation peaks, rather than concept coverage? in Table 2, the only difference between conditions is the "concept vector" used per parcel, computed as the average CLIP embedding of the top-K (K=32) activating images. Since a good concept vector depends on sampling near peak activation, image sets like BrainDIVE (explicitly optimized for this) and ImageNet (large and dense) naturally yield better vectors. In contrast, the NSD-train set is relatively small and not optimized for activation maximization, making it less likely to produce representative concept vectors — even if the underlying concepts are present. Thus, the performance gap may reflect sampling effectiveness rather than a difference in the actual set of concepts represented. The current analysis seems to conflate these, interpreting lower values for NSD-train as indicating fewer covered concepts. To support this claim more rigorously, it would help to control for sampling density — for example, by varying the top-K threshold used to define the concept vector. If NSD-train still underperforms at lower K (where the top images should be closest to the activation peak), that would strengthen the argument.
2. Same concern applies to Table 3: Since the encoder-driven image sets yield higher spearman rank correlations the authors conclude that “brain-encoder pipeline can generate finer-grained categorical hypothesis” than images in the train set. Again, this is very likely a direct consequence of having a better estimate of a parcel’s concept vector, which itself depends on having enough images sampled around maximum activation for that parcel. Showing these results across different values of K (i.e., the number of top-activating images used to define the concept vector) would clarify whether the NSD-train set truly yields a worse selectivity hypothesis, or if the observed differences are simply due to fewer or less optimized samples contributing to the concept vector. This would help disentangle model differences from sampling effects.
3. What are some examples of follow up experiments for which the parcels with higher retrieval in Figure 5 are promising targets? The metric rewards parcels where the concept vector retrieves the ground-truth top-activating images well. While this likely reflects some degree of semantic selectivity, it's not clear why this makes these parcels "promising targets" for follow-up fMRI studies. What would an experimenter test for with these parcels and/or the corresponding images that they seem to be semantically selective for (from ImageNet or BrainDIVE)?
4. Did the NSD-train condition use actual measured responses to select the top-K images, or were the selections based on model-predicted responses to training images (like ImageNet and BrainDIVE)? This detail is important for interpreting the NSD-train condition as a proper baseline. If the encoder is also used for NSD-train, then all conditions reflect model-based estimates. If not, then the performance gap could partly reflect differences in response quality (e.g., measured vs. predicted) rather than conceptual coverage alone. Clarifying this would help in comparing conditions more fairly.
5. What is the main conceptual advancement beyond implementing an evaluation framework for comparing image sets? The pipeline offers an approach to quantify semantic selectivity. The main message, as far as I understood, is that since the approach relies on highly exciting inputs to construct a parcel-specific “concept vector” using a good set of inputs matter. And the findings largely align with what one would expect: images optimized to be maximally activating (BrainDIVE) or images picked from a densely sampled set (ImageNet) naturally yield better performance than a limited training set. In addition, the paper also assesses cross-stimuli and cross-subject generalization, but I find those results hard to interpret (explained in the next question). Could the authors clarify which aspects, beyond the application of existing tools, constitute a novel conceptual or methodological contribution. Is it the use of CLIP to create “concept vectors” for parcels where their semantic selectivity is not known?
6. Does the correlation magnitude matter? In both Tables 3 and 5, the reported Spearman correlations are rather low. Assuming the concept vectors are accurate and CLIP space reflects semantic similarity, strong semantic selectivity should yield high rank alignment (i.e., high ). However, the paper provides no principled threshold or theoretical grounding for what magnitude of reflects meaningful selectivity. How high should the correlation be for a parcel to be considered semantically selective, and how low for it to be weakly/not selective? As a reference point, it would be informative to show the same analysis on the known visual parcels - do they exhibit (noticeably) higher correlations?
7. Why not directly compare concept vectors of the same parcel across subjects to demonstrate cross-subject consistency in semantic selectivity? While Tables 4 and 5 demonstrate that concept vectors derived from other subjects exhibit semantic selectivity in a target subject, this provides only indirect evidence for cross-subject generalization. It does not directly assess whether the semantic selectivity of corresponding parcels is similar across individuals. In Figure 6, the authors illustrate this point qualitatively using example images, but a more direct test would be to compute the pairwise cosine similarity between concept vectors of corresponding parcels across subjects. This would more explicitly support the assumption that the same parcel across subjects encodes similar semantic content. Have the authors performed such an analysis?
Other (minor) points:
- In section 4.5 how are the authors exactly constructing the “concept vector” using the encoder models trained on other subjects? Are you pooling top-32 maximally activating images across three models (i.e. those trained on other subjects)? Please provide some details on how the concept vector is constructed.
- Is the encoder part of the pipeline and is the encoder a contribution of the paper? My initial understanding was not the pipeline is additional steps around the encoder, which aligns well with the statement in line 64: "...our pipeline is ultimately encoder-agnostic, and can use any encoder that is image-computable". But then the usage of the term “brain encoder pipeline” in line 224 seem to imply that the encoder is part of the pipeline. I would appreciate some clarification here, and I would also suggest to improve the text to potentially avoid such confusions.
- Please refer to specific subplots/panels in Figure 3 both in the text (lines 141-144) and the caption. Also, please label the colorbars.
- There is a repeated sentence in lines 222-223.
局限性
The authors already point out some relevant limitations such as the need for experimental validation of “superstimuli” and the fact that encoding model can be biased due to the training set. I think these are fair and useful to acknowledge. A few more points might be worth considering:
- The approach assumes that CLIP embedding space reflects something close to the brain’s semantic organization. That might hold in some regions (e.g., visual cortex), but it’s unclear how valid that assumption is more broadly.
- The method is heavily based on top-K activation which could be problematic. For example, differences in performance across datasets could just reflect sampling density near activation peaks, not actual differences in concept coverage.
- The Spearman correlation values used throughout (especially in Tables 3 and 5) are fairly low, and it’s unclear what values should be considered meaningful? Some sort of benchmark (e.g. visual cortex parcels) would help give these numbers context.
- Tables 4 and 5 suggest cross-subject generalization, but they don’t directly test if the same parcels across subjects are encoding similar concepts. Comparing concept vectors for the same parcel across individuals (e.g. using cosine similarity) would provide a more clear evidence here.
最终评判理由
The authors provided clarifications and additional analyses that addressed my main concerns, particularly regarding sampling effects and response variability. While some questions remain, I find the conceptual framing and empirical results good enough to warrant an increased rating.
格式问题
None.
We are very grateful to Reviewer uJzq for the thorough and constructive feedback. We report the results from additional experiments below to address specific comments, and have included these results (and the motivation) to the revision. We look forward to further discussion.
Q1) Table 2: Concern about sampling density around activation peaks as a confound
To confirm that the superior performance of ImageNet and BrainDIVE superstimuli stems from a broader diversity of concepts present—and not sampling near peak activation—we varied the top-K image threshold used to define the concept vector as suggested.
Lowering from 32 to 1 yields results that closely mirror Table 2 in the paper, with BrainDIVE and ImageNet outperforming the NSD train concept vector by a similar margin. We observe the same pattern for but only report for brevity. All tables will be added to the revision.
Table 2 (revised). Fraction of parcels where the superstimulus selection process ranks NSD test images better than chance, FDR corrected, .
| s1 | s2 | s5 | s7 | |
|---|---|---|---|---|
| NSD train | 122/181 | 123/192 | 112/175 | 119/196 |
| ImageNet | 138/181 | 163/192 | 128/175 | 151/196 |
| BrainDive | 140/181 | 165/192 | 141/175 | 151/196 |
| s1 | s2 | s5 | s7 | |
|---|---|---|---|---|
| NSD train | 103/181 | 97/192 | 89/175 | 81/196 |
| ImageNet | 131/181 | 145/192 | 113/175 | 128/196 |
| BrainDive | 136/181 | 162/192 | 120/175 | 129/196 |
We would also like to note that traditional large-scale experiments are not—and cannot be—optimized for activation maximization. NSD is already the most massive fMRI image dataset to date, yet the cost of collecting data on an even larger image dataset like ImageNet is (currently) prohibitive. The greater density of concept coverage is still inherently an advantage of our in silico approach.
Q2) Table 3: Same concern
We ran the same experiment reported in Table 3 for different values of the top-K threshold. Lowering the top-K threshold from 32 to 1, the results are again very similar, with ImageNet and BrainDIVE outperforming NSD train by a similar margin for all . All tables will be included in the revision.
Table 3 (revised). Spearman's (mean std) between the model-predicted and ground-truth activation rankings on the NSD test set, averaged across parcels, .
| s1 | s2 | s5 | s7 | |
|---|---|---|---|---|
| NSD train | 0.130 ± 0.119 | 0.117 ± 0.096 | 0.115 ± 0.106 | 0.111 ± 0.091 |
| ImageNet | 0.165 ± 0.108 | 0.155 ± 0.082 | 0.140 ± 0.091 | 0.124 ± 0.073 |
| BrainDive | 0.165 ± 0.120 | 0.184 ± 0.098 | 0.151 ± 0.092 | 0.131 ± 0.080 |
| s1 | s2 | s5 | s7 | |
|---|---|---|---|---|
| NSD train | 0.104 ± 0.124 | 0.090 ± 0.095 | 0.092 ± 0.114 | 0.078 ± 0.089 |
| ImageNet | 0.148 ± 0.101 | 0.136 ± 0.084 | 0.125 ± 0.091 | 0.112 ± 0.075 |
| BrainDive | 0.148 ± 0.115 | 0.165 ± 0.093 | 0.140 ± 0.092 | 0.119 ± 0.082 |
Q3) Examples of follow-up experiments
Parcels in Fig. 5 are just one way to select promising targets for follow-up fMRI studies. We suggest that experiments on these parcels would easily allow us to verify our method, to ensure that our selectivity labels are not just an artifact of the dataset or model. For these parcels, a single semantic concept (i.e., a vector in CLIP space) seems to capture the selectivity well—similar to the visual area. A simple follow-up experiment would be to present these "superstimuli" to a subject and check whether higher activation is observed in these parcels.
In addition, these parcels are promising since 1) they are visually responsive, suggesting they play a role in visual processing and it is of interest to define what these regions do, 2) they demonstrate consistent selectivity that our model can predict well (Figure 5), 3) their selectivity is not well understood nor mapped in past experiments since they lie outside the visual area, 4) many selectivities (see A.5: specific sports, kitchen appliances, social gatherings) are not known to have specialized regions, so it would be of interest to experimentally confirm.
However, this is just one method of selecting "promising" parcels, and an fMRI experimenter could apply their own criteria to select parcels using our method. For example, an experimenter interested in tool use specifically could use our in silico method to discover parcels involved exclusively in tool use, then perform follow-up experiments to test parcel activation on the superstimuli that our pipeline generates.
Q4) Clarification on the NSD-train condition
The NSD-train condition uses actual measured responses to select superstimuli, while the ImageNet and BrainDIVE images use model-predicted responses. Therefore, the NSD-train condition reflects the traditional fMRI model, where an experimenter generates selectivity hypotheses based on images that experimentally maximally activate a parcel. Since all conditions are eventually evaluated on actual responses from the NSD test set, our test is, if anything, biased in favor of the NSD-train condition. That our ImageNet/BrainDIVE-based superstimuli selection still outperforms is strong evidence of the usefulness of our pipeline.
This is an important clarification that we will include in our revision.
Q5) Clarifying main conceptual advancement
- Massive scale in silico experimentation: "Superstimuli" from ImageNet and BrainDIVE outperform the NSD training set on predicting parcel activation on a held-out image set, demonstrating that our pipeline can successfully test parcel selectivity beyond those explicitly shown to the participant. To the best of our knowledge, no other study has tested image datasets on this scale nor demonstrated the capability to test concepts beyond the training set in silico.
- Mapping of the whole brain: Previous work has focused on areas in the visual cortex (mostly V1-V4, some limited higher-level areas). We expand semantic selectivity mapping to the whole brain beyond the visual area. We discover areas with novel complex semantic selectivity beyond the classic visual area.
- In silico verification framework: Our pipeline can verify selectivity hypotheses in silico by evaluating how well a label can predict ground-truth activation on a held-out set.
- New fMRI experimental paradigm: As both image datasets and encoding models improve, our pipeline offers a way to leverage these advances to accelerate and improve the accuracy of whole-brain mapping.
Q6) Interpreting rank correlation magnitude
Parcels in the visual area with known high-level selectivity (EBA, FFA, FBA) tend to exhibit very high correlations, while parcels with known lower-level selectivity (V1–V4) tend to exhibit low correlations. The average Spearman's from our concept vectors are in the range of areas like PPA and RSC—both widely studied and accepted in the literature.
Using CLIP as our metric allows us to capture high-level features that can explain semantic selectivity. Given that fMRI studies tend to rely on experimenter-curated topics, the CLIP space is a move toward a data-driven approach, but ultimately any metric can be substituted.
We provide baselines from known areas to contextualize our results. Since selectivity is a graded response and not binary, we don’t believe it is appropriate to have a minimum threshold that constitutes meaningful selectivity. Some parcels are naturally more visually responsive than others, and our paper lays out a method of systematically making and ranking semantic selectivity predictions. Notably, the CLIP metric allows us to rank parcels by how well we can make such predictions, which paves the way for in vivo fMRI studies of promising parcels.
Spearman's , actual vs. predicted activation ordering in high level areas, subject 1:
| Imagenet | BrainDIVE | |
|---|---|---|
| Unlabeled parcel mean (ours) | 0.168 ± 0.106 | 0.163 ± 0.121 |
| EBA | 0.432 | 0.501 |
| FFA-1 | 0.217 | 0.243 |
| FFA-2 | 0.373 | 0.412 |
| FBA-2 | 0.373 | 0.401 |
| lateral | 0.323 | 0.339 |
| PPA | 0.166 | 0.128 |
| RSC | 0.170 | 0.162 |
| aTL-faces | 0.120 | 0.138 |
| mTL-words | 0.164 | 0.199 |
Spearman's , actual/predicted activation ordering in low level areas, subject 1:
| Imagenet | BrainDIVE | |
|---|---|---|
| Unlabeled parcel mean (ours) | 0.168 ± 0.106 | 0.163 ± 0.121 |
| V1d | 0.010 | 0.027 |
| V1v | 0.024 | 0.036 |
| V2d | -0.032 | -0.035 |
| V2v | -0.030 | -0.002 |
| V3d | -0.004 | -0.018 |
| V3v | 0.007 | 0.009 |
| hV4 | 0.047 | 0.067 |
| early | -0.025 | -0.014 |
Note that since we’re using the Schaefer-1000 parcellation, for each visual area, we average the rank correlation value across the top 3 Schaefer parcels with greatest overlap with that visual area.
Q7) Cosine similarity analysis of concept vectors for cross-subject parcels
Great suggestion. We do find that the concept vectors generated for the same parcel, when analyzed across subjects, tend to be more similar (0.784 ± 0.069 for BrainDIVE) than the parcels coming from around the target parcel (0.710 ± 0.115 for BrainDIVE). Absolute cosine similarities are high because these parcels are selective for high-level visual concepts that share similar features, but our method can adjudicate between them using targeted stimulus generation. We will add this analysis to give further intuition on the concept space of the parcels.
Q8) Clarifying construction of concept vector in cross-subject analysis
Yes, the concept vector for cross-subject experiments is generated by taking the simple average of the CLIP vectors for the top-32 maximally activating images from each subject. For any subject, this concept vector is an average of 96 CLIP vectors (32 per subject * 3 subjects).
Q9) Ambiguity of the term "pipeline"
The application of this encoder to the whole brain is a contribution of the paper (section 4.2), but you’re right that the core contribution is the steps around the encoder. Thank you for pointing out the ambiguity; we will clarify this in our revision.
Q10) Typos in figure and text
Thank you for pointing these out. We will fix these in our revision.
I thank the authors for the detailed response and the additional analyses (e.g., varying top-K, concept vector similarity), which address most of my concerns.
I have one remaining question regarding the NSD-train condition. You clarify that measured responses were used to define top-K images for NSD-train, while other conditions rely on model-predicted responses. While I understand the motivation (i.e., to reflect the traditional approach), this introduces a potential confound in response quality, which could degrade the resulting concept vectors by making the top-K selection less reliable.
Do you have results where model-predicted responses on NSD-train images are used to define the concept vectors? That would help separate the effects of stimulus coverage from response variability.
Lastly, I found the argument that the comparison is "biased in favor of NSD-train" a bit unclear. Noise in measured responses would typically degrade performance, not improve it. Could you clarify what you meant?
Our comment on the comparison being "in favor of NSD-train" reflected the fact that any encoding model relies on a specific feature space (i.e. DINOv2 in our case), to map visual responses to brain activity and can only capture a certain level of variance in the response. Selecting images directly based on the measured brain responses can potentially mitigate this issue and the selected images can capture variance in responses that have not have been captured by the encoding model.
However, we completely agree that the noise is lower in predicted activation and this can lead to improved performance. To address the reviewer's comment, we provide below results for the NSD train condition with predicted activation.
k=32 Spearman’s (mean ± std) between the model-predicted and ground-truth activation rankings on the NSD test set, averaged across parcels.
| s1 | s2 | s5 | s7 | |
|---|---|---|---|---|
| NSD train (ground truth) | 0.149 +/- 0.105 | 0.138 +/- 0.094 | 0.125 +/- 0.104 | 0.125 +/- 0.088 |
| NSD train (model predicted) | 0.187 +/- 0.092 | 0.169 +/- 0.085 | 0.144 +/- 0.094 | 0.135 +/- 0.075 |
| ImageNet | 0.168 +/- 0.106 | 0.163 +/- 0.082 | 0.142 +/- 0.092 | 0.133 +/- 0.075 |
| BrainDIVE | 0.163 +/- 0.121 | 0.190 +/- 0.099 | 0.154 +/- 0.094 | 0.133 +/- 0.083 |
k=4
| Dataset | s1 | s2 | s5 | s7 |
|---|---|---|---|---|
| NSD train (ground truth) | 0.130 ± 0.119 | 0.117 ± 0.096 | 0.115 ± 0.106 | 0.111 ± 0.091 |
| NSD train (model predicted) | 0.170 ± 0.094 | 0.161 ± 0.087 | 0.140 ± 0.098 | 0.127 ± 0.076 |
| ImageNet | 0.165 ± 0.108 | 0.155 ± 0.082 | 0.140 ± 0.091 | 0.124 ± 0.073 |
| BrainDive | 0.165 ± 0.120 | 0.184 ± 0.098 | 0.151 ± 0.092 | 0.131 ± 0.080 |
k=1
| Dataset | s1 | s2 | s5 | s7 |
|---|---|---|---|---|
| NSD train (ground truth) | 0.104 ± 0.124 | 0.090 ± 0.095 | 0.092 ± 0.114 | 0.078 ± 0.089 |
| NSD train (model predicted) | 0.145 ± 0.095 | 0.139 ± 0.088 | 0.133 ± 0.104 | 0.114 ± 0.081 |
| ImageNet | 0.148 ± 0.101 | 0.136 ± 0.084 | 0.125 ± 0.091 | 0.112 ± 0.075 |
| BrainDive | 0.148 ± 0.115 | 0.165 ± 0.093 | 0.140 ± 0.092 | 0.119 ± 0.082 |
Interestingly, this approach outperforms the selection based on measured responses, confirming reviewer's intuition on the effect of noise. We really appreciate you pointing this out. Given that this approach performs better than using the measured activation on NSD train, we will update the tables for cross-subect tests as below (showing only one analyses here due to space limitations).
Table 5 (revised). Spearman’s (mean ± std) between the model-predicted (from all subjects other than heldout) and heldout subject's ground-truth activation rankings on the NSD training set, averaged across parcels.
k=32
| S1 | S2 | S5 | S7 | |
|---|---|---|---|---|
| Null | 0.000 ± 0.011 | 0.000 ± 0.011 | 0.000 ± 0.011 | 0.000 ± 0.011 |
| Top 32 NSD train (model predicted) | 0.007 ± 0.064 | 0.036 ± 0.054 | 0.033 ± 0.059 | 0.053 ± 0.043 |
| Our encoder w/INet | 0.072 ± 0.071 | 0.105 ± 0.073 | 0.105 ± 0.078 | 0.093 ± 0.075 |
| Our encoder w/BD | 0.079 ± 0.077 | 0.103 ± 0.087 | 0.106 ± 0.091 | 0.089 ± 0.079 |
k=4
| S1 | S2 | S5 | S7 | |
|---|---|---|---|---|
| Null | 0.000 ± 0.011 | 0.000 ± 0.011 | 0.000 ± 0.011 | 0.000 ± 0.011 |
| Top 32 NSD train (model predicted) | -0.006 ± 0.061 | 0.033 ± 0.060 | 0.027 ± 0.057 | 0.047 ± 0.048 |
| Our encoder w/INet | 0.071 ± 0.070 | 0.106 ± 0.075 | 0.104 ± 0.075 | 0.092 ± 0.076 |
| Our encoder w/BD | 0.083 ± 0.078 | 0.102 ± 0.086 | 0.110 ± 0.093 | 0.093 ± 0.079 |
k=1
| S1 | S2 | S5 | S7 | |
|---|---|---|---|---|
| Null | 0.000 ± 0.011 | 0.000 ± 0.011 | 0.000 ± 0.011 | 0.000 ± 0.011 |
| Top 32 NSD train (model predicted) | -0.004 ± 0.063 | 0.026 ± 0.077 | 0.032 ± 0.068 | 0.033 ± 0.074 |
| Our encoder w/INet | 0.068 ± 0.070 | 0.105 ± 0.070 | 0.104 ± 0.076 | 0.086 ± 0.076 |
| Our encoder w/BD | 0.084 ± 0.080 | 0.099 ± 0.090 | 0.109 ± 0.087 | 0.090 ± 0.082 |
Thank you for your very thoughtful comments! Please let us know if you have any additional questions.
Thanks for the detailed follow-up and the additional analyses. The comparison with the model-predicted NSD-train condition helps clarify the impact of response variability. I find it interesting that while model predictions yield noticeable improvements for within-subject results (revised Table 3), they actually perform worse than the ground-truth NSD-train in the cross-subject generalization setting (revised Table 5). That seems a bit counterintuitive - do you have any thoughts on why that might be?
Overall, and also considering your earlier responses to my concerns, I believe the strengths of the work outweigh its limitations. I am inclined to increase my initial rating, and I thank the authors for their engagement during the discussion period.
Great question, we need to dig deeper into this to have a complete answer. Our impression is that the performance from using measured or model predicted responses are not significantly different for cross-subject tests (in different subjects one or another wins, also depending on the choice of K). We believe the reason is that this test is much more stringent than within subject test and therefore amplifies the gains in expanding the concept space with alternative methods.
We want to thank you again for the detailed and constructive feedback. And we greatly appreciate the engagement you've put into the review process!
Thank you!
This paper introduces a novel and powerful in silico paradigm for discovering and mapping the visual selectivity of cortical regions across the entire brain. The central motivation is to overcome the limitations of traditional fMRI studies, such as high cost, small stimulus sets, and experimenter bias. The authors develop a transformer-based encoder-decoder model that learns to predict fMRI activity from natural images, leveraging a cross-attention mechanism to align deep network features with responses in specific brain parcels. By training this model on the large-scale Natural Scenes Dataset (NSD), they create a "digital twin" of the brain, enabling large-scale virtual experiments. This allows them to search massive image datasets (ImageNet) or use generative models (BrainDIVE) to identify or create "superstimuli" that are predicted to maximally activate specific parcels. The work successfully replicates known selectivity for faces, discovers novel, complex selectivity for concepts like "skateboarding" and "tool use" in previously unlabeled regions, and validates that these computationally-derived hypotheses better explain held-out brain activity than traditional baselines.
优缺点分析
### Strengths
Innovative Computational Paradigm: The paper's core contribution is a powerful data-driven framework for discovery that moves beyond traditional hypothesis-testing. The use of in silico experimentation to generate and test hypotheses on a massive scale is a significant methodological advance for the field.
Sophisticated and Effective Architecture: The use of a transformer-based encoder with a cross-attention mechanism is a state-of-the-art approach to brain encoding. The systematic comparison of different backbones and the finding that DINOv2 provides a strong performance baseline adds to the paper's technical rigor.
Rigorous Multi-Level Validation: The authors effectively validate their method by first confirming its ability to identify known category-selective regions (aTL-faces, Figure 4), and then by demonstrating quantitatively that their novel hypotheses for unlabeled parcels significantly predict held-out neural data better than a baseline model (Tables 2 & 3).
Discovery of Complex and Generalizable Selectivity: The pipeline successfully identifies parcels with intriguing, complex selectivity (e.g., "child eating," Figure 1c) and, importantly, shows that some of these preferences are consistent for the same anatomical parcel across multiple subjects (e.g., the "tool parcel," Figure 6), greatly strengthening the biological plausibility of the findings.
Weaknesses and Possible Improvements
Lack of Experimental Validation for Generated Stimuli: A primary limitation, acknowledged by the authors, is that the "superstimuli" generated by BrainDIVE have not yet been tested in new fMRI experiments. While the encoder-based reranking provides confidence, showing these novel images to subjects to confirm they elicit stronger-than-normal responses would be the ultimate validation of the generative pipeline.
Potential for Dataset Bias: The model is trained exclusively on the NSD dataset, which may introduce biases. The authors astutely note a potential "zebra-selective" parcel as a possible artifact of the NSD stimulus set (Appendix A.12). Future work could strengthen the generalizability of the findings by training or fine-tuning the model on multiple, diverse fMRI datasets.
Minor Clarity Issues in Figures: Some figures could be slightly improved for readability. For instance, the caption for Figure 3 describes two distinct subplots but does not use (a) and (b) labels to explicitly link the text to the corresponding images, which could cause momentary confusion.
问题
Questions/Clarifications for the Authors
- The parcel-based approach is grounded in functional localization, yet the brain also relies heavily on distributed representations. While the paper notes that selectivity is "graded", could you further elaborate on how you view these highly selective parcels as nodes within wider, distributed networks? Does your framework offer a way to map the connections between these specialized nodes?
- The results depend on the Schaefer-1000 functional parcellation. How sensitive do you believe the findings are to this specific parcellation scheme? Have you considered how using a different atlas, or perhaps subject-specific functional parcels, might alter the discovered patterns of selectivity?
- The generated BrainDIVE images (e.g., Figure 9d, 10d) are a fascinating result of the pipeline. Based on your model, what is the predicted quantitative increase in activation these "superstimuli" would elicit compared to the best-performing images from ImageNet or the NSD test set?
- While the parcel selection method (Section 3.3) is very well-reasoned, the general low predictive accuracy in many non-visual areas (Figure 3) remains a challenge. Do you believe this is primarily a limitation of using encoders trained on static 2D images to model regions that may process more dynamic or abstract information, or is it more related to inherent signal-to-noise issues in fMRI for these regions?
局限性
yes
格式问题
None
We sincerely thank reviewer jfLo for recognizing the strengths of our work and for the thoughtful review. We especially appreciate the positive evaluation of our manuscript. We include answers to specific questions below.
Q1) Mapping the parcels within a distributed network
Taking advantage of our transformer-based approach, we map the connections between parcels by examining the representational similarity of learned ROI queries, each of which corresponds to a single parcel. High cosine similarity between ROI queries (queries averaged across ensemble models) suggests that two parcels are highly connected, as both attend to similar content in an image. Indeed, this similarity matrix for the visual areas closely replicates the functional connectivity matrix from the Schaefer parcellation (derived from resting-state correlated responses) [1]; the Pearson correlation between the two matrices is . We will include these results and the corresponding figures in the revision.
Furthermore, generated superstimuli for early visual areas reproduce the classic coarse-to-fine hierarchy: V1 stimuli appear as cluttered scenes filled with dense, repetitive texture; V2 images add composite color patches and rudimentary objects; V4 stimuli reveal smoother, recognizable object forms; and FFA images almost exclusively depict close-up faces, often with clear emotional expressions. We will include these superstimuli in the revision.
To quantify hierarchical properties, we examined whether the spatial-frequency content of our stimuli mirrors classical physiological findings (e.g., [2,3]). We computed the radial average power spectrum for each image [4] and calculated the proportion of spectral power above 10%, 20%, and 30% of the maximum spatial frequency. At every threshold the high-frequency energy ratio decreases monotonically from V1 > V2 > V4 > FFA, indicating that the images that best drive higher-level areas (FFA) contain proportionally less fine-scale texture and relatively more coarse, low-frequency structure.
| Threshold | ROI | High-frequency energy ratio |
|---|---|---|
| 0.1 | V1 | 0.006547 |
| 0.1 | V2 | 0.004160 |
| 0.1 | V4 | 0.002500 |
| 0.1 | FFA | 0.001347 |
| 0.2 | V1 | 0.001838 |
| 0.2 | V2 | 0.000944 |
| 0.2 | V4 | 0.000699 |
| 0.2 | FFA | 0.000299 |
| 0.3 | V1 | 0.000762 |
| 0.3 | V2 | 0.000339 |
| 0.3 | V4 | 0.000285 |
| 0.3 | FFA | 0.000104 |
Note: Natural-image datasets (ImageNet, NSD) do not show a clear V1 > V2 > V4 > FFA ordering because their spectra cluster around natural photo statistics, masking early-visual preferences.
Q2) Alternative parcellations beyond Schaefer-1000
We reported results from the Schaefer-1000 parcellation since it is commonly used in the literature, but we also ran analyses on a functional parcellation generated by grouping voxels with correlated responses (applying k-means on the training set, which yields a subject-specific parcellation). For each subject, the whole brain was separated into roughly 1000 parcels, each containing 200–400 voxels based on correlated responses. We observed very similar results when applying the same analyses. Parcels that overlap significantly with known visual areas demonstrate the selectivity of that area, reproducing the sanity check in 4.3, including face-, place-, word-, and body-selective areas. Many visually-responsive parcels outside the visual area also demonstrated consistent semantic selectivity of complex concepts similar to those observed in the Schaefer-1000 parcellation (social interactions, specific sports, family, etc.). Perhaps surprisingly, there were some concepts that parcels were selective for that appeared in one parcellation but not the other—though we only observed this qualitatively. These results ultimately were not included in the final manuscript for brevity and because Schaefer-1000 is a more widely-used parcellation approach.
Q3) Quantitative increase in activation by superstimuli
On average, our model predicts that the BrainDIVE-optimized superstimuli activate the parcel more strongly than top NSD test images, but not more than the best-performing ImageNet images. But for roughly a third of parcels, BrainDIVE-optimized superstimuli do outperform ImageNet, suggesting that some parcels may be selective for a concept that is not easily captured by the space of natural images (ImageNet).
Here, we report the average activation magnitude predicted by the model for NSD-test images, top ImageNet images, and top BrainDIVE images.
Model-predicted activation magnitude for top-K superstimuli from all datasets, averaged across parcels, subject 1.
| Top 32, Visual Areas | Top 16, Visual Areas | Top 4, Visual Areas | Top 32, Non-Visual Areas | Top 16, Non‑Visual Areas | Top 4, Non‑Visual Areas | |
|---|---|---|---|---|---|---|
| NSD test | 0.476 ± 0.210 | 0.524 ± 0.223 | 0.597 ± 0.244 | 0.403 ± 0.188 | 0.475 ± 0.204 | 0.580 ± 0.237 |
| ImageNet | 0.780 ± 0.283 | 0.792 ± 0.287 | 0.813 ± 0.293 | 0.760 ± 0.265 | 0.776 ± 0.268 | 0.803 ± 0.273 |
| BrainDIVE | 0.737 ± 0.294 | 0.760 ± 0.299 | 0.794 ± 0.307 | 0.672 ± 0.291 | 0.711 ± 0.293 | 0.767 ± 0.299 |
| --- | --- | --- | --- | --- | --- | --- |
| # parcels where BrainDIVE > ImageNet | 51 / 162 | 59 / 162 | 68 / 162 | 28 / 181 | 34 / 181 | 50 / 181 |
Q4) Low prediction accuracy in non-visual areas
Since many of the parcels considered here are deeper in the processing hierarchy, they are selective for more abstract concepts and are driven not only by a single sensory input but also multi-modal inputs and other cognitive processes (such as attention and memory). As a result, the brain responses do not always have high agreement for different presentations of the stimuli, leading to lower signal-to-noise (SNR) ratios. For this reason, we only selected parcels with an SNR above a certain threshold for further experimentation, to make sure their visual responses are strong and consistent and ensure generalizability of the observed categorical selectivities to future experiments.
Q5) Typo in figure 3
Thank you, we have added (a) and (b) labels for Figure 3 in the revision of our work.
[1] Ru Kong, Yan Rui Tan, Naren Wulan, Leon Qi Rong Ooi, Seyedeh-Rezvan Farahibozorg, Samuel Harrison, Janine D. Bijsterbosch, Boris C. Bernhardt, Simon Eickhoff, and B. T. Thomas Yeo. Comparison between gradients and parcellations for functional connectivity prediction of behavior. NeuroImage, 273:120044, June 2023. doi:10.1016/j.neuroimage.2023.120044. URL
[2] David H. Hubel and Torsten N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160(1):106–154, January 1962. doi:10.1113/jphysiol.1962.sp006837. URL
[3] R. Desimone and S. J. Schein. Visual properties of neurons in area V4 of the macaque: sensitivity to stimulus form. Journal of Neurophysiology, 57(3):835–868, March 1987. doi:10.1152/jn.1987.57.3.835. URL
[4] A. van der Schaaf and J. H. van Hateren. Modelling the power spectra of natural images: Statistics and information. Vision Research, 36(17):2759–2770, September 1996. doi:10.1016/0042-6989(96)00002-8. URL
In this work, the authors propose a data-driven in silico approach based on fMRI activation predictive encoding models that would allow for images predicted to yield high activations in the target area to be selected from a pool of images or be generated by using a diffusion-based decoder employing the encoder model’s gradient. Using this approach, the authors higlight that the method can indeed select out mostly face-containing images for aTL (area known for face selectivity) and further show that images selected by their approach tends to yield semantically meaningful cluster (as characterized by CLIP embedding) such that images closer to this embedding in the CLIP space tends to elicit strong response in the encoding model.
优缺点分析
Quality
- It appears to me that the encoding model-based stimulus selection as put forward by the authors in this work is conceptually very similar, if not identical, to what has been proposed and worked on over the last couple of years across electrophysiological and optophysiological data in NHP and mice. However, citations of this relevant literature appear to be grossly lacking.
- The work offers very little novel technical contributions. While it would have been exciting to see whether the new image category selected for each parcel to be maximally activating would have actually led to higher activation, the work does not include experimental verification (the authors note this limitation). As the technique is not particularly novel (aside from possibly the technique being applied to fMRI dataset, rather than electrophysiolgy or optophysiology recordings), it would have been seminal to see the predictions tested in real experiments.
- In the evaluation of the labels, the authors compare the degree to which the model-derived label predicts (ranks) parcel activation rankings better than chance as a measure of labeling consistency. Although the reasoning is intuitive, it is not clear whether the use of semantic proximity is always the most appropriate measure to assess the similarity of a test image to the labeled most activating image. One would imagine an appropriate measure of image similarity precisely depends on the exact visual features that the parcel is selective to, and as such, how well the semantic proximity (in their case, the proximity in the CLIP embedding) can assess the parcel-relevant feature similarity entirely depends on a priori unknown alignment of CLIP embedding and correspoinding image embedding based on the parcel activities. Furthermore, although encoder-based selections do appear to achieve higher fraction scores, the difference from the baseline is rather small. Without experimental verification, it is unclear if the baseline selection vs encoder-based selection would have yielded experimentally observable and meaningful difference in the activations.
Clarity
- The term NSD was introduced before the first mention of Natural Scene Dataset appearing in L121 and even then explicit correspondence of the acronym was never made. Please be sure to introduce the full term at the first use of the acronym appearing on L112
Significance & Originarity
- As the authors state, the use of large and naturalistic stimulus sets (such as naturalistic images) is critical in adequately studying the functions of stimulus-responsive neural populations. However, the large number of trials required to present such datasets proves to be empirically prohibitive. In line with the recent trend, the authors identify AI models trained on an extensive collection of neural datasets as “digital twins,” which enables extensive in silico experiments. In this work, the authors introduce a pipeline framework through which a trained encoding model can be used for mapping visual categorical selectivity, which in turn can be used to generate hypotheses at the category/concept level.
- While the proposed apporach of using a high-accuracy encoder model of the brain (in this work’s case, predicting vertices activation from image input) in selecting (from a held out natural images) or generating (based on generative model where the generation can be guided via gradiet of the predictive model) images that would most strongly activate the target brain areas is a powerful technique, this framework is by no means original as there has been numerous work exactly on this approach in the last ~6 years including the work of Bashivan, Kar & DiCarlo Science 2019, Walker, et al. NatNeuro 2019, and Perzchlewicz, et al. NeurIPS 2023, just to name a few.
- Consequently, I’m not clear what particular novel contribution is made by the work. I fail to see the idea of using encoding model to predict highly activating stimulus to be conceptually novel (as this has been proposed with rigorous validation in the work cited above), and use of generative model guided by encoding model to generate high activating stimuli has also been addressed in the work of Perzchlwicz et al (NeurIPS 2023) with even experimental verification. Although it is possible that these techniques have not been used extensively in fMRI settings, I would find the difference in target neural recording modality alone to be sufficient grounds for considering this work novel.
问题
- The authors seem to be using the encoding model almost exclusively for generating the initial set of labels. In assessing the “category” defined by these images, the authors rely on the similarity of the concept vector for each image in CLIP space. However, as they have direct access to the encoding model, couldn’t they use the encoding model to yield similarity from the model? It would also be interesting to see if a similarity metric defined on the encoding model activations could better assess the rankings.
- Instead of just generating the most activating images, could one use the encoding model to find the most distinctive set of images (e.g., images that only selectively activate the target parcel while eliciting little to no activations in other parcels)? Such a contrastive approach in semantic assignment across multiple parcels at once may help further differentiate the selectivity of each parcel.
局限性
Authors identify key limitations
最终评判理由
Authors have addressed my main concern over the connection to the prior work and clarification of the main contributions of this work.
格式问题
None noted
We thank Reviewer yocx for the constructive review and for bringing our attention to missing citations for relevant works. We regret not citing them in the original submission. These papers have been seminal in laying the groundwork for in silico brain mapping and the foundation of our methodology, and we will include citations to these papers in our revision. However, we believe that our work significantly expands upon past work and clarify the novelty and contribution of our approach below.
Clarification of a detail in the review
Reviewer yocx wrote "images selected by their approach tends to yield semantically meaningful cluster (as characterized by CLIP embedding) such that images closer to this embedding in the CLIP space tends to elicit strong response in the encoding model."
We just want to clarify that our images closer to this embedding in the CLIP space tends to elicit strong response in the ground-truth held-out set, not only by the encoding model—which allows us to validate the robustness of our predictions in silico without encoder bias.
Q1) Technical contributions
We share your optimism about the promise of AI models for in silico experimentation to study neural populations. However, past approaches face several key limitations that our work addresses, as pointed out by other reviewers.
- Reviewer jfLo highlights the "massive scale" upon which our pipeline generates and tests hypotheses as a "significant methodological advance for the field."
- Reviewer uJqz cites cross-subject generalization in evaluation as a key strength of the method.
- Reviewer kyMo raises the capability to identify selectivity outside canonical visual areas as a key "proof-of-concept" result.
- Reviewer ywAe writes that the use of encoder-decoder transformers to synthesize optimal stimuli from massive datasets "suggests a new approach for discovering visual selectivity, offering a more flexible and scalable alternative to traditional fMRI experiments."
Below, we address the novelty of our manuscript in comparison to the specific papers mentioned.
Q2) Significance compared to past work in NHP/mice
We thank the reviewer for pointing out our blindspot regarding similar works done in mice/NHP and regret not citing these studies. We believe those works laid the foundation for much of the more recent work in human imaging and we have updated the revision to include the relevant citations and past methods. Those works [1, 2, 3] have clearly demonstrated the usefulness of stimulus optimization and how they can be paired with encoders to study neural populations. The generated "superstimuli" also have been experimentally validated, which is a motivation for in silico experimentation that we have explored in this work.
We do however want to emphasize that extending those works to human neuroimaging comes with many new challenges that remain unaddressed. Namely, building a scalable state-of-the-art encoding model for the whole brain, using the encoder for exploring massive image sets and generative models for maximally activating stimuli, and creating robust methods to quantify selectivities and to validate them empirically in silico—have all not yet been explored. Notably, building a linear encoding model for the whole brain with the same size feature maps would have 242B parameters.† Our transformer-based model learns parcel queries that reduce feature dimension to 768 while still outperforming the full linear model. An experiment comparing the two isn't feasible to conduct but prior work comparing performance on the visual area supports this claim [4].
These innovations move the field from single‑site discovery ([1] and [3] looks at primate and macaque V4; [2] looks at monkey V1) to systematic whole‑brain mapping and to statistical validation without additional scanning time, which, to our knowledge, has not been demonstrated previously.
As a result of this undertaking, we have discovered visual selectivities to high level concepts that are unique to humans, such as sports and complex tool use, that could not have been studied in non-human animal models. We believe addressing all these challenges, and the discovery of new selectivities worthy of future experiments, are all novel aspects of our work in comparison to the papers highlighted by the reviewer.
†31*31 (patch size) * 768 (dimension size) * 327684 (voxels) = 242B parameters
Q3) Similarity metric defined on encoding model
We agree that we would be able to use encoder predictions to create a similarity measure, and that would be a good test of the encoder performance. However, the main goal of our methodology is to create labels for the selectivity of unlabeled parcels. Since the CLIP space provides an aligned visual-text embedding space, creating our "concept vector" hypothesis in that space can give us a semantic description of the selectivity. This also introduces a bottleneck that can lead to findings that can better generalize to future in vivo experiments.
Q4) Contrastive approach for generating superstimuli
This is an interesting suggestion for further teasing apart the selectivity of a parcel. We performed some further analyses on parcels in the labeled area, and found that even without any selection, stimuli which activate one area generally do not activate other (non-related) areas. For example, the top 32 ImageNet images that maximally activate aTL-faces already perform poorly in activating other parcels such as PPA (top 82.3%), OPA (top 81.5%), and RSC (top 62.3%).
In some cases, this approach may actually make it more difficult to correctly label the selectivity of an area. The images that maximally activate aTL-faces unsurprisingly also activate FFA (top 23.6%) and OFA (top 9.4%). Naïvely applying the suggested approach, we can generate images that activate aTL-faces but not FFA, which end up being non-face images—not the expected selectivity for aTL-faces. Of course, for labeled parcels, we could manually avoid suppressing the activation in other related areas when choosing optimal stimuli, but this fails for areas outside the visual cortex since they are unlabeled.
Generally, we believe that this approach may be useful for experimenters interested in very fine-grained analyses of specific parcels, but it is challenging to apply to whole-brain mapping. We will include these results and details about this suggested approach in the revision.
Q5) NSD acronym typo
Thank you, we revised our paper to use the full name upon first mention.
[1] Pouya Bashivan, Kohitij Kar, and James J. DiCarlo. Neural population control via deep image synthesis. Science, 364(6439):eaav9436, May 2019.
[2] Edgar Y. Walker, R. James Cotton, Wei Ji Ma, and Andreas S. Tolias. A neural basis of probabilistic computation in visual cortex. Nature Neuroscience, 23(1):122–129, January 2020. ISSN 1097‑6256, 1546‑1726.
[3] Paweł A. Pierzchlewicz, Konstantin F. Willeke, Arne F. Nix, Pavithra Elumalai, Kelli Restivo, Tori Shinn, Cate Nealley, Gabrielle Rodriguez, Saumil Patel, Katrin Franke, Andreas S. Tolias, and Fabian H. Sinz. Energy Guided Diffusion for Generating Neurally Exciting Images. Advances in Neural Information Processing Systems, 36, 2023.
[4] Hossein Adeli, Minni Sun, and Nikolaus Kriegeskorte. Transformer brain encoders explain human high‑level visual responses, 2025.
I thank the authors for the detailed responses. The clarification for the novelty of the network infrastructure, particularly those pertaining to solving the challenges of the brain-wide dataset was helpful. While I do appreciate the effort in improving on existing methodologies to achieve prediction and image generation at a larger scale than previously done, I find that this won't be a sufficient basis for the authors to claim the entire framework of model-based stimulus optimization and validation to be a completely novel framework. In light of past and directly related work, I would like to learn how the authors plan to reframe the scope and claim of the paper to better reflect the specific contribution of this work.
Furthermore, while the use of semantic vector for possible characterization of previously unknown area is obviously of high interest, I would imagine that the utility of the model would strongly depend on whether the identified characterization (in this case, in the form of CLIP space vector) would prove to yield consistently higher responses in a verification experiment. As this is in discussion period, I am certainly not demanding for the authors to conduct and add experimental results, but I do find that without such validation, it would be difficult to assess the validity/utility of the method.
Thank you for the follow-up comments. Our paper buildson the extensive prior work in both fMRI and animal studies, and we will clarify in our revision that we extend and strengthen those frameworks through:
Massive scale: applying in silico mapping on millions of images (ImageNet, BrainDIVE) with a transformer-based brain encoder, enabling discovery of parcel selectivity for concepts never shown in training. To the best of our knowledge, no other study has been done on this scale nor demonstrated the capability to test concepts beyond the training set in silico.
Mapping of the whole brain: expanding beyond visual cortex and revealing human-specific semantic selectivity.
In silico verification: our pipeline verifies selectivity hypotheses in silico with rigorous test that evaluates how well a label can predict ground-truth activation on a held-out set within and across subjects.
New fMRI experimental paradigm: as both datasets and encoding models improve, our pipeline offers a way to leverage these advances to accelerate and improve the accuracy of whole-brain mapping.
More specifically on framing, we will update the following in the revision:
Abstract: emphasize scalability, whole-brain application, and rigorous statistical validation.
Related-work section: more explicitly position our method as an extension of prior work (both in fMRI and animal studies), detailing differences in scale, cross-subject evaluation, and statistical tests.
Limitations and future work: state clearly that prospective fMRI experiments remain essential to confirm our highest-ranked concept vectors.
We hope our response is helpful in addressing the reviewer concerns. We are very grateful for the engagement you have given the review process!
I thank the reviewers for detailed response and, with the clarifications of the scope and contribution that the authors will incorporate, I am happy to increase my score to the positive side.
We deeply appreciate your positive evaluation of our work and thank you again for your detailed comments and feedback!
We hope our response has been helpful! We would be more than happy to discuss further.
Thank you!
Best,
Authors
This paper proposes a method for mapping the selectivity to visual stimuli of cortical regions across the whole human brain. The paper is based on a digital twin trained to predict fMRI responses to visual stimuli from the Natural Scenes Dataset. The digital twin allows for large in-silico experiments, searching for images that are expected to strongly activate an area in a large database, or to use generative models to create synthetic superstimuli. The method can recover the selectivity of known areas as well as discover selectivity for novel concepts.
The reviewers appreciated the paper as a significant technical advance (especially after clarification about its relationship with previous literature on stimulus synthesis for rodents and NHPs and on fMRI response prediction using DNNs trained on NSD) and were satisfied with the degree of validation. The author-reviewer discussion generated a substantial amount of new analyses and results. Of particular interest was the replication of the selectivity of additional areas that are not categorically and semantically selective, such as the low-level areas in the visual hierarchy.
Overall, the paper will be of interest to the NeurIPS community and I recommend acceptance.