PaperHub
6.8
/10
Spotlight4 位审稿人
最低6最高8标准差0.8
8
6
6
7
4.0
置信度
正确性3.3
贡献度3.3
表达3.3
NeurIPS 2024

Brain-JEPA: Brain Dynamics Foundation Model with Gradient Positioning and Spatiotemporal Masking

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

Brain-JEPA is a state-of-the-art brain dynamics foundation model enhancing brain activity analysis, it achieves superior performance on different downstream tasks with broad applicability.

摘要

关键词
foundation modelfMRI

评审与讨论

审稿意见
8

This paper introduces Brain-JEPA, a self-supervised learning approach that leverages joint-predictive architecture to learn representations of brain fMRI images. The authors introduce two novel components on top of the JEPA architecture to adapt it to brain images: 1) Brain Gradient Positioning, which encodes the functionality of each ROI into the patch positional encoding, and 2) an fMRI-specific masking strategy.

The authors pretrain a family of ViT encoders using Brain-JEPA on the UK Biobank dataset and then explore downstream task performance using both fine-tuning and linear probing protocols to evaluate the quality of their learned representations. They look into trait prediction of a held-out set from the same UK Biobank dataset and three other datasets: HCP-Aging, ADNI, and another resting state fMRI data source.

The methods achieve strong empirical results, outperforming previous approaches such as BrainLM with a significant margin across evaluation protocols (fine-tuning, linear eval).

优点

  • The proposed approach demonstrates strong empirical results compared to previous work in the field.

  • The authors provide a clear ablation study that highlights the significance of the proposed contributions, specifically the brain-gradient position embedding and masking strategy, for applying JEPA to fMRI data.

  • The methods exhibit good scaling properties, indicating their potential for broader applicability.

缺点

  • It is unclear from the empirical evaluation how the pretraining data affects downstream performance. Are the baselines, such as BrainLM, using the same pretraining dataset and computational budget?

  • Similarly, it would be informative to explore the scaling properties of BrainJEPA with respect to dataset size. Does performance improve as the size of the pretraining dataset increases?

  • What is the impact of some of the contributions (brain gradient positioning, masking strategy) compared to other design choices, such as predictive in latent space? Would the former contributions also benefit a BrainLM baseline?

问题

See weaknesses.

局限性

Authors adequately addressed the limitations

作者回复

Pretraining scheme of baselines (dataset and computational budget).

Thank you for your inquiry regarding the pretraining process. We confirm that for self-supervised pretraining baselines like BrainLM, we used the same pretraining dataset and computational budget, specifically the UKB dataset on 4 A100 GPUs (40GB each).

Scaling of dataset size.

Thank you for your suggestion on exploring the impact of dataset size scaling.

We have compared the performance of Brain-JEPA trained with varying portions of the UKB pretraining dataset: 25%, 50%, 75%, and 100%. As shown in the table below, the performance improves as the dataset size increases, highlighting the scalability of Brain-JEPA in relation to the pretraining dataset size.

HCP-AgingHCP-AgingADNI
AgeSexNC/MCI
ρ\rhoACC (%)ACC (%)
25%0.659 (.043)68.03 (1.21)67.89 (9.18)
50%0.768 (.012)74.24 (1.36)71.05 (3.86)
75%0.813 (.015)77.42 (2.00)74.74 (4.88)
100%0.844 (.030)81.52 (1.03)76.84 (1.05)

Combine contributions with BrainLM.

Thank you for your insightful suggestion.

We further compared the performance of BrainLM combined with our contributions to vanilla BrainLM. As shown in the global author rebuttal Table 4, BrainLM combined with our contributions outperforms vanilla BrainLM consistently, demonstrating that our contributions (gradient positioning and spatiotempotal masking) could benefit the training of BrainLM as well.

评论

The rebuttal effectively addressed my primary concerns about the size of the pretraining dataset and the evaluation of each contribution in the BrainLM framework. As a result, I have updated my score to 8, as I believe this paper will be a valuable addition for both the SSL and neuroscience communities.

评论

Thank you very much for your positive feedback and for taking the time to review our rebuttal. We are delighted to hear that our responses have addressed your concerns regarding the size of the pretraining dataset and the evaluation of contributions within the BrainLM framework.

We appreciate your support and are glad that you find our work valuable for both the SSL and neuroscience communities.

审稿意见
6

In this paper, the authors train a foundation model on fMRI data. To this end, they combine multiple deep learning techniques in a novel way:

  • they rely on a Joint-Embedding Predictive Architecture, and devise a specific masking strategy for brain data (referred to as spatio-temporal masking)
  • they make use of pre-trained embeddings containing functional information (as opposed to anatomical only) of fMRI data as positional embeddings for transformers comprised in their JEPA model

Overall, this study shows that the foundation model obtained yields state-of-the-art performance in a variety of downstream tasks (some of which test the obtained embedding without further tuning through linear probing).

优点

In my opinion, this paper tackles an important problem: large collections of neuro data have been collected in the past decade, but inter-individual variability makes it hard to derive meaningful models of the brain. Many recent endeavours have sought to show how deep-learning can help alleviate this issue and provide meaningful embeddings of brain data that can be used in downstream tasks. I believe the current work would be of interest to a growing number of computational neuroscience researchers.

Moreover, I find the writing to be clear and rather easy to follow.

缺点

I find the paper interesting and well written. I think it brings valuable information to the community. Methodological contributions could be deemed poor compared to other submissions, but I think the kind of benchmarks featured in this paper are challenging to implement and represent a great amount of work ; in this regard, I would encourage the authors to make their code public so that other research teams can potentially reproduce this benchmark. However, some important pieces of information are missing at this stage, notably concerning how data was processed for downstream tasks.

问题

At this stage, it is not clear to me from line 235 how the downstream dataset were divided into training, validation and test sets. For instance, can data from a given participant appear in different categories? In my understanding, the division is the same across all models tested (Brain-JEPA, BrainLM, etc), is that indeed the case? I believe it is crucial to evaluate how models like Brain-JEPA generalise to unseen participants.

The authors write that the temporal resolution of the dataset used for the pre-training phase is 0.735s (line 216). Classical repetition times in fMRI are usually higher. Do the datasets used in downstream tasks have the same TR as the one used for pre-training? If not, how did the authors adapt to this change? In particular, since pp, the number of brain volumes concatenated to form patches is set to 16 (Table 4), patches are approximately 10 seconds long during the pre-training phase be would be about 30 seconds long with a classical TR of 2 seconds in downstream tasks, which seems pretty long. This brings the question: how was the value of pp chosen? Does it seem possible to the authors that different downstream tasks would have different optimal values for pp?

I am curious about the size of the brain gradients used to derive the positional embeddings. The authors indicate in Table 4 that mm the dimension of the brain gradients is 30, which seems pretty high to me. I would expect that only the first few dimensions are actually useful to the model (maybe the first 3, as illustrated in Figure 2, are enough to distinguish the most important networks cortical networks). Moreover, these vectors are actually mapped to a higher-dimensional space of size d/2d/2 (where dd depends on the size of the ViT used). I am curious as to what the final functiono-anatomical embedding obtained here looks like. Can the authors try to give more intuitions about it?

Additional suggestions:

In my humble opinion, this paper mostly targets the neuroscientific community and should therefore help members of this community who are not deep-learning experts to dive into the current work. In particular, I think the authors could add short paragraphs (maybe in the supplementary materials if they cannot fit in the main text) to explain concepts behind JEPA, linear probing, with simple words.

The left and middle panes of Figure 7 could be merged together so that bars for the caucasian and asian participants for each network would be side by side.

It would be nice to indicate how much time the pre-training and fune-tuning phases required (lines 556-560).

In my opinion, the positional embedding part of Figure 1 (top right) is rather unclear. I think the figure would benefit from highlighting clearly what the inputs of gϕg_{\phi} are (embeddings, positional embeddings)

The authors write: "In the field of brain activity analysis, brain language model (brainLM) is the first and only foundation model to date" (line 32). I would personally be less assertive about this. In my understanding, many existing works have trained models on large fMRI datasets to extract meaningful representations (before these would be called "foundation models"). In the list below, the last two examples are rather close to what is being done in the current paper (self-supervised settings), and the last example even uses masking strategies close to that of BrainLM:

  • Mensch, Arthur, Julien Mairal, Bertrand Thirion, and Gaël Varoquaux. ‘Extracting Representations of Cognition across Neuroimaging Studies Improves Brain Decoding’. PLoS Computational Biology 17, no. 5 (2021): e1008795.
  • Thomas, Armin, Christopher Ré, and Russell Poldrack. ‘Self-Supervised Learning of Brain Dynamics from Broad Neuroimaging Data’. Advances in Neural Information Processing Systems 35 (6 December 2022): 21255–69.
  • Chen, Zijiao, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and Juan Helen Zhou. ‘Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding’. arXiv, 14 November 2022. https://doi.org/10.48550/arXiv.2211.06956.

局限性

N/A

作者回复

Full open-source code, preprocessing.

We appreciate your important suggestions. We have supplemented the following materials/information:

Code and Data Source: We have now supplemented the codebase to include all downstream tasks on public datasets mentioned in the paper. The complete codebase, along with the list of subject IDs for all datasets used, is available through an anonymous link. In accordance with the author guidelines, we have provided this link to the Area Chair in a separate comment.

Data Preprocessing: We utilized open-source preprocessing pipelines as outlined in [1] and [2]. These pipelines are fully open-sourced and have been referenced in our paper. For data normalization, we implemented robust scaling as practiced in BrainLM.

By sharing the complete codebase, the list of subject IDs, and using open-source preprocessing pipelines, we believe that our results are highly reproducible.

Data splitting of downstream datasets.

Thank you for your inquiry regarding the splitting of our downstream datasets. We confirm that the data is divided at the participant level, ensuring that each participant appears only once in either the training, validation, or test set. It ensures that all participants in the test set are unseen during training, thereby demonstrating the excellent generalizability of Brain-JEPA. Additionally, the same division is applied consistently across all models tested.

Different TR for different datasets.

Thank you for the insightful question regarding the temporal resolution of datasets used. For fMRI in UKB, we used multi-band data with a high temporal resolution (TR is ~0.7s). In our downstream datasets, HCP-Aging also uses multi-band data with a TR of ~0.7s, while both ADNI and the Asian cohort use single-band data with a TR of ~2s.

To address the differences in TR, we downsampled the multi-band data with a temporal stride of 3, aligning the TR to ~2s. This ensures consistency across different datasets.

For future work, we plan to use learnable TR embeddings to enable the model to adapt to different TRs dynamically.

Dimension of brain gradient embedding.

Thank you for your insightful feedback on the dimensionality of the brain gradient. We compared the model performance between 3-dimensional (3-dim) and 30-dim brain gradient positioning, shown in the global author rebuttal Table 5. The 30-dim model consistently outperformed the 3-dim model by a large margin. This indicates that higher-dimensional brain gradients may encapsulate finer-grained information on brain network organization, which benefits the learning of brain dynamics.

We leave the investigation of the mapped gradient embedding in the future work. Specifically, we will conduct comprehensive experiments to explore how these gradients are organized in the latent space and their relationships with brain network organization.

Explain to neuroscience community.

We appreciate your valuable suggestion. We will include short explanatory paragraphs on concepts like JEPA and linear probing in the supplementary materials to make the paper more accessible to the neuroscientific community.

Fig7.

Thank you for your insightful suggestion regarding Figure 7. However, the absolute attention values are not meaningful in isolation, so it may not be appropriate to directly compare attention values across different cohorts. Our focus is on whether the rankings among different networks are aligned between the two cohorts (Caucasian versus Asian). From the current Figure, we can clearly see the top 4 networks are the same across the two cohorts. Please let us know if you have any further concerns regarding this.

Training time.

Thank you for your inquiry regarding the training time.

For pre-training (UKB) with the ViT-base model on 4 A100 GPUs (40GB each), using a batch size of 16x4x8, the training time per batch is ~0.3s. The total training time is ~16.5 hours.

For fine-tuning, taking HCP-Aging as an example, with the ViT-base model on 1 A100 GPU (40GB), using a batch size of 16. The total training time is ~14 mins.

Fig1.

Thank you for the valuable feedback on Figure 1. The positional embedding part is intended to be a 2D schematic view of brain gradient positioning along with temporal positioning. We will revise the figure to explicitly highlight the inputs to the predictor gϕg_{\phi}.

Related work.

Thank you for highlighting other related works. While it is true that the mentioned studies have trained models on large fMRI datasets, their downstream applications are limited. [3] and [4] focus on mental state classification, and [5] on brain decoding. A brain foundation model should exhibit broad applicability across diverse brain-related tasks, such as demographic prediction, trait prediction, and disease diagnosis/prognosis. We have added additional comparisons with CSM [4], together with most recent work suggested by other reviewers. Shown in the global author rebuttal Table 1 and 2, Brain-JEPA outperforms them on most tasks. We highlight that Brain-JEPA demonstrates the most diverse range of downstream applications, showcasing its powerful generalizability across different cohorts and tasks. We will include the mentioned works in our revised version to better illustrate these distinctions.

References:

[1] Spatial topography of individual-specific cortical networks predicts human cognition, personality, and emotion. Cerebral Cortex 2019.

[2] Global signal regression strengthens association between resting-state functional connectivity and behavior. Neuroimage 2019.

[3] Extracting Representations of Cognition across Neuroimaging Studies Improves Brain Decoding. PLoS Computational Biology

[4] Self-Supervised Learning of Brain Dynamics from Broad Neuroimaging Data. NeurIPS 2022.

[5] Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding. CVPR 2023.

审稿意见
6

The study introduces Brain-JEPA, an fMRI foundation model that leverages joint-embedding predictive architecture and a novel position-embedding approach based on brain gradient. This model achieves state-of-the-art performance in demographic prediction, disease diagnosis/prognosis, and trait prediction, and excels in off-the-shelf evaluations like linear probing.

优点

  1. The idea of using fMRI gradient information to guide the position encoding is very interesting and is proven to be effective.
  2. Overall the paper is nicely organized and written.
  3. The model shows great generalizability.

缺点

  1. I think the effectiveness of JEPA, i.e., predicting the representations of target blocks rather than constructing the masked input like MAE, is not well supported by the ablation study:
  • BrainLM uses the AAL-424 atlas instead of the Schaefer functional atlas. Since these two studies use different atlases, it is not a very fair comparison.
  • I think using different atlases is totally fine if there is a very significant performance difference between the current model with anatomical position embedding (incorporating the same position settings as BrainLM) and BrainLM. However, from the ablation study, the performances with anatomical position embedding on three downstream tasks are 0.716, 78.79%, and 74.74%, respectively, while for BrainLM, they are 0.832, 74.39%, and 75.79%, with two of them higher than Brain-JEPA (with anatomical position embedding). With these results, I feel the main contribution would be the novel position embedding based on the fMRI gradient rather than the JEPA framework. Additionally, the author didn’t report other ablation results on downstream tasks like Neuroticism, Flanker, Amyloid, and NC>MCI (Asian), so there is no further evidence that JEPA consistently performs well on other downstream tasks. Nor did the author explore Brain-JEPA without the JEPA framework, i.e., predicting the original signals rather than representations.
  1. Since the model performs prediction at the latent embedding level rather than at the original data level, the current model is unable to reconstruct the original missing time series. However, reconstructing the unseen missing brain time series is also an important and valuable application of the brain foundation model.

问题

  1. In lines 71 to 73, I wonder if the author made it italics on purpose or if it’s a formatting mistake.
  2. In line 154, what’s the shape/size of functional connectivity cic_i in ROI i?
  3. Why does the author choose to sample 160 frames from the original data with a temporal stride of 3, instead of using the original data frame?
  4. I think BrainLM is not the only self-supervised study on fMRI; there are other works like “Thomas et al., Neurips 2022” and “SwiFT” by Kim et al., Neurips 2023, etc, that could be discussed in the introduction or even compared with the current model.

局限性

The author has adequately discussed the limitations.

作者回复

Ablation study on framework.

Thank you for your feedback on our ablation study.

To thoroughly compare the performance between JEPA with anatomical locations (AL) and BrainLM (MAE), we have extended our comparison to include all the tasks except for the three in the current version, as well as two newly added datasets, OASIS-3 and CamCAN. The results shown in the table below, demonstrates that JEPA w AL outperforms BrainLM in seven out of eleven tasks, demonstrating the superiority of prediction in latent space. For the tasks where BrainLM performs better, it is likely that JEPA requires gradient positioning for precise ROI placement to achieve optimal performance. In future work, we will further investigate the possible interactions between the self-supervised learning framework and brain gradient positioning.

UKBUKBHCP-AgingHCP-AgingADNIAsianOASIS-3CamCAN
AgeSexNeurotismFlankerAmy+/-NC/MCIAD ConversionDepression
ρ\rho \uparrowACC (%) \uparrowρ\rho \uparrowρ\rho \uparrowACC (%) \uparrowACC (%) \uparrowACC (%) \uparrowACC (%) \uparrow
BrainLM0.632 (0.020)86.47 (0.74)0.231 (.012)0.318 (.048)67.00 (7.48)61.65 (3.35)65.00 (7.75)70.00 (6.17)
Brain-JEPA w AL0.686 (0.013)84.11 (0.50)0.267 (.003)0.374 (.022)65.00 (6.32)64.33 (1.80)67.00 (4.00)71.82 (6.03)

We further compared Brain-JEPA without JEPA architecture (i.e., BrainLM w contributions) to Brain-JEPA. As shown in the global author rebuttal Table 4, Brain-JEPA (JEPA framework) outperforms BrainLM (MAE framework) w contributions consistently, demonstrating the superiority of JEPA framework.

Reconstruction of time series.

Thank you for pointing out the underlying difference between the MAE and JEPA frameworks. We would like to emphasize that the primary goal of reconstructing masked fMRI time series is to facilitate self-supervised training rather than for a specific clinical application. Our primary focus is on learning representations that directly enhance performance across diverse downstream tasks. By achieving better latent performance for the samples, we can utilize powerful decoders such as diffusion models to reconstruct input signals for various applications. We leave this as the future work.

Italic lines 71-73.

Thank you for your comment. Regarding lines 71 to 73, we intentionally italicized this part to emphasize the importance of these questions. The italicization was meant to highlight the significance of developing a functional coordinate system and a masking strategy for large-scale pretraining with fMRI data, which are crucial interdisciplinary challenges at the intersection of AI and neuroscience.

Line 154, the value of cc.

Thank you for your attention to detail. The functional connectivity is represented by an adjacency matrix. Given NN ROIs, the adjacency matrix is an N×NN×N symmetric matrix. Each row/column ii of this matrix represents the connectivity of ROI ii with other regions. In our paper, the brain is parcellated into N=450N=450 regions, therefore, the dimension of connectivity for each ROI is 450.

Downsampling.

We thank the reviewer for the insightful question on temporal downsampling. We performed temporal downsampling for some of our datasets because of the variable temporal resolutions among different datasets. For the fMRI data in the UKB (our pretraining dataset), we used multi-band data with a high temporal resolution, where the TR is ~0.7s. In our downstream datasets, HCP-Aging also uses multi-band data with a TR of ~0.7s, while both ADNI and the Asian cohort use single-band data with a TR of ~2s.

To address the differences in temporal resolution, we downsampled the multi-band data with a temporal stride of 3, effectively aligning the TR to approximately 2s. This ensures consistency across datasets.

For future work, we plan to incorporate learnable TR embeddings, enabling the model to handle different temporal resolutions in a learning-driven manner. This approach will enhance the model's flexibility across varied datasets.

Other self-supervised study on fMRI.

Thank you for your suggestion on discussing other self-supervised studies on fMRI.

We have incorporated CSM [1], SwiFT [2], and BrainMass[3] (one concurrent work suggested by another reviewer) into our comparisons on the HCP-Aging and ADNI datasets, as well as two newly added datasets, OASIS-3 and CamCAN. As shown in the global author rebuttal Table 1-3, our proposed Brain-JEPA consistently outperforms these models on most tasks.

CSM is domain-specific, trained exclusively for mental state decoding. SwiFT specializes in demographic prediction, and BrainMass is limited to disease diagnosis and prognosis. Brain-JEPA, however, has demonstrated versatility across a broader range of downstream applications, including demographic prediction, trait prediction, cognitive score prediction, and disease prognosis/diagnosis, showcasing its extensive potential.

In our revised version, we will include comparisons with additional models to further demonstrate the outstanding performance of Brain-JEPA. We will also incorporate a discussion of the above-mentioned works into the related work section to better position our contributions.

References:

[1] Self-Supervised Learning of Brain Dynamics from Broad Neuroimaging Data. NeurIPS 2022.

[2] SwiFT: Swin 4D fMRI Transformer. NeurIPS 2023.

[3] BrainMass: Advancing Brain Network Analysis for Diagnosis with Large-scale Self-Supervised Learning. TMI 2024.

评论

Thank you for the detailed explanations. The additional experiments have addressed most of my concerns, and I have increased my score accordingly. However, regarding Concern 1, I actually still think that BrainLM and Brain-JEPA with AL perform comparably, given the large standard deviations observed in some tasks and the absence of statistical testing here. Therefore, the effectiveness of JEPA remains uncertain from my perspective.

评论

Thank you for recognizing that we have addressed most of your concerns in our rebuttal, and we appreciate your decision to increase the score accordingly.

We apologize for not mentioning the statistical testing for the table when addressing your concern 1. Regarding the comparison between the JEPA and MAE frameworks, particularly with the use of anatomical locations as positional embeddings for ROIs (instead of our contribution of gradient positioning), we would like to add on that our t-test results show a significant advantage (p < 0.05) of JEPA over MAE in seven out of eleven tasks.

We acknowledge that, based on our current experiments, the JEPA architecture performs best when combined with gradient positioning, while in some cases, MAE may outperform when using anatomical positioning. This suggests that the collaborative effect of JEPA and gradient positioning could be the key to achieve optimal performance. In future work, we plan to explore the potential interactions between self-supervised learning frameworks and brain gradient positioning further. We believe that this could lead to even greater enhancements and broader impact of our approach.

审稿意见
7

This paper introduces a foundation model for fMRI time series, using a classic vision transformer backbone. It incorporates two main original developments: (1) a positional encoding for brain regions, using a “functional gradient” analysis (aka PCA on a Jacobian matrix derived from a temporal correlation matrix), (2) a masking strategy that explicitly masks different type of spatial and/or temporal interactions. The model is trained on a large public popular resource (UK biobank), and evaluated on downstream supervised learning in the same resource (age and sex prediction). Then the model is used for transfer learning in three independent datasets (including both north american and asian participants) for various supervised downstream tasks, predicting either demographic data (age, sex), clinical diagnosis (mild cognitive impairment) and AD biomarker status (amyloid beta deposition). The performance of the proposed model Brain-JEPA is contrasted with other models from the literature with different kinds of architectures, and in particular another “fMRI foundation” model called BrainLM. Brain-JEPA outperforms other models on most tasks, and also retains good performance when a simple linear layer is used for transfer, unlike BrainLM. These results are very promising, as they show self-supervised pretraining can successfully be applied to fMRI time series, and may lead to more robust brain biomarkers for brain disorders.

优点

The work is very clear and well constructed. It adapts a now classic vision transformer architecture by proposing compelling solutions for the two main domain-specific ingredients: positional encoding and masking. It includes a fair number of baseline models, and tasks for evaluation.

The code and the checkpoints of the models are shared. I have not done an in depth review of these, but the code is clear and is meant to replicate one of the tasks.

The comparison of transfer with fine-tuning vs linear probing as well as the improvement in downstream accuracy across training epochs all improve confidence in the results and the generalizability of the learned representations by brain-JEPA. The ablation study is also great.

缺点

Although the code is shared, it does not appear to cover all of the downstream tasks in the paper though. Importantly, I could not find detailed information on data sources. ADNI for example is a massive dataset, and depending on the details of data release and other criteria, the sample size and exact list of subjects for “NC vs MCI” may vary dramatically. I am fairly confident I would not be able to reproduce all the results in this paper based on the provided information, and it’s not clear to me which parts of the paper would be easy to reproduce. There are also no details in fMRI data preprocessing, and the normalization applied across subjects is not standard. Were all datasets preprocessed with similar tools? EDIT: the authors shared anonymized code and a list of subject IDs for their analysis. The level of reproducibility is thus adequate.

Surprisingly, the paper lacks a good SVM/SVR or linear regression baseline. Given the regime of limited data size in fMRI, these vanilla models are really tough to beat. Based on reported accuracy, SVM would likely beat brain-JEPA prediction on sex prediction on uk biobank. https://www.nature.com/articles/s41467-020-18037-z EDIT: the authors added an SVM baseline (along with other methods).

More of a stylistic weakness: some of the text uses excessively strong languages, in particular the abstract. “This pioneering model sets an unprecedented chapter in brain activity analysis with fMRI, achieving state-of-the-art performance in demographic prediction, disease diagnosis/prognosis, and trait prediction through fine-tuning.” This is not accurate: several similar papers have come out in the past two years (see below), at least the brainLM paper. EDIT: the authors have toned down claims of novelty in the abstract.

The paper fails to acknowledge several closely related works. BrainLM is not the only fMRI foundational model. I have listed a few below. The brainMASS paper in particular is an impressive work very relevant to this submission, using 30 different datasets and seven downstream tasks. It would be important to position the paper compared to some of these models, and tone down the claims of novelty. EDIT: the authors have added more recent models to their evaluation and added a discussion of other models that could not be directly compared.

Yang et al., 2024 BrainMass: Advancing Brain Network Analysis for Diagnosis with Large-scale Self-Supervised Learning Ferrante et al., 2024 TOWARDS NEURAL FOUNDATION MODELS FOR VISION : ALIGNING EEG, MEG AND FMRI REPRESENTATIONS TO PERFORM DECODING , ENCODING AND MODALITY CONVERSION https://openreview.net/forum?id=nxoKCdmteM Thomas et al., 2023 Self-Supervised Learning of Brain Dynamics from Broad Neuroimaging Data https://arxiv.org/abs/2206.11417

Finally, and related to that last point, there are many public dataset to benchmark downstream tasks. ABIDE I, ABIDE II and ADHD200 in particular are readily available. It is surprising to see “only” three datasets used for downstream. EDIT: the authors have added several datasets for downstream tasks.

问题

How do the authors handle the massive difference in fMRI temporal resolution between UK biobank and ADNI? EDIT: all required details were provided in the rebuttal below.

Positional encoding would be critical with variable brain region location and temporal sampling, but fMRI only has variable temporal sampling, and it is not clear from the text how this is handled by the model.

The brain representation in Figure 7 is hard to decipher. What are the readers supposed to observe? EDIT: not really resolved, see my comment to the rebuttal.

I believe UK biobank uses three Siemens 3T scanners, not one. EDIT: manuscript was amended to reflect this.

Could the authors please double check that IRB approval is not required for secondary analysis of human neuroimaging data in their institution.EDIT: authors have required IRB approvals.

局限性

The number of downstream tasks is limited, and the authors did not use simple baseline models (such as SVM on connectomes) despite the known excellent performance of these models at the scale of fMRI datasets. EDIT: the number of baseline models and downstream tasks have been expanded.

It is also unclear based on the current set of results how the brain-JEPA model handles a diversity of scanner and image acquisition characteristics, considering it was trained on a single protocol. EDIT: dowstream tasks include datasets with various protocols, demonstrating robustness.

作者回复

Code, data source, and preprocessing.

Thank you for your valuable feedback.

We have supplemented code of all downstream tasks on public datasets to the codebase. The code and subject IDs are available via an anonymous link provided to AC in a separate comment, per author guidelines.

We utilized open-source preprocessing pipelines as outlined in [1] and [2]. We implemented robust scaling for normalization as practiced in BrainLM.

By sharing the complete codebase, subject IDs, and using open-source preprocessing pipelines, we believe that our results are highly reproducible.

Other related work.

Thank you for your important suggestions.

We have incorporated SVM/SVR, BrainMass [3], CSM [4], and SwiFT [5] (as suggested by another reviewer) into comparisons. As shown in the global author rebuttal Table 1 and 2, Brain-JEPA outperforms these models on most tasks.

Although SVM/SVR may perform well in demographic prediction (e.g., sex), previous research has indicated that SVMs perform worse than fMRI deep learning models [6] [7]. Additionally, note that BrainMass is a concurrent work to ours (published after our initial submission). We have demonstrated a broader range of downstream applications, including demographic prediction, trait prediction, cognition score prediction, and disease prognosis/diagnosis. In contrast, experiments in BrainMass are limited to disease diagnosis/prognosis only. Moreover, the work mentioned [8] lacks available code and is applied to a multi-modal setting, making it difficult to reproduce and not suitable for our context.

In revised version, we will include comparisons with additional models to further demonstrate the outstanding performance of Brain-JEPA. We will also incorporate a discussion of the above works to better position our contributions.

More downstream datasets.

Thank you for the valuable suggestion.

Note that datasets such as ABIDE I, ABIDE II, and ADHD200 are all children cohorts. Given that our model was pretrained on UKB, which primarily includes middle-aged to elderly participants, it may not generalize well to younger populations yet (same as BrainLM).

To further demonstrate the diversity of our downstream applications, we have conducted additional experiments using two aging-related public datasets: OASIS-3 and CamCAN, for AD conversion prediction and depression classification, respectively (global author rebuttal Table 3). By applying Brain-JEPA to five downstream datasets across eight distinct tasks totally, we have demonstrated its versatility in a wider range of applications compared to the existing models. Specifically, Brain-JEPA excels in demographic prediction, trait prediction, and disease diagnosis and prognosis. This stands in contrast to experiments done in BrainLM, which is limited to demographic and clinical score prediction, and BrainMass, which focuses solely on disease diagnosis and prognosis.

Temporal resolution.

We thank the reviewer for the insightful question. For fMRI in UKB, we used multi-band data with a high temporal resolution (TR is ~0.7s). In our downstream datasets, HCP-Aging also uses multi-band data with a TR of ~0.7s, while both ADNI and the Asian cohort use single-band data with a TR of ~2s.

To address the differences in TR, we downsampled the multi-band data with a temporal stride of 3, aligning the TR to ~2s. This ensures consistency across different datasets.

For future work, we plan to use learnable TR embeddings to enable the model to adapt to different TRs dynamically.

Clarification on Fig 7.

Brain networks are systems of interconnected regions that work together to perform specific functions, such as the Default Mode Network (DMN) for self-referential and memory tasks and the Control Network (CN) for executive functions [9] [10]. Figure 7 shows attention values among networks. Higher attention values in DMN, CN, and SAN indicate their significant involvement in MCI, consistent across different ethnic groups, showcasing Brain-JEPA's robustness and generalizability.

Claim in abstract.

Thank you for your valuable feedback. While we recognize the contributions of recent works, our aim was to highlight specific advancements introduced by Brain-JEPA. These include the integration of Brain Gradient Positioning and Spatiotemporal Masking. Furthermore, the downstream applications of Brain-JEPA are exceptionally diverse. We will remove the “unprecedented chapter” in the abstract for our revision to better place our work in the context of recent developments in the field.

UKB has three scanners.

In revised version, we will revise "one scanner" to "three scanners" in the UKB introduction of our revised version.

IRB approval.

We have obtained all necessary IRB approvals for the datasets used for analysis.

References:

[1] Spatial topography of individual-specific cortical networks predicts human cognition, personality, and emotion. Cerebral Cortex 2019.

[2] Global signal regression strengthens association between resting-state functional connectivity and behavior. Neuroimage 2019.

[3] BrainMass: Advancing Brain Network Analysis for Diagnosis with Large-scale Self-Supervised Learning. TMI 2024.

[4] Self-Supervised Learning of Brain Dynamics from Broad Neuroimaging Data. NeurIPS 2022.

[5] SwiFT: Swin 4D fMRI Transformer. NeurIPS 2023.

[6] Interpretable Graph Neural Networks for Connectome-Based Brain Disorder Analysis. MICCAI 2022.

[7] Beyond the Snapshot: Brain Tokenized Graph Transformer for Longitudinal Brain Functional Connectome Embedding. MICCAI 2023.

[8] TOWARDS NEURAL FOUNDATION MODELS FOR VISION : ALIGNING EEG, MEG AND FMRI REPRESENTATIONS TO PERFORM DECODING , ENCODING AND MODALITY CONVERSION. ICLR 2024 Workshop Re-Align.

[9] The organization of the human cerebral cortex estimated by intrinsic functional connectivity. J Neurophysiol 2011.

[10] Correspondence of the brain's functional architecture during activation and rest. PNAS 2009.

评论

Re reproducibility: thanks for taking these steps. This is strengthening the submission. I am going to update my soundess score to 4 (excellent).

Re prior works: this point is appropriately addressed by incorporating additional models when possible, and adding discussions on the model that cannot be directly compared or implemented.

Re downstream tasks: adding several new datasets and downstream tasks addresses the issue and is strengthening the manuscript (this is participating to my upgrade of the soundness score).

Re temporal resolution: thanks for the clarification. Those are critical details that should be added in future revisions of the article.

Re figure 7: I am familiar with the functional organization into intrinsic connectivity networks. My point is that the figure is very hard to read due to the choice of visualisation. The point you are trying to make would be better served by showing distribution of attention in those networks.

Re the tone of the abstract: I understand this paper makes novel contributions. But I maintain that the tone was too dramatic, and I agree with the proposed revision. I disagree that the range of downstream tasks is exceptional. Even after revision, it is smaller than some other works in the field. Check this work for example: https://doi.org/10.1162/imag_a_00222

Scanners in UK biobank: I would encourage you to refer to the documentation of the UK biobank to double check.

IRB: this adequately addresses my concerns.

Based on the substantial improvements made by the authors, I have decided to revise my overall score to 7 (accept).

评论

We appreciate your recognition that we have addressed most of your concerns in our rebuttal, and we are grateful for your decision to raise the score.

Thank you for your valuable feedback on Figure 7. We acknowledge that displaying the attention values as brain surface maps would enhance clarity, and we will include this additional brain map figure in the revised version.

We also recognize that 'exceptional' may not be the most precise wording when referring to downstream tasks involving a broader range of disease types and etc. In the revised version, we will emphasize that Brain-JEPA can be applied to a diverse array of downstream applications. Additionally, we will include the evaluation on more related downstream tasks in the future.

We sincerely appreciate your thorough review and thoughtful suggestions, which have greatly contributed to the improvement of our work.

作者回复

We thank the reviewers for their time and effort in reviewing our work. We appreciate the positive feedback and great interest in our work from all reviewers, along with their insightful questions and suggestions. We are pleased to see that the reviewers acknowledge and appreciate the following aspects:

  1. The idea is interesting and effective. The model shows great generalizability, scaling property, with strong empirical results. (Reviewers GJHn, M77o, iLSi, fytG)

  2. The work is very clear and well constructed. (Reviewers GJHn, M77o, iLSi)

  3. Excellent contribution (Reviewer GJHn) and presentation (Reviewer fytG).

  4. Great and clear ablation study. (Reviewers GJHn, fytG)

  5. The code and the checkpoints of the models are shared. (Reviewer GJHn)

  6. The paper is of interest to the computational neuroscience community. (Reviewer iLSi)

We have addressed each reviewer's questions and suggestions point-by-point. This includes adding experiments with more baselines, and evaluating on additional downstream datasets and tasks, as well as clarifying the temporal resolution. Additionally, we responded to questions related to ablation study about the framework's contribution, the gradient dimension, and the scaling of the pretraining dataset size.

We would like to note that for the added baselines, BrainMass [1] is a concurrent work published after our initial submission. Additionally, CSM [2] and SwiFT [3] are not time series models; CSM utilizes text-like representations, while SwiFT operates on raw fMRI data. We thank the reviewers for suggesting these works, and we believe that including comparisons with them strengthens our results.

Additional results are shown below:

  • Table 1. Results of additional baselines on HCP-Aging.
AgeAgeSexSex
MSE \downarrowρ\rho \uparrowACC (%) \uparrowF1 (%) \uparrow
SVM/SVR0.586 (.019)0.699 (.022)76.67 (1.88)80.82 (1.15)
BrainMass0.396 (.002)0.831 (.014)74.09 (3.87)75.78 (3.37)
CSM0.409 (.012)0.733 (.023)74.85 (1.11)76.23 (0.37)
SwiFT0.341 (.007)0.755 (.063)73.48 (2.20)74.65 (2.32)
Brain-JEPA0.298 (.017)0.844 (.030)81.52 (1.03)84.26 (0.82)
  • Table 2. Results of additional baselines on ADNI.
NC/MCINC/MCIAmy+/-Amy+/-
ACC (%) \uparrowF1 (%) \uparrowACC (%) \uparrowF1 (%) \uparrow
SVM/SVR64.21 (5.16)73.06 (4.71)62.00 (4.00)63.84 (5.44)
BrainMass74.21 (5.10)81.36 (3.56)68.00 (7.48)69.29 (8.96)
CSM68.42 (4.99)76.74 (4.54)63.00 (9.80)65.89 (9.79)
SwiFT73.16 (5.31)80.46 (4.16)65.00 (6.32)67.79 (6.38)
Brain-JEPA76.84 (1.05)86.32 (0.54)71.00 (4.90)75.97 (3.93)
  • Table 3. Additional tasks of AD conversion prediction and depression classification on OASIS-3 and CamCAN datasets.
OASIS-3OASIS-3CamCANCamCAN
AD ConversionAD ConversionDepressionDepression
ACC (%) \uparrowF1 (%) \uparrowACC (%) \uparrowF1 (%) \uparrow
SVM/SVR56.00 (2.81)52.05 (1.66)63.64 (3.07)56.79 (2.32)
BrainNetCNN62.00 (2.45)59.53 (0.58)62.73 (4.45)56.85 (4.47)
BrainGNN59.00 (2.00)56.53 (4.34)63.64 (4.98)56.68 (3.26)
BNT68.00 (8.72)64.73 (11.29)65.45 (4.64)55.32 (8.67)
BrainLM65.00 (7.75)62.67 (9.04)70.00 (6.17)64.18 (3.82)
BrainMass67.00 (6.00)66.53 (6.95)70.91 (2.23)63.56 (2.93)
CSM61.00 (4.90)61.97 (5.49)64.55 (4.45)56.08 (6.23)
SwiFT65.00 (6.32)66.80 (4.12)69.09 (6.68)61.78 (9.26)
Brain-JEPA69.00 (7.35)67.32 (7.92)72.73 (2.87)67.45 (1.57)
  • Table 4. Comparisons of different frameworks.
HCP-AgingHCP-AgingADNI
AgeSexAmy+/-
ρ\rho \uparrowACC (%) \uparrowACC (%) \uparrow
BrainLM0.832 (.028)74.39 (1.55)67.00 (7.48)
BrainLM w contributions0.838 (0.014)76.36 (2.58)70.00 (11.40)
JEPA w contributions0.844 (.030)81.52 (1.03)71.00 (4.90)
  • Table 5. Comparison of different number of gradient components. ‘bg’ means brain gradient.
HCP-AgingHCP-AgingADNI
AgeSexAmy+/-
ρ\rho \uparrowACC (%) \uparrowACC (%) \uparrow
3-dim bg0.819 (0.003)76.96 (1.77)67.00 (6.00)
30-dim bg0.844 (.030)81.52 (1.03)71.00 (4.90)

The main results on additional baselines and datasets can also be found in the attached PDF.

If reviewers have any further questions or concerns, please let us know. We are happy to engage in further discussion.

References:

[1] BrainMass: Advancing Brain Network Analysis for Diagnosis with Large-scale Self-Supervised Learning. TMI 2024.

[2] Self-Supervised Learning of Brain Dynamics from Broad Neuroimaging Data. NeurIPS 2022.

[3] SwiFT: Swin 4D fMRI Transformer. NeurIPS 2023.

最终决定

This submission generated much interest and discussion. The reviewers appreciated the contributed foundation model for fMRI time series, with interesting modeling developments and solid evaluation for several applications such as demographic prediction, disease diagnosis/prognosis, and trait prediction.