Brain Harmony: A Multimodal Foundation Model Unifying Morphology and Function into 1D Tokens
Brain Harmony (BrainHarmonix) unifies T1 MRI + fMRI into 1D tokens and tops six brain-disorder benchmarks.
摘要
评审与讨论
This paper presents BrainHarmonix, a multimodal brain foundation model that unifies structural (T1-weighted MRI) and functional (fMRI) neuroimaging data into compact 1D token representations. The proposed method comprises a two-stage modular training framework: unimodal encoding (UE) followed by multimodal fusion (MF) via shared brain hub tokens. UE includes two key modules: (a) Geometric Harmonics-Based Pre-alignment of fMRI dynamics with cortical structure to impose structural constraints on functional encoding, and (b) Temporal Adaptive Patch Embedding (TAPE) to handle varying fMRI repetition times (TRs), addressing a major barrier in prior models. Experimental results demonstrate BrainHarmonix's strong performance across neurodevelopmental and neurodegenerative disorder diagnosis and cognitive prediction benchmarks. Ablation studies highlight the critical roles of multimodal fusion, pre-alignment, and TR-aware augmentation. The model shows scalable performance and efficient generalization even with linear probing.
优缺点分析
Strengths:
- The paper is well-organized and accessible. Each section and complex concepts are explained with visual aids and equations.
- The work addresses critical gaps in the brain foundation model literature, including the lack of multimodal integration and the inability to generalize across fMRI datasets with variable TRs.
Weaknesses:
- The paper claims: "We demonstrate, for the first time, that complex brain morphology and dynamics can be deeply compressed into unified continuous-valued 1D tokens that serve as holistic representations of the human brain." However, prior foundation models have also aimed to learn unified representations to capture complex brain morphology and dynamics. To clarify the novelty of this work, it would be helpful to distinguish it more clearly from existing approaches. A comparative table analyzing similarities and differences with previous models would strengthen the contribution and help position this work in the broader context of neuroimaging foundation models.
- Despite the large pretraining datasets, the data is primarily sourced from the UK Biobank and ABCD, which focus on specific age groups (middle-aged adults and children, respectively). As acknowledged by the authors, early life (infancy) and young adulthood are underrepresented. Moreover, there is limited evidence of geographical, scanner, or demographic diversity, which may affect generalization.
- To further demonstrate the method’s effectiveness, it would be helpful to present results on additional datasets and tasks, such as the Human Connectome Project for Early Psychosis (HCP-EP) and the Transdiagnostic Connectome Project (TCP).
- The method relies on downstream fine-tuning after pretraining. To better showcase its effectiveness, it would be helpful to include a few-shot setting—for example, using an increasing portion of data (e.g., 5%, 10%, ...100%) from a neuroimaging benchmark dataset for fine-tuning.
- Figure 5 presents results on scaling across different numbers of 1D tokens (32, 64, 128, 256) during fine-tuning and linear probing. The curves show that increasing the token count from 32 to 256 steadily improves accuracy. However, a clear performance plateau is not observed, contrary to what is suggested in the text. To make this claim more convincing, it would be helpful to include results with additional token counts (e.g. 512, 1024) to show whether performance eventually plateaus, continues increasing, or begins to decrease at some point.
- It would be helpful to include training and inference times under varying experimental settings—for example, different token counts. Further, scalability experiments assessing how model performance evolves with increasing model complexity or data availability would strengthen the analysis. Such an evaluation would clarify the effects of scaling and illuminate the trade-offs in runtime and memory usage.
- The paper could benefit from a deeper exploration of where BrainHarmonix underperforms, particularly in fine-grained cognitive prediction tasks or low-sample regimes. Including an analysis of failure modes would strengthen the paper.
问题
Please refer to the items listed under the Weaknesses section.
局限性
yes
最终评判理由
Most of concerns are well-addressed.
格式问题
no
Distinction from prior foundation models
We appreciate the reviewer’s feedback and would like to clarify this important point. To the best of our knowledge, we are indeed the first to deeply compress both brain morphology and dynamics into unified representations via 1D tokens as a multimodal brain foundation model. Regarding “prior foundation models”, we would greatly appreciate it if the reviewer could point out specific works we might have overlooked.
As acknowledged by the other three reviewers, our proposed BrainHarmonix is a significant conceptual step forward, addressing a clear need in neuroimaging by effectively handling heterogeneous TRs and multimodal data fusion, thus providing novel directions for future research in brain foundational models.
Moreover, Table 1 and 2 in our paper provide how the proposed model differs from previous brain foundation models. For further clarity, we summarize the key distinctions below:
| Model | Modality | Compact 1D latent representation of brain | Capture brain morphology | Capture brain dynamics | Handle heterogeneous TR | Constrained by brain cortical surface |
|---|---|---|---|---|---|---|
| BrainMVP | Structural MRI (volumetric data) | No | Yes | No | N.A. | N.A. |
| BrainMass | fMRI (functional connectivity) | No | No | No | Yes | No |
| BrainLM | fMRI (time series) | No | No | Yes | No | No |
| Brain-JEPA | fMRI (time series) | No | No | Yes | No | No |
| BrainHarmonix | Structural MRI (volumetric data) + fMRI (time series) | Yes | Yes | Yes | Yes | Yes |
We hope this comparison clarifies the unique contributions of our work.
Data diversity
We thank the reviewer’s insightful comment and would like to emphasize that we have explicitly acknowledged the limitations regarding age coverage in our pretraining datasets within the "limitations and future work" section.
Importantly, our current pretraining data already extends the lifespan coverage compared to prior state-of-the-art brain dynamics models such as BrainLM and Brain-JEPA, expanding from middle-aged and older adults to children as well, thereby enabling downstream analyses of neurodevelopmental disorders that previous models did not address. Moreover, our pretraining datasets inherently include scanner diversity, for example, ABCD data was acquired from multiple vendors (Siemens, GE, Philips). To further demonstrate generalizability across different geographical and demographic contexts, we have extended our downstream evaluations to include an Asian clinical cohort collected from a university hospital memory clinic, effectively assessing our model in real-world clinical scenarios. Please refer to our response to Reviewer Eo3b, "Limited generalization exploration beyond public Western datasets" for the detailed results.
Evaluation on additional datasets and tasks
We thank the reviewer for this valuable suggestion. Due to the limited rebuttal time and resources, we apologize that we are unable to complete the downloading, preprocessing, and evaluation of our model on the suggested datasets. Nevertheless, we extended our evaluation to an Asian clinical cohort collected from a university hospital memory clinic, thereby assessing generalizability to non-Western populations and in real-world clinical scenarios, as correctly pointed out by the reviewer.
Specifically, we performed an additional task not included in the current version - classification of amyloid-positive/negative cognitively normal participants, which holds significant clinical value for AD prognosis and early intervention. Please refer to our response to Reviewer Eo3b, "Limited generalization exploration beyond public Western datasets" for detailed results. We will incorporate the suggested HCP-EP and TCP for downstream evaluation in our future work.
Fine-tuning evaluation across data portions
We appreciate the reviewer’s suggestion regarding few-shot evaluation. To address this, we conducted additional analyses by scaling the fine-tuning dataset using increasing proportions (20%, 40%, 60%, 80%, and 100%). The results regarding accuracy (%) are shown in the table below. Our results demonstrate a clear and consistent scaling of performance with increasing data portions. Notably, compared with prior leading baseline BrainMass 59.35% on AbideII and 65.99% on ADHD, BrainHarmonix achieves state-of-the-art performance even when fine-tuned on only 80% of the dataset, highlighting the efficiency and effectiveness of our pretrained representations. We will include this analysis in the revised version.
| Portion (%) | 20% | 40% | 60% | 80% | 100% | |
|---|---|---|---|---|---|---|
| ABIDE-II | Accuracy (%) | 55.94 | 56.52 | 60.87 | 63.77 | 66.67 |
| ADHD-200 | Accuracy (%) | 57.14 | 62.72 | 67.44 | 69.39 | 70.09 |
Scaling behavior with increasing token numbers
We appreciate your valuable comments. In our preliminary experiments, we observed that the performance tends to stabilize when the number of tokens reaches around 256. For completeness, we have included results with 512 and 1024 tokens as references. As shown in the table below, the accuracy (%) remains relatively stable beyond 256 tokens, confirming our initial observation.
| Dataset | Method | 32 | 64 | 128 | 256 | 512 | 1024 |
|---|---|---|---|---|---|---|---|
| AbideII | Finetune | 62.61 | 65.21 | 66.67 | 66.96 | 67.53 | 66.96 |
| Linear Probe | 61.45 | 61.45 | 61.74 | 62.03 | 62.32 | 62.89 | |
| ADHD | Finetune | 67.69 | 69.05 | 70.09 | 70.41 | 70.41 | 70.75 |
| Linear Probe | 66.33 | 67.69 | 68.37 | 68.71 | 68.37 | 69.05 |
Scalability and efficiency evaluation
We thank the reviewer for the valuable suggestions. We have added scalability experiments by varying the size of the harmonizer model from 22M parameters to 307M parameters. The results regarding accuracy (%) in the table below show performance improvement from 22M to 86M, while the gain from 86M to 307M is only marginal.
| 22M | 86M | 307M | |
|---|---|---|---|
| AbideII | 64.06 | 66.67 | 66.95 |
| ADHD | 69.39 | 70.09 | 70.40 |
On the other hand, we investigated the effect of using different portions of the pretraining dataset. Specifically, we applied identical sampling proportions to both the UKB and ABCD datasets for pretraining. The corresponding results regarding accuracy (%) are reported in the table below. We observe that the model’s performance improves as the portion of the pretraining dataset increases.
| Pretrain Portion | 20% | 40% | 60% | 80% | 100% |
|---|---|---|---|---|---|
| AbideII | 59.12 | 62.90 | 64.35 | 65.21 | 66.67 |
| ADHD | 64.96 | 65.64 | 67.01 | 68.37 | 70.09 |
Following the reviewer’s suggestion, we have also included the pretraining time (on 8 NVIDIA H100 GPUs (80GB)), as well as finetuning time (on 1 H100 GPU) and inference time (on 1 H100 GPU) on AbideII, corresponding to different model sizes (token counts) to provide a more comprehensive view of the computational cost in the table below. Larger model or more token counts lead to longer computing time.
| 22M (128) | 307M (128) | 86M (32) | 86M (64) | 86M (128) | 86M (256) | 86M (512) | 86M (1024) | |
|---|---|---|---|---|---|---|---|---|
| Pretrain Time | 5h 10m | 17h 9m | 9h 20m | 9h 26m | 9h 37m | 9h 45m | 10h 23m | 11h 11m |
| FT Training Time | 0h 21m 52s | 1h 07m 28s | 0h 25m 33s | 0h 26m 54s | 0h 27m 41s | 0h 29m 54s | 0h 30m 17s | 0h 31m 54s |
| Inference Time | 5.02s | 7.19s | 5.90s | 5.89s | 6.36s | 5.77s | 6.11s | 7.47s |
Overall, our further analysis highlights a trade-off: larger models or increased token counts generally enhance performance but result in longer computation times.
Failure mode and limitation analysis
We appreciate the reviewer’s suggestion to analyze potential failure modes and areas of underperformance. In our current evaluations, one notable case where BrainHarmonix underperforms is on the ADHD-200. Its F1 is slightly lower than BrainHarmonix-F. This is likely due to motion artifacts in T1, as ADHD patients exhibit increased head motion during MRI acquisition. Such motion introduces noise and negatively affects structural data quality, potentially reducing multimodal fusion performance. Future work will explore methods to improve robustness against data-quality issues. On the other hand, although we have demonstrated data scaling effects in the above discussion, model performance under low-sample and few-shot learning scenarios remains an area for improvement. Future studies may address few-shot adaptation through approaches such as parameter-efficient fine-tuning or prompt-based tuning.
Thanks for the response. I have changed the score to weak accept.
We truly appreciate the reviewer's decision to increase the score. Thank you for your time and consideration.
The paper introduces BrainHarmonix, a novel multimodal brain foundation model that unifies structural morphology from T1-weighted MRI and functional dynamics from fMRI into compact 1D brain-hub tokens. Unlike prior work that focused on either modality, BrainHarmonix bridges both through a two-stage pretraining pipeline: unimodal encoding (using MAE and JEPA architectures) followed by multimodal fusion using shared learnable tokens.
Key contributions include:
Multimodal Representation: First to integrate structural and functional data into shared 1D token space.
Geometric Harmonization: Structural constraints (via geometric harmonics) are embedded into functional encoding, improving alignment.
Temporal Adaptive Patch Embedding (TAPE): A strategy to handle heterogeneous TRs across fMRI datasets, a known limitation in prior models.
Effective Data Augmentation: Downsampling-based augmentation for fMRI to enrich temporal dynamics.
State-of-the-art Results: Demonstrates consistent improvements across diverse downstream tasks including ASD, ADHD, Parkinson’s, Alzheimer’s, and cognition prediction.
The model is trained on over 64,000 sMRI volumes and 70,000 fMRI scans, showing strong scalability and generalization.
优缺点分析
Strengths The unification of morphology and dynamics into 1D tokens is a significant conceptual step forward. Prior models typically focus on only one modality. Authors address a clear need in neuroimaging—handling heterogeneous TRs and fusing modalities.
TAPE and geometric harmonics-based alignment are novel and well-motivated, offering practical solutions to known limitations. Evaluated across several challenging datasets, with detailed ablations (Figure 6) and scaling experiments (Figure 5).
The submission includes detailed implementation and preprocessing steps, openly available data, and code release in the supplementary.
Weaknesses Lack of Baseline Against Dynamic Time Warping: The proposed TAPE method addresses variable TRs, but omits comparisons to standard time-alignment methods like dynamic time warping (DTW), which would offer a useful baseline.
Interpretability: While the compact tokenization is elegant, interpretability of these 1D tokens—especially with respect to individual behavioral or clinical traits—is underexplored.
Fusion Mechanism: The harmonizer is somewhat abstract; more insights into how the attention mechanism learns cross-modal representations (e.g., attention maps, token influence) would enhance clarity.
Evaluation Scope: While six benchmarks are substantial, generalization to clinical settings outside large public datasets (e.g., low-resource or single-site studies) is not explored.
问题
Why was dynamic time warping (DTW) not considered as a benchmark for aligning fMRI time series across TRs? DTW is a classic, domain-agnostic alignment strategy and could provide a useful contrast to the learned TAPE module. Including such a comparison, or at least discussing why it is inappropriate, would strengthen the argument for TAPE.
How interpretable are the 1D tokens? Are they anatomically localized, or do they represent global embeddings? Could the authors provide visualizations of token contributions to downstream tasks, or any biological interpretation?
What are the limitations of using geometric harmonics across diverse populations or clinical scans? Are there risks of oversmoothing or biases introduced by relying on population-level cortical surfaces? How well does this generalize to atypical brains (e.g., patients with malformations)?
Could the harmonizer be applied independently for zero-shot transfer across new datasets with only one modality? For instance, can it be used to impute missing modalities or perform cross-modal translation? If not, could this be a direction for future work?
局限性
The authors acknowledge several limitations, such as restricted age coverage and the need for further generalization to other populations. However, a few areas are underdiscussed:
DTW as a baseline: As noted, the omission of classical alignment techniques like DTW weakens the empirical justification for TAPE.
Interpretability: More attention could be paid to understanding what the 1D tokens actually represent, beyond task performance.
Synthetic testing of TRs: Take a dataset with a fixed TR, resample it with different TRs. Now compare original data with artificially and irregularly sampled TR to test out TAPE?
Generalization risks: While datasets are large, they mostly reflect Western populations; generalization to non-Western cohorts, clinical settings, or non-standard acquisition protocols is untested.
格式问题
None
Lack of Baseline Against Dynamic Time Warping
We appreciate the reviewer’s valuable suggestion. However, we believe that DTW is not a suitable baseline in this setting. DTW assumes a meaningful temporal correspondence between sequences. However, in the context of resting-state fMRI, there is no ground truth temporal alignment across individuals, as each subject’s brain dynamics evolve independently and asynchronously. Therefore, applying DTW across different scans would impose artificial temporal correspondences not supported by the data.
We thank the reviewer for bringing this method to our attention. We will add the above discussion in the related work and TAPE section.
Interpretability of 1D tokens and attention maps
We appreciate the reviewer’s insightful comments regarding the interpretability of our 1D tokens. Although NeurIPS 2025 policy prohibits us from including new figures at this rebuttal stage, we would like to briefly summarize key insights from our new analyses.
We examined the attention patterns between the 128 learned 1D tokens and the modality-specific tokens (400 fMRI ROI + 1200 T1 tokens) in ASD diagnosis using ABIDE-II data. For the 400 fMRI ROI tokens, each is obtained by averaging all tokens within the corresponding ROI. We found differentiation in modality attention among the 1D tokens: 93/128 tokens attended exclusively to fMRI, 30/128 exclusively to T1, and 5/128 tokens exhibited cross-modal attention. For cross-modal tokens, we found that they exhibited key structure-function coupling such as medial prefrontal cortex in brain morphometry and default model network in brain dynamics, which have previously been demonstrated in the literature to be associated with ASD.
For the 93 fMRI-specific tokens, further analysis revealed network-level functional differentiation relevant to ASD behavioral traits. Specifically, 60/93 were network-specific, predominantly focusing on a single brain network (with >70% salient ROIs within one network), while the remaining 33 were identified as “bridge” tokens capturing interactions across multiple networks. Among the most salient network-specific tokens, temporoparietal network (implicated in social perception and language processing deficits), somatomotor network (associated with sensorimotor integration impairments), and default mode network (linked to mentalizing deficits) emerged prominently. The identified “bridge” tokens primarily captured interactions involving default, limbic, and control networks, reflecting impaired integration across sensorimotor, socioemotional, and higher-order cognitive processes - a mechanism implicated in pathophysiology of ASD. Please refer to our response to Reviewer wvBG, “Latent space analysis reflecting structural constraints” for the interpretation of the geometry-constrained fMRI latent space.
We will incorporate corresponding visualization into our revised manuscript.
Limited generalization exploration beyond public Western datasets
We thank the reviewer for raising this important point regarding generalization. To address this, we extended our evaluation to an Asian clinical cohort collected from a university hospital memory clinic (single site), thereby assessing generalizability to non-Western populations and in real-world clinical scenarios. Specifically, we performed an additional task not included in the current version - classification of amyloid-positive/negative cognitively normal participants, which holds significant clinical value for AD prognosis and intervention. As shown in the table below, BrainHarmonix achieved state-of-the-art performance in this clinically relevant, in-house setting, underscoring its robustness and cross-population generalizability. We will incorporate these additional results in the revised version.
| ACC (%) | F1 (%) | |
|---|---|---|
| BrainMVP | 65.83 | 53.64 |
| BrainHarmonix-S | 67.68 | 56.67 |
| BrainNetCNN | 57.57 | 52.00 |
| BrainGNN | 62.61 | 40.57 |
| BrainNetTF | 63.03 | 57.57 |
| BrainMass | 64.65 | 57.93 |
| BrainLM | 63.64 | 54.03 |
| Brain-JEPA | 66.67 | 59.18 |
| BrainHarmonix-F | 68.69 | 62.50 |
| BrainHarmonix | 74.75* | 65.57* |
Population-level geometric harmonics
We thank the reviewer for highlighting this important consideration. Using geometric harmonics derived from a group-level cortical surface provides robust, stable alignment by capturing common cortical geometry across individuals, thereby facilitating generalization and comparability across diverse datasets. In contrast, subject-specific cortical surfaces can be highly variable and prone to errors. However, we agree that combining group-level and individualized geometric harmonics may further enhance sensitivity and robustness, especially for atypical populations (e.g. patients) or the developing/aging brain. We will explore individualized (or age-specific) geometric harmonics in future studies which could further enhance model performance for clinical applications.
Zero-shot cross-modal transfer with harmonizer
We thank the reviewer for this insightful suggestion. While the current harmonizer requires paired multimodal inputs, a promising future direction would be to adapt it for zero-shot transfer and cross-modal imputation. One possible solution is to train modality-specific adapters, which could learn mappings from single-modality inputs to multimodal latent representations, enabling the harmonizer to impute missing modalities or perform cross-modal translation in scenarios where only one modality is available. We will add this in future work.
Synthetic testing of TRs
We appreciate this insightful suggestion. To further evaluate TAPE’s effectiveness, we tested on both the original HCP-A test set and version with samples randomly downsampled by factors of 1and 2 (equal probability, leading to TR values 1.6, 2.4). The comparable performance across conditions demonstrates TAPE's robustness.
| MAE | Correlation | |
|---|---|---|
| Original test set | 6.56 | 0.42 |
| Synthetic test set | 6.69 | 0.39 |
Thank you for addressing all my concerns and conducting additional experiments to support your points. I am happy to recommend your paper for NeurIPS' acceptance.
Thank you very much for your positive feedback and recommendation. We sincerely appreciate your valuable comments, which have greatly improved our paper.
This paper proposes BrainHarmonix, a multimodal brain foundation model focused on joint encoding of structural morphology (structural MRI) and functional dynamics (fMRI). BrainHarmonix encodes structural information contained in T1-weighted structural MRI 3D volumes, allowing the model to respect inductive biases in the brain where morphological structure is thought to constrain functional dynamics. This addresses limitations in previous brain foundation models, which mainly focus on one modality of MRI data such as fMRI recordings. Inductive priors based on geometric harmonics are incorporated into the embeddings learned by the model, in addition to a novel temporal patch embedding framework for dealing with varying repetition times in different MRI datasets. Downstream results are presented on neurodevelopmental and neurodegenerative datasets, demonstrating that the individual components of BrainHarmonix show strong performance compared to other foundation models, and that the multimodal model further improves downstream performance. Ablation studies and scaling experiments exploring token counts and data augmentation are also presented.
优缺点分析
Strengths Motivation & Clarity: The paper is well-written and well-motivated. The shortcomings of previous foundation models for brain activity which are limited in the modalities they encode are clearly explained, and the contributions that the authors are trying to make with this work are clear, mainly, a multimodal foundation model for brain representations that capture both structural and functional neuroimaging data with heterogeneous temporal resolution. Contributions: The work makes clear contributions to address the two major shortcomings of previous foundation models: a temporal adaptive patch embedding scheme to account for variable time resolution across different MRI datasets, and added priors based on geometric harmonics for injecting structural information into the embeddings learned by the model. Significance: This paper addresses a key consideration in the development of brain foundation models: learning generalizable representations for brain activity that account for both structural and functional, information. Furthermore, the work makes contributions to increase the architecture’s capability to be trained across multiple datasets despite heterogeneity in time resolution. Datasets and Experiments: The work uses large-scale, significant neuroimaging datasets for pretraining and evaluation, including the UK Biobank (UKB), Adolescent Brain Cognitive Development (ABCD), and multiple neurodevelopmental disorder and neurodegenerative datasets. The results shown on downstream datasets are impressive, and demonstrate the strength of the individual architecture components (BrainHarmonix-S and BrainHarmonix-F) while also demonstrating that for the most part, the multimodal model outperforms individual modality components and goes beyond single-modality performance. Scaling & Ablation Studies: Scaling experiments are presented for the number of 1D tokens on neurodevelopmental disorder datasets, along with ablation studies
Weaknesses Latent Space Analysis: A qualitative analysis of the embeddings learned by each unimodal encoder and/or the full multimodal model could add further results related to the embedding quality of the model. In particular, do the authors have evidence that the latent space of brain representations learned by their model reflects structural constraints more than previous brain foundation models as a result of the incorporation of geometric harmonics? Clarity of Temporal Adaptive Patch Embedding (TAPE) method: A few details of the TAPE method are unclear from Section 3.1.2, for example: (i) is a linear transformation matrix B defined for every MRI dataset that is trained on, or rather per each repetition time s that occurs in pretraining datasets? And how are these matrices initialized/learned? (ii) The goal of creating tokens with consistent temporal duration is clear, however, how much guarantee is there that the linear transformation matrices create consistent tokens which are comparable across datasets? Lack of End-to-End Training: Although the modular training framework allows each modality encoder to be trained independently before Multimodal Fusion (MF) training, the training framework does not allow a stage where all architecture components are trained end-to-end. This might hinder the model’s ability to create representations and encode all relevant information between both modalities, since the unimodal encoders are never updated together with the fusion architecture component. This may be difficult to address during a short rebuttal period, however discussion would be welcome regarding the training of individual components of the architecture: For example, why does the multimodal BrainHarmonix model underperform the individual performance of BrainHarmonix-F on the ADHD-200 dataset (in terms of F1-score), when in theory it has access to more information. Parameter Count: Regarding the performance of BrainHarmonix versus individual modality components, additional discussion would be welcome on how much of the performance might stem from increased parameter count in the full BrainHarmonix model (with the Harmonizer model component) versus performance increase from having additional data modalities to perform downstream tasks.
问题
Do the authors have evidence that the latent space of brain representations learned by their model reflects structural constraints more than previous brain foundation models as a result of the incorporation of geometric harmonics? Can the authors clarify details of the TAPE encoding scheme, mainly (i) how the linear transformation matrices are learned and (ii) how much information loss there is (if any) when creating tokens across datasets with varying repetition time. Can the authors comment on end-to-end training versus modular training as described in their framework? How much performance of the full multimodal model comes from the inclusion of multiple data modalities versus increased parameter count from having an additional architectural module (the harmonizer)?
局限性
yes
最终评判理由
The authors addressed my concerns with clear explanations, additional analyses, and ablations, and while some limitations remain, the novelty, technical quality, and strong results justify a borderline accept.
格式问题
none
Latent space analysis reflecting structural constraints
We thank the reviewer for suggesting this valuable qualitative analysis. As recommended, we extracted fMRI embeddings from BrainHarmonix-F and Brain-JEPA (400 ROIs, each represented by a 768-dimensional embedding) and applied t-SNE to project these embeddings onto a 2D plane. Since NeurIPS 2025 policy prohibits us from including new figures at this rebuttal stage, we performed a quantitative analysis comparing our model to Brain-JEPA. Specifically, we correlated each dimension of the t-SNE embedding with each of the 200 geometric harmonic modes across 400 ROIs. Compared to Brain-JEPA, our geometry-constrained embeddings exhibit a greater number of significantly correlated modes (p<0.05), with higher correlation strengths and significance levels on the top 5 most significant modes (see table below). On the other hand, we applied the Fisher r-to-z transformation to all correlations from the 200 harmonics for each model and conducted a two-sample t-test. Results demonstrate that correlations from our model are significantly higher overall. The correlation strength and statistical significance from the top modes, along with the overall comparison, confirm that our model is constrained by structural information more than Brain-JEPA. We will incorporate the corresponding t-SNE visualization in the revised version.
| Dimension | Model | # Significant Modes | Avg. P-value | Avg. Correlation |
|---|---|---|---|---|
| Dim1 in t-SNE | Brain-JEPA | 7 | 0.00769 | 0.1562 |
| Ours | 12 | 0.00456 | 0.1717 | |
| Dim2 in t-SNE | Brain-JEPA | 8 | 0.0115 | 0.1506 |
| Ours | 15 | 0.00477 | 0.1726 |
Clarification of TAPE
We thank the reviewer for this insightful question, and we appreciate the opportunity to clarify our TAPE implementation:
The matrix is defined for each TR. Specifically, for each TR () with corresponding patch size , we construct as a deterministic bilinear interpolation matrix that resizes embedding weights from the base size to . This matrix is not learned but rather computed analytically using standard bilinear interpolation. The learnable parameters here are , which are optimized via backpropagation during pretraining. The matrix provides a deterministic transformation to adapt to different patch sizes corresponding to different TR. Implementation details can also be found in the code provided in the supplementary material.
The consistency of temporal duration is guaranteed both theoretically and empirically.
Theoretical guarantee: from the transformation of with the pseudo-inverse of matrix is the minimum-norm least-square solution that minimises , where denotes inner product. A full derivation could be found in PI-resize work [1].
Empirical evidence: To further evaluate TAPE’s effectiveness, we tested on both the original HCP-A test set and version with samples randomly downsampled by factors of 1 and 2 (equal probability, leading to TR values 1.6, 2.4). The comparable performance across conditions demonstrates TAPE's robustness.
| MAE | Correlation | |
|---|---|---|
| Original test set | 6.56 | 0.42 |
| Synthetic test set | 6.69 | 0.39 |
Discussion on end-to-end training
We thank the reviewer for the insightful comments regarding end-to-end training. The harmonizer is already pretrained with long input sequences (i.e., 7200 fMRI tokens + 1200 T1 tokens + the number of 1D tokens), resulting in considerable memory requirements during training. Enabling end-to-end training would substantially increase GPU memory consumption and training time, potentially taking more than two weeks per run under our current infrastructure, limiting our capacity of model development and comprehensive evaluation.
However, we agree with the reviewer that jointly optimizing the unimodal encoders and the fusion module could potentially lead to further performance gains. Exploring efficient training strategies for such long-sequence transformer models, particularly in the context of neuroimaging, is a promising direction. We will discuss this important future work in the revised version.
The slightly lower F1 of BrainHarmonix compared to BrainHarmonix-F on ADHD-200 likely stems from motion artifacts in the T1 scans, especially prevalent among ADHD patients who tend to exhibit more head motion during MRI acquisition. Even though we have applied some quality control of T1, motion artifacts is unavoidable in T1 data of ADHD kids. This motion negatively impacts structural data quality, introducing noise that affects multimodal fusion. In contrast, fMRI data has gone through motion correction, of which the impact is less. We will investigate data-quality factors more systematically in future work.
Performance improvement: parameter count vs. multimodal integration
We thank the reviewer for the insightful question. We have added additional comparisons to better illustrate the performance gains from incorporating multiple modalities (2nd & 3rd columns in the table below containing the results regarding accuracy (%)) and from introducing the harmonizer module for fusion (4th-6th columns). Specifically, we concatenated the embeddings from the frozen T1 and fMRI encoders and passed them through a trainable linear layer for the classification task. Furthermore, we conducted experiments using harmonizers of different sizes, ranging from 22M parameters to 307M parameters (4th-6th columns). The results in the table below clearly demonstrate the performance improvements achieved both by adding modalities and by scaling the harmonizer module.
| single modality (fMRI) | concat | 22M | 86M | 307M | |
|---|---|---|---|---|---|
| AbideII | 62.90 | 63.19 | 64.06 | 66.67 | 66.95 |
| ADHD | 67.69 | 68.36 | 69.39 | 70.09 | 70.40 |
References:
[1] FlexiViT: One Model for All Patch Sizes. CVPR 2023.
Dear Reviewer wvBG,
Thank you very much for your acknowledgment. Just to ensure clarity from our side, could you please confirm whether we have adequately addressed all your concerns, or if there are any remaining issues we should further clarify?
We greatly appreciate your valuable feedback!
Dear Reviewer wvBG,
As the reviewer–author discussion period will conclude in one day, we wanted to kindly check if there are any remaining issues or clarifications needed from our side. Here we would like to briefly summarise our reply for your quick review:
-
Added quantitative t-SNE analysis of the latent space which reflects that our embeddings are constrained more by brain geometry than Brain-JEPA;
-
Provided clarification of the TAPE method, including both theoretical guarantee and empirical evidence which demonstrate TAPE's effectiveness of incorporating varying TRs;
-
Discussed the feasibility of end-to-end training and future work;
-
Included additional ablation experiments to illustrate the performance gains from incorporating multiple modalities and from introducing the harmonizer module for fusion.
We hope these fully address your earlier comments. Please let us know if any further clarification is needed.
Thank you again for your valuable feedback.
The authors propose three modules to improve fMRI foundation models. First, they propose a multimodal fusion module that learns to fuse uni-modal representations into multi-modal tokens. Second, they show that pre-alignment to the geometric harmonics of the structural morphology of the brain improves performance on the ABIDE and ADHD-200 datasets. Third, the authors propose a patch embedding that can be used with multiple temporal sampling rates. This specific module also allows the authors to perform data augmentation during pre-training.
优缺点分析
Strengths
The authors compare their model across a wide range of datasets with many ablation studies, which makes me confident that the results are significant. Moreover, many of the modules the authors discuss are original and provide novel future directions in the field of foundational fMRI models. Especially the TAPE module and the ablation studies in Tables 1-3, and Figure 6 are great examples of the authors' original contributions and careful evaluations.
Major weaknesses
The current manuscript mostly lacks clarity in terms of explaining the different modules that the authors use. Specifically, Section 3.1.2 is quite unclear, especially lines 155-165, but each of the sub-sections could use clarity improvements.
Minor weaknesses
L126L: How does the pre-alignment constrain the fMRI representations? I think I may be understanding this incorrectly, but aren't you using the downsampled harmonics in each ROI as a way to learn positional embeddings? If I understand it correctly, I think it is important to mention that Brain-JEPA [1] uses a similar way to encode positional embeddings, but using gradients instead of harmonics. Especially given the critiques/replies [2, 3] to the original work [4] that is the basis of the harmonics, it is important to ablate a few different ways to do learn positional embeddings. I am not arguing that these replies/critiques necessarily supersede the original work or the subsequent reply of the original authors [5], but I do think it is worth mentioning the critiques in the limitations section.
To further improve the paper, the authors could also compare their proposed model with the following work [6]. Since this paper was submitted to arxiv less than 6 months ago, however, I have not taken this into account when determining my scores.
Spelling and Grammer
L123: "... will be selected for learning positional embedding." -> are selected to learn the positional embeddings from.
[1] https://arxiv.org/pdf/2409.19407 [2] https://www.biorxiv.org/content/10.1101/2023.07.20.549785v1.abstract [3] https://www.biorxiv.org/content/10.1101/2023.10.06.561240v1.abstract [4] https://www.nature.com/articles/s41586-023-06098-1 [5] https://www.biorxiv.org/content/10.1101/2023.10.06.560797v1.abstract [6] https://arxiv.org/abs/2502.04892
问题
The results in the original Brain-JEPA paper [1] for overlapping tasks is significantly lower in the authors' reproduction:
- HCP-A Flanker: 0.406 (original) vs 0.26 (reproduction)
- ADNI: 76.84 ACC, 86.32 F1 (original) vs 59.60 ACC, 60.78 F1 (reproduction) Is there a specific reason why the authors' fine-tuning reproductions differ so significantly in the performance reported in the original Brain-JEPA paper [1]? Could this be due to the frozen encoders? If so, can the authors include results that follow evaluation frameworks similar to previous work? I think this would significantly strengthen the work.
Can the authors perform an ablation study to look at how the addition of the ABCD dataset affects model performance? I think this is relevant because the Brain-JEPA paper [1] the authors only used the UKBiobank dataset
L198-199: Why do the authors sub-sample the UKB dataset 1-3 times, and the ABCD dataset 1-2 times?
L228: Why are the unimodal encoders frozen while training the MF module? Have the authors ablated this and wouldn't steering the functional and structural encoders to learn multi-modal embeddings improve the representations, since this seems to be a central claim in the paper?
Is there a specific reason that the authors chose to perform ablations in Figure 6 on the ABIDE and ADHD-200 datasets? Would it be more representative to use one neurodevelopment disorder and one neurodegenerative disorder dataset?
This is more of a personal observation, but it seems like the authors use a variety of visual styles in their figures (e.g. Figure 5 vs Figure 6 and Figure 1 vs Figure 3), is there any reason for this?
局限性
Yes
最终评判理由
The authors were able to answer most of my concerns well, and I believe that this paper would be a good addition to Neurips 2025.
格式问题
None
Clarification on TAPE
We appreciate the opportunity to clarify Section 3.1.2 (lines 155-165).
In Figure 4, we illustrate the intuition behind TAPE. Given two fMRI time series with different TR (), generating patches of a fixed temporal duration () results in varying patch sizes (where ). Thus, a fixed-size embedding kernel () is insufficient. Lines 155-165 describe how we transform a kernel of size () to a kernel of an arbitrary length () using the pseudo-inverse of the bilinear transformation matrix , as defined in Equation (2). matrix is computed analytically using standard bilinear interpolation. The learnable parameters here are , which are optimized via backpropagation during pretraining. The matrix provides a deterministic transformation to adapt to different patch sizes corresponding to different TR. Implementation details can also be found in the code provided in the supplementary material.
To further improve the clarity of lines 155-165, we will first quote Figure 4 to show the intuition, and add the definition of every symbol on first use at the beginning: “Let be the temporal duration each patch should cover; the TR; an arbitrary patch length and the base patch length”. Please refer to our response to Reviewer wvBG, "Clarification of TAPE" for the theoretical and empirical evidence supporting our TAPE method.
We appreciate the reviewer’s suggestion regarding clarity improvements. Besides Section 3.1.2, please let us know if any other section require further clarifications..
Implementation and discussion of geometric harmonics
We appreciate the reviewer’s insightful questions and pointer to the recent discussion on [1].
We appreciate the reviewer’s observation regarding our positional embedding strategy. While Brain-JEPA utilizes gradient-based positional embeddings, our approach extends this by integrating both brain geometric harmonics and gradients in a complementary manner. Specifically, as detailed in lines 224–226, we average both brain geometric harmonics and gradients after linear projection for positioning. This combination enables our model to capture both multi-scale oscillatory modes and large-scale gradient axes, promoting structure-function coherence in the latent space. It leads to better downstream performance compared to gradient only positioning in Brain-JEPA as evidenced in Figure 6. We will make these implementation differences clearer in the revision.
The critiques to [1] focus on the paper’s claim that geometric harmonics, by themselves, can serve as a “winner-take-all” solution for brain dynamics reconstruction, thereby diminishing the role of the structural connectome. However, the critiques do not affect the validity of our geometric pre-alignment. The harmonics in our work are only used to provide geometry-aware positioning, we make no claim that they can fully explain/reconstruct brain dynamics. On the other hand, the harmonics are averaged with large-scale functional gradients, so it does not have a winner-take-all basis. We will acknowledge these points in our revised version. Future work can explore how structural connectome and other biological principles can be encoded into the model and whether they can further improve brain representation learning and generalizability.
Comparison to the arxiv work
We appreciate the reviewer bringing this work [2] to our attention. However, the absence of publicly available code of this work currently prevents direct comparison. It proposes a single-modality foundation model focusing exclusively on brain dynamics, similar to Brain-JEPA and BrainLM, whereas our work unifies both structural morphology and functional dynamics into compact 1D tokens. We will acknowledge this recent work in the revised version.
Grammer revision
We thank the reviewer for pointing this out. We will revise the sentence accordingly.
Evaluation of Brain-JEPA
We appreciate the reviewer for observing this. The experiments were conducted using the official Brain-JEPA implementation for both pretraining and fine-tuning, during which the Brain-JEPA encoder was fully fine-tuned. The reported results were averaged over three random data splits. This is a more robust evaluation compared to the original results in the Brain-JEPA paper, which used one fixed data split. Therefore, the performance gap could come from the splitting difference. The three data split settings were applied consistently in all of our comparisons. We hope this clarifies the observed differences in performance and will explicitly state this in the revised version.
Ablation of ABCD dataset
We thank the reviewer for this valuable suggestion. It is important to highlight that previous methods, such as Brain-JEPA, cannot readily integrate datasets with heterogeneous TR values. In contrast, our proposed TAPE module allows datasets with heterogeneous TR, allowing us to add both UKB and ABCD into the pre-training to cover both the developing and aging brain. Following the reviewer’s recommendation, we conducted an ablation on ADNI for MCI classification by pretraining BrainHarmonix-F without ABCD data (see table below). We found it still outperformed the original Brain-JEPA based on the same UKB dataset. We would also like to emphasize that the proposed data augmentation and pre-alignment techniques significantly contribute to the observed performance improvements as shown in Figure 6. We will include the ablation results in the revised version.
| Model | ACC (%) | F1 Score (%) |
|---|---|---|
| Brain-JEPA using UKB only | 59.60 | 60.78 |
| BrainHarmonix-F using UKB only | 60.67 | 63.34 |
| BrainHarmonix-F using both UKB & ABCD | 61.62 | 64.80 |
Clarification on subsampling times
We appreciate the reviewer’s question. Thanks to our proposed TAPE, our model uniquely handles heterogeneous TR, enabling us to introduce a novel data augmentation strategy for fMRI time series by downsampling. This approach enriches the dataset with a broader range of TR values and enlarges the sample size, significantly boosting model performance as demonstrated in Figure 6. Specifically, we downsample UKB data by factors ranging from 1 to 3, covering a TR range from ~0.7s to ~2.9s, and ABCD data by factors of 1 to 2, achieving a TR range up to 2.4s. Since single-band fMRI typically employs TR values between 2-3s and multi-band fMRI typically below 1s, further downsampling ABCD data would result in TR values exceeding 3s, which is uncommon in practical fMRI acquisitions and could lead to lower efficiency (longer training time).
Frozen unimodal encoders
We thank the reviewer for the insightful question. We chose to freeze the unimodal encoders during the fusion stage to allow the fusion module to flexibly integrate various pretrained unimodal representations. This design choice enables greater modularity and scalability, making it easier to adapt to different unimodal encoders trained under diverse settings without requiring full end-to-end retraining. As such, end-to-end training is not the main focus of this work. We would also like to emphasize that our current fusion module, even when trained with frozen unimodal encoders, already achieves significantly better performance than each unimodal model alone, demonstrating the effectiveness of our fusion strategy.
Moreover, the harmonizer is already pretrained with long input sequences (i.e., 7200 fMRI tokens + 1200 T1 tokens + the number of 1D tokens), resulting in considerable memory requirements during training. Enabling end-to-end training would substantially increase GPU memory consumption and training time, potentially taking much longer time per run under our current infrastructure, limiting our capacity of model development and comprehensive evaluation.
However, we agree with the reviewer that jointly optimizing the unimodal encoders and the fusion module could potentially lead to further performance gains. Exploring efficient training strategies for such long-sequence transformer models, particularly in the context of neuroimaging, is a promising direction and we will discuss it in future work in the revised version.
More ablation studies
We appreciate the reviewer’s thoughtful question. We initially chose ABIDE and ADHD-200 for our ablation studies because both datasets contain heterogeneous TR values. Following the reviewer’s suggestion, we additionally performed ablation studies on ADNI (see table below with results regarding accuracy (%)), where we observed a similar trend and performance pattern, reinforcing the effectiveness of our proposed model design. We will include this additional ablation in the revised version.
| BrainHarmonix w/o pre-alignment | BrainHarmonix-F | BrainHarmonix | |
|---|---|---|---|
| with data augmentation | 61.35 | 61.62 | 64.65 |
| w/o data augmentation | 60.07 | 60.11 | 62.94 |
Figure styles
We appreciate the reviewer’s observation regarding the style of our figures. Each figure is designed specifically to best convey the particular type of information presented. Figure 1 serves as a conceptual illustration, whereas Figure 3 visualizes the geometric harmonics generated from actual data. For Figure 5, we use circle sizes to clearly indicate the number of trainable parameters, a variable that is not applicable in Figure 6. However, we agree with the reviewer that a more consistent visual design would benefit readability, and we will refine the figures accordingly in the revised version.
References:
[1] Geometric constraints on human brain function. Nature 2023.
[2] A Foundational Brain Dynamics Model via Stochastic Optimal Control. Arxiv 2025.
I want to thank the authors for answering all of my questions, and responding to the weaknesses I have raised. With regards to the figure styles, I think keeping the same color palette for each figure and choosing to use either shadow or no shadow for the shapes used in the figures will make the paper more visually consistent. Given the authors' response, I have decided to increase my score to an accept.
We greatly appreciate that the reviewer has increased the score to an accept. We will ensure a more consistent visualization in our revised version following the recommendations.
This paper proposes a multimodal foundation model for brain imaging data that unifies structural and functional fMRI into a 1D continuous token representations. The method overcomes shortcomings of prior works by allowing for variable temporal resolution and incorporating structural priors. Their foundation model is trained on a large scale dataset of sMRI and fMRI data, demonstrating convincing strengths in downstream performance and scaling. There were some questions regarding the TAPE method -- mostly wrt to clarification -- but these were well addressed. This paper addresses an important problem in multimodal and heterogeneous data in brain imaging, and has answered some important practical questions with convincing results, and I believe this will have value towards NI applications such as diagnosis.
I therefore recommend the paper for acceptance as a poster.