Unsupervised Representation Learning of Brain Activity via Bridging Voxel Activity and Functional Connectivity
摘要
评审与讨论
The paper discusses a neural network encoding methodology for learning representations of brain data described both in terms of voxel-activity and functional connectivity. Building on approaches from image or standard data representations the paper discusses the unique challenges of neuroimaging data and specific trainable processing to handle them. The paper presents results on a several data sets including the contribution of more comprehensive processing of a previously released dataset. Superiority of performance is seen across all datasets by a large margin.
优点
The problem is well motivated. Numerous related works are discussed. The approach while taking inspiration from a number of works proposes tailored steps for brain data. The significant improvements in performance are very encouraging.
缺点
Since the appendices are not attached to the initial submission and the paper's page limits mean many things are unclear.
Although Figure 1 organizes the many processes involved the level of detail is not sufficient to grasp how the dimensions or arrangement of data changes through the processing.
Many terms are not precisely defined and the reader is left to guess the meaning (e.g. "temporal graph"). Some times jargon perhaps from other papers is not explained. "Functional patching" is not clear to me. Another example, "while in the brain the functionality of each token is important and a different set of tokens should be mixed differently", I don't understand the word token in this context. Are the segments akin to attention heads? Calling them segments has a connotation to time.
The clarity of the description of the methodology needs improvement:
In Section 3.1's "voxel-mixer section" the subscript is not on the interpolated matrix, it appears only on P and W_flat and isn't consistent. Is this a mistake or meaningful? The dimensions should be listed for these variables to help the reader (same for equation 1). It's not clear to me how softmax and flat operate together.
appears twice and its not clear it has same meaning. Once is the length of the functional patch's time dimension in the "voxel-mixer section" and once it is the previous tilmestep in the "temporal patching" section.
At the end of Section 3.2, dimensions are missing and subscripts "token" are not clear.
The objective function has variables that are not clear (Z_V is not defined) and doesn't match the description which states that mutual information between functional connectivity and voxel activity is maximized. H_voxel and Z_V are both about the voxel activity. And and Z_F are about the functional connectivity. H_time which is a function of H_voxel is not used in the objective.
In notation is not defined and probably should also be a function of if windows of a task are not same length.
In the "functional patching" section, it is not clear if the patches are contiguous voxels or how are voxels in a function system arranged. It is not clear how linear interpretation would work in such a coordinate system unless this uses the 3D location of the voxels/channels.
Some choices in the processing seem arbitrary and the paper doesn't motivate them all. It's not clear to me why the node uses a weighted average of timestamp encodings while the edge uses a single.
A weakness is the lack of discussion of hyper-parameter selection for the method or baselines. At the end of the methodology data augmentation is discussed. This has a number of parameters itself, and itself can account for differences in performance if other methods are not trained on augmented data. The paper states "The effect of the walk length on performance peaks at a certain point, but the exact value varies with datasets", which means that a valid hyper-parameter selection algorithm (that does not have access to testing performance) is need. Without a valid hyper-parameter selection the impressive results are called into question.
Minor:
In abstract "single beta weight" is jargon. Perhaps a "single weight relating the voxel activity to the task". Also patching is not common in graph terminology, perhaps "local subgraphs". "Temporal graph" is also not clear from context or standard usage.
"jointly learn voxel activity and functional connectivity" -> "jointly learn representations of the voxel activity and functional connectivity"
"minuetes"
问题
How are the set of edges and functional systems obtained for the different modalities?
How does the linear interpolation work (for both fMRI and EEG)? Are voxel/channel locations needed?
In section 3.2 what is meant by a temporal graph? Why is temporal graph used to describe connections between voxels across different time points?
In the temporal patching section, wouldn't it be better to call it a spatiotemporal random walk?
Is the edge set fixed and how is it defined or computed?
The operation of softmax in (2) is not clear. Is it element wise to give as output a vector that lies in the probability simplex (argsoftmax) or does it literally return the softmax scalar, and if so which dimension does it operate along?
How is hyper-parameter selection performed (including for augmentation)?
Are other methods trained with defaults hyper-parameters or is a fair search done for them? Is the same data augmentation used for all methods?
Thank you so much for your time and constructive review! We really appreciate it! Please see below for our response to your comments:
Since the appendices are not attached to the initial submission and the paper's page limits mean many things are unclear.
Response: Thank you for mentioning it. Unfortunately, we realized that our anonymous link was not properly linked to our repository, and so the appendix was missing. Our appendix is now available in the main submission file, and we aimed to address all the concerns raised by the reviewers. We have discussed all the details about the experimental setup, background knowledge, additional related work, our contributions, theoretical results, and additional experimental results in the Appendix.
Although Figure 1 organizes the many processes involved the level of detail is not sufficient to grasp how the dimensions or arrangement of data changes through the processing.
Response: Thank you for your suggestion. We have added the dimensions to the figure. We further added the output dimension of each equation in the main text.
Many terms are not precisely defined and the reader is left to guess the meaning (e.g. "temporal graph”)
Response: Temporal graph is a commonly used term in graph literature, which refers to graphs that can change over time, and each connection is associated with a timestamp. In the revised version, we have added all the background concepts we used in the paper and discussed them in detail (Appendix A).
"Functional patching" is not clear to me.
Response: In order to split the voxels into some groups in which voxels have similar functionality, we suggest using the actual functional systems of the brain and treating each as a patch. That is, we split voxels into some groups, each of which includes voxels corresponding to one of the brain's functional systems. To this end, we use the actual brain functional systems that are provided in (Schaefer et al., 2018). In Appendix A.4, we further discussed patching and have provided illustrative figures.
I don't understand the word token in this context …
Response: The word token refers to the same context as patch. To improve consistency and avoid confusion, we revised the paper and used ``patch consistently. We further rewrote the parts you mentioned to improve the presentation and clarity.
In Section 3.1's "voxel-mixer section," the subscript is not on the interpolated matrix … It's not clear to me how softmax and flat operate together ...
Response: Yes, this is by design. In fact, is the -th row of , which should be calculated by multiplication of and -th row of . Please note that the dimension of this multiplication is . Accordingly, the input of the softmax function is a vector. We have added all the dimensions to the equations in the revised paper.
appears twice, and it's not clear it has the same meaning.
Response: Thank you very much for mentioning that. We have revised this part and used to refer to the length of the functional patch's time dimension and to refer to the previous timestamp in the "temporal patching'' section.
At the end of Section 3.2, dimensions are missing and subscripts "token" are not clear.
Response: Thank you very much for mentioning that. We have removed the word token to improve the consistency and also added the dimensions to Section 3.2.
The objective function has variables that are not clear (Z_V is not defined) and doesn't match …
Response: Thank you very much for bringing it to our attention. This was a typo, and it is fixed in the revised paper. We further added the discussion about and its definition to the paper.
In notation is not defined and probably should also be a function of if windows of a task are not same length.
Response: Thank you very much for mentioning this issue. We have defined , which is the length of the time window in the revised paper. Also, as you mentioned, we have defined it as the function of .
In the "functional patching" section, it is not clear if the patches are contiguous voxels or …
Response: Please note that, in functional patching, we split voxels into some groups, each of which includes voxels corresponding to one of the brain's functional systems. To this end, we use the actual brain functional systems that are provided in (Schaefer et al., 2018). Accordingly, we map each voxel to one of the brain's functional systems based on its location in the brain. We further provide more information about patches in time series and graphs in Appendix A.4 with an illustrative example (Table 4).
However, in the interpolation, we do not need any positional information as we linearly interpolate the signals (voxel activities). That is given voxel signals in patch , we linearly interpolate signals to obtain new signals , where is the maximum patch size.
Some choices in the processing seem arbitrary and the paper doesn't motivate them all …
Response: We believe that all components and choices are motivated in the paper. However, in the case you mentioned, we want to kindly bring to your consideration that in temporal graphs, each edge (connection) is associated with a single timestamp. Accordingly, for each connection, we have a single timestamp encoding. On the other hand, for each node, we have the timestamps of its connections, and so we have a set of timestamp encodings, which we aggregate using the weighted average. We have revised this part of the paper to make it clearer.
A weakness is the lack of discussion of hyper-parameter selection for the method or baselines …
Response: Thank you for your comment. Unfortunately, this was caused by our mistake in linking an anonymous link to our repository and missing the appendix accordingly. We have provided the appendix in the main file of the revised submission.
The details of experimental setup, baselines, and hyperparameter tuning are available in Appendix E. To ensure a fair comparison, we use the same hyperparameter selection process as BrainMixer. Also, we fine-tune their training parameters (e.g., learning rate, etc) as their original papers using grid search. For the sake of fair comparison, we use the same training, testing, and validation data for all the baselines (including the same data augmentation and negative sampling). Also, the hyperparamter tuning is performed by using only validation set. All in all, all the methods use the same training, testing, and hyperparameter tuning procedures, and we ensure that the training, testing, and validation sets are separated.
We also have reported the effect of hyperparameters on BrainMixer performance in Appendix F.1.
Minor
In abstract "single beta weight" is jargon. Perhaps a "single weight relating the voxel activity to the task".
"jointly learn voxel activity and functional connectivity" -> "jointly learn representations of the voxel activity and functional connectivity"
"minuetes"
Response: Thank you very much for your suggestions; we have addressed all the above concerns in the revised paper. We also provide a detailed discussion of temporal graphs in Appendix A.
Also, we wanted to kindly bring to your consideration that patching in graph learning methods has been used in several studies (e.g., [1]), and we believe that using patching instead of local subgraph might cause inconsistency in the paper. However, if the reviewer believes that the local subgraph is a better term, we would be happy to change it.
Questions
How are the set of edges and functional systems obtained for the different modalities?
Response: The process for different modalities is the same as fMRI. For MEG and EEG, we calculate the Pearson correlation of each pair of signals and then construct the edges based on the 90 percentile of positive correlation for each node. We further discuss this in Appendix E.2.
How does the linear interpolation work (for both fMRI and EEG)? Are voxel/channel locations needed?
Response: In the interpolation, we do not need any positional information as we linearly interpolate the signals (voxel activities). That is given voxel signals in patch , we linearly interpolate signals to obtain new signals , where is the maximum patch size.
In section 3.2 what is meant by a temporal graph? Why is temporal graph used to describe connections between voxels across different time points?
Response: We have provided a detailed discussion on temporal graphs in Appendix A. Temporal graphs are graphs that can change over time, and each connection is associated with a timestamp. Temporal graphs (also known as dynamic graphs) have been commonly used in the literature to model brain connectivity networks. The main reason is that fMRI data is dynamic, and the correlation of ROI activity (or voxel activity) can change over time. Accordingly, the edges (connections) in the brain connectivity graph can also change over time. Therefore, temporal graphs are powerful paradigms for such cases.
In the temporal patching section, wouldn't it be better to call it a spatiotemporal random walk?
Response: Thank you very much for your suggestion. We have kept it as temporal patching to avoid any misunderstanding and confusion in the reviewing process, but we will change its name, as you mentioned, in the final version.
Is the edge set fixed and how is it defined or computed?
Response: We have discussed this in Appendix A and Appendix E.2. Given , edge set is fixed, but for different values of , edge sets are different. Also, a connection in the edge set between two voxels shows a high statistical correlation between their corresponding signals.
The operation of softmax in (2) is not clear.
Response: The softmax that is used in (2) is the element-wise softmax applied on each row of the input matrix.
How is hyper-parameter selection performed (including for augmentation)?
Response: We have tuned hyperparameters using grid search on the set of potential values for hyperparameters. The search space for each dataset is reported in Table 6.
Are other methods trained with default hyper-parameters, or is a fair search done for them? Is the same data augmentation used for all methods?
Response: The details of the experimental setup, baselines, and hyperparameter tuning are available in Appendix E. To ensure a fair comparison, we use the same hyperparameter selection process as BrainMixer. Also, we fine-tune their training parameters (e.g., learning rate, etc) as their original papers using grid search. For the sake of fair comparison, we use the same training, testing, and validation data for all the baselines (including the same data augmentation and negative sampling).
Once again, we thank the reviewer for their time and constructive review!
I want to thank the authors for their detailed responses. Like the paper, the responses are very detailed and now that many things have been clarified and appendix/supplement available I'm raising my score.
The paper proposes to use the ideas from the MLPMixer paper for multi-view classification of functional neuroimaging data. One view is the spatial voxel information and the other is functional connectivity. For each of the views a different model is constructed and applied: for the voxel time series is a model very close to the MLPMixer, and for the functional connectivity an model that is less clear. The models are pre-trained via use of self-supervised contrastive pre-training. Evaluation is done in classification and anomaly detection settings, where the latter involves introducing a synthetic anomaly into functional neuroimaigng data. Classification performance is interestingly high.
优点
- The problem of spatiotemporal data analysis is important to the field of neuroscience and brain imaging.
- The paper demonstrates an interesting application of contrastive pre-training of multi-view networks.
缺点
- The paper is unclear and difficult to follow.
- "Dimension reduction" step of the voxel-Mixer should learn time-window specific spatial components in since logically the voxels needs to be grouped into ROIs. However seems to be reducing time, not space. Notation is fairly mixed up making it difficult to follow what has been done exactly.
- "Functional connectivity encoder" starts with a connectivity graph where does that graph come from? This section is also unclear - the source of patches and how they are mixed is very difficult to discern.
- Evaluation tasks cite evaluation in classifying ADHD and ASD but results are not reported in Table 1.
- Experiments are not clearly explained. Specifically, how the test and train sets were split?
- The contribution to either the ML or Neuroimaging is unclear
- For ML: it appears the paper is proposing to train a model on the functional connectivity graph and another model with a comparable number of parameters on time series. The number of parameters relative to the BNT used in the benchmark roughly doubles since the Temporal Graph Mixer in this paper is doing about what BNT is doing. This parameter doubling not surprisingly leads to improved performance. This is not explained.
- For ML: since clasification accuracy is of a major concern, why not compare to Logistic Regression on the FNC data as a much simpler model with much fewer parameters?
- For Neuroimaging: The paper does not show what are the learned voxel groupings. Note, all voxel-level analysis papers with which the current submission contrast the work obtain biologically interpretable voxel maps. The current model supposedly should capture spatial maps like that since it is operating at the voxel level input from subjects. However, these maps are neither presented nor discussed.
- For Neuroimaging: What is the significance of Table 2? This problem is almost a toy problem on a synthetic task. This task is barely relevant to the neuroimaging field and the cited workshop is the only paper that uses this synthetic data generation approach for testing the method.
- For Neuroimaging: Why the ROIs in Figure 3 are so carefully delineated? The described methods would be unlikely to produce such precisely carved spatial maps. This result is unclear.
- Some relatively minor problems:
- Abstract: "toward revealing the understanding" what do you exactly mean?
- Abstract: "we bridge this gap", which gap? This is strangely phrased
- The first paragraph of the introduction section is awkwardly written and can gain a lot from a rewrite.
- Page 2: Limitations state that most studies focus on either voxel level or functional connectivity. However, to study functional connectivity researchers usually first extract intrinsic networks from the voxel level, by running ICA, RBM, NMF, Dictionary Learning or any other decomposition. The statement does not seem to hold.
- Page 3 lats paragraph of section 2: MEG and EEG do not readily provide localization of the time-series to parts of the brain. The sensor space recordings need to be projected to the brain volume or surface by overcoming the difficulties of an ill-posed inverse problem.
- Note, that a model that has the characteristics you describe for BrainMixer on page 3 is available:
- Mahmood U, Fu Z, Calhoun V, Plis S. Glacier: Glass-Box Transformer for Interpretable Dynamic Neuroimaging. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023 Jun 4 (pp. 1-5). IEEE.
- and a paper from 2022 by the same author
- On page 4 Notation: is not defined.
- On page 4 Notation: is the same as ? It would be simpler to follow the paper if the notation was consistent
- Page 5: "even after 2 minuetes"
- Appendix is missing
问题
see weaknesses
Thank you so much for your time and constructive review! We really appreciate it! Please see below for our response to your comments:
Presentation
Regarding the clarity of the presentation, we aimed to address all your suggestions in the revised paper and improve the clarity.
"Dimension reduction" step of the voxel-Mixer should learn time-window specific …
Response: The main goal of the VA encoder is to learn a low-dimensional representation for each voxel. Accordingly, in no part of BrainMixer we aim to reduce dimension from the voxel dimension, as if we do, we will lose information about reduced voxels. Please note that in VA Encoder, we do not combine voxels in each patch to obtain ROI encodings. The patching procedure here is used to fuse information across a group of voxels, not to group voxels into ROIs. Accordingly, as you mentioned indeed reduces time dimension. We further added dimensions to the equations to make them clearer.
"Functional connectivity encoder" starts with a connectivity graph …
Response: Thank you for mentioning that. In the revised version, we have discussed and added all the necessary background knowledge that we have used in the paper in Appendix A and cited it in the main text. We want to kindly bring to your consideration that in section 3 (Notation), we explain that the adjacency matrix of the connectivity graph is the correlation matrix. However, in the revised version, we further discuss constructing the connectivity graph in the appendix (Please see Appendix A.3 and E.2 ). We followed previous studies and constructed the brain connectivity graph using pairwise Pearson correlation of voxel time series activity. The only difference between our approach with existing studies in constructing brain connectivity graphs is that we consider the pairwise correlation at the voxel-level activity. Accordingly, in the constructed connectivity graph, each node is a voxel, and each connection shows a high correlation between their time series activity.
In this part, we do not use any source for the patches. In fact, as we explained in the "Temporal Patching" section, we propose to use temporal random walks to define patches. More specifically, we sample temporal random walks starting from each node and simply consider the union of all sampled walks as a patch.
Evaluation tasks cite evaluation in classifying ADHD and ASD but results are not reported in Table 1.
Response: Thank you very much for mentioning that. Graph anomaly detection task can be seen as a binary classification task, where 0 means "abnormal" and 1 means "normal." In fact, in the graph anomaly detection task on ADHD and ASD, we aim to classify ADHD and ASD. To avoid confusion, in the revised paper, we changed this definition and referred to Table 1 as multi-class brain classification tasks.
Experiments are not clearly explained. Specifically, how the test and train sets were split?
Response: Thank you for mentioning it. Unfortunately, we realized that our anonymous link was not properly linked to our repository, and so the appendix was missing. The appendix is now available in the main submission file, and we discussed our experimental design in Appendix E. In our experiments, we ensure that the training, testing, and validation sets are separated, and the final performance is reported on the hold-out test set. Also, we split the data into 70% training, 10% validation, and 20% test sets.
Some Relatively Minor Problems:
Abstract: "toward revealing the understanding" what do you exactly mean?
The first paragraph of the introduction section is awkwardly written and can gain a lot from a rewrite.
On page 4 notation is not defined.
On page 4 notation is the same as ?
Page 5: "even after 2 minuetes".
Response: Thank you very much for mentioning them. In the revised version, we re-wrote some parts and have addressed all of them.
Abstract: "we bridge this gap", which gap? This is strangely phrased
Page 2: Limitations state that most studies focus on either voxel level or functional ...
Response: Existing studies have studied the brain either at the level of voxel activity or functional connectivity. Indeed, as you mentioned, functional connectivity networks are constructed from voxel activity. However, after constructing the network, we miss the actual activity of voxels. Note that two different pairs of signals can have the same correlation while are completely different. The functional connectivity graph only captures the correlation of brain signals, not the actual activity. On the other hand, studying the brain at the voxel-level activity cannot capture higher-order interactions of voxels in functional systems. Accordingly, there is a lack of method that could encode brain activity at both of these levels.
Contributions
General response for this part: The main contribution of this paper is to design an effective and powerful model that can learn low-dimensional representations of voxel activity for use in different downstream tasks. We believe that this paper has contributed to both ML and Neuroimaging:
1. We present a novel multivariate timeseries encoder that employs a novel dynamic attention mechanism to bind information across both voxel and time dimensions.
2. We present a novel graph learning method for encoding the FNC, which employs a novel temporal pooling strategy. We further provide a theoretical guarantee for the power of this approach.
3. We present a novel self-supervised pre-training framework. This framework does not rely on labeled data, unique properties of a specific neuroimage modality, or computationaly expensive negative sampling, which supports its potential to be a backbone for future foundation models on neuroimaging data.
4. To the best of our knowledge, we performed one of the most extensive experimental evaluations with seven datasets, three neuroimage modalities, five different downstream tasks, and 14 baselines. Most existing studies have focused on one downstream task and one neuroimage modality. The results show the superior performance of our method, and ablation studies on all elements of the BrainMixer suggest that each element is beneficial and helps to improve performance. Moreover, these ablation studies show that (i) each encoder of the BrainMixer alone can outperform all baselines, (ii) BrainMixer, even without pre-training, can outperform baselines (including pre-trained baselines), (iii) Replacing each encoder with state-of-the-art existing methods damages the performance, indicating that FC and VA Encoders are more powerful than their counterparts.
We want to kindly bring to your consideration that coming up with one architecture like BrainMixer that (i) each of its elements alone achieves competitive performance and (i) is suitable for different tasks and modalities is highly challenging and usually has not been done in the existing studies.
For ML: it appears the paper is proposing to train a model on the functional connectivity …
Response: Please note that the larger number of components does not necessarily mean that the model has more parameters. In fact, the BNTransformer in several tasks has at least three times more parameters than the FC Encoder of BrainMixer. Also, we want to kindly bring to your consideration that the FC Encoder alone (without the VA Encoder) outperforms existing methods (Table 3, row 3). Furthermore, replacing the FC Encoder with BNTransformer damages the performance (Table 3, row 6). Accordingly, we believe that our experiments support the claim that the superior performance of BrainMixer is not because of a larger number of parameters but because of its design.
For ML: since clasification accuracy is of a major concern, why not compare to Logistic Regression …
Response: Logistic Regression is a simple model whose power is not often comparable to state-of-the art methods. Accordingly, in our experimental evaluations, we have used 14 state-of-the-art machine learning models as baselines. While BrainMixer significantly outperforms all the baselines, several of these baselines significantly outperform Logistic Regression in classification tasks [1].
For Neuroimaging: The paper does not show what are the learned voxel groupings. Note, all voxel-level analysis papers with ...
Response: We respectfully disagree with the statement that all voxel-level analysis papers provide biologically interpretable voxel maps. [2, 3] are examples of studies that use representation learning of voxel activity to decode the visual cortex. Our experiments on BVFC datasets are really close to these studies as we aim to decode fMRI to predict the label of the seen images. In this study, we have evaluated the learned representation of voxels by BrainMixer in various tasks, datasets, and neuroimage modalities, which all show the superior performance of BrainMixer. We believe that all these experiments strongly support our claims and also show the potential of BrainMixer in various downstream tasks. We further have How Does BRAINMIXER Detect GAN Generated Images? which we believe shows voxel representations by BrainMixer are meaningful. Additional details can be found in Appendix F.
However, if the reviewer belives that performing additional experiments on new downstream tasks and providing biologically interpretable voxel maps in those tasks are required, we would be happy to add any other experiments that the reviewer suggests.
Once again, we thank the reviewer for their time and constructive review!
Based on the reviewer’s summary of our paper, we want to kindly bring to your consideration that:
1: Difference with MLPMixer:
The paper proposes to use the ideas from the MLPMixer paper.
For the voxel time series is a model very close to the MLPMixer ...
Response: The main contributions of this paper (i.e., dynamic attention mechanism, using patches with different sizes, overlapping patches, new graph pooling mechanism with theoretical guarantee, functional patching in time series data, and temporal patching in graph-structured data) are all different from the MLPMixer. In fact, inspired by the success of MLPMixer, we designed BrianMixer based on simple MLPs. We have added a new section in Appendix A.4 and extensively discuss the differences between MLPMixer and BrainMixer’s encoders.
2: Different Neuroimage Modalities:
for multi-view classification of functional neuroimaging data …
Response: One of the main advantages of BrainMixer is its generalizability to different neuroimage modalities like (fMRI, EEG, and MEG). We support this claim by performing several experiments on BVFC-MEG and TUH-EEG datasets, which consist of MEG and EEG modality, respectively. The superior performance of BrainMixer is because of the fact that simple brain network-based methods cannot capture the long-range signals of EEG or MEG, and also, some time series-based methods (e.g., MVTS) are specifically designed for EEG modality and are not designed for fMRI or MEG.
3: Additional Experiments on Regression Tasks:
Evaluation is done in classification and anomaly detection settings
Response: We want to kindly bring to your consideration that in the revised version of the paper, we have also added additional experimental results on regression tasks. The results are reported in Appendix F.4.
4: Soundness
Soundness: 2 fair
Response: To the best of our knowledge, we performed one of the most extensive experimental evaluations with seven datasets, three neuroimage modalities, five different downstream tasks, and 14 baselines. The results show the superior performance of our method, and ablation studies on all elements of the BrainMixer suggest that each element is beneficial and helps to improve performance. In the revised version, we further performed statistical analysis and showed that the results are significant. In the appendix, we have provided nine pages of additional experimental results and discussed the effect of each hyperparameter on the performance of BrainMixer. We further theoretically show that our proposed pooling method is a universal approximator of multi-set functions (i.e., can learn any graph pooling method). We believe that all these experimental and theoretical results excellently support our claims. However, we would be happy to add any additional experimental or theoretical results that the reviewer believes are required to improve the soundness of the paper.
5: Contribution
Contribution: 2 fair
Response: Please see our general response as well as the second part of our response to you. We also have further discussed our contributions in Appendix C.1.
6: Presentation
Presentation: 1 poor
Response: We have revised the paper and have addressed all the concerns raised by the reviewers regarding the clarity and presentation.
Dear reviewer JxFN,
Once again we sincerely thank you for your time and helpful comments. We hope our response and revised paper has adequately addressed your concerns. Since the author-reviewer discussion period ends soon, we will appreciate it if you could let us know about any unsolved concerns you might have about the paper. We are more than happy to answer your further questions.
Contributions (Cont.)
For Neuroimaging: What is the significance of Table 2? This problem is almost a toy problem on a synthetic task …
Response: We first want to kindly bring to your consideration that the brain-level anomaly detection task has ground truth labeled data, and it does not use synthetic anomalies. A detailed description of the datasets and experimental setup can be found in Appendix E. Also, please note that the task of anomaly detection in the human brain is an important downstream task, which can help to understand the normal brain activity that might cause a brain disease or disorder. However, this task, by its nature, does not have ground truth labeled anomalies. That is, understanding abnormal activity in the human brain is still an active research area, and for most diseases, we do not even know specific biomarkers. Accordingly, this makes the evaluation of anomaly detection methods challenging. To this end, for the quantitative analysis of edge and voxel-level AD, we follow the machine learning-based anomaly detection methods [4] and use synthetic anomalies to measure the accuracy of our method and compare it with the baselines. On the other hand, anomaly detection in human brain studies usually does not compare their methods to others and reports only their findings using their proposed methods, followed by a discussion about the consistency of their findings with previous studies [5, 6]. We follow the literature, and in the edge-level and voxel-level AD tasks, where we do not know about ground truth anomalies, we report our findings and show that they are consistent with previous studies (Figure 3). To the best of our knowledge, we use both types of evaluation that exist in the literature (Synthetic data in a part of Table 2 and consistency with previous studies on Page 9).
To the best of our knowledge, there is no dataset consisting of ground truth abnormal brain activity at the edge level or voxel level. However, if the reviewer is aware of any such datasets, we would be happy to report the BrainMixer performance on those datasets.
We also want to kindly bring to your consideration that most existing studies only have focused on a single task (e.g., brain classification). BrainMixer, in addition to these two tasks (i.e., edge-level and voxel-level AD), shows promising performance in multi-class brain classification, brain anomaly detection, and regression tasks.
For Neuroimaging: Why the ROIs in Figure 3 are so carefully delineated?
Response: In Figure 3, we have used BrainPainter [7] with the Desikan-Killiany atlas. This visualization software is used in several studies like [8]. We further discussed these details in Appendix E.5. We would be happy to visualize our approaches with any other software that the reviewer suggests.
Some Relatively Minor Problems (Cont.):
Note, that a model that has the characteristics you describe for BrainMixer …
Response: Thank you very much for bringing this work to our attention. Contrary to our study, in which we aim to learn low-dimensional representations for voxels, this paper has focused on learning brain connectivity structure. Moreover, the architectures of approaches are significantly different (i.e., dynamic vs. static attention, MLP-based vs. Transformer-based, patching, etc.). We further discuss the differences between BrainMixer and existing studies in Appendix C.
References
[1] BrainNNExplainer: An Interpretable Graph Neural Network Framework for Brain Network based Disease Analysis. IMLH ICML 2021.
[2] Mind Reader: Reconstructing complex images from brain activities. NeurIPS 2022.
[3] Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors. NeurIPS 2023.
[4] A Comprehensive Survey on Graph Anomaly Detection with Deep Learning. TKDE Journal 2021.
[5] Detecting network anomalies using forman–ricci curvature and a case study for human brain networks. Scientific Reports 2021.
[6] Microstructural abnormalities in the combined and inattentive subtypes of attention deficit hyperactivity disorder: a diffusion tensor imaging study. Scientific Reports 2014.
[7] BrainPainter: A software for the visualisation of brain structures, biomarkers and associated pathological processes. MICCAI 2019
[8] Uncovering the heterogeneity and temporal complexity of neurodegenerative diseases with Subtype and Stage Inference. Nature Communications 2018.
This manuscript proposes a self-supervised framework, BrainMixer, that encodes time series and functional connectivity. For the time series, the authors adapted MLP-Mixer by introducing TimeMixer and VoxelMixer. For functional connectivity, the authors presented a Temporal Graph Mixer with a Temporal Pooling Mixer. For each encoder, they developed additional patching strategies. To train the multi-view model, the authors adapted an objective based on the maximization of mutual information (using InfoNCE estimator) between global embeddings of the time series and its functional connectivity and between local-global embeddings of the time series. The results show improvements in performance across multiple datasets, modalities, and baselines.
优点
- The authors presented experiments with multiple datasets and modalities (fMRI, EEG, MEG) and compared the proposed model to various baselines.
缺点
-
Experiments:
- The experimental design is poorly described. It is unclear whether the manuscript uses all available data during unsupervised/self-supervised pre-training or only the training set. Otherwise, when all the data is used, the self-supervised model may already remember the data, boosting the performance in downstream tasks on the test subset.
- The cross-validation is unclear. Do you use training, validation, and hold-out sets? Validation sets must be used only to determine hyperparameters and checkpoints. The final performance must be reported on the hold-out test set.
- Furthermore, did you ensure that splits do not contain the same subjects?
- The comparison in Table 2 seems unfair when the setting is not described: the type of encoders and their capacity, the type of modalities used for downstream tasks, and whether pre-training is used. Baselines can use architectures with different capacities or might not utilize pre-training; hence, the performance might be worse due to capacity, use of concatenated embeddings from fMRI and FNC, or warm start.
-
Rigor:
- Table 1: There is no description of class distribution; hence, using Accuracy in Table 1 is unjustified, and you need to use additional metrics.
- Table 2: It is unclear whether the models give well-calibrated predictions because AUC's shortcoming is that it does not measure whether a prediction is calibrated. Hence, you might have better performance with worse calibration.
- Table 2: Anomaly detection asks for AUC PR because there are many "normal" cases and very few "anomalous" cases.
- Statistical analysis for Table 2 and Table 3 has not been performed. Please compare statistically the performance of the models (best versus other) and then correct p-values for multiple comparisons.
-
Clarity:
- The manuscript should clearly show that, specifically, joint pre-training of fMRI with Functional Connectivity is beneficial as the first contribution. The second contribution is the improvement of MLPMixer to fMRI and functional connectivity. The third contribution is functional patching. Currently, it is not clear whether the main performance comes from the proposed encoder architectures, functional patching, the unsupervised multimodal pre-training itself, or concatenated embeddings from fMRI and FNC. There are a lot of contributions that could be ablated and compared separately with other baselines.
- The section "How Does Brain Detect GAN Generated Images?" has a very cramped discussion, and the experiment design is unclear. What was the performance of the GAN used in synthesized images? Also, the differences are shown only visually, and no numerical values are shown on how these distributions differ globally and locally (region-wise).
- Figure 3 shows the distribution of detected abnormalities. How do you get the distribution? Where is the color bar? How do you test? Do you show the p-value, or do you show the effect size? Have you done FDR corrections based on p-values?
-
Missing related work:
- MLP mixer has been applied previously to fMRI data (Geenjaar et al., 2022). Additionally, in the same work, spectral clustering was used to develop a patching strategy for fMRI data. I have not found ablation for the patching approach in this work; it is unclear how the proposed patching is better.
Geenjaar, Eloy, et al. "Spatio-temporally separable non-linear latent factor learning: an application to somatomotor cortex fMRI data." arXiv preprint arXiv:2205.13640 (2022).
问题
- The appendix is missing. The supplementary material is empty (https://anonymous.4open.science/r/br-CD4D/README.md).
Thank you so much for your time and constructive review! We really appreciate it! Please see below for our response to your comments:
Experiments
The experimental design is poorly described. It is unclear whether the manuscript uses all available data …
The cross-validation is unclear. Do you use training, validation, and hold-out sets? Validation sets must be …
Response: Thank you for mentioning it. Unfortunately, we realized that our anonymous link was not properly linked to our repository, and so the appendix was missing. The appendix is now available in the main submission file, and we discussed our experimental design in Appendix E. In our experiments, we ensure that the training, testing, and validation sets are separated. Also, in the pre-training phase, we only use training and validation sets and keep test sets untouched. The final performance is reported on the hold-out test set. Accordingly, the model does not remember the data in pre-training, and its superior performance comes from its design and architecture.
Furthermore, did you ensure that splits do not contain the same subjects?
Response: Yes, we ensure that our splits are valid and do not leak information about the test data. That is, in downstream tasks that we aim to predict subject labels, we ensure that splits do not contain the same subjects. Also, in BVFC datasets, in which we aim to predict the label of the seen objects based on the fMRI response, we ensure that the same object is not in both training and test sets.
The comparison in Table 2 seems unfair when the setting is not described …
Response: We have provided the details of our experimental design in Appendix E. The type of modality in BVFC, ADHD, ASD, and HCP are fMRI, the modality in BVFC-MEG is MEG, and the modality of TUH-EEG is EEG. The pre-training is used for all the baselines that support pre-training, and the pre-training setting is the same as BrainMixer. Moreover, we performed hyperparameter tuning for the baselines to ensure a fair comparison. We also want to kindly bring to your consideration the following:
(1) Pre-training is an ability of the model, and it is the advantage of our design that BrainMixer is capable of unsupervised pre-training. This unsupervised pre-training does not use additional data and also does not necessarily mean that the capacity of BrainMixer is more than baselines.
(2) In Table 3, we report the performance of BrainMixer without pre-training. The results show that BrainMixer, even without pre-training, outperforms baselines (including pre-trained baselines!).
(3) In the revision, we further perform an additional ablation study and replace each of the FC Encoder and VA Encoder with the best brain network encoder (i.e., BNTransformer) and time series encoder (Time Series Transformer). The results are reported in Table 3. The results show that these replacements damage the performance, indicating the power of the FC Encoder and VA Encoder compared to other brain network encoders and time series encoders.
(4) We perform an ablation study and remove each fMRI and FNC (Results are in Table 3). The results show that each FC encoder and VA encoder separately can outperform the baselines.
(5) While we have not discussed it in the paper, we observed that most Transformer-based baselines (e.g., BNTransformer) have at least three times more parameters than their counterparts in our model (e.g., FC Encoder).
Rigor
Table 1: There is no description of class distribution; hence, using Accuracy in Table 1 is unjustified ...
Response: We have discussed the statistics of the datasets in Appendix E.2 (Table 5). The downstream tasks in Table 1 are multi-class classification with the almost uniform class distribution. We followed existing studies and used Accuracy as the metric. We further evaluate the Top-1 performance in Table 9. We would be happy to evaluate the performance of BrainMixer with any other metrics that the reviewer believes are more suitable for these tasks.
Table 2: It is unclear whether the models give well-calibrated predictions …
Response: Thank you for mentioning that. We also believe that the reliability, i.e., calibration (along with other metrics like explainability and fairness) of models is important. However, the main goal of this study is to design an accurate and powerful model that can learn the representation of voxels for use in downstream tasks. Accordingly, we followed existing studies in this research area and used Accuracy (for multi-class classification) and AUC-PR (for binary classification). We further showed the potential of BrainMixer in regression tasks. Designing reliable machine learning models is indeed an active area of research, and so we added this discussion in our “Limitation and Future Work” section in Appendix G and left the adoption of BrainMixer to have well-calibrated predictions for future studies.
Rigor (Cont.)
Table 2: Anomaly detection asks for AUC PR because there are many "normal" cases and very few "anomalous" cases.
Response: Thank you very much for mentioning that. By using AUC, we meant AUC-PR, which, as you mentioned, is the most suitable metric for unbalanced binary classification tasks. We revised the paper to avoid misunderstanding and used AUC-PR to make this point clear.
Statistical analysis for Table 2 and Table 3 has not been performed. Please compare statistically the performance of the models (best versus other) and then correct p-values for multiple comparisons.
Response: Thank you very much for your suggestion. We conducted paired t-tests to assess the statistical significance of the results in Table 2. BrainMixer consistently outperformed all competitors across tasks and experiments, with significant p-values in 32 out of 34 cases. We also revised the discussion of results in the paper and highlighted significant results in blue.
Clarity
The manuscript should clearly show that, specifically, joint pre-training of fMRI with Functional Connectivity is beneficial as the first contribution.
Response: We performed an ablation study (Table 3) and reported the performance of BrainMixer without pre-training. The results show that pre-training of BrainMixer by maximizing the mutual information of encodings obtained from FNC and fMRI is beneficial and improves performance.
The second contribution is the improvement of MLPMixer to fMRI and functional connectivity
Response: We performed an ablation study (Table 3) and reported the performance of BrainMixer without (1) dynamic attention, (2) temporal and functional patching, (3) pooling strategy, and (4) time encoding. The results show that all the improvements we proposed are important and beneficial.
The third contribution is functional patching.
Response: In the revision, we performed an additional ablation study on functional and temporal patching and replaced them with eight other baselines. The results are reported in Table 7 and show that Functional and Temporal patching, which we proposed for VA and FC Encoders, are more effective than other patching methods.
The section "How Does Brain Detect GAN Generated Images?" has a very cramped discussion …
Response: Due to the 9-page space limit, we reported the results in the main paper and discussed the details of the experimental setup in Appendix F.6. We want to kindly bring to your consideration that the original data, including GAN-generated images, are all from a previous study [THINGS dataset], and are not designed by us. Also, please note that the GAN is used to generate non-realistic and non-recognizable objects. Accordingly, even a GAN with poor performance can provide us with the images that we need.
In this experiment, we split the test set into two groups based on BrainMixer's prediction: (1) data samples that BrainMixer has detected as normal and (2) data samples that BrainMixer has detected as abnormal. We report the distribution of fMRI responses that BrainMixer found abnormal and the distribution of fMRI responses that BrainMixer found normal. Interestingly, while the distributions share similar patterns in lower levels (e.g., V1 and V2 voxels), higher-level voxels (e.g., V3) are less active. This drop in the V3 activity is ~ 57%. These results are compatible with our expectation about the hierarchical structure of the visual cortex and so support that BrainMixer can learn a powerful representation for voxel activity.
Figure 3 shows the distribution of detected abnormalities. How do you get the distribution …
Response: Thank you for asking about that. We think there might be a misunderstanding about this figure. This figure does not compare the brain networks of subjects in the ADHD group and the healthy control group. In fact, we train our model on the healthy control group and then test it on the ADHD group to detect abnormal brain activity. This figure shows the distribution of the number of times we find a brain region as an anomaly. We expect repeated abnormal brain regions in the brains of subjects in the ADHD group to be correlated to some symptoms of ADHD. Interestingly, we found that these repeated abnormal brain regions are also discussed in previous studies, which use different approaches, as the brain regions that their abnormal activity might be correlated to some symptoms of ADHD. We believe these results show the potential of BrainMixer and can motivate and help future research studies to understand ADHD.
We revised the paper to make this point clearer and also added the color bar.
Rigot (Cont.)
Response: Thank you very much for your suggestion. We conducted paired t-tests to assess the statistical significance of the results in Table 2. BrainMixer consistently outperformed all competitors across tasks and experiments, with significant p-values in 32 out of 34 cases. We also revised the discussion of results in the paper and highlighted significant results in blue.
For the T-test, you must check the normality assumptions. One can use Wilcoxon when the assumptions are ill-met. Since you compare multiple models in tables, you must correct p-values for multiple comparisons (e.g., Holm correction).
Clarity
In fact, we train our model on the healthy control group and then test it on the ADHD group to detect abnormal brain activity. This figure shows the distribution of the number of times we find a brain region as an anomaly. We expect repeated abnormal brain regions in the brains of subjects in the ADHD group to be correlated to some symptoms of ADHD.
How would you guarantee that the model captures exactly the abnormalities of ADHD when you train the model only on healthy subjects? How do you ensure the abnormalities are not due to poor model generalization to unseen groups/domains? What if you apply your model to other unseen data with different diseases and will get similar results?
Missing related work:
MLP mixer has been applied previously to fMRI data …
Response: Thank you very much for bringing this work to our attention. We added a new subsection in Appendix C and discussed this study. We wanted to kindly bring to your consideration that our encoders are significantly different from the MLP-Mixer. MLP-Mixer uses regular grid patches, while FC and VA Encoders need to learn from non-grid data, which is a significant challenge. The diverse length of patches, dynamic attention, graph pooling, etc., are other novel parts of our encoders and are different from MLP-Mixer. To make this point clearer, we added a section in Appendix A.4 with illustrative examples and figures and highlighted the similarities and differences between the FC Encoder, VA Encoder, and MLP-Mixer.
Thank you for your suggestion of adding an ablation study on patching methods. We previously had an ablation study on replacing our functional patching with random patching (Table 3). In the revision, we further added a subsection in Appendix F and performed an ablation study on functional and temporal patching and replacing them with eight other baselines. The results are reported in Table 7 and show that Functional and Temporal patching, which we proposed for VA and FC Encoders, are more effective than other patching methods (including the spectral clustering patching mentioned in (Geenjaar et al., 2022)).
Once again, we thank the reviewer for their time and constructive review!
Experiments
(1) Pre-training is an ability of the model, and the advantage of our design is that BrainMixer is capable of unsupervised pre-training. This unsupervised pre-training does not use additional data and also does not necessarily mean that the capacity of BrainMixer is more than baselines.
I do not think it is a good idea to mix the architecture with the objective and discuss it as the model. The pre-training objective is usually agnostic. Even the Deep InfoMax objective is agnostic and can be applied with both CNN (Bachman et al. 2019) and transformers (Li et al. 2021). For the Deep InfoMax, we only need to ensure we have some form of local features. Hence, I think you need to control the architecture capacity. For example, the work (Fedorov et al., 2021) compared supervised or self-supervised unimodal (AE and DIM) or multimodal objectives (CCA and DIM-based) using the same backbone architecture, while the differences only were in the projection heads or having a decoder.
Furthermore, comparing the model's capacity and checking whether the performance can scale with the architecture capacity is crucial when proposing the new backbone. This would be insightful for a general machine-learning community.
Bachman, Philip, R. Devon Hjelm, and William Buchwalter. "Learning representations by maximizing mutual information across views." Advances in neural information processing systems 32 (2019).
Li, Chunyuan, et al. "Efficient self-supervised vision transformers for representation learning." arXiv preprint arXiv:2106.09785 (2021).
Fedorov, Alex, et al. "Self-supervised multimodal domino: in search of biomarkers for alzheimer’s disease." 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI). IEEE, 2021.
(2) In Table 3, we report the performance of BrainMixer without pre-training. The results show that BrainMixer, even without pre-training, outperforms baselines (including pre-trained baselines!).
If the BrainMixer outperforms the baselines without pre-training, what is the benefit of the proposed self-supervised pre-training?
Rigor
Response: Thank you for mentioning that. We also believe that the reliability, i.e., calibration (along with other metrics like explainability and fairness) of models is important. However, the main goal of this study is to design an accurate and powerful model that can learn the representation of voxels for use in downstream tasks. Accordingly, we followed existing studies in this research area and used Accuracy (for multi-class classification) and AUC-PR (for binary classification). We further showed the potential of BrainMixer in regression tasks. Designing reliable machine learning models is indeed an active area of research, and so we added this discussion in our “Limitation and Future Work” section in Appendix G and left the adoption of BrainMixer to have well-calibrated predictions for future studies.
Sorry for the lack of clarity in my comment. I am not concerned with the reliability or uncertainty. I am concerned that when you compare AUC-ROC or AUC-PR if the logits you use are not well calibrated, these metrics are not comparable. Hence, you must report Brier, show curves, or use additional metrics to ensure rigor.
We thank the reviewer for their time and reply.
I do not think it is a good idea to mix the architecture with the objective and discuss it as the model …
Response: Please note that, as we stated in the paper, the idea of pre-training here is to use both voxel-level activity and brain connectivity networks and maximize their mutual information. Accordingly, our proposed pre-training strategy and its objective need a model that provides both voxel-level and brain connectivity network-level encodings. That is, we need a model that can provide encodings from two views of the data! In our case, one view is the voxel activity time series, and the second view is the brain connectivity network.
Similarly, in the studies that are cited by the reviewer, the contribution is not employing Deep InfoMax for a different architecture. For example, Bachman et al. (2019) suggest using contrastive learning to generate different views of the image, enabling the approach to use Deep InfoMax. There are so many ways that we can define local features for different architectures, which is out of the scope of the paper. However, to address the concern raised by the reviewer, we are running several experiments and training our model with different objectives. We will further report the model parameters for different baselines.
Also, we want to kindly bring to your consideration that BNTransformer has more parameters than BrainMixer in most datasets. Furthermore, as we stated in the previous response, if the capacity is the reason for the superior performance of BrainMixer, replacing its encoder with other state-of-the-art encoders would result in the same or better performance. However, our ablation study showed that changing or removing encoders can damage the performance.
If the BrainMixer outperforms the baselines without pre-training, what is the benefit of the proposed self-supervised pre-training?
Response: Please note that achieving state-of-the-art performance is not the goal of the paper. We showed that self-supervised pre-training can improve performance and also note that the results are far from perfect scores (e.g., ACC performance for BVFC, BVFC-MEG, and HCP-Age are almost 68%, 63%, and 58%). Accordingly, instead of introducing BrainMixer without pre-training in this paper and then beating its performance in another paper with introducing pre-training, we suggest both in this study and show that both are important to achieve a good performance.
Sorry for the lack of clarity in my comment. I am not concerned with the reliability or uncertainty …
Response: In addition to AUC-PR, we have reported the performance of the model using the ACC metric in Table 8, which reflects the prediction performance when the threshold is 0.5. This approach does not suffer from the mentioned limitation and results show the superior performance of BrainMixer. As the reviewer mentioned, we believe that these results with additional metrics ensure rigor. We will further add the PR curve in the final version if it is needed.
We will revise the paper based on the above comments and add additional results in the next 5 hours. We sincerely thank you for your constructive comments, which helped us to improve the paper. If our response resolves your concerns, we kindly ask you to consider raising the rating of our work.
For the T-test, you must check the normality assumptions …
Response: Please note that BrainMixer consistently outperformed all competitors across tasks and experiments, with significant -values in 32 out of 34 cases. Given the extremely low likelihood of a type 1 error causing these results, we reported raw -values to maintain consistency and simplicity in our statistics. However, to address the concern raised by the reviewer, we will report the corrected -value in our next revision, which will be available in next hours.
How would you guarantee that the model captures exactly the abnormalities of ADHD when you train the model only on healthy subjects?
Response: Similar to most existing studies in healthcare domain, there is no theoretical guarantee for the findings. However, the main intuition about this experimental design is that training our model on a healthy control group lets the model learn normal brain patterns. On the other hand, when testing it on the ADHD group, we can find different brain activity patterns that are abnormal with respect to the seen training data. When considering common patterns in the control group (as we have done in Figure 3), we find some common different patterns in the ADHD group, and we expect that these patterns potentially correspond to ADHD symptoms. Please note that we further discuss these results and show that our findings are consistent with existing studies, which is evidence of BrainMixer's effectiveness in finding abnormal patterns.
Understanding abnormal activity in the human brain is still an active research area, and for most diseases, we do not even know specific biomarkers. Accordingly, this makes the evaluation of anomaly detection methods challenging. Anomaly detection in human brain studies usually does not compare their methods to others and reports only their findings using their proposed methods, followed by a discussion about the consistency of their findings with previous studies (e.g., [1]).
[1] Detecting network anomalies using forman–ricci curvature and a case study for human brain networks. Scientific Reports 2021.
How do you ensure the abnormalities are not due to poor model generalization to unseen groups/domains?
Response: As we discussed above, our findings are consistent with previous studies on ADHD, using different methods and datasets. Accordingly, we believe that the found abnormalities are not because of poor model generalization to unseen groups/domains. We further have added the case study on the ASD, which shows several of abnormal brain regions found by BrainMixer are consistent with existing studies.
What if you apply your model to other unseen data with different diseases and will get similar results?
Response: Designing such a powerful model that can detect abnormal brain activity and transfer its knowledge to different diseases could be an important contribution. However, we do not know if BrainMixer is capable of that. The main challenge to evaluate this is the lack of multi-disease datasets with the same procedure of data gathering. In fact, the experimental setup to record fMRI responses can be very different, which makes it hard to evaluate the model in the setup that you mentioned.
Contributions, Rigor, and Experiments:
We wholeheartedly appreciate the reviewer's time and detailed responses. We hope that our effort to improve the paper has addressed your concerns. Here, we want to kindly bring to your consideration that a thoroughly comprehensive evaluation of an approach requires unlimited resources and time. Even well-known machine learning models with tens or hundreds of follow-up studies are still being evaluated and improved. The main goal of this paper is (1) to provide a new insight that joint learning of voxel-level time series and brain connectivity network can be useful to improve the performance, (2) to design novel time series encoder and graph learning encoder that can use unique properties of the brain, (3) to provide enough evidence that potentially this idea and model can be useful for the future studies. We provided one of the most extensive experimental evaluations (compared to existing studies in this area) on binary and multi-class classification as well as regression tasks, different neuroimage modalities, and (6 + 1) datasets. We provide different metrics for each evaluation and support our results with two case studies. We have evaluated the characteristics of the BrainMixer in the appendix and further performed ablation studies with 18 different cases (Tables 3 and 7), to provide more details for future work. We believe that these experiments support the fact that BrainMixer insights are potentially effective and have the potential to be further studied in future work.
I want to thank the authors for their efforts. Most of my concerns have been addressed. To reflect it, I raised the score to "6: marginally above the acceptance threshold" from "5: marginally below the acceptance threshold".
Dear reviewers and ACs,
First, we thank the reviewers for their time and careful reading. We are grateful for your constructive comments and suggestions, which have helped us to improve the paper. We revised the paper, improved the clarity, and tackled all suggestions and comments raised by the reviewers.
Appendix
We agree with the reviewers that there were unclear points regarding the experiments and background concepts since the appendix was missing. We had put the appendix on the anonymous link in the paper, but unfortunately, we realized that it was not properly linked to our repository. In the revision, we provided the appendix in the main submission file and made sure that other supplementary materials were on the anonymous link.
Revision
In the following, we outline the major changes we made in the revision. We would appreciate it if the reviewers could note whether we have understood their comments correctly. We hope that our efforts will satisfy the reviewers' questions and hope they may improve their ratings based on the current version.
- We provide the details of all experiments, training, testing, and hyperparameter tuning procedures in Appendix E. All the methods use the same training, testing, and hyperparameter tuning procedures, and we ensure that the training, testing, and validation sets are separated.
- We performed additional ablation studies on the effect of each FNC encoder and time-series encoder (Results are in Table 3). Results show that replacing any of these encoders with state-of-the-art can damage the performance.
- We further performed additional ablation studies on the effect of patching (Results are in Table 7). The results show the significance of our proposed patching methods.
- We performed additional experiments on regression tasks to show the effectiveness of BrainMixer in various downstream tasks.
- We performed statistical analysis and reported the statistical significance of our results.
- We discuss all the necessary background knowledge that we have used in the paper in Appendix A.
- We add a section in the appendix and extensively discuss the differences of our encoders with MLP-Mixer with illustrative examples.
- We re-wrote and modified the parts of the paper that reviewers needed clarification on. We hope this effort and changes based on the reviewers' suggestions have improved the presentation of the paper.
Contributions
We are concerned that our unintended mistake, which caused missing the appendix, might have caused missing the contributions of this paper. Now, after addressing the reviewers' concerns about the appendix and clarity, we want to respectfully bring to your consideration that:
- This paper presents a novel multivariate timeseries encoder that employs a novel dynamic attention mechanism to bind information across both voxel and time dimensions. Our experiments support that this encoder is more powerful than existing timeseries encoders for the brain and is not limited to a specific neuroimage modality.
- We present a novel graph learning method for encoding the FNC, which employs a novel temporal pooling strategy. Our experiments support that this encoder is more powerful than existing graph machine-learning models for brain networks. Also, we theoretically prove the power of our approach.
- We present a novel self-supervised pre-training framework without using computationally costly contrastive learning, which requires generating many negative samples. Moreover, contrary to most existing studies, this self-supervised framework does not use unique properties of a specific neuroimage modality (e.g., EEG) and so is generalizable to different neuroimage modalities. We believe this framework, which does not rely on labeled data and unique properties of a specific neuroimage modality, can be a backbone for future foundation models on neuroimaging data.
We want to kindly bring to your consideration that coming up with one architecture like BrainMixer that (i) each of its elements alone achieves competitive performance and (i) is suitable for different tasks and modalities is highly challenging and usually has not been done in the existing studies. As an example, [1, 2] have focused on only our second contribution (novel graph learning method for the brain), [3] has focused on only our first contribution (novel time series encoder for the brain), and finally [4] has only focused on designing an effective pre-training framework. We further discuss related work as well as our contributions in Appendix C.
References
[1] Brain Network Transformer. NeurIPS 2022.
[2] Learning dynamic graph representation of brain connectome with spatio-temporal attention. NeurIPS 2021.
[3] Learning representations from EEG with deep recurrent-convolutional neural networks. ICLR 2015.
[4] PPi: Pretraining Brain Signal Model for Patient-independent Seizure Detection. NeurIPS 2023.
Dear Reviewers, ACs,
Once again, we sincerely thank you for your time and constructive reviews. We hope that our effort to improve the paper has addressed all the concerns raised by the reviewers. We have provided the second revision of the paper. In addition to the previous changes (mentioned in the above comment), in this revision, we have:
- added corrected -values to measure the significance of the results,
- added a new ablation study on the objective of the pre-training,
- reported the number of parameters,
- evaluate the effect of the number of parameters on the performance of BrainMixer,
- added a new case study on ASD.
We are more than happy to answer your further questions.
This paper introduces BrainMixer, a novel self-supervised framework for learning representations of brain data. Combining time series and functional connectivity information, BrainMixer utilizes adapted versions of MLP-Mixer for time series encoding and a Temporal Graph Mixer with Temporal Pooling Mixer for functional connectivity encoding. Additionally, innovative patching strategies are implemented for each encoder. Trained through maximizing mutual information between global and local-global embeddings, BrainMixer demonstrates significant performance improvements across multiple datasets, modalities, and baseline models, making it a powerful tool for neuroimaging data analysis.
The manuscript suffers from major issues related to experimental design, clarity, and rigor, making it difficult to assess the true contribution of the proposed methods. We appreciate the efforts the authors put during the rebuttal period. However, one critical concern is that the paper frequently refers to the appendix for essential details and clarifications, yet the actual appendix is absent in the submission. This absence hinders a comprehensive understanding of the work. It is added at the rebuttal stage which is way beyond the paper submission. This raises questions about the fairness and transparency of the review process.
为何不给更高分
The manuscript suffers from major issues related to experimental design, clarity, and rigor, making it difficult to assess the true contribution of the proposed methods. One critical concern is that the paper frequently refers to the appendix for essential details and clarifications, yet the actual appendix is absent in the submission. This absence hinders a comprehensive understanding of the work. It is added at the rebuttal stage which is way beyond the paper submission. This raises questions about the fairness and transparency of the review process.
为何不给更低分
N/A
Reject