PaperHub
5.5
/10
Poster4 位审稿人
最低4最高7标准差1.1
5
7
4
6
3.8
置信度
正确性2.8
贡献度2.3
表达3.3
NeurIPS 2024

Neural decoding from stereotactic EEG: accounting for electrode variability across subjects

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06
TL;DR

Seegnificant: a framework and architecture for multi-subject neural decoding based on sEEG

摘要

关键词
sEEGNeural DecodingTransformersMulti-Subject Training

评审与讨论

审稿意见
5

The work proposes a transformer architecture for decoding behavior using sEEG neural recording which can account for inter-subject variability that exist due to different placements of sEEG electrodes, different SNRs, inherent biological differences etc. The ability of the proposed transformer architecture to deal with inter-subject variability allows the transformer architecture to leverage a much larger multi-subject dataset as compare to existing works which rely on subject-specific models. The proposed architecture consists of six main components: (i) a CNN temporal tokenizer; (ii) A temporal attention mechanism; (iii) A hand-designed RBF-based spatial encoding of different electrode locations; (iv) A spatial attention mechanism; (iv) a feed-forward MLP to extract small dimension neural representations; and (iv) a multi-head regression layer to deal with inherent patient-to-patient variability. The authors test their proposed architecture on a dataset collected from 21 subjects performing a visual color-change task, and show that their proposed architecture significantly boosts its decoding performance when using multi-subject datasets as compared to single-subject dataset.

优点

  • The paper is well-written and easy to follow.
  • The paper addresses an important issue of data heterogeneity in sEEG recordings among participants which makes constructing large dataset challenging as data from multiple participants cannot be easily pooled together. This particularly affects the applicability of deep learning models as they typically rely on large datasets to get low generalization error.
  • The proposed architecture is clearly defined and uses a multi-head regression layer with a shared trunk (similar to some multi-task architecture) to model inter-patient variability. The authors also employ spatial and temporal attention layers to model temporal dependence in the neural activity and the spatial dependence between different electrodes, respectively. and
  • The experimental details are clearly provided and the results shown in the work clearly show that training transformers on multi-subject dataset significantly boosts their decoding performance.
  • The ablation study is insightful and provides important insights into the functioning of different components of the proposed architecture.

缺点

  1. A main weakness of this paper is the spatial encoding (which is framed as a main contribution of the work) only seems to marginally improve the performance of the transformer architecture. The results shown in Section 4.5 only show that average decrease of 0.02 in R2R^2 coefficient across the subjects which is well within the standard error of 0.05 (line 279) reported by authors for the average R2R^2 coefficient with the transformer model incorporating spatial encoding. I am surprised why the authors do not use this result to conclude that the proposed spatial encoding does not have a significant effect on model's accuracy and, consequently, the heterogeneity in the actual positions of sEEG electrode does not seem to impede the training of multi-subject models. Looking at authors' result of ablation study in A.3.1, it seems more plausible that attention mechanism for space (without which the R2R^2 coefficient decreases by a more significant amount of ~0.1, figure 6) is the main mechanism by which the proposed architecture deals with electrode heterogeneity.

  2. Furthermore, the ablation study detailed in A.3.1 seems to suggest that it is the multi-head regression component of the proposed architecture is the main component enabling the better performance of multi-subject model. I am surprised as to why this observation is not mentioned in the main manuscript.

  3. Due to the marginal performance boost offered by positional encoding, the argument that heterogeneity of electrode placement needs to be explicitly accounted for by tokenizing space and time separately seems weak (lines 161-164). Consequently, the separation of time and space attention mechanism does not seem to be adequately justified. I think comparison against 2-D attention mechanism will benefit the study.

  4. Comparison against other methodology is limited. Authors only compare their architecture on single-subject dataset with multi-subject dataset and show a significant increase in the performance of their transformer architecture when training with the larger multi-subject data. Naively training transformer architecture without heavy regularization on small single subject dataset seems to provide poor performance as already discussed in Ye & Pandarinath'21. Hence, the gains of using multi-subject dataset reported in this study might be artificially larger than the ones obtained with proper regularization (such as dropout, L2,L_2, and L1L_1 regularization) of the transformer architecture. Furthermore, authors should also use some classical decoding models such as (PCA with linear regression, Unscented Kalman Filters, see Xu et al. JNE'22) to give a sense of how much gain is obtained in the first place by using the more complicated transformers architecture and provide a baseline accuracy for comparison. Authors should also compare against existing techniques such as SEEG-Net (Wang et al. Computers in Biology and Medicine'22) which used domain generalization techniques for dealing with heterogeneous sEEG data distribution among different subjects. As a side note, I think the work would benefit from a discussion on multi-task literature, as the multi-head regression structure is quite similar to many multi-task architecture proposed earlier (shared trunk architecture, see the review Crawshaw'20 on arvix).

Minor:

  • Line 210: "long-range dependency" might be more appropriate than "long-term dependency" as the authors are talking about variables representing space.
  • Please provide the test/train/validation splits used in this work. It would facilitate in understanding the reported test R2R^2 coefficient.

问题

  1. Line 230: What is C? Should it be E?
  2. Section 3.2.2: Spatial Positional Encoding: A couple of questions:
  • I am assuming that an MRI for each subject in the study was available, from which the MNI coordinates of the sEEG electrodes were calculated. In the spirit of being fully data-driven, why not use the whole MRI (with the sEEG shanks) for encoding the spatial latents? A simple CNN or Graph Convolutional Net could be used to directly encode the spatial information into a latent which can be used instead of the hand-designed coordinate-based spatial feature used in this work. A minor issue might be that the spatial latents might also contain information about the electrodes discarded in the preprocessing but that can possibly be remedied by appropriate masking or removing the sEEG coordinated from the MRI while pre-processing (whichever is easier). A possible advantage of directly using the MRI data would be that it would also account for differences in the brain structure along with the difference in sEEG probe placement.
  • I do not understand the point of using mm-different 1-D RBFs for modeling the positional encoding. Is the goal to model positional encodings at different spatial scales. Why is information at different spatial scales relevant? How are the different σj\sigma_j chosen (are they learnt from the data or chosen as hyperparameters)? The choice seems somewhat arbitrary, especially since the gain in the model's performance by using the spatial encoding is marginal (a 0.02 increase in R2^2) as reported by the authors in Section 4.3.
  • Is the rationale for using 1-D RBFs, i.e., the RBF for each coordinate (eq (4)) instead of using 3-D RBFs (e.g., something like Cexp(D((xμx)2+(yμy)2+(zμz)2C\exp(-D((x-\mu_x)^2+(y-\mu_y)^2+(z-\mu_z)^2)) is to also incorporate the information of the direction of the coordinate.
  • Lines 205-207: The authors state that the projected positional encodings are added to the latents. I am having trouble fully understanding what does this operation entail? Does this mean that the projected positional encodings, denoted by pRKp\in\mathbb{R}^K, and the latents, denoted as zint3=zint[i,j,:]z_{int}^{3} = z_{int}[i,j,:] (using the pythonic notation) for some 0iE10 \leq i \leq E-1 and 0jT10 \leq j \leq T-1, are added to produce the output oRE×T×Ko\in\mathbb{R}^{E\times T\times K} of the operation in the following manner: o3=zint3+po^{3}=z_{int}^3+p followed by appropriate stacking. If that is the case, then what are the relative magnitudes of zint3z_{int}^{3} and pp? Since pp is calculated using the 1-D RBFs, depending upon the magnitude of σj\sigma_j's, the corresponding projected pp could be really numerically small, and the corresponding output o3=zint3+po^3=z_{int}^3+p might be dominated by the term zint3z_{int}^3. I think checking the relative magnitude of zint3z_{int}^3 and pp, and ensuring that they roughly have comparable magnitudes is important to ensure that the positional encoding impact the final outcome.
  1. Why is Huber loss used for training multi-subject model whereas MSE loss used for training single-subject loss? Ideally both multi-subject and single-subject models should have been trained using both losses and for each model, the Huber or MSE loss should have been chosen based on validation error. The choice is surprising as I would have expected to use the more robust Huber loss on the single-subject model (which would be more noisy and prone to effects of outliers) and MSE on the multi-subject data where the much larger size of the dataset can regularize the training process.
  2. Another point of potential unfair comparison is that single-subject models are trained for much smaller number of total (gradient) descent steps compared to multi-subject models. On average, the multi-subject models have 21x larger dataset compared to a single-subject model. Hence, a single epoch while training multi-subject models will go through 21 x more gradient descent steps compared to an epoch of a single-subject model. So, multi-subject model trains for 21x longer than the single-subject model. I think authors should provide a justification for this large discrepancy in the training periods of multi-subject and single-subject models. For example, try training the multi-subject model for 1000/20 = 50 epochs and see its R2R^2 coefficient or try training a couple of best-performing single-subject model and train them for 21000 epochs and see their performance. Edit: This comment is no longer applicable. I missed the part about different batch-sizes.

局限性

Yes, the authors provide a discussion on some limitations in the discussion section.

作者回复

Thank you for a very detailed review. Here, we provide our response to your questions. Due to space constraints, for each question we provide only the beginning of your prompt. We apologize for the inconvenience.

Throughout this rebuttal we refer to single-subject as SS and multi-subject as MS.

Weaknesses

  1. A main weakness of this paper is the spatial encoding...

Thanks for the constructive criticism. Indeed, our proposed PE only marginally boosts decoding performance, suggesting the attention mechanism in space is the main mechanism by which our model deals with electrode heterogeneity. We will highlight this in our revised manuscript.

  1. Furthermore, the ablation study detailed in A.3.1...

This is a valid point. We will highlight this in revisions.

  1. Due to the marginal performance boost offered by positional encoding...

Thanks for suggesting this. We compared the performance of our architecture against a variant where 2D attention is performed over time & space. We include the results of this analysis in Table 4 of our General Response (see also Figure 1E of the pdf). The results suggest that our proposed architecture outperforms its 2D attention variant by ΔR2=0.06\Delta R^{2} = 0.06.

  1. Comparison against other methodology is limited...

Thanks for the constructive criticism. To address this, we compared our model with classic and SOTA SS models. The results, shown in Table 1 of our General Response, indicate that our architecture outperforms all baselines across the board.

Unfortunately, UKF and SEEG-Net are missing from those baselines. We have been unable to get good results using UKF, likely due to the nature of our data, since the trial response times (included in UKF's state) are not stationary, making state estimation unreliable. We also looked into sEEG-Net, but the code is not public and the method requires substantial modification before use, since it operates on the electrode level while our problems requires operation on the subject level. We are still trying to make those models work and hope to include them as baselines in revisions.

Thank you for suggesting we discuss multi-task literature in revisions. We plan to do so, including Crawshaw (2020), in the related work section of our manuscript.

Minor

  • Line 210: "long-range dependency" might be more appropriate...

We will amend this in revisions.

  • Please provide the test/train/validation splits...

The training/validation/test split is 70/15/15. We will include this in revisions.

Questions

  1. Line 230: What is C? Should it be E?

Yes. We will fix the typo in revisions.

  • I am assuming that an MRI for each subject in the study was available...

This is a great idea. Unfortunately, for this dataset we only have the MNI coordinates of electrodes and not whole MRIs or other imaging data. Therefore, we are unable to test this out. We look forward to exploring this option in the future.

  • I do not understand the point of using m-different 1-D RBFs for modeling the positional encoding...

Your intuition is correct; we used RBFs with different variances to encode position at different spatial scales. This is relevant because studies (Meunier (2009)) show that the brain has a hierarchical modular structure. Therefore, encoding both short and long electrode connections could be useful. The scales σj\sigma_{j} for each RBF were hyperparameters with values: [1, 2, 4, ..., 64].

  • Is the rationale for using 1-D RBFs, i.e., the RBF for each coordinate...

We used 1D RBFs instead of 3D because 1D RBFs: (1) inform the model with direction (as you correctly identified) and (2) introduce less parameters to the model compared to 3D RBFs. Specifically, assuming that sampling an interval in 1D requires m points, sampling a 3D interval, separately in each dimension, at the same scale requires 3m points. Sampling the same 3D interval in 3 dimensions combined requires m3^{3} points.

  • Lines 205-207: The authors state that the projected positional encodings are added to the latents...

Thank you for this insightful suggestion. Your description of the operation correct. To ensure that z3z^{3} & pp have comparable magnitudes, we computed z3||z^{3}|| and p||p|| for 512 random samples from our test set. The norms were z3=0.790±0.53||z^{3}|| = 0.790 \pm 0.53 & p=0.013±0.01||p|| = 0.013 \pm 0.01 (mean + sdev), suggesting that z3||z^{3}|| is greater than p||p|| but not to the extent that the model's output would be unaffected by this operation.

  1. Why is Huber loss used for training multi-subject model whereas MSE loss used for training single-subject...

We optimized SS models using MSE Loss because R2R^{2} is directly proportional to the MSE Loss. Following your suggestion we trained both SS & MS models using both MSE & Huber Loss. The results show that training with Huber Loss boosts performance for all models. We have update the SS model results to reflect this.

Table 5. Comparison of model performance trained with Huber vs MSE Loss

ModelMSEHuber
SS0.28 ±\pm 0.050.30 ±\pm 0.05
MS0.35 ±\pm 0.050.39 ±\pm 0.05
  1. Another point of potential unfair comparison...

Thank you for this thoughtful concern. Your intuition is correct. However, the MS and SS models were trained with batch size 1024 and 64, respectively (see line 599 of manuscript). Our dataset contains 3685 trials (175 trials per subject). Therefore, during 1K training epochs MS and SS models underwent 4K and 3K gradient descent steps (GDS), respectively.

To ensure that this did not artificially inflate the performance difference between the MS and SS models, we retrained all SS models for 1.5K epochs (4.5K GDS). The average per-subject test set R2^2 of the SS models was 0.29 ±\pm 0.05 (mean ±\pm sem). The MS model still outperformed those by ΔR2=0.10\Delta R^{2} = 0.10.

In practice, we used 1K epochs because we observed that they were enough for the training objective of all model to converge.

评论

I have read the author rebuttal and increased my overall score to 5.

  • I still think the biggest weakness of the work is comparison with existing baselines. Authors do not compare against any existing state of the art method, e.g., Ye & Pandarinath'21 or SEEG-NET.
  • Authors should also not claim that "This work is the first to train a unified, multi-session, multi-subject models for neural decoding based on sEEG", when SEEG-Net (Wang et al. Computers in Biology and Medicine'22) has done it two years earlier.
  • I think it is a bit of a stretch to claim that p\\|p\\| (which is \sim50 times smaller than the standard deviation in z3\\|z^3\\|) is affecting the output oo.
  • I do not agree with the authors' assertion that 3-D RBFs require much more parameters. The common RBF kernel is of the form exp(xx02/σ2)\exp(-\\|x-x_0\\|^2/\sigma^2) which only requires a single parameter σ\sigma, which would result in exactly the same number of parameters. You could potentially increase the numbers of parameters by using the Mahalanobis distance instead of the standard distance but even in that case the number of parameters for each RBF kernel is 66 instead of 11 (overall parameters 6x9 = 54), which compared to the number of parameters in the transformer architecture are negligible. Also I do not understand how 1-D RBFs are able to sample a 3-D space using 3m3m points. Note that if I am discretizing each dimensions using mm bins, then I have discretized the 3-D space into m3m^3. Consider the example where I discretize each dimension into two bins {0,1}, then I access 8 different bins in 3-D space as {0,0,0}, {0,0,1}, {0,1,0}, {0,1,1}, {1,0,0}, {1,0,1}, {1,1,0}, and {1,1,1}.
评论

Thank you for your response and for updating your score! We are very happy that our responses and new experiments addressed at least some of your concerns. Please find our answers to your new comments below:

  • I still think the biggest weakness of the work is comparison with existing baselines. Authors do not compare against any existing state of the art method, e.g., Ye & Pandarinath'21 or SEEG-NET.

Based on your initial feedback, we put a lot of effort into comparing our models with baselines. However, you would like us to compare against:

  1. Ye and Pandarinath (2021): The NDT model is designed to "transform sequences of binned spiking activity into inferred firing rates" (this is directly quoted from the paper: line 1 of section 2). NDT operates on single-unit electrophysiology datasets, which can only be recorded with microelectrode arrays. In this work, we use an sEEG dataset and it is impossible to resolve single units from sEEG recordings. Given the nature of our dataset, it is unclear how we could use NDT on our data.

  2. Wang et al. (2022): sEEG-Net is a model designed for pathological activity detection for drug-resistant epilepsy. It accepts univariate timeseries of neural activity from single sEEG electrodes and classifies them as: physiologic, pathological, or artifact (it maps RTR3R^{T} \rightarrow R^{3}). In contrast, in this work we are dealing with a problem where multi-variate timeseries of neural activity from multiple sEEG electrodes need to be mapped to a response time (map RE×TR1R^{E \times T} \rightarrow R^{1}). Therefore, without extensive modificaton, sEEG-Net cannot work on our data. Adding to this, the code for sEEG-Net is not public, hampering our efforts to reproduce the model.

We hope you understand that we have been unable to provide those comparisons for solid reasons.

Authors should also not claim that "This work is the first to train a unified, multi-session, multi-subject models for neural decoding based on sEEG", when SEEG-Net (Wang et al. Computers in Biology and Medicine'22) has done it two years earlier.

We respectfully disagree. We believe our claim is justified because:

  1. sEEG-Net is not a neural decoding model, rather a pathological activity detection model. Neural decoding and pathological activity detection are two very different tasks.
  2. sEEG-Net operates on single-electrodes (univariate timeseries of single sEEG electrodes) while our models operate on single-subjects (multivariate timeseries of many sEEG electrodes).
  • I think it is a bit of a stretch to claim that p||p|| (which is 50 times smaller than the standard deviation in z3||z||^{3}) is affecting the output oo.

We echo that the contribution of p||p|| is very small. We will make sure to discuss this limitation in our revised manuscript and emphasize the need to identify better spatial positional encoding schemes, such as encoding whole brain MRI scans or using atlases other than MNI, as you and reviewer o6UF suggested.

  • I do not agree with the authors' assertion that 3-D RBFs require much more parameters. The common RBF kernel is of the form exp(xx02/σ2)exp(-||x-x_{0}^{2} / \sigma^{2}) which only requires a single parameter σ\sigma, which would result in exactly the same number of parameters. You could potentially increase the numbers of parameters by using the Mahalanobis distance instead of the standard distance but even in that case the number of parameters for each RBF kernel is 6 instead of 1 (overall parameters 6x9 = 54), which compared to the number of parameters in the transformer architecture are negligible. Also I do not understand how 1-D RBFs are able to sample a 3-D space using 3m3m points. Note that if I am discretizing each dimensions using mm bins, then I have discretized the 3-D space into m3m^{3}. Consider the example where I discretize each dimension into two bins {0,1}, then I access 8 different bins in 3-D space as {0,0,0}, {0,0,1}, {0,1,0}, {0,1,1}, {1,0,0}, {1,0,1}, {1,1,0}, and {1,1,1}.

We agree that the number of parameters saved by using a 1D RBFs instead of a 3D RBFs is negligible compared to the number of parameters of the transformer. The main benefit of 1D RBFs is the encoding of directionality.

In terms of how 1-D RBFs are able to sample 3D space using 3m3m points, we would like to clarify that this statement in our previous response was poorly written. Our previous statement intended to convey that to center 1-D RBFs (separately in the x, y, and z direction) to the m3m^{3} points of a 3D space, you only need to compute 3m3 m distinct 1-D RBFs. All m3m^{3} RBFs required to map the 3D space can be computed by appropriately multiplying those 3m3 m RBFs. We apologize for the confusion.

Thanks again for your feedback and for the insightful discussion, which we believe has substantially improved the quality of our work.

评论

On more careful thinking based on their response, it seems that the 1-D RBF structure authors are using is the same as the standard 3D-RBF. Consider the following argument: The standard 3-D RBF has the form exp((xx0)2+(yy)2)+(zz0)2)=exp((xx0)2)exp((yy0)2)exp((zz0)2)\exp(-(x-x_0)^2+(y-y_)^2)+(z-z_0)^2)=\exp((x-x_0)^2)\exp((y-y_0)^2)\exp((z-z_0)^2) which is the same thing as multiplying 1-D RBFs across the m3m^3 points. So the 3-D RBF should also require the same amount of 3m points as the 1-D RBFs. Regardless, it is a minor point but I wanted to point it out to the authors in case they want to generalize their positional encoding schemes.

审稿意见
7

The authors propose a novel training framework and architecture to predict response time for a color change behavior task from stereotactic electroencephalography (sEEG) data, focusing on integrating data across multiple subjects despite the variability in electrode placement and count. The model tokenizes neural activity within electrodes using convolutions, extracts temporal dependencies with self-attention, incorporates electrode location with a positional encoding scheme followed by another spatial self-attention layer, and a subject-specific prediction layer. The model is trained on data from 21 subjects, using different procedures: single-subject training (baseline), multi-subject training, and multi-subject training plus single-subject finetuning. The proposed method demonstrates an improved ability to decode trial-wise response times compared to the baseline, and transferability of learned representations to new subjects.

优点

  • Innovative Framework: The proposed framework effectively addresses the heterogeneity across subjects, a significant challenge in sEEG data processing, by using electrode placement and subject-specific prediction layers. The ability to pretrain the model on a larger multi-subject dataset and fine-tune it for new subjects with minimal training is a valuable feature, enhancing the model's practicality and applicability.
  • Comprehensive Evaluation: The model's performance is validated using a substantial dataset (21 subjects), showing consistent improvement in performance for most subjects. An ablation study is also performed.

缺点

The paper discusses the impact of the positional decoding in the ablation study, but this is the least impactful structure compared to other layers (Fig 6). It would be beneficial to explore other position decoding mechanisms besides using MNI locations for future work, perhaps based on different brain atlases.

问题

  1. With the number of electrodes differing across subjects, how did you handle the different input sizes?
  2. Instead of plotting all the sEEG channels, it would be helpful to show the selected channels actually used in training.
  3. What is the (relative) root mean square error for the prediction?

局限性

The method could provide valuable insights for various BCI studies if the authors can demonstrate its performance on more complex tasks. In the Discussion, the authors mention that developing a multi-task model is planned for future work.

作者回复

Thank you for your thoughtful feedback and question. We were really excited to read that "The proposed framework effectively addresses the heterogeneity across subjects, a significant challenge in sEEG data processing". Here, we provide our response to your questions and the results from new experiments that we ran to address your concerns. If you would like any further clarifications, please let us know.

Weaknesses

The paper discusses the impact of the positional decoding in the ablation study, but this is the least impactful structure compared to other layers (Fig 6). It would be beneficial to explore other position decoding mechanisms besides using MNI locations for future work, perhaps based on different brain atlases.

Thank you for the insightful suggestion! We echo your observation that our proposed spatial positional encoding only improves decoding performance slightly and agree that future work should identify better ways to encode the location of sEEG electrodes in the brain. Using different brain atlases is a great suggestion that we will definitely look into. Unfortunately, for this dataset, we only have the MNI coordinates of the sEEG electrodes and therefore we are not able to test whether encoding position with some other atlases would achieve better decoding results. However, we certainly look forward to exploring such ideas in the future.

The best we could do to adress your feedback was to compare our spatial positional encoding approach against other positional encoding approaches. The results are summarized in Table 2 of our General Response and suggest that our proposed positional encoding, works as well as, or better than other approaches. However, as you pointed out, the explored positional encoding schemes are far from comprehensive, indicating that there is still a lot of margin for improving spatial positional encoding mechanisms. We will make sure to discuss the idea of encoding electrode location using other brain atlases in section 4.5 of our revised manuscript.

Questions

  1. With the number of electrodes differing across subjects, how did you handle the different input sizes?

This is a great question. Our models is capable of handling different input sizes due to the ability of convolutional neural networks and transformers to accept inputs of varying lengths. Prior to procesing the latents with our architecture's MLP, we pad the latents of all subjects to a fixed length. For the mathematical details, we refer you to section 3.2 of our manuscript. Here, we try to provide a more intuitive explanation of why our network is capable of handling inputs of different sizes.

The input to our network, XRE×TX \in R^{E \times T}, is a 2D array of electrodes ×\times timepoints, where EE refers to the electrode dimension and TT to the time dimension. The tokenizer performs KK 1-D convolutions on the input along the time dimension TT, while parallelizing the computation across the EE dimension (electrodes). This operation returns a latent zRE×T×Kz \in R^{E \times T \times K}, where KK refers to the number of convolutional kernels. Then, self-attention is performed on zz along the time dimension, while computation is parallelized across the EE dimension. Self-attention can accept sequences of arbitrary lengths and therefore no matter the number of timepoints of the latent zz, this operation can be performed successfully. The output of the self-attention has the same size as the input, returning a latent of the form zRE×T×Kz \in R^{E \times T \times K}. Then positional encodings are added to the latents, which does not alter the shape of zz. Another self-attention operation is performed, this time along the EE dimension of zz, parallelized across the TT dimension (timepoints), which again preserves the size of the latent zRE×T×Kz \in R^{E \times T \times K}. At this point, the latents are unrolled to get a latent of the form zRETKz \in R^{E \cdot T \cdot K}. TT and KK have the same dimensionality across subjects; EE however (the number of electrodes) does not. Therefore, each subject is going to have a latent zRETKz \in R^{E \cdot T \cdot K} that has a unique dimensionality. This latent will be projected to a lower dimensional space through a multi-layer perceptron, which requires a fixed input length. Therefore, at this stage the latents zRETKz \in R^{E \cdot T \cdot K} are padded with zeros to obtain a fixed length EmaxTKE_{max} \cdot T \cdot K that is common to all subjects. For convenience, EmaxE_{max} is set to the maximum number of electrodes that a subject has in our cohort. The MLP projects the latent to zRFz \in R^{F}, where FF represents a small number of features that will then be projected through the subject-specific MLPs to a single number corresponding to the response time of a subject for a given trial. We hope that this explanation answers your question.

  1. Instead of plotting all the sEEG channels, it would be helpful to show the selected channels actually used in training.

This is a great suggestion. Please take a look at Figure 2 of the pdf we submitted with our General Response and let us know if you have any further suggestions. We will appreciate your input on this.

  1. What is the (relative) root mean square error for the prediction?

For the multi-subject model, the root mean square error for all predicted response times in the test set, across subjects is RMSE = 0.082 ±\pm 0.002 (mean ±\pm sem) sec. For reference, the mean response time in the test set across subjects is 0.410.41 ±\pm 0.005 sec (mean ±\pm sem). We will make sure to include this in section 4.3 of our revised manuscript.

评论

Thanks for the detailed response and additional analysis. The new experiments substantially enhance the quality of the work.

评论

Thank you for your response. We are happy to read that our new experiments substantially enhanced the quality of the work.

审稿意见
4

This paper presents a framework and architecture for decoding behavior across subjects using stereotactic electroencephalography (sEEG) data, addressing the challenge of electrode variability. By tokenizing neural activity with convolutions and employing self-attention mechanisms along with a positional encoding scheme based on MNI coordinates, the model extracts effective spatiotemporal neural representations. The study demonstrates successful decoding of behavioral response times from 21 subjects' data and shows that the pretrained neural representations can be transferred to new subjects with minimal training data. This work offers a scalable approach for integrating and decoding sEEG data across multiple subjects.

优点

  1. Multi-Subject Generalization: The framework's ability to generalize across subjects by combining data from multiple individuals and training a unified model is a substantial step forward compared to traditional single-subject approaches.

  2. Methodology: The detailed methodology, including signal preprocessing, and bootstrap randomization test for identifying significant electrodes, ensures the robustness of the proposed approach.

  3. Clarity: The paper is well-organized and clearly written, making it accessible to readers from both neuroscience and machine learning backgrounds

缺点

Plz go and check questions.

问题

  1. Scale and diversity of the dataset: The paper mentions using data from 21 subjects, totaling 3600 behavioral trials and 100 hours of sEEG recordings. Is this amount of data sufficient to represent the neural activity patterns of all subjects? Could the diversity of the dataset be insufficient to support the model's generalization capabilities?

  2. Effectiveness of spatial positional encoding: The paper proposes a spatial positional encoding method based on MNI coordinates to handle electrode placement variability across subjects. Has the effectiveness of this method been thoroughly validated? Are there comparative experiments showing that this encoding method is superior to other possible encoding methods?

  3. Individualized decoder heads: The paper mentions that each subject has a personalized task head for downstream decoding tasks. How effective is this approach when dealing with new subjects? Are there detailed experimental results demonstrating the performance of this method on new subjects?

  4. Computational complexity and scalability: The paper outlines a complex model architecture involving convolutional tokenization, self-attention in both time and electrode dimensions, and individualized regression heads. What are the computational requirements for training and running this model? Is the approach scalable to larger datasets or real-time applications?

局限性

Plz go and check questions.

作者回复

Thank you for your though provoking feedback. We appreciate reading that "combining data from multiple individuals and training a unified model is a substantial step forward compared to traditional single-subject approaches"! Here, we provide our response to your questions and the results from new experiments that we ran to address your concerns. We are available to provide further clarification, if needed.

  1. Scale and diversity of the dataset: The paper mentions using data from 21 subjects, totaling 3600 behavioral trials and 100 hours of sEEG recordings. Is this amount of data sufficient to represent the neural activity patterns of all subjects? Could the diversity of the dataset be insufficient to support the model's generalization capabilities?

This is a very thoughtful concern. It is likely that the amount of data in our dataset is insufficient to represent the neural activity patterns of all humans in a global sense. However, our dataset is larger than that of other recent sEEG datasets (Angrick (2021), Petrosyan (2022), Meng (2021), Wu (2022), Wu (2024) have data from 12 subjects or less), suggesting that it is more likely to represent general neural activity patterns compared to most other sEEG datasets.

In terms of our dataset's diversity: (1) our cohort is composed of 13 females and 8 males, (2) subjects' age ranges from 16 to 57 years old, and (3) electrode number & placement for each subject is unique (solely based on clinical needs). Across subjects electrodes span white and grey matter; cortical, subcortical, and deep structures; and every major structural region of the brain (see Figure 2 of our General Response pdf). Our dataset is very diverse, suggesting that our model, trained on the combined data of all subjects, is likely to generalize.

Thanks for the food for thought, we will include this discussion in our revised manuscript.

  1. Effectiveness of spatial positional encoding: The paper proposes a spatial positional encoding method based on MNI coordinates to handle electrode placement variability across subjects. Has the effectiveness of this method been thoroughly validated? Are there comparative experiments showing that this encoding method is superior to other possible encoding methods?

Thanks for pointing out this limitation. In response to this concern, we validated our spatial positional encoding against other positional encodings. The results are summarized in Table 2 of our General Response and suggest that our proposed positional encoding, based on MNI coordinates, works as well as or better than other approaches.

We believe it is also important to note that the MNI-Fourier and MNI-RBF positional encoding schemes outperform other approaches, indicating that informing models with sEEG electrode location boosts their performance. However, the performance gains are small, indicating that there is still a lot of margin for improving spatial positional encoding mechanisms. We will make sure to discuss these findings in section 4.5 of our revised manuscript.

  1. Individualized decoder heads: The paper mentions that each subject has a personalized task head for downstream decoding tasks. How effective is this approach when dealing with new subjects? Are there detailed experimental results demonstrating the performance of this method on new subjects?

This is a great question. We echo that the ability of dealing with new subjects is very important to support our model's generalization capabilities.

In section 4.4 of our submitted manuscript we demonstrate that our approach is effective when dealing with new subjects. Specifically, in that section, we take a leave-one-out cross validation approach in which we train 21 models (one for each subject in our cohort) each of which is trained on the combined data of all subjects except one. Then, we use the weights of the pre-trained models as a basis of finetune, upon which we train a new single-subject model with the data of the left-out subject. Across subjects, the models trained on other subjects and finetuned to a new one achieved an average test set R2R^{2} score of 0.37±0.060.37 \pm 0.06 (mean ±\pm sem). Importantly, the performance of those models is superior to that of models trained from scratch and outperforms all baseline models (see Table 1 of our General Response) by ΔR2>=0.10\Delta R^{2} >= 0.10.

  1. Computational complexity and scalability: The paper outlines a complex model architecture involving convolutional tokenization, self-attention in both time and electrode dimensions, and individualized regression heads. What are the computational requirements for training and running this model? Is the approach scalable to larger datasets or real-time applications?

The computational resources used to train our model are provided in lines 617-619 of our submitted manuscript. Single- and multi-subject models train within 5 mins and 1 hour, respectively, on a machine with an AMD EPYC 7502P Processor and one Nvidia A40 GPU, which is well within the resources of most computational research labs. The memory requirement to train the multi-subject model with a batch size of 1024 is \sim 8 GiB, making model training tractable using less powerful GPUs as well.

In terms of scalability, assuming that the batch size remains fixed, irrespective of the dataset size, the aforementioned computational resources should be enough to train the model. Naturally, the time requirement for training would scale linearly with the number available trials. However, even a dataset of \sim 80K trials would train in roughly 24 hrs, which is very manageable considering the dataset size.

To investigate whether our model can be used in real time, we measured the model's inference time on a sever and a laptop (see Table 3 of our General Response). All inference times are on the order of a few msec, which easily allows for integrating our model with real-time systems. We will make sure to add this result to our revised manuscript.

评论

Thank you for the authors' efforts and detailed responses. However, I still have concerns regarding the scale of the dataset. While the authors mention that recent sEEG datasets contain fewer than 12 subjects, this does not sufficiently justify that 23 subjects are adequate. In related fields, datasets often include 50, 100, or even more subjects to ensure robustness and generalizability.

I have raised my rating to 4.

评论

I agree that a larger dataset is always preferable. However, for educational purposes only, I must emphasize that collecting a sizable dataset of 50 to 100 subjects is extremely challenging. The first line of treatment for epilepsy typically involves anti-epileptic drugs. Only when these treatments fail after multiple attempts are patients considered for intracranial EEG monitoring. Depending on the size of the epilepsy center, gathering this 50-100 sEEG subjects could take years, if not decades, not to mention the additional difficulty of recruiting them for behavioral studies.

评论

Thank you for your reminder, and I acknowledge the challenges of data collection from a medical perspective. However, there are still larger, publicly available sEEG datasets. If collecting sEEG data in epilepsy is too difficult, it might be worth considering using other types of brain signals, which could be more suitable for applying machine learning techniques.

评论

Thank you for your response and for updating your score! We are very happy that our responses and new experiments addressed at least some of your concerns. Please find our answers to your new comments below:

Thank you for the authors' efforts and detailed responses. However, I still have concerns regarding the scale of the dataset. While the authors mention that recent sEEG datasets contain fewer than 12 subjects, this does not sufficiently justify that 23 subjects are adequate. In related fields, datasets often include 50, 100, or even more subjects to ensure robustness and generalizability.

We echo that a larger dataset (50-100 subjects) would provide stronger evidence for the model's generalization capabilities. However, as reviewer o6UF mentioned "gathering this 50-100 sEEG subjects could take years, if not decades, not to mention the additional difficulty of recruiting them for behavioral studies". Importantly, if the scientific community found value in studies containing 12 subjects or less, it is likely that it will find value in our 21-subject study as well.

I acknowledge the challenges of data collection from a medical perspective. However, there are still larger, publicly available sEEG datasets.

This is very insightful. If you could point us towards those datasets, by specifying either the publication with which they are associated or the link to the dataset repository, we would be very thankful and we look forward to working with those datasets in the future.

If collecting sEEG data in epilepsy is too difficult, it might be worth considering using other types of brain signals, which could be more suitable for applying machine learning techniques.

Thank you for your suggestion. While there are certainly many different types of brain signals that can be analyzed using machine learning techniques, possibly with larger datasets, we believe that there is a lot of value in applying machine learning techniques to sEEG datasets because:

  1. sEEG is currently the gold standard invasive neural recording modality used in humans. Therefore, improving neural decoding based on sEEG bring it closer to clinical translation compared to other neural recording modalities which might provide a lot of data from animal models but have very rarely been used in humans due to safety or other concerns (such as microelectrode array recordings for example). Edit: We would also like to emphasize that even if another neural recording modality was used, the only clinical population approved for collecting intracranial neural recordings would be epilepsy patients. Therefore, the challenges associated with collecting a large dataset would carry over to any other modality as well.
  2. there is a lot of scientific value in showing that machine learning tools can be used on smaller datasets, as in numerous fields (healthcare & medicine, astronomy, environmental science, to name a few) collecting large datasets can be extremely costly and time-consuming. Our work is especially valuable in that sense, since we show that combining many small datasets, despite data heterogeneity, can lead to better machine learning models, compared to training on the smaller datasets individually.

Thanks for engaging in the discussion! We very much appreaciate your insights and truly believe that our work has become stonger based on your feedback and suggestions.

审稿意见
6

The paper presents a novel approach to sEEG decoding. Authors highlight the benefit of using data from various subjects for training. However, due to the nature of the sEEG technique, collection of such data is difficult. Authors provide a new deep learning based decoding approach which utilizes spatial positional encoding (to provide a model with information on electrodes’ locations), temporal and spatial attention, MLP and subject-specific regression heads. Then, authors train their model in multiple frameworks and highlight the efficacy of multi-subject training.

优点

The paper highlights a novel approach to sEEG decoding which enables training of their model on multiple subjects. Generalization across subjects is overall a very important and difficult task to achieve in many modalities and applications of neuroimaging. Authors present a method which has a value in itself and provides ideas for future research in this direction. A paper clearly presents the ideas and performed experiments. Visualizations help to understand both the data collection process and the deep learning architecture.

缺点

  1. No comparison with other methods during within-subject training was done. Authors only provide the results for their own architecture for within-subject experiments. Therefore, they only show that their multi-subject trained model is stronger compared to their single-subject model. However, it might be possible that some of the other existing single-subject State-of-the-Art models will be more effective than authors’ multi-subject model on authors’ data. The importance of these comparisons are magnified by the fact that authors’ dataset will stay private (therefore only the authors of this paper can provide metrics for other models applied to this dataset); and by the reasonably large size of authors’ model (it might be too big to be efficiently trained on a single person data, while other models, which were developed for single-subject tasks, might be lighter and train better on small datasets).
  2. Decoding performance with and without spatial positional encoding seems very similar. For such a difference it is interesting to see if it has a statistical significance.

问题

  1. How many trainable and frozen parameters there were during training and finetuning of the model?
  2. How large is the variance in the response time within each subject?
  3. Are all K convolution kernels in Convolution tokenizer the same for all electrodes?
  4. How much recording hours of data was used overall and on average per subject (only the electrode-hours are provided)?
  5. What is the statistical significance of the metrics increase while using positional encoding compared to not using it?

局限性

Yes, authors provide a good description of the limitations and address potential topics for future research (larger datasets, self-supervised pre-training).

作者回复

Thank you for your thoughtful feedback and suggestions. We are excited to read that our "method has a value in itself and provides ideas for future research"! Here, we provide our response to your comments/questions and the results from new experiments that we ran to address your concerns. We hope that our answers will clear any lingering concerns. We are also available to provide further clarification, if needed.

Weaknesses

  1. No comparison with other methods during within-subject training was done. Authors only provide the results for their own architecture for within-subject experiments. Therefore, they only show that their multi-subject trained model is stronger compared to their single-subject model. However, it might be possible that some of the other existing single-subject State-of-the-Art models will be more effective than authors’ multi-subject model on authors’ data. The importance of these comparisons are magnified by the fact that authors’ dataset will stay private (therefore only the authors of this paper can provide metrics for other models applied to this dataset); and by the reasonably large size of authors’ model (it might be too big to be efficiently trained on a single person data, while other models, which were developed for single-subject tasks, might be lighter and train better on small datasets).

We thank the reviewer for this constructive criticism. To address this, we trained classic and state-of-the-art single-subject models to compare against our model. The results are summarized in Table 1 of our General Response (see also Figure 1 of the attached pdf) and suggest that our architecture outperforms the baselines when trained on either single-subject or multi-subject data. Specifically, our single-subject models outperformed all baselines by ΔR2>=0.03\Delta R^{2} >= 0.03, while our multi-subject models outperformed all baseline by ΔR2>=0.12\Delta R^{2} >= 0.12. Importantly, our transfer learned single-subject models also beat the baselines by ΔR2>=0.10\Delta R^{2} >= 0.10. We hope that these comparisons are sufficient to address your concerns. We plan to include this analysis in section 4 of our revised manuscript.

  1. Decoding performance with and without spatial positional encoding seems very similar. For such a difference it is interesting to see if it has a statistical significance.

To identify whether the difference in the decoding performance with and without spatial positional encoding is statistically significant, we performed a Wilcoxon rank-sum test between the groups: (1) R2R^{2} of all subjects obtained by training the multi-subject model with spatial positional encoding, and (2) R2R^{2} of all subjects obtained by training the multi-subject model without spatial positional encoding. The test returned a test-statistic = 0.34 and a p-value = 0.73, indicating that there is no significant difference between the performance of the model with and without spatial positional encoding. We plan to include this result in section 4.5 of our revised manuscript.

Questions

  1. How many trainable and frozen parameters there were during training and finetuning of the model?

The total number of trainable parameters for the multi-subject model is 797,095. From those parameters 753,394 are shared across subjects and the rest 43,701 parameters are subject-specific (2,081 parameters per subject). When training single-subject models (section 4.2) and multi-session, multi-subject models (section 4.3), all parameters of the model (shared + subject specific) were trained. When transferring the pretrained, multi-subject models to new subjects (section 4.4), for the first 400 epochs the shared parameters were kept frozen (753,394 parameters) while the subject-specific parameters were being trained (2,081 parameters per subject). For the remaining 600 epochs, all parameters were unfrozen and trained. We plan to include those numbers in section A.2 of our revised manuscript.

  1. How large is the variance in the response time within each subject?

For convenience, the mean of the variance of the response times across all 21 subjects of our dataset is σ2=0.011±0.0017\sigma^{2} = 0.011 \pm 0.0017 (mean ±\pm sem) sec2^{2}. The variance of the response times for each subject is: (18, 9, 5, 6, 7, 3, 4, 5, 17, 25, 4, 9, 9, 13, 21, 19, 8, 31, 4, 11, 5) ×103\times 10^{-3} sec2^{2}.

  1. Are all K convolution kernels in Convolution tokenizer the same for all electrodes?

Yes. The K convolutional kernels that our tokenizer uses are the same for all electrodes of all subjects. This design choice increases the effective number of training samples available to the tokenizer, making it less prone to overfitting and increasing the model's robustness. We will make sure to ephasize this in section 3.2.1 of our revised manuscript.

  1. How much recording hours of data was used overall and on average per subject (only the electrode-hours are provided)?

The total recording hours of data used during model training were 1.54 hrs. The average per-subject recording time that was used for model training was 4.39 ±\pm 0.39 (mean ±\pm sem) mins. We will make sure to add those is section A.2 of our manuscript.

  1. What is the statistical significance of the metrics increase while using positional encoding compared to not using it?

The increase in the performance due to spatial positional encoding compared to not using it is not statistically significant. Please refer to our response to your comment under Weaknesses (bullet point No. 2) for a detailed description of how we obtained this result.

作者回复

We thank the reviewers for their thoughtful feedback. The reviewers pointed out that our unified multi-session, multi-subject modeling approach "is a substantial step forward compared to traditional single-subject approaches" (pdN2) that "provides ideas for future research" (VKBz).

Some highlights from the reviewers:

  • Method: "the paper highlights a novel approach to sEEG decoding which enables training of their model on multiple subjects" (VKBz), " The detailed methodology [...] ensures the robustness of the proposed approach" (pdN2), "The ability to pretrain the model on a larger multi-subject dataset and fine-tune it for new subjects with minimal training is a valuable feature, enhancing the model's practicality and applicability" (o6UF)

  • Impact: "Authors present a method which has a value in itself and provides ideas for future research in this direction" (VKBz), "The paper addresses an important issue of data heterogeneity in sEEG recordings among participants which makes constructing large dataset challenging" (j3GS), "The proposed framework effectively addresses the heterogeneity across subjects, a significant challenge in sEEG data processing" (o6UF), "Generalization across subjects is overall a very important and difficult task to achieve" (VKBz)

  • Experiments & Evaluation: "The model's performance is validated using a substantial dataset (21 subjects), showing consistent improvement in performance for most subjects" (o6UF), "The experimental details are clearly provided" (j3GS), "The ablation study is insightful" (j3GS)

  • Writing & Presentation: "The paper is well-organized and clearly written, making it accessible to readers from both neuroscience and machine learning backgrounds" (pdN2), "well-written and easy to follow" (j3GS),"paper clearly presents the ideas and performed experiments" (VKBz)

Based upon reviewer comments, we ran a number of new experiments, and are currently working on the following revisions to our manuscript:

  • Comparisons with single-subject baselines. As suggested by reviewers j3GS & VKBz, we compared the performance of our model against other traditional and state of-the-art models (see Table 1). The results indicate that our model outperforms other approaches across the board. When trained on single-subjects, our model outperforms the baselines by ΔR2>=0.03\Delta R^{2} >= 0.03, while when trained on mutli-subject data, our model beats the baselines by ΔR2>=0.12\Delta R^{2} >= 0.12. Importantly, the single-subject models obtained by finetuning multi-subject models to new subjects also beat the baselines by ΔR2>=0.10\Delta R^{2} >= 0.10. Those results demonstrate the power of multi-subject approaches compared to single-subject ones.

Table 1. Comparison of our model's performance against baselines. Results report mean ±\pm sem.

ModelR2R^{2}
PCA + Wiener Filter0.13 ±\pm 0.18
PCA + L2_{2}-Regression0.17 ±\pm 0.14
PCA + XGBoost0.17 ±\pm 0.06
MLP0.23 ±\pm 0.06
CNN + MLP0.27 ±\pm 0.07
PCA + L1_{1}-Regression0.27 ±\pm 0.04
Ours (Single-Subject)0.30 ±\pm 0.05
Ours (Multi-Subject)0.39 ±\pm 0.05
Ours (Multi-Subject + Finetune)0.41 ±\pm 0.05
Ours (Single-Subject + Finetune)0.37 ±\pm 0.06
  • Comparison of other positional encodings (PEs) against ours: All reviewers pointed out that model performance only benefits slightly by our spatial PE. To address this, we trained variants of our architecture using different PEs and compared them against ours (see Table 2). The results suggest that our proposed PE performs as well as, or better than other approaches. However, we echo the reviewers' concern that the gains are modest. Therefore, we plan to include these results along with a discussion of other possible PEs, based on whole brain MRIs (j3GS) or based on atlases other than MNI (o6UF), in our revised manuscript.

Table 2. Comparison of different PE schemes. Results report the mean ±\pm sem.

PE TypeR2R^{2}
Vaswani (2017)0.16 ±\pm 0.04
MNI-Fourier0.39 ±\pm 0.05
No PE0.37 ±\pm 0.05
MNI-RBF (ours)0.39 ±\pm 0.05
  • Real-Time applicability. Reviewer pdN2 suggested we check whether our model can be used in real time. To test this, we measured our model's inference time on two machines (see Table 3). The results show that our model can be used in real time, since its inference time is on the order of msec, while real time systems run on the order of 100 msec.

Table 3. Inference time on different hardware.

MachineCPUGPUUnits
AMD EPYC 7502P + Nvidia A409.15.1msec
Inter Core i9 + Nvidia A20004.07.9msec
  • Separate attention in space & time vs 2-D attention over space & time. Reviewer j3GS suggested we check whether our model's decoding performance would benefit by employing one 2-D self-attention mechanism over the time & space dimensions of our data, instead of two separate self-attention mechanisms over the aforementioned dimensions. We identified that our method outperforms the 2-D attention variant by ΔR2=0.06\Delta R^{2} = 0.06 (see Table 4). Our proposed architecture trains faster, too.

Table 4. Comparison of our approach vs the 2-D attention variant. Results report the mean ±\pm sem.

Model VariantR2R^{2} scoreTraining Time
Combined attn in time & space0.33 ±\pm 0.05141.6 ±\pm 3.23 mins
Separate attn in time & space (Ours)0.39 ±\pm 0.0525.7 ±\pm 0.03 mins

Impact of this work: This work is the first to train a unified, multi-session, multi-subject models for neural decoding based on sEEG. We demonstrate our approach on a very diverse dataset, with data from 21 subjects whose electrodes are heterogenously placed in brain locations that are unique to each subject. We show that pretrained multi-subject model can be transferred to new subjects, demonstrating their practicality and applicability. This work highlights the power of multi-subject approaches for neural decoding based on sEEG.

最终决定

This manuscript proposes a novel architecture for decoding stereotactic electroencephalography (sEEG) signal and to predict response time in a cognitive task. The main innovation is to leverage a much larger multi-subject data, in contrast to existing works based on single-subject models, by dealing with the inter-individual variability (e.g. location of electrodes, different SNR, biological differences). The learning strategy can be summarized in three main components: (i) tokenization of the neural signal using convolutions to extract temporal dependencies with self-attention, (ii) the extraction of effective spatio-temporal neural representations by a positional encoding scheme based on MNI coordinates and a mechanism of spatial attention to select the most informative electrodes, (iii) a subject-specific prediction layer based on multi-head regression layer to deal with inherent patient-to-patient variability. An empirical evaluation on a dataset collected from 21 subjects performing a visual color-change task, shows that the proposed architecture significantly boosts the decoding performance when using multi-subject data as compared to single-subject data.

The Authors were really responsive to the criticism of the Reviewers and devoted a meaningful effort to take the opportunity to improve their work. The rebuttal was very effective in improving the submitted manuscript with several additional analyses.

Two major concerns didn’t find an answer in the discussion: (i) the limited size of the dataset, and (ii) the negligible effect of spatial encoding scheme. While the first limitation is inherent to the domain (scaling up with sEEG is hard and this is the reason of shortage of larger open dataset), the failure of dealing with spatial inter-individual differences reveals that a major challenge in across individual decoding is still open. For this reason this work better deserve a poster presentation.