PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
4
5
3.8
置信度
创新性3.0
质量3.0
清晰度2.5
重要性2.8
TL;DR

Scaling Transformers for brain-computer interfaces show gains restricted to low data regimes.

摘要

关键词
Brain-Computer InterfacesNeuroscienceMotor Cortex

评审与讨论

审稿意见
5

This paper builds an improved foundation model for decoding neural activity into estimates of manual behavior. The authors use a combination of human and monkey data, about 2000 hours of data across multiple behavioral tasks. They show improved results on several downstream decoding tasks and also perform a number of ablation and scaling experiments to explore the benefits of this large foundation model approach. It’s a nice paper overall and well-written.

Main Finding: The key takeaway is that pretraining helps, primarily by improving sample efficiency. However, beyond 1.5 hours of data, there's no additional performance gain, which they attribute to data heterogeneity (in recording devices, pattern of dead channels, subjects, etc). The generalization to out-of-distribution data is also weak in several specific cases that the authors highlight.

Model Choices: • They use a causal transformer and acknowledge that the negative results might partly due to architectural choices. • Regarding tokenization: they group spikes from 32 channels into patches per time step. This may be suboptimal, especially when transferring across devices—channel groupings lose their meaning. The model also ignores information about brain region, which is probably important for Neuropixels recordings that can span multiple regions in a single experiment. (I’m not sure if this paper uses Neuropixels recordings but these are becoming more common in both monkey and human recordings.)

优缺点分析

Strengths:

  • Well-written and clear
  • Improvement on the SOTA
  • Addresses an impactful and interesting problem
  • Refreshingly honest about the limitations

Weaknesses

  • Several of the results fig panels were pretty obscure at first read; I’d recommend looking the figures over again and trying to clarify these for non-expert readers

问题

  • Obvious question - is it possible to fix the out-of-distribution problems highlighted in fig 4? (Maybe this will be the topic of the next paper.)
  • Some of the r2 values of the predictions are fairly low. It might be useful to see a decomposition of the errors across different frequency bands – I’d guess that the highest frequencies are impossible to predict (due to sensor noise or unconstrained motor variability) and the r2 values for the lower frequencies may be better
  • In general it would be nice to see some more predicted traces, so we get a clearer qualitative sense of what these models are predicting

局限性

adequately addressed

最终评判理由

It's a good paper, and the authors had good responses to the reviews. I maintain my score.

格式问题

n/a

作者回复

Thank you for the positive assessment. We respond to your points below.

  • Re: brain region information. Nearly all devices in our data are placed in hand/arm regions in the motor cortex. Within this scope, even the coarse anatomy varies across humans (and between species), so it is not clear how to use a general coordinate system or discriminative region label. Such labels may be non-critical, thankfully, as [Azabou 25] shows that the neural activity itself is predictive of the brain region they are sourced from.
  • Fixing out of distribution in Fig 4: We hope so, but do not have a clear path forward. By thought experiment, we infer that POYO should be more robust to the synthetic shuffles, but likely only with the a priori foresight to tune the unit embeddings first.
  • Error decomposition / traces: Frequency-band error decomposition is an interesting suggestion. In constrained monkey experiments, decoding R2s are quite high, so sensor noise is likely not dominant. We didn’t observe salient qualitative changes in predictions beyond previous decoding models, but will add in an example trace prediction for some of our evaluation tasks in the appendix.

Azabou et al, 2025. Multi-session, multi-task neural decoding from distinct cell-types and brain regions


We attach below a global rebuttal summary if you are interested in how your feedback relates to other reviewer’s comments:

4GhD/iNS9/cDht sought more understanding of how NDT’s design might cause the reported limitations. We share this desire but believe addressing it clearly will require scaling alternative models, and we exhausted our resources scaling NDT3 (20K A100 hours). We do believe that NDT3’s design was appropriate as the first choice to scale to 100M+ parameters as 1) NDT2 showed the main choice of patchwise tokenization worked well in the previous order of magnitude, 2) NDT3 outperformed the alternate candidate (POYO) in our hands (Fig 19), and 3) NDT3 closely adheres to the vanilla Transformer that successfully scales in other fields. We are hopeful for future improvements but note that cross-subject transfer is broadly challenging beyond our design and neural modality (Banville 25, Apicella 24). The most promising path forward appears to be scaling to thousands of subjects (Kaifosh 25, Aristimunha 25), which is not yet feasible for implanted BCIs. Instead, we personally are actively exploring richer neural data features and alternate tokenization, as some reviewers have suggested. More generally, we hope that NDT3’s 90 minute saturation mark and limitation analyses have provided clear goalposts for the field.

The reviewers requested various improvements to clarity. We agree the work is dense and unclear at times, which likely caused some reviewers to request analyses or question claims that were explored in the appendix. We will add 1) summary boxes to our main text sections (4GhD) and 2) a table of contents in the appendix. We will also 3) extend the discussion with an explicit limitations section and restore the impact statement to the main text (iNS9), and 4) tweak figures to improve readability (cDht). Finally, we 5) will add a reference to the supplementary analysis that supports the rigor of our cross-subject evaluation (HeMw).


Apicella 24, Toward cross-subject and cross-session generalization in EEG-based emotion recognition: Systematic review, taxonomy, and methods. https://doi.org/10.1016/j.neucom.2024.128354

Banville 25, Scaling laws for decoding images from brain activity. https://arxiv.org/pdf/2501.15322

Kaifosh et al, 2025, A generic non-invasive neuromotor interface for human-computer interaction. https://www.nature.com/articles/s41586-025-09255-w

Aristimunha 25, EEG Foundation Challenge. https://arxiv.org/abs/2506.19141v1

审稿意见
4

The paper introduces Neural Data Transformer (NDT) 3, a standard transformer-based model trained on large amount of neural recordings. The authors explore the extent to which the large-pretraining/foundation model paradigm apply to the field of neural decoding and BCI. To that aim, they train their model on ~2000hrs of neural data coming from either human or monkey subjects (in motor cortical areas) and test the downstream behavioural decoding performances as a function of model and dataset size. They identify that large pre-training benefits decoding when available subject data is <1.5hrs. Furthermore they explore different ID vs. OOD generalisation scenarios: (1) different configurations when distribution shifts are due to different time or other presence/absence of spring load feedback (2) continuous/trialized training (3) unseen behaviour (novel angles in 2D center-reach).

优缺点分析

The paper present a comprehensive study of the benefits (and limitations) of scaling model pretraining in the field of neural decoding, a important and very active area of modern research with several potential downstream applications.

While the proposed model employs standard tools in ML for decoder-only transformer training (e.g. rotary embeddings, custom tokens etc..), its scaling analysis reveals important and neural-specific limitations that are worthy of consideration and will inform the future development of neural decoding efforts.

Strengths

  • The paper provides extensive evaluations and supplementary experiments.
  • The identified limitations (saturation of scaling performance) are highly relevant for the field and can inform therapeutic applications.
  • The authors provide a model that can be applied to a variety of downstream task with good performance.

Weaknesses

  • The authors in the main text mention that full-comparisons with POYO were omitted as it underperformed NDT3 on the FALCON dataset. However Figure 19 in the appendix suggests a more nuanced situation. For example in the 16D Reach/Grasp EMG (FALCON M1) - mid panel in Figure 19 the POYO-like model either matched or surpassed NDT3 performance in the 100% data regime. Similar trends emerge in the other panels (e.g. in M2). Most importantly for the claims of the paper, the POYO-like model seems to enjoy more favourable scaling laws (with performance increasingly more rapidly with data size), thus hinting that perhaps scaling it to the ~2000hrs regime could have yielded better results.
  • In the paper the authors compare NDT3 vs. NDT2 (Figure 11). However NDT2 was not trained with behavioural labels and Figure 9 suggests that training with behaviour is beneficial. Hence a fairer/interesting comparison would have been NDT2 vs. NDT3-neural-only to disentangle the contribution of the design choices of NDT3.

Minor points

  • In Figure 11 there is a mismatch between the legend of NDT2-100hrs (cross marker) and the plots (circle markers) - assuming color is correct.
  • There is no explicit discussion of the current limitations & the discuss itself is very limited

问题

  1. How would NDT3-neural only compare against NDT2?
  2. Why do authors believe that NDT3 suffers high training instabilities in the H2 task (Figure 10)? Also, why does the pretrained-version have such higher initial loss? It is also interesting that the runs that exhibit training instabilities actually results in the best performance.
  3. With data quality being a crucial factor when training foundational models, what would the effect of (automated) spike-sorting be on the quality of the final model? Current pipeline only employs filtering+thresholding, yielding noisy data. While perhaps not directly applicable to the field of BCI (efficient online spike sorting is an area of active development), when applied during pre-training it could potentially improve model performance (albeit also widening the ID vs. OOD gap).

局限性

There is currently no explicit discussion of the limitations of this work nor a statement about potential societal impact. The author could for example discuss the still unexplored axes that might influence model quality, e.g. spike-sorting input data, potential importance of data diversity (other brain regions) etc... or the identified pathological sensitivity to input order and stereotyped outputs.

格式问题

For some reason Figure 8 was not rendered properly in the PDF (with only the right portion visible and the rest partially occluded), but this fact could very possibly be due to some software glitches on the reviewer's end.

作者回复

We thank the reviewer for their careful reading of our manuscript (e.g. the Fig 11 typo catch), and we agree with your summary. We address your concerns individually below.

Weaknesses

  • POYO-scaling: Due to evaluation noise, we do not believe that any model is significantly different in the 100% M1 point or M2 50%/100% points. Nonetheless, we do also see that POYO started worse than NDT3 but is at parity by the larger data scales. You raise an interesting interpretation that this might imply that POYO will scale faster with more data. We had interpreted this trend as meaning that our POYO reproduction trained poorly at lower scales. Overall, Fig 19 is open to interpretation. For example, NDT2 sees similar slopes as POYO in the right half of each of these plots, so by similar argument we could have aimed to scale NDT2. Since we could not afford to scale pretraining on multiple methods, we judged it best to scale NDT3 after designing it for scaled training and verifying it had highest overall performance in scratch settings. We look forward to future work that provides more rigorous comparison. We believe we have made it relatively easy to demonstrate better scaling than NDT3 with clear goalposts (90 minutes, tying with baselines at 100% FALCON scale, Sec 3.2 generalization analyses).
  • NDT2 objectives: You have identified a subtle implementation detail! In the NDT2 codebase, most models are indeed prepared with a neural-only SSL objective, and probed for behavior labels. However, some of NDT2’s experiments (e.g. Fig 5) simplify this 2-step training into a single model that trains with both neural and behavioral objectives. For fairness (as you say), we have used this joint training objective through our evaluation of NDT2 in this work. We mentioned this briefly in Line 1317-1318. To clarify, we now modify Line 1317 to read:

“The only departure we take from NDT2’s default design is that we train NDT2 with both a neural reconstruction loss (25% masking) and a supervised decoding loss. This joint objective was introduced late in the NDT2 paper (Fig 5) and performs comparably to multi-stage training while simplifying large scale evaluation.”

Questions

  • NDT3-Neural against NDT2: See above.
  • H2: In conversation with researchers familiar with the H2 task, we speculate that our irregular training is just a result of overparameterized Transformers with insufficient regularization. H2 only contains about 600 data points in total. Most works in the field use small RNNs, except Card 25, which adopted the Transformer after collecting much larger volumes of data. The pretrained model loss is higher likely because NDT3 pretrains on continuous regression tasks while H2 is a Seq2Seq task. We do not see a reason NDT3 could not perform reasonably on H2 if its pretraining mixed in a larger dataset of Seq2Seq spiking decoding tasks.
  • Spike-sorting: We agree with your intuition that spike-sorted inputs could enrich the representations learned in pretraining. Practically, this choice is often already made for us, as few datasets are released with broadband data or sorted spikes, and some BCI labs do not save the broadband data due to its large storage footprint. We are broadly interested in the development of models that support richer neural data features, but the field needs convincing evidence that the additional overhead of these features are justified.

Limitations: We have initially omitted an explicit Limitations section as our results have a section (3.2) analyzing methodological limitations. We acknowledge, though, that there are other limitations in our work unaddressed by 3.2, and will include some of your suggestions in a new explicit Limitations section in camera-ready. Similarly, we will restore our impact statement from section A in the appendix to the main text.

By contextualizing why we chose to scale NDT3 over POYO, clarifying NDT2’s joint objective, and lengthening the discussion, we hope we have addressed your concerns on the work’s weaknesses, and ask that you might consider a non-borderline score. Thanks for your time!

Card et al, biorXiv 2025. Long-term independent use of an intracortical brain-computer interface for speech and cursor control


We attach below a global rebuttal summary if you are interested in how your feedback relates to other reviewer’s comments:

4GhD/iNS9/cDht sought more understanding of how NDT’s design might cause the reported limitations. We share this desire but believe addressing it clearly will require scaling alternative models, and we exhausted our resources scaling NDT3 (20K A100 hours). We do believe that NDT3’s design was appropriate as the first choice to scale to 100M+ parameters as 1) NDT2 showed the main choice of patchwise tokenization worked well in the previous order of magnitude, 2) NDT3 outperformed the alternate candidate (POYO) in our hands (Fig 19), and 3) NDT3 closely adheres to the vanilla Transformer that successfully scales in other fields. We are hopeful for future improvements but note that cross-subject transfer is broadly challenging beyond our design and neural modality (Banville 25, Apicella 24). The most promising path forward appears to be scaling to thousands of subjects (Kaifosh 25, Aristimunha 25), which is not yet feasible for implanted BCIs. Instead, we personally are actively exploring richer neural data features and alternate tokenization, as some reviewers have suggested. More generally, we hope that NDT3’s 90 minute saturation mark and limitation analyses have provided clear goalposts for the field.

The reviewers requested various improvements to clarity. We agree the work is dense and unclear at times, which likely caused some reviewers to request analyses or question claims that were explored in the appendix. We will add 1) summary boxes to our main text sections (4GhD) and 2) a table of contents in the appendix. We will also 3) extend the discussion with an explicit limitations section and restore the impact statement to the main text (iNS9), and 4) tweak figures to improve readability (cDht). Finally, we 5) will add a reference to the supplementary analysis that supports the rigor of our cross-subject evaluation (HeMw).


Apicella 24, Toward cross-subject and cross-session generalization in EEG-based emotion recognition: Systematic review, taxonomy, and methods. https://doi.org/10.1016/j.neucom.2024.128354

Banville 25, Scaling laws for decoding images from brain activity. https://arxiv.org/pdf/2501.15322

Kaifosh et al, 2025, A generic non-invasive neuromotor interface for human-computer interaction. https://www.nature.com/articles/s41586-025-09255-w

Aristimunha 25, EEG Foundation Challenge. https://arxiv.org/abs/2506.19141v1

评论

I would like to thank the authors for the careful rebuttal which clarified the questions I had - especially the one regarding the NDT2 implementation. I also share the intuition that NDT3's 90 minutes saturation mark provides a clear goalpost for future work on scaling pretraining to build upon. While a comparative between-architecture scaling analysis would have provided valuable insights and would have strengthen the paper, I understand that compute limits impose necessary tradeoffs.

Thanks again for your time and your detailed responses.

审稿意见
4

Authors introduce Neural Data Transformer 3 (NDT3), a pretrained autoregressive transformer for decoding motor behavior from intracortical neural activity measured from monkeys and humans. Pre-training the model on up to 2000 hours of neural and behavioral data, the authors evaluate the decoding performance of NDT3 on 8 motor datasets. In this analysis, they study model performance when scaling the number of parameters in the model and the amount of pre-training data used. The authors suggest benefits of scaling both pre-training dataset size and model parameter count, but only when the amount of downstream, task-specific data is very limited (up to ~1.5 hours of data).

优缺点分析

Strengths:

  • The authors provide a wide breadth of evaluations, and highlight where NDT3 succeeds and fails. These analyses may offer useful insights for practitioners seeking to scale their neural data decoders.
  • NDT3 achieves strong performance as compared to conventional baseline Wiener Filters in most evaluated tasks.
  • Although saturation limits were observed with relatively low quantities of motor behavior pre-training data, such pre-training strategies may be applicable in regimes where data collection is prohibitively expensive or only small amounts of new-subject calibration data can be acquired (e.g., end-user BCI applications).

Weaknesses:

  • A primary motivation of this paper and neural foundation models in general is to support generalization to new tasks and subjects. Minimal improvements are made in this regard (most notable improvements from pre-training occur when subject data is available in pre-training data or when downstream data is very limited) and little guidance is provided into how this may be achieved.
  • The benefits of pre-training are limited when there is no data from the test subject in the pre-training data. Outside of laboratory settings, it is rarely the case that we would have training data available for a BCI user (even if from a different task), limiting practicality of this type of strategy.
  • At many points in the paper it is unclear what type of generalization is being accounted for and whether it is actually cross-subject generalization. The authors highlight that their goal is to study "scaling of cross-subject generalization" (Line 125), but in many cases data from the test subject (but not task-specific data) is included in training (unless I am misunderstanding this). I disagree that this is cross-subject generalization, since subject-specific neural data distributions can be learned from the cross task-data.

问题

  • Clarification: Do all 200 hour and 2000 hour pre-training datasets have data from the test subject (even if not specific to the test task)?
  • If feasible, can the authors provide any numerical results for how generalization scales as a factor of number of subjects (or alternatively number of sessions, number of hours, etc.) in the training dataset while strictly excluding any data from the test subject?
  • How does performance scale with pre-training data when you can only train a linear readout (instead of fine-tuning all parameters of the model)? Would this not be a more stringent test of the generalization of pre-trained features with scaling?

局限性

yes

最终评判理由

Updating my score to a 4 based on rebuttal discussions with the authors. This discussion helped resolved misclassifications that I had about their cross-subject generalization training and evaluation methodology. What remains unresolved is a more detailed analysis of how cross-subject generalization scales, quantified in terms of amount of training data and number of subjects, when absolutely no data is available from the subject in pre-training and variable amounts of data (e.g., 10, 20, ... minutes of subject data) available for fine-tuning or linear decoding.

Regardless of this unresolved aspect, I believe this study offers valuable insights into challenges associated with scaling models for neural data modeling.

格式问题

N/A

作者回复

We thank you for your time and detailed review of our paper. We agree with your summary with one small clarification: although 90 minutes of downstream data is far from trivial. As evidence, prior studies (POYO, NDT2) only showed improvements when tuning datasets of single experimental sessions lasting ~1-10 minutes, and intracortical neural data releases overall remain short; monkey studies often publish with 100-200 minutes total. In this framing, the fact that NDT3’s benefit saturates at 90 minutes is a non-issue. We stated that the benefit is of limited practical utility, because in recent years BCI research has begun adopting multi-session modeling as a norm (Line 132), and in this regime 90 minutes of calibration data could be achieved after say, 3-5 BCI sessions. In this more realistic regime, unaddressed by prior neural pretraining work, we envision NDT3 as still being useful when onboarding new BCI users (in the lab and the real world), or when testing new behavioral paradigms with monkeys.

With this context, we next address the main body of your critiques and questions, which center on the issue of cross-subject generalization.

Our view is that our work identifies for the first clear time that there are significant cross-subject transfer issues when scaling models on spiking activity. Thus NDT3’s contribution is not to improve cross-subject transfer in this regime, because there is no baseline to improve over at our scale of evaluation, but to definitively establish the challenge of cross-subject generalization in this domain. We agree that the former interpretation would justify much of your critique (i.e. weakness 1), and we believe our miscommunication may have caused much of your subsequent review. Specifically, addressing weakness 3 as to whether our evaluation addresses cross-subject generalization, we state at the start of 2.3, (line 122):

Our main evaluation uses four human and four monkey datasets [...] All downstream monkeys are held-out of pretraining, and all humans are held-out of < 2 khr models. That is, we evaluate scaling of cross-subject generalization.

That is, only the 2khr model contained data from some evaluation subjects (Addressing Q1). Of course, the 2khr setting is our largest scale, and we realize now that we asserted (Line 125) that the work nonetheless evaluated cross-subject generalization without further justification. The following explanation was omitted:

We did not hold humans out of our full 2khr model since we have so few human datasets to begin with, and pre-determined evaluation datasets overlapped with subjects with significant data. To alleviate concerns on subject leakage, we verified that the 2khr model, with subject leakage, did not outperform other models on the test datasets with leaked subjects (Fig 11, top row), and did outperform other models on a held-out human evaluation (Fig 11, top right). This observation (Line 1157), supports our claim that NDT3’s demonstrated benefits are all in a setting where no evaluation subject data was seen in pretraining (contrasting with the critique in weakness 2), and that our evaluation reflects scaling cross-subject gains overall. To summarize, our pretraining data rarely included data from evaluation subjects, and where it did, models did not benefit from this data.

If you may now agree that we did evaluate cross-subject generalization, we can address Q2, on scaling with the number of subjects. Our main results (Fig 3C/D) show that performance increases with (cross-subject) pretraining data hours. Individual task results varied greatly, so we did not attempt more precise scaling laws. More precise quantification of the scaling benefit of cross-subject hours is measured in NDT2 Fig 3, 4 [Ye 23], for one downstream evaluation. Unfortunately, your strongest request of identifying scaling with respect to the number of training subjects is challenging because the data available in different subjects is heterogeneous in volume and quality (Fig 2A) and more simply, because the number of subjects in intracortical neural data is low overall. We hope that our work motivates future neural data collection efforts with hundreds of subjects, to analyze precisely your requested analysis.

Finally, your Q3 is interesting, but unfortunately we have not run analyses on how probes might scale with pretraining. We agree probes would provide a stricter view of pre-trained features. We chose to evaluate E2E fine-tuning to best reflect downstream usage. Downstream BCI deployment most likely will value E2E tuning’s greater or equal performance relative to probes because tuning cost is not significant at NDT3’s size. It is our understanding that linear probes are most frequently used to evaluate representation learning when model pretraining differs from downstream application (e.g. ImageNet/CLIP to generic vision tasks). This is also of interest for NDT3, so probe-based evaluations are of interest for future work.

We believe your main concern with this work has been around pretraining leakage and our claim of cross-subject generalization. We will add a summary of our response here around Line 125. Thank you for advocating for more clarity on this point. We hope that this addition and our response here have alleviated your concerns, and respectfully request a reconsideration of your score if so.

Ye et al, 2023. Neural Data Transformer 2. https://openreview.net/pdf?id=CBBtMnlTGq


We attach below a global rebuttal summary if you are interested in how your feedback relates to other reviewer’s comments:

4GhD/iNS9/cDht sought more understanding of how NDT’s design might cause the reported limitations. We share this desire but believe addressing it clearly will require scaling alternative models, and we exhausted our resources scaling NDT3 (20K A100 hours). We do believe that NDT3’s design was appropriate as the first choice to scale to 100M+ parameters as 1) NDT2 showed the main choice of patchwise tokenization worked well in the previous order of magnitude, 2) NDT3 outperformed the alternate candidate (POYO) in our hands (Fig 19), and 3) NDT3 closely adheres to the vanilla Transformer that successfully scales in other fields. We are hopeful for future improvements but note that cross-subject transfer is broadly challenging beyond our design and neural modality (Banville 25, Apicella 24). The most promising path forward appears to be scaling to thousands of subjects (Kaifosh 25, Aristimunha 25), which is not yet feasible for implanted BCIs. Instead, we personally are actively exploring richer neural data features and alternate tokenization, as some reviewers have suggested. More generally, we hope that NDT3’s 90 minute saturation mark and limitation analyses have provided clear goalposts for the field.

The reviewers requested various improvements to clarity. We agree the work is dense and unclear at times, which likely caused some reviewers to request analyses or question claims that were explored in the appendix. We will add 1) summary boxes to our main text sections (4GhD) and 2) a table of contents in the appendix. We will also 3) extend the discussion with an explicit limitations section and restore the impact statement to the main text (iNS9), and 4) tweak figures to improve readability (cDht). Finally, we 5) will add a reference to the supplementary analysis that supports the rigor of our cross-subject evaluation (HeMw).


Apicella 24, Toward cross-subject and cross-session generalization in EEG-based emotion recognition: Systematic review, taxonomy, and methods. https://doi.org/10.1016/j.neucom.2024.128354

Banville 25, Scaling laws for decoding images from brain activity. https://arxiv.org/pdf/2501.15322

Kaifosh et al, 2025, A generic non-invasive neuromotor interface for human-computer interaction. https://www.nature.com/articles/s41586-025-09255-w

Aristimunha 25, EEG Foundation Challenge. https://arxiv.org/abs/2506.19141v1

评论

I'd like to thank the authors for the detailed responses to my questions. These responses helped clarify the proposed advances of their work and framing of the paper.

Although I believe the challenge of cross-subject generalization in neural data modeling is well established in prior work, I agree with the authors that it has not been shown in this level of detail and scale, which will hopefully prompt future work within the field on developing better-generalizing models, making more neural datasets public and accessible, and re-thinking data collection strategies to maximize gains in generalization.

Thank you again for the time and effort that you have put into these responses.

审稿意见
5

This paper introduces the Neural Data Transformer 3 (NDT-3) model architecture for motor decoding from intracortical spiking activity, which is a generalist, causal transformer inspired by models such as GATO. NDT-3 builds upon NDT-2 but with some key changes: moving from a masked auto encoding (MAE) architecture to a causal transformer, and accepting behaviour tokens as inputs in addition to neural data. Other components mostly remain the same except some implementation details and evaluation schemes. The model is pretrained in self-supervised fashion on 1.75k hours of neural data, post which it is finetuned using SSL on a target calibration dataset before a decoder layer is learnt through supervised learning for behavioural decoding. NDT-3 performs comprehensive evaluations of the scaling properties of a "vanilla" (ViT-like) causal Transformer architecture and evaluates the cross-subject transferability of such pretrained BCI decoders. While performance results are overall positive, it highlights some concerns in cross subject transfer and handling the complexity associated with output dimensionality and generalisation to new behaviours. Overall, it is a performant model for neural decoding and highlights key avenues for future work in this subfield.

优缺点分析

Strengths

  • This is among the largest pretrained model for neural data/neural decoding, especially in the context of intracortical spiking activity, and represents a comprehensive effort at scaling model size and pretraining data.
  • While results are mostly positive for scaling, the paper highlights concerns related to output dimensionality, generalisation to heldout/unseen behaviour, and cross subject transfer of pretrained models.
  • The experimental evaluation of NDT-3 (and to some extent NDT-2 in comparison) is comprehensive, although there is a concern that only NDT-style models are considered (for the most part; see Weaknesses).

Weaknesses

  • It remains an open question how many of these results, especially negative transfer results, are a consequence of design choices in NDT-style models. Some possible culprits:

    • Patch tokenisation scheme adopted from ViT models
    • Using binned spike counts instead of actual spike timings (as in POYO)
    • Autoregressive transformer architecture which is not entirely "causal" – patches of neural tokens and multiple behaviours are predicted autoregressively even at the same real-world time point
    • Pretraining + "continual" finetuning scheme – SSL finetuning + supervised learning but the lack of post training (maybe with RL) causing models to underperform?

    and so on. While this is not meant to be a request to figure all of these questions out in one paper (which is already dense in terms of evaluations), insights would be appreciated.

  • The paper doesn't show results on FALCON H2, which is the speech task based on Willett et al. (2023) (or similar) and can have long trials of 10s+ length. This highlights some problems with Transformer-based approaches and their ability to deal with long context tasks (issues with length generalisation and/or prohibitive memory requirements) especially when dense labels are not available (the task only has trial-level sentence labels if I understand correctly). Concurrent work from Wairagkar et al. (2025) and Ryoo et al. (2025) may be worth looking at for future extensions.

  • Details on the POYO "reproduction" associated with Fig. 19 and Appendix D3 seem sparse/underspecified. Based on what I pieced together, there are still differences with POYO: you consider binned spikes instead of spike times, it is unclear how exactly tokenisation and training were done. I would think POYO's positive, consistent cross-subject transfer comes from unit embeddings and supervised training as opposed to cross-attention. Also, lines 1363-1364 appear to be inaccurate to me (at least as of 4 months ago): there's a checkpoint at https://github.com/neuro-galaxy/poyo and a codebase at https://github.com/neuro-galaxy/torch_brain. These might be worth looking at to iron out differences between POYO and the NDT-3 cross-attention implementation.

  • Another comment about baselines is that Jiang et al.'s recent work on NDT-MTM scaling may not entirely take into account the right hyperparameter settings at different scales, given that they only tuned them on single session and 10 session models (c.f. Yang et al. (2022)).

  • This might be me, but I found the paper dense at times, though I appreciated the insight and attention to low-level details. Re-reading it at times, I felt the message or summary of results kind of got lost with each read. It might be worth including a concise summary in list form with the key positive and negative results in one place, given the breadth of experiments.

References

  • Willett, Francis R., et al. "A high-performance speech neuroprosthesis." Nature 620.7976 (2023): 1031-1036.
  • Wairagkar, Maitreyee, et al. "An instantaneous voice-synthesis neuroprosthesis." Nature (2025): 1-8.
  • Hee-Woon Ryoo, Avery, et al. "Generalizable, real-time neural decoding with hybrid state-space models." arXiv e-prints (2025): arXiv-2506.
  • Yang, Greg, et al. "Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer." arXiv preprint arXiv:2203.03466 (2022).

问题

  • On lines 294 and 306, did you mean Fig. 4D instead of 4C? 4C seems to be a schematic as opposed to a result on overfitting.
  • Given the prevalence of post-training in LLMs, how important do you think it is for NDT-3-style models given architectural and objective parallels?
  • Again, given the rising adoption of synthetic data pipelines in LLM pretraining, could these be used to alleviate issues of generalisation to unseen targets or other scaling issues? Are there preliminary explorations or results in this direction by authors?
  • NDT-2 had results on cross-species transfer, more specifically that monkey to human transfer did not work. Did NDT-3 alleviate these issues? Maybe I missed this but I couldn't find an explicit study of cross-species transfer, but maybe because cross-subject transfer was ambiguous/negative.
  • I see in the Appendix that you mention latency is < 20ms, are there actual values in ms for specific hardware along with memory requirements? 350M params is quite a lot so I'm curious what prediction times look like on desktop/workstation-equivalent hardware you'd find in a typical lab.

局限性

Yes, limitations are discussed overall (the paper analyses the architecture's scaling and specific failure modes explicitly) and discusses broader impact in Appendix A.

最终评判理由

The authors answered all my questions during the rebuttal and discussion phases. The paper has a lot of interesting analyses that are valuable to the BCI + NeuroFM communities and so I recommend acceptance.

格式问题

None.

作者回复

We thank the reviewer for your time and thoughtful questions. We generally agree with your summary. We would like to clarify that NDT3 is trained with both supervised behavior and self-supervised neural data prediction objectives, in pretraining and tuning. The related analyses include pretraining curves for each objective in Fig 14, an ablation of neural objectives in pretraining in Fig 9, and an ablation in fine-tuning in Fig 16 / C.11.

Weaknesses

NDT design. We agree there is room for innovation beyond NDT3’s design. This work is meant to establish a rigorous baseline that provides clear goalposts for future work (whether by beating the 90 minute threshold, superior generalization in Fig 4’s analyses, or higher FALCON scores). We do not think these innovations will come easily, though (see global response below). Our thoughts on your suggested culprits:

  • Patch tokenisation: We were optimistic that NDT3’s patching would not be limiting because the patching scaled well in NDT2. In retrospect, this is only because those evaluation datasets were small. POYO’s tokenization is promising, but we were not successful with it.
  • Spike timing: This is a keen observation, and we have since observed that spike timing improves NDT, but we do not expect scaling to qualitatively improve from using spike timing over binned spikes, since binning effectively just lightly noises the input. Practically, binned spike times are necessary as many datasets are only packaged with binned spike data over precise spike timing.
  • Autoregressive causality: We do not see this design as a likely culprit since other work has demonstrated scaling in domains with no sense of causality/time with random token dropout (e.g. Vision MAE).
  • Lack of post-training: We don’t believe other domains have required post-training to observe good scaling. We are hopeful post-training can improve NDT3’s utility, however, and would be interested in evaluating this in future efforts.

H2 task: We did evaluate NDT3 on H2 (C.4, Fig 10). While length generalization and memory were not issues for us (thanks to FlashAttention), we did not find transfer from NDT3’s continuous regression pretraining to H2’s seq2seq classification task. The use of Transformers is likely not fundamentally limiting for either discrete or continuous speech tasks, though, as both Wairagkar 25 and Card 25 use Transformers.

POYO comparison: This topic is important to detail. NDT-CrossEnc is so named because NDT’s patching is its most suspect design flaw (Section 3.2), but we tried to compare with POYO more completely than this, and did consider your three raised differences:

  1. Supervised training: NDT2 and NDT3 are both trained with supervised objectives in this work (see the aforementioned Fig 9/14/16 on its importance in NDT3).
  2. Unit embedding vs patch embedding: NDT-CrossEnc reduces patch size to 1, and each patch receives identifying embeddings, so NDT-CrossEnc also essentially receives unit embeddings. One remaining difference is that unit embeddings are distinguished across datasets. In NDT2, dataset-specific parameters were accounted for with a separate session embedding token. We dropped this in NDT3 due to its low benefit (according to NDT2’s ablations) and high cost (45M parameter NDT3 would incur about 20M new params from ~20K datasets x 1K params/dataset).
  3. Binned spikes vs spike timing: As mentioned, we cannot scale with spike timing as this data is frequently missing in BCI datasets.

In summary, we believe NDT-CrossEnc fairly translates POYO to the scaled BCI setting. We have confirmed our design with the released codebase and in correspondence with a researcher familiar with the POYO model. We did not pursue an exact reproduction as POYO has not shown cross-subject transfer that is qualitatively different from NDT2 (Both were evaluated in low-data downstream setting). We certainly agree that our documentation of NDT3-CrossEnc reproduction (Line 1354-1362) could be improved; we will restructure based on your feedback, similar to the above bullets, to improve clarity.

One remaining option would be to directly use the public POYO codebase, by adapting POYO model/dataloading/evaluations to the BCI use case (causal / binned spikes). Our confidence in doing this fairly is low, since there are open issues regarding reproducing reported results at the original POYO code, and the new codebase also lacks evaluation scripts or end-to-end examples. At this point, we believe it would be most expedient to seek comparison through external benchmarks like FALCON.

NDT-MTM: We agree NDT-MTM may not have been exhaustively optimized; NDT3 also did not use Mu-parameterization. To clarify, we omitted comparison with NDT-MTM not because of poor performance, but because its main distinction from NDT was its use of new masking strategies to improve neural data reconstruction (as opposed to behavior decoding).

Clarity: Thank you for this note. We had compressed such summaries to fit the content limit, but will add these summaries in the camera-ready.

Questions

294-306: Yes, we meant 4D. Thanks for catching this typo.

Post-training: Thanks for asking! Post-training has not been explored in the neural FM space, but we included the analysis in 4C to specifically highlight a target for post-training, to prioritize a decoder that will generalize to more behaviors when the neural activity would support it. We allude to this in Line 257. One pertinent question will be how to preserve this post-training through the final fine-tuning that all neural models currently require.

Synthetic data: To attempt to improve generalization to unseen channel orderings, we piloted the use of channel shuffling augmentations in pretraining. This helped at small scales but began to harm performance by 200 hours of pretraining data. We did not report this as we think this warrants a more thorough investigation that was beyond the scope and content limits of this paper. Other synthetic augmentations from the BCI literature are definitely worth testing for neural foundation models, but have mostly been used to address small signal nonstationarities over time, so we do not expect it to change cross-subject scaling.

Cross-species: We did not explicitly mention cross-species transfer in this work, but you raise a good point, so we will clarify the following in the Discussion: The 200 hour model does show cross-species transfer, as it only pretrains on monkey data and positively benefits human evaluations. We do not think this is due to differences between NDT2/NDT3, but rather a result of training and evaluating at a larger scale.

Latency: With KV cache enabled and at 1 second context length, we see mean inference times of about 4ms and 9ms for the 45M and 350M models on an NVIDIA 4090. On a 4070, the 350M model slows to ~15-18ms. We add these timings to Line 1412.

Thank you for all these suggestions, and we hope that our clarifications and comments (justification of NDT3 and our POYO reproduction, resolution of other weaknesses and questions) address your main critiques and respectfully request reconsideration of the borderline score.


We attach below a global rebuttal summary if you are interested in how your feedback relates to other reviewer’s comments:

4GhD/iNS9/cDht sought more understanding of how NDT’s design might cause the reported limitations. We share this desire but believe addressing it clearly will require scaling alternative models, and we exhausted our resources scaling NDT3 (20K A100 hours). We do believe that NDT3’s design was appropriate as the first choice to scale to 100M+ parameters as 1) NDT2 showed the main choice of patchwise tokenization worked well in the previous order of magnitude, 2) NDT3 outperformed the alternate candidate (POYO) in our hands (Fig 19), and 3) NDT3 closely adheres to the vanilla Transformer that successfully scales in other fields. We are hopeful for future improvements but note that cross-subject transfer is broadly challenging beyond our design and neural modality (Banville 25, Apicella 24). The most promising path forward appears to be scaling to thousands of subjects (Kaifosh 25, Aristimunha 25), which is not yet feasible for implanted BCIs. Instead, we personally are actively exploring richer neural data features and alternate tokenization, as some reviewers have suggested. More generally, we hope that NDT3’s 90 minute saturation mark and limitation analyses have provided clear goalposts for the field.

The reviewers requested various improvements to clarity. We agree the work is dense and unclear at times, which likely caused some reviewers to request analyses or question claims that were explored in the appendix. We will add 1) summary boxes to our main text sections (4GhD) and 2) a table of contents in the appendix. We will also 3) extend the discussion with an explicit limitations section and restore the impact statement to the main text (iNS9), and 4) tweak figures to improve readability (cDht). Finally, we 5) will add a reference to the supplementary analysis that supports the rigor of our cross-subject evaluation (HeMw).


Apicella 24, Toward cross-subject and cross-session generalization in EEG-based emotion recognition: Systematic review, taxonomy, and methods. https://doi.org/10.1016/j.neucom.2024.128354

Banville 25, Scaling laws for decoding images from brain activity. https://arxiv.org/pdf/2501.15322

Kaifosh et al, 2025, A generic non-invasive neuromotor interface for human-computer interaction. https://www.nature.com/articles/s41586-025-09255-w

Aristimunha 25, EEG Foundation Challenge. https://arxiv.org/abs/2506.19141v1

Card et al, biorXiv 2025. Long-term independent use of an intracortical brain-computer interface for speech and cursor control

评论

Thank you for the response, I really appreciate your efforts here. Some comments and questions:

  1. Spike timings: In my experience with some of these models, having access to raw spike times can actually increase performance a fair bit (say, 0.7 \xrightarrow{} 0.77 <= 10% on R2R^2 on an RT task). Could you clarify what you mean by this not mattering (or not necessarily enabling better performance) at scale? I'm curious to hear about your NDT3 implementation of spike times as inputs and wonder if you could share what the improvement was by using them.

  2. H2 task: My bad, thank you for pointing this out.

  3. POYO: I acknowledge (1) and (3), the latter in my opinion is an important question henceforth for the BCI community to address. I'm a bit more sceptical of (2). What if dataset-specific unit embeddings actually matter at scale? I understand this would add several parameters, but using the same embeddings for what could be very different neurons or channels across different sessions seems a bit weird and not equivalent to POYO. Not arguing that POYO is optimal (having to learn so many parameters could be considered inefficient), but it doesn't quite seem like we have a one-to-one comparison here on all/reasonable fronts. I agree it might be more realistic to expect a comparison with POYO on a standard benchmark.

  4. NDT-MtM: I agree that NDT-MtM's goal is to improve reconstruction, but they show gains in some downstream decoding tasks over just temporal masking à la NDT1 (Section 6.2 & Table 2). Interestingly they also show that NDT1-Stitch scales better than NDT2 (which is closer to NDT3). So, I wonder if the authors have a better justification here for the former, and any comments on the latter – my understanding is that tokenisation and masking schemes are the biggest differences between NDT1 and NDT2.

  5. Thanks for the clarifications on post-training, synthetic data, and latency. The cross-species result is cool to know, would be nice to see some numbers.

评论

We're glad to address further questions.

  1. Spike timings: We implemented spike timings by using a given channel's activity at a smaller time resolution as a dense feature vector. For example, instead of summarizing 20ms into a single binned spike count scalar, we have a length 20 vector that represents spike times at 1ms resolution, that we then linearly project to a token embedding. In small experiments with RT tasks (The O'Doherty dataset used in NDT2), we saw improvements of, say, 0.6 -> 0.63. You might be interested to know that we see similar 0.02-0.04 bumps from using richer features more common in the BCI literature, namely spike band power; unclear if these play the same role. These pretrained NDT3s are incompatible with these richer features and don't currently benefit (we think a medium-scale fine-tuning would enable their integration). A successful tuning of NDT3 to enable these integrations post-hoc to our large-scale pretraining is of very high interest to us. We make two points about scaling with spike timings: 1) We don't think it will overcome cross-subject challenges (e.g. 90 minute saturation), and 2) we don't always have spike timing in public dataset releases, making their required use a challenge.

  2. POYO-unit embeddings: This is a fair point to push back on. We agree dataset-specific parameters will likely help pretraining, so our reproduction falls short there. We think including dataset-specific parameters (of O(model params)) does raise other problems. We can think of 2:

  • If dataset-specific parameters performed well, it would be harder to cleanly measured scaling, as scaling data now also simultaneously increases parameter count by a nontrivial amount. Comparisons with other domains would also be harder.
  • It's a bit tricky to be principled about defining unique datasets in our data. A reasonable default would be to try to assign unique parameters for each "subject-day", but the same day can include hardware and software changes that jitter the seen neurons (e.g. headstage swapping or re-thresholding).

So yes, we agree that NDT-CrossEnc's failure does not end the story on POYO's efficacy and we are eager to see whether dataset-unique parameters or some other factor might overcome NDT3's cross-subject challenges. We do think NDT-CrossEnc provides a reasonable control for the POYO-based alternative that we could have pursued to run similar scaling experiments as we sought for this work.

  1. NDT-MTM: For the former (decoding performance) -- we may be mistaken, but we still do not see differences between MtM vs the temporal masking baseline in 6.2 / Table 2. In table 2, Choice decoding and WBE - the metrics differ by 0.01-0.02, with MtM vs Temporal Masking winning in one vs the other. In Fig 3 the decoding scatter plot metrics appear virtually on the diagonal. Presumably decoding differences are not statistically significant. Fig 5 has some differences between the two models for mouse brain-region specific decoding, but we're not sure how to interpret temporal masking's sometime subtrivial performance in those experiments in the context of our work in primate motor cortex where temporal masking performs reasonably. For the latter -- this is an interesting point that we do not have much insight on. In one part, stitching may be particularly performant when the data are both limited in volume and highly heterogeneous. More likely, based on NDT2's poor performance in this work's experiments, is that some aspect of NDT2's preparation is just not very performant. We do not mean to detract from NDT2's narrative contributions, however, in providing a scaling comparison between different types of neural data.
评论

Thanks for the quick response. Thoughts:

  1. Totally agree with your points on spike-band powers + post-hoc integration, the Brain2Text paper and '24 benchmark provide clear evidence of the former being instrumental in performance gains (Willett et al. 2023 and 2024). Regarding the spike times implementation, this seems like a good place to start, and I think a bump of 0.03 is pretty decent for the RT tasks. Have you tried combining this smaller time resolution binning with NDT3-CrossEnc (maybe you already did this, just a bit unclear to me)? As to your final points, I can understand (1) and think it warrants more study; ideally for (2) a method should be able to handle heterogeneous inputs, and maybe take advantage of spike timings where available (given evidence of some gains). This is a "soft" multimodality problem that could open a whole other can of worms, so I think the approach so far is again a reasonable starting point.
  2. Agree that quantifying scaling would be harder, a naïve thought would be to exclude embeddings from the parameter counts but this is not ideal. Unique params per "subject-day" is what POYO does iiuc, sounds like a reasonable approach given the tradeoffs. Overall though, I think the POYO-based NDT3 experiment needs some rephrasing, as differences still exist between the two, e.g.:
    1. Fig. 19 legend: NDT3-CrossEnc (POYO) -> maybe POYO-inspired/POYO-based?
    2. A clear list of outstanding differences, as outlined in your comment(s).
  3. Okay, I'm satisfied with your points on MtM. It escapes me why NDT2 doesn't perform well, would be interesting to investigate why exactly.
  4. Quick reminder icymi, to share numbers (if/when you have them) on your cross-species results with NDT3-200h.

Overall, I am still positive about the paper and will recommend acceptance.

评论

We appreciate the recommendation!

  1. We have not attempted to combine spike timing and NDT3-CrossEnc. It'd be fascinating if that combination unlocked particularly good performance. We believe the POSSM authors will likely pursue some degree of scaling on BCI data with many of the aspects we've discussed.

  2. That is a good summary and suggestion. We will restructure our framing of our POYO-based reconstruction based on this discussion.

  3. We agree a direct study of NDT2's weakness would be informative.

  4. Our points on cross-species transfer is that the 200 hour model is a monkey-only model that boosts human evaluations. The numbers get messier depending on the particular task (plotted in full in Fig 11), but for example, Fig 3B shows that scaling from 1.5 monkey hours to 200 monkey hours boosts performance across downstream human scales. We figured adding a sentence about this point in the text is what you were interested in, not new numbers. Could you clarify what you'd like to see reported?

评论

To clarify, I was wondering if you'd be able to state something like a percentage improvement (maybe just on certain tasks, e.g., 4D bimanual) in the text along with your mention of showing cross-species transfer. Upon looking at Fig. 11 again, I agree that these can get quite messy depending on the task, where in some tasks all models seem to be quite close. The figure was enough for me to get a sense of the result, would just hope that the framing in the discussion is crystal clear so the reader knows exactly what to expect.

Thanks for engaging, can't overstate how much I appreciate all the analyses in the paper!

最终决定

In this paper the authors investigate using a pre-trained model to map neural activity to motor behavior. They pre-trained an autoregressive transformer on >2000 hours of intracortical neural and motor data from more than 30 monkeys and humans. Their results show the pre-training of the model is broadly beneficial, improving performance across eight different downstream decoding tasks and generalizing well to shifts in neural data. However, the authors also conclude that simply scaling such models is unlikely to resolve fundamental limitations arising from sensor variability and stereotyped patterns in neural datasets.

The primary concerns of the reviewers were centered on the true extent and limitations of the model's generalization capabilities, as well as the context provided by the baseline comparisons. A major recurring point, raised by multiple reviewers, was the weakness in cross-subject and out-of-distribution generalization. One reviewer questioned the definition of "cross-subject generalization" used in the paper, arguing that including any data from the test subject (even from a different task) in the pre-training set limits the practicality of the approach and does not represent a true zero-shot transfer scenario. This concern was linked to specific architectural choices, such as the patch-based tokenization of neural data, which reviewers suggested might not be robust to heterogeneity across different subjects, tasks, and recording hardware.

Furthermore, reviewers expressed a need for more rigorous comparisons to alternative models to better contextualize NDT-3's performance and limitations. Specifically, the comparison to the POYO model was noted as being insufficient or revealing a more nuanced story than presented in the main text, with one reviewer suggesting POYO might have more favorable scaling properties in some regimes. A desire for a fairer ablation study against the previous NDT-2 model was also mentioned to better isolate the benefits of the new architecture from the benefits of pre-training with behavioral data. Minor points on the paper's density and the clarity of some figures were also raised.

Despite these concerns, the reviewers unanimously recommended acceptance, driven by several common positive assessments. All reviewers lauded the paper as a comprehensive, large-scale, and highly valuable study in the important and challenging field of neural decoding. They recognized it as a significant effort, representing one of the largest pre-trained models for intracortical spiking activity to date. A key strength highlighted was the paper's honesty and transparent analysis of its own limitations. Rather than only focusing on successes, the authors' detailed exploration of where scaling helps and where it fails (e.g., the performance saturation and generalization issues) was seen as a crucial and insightful contribution that will inform future work in the BCI and neural foundation model communities. This combination of achieving strong performance, improving on the state-of-the-art, and providing a nuanced look at the challenges ahead was the clear basis for the positive recommendations.

With the reviewers reaching consensus that this paper should be accepted, and a final average score of 4.5, a decision of Accept (poster) was decided on.