PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
2
3
3
4
ICML 2025

Are Large Brainwave Foundation Models Capable Yet ? Insights from Fine-Tuning

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
Foundation ModelsLarge Brainwave Foundation ModelsBrain-Computer Interface (BCI)Low-Rank Adaptation (LoRA)Electroencephalogram (EEG)

评审与讨论

审稿意见
2

The authors performed the analysis of the current Brainwave Foundation Models. The results indicate that current LBMs show limited improvement over traditional deep-learning models. The authors further introduced LoRA for the fine-tuning of LBMs. LoRA fine-tuning technique can sufficiently reduce the training parameters and training efficiency when applying LBMs to downstream tasks.

After rebuttal

Thank the authors for their responses. I am inclined to keep my score, considering the technical contribution aspect.

给作者的问题

Why the classifier architectures connected with LaBraM and NeuroGPT are different? Is there any overlap in the dataset between the ones used in the pertaining and the downstream tasks?

论据与证据

Yes

方法与评估标准

The preprocessing step makes sense, but what's the variation of the accuracy in Table 1? Does the result show any statistical significance?

These LBMs can be applied to different tasks by only training the classifier. The parameters for the classifier would be much smaller compared to fine-tuning the whole model. It would be good to show the performance with fixed LBMs but learnable classifiers. This could be helpful to demonstrate the necessity of the LoRA finetuning.

理论论述

N/A

实验设计与分析

The authors provide a detailed analysis of the number of ranks, the layer types that apply LoRA, and the dropout parameter. However, the necessity of LoRA itself is not demonstrated, especially when compared with the case; only the classifier is trainable.

补充材料

No supplementary material is attached.

与现有文献的关系

It is important to study how to apply the existing foundational models, such as LaBraM and NeuroGPT, for various downstream tasks. The authors leverage LoRA. for fine-tuning which is an existing technique.

遗漏的重要参考文献

Additional models for EEG training, such as BIOT [1] and EEGformer [2], can be included in the study. Newer ones, such as CBraMod [3] and NeuroLM [4], are also worth discussing.

[1] Chaoqi Yang, M Westover, and Jimeng Sun. Biot: Biosignal transformer for cross-data learning in the wild. In Advances in Neural Information Processing Systems, volume 36, pp. 78240–78260, 2023. [2] Chaoqi Yang, M Westover, and Jimeng Sun. Biot: Biosignal transformer for cross-data learning in the wild. In Advances in Neural Information Processing Systems, volume 36, pp. 78240–78260, 2023. [3] Wang, Jiquan, et al. "CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding." arXiv preprint arXiv:2412.07236 (2024). [4] Jiang, Wei-Bang, et al. "NeuroLM: A Universal Multi-task Foundation Model for Bridging the Gap between Language and EEG Signals." arXiv preprint arXiv:2409.00101 (2024).

其他优缺点

The paper is well-structured and easy to understand.

However, the necessity of the LoRA for current LBMs should be more clearly demonstrated. Also, it would be good to have a comprehensive discussion about the different LBMs. Also, a more technical contribution is expected.

其他意见或建议

See above.

作者回复

We would like to thank the reviewer for their thoughtful review and valuable feedback. As mentioned in our response to Reviewer QMtY, we have decided also to extend our analysis results to capture one more popular BCI paradigm, namely Motor. Therefore, we added a new movement benchmark based on the High Gamma dataset (R T Schirrmeister et al, Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping, 38(11):5391–5420, 2017). Please see our response to Reviewer QMtY for the updated results Table. This extends our thorough analysis even further as mentioned by the current Reviewer bhSh.

The performance difference between NeuroGPT (the best foundation model in the new analysis) and EEG-Inception (the best baseline network) is 1.2% (improvement compared to the previous 0.5%) but with a considerable increase in the number of trainable parameters and larger std among folds. As highlighted by the reviewer, although we demonstrate 10‐fold cross‐validation performance, the original manuscript does not show whether that small improvement margin of 1.2% is statistically significant or not. For that reason, we conducted paired-t tests between EEGInception (the best baseline network) and all examined foundation models. The results are demonstrated below, showing a clear statistical significance for NeuroGPT in many benchmark tasks:

Table: P-values of paired-t tests between EEGInception and finetuned foundation models. Bold values indicate statistically significant results (p<0.05)(p < 0.05).

ModelsMotorERPMemorySleepEyes
EEGInception / LaBraM0.58600.01230.10900.29950.4468
EEGInception / NeuroGPT (full model)0.03140.04010.00410.22260.8979
EEGInception / NeuroGPT (Encoder)0.00720.00510.03990.04950.1056

As suggested by the reviewer, we conducted another study: fixed LBMs but learnable classifiers. The results are shown in the table below:

Table 2: Classification accuracy of foundation models where all parameters except classification heads are frozen. Each model was trained for 20 epochs with 10-fold cross-validation. Bold values indicate the best performance per task or overall.

ModelMotorERPMemorySleepEyesMean Accuracy
LaBraM0.2970.8840.6700.6080.7170.635
NeuroGPT (full model)0.3660.8840.6560.5970.7340.647
NeuroGPT (encoder)0.4310.8830.6550.6020.7460.663

From the results above, training just the classification yields models that lack behind baselines by a large margin almost 8-10% (comparing with the results in our response to reviewer QMtY). This further demonstrates that current brainwave foundation models lack essential elements to fully capture the diverse nature of EEG and largely outperform all current state-of-the-art baselines in various tasks with minimum required fine-tuning. This constitutes full-model fine-tuning necessary and in turn makes PEFT techniques like LoRA extremely valuable.

The classifiers in both foundation models were designed based on the implementations of the original foundation models papers (LaBraM and Neuro-GPT). As mentioned in our response to Reviwer 9Has, both foundation models (LaBraM and Neuro-GPT) have in their pre-training datasets paradigms that include motor-, ERP-, sleep- and eyes-related tasks (not the specific datasets chosen for the downstream tasks). Interestingly, in the memory task—which was not explicitly included in the pre-training datasets—baseline models slightly outperform foundation models.

We thank the reviewer for the additional references provided. In our studies, we included the two state-of-the-art open-source EEG foundation models of that time. The trained weights of CBraMod and NeuroLM were released close to the ICML submission deadline. Including these models in our analysis is not possible due to rebuttal's time constraints. But we will definitely acknowledge these recent works in the background section of our revised manuscript.

In the camera-ready version of the manuscript, we will incorporate the paired-t test analysis, the frozen foundation models analysis and the aforementioned discussion points. We sincerely thank the reviewer for their valuable comments, which have helped us to improve our work.

审稿意见
3

In this paper, the authors compare state‐of‐the‐art Large Brainwave Foundation Models (LBMs) with traditional deep learning baselines on multiple EEG‐based tasks and find only marginal accuracy gains despite a massive increase in parameters. They then apply Low‐Rank Adaptation (LoRA) to substantially reduce trainable parameters without sacrificing performance and show that adapting multiple model components simultaneously (e.g., convolutional, fully connected, and attention layers) yields the greatest benefit.

给作者的问题

Could authors add statistical test to all the comparisons provided?

论据与证据

The authors make a strong claim that Large Brainwave Foundation Models (LBMs) offer only a marginal improvement over traditional deep nets in EEG tasks, but they never show whether that small margin is statistically significant. Although they do 10‐fold cross‐validation, there is no clear significance testing or confidence intervals for the performance numbers they report.

It’s difficult to conclude if the LBMs are truly better or if the difference is just random variation.

方法与评估标准

The comparison and evaluation in this paper are quite standard. However, significance testing was missed, which seriously undercuts the authors’ key assertions.

理论论述

There is no theoretical claims in this paper.

实验设计与分析

The comparison in the current version is not valid since there is no significance test, this is especially important when the increase is marginal like 0.5%

补充材料

No supplementary material in this paper.

与现有文献的关系

This paper appears to be the first to systematically compare large‐scale, pre‐trained EEG foundation models against conventional deep learning baselines in BCI tasks. While earlier studies have applied large‐model concepts to EEG or introduced individual LBMs, no prior work has comprehensively benchmarked multiple LBMs and standard architectures side by side.

遗漏的重要参考文献

The reviewer is not aware of any missing reference.

其他优缺点

The key problem of this paper is that there is no significance test. With statistical test added, this paper could be a good contribution to this field.

The authors are suggested to revise the title to specify EEG modality, since there are also brain foundation models on fMRI.

其他意见或建议

N/A.

作者回复

We would like to thank the reviewer for their thoughtful review and valuable feedback. As mentioned in our response to Reviewer QMtY, we have decided also to extend our analysis results to capture one more popular BCI paradigm, namely Motor. Therefore, we added a new movement benchmark based on the High Gamma dataset (Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram Burgard, and Tonio Ball. Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping, 38(11):5391–5420, 2017). Please see our response to Reviewer QMtY for the updated results Table. This extends our thorough analysis even further as mentioned by the current Reviewer MQLQ.

The performance difference between NeuroGPT (the best foundation model in the new analysis) and EEG-Inception (the best baseline network) is 1.2% (improvement compared to the previous 0.5%) but with a considerable increase in the number of trainable parameters and larger std among folds. As highlighted by the reviewer, although we demonstrate 10‐fold cross‐validation performance, the original manuscript does not show whether that small improvement margin of 1.2% is statistically significant or not. For that reason, we conducted paired-t tests between EEGInception (the best baseline network) and all examined foundation models. The results are demonstrated below, showing a clear statistical significance for NeuroGPT in many benchmark tasks:

P-values of paired-t tests between EEGInception and finetuned foundation models. Bold values indicate statistically significant result (p<0.05)(p < 0.05)

ModelsMotorERPMemorySleepEyes
EEGInception / LaBraM0.58600.01230.10900.29950.4468
EEGInception / NeuroGPT (full model)0.03140.04010.00410.22260.8979
EEGInception / NeuroGPT (Encoder)0.00720.00510.03990.04950.1056

We agree with the reviewer "brain foundation models" could refer to models on fMRI as well therefore we have used the term "Brainwave Foundation Models (LBMs)", clearly hinting the EEG modality.

In the camera-ready version of the manuscript, we will incorporate the paired-t test analysis. We sincerely thank the reviewer for their valuable comments, which have helped us to improve our work.

审稿人评论

Thanks for adding significance testing and additional task, I have increased my rating to weak accept. While the results are helpful in EEG community to think more about the current state-of-the-art, the paper can be benefited from deeper and more thorough analysis including why EEG foundation models' performance are limited and how to improve them.

审稿意见
3

The paper evaluates the performance of two Large Brainwave Foundation Models, LaBraM and NeuroGPT, by fine-tuning them on multiple EEG-based benchmark tasks. The authors compare these LBMs to well-known deep learning baselines (e.g. EEGNet, EEGInception) and investigate parameter-efficient fine-tuning via Low-Rank Adaptation (LoRA). Their results show that LaBraM slightly outperforms baselines but at a much higher parameter cost, while NeuroGPT lags in performance. They also find that LoRA can significantly reduce trainable parameters in LBMs without harming accuracy.

给作者的问题

  • Are there specific EEG paradigms where LBMs offer bigger gains, or is the 0.5% improvement consistent across tasks?
  • How sensitive are LBMs to the choice of pre-training datasets? Could incorporating additional diverse EEG tasks in pre-training yield larger improvements?

论据与证据

  • The paper claims that existing LBMs do not yield substantially better performance than smaller specialized networks, and the authors provide numerical evidence that LaBraM improves average accuracy by only about 0.5% over baselines.
  • LoRA can retain or even improve performance of LBMs while drastically cutting trainable parameters. Their experiments across rank settings and layer types support this.

方法与评估标准

  • The authors use standard EEG preprocessing steps including bandpass filtering, notch filtering, resampling that are consistent with typical BCI literature.
  • Fine-tuning is done with 10-fold subject-independent cross-validation, ensuring robust performance estimates.
  • The evaluation focuses on classification accuracy across four EEG tasks, which is a standard metric for BCI tasks.
  • The approach to LoRA is well-detailed, including how the low-rank adapters are integrated into each layer.

理论论述

  • The paper focuses on empirical evaluations.

实验设计与分析

  • The paper uses four benchmark datasets (ERP, working memory, sleep staging, and eyes open/closed). This selection strengthens the generalizability of findings.
  • Baselines are well-chosen (EEGNet, EEGInception). These are standard architectures for EEG classification.
  • Fine-tuning setups are described in sufficient detail, including the rank hyperparameters for LoRA, dropouts, and parameter counts. Overall, the experiments are well-structured.

补充材料

N/A

与现有文献的关系

  • The paper positions LBMs alongside established large-scale modeling trends in NLP and computer vision, citing GPT-like training (NeuroGPT) and codebook pre-training (LaBraM).
  • The discussion references prior EEG-based deep learning approaches (EEGNet, EEGInception) and prior BCI tasks. This frames their contributions within a standard BCI pipeline and highlights the difference between specialized vs. foundation models.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  • Thorough experimental design with multiple downstream tasks.
  • Clear demonstration of how LoRA can be systematically applied in EEG foundation models.
  • Useful ablation studies clarifying which layers benefit most from adaptation.

Weaknesses:

  • The performance improvement of LBMs over smaller baselines is minimal, raising questions about practical benefits.

其他意见或建议

  • It may be helpful to elaborate on interpretability aspects of LBMs vs. smaller EEG models, especially given the importance of model explainability in BCI applications.
作者回复

We would like to thank the reviewer for their thoughtful review and valuable feedback. As mentioned in our response to Reviewer QMtY, we have decided also to extend our analysis results to capture one more popular BCI paradigm, namely Motor. Therefore, we added a new movement benchmark based on the High Gamma dataset (Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram Burgard, and Tonio Ball. Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping, 38(11):5391–5420, 2017). Please see our response to Reviewer QMtY for the updated results Table. This extends our thorough analysis even further as highlighted by the current Reviewer 9Has.

As highlighted by the reviewer, explainability is crucial in BCI applications. Therefore, developing interpretable AI models is essential. In recent years, many research papers have developed interpretable EEG models with strong performance. As we head toward large brainwave foundation models, maintaining this requirement remains critical. While Neuro-GPT follows a black-box approach, LaBraM incorporates a neural codebook method, enhancing model interpretability, as discussed in the background section of the manuscript.

As shown in the analysis Table in our response to Reviewer QMtY, foundation models slightly outperform baseline models by an average of 1.2%. However, as the reviewer rightly points out, a closer examination of individual tasks and the pre-training datasets used by these models is worth mentioning. Both foundation models (LaBraM and Neuro-GPT) have in their pre-training datasets paradigms that include motor-, ERP-, sleep- and eyes-related tasks. From the results, we observe that foundation models achieve comparable performance to baseline models in ERP, sleep, and eyes tasks. NeuroGPT significantly outperforms baseline models in motor. Interestingly, in the memory task—which was not explicitly included in the pre-training datasets—baseline models slightly outperform foundation models. This evidence further strengthens our claim that these models have yet to reach their full potential - particularly in achieving significant generalization across new tasks. By integrating domain-specific knowledge, such as leveraging various EEG modalities, and employing tailored training strategies, like brain-inspired masking techniques, these models could fully capture the diverse nature of EEG and largely outperform all current state-of-the-art baselines in various tasks with minimum required fine-tuning.

In the camera-ready version of the manuscript, we will incorporate the aforementioned discussion points. We sincerely thank the reviewer for their valuable comments, which have helped us to improve our work.

审稿意见
4

An interesting perspective-style paper comparing state of the art EEG-focused ML models in traditional versus foundation application. The paper is well written and presents a solid comparison of two methods resulting in a statement state-of-the-art LBMs achieve only marginal improvements (0.5%) over traditional deep architectures. The results is an important message to the community, that simple transfer of methods between domains is sometimes not necessarily improving outcomes.

给作者的问题

The concluding statements are carefully worded and diplomatic. Adding a more pronounced perspective on the future outlook for LBMs in this application could offer valuable insight to readers.

论据与证据

This paper delivers a thorough and insightful comparison of open-source datasets, providing a valuable perspective/review that is highly relevant to the ICML community. The analysis is well-structured and demonstrates a strong understanding of the field. This is a solid contribution that fits the ICML scope perfectly.

方法与评估标准

The authors have meticulously selected benchmark BCI and sleep study datasets, reflecting the current state-of-the-art within the EEG community. Furthermore, the rigorous evaluation of both cutting-edge LBMs and established deep learning architectures underscores the paper's contribution to the field. This is a highly valuable contribution, offering important and insightful results.

理论论述

The selection of Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning (PEFT) of large pretrained models is a particularly insightful technical and theoretical contribution. This decision enables a clear and solid comparative analysis across diverse tasks. This is a very well-executed piece of work.

实验设计与分析

The authors have made excellent choices in selecting datasets and comparing models, reflecting current state-of-the-art practices. This paper provides a comprehensive analysis, yielding significant and well-supported results.

补充材料

Given the nature of this review/perspective submission, supplementary materials are not expected, and their absence does not detract from the paper's overall contribution.

与现有文献的关系

This is a strong contribution that effectively evaluates the current state-of-the-art in the EEG field, specifically comparing foundation models with classical methodologies. This work has the potential for considerable impact.

遗漏的重要参考文献

The provided references are comprehensive and relevant to the scope of this work.

其他优缺点

This is a solid contribution with the potential for significant impact within the field.

其他意见或建议

None.

作者回复

We would like to thank the reviewer for their thoughtful review and valuable feedback. As the reviewer highlights we have meticulously selected various benchmark BCI paradigms datasets, reflecting the current state-of-the-art within the EEG community. We decided also to extend these results to capture one more popular BCI paradigm, namely Motor. Therefore, we added a new movement benchmark based on the High Gamma dataset (Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wolfram Burgard, and Tonio Ball. Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping, 38(11):5391–5420, 2017). The updated results are presented below:

Table: Classification accuracy of finetuned foundation models and standard deep learning architectures, reported as mean (std). Each trained/finetuned for 20 epochs with 10 fold cross-validation. Bold values indicate best performance per task or overall, italic values indicate next best performance.

ModelMotorERPMemorySleepEyesMean
EEGNet0.657 (0.087)0.912 (0.009)0.660 (0.022)0.624 (0.037)0.803 (0.061)0.731 (0.024)
EEGInception0.590 (0.087)0.896 (0.007)0.669 (0.021)0.688 (0.057)0.823 (0.038)0.733 (0.021)
LaBraM0.614 (0.096)0.911 (0.013)0.643 (0.040)0.704 (0.025)0.840 (0.041)0.742 (0.023)
NeuroGPT (full)0.682 (0.083)0.904 (0.012)0.610 (0.052)0.665 (0.030)0.821 (0.052)0.736 (0.025)
NeuroGPT (encoder)0.695 (0.085)0.908 (0.012)0.634 (0.035)0.647 (0.024)0.843 (0.045)0.745 (0.027)

The performance difference between NeuroGPT (the best foundation model in the new analysis) and EEG-Inception (the best baseline network) is 1.2% (improvement compared to the previous 0.5%) but with a considerable increase in the number of trainable parameters and larger std among folds.

In terms of a more pronounced perspective on the future outlook for LBMs, we strongly believe that this extensive study highlights critical considerations for the research community. In the camera-ready version of the manuscript, we could also add that we believe the future of Large Brainwave Foundation Models (LBMs) should go beyond merely adopting transfer techniques from other domains. Instead, they should integrate domain-specific knowledge—such as leveraging various EEG modalities—and employ tailored training strategies, like brain-inspired masking techniques. These are essential elements to fully capture the diverse nature of EEG and build an efficient and effective LBM that will largely outperform all current state-of-the-art baselines in various tasks with minimum required fine-tuning.

In the camera-ready version of the manuscript, we will incorporate the aforementioned discussion points. We sincerely thank the reviewer for their valuable comments, which have helped us to improve our work.

审稿人评论

The reviewer affirms the acceptance decision, acknowledging the authors' successful integration of revisions and explanations.

最终决定

This paper presents a mainly empirical contribution by systematically evaluating the capabilities of Large Brainwave Foundation Models (LBMs), in comparison to traditional deep learning models for several EEG-based BCI tasks (e.g., sleep stage classification, memory tasks), revealing only marginal improvements in accuracy despite a substantial increase in parameters. One key observation from a technical aspect follows the use of low-rank adaptation (LoRA) for parameter-efficient fine-tuning of LBMs (specifically for LaBraM and NeuroGPT), which significantly reduces the number of trainable parameters. The study highlights that simultaneously adapting multiple LBM network components provides the greatest benefit across multiple BCI tasks, suggesting that LBMs require domain-specific strategies and architectural redesign to effectively leverage their potential.

The paper went through a productive rebuttal phase. One of the main concerns was the limited methodological contribution of the paper, since LoRA is a well-known technique which is being used in this work, and whether its necessity was justified at all. The authors addressed the issue of necessity justification with additional experimental validations, which appeared to be sufficient. Another discussed limitation was about the lack of statistical significance tests on the accuracy comparisons and claims, which the authors clarified with additional results during the rebuttal. Authors also included new results on the High Gamma dataset to successfully extend their results with generalizability to another common BCI paradigm (i.e., motor imagery). In general, the AC acknowledges that the rebuttal was effective, although some reviewers maintained diverse ratings. Overall, the paper presents an interesting contribution and is of interest for the EEG-based BCI community. Thus, the AC recommends acceptance of this paper, if there is room in the program.