PaperHub
7.0
/10
Poster3 位审稿人
最低3最高4标准差0.5
3
4
4
ICML 2025

Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

OpenReviewPDF
提交: 2025-01-23更新: 2025-08-01

摘要

关键词
ElectroencephalogramGraph LearningSelf-supervised Pre-trainingNeuroscienceKnowledge DistillationContrastive Learning

评审与讨论

审稿意见
3

The paper introduces EEG-DisGCMAE, a pre-training framework for EEG-based classification using graph neural networks (GNNs). The method integrates graph contrastive learning and masked autoencoders for self-supervised pre-training, followed by graph topology distillation to transfer knowledge from high-density (HD) EEG to low-density (LD) EEG using a teacher-student structure. The framework is evaluated on four classification tasks using two clinical EEG datasets (EMBARC and HBN). Experiments show EEG-DisGCMAE outperforms existing methods, including GNN-based and pre-training-based approaches.

给作者的问题

N/A

论据与证据

The construction of positive and negative pairs in graph contrastive learning raises significant concerns. This paper defines positive pairs as EEG nodes (electrodes) that are either directly connected (1-hop neighbors) or indirectly connected (2-hop neighbors), while all other pairs are treated as negative pairs. This approach implicitly assumes that spatially close electrodes should have similar embeddings, whereas spatially distant electrodes should have dissimilar embeddings.

From the neuroscience perspective, this assumption may be reasonable for localized EEG tasks, such as motor imagery in Brain-Computer Interfaces (BCI)—where adjacent electrodes in the motor cortex are functionally related—it does not generalize well to many other EEG-based tasks, including disease detection and gender classification. In these cases, functional connectivity often extends beyond spatial proximity, making such a rigid spatial constraint inappropriate.

Given that this paper focuses on gender classification tasks and brain disorder detection, the imposed spatial locality assumption is fundamentally flawed and may limit the model’s ability to capture the true functional relationships within EEG data.

方法与评估标准

  1. Lack of a Strict Subject-Independent Setup. The paper does not explicitly follow a subject-independent evaluation strategy, which is crucial for ensuring that models generalize to unseen subjects.If subject data is mixed between training and testing, data leakage might occur, artificially inflating performance. Especially for tasks like disease detection, in which a label is assigned to the subject, spurious correlation might be constructed between subject-specific features and labels, which learn nothing about the disease-related features.

  2. Failure to Compare Against Non-GNN Pre-training Methods. The paper only compares EEG-DisGCMAE with GNN-based pretraining models. However, other non-GNN pre-training approaches (e.g., BIOT, LaBram, EEGPT) exist and are not evaluated as baselines. A broader comparison is necessary to justify the advantages of graph-based pre-training over alternative architectures.

  3. Underwhelming Performance Gains. EEGNet, while a widely used baseline, is a relatively outdated model that lacks even residual connections. Given the complexity of EEG-DisGCMAE, including its pre-training strategy and model structure, the reported performance gains remain unimpressive, with AUROC improvements of no more than 10% over EEGNet. Such marginal improvements raise concerns about the actual learning capability of EEG-DisGCMAE, especially given its significantly higher computational cost and architectural complexity.

理论论述

See before.

实验设计与分析

See before.

补充材料

I reviewed the information on the part of the dataset.

与现有文献的关系

See before.

遗漏的重要参考文献

See before.

其他优缺点

See before.

其他意见或建议

The paper introduces many mathematical symbols and terms, making it unnecessarily complex. Many symbols are redundant and could be simplified for better readability. Some of the notations for graph construction and distillation are overly complicated, making it harder to assess the method's true contribution.

作者回复

Thank you for all your comments. We respond to your comments one by one as below.


🟩 Q1: Claims and Evidence

(1)
We understand your concerns; however, we believe they stem from a misunderstanding of our approach. Our method is purely based on functional connectivity EEG graphs, rather than distance-based spatial EEG graphs.

Graphs based solely on spatial distances lead to spatial bias: positive pairs are close while negatives are far apart. For instance, Ho et al. (AAAI 2023) used hybrid graphs combining spatial and functional connectivity. Though helpful, this introduces spatial bias in contrastive learning.

Our results (see the Table in Q4 response to Reviewer c7XS) show that spatial graphs degrade performance even with our distillation. Hybrid graphs offer slight improvements, but spatial bias remains.

Thus, we use only functional connectivity—nodes use PSD features, edges are defined by Pearson correlation—to avoid spatial bias. This ensures positive/negative pairs are based on functional, not spatial, similarity.

(2)
In EEG, distant electrodes often show strong functional connectivity. Our graph is built on global functional correlation (Pearson), not spatial proximity—tailored for resting-state EEG.

Even electrodes that are spatially far apart can exhibit strong connections when computing functional connectivity, which we define as positive sample pairs. Functional connectivity inherently ignores spatial distance. Therefore, the concern you raised regarding spatial locality is mainly relevant to distance-based spatial connectivity, not to functionally derived connectivity.

Our method captures global patterns without spatial constraints.


🟦 Q2: Methods and Evaluation Criteria

(1) Lack of Strict Subject-Independent Setup
As mentioned in the Q1 table for Reviewer c7XS and shown in the Table below, we conducted both subject-dependent and subject-independent experiments on clinical and SEED datasets. The model was pre-trained only on clinical EEG and fine-tuned on SEED to assess generalization.

Table. Performance on Sex Classification (EMBARC) and Emotion Recognition (SEED)

ModelSex Classification (EMBARC)Emotion Recognition (SEED)
Subject-dependentSubject-independentSubject-dependentSubject-independent
Graph Transformer71.6%68.2%86.4%75.4%
GraphMAE73.8%70.6%88.6%78.1%
Ours76.8%74.1%93.6%84.3%

(2) Missing Comparison to Non-GNN Pre-training

ModelPre-TrainingHDLDSizeFT Mem
GraphCLGNN (Contrastive)83.9%80.6%6.9M1.0G
GraphMAEGNN (Reconstruction)85.3%83.3%6.9M1.0G
LaBraMTime Series (Reconstruction)87.3%84.8%7.6M3.2G
Ours-TinyGNN (Contrastive+Reconstruction)86.8%84.3%1.4M0.7G
Ours-LargeGNN (Contrastive+Reconstruction)87.8%86.9%6.9M1.0G

We compared: (1) contrastive pre-trained GNNs (GraphCL), (2) masked-reconstruction pre-trained GNNs (GraphMAE), (3) masked-reconstruction pre-trained time-series models (LaBraM), and (4) our joint contrastive + reconstruction GNN. Our model outperforms prior pre-trained GNNs and slightly surpasses time series-based pre-trained model LaBraM, while being significantly more efficient.

Note that although LaBraM performs well, it demands far more memory and parameters.

(3) Underwhelming Gains

  1. Performance gains on clinical resting-state EEG are modest, but on SEED (emotion recognition), we observed >10% improvement post fine-tuning—much higher than in disease classification.
  2. Our pre-training data is limited; larger datasets and augmentation will likely enhance gains.
  3. Our goal is boosting low-density EEG performance with lightweight models. Our method enables small LD models to match HD models, showing substantial relative improvements.

🟨 Q3: Other Comments or Suggestions

Thank you for the suggestion. Following advice from Reviewer c7XS and FWdZ, we simplified notations, removed redundant symbols from the text and figures, and deleted the symbol table from the appendix. This significantly improves clarity and readability.

审稿意见
4

This paper introduces a knowledge transfer model based on graph networks and distillation methods, which enables low-density EEG to learn the representation of high-density EEG to better handle downstream tasks. The authors conduct a large number of experiments to demonstrate its effectiveness.

给作者的问题

  1. Is it possible to use data from multiple frequency bands at the same time? Using only single-frequency band data may miss valid information. 2) Can this method be used for other tasks? 3) Since the graph encoder has the reconstruction function, has the author tried to use low-dimensional data to reconstruct more channel data to increase the dimension of the data and thus improve the classification performance?

论据与证据

I think the experiments and analyses in the paper are sufficient to support the authors' contributions.

方法与评估标准

The method is interesting, but the introduction is not very clear in some places. 1) How do teacher and student models adapt to different types of GNNs? 2) What is the theoretical basis for defining positive and negative sample pairs? When the EEG density is very low, is it possible that there are no negative sample pairs? How to deal with this situation?

理论论述

There is no theoretical proof in the paper.

实验设计与分析

The authors performed extensive experiments in the paper. There are some weaknesses: 1) Brain topography cannot reflect the reconstruction effect well, the authors should add some reconstruction indicators to display their results quantitatively. 2) When testing on VLD data, the authors should randomly select a small number of electrodes multiple times and calculate the average accuracy.

补充材料

I have reviewed the appendix submitted by the author.

与现有文献的关系

The paper provides a well-balanced discussion of previous work and clearly highlights how it extends the existing literature. The citations are comprehensive and appropriately placed.

遗漏的重要参考文献

No

其他优缺点

Strengths: The ablation experiments and analysis are very complete. weaknesses: 1) Figure 1 is a bit cumbersome and too many annotations can easily be confusing, it is recommended to simplify the figure. 2) Table 4 lacks a direct introduction in the text.

其他意见或建议

The method proposed in the article is very good, but too many symbols are used when introducing the method, and the fonts and colors are also confusing, which makes it difficult to read and should be simplified as much as possible.

作者回复

Thank you for all your comments. We respond to your comments one by one as below.


🟩 Q1: Methods and Evaluation Criteria

(1)
Our pre-training framework and distillation loss are designed to be general and compatible with both major types of GNNs: local message-passing models (e.g., DGCNN) and global attention-based models (e.g., graph transformers). Both our graph contrastive learning (GCL) and graph masked autoencoding (GMAE) methods work across different GNN architectures, as both types can capture global graph topology from different perspectives.

The goal of both pre-training and distillation is to learn meaningful global structures, regardless of the underlying GNN. The choice of backbone affects performance due to their distinct inductive biases: local GNNs excel at capturing neighborhood-level patterns, while graph transformers are more effective at modeling global interactions.

In practice, we recommend using message-passing GNNs like DGCNN for small graphs due to their ability to capture fine-grained local structures, and graph transformers for large graphs, as they offer better scalability and efficiency.

(2)
Our Graph Topology Distillation (GTD) loss defines positive and negative pairs based on functional similarity in EEG graphs. For more theoretical analysis, please refer to (Joshi et al., 2022. TKDE 2022).

In EEG, graph connectivity reflects functional correlations rather than spatial proximity, so strong links can exist between spatially distant electrodes. However, in low-density (LD) graphs, missing electrodes may break these meaningful connections. For example, two strongly connected nodes in the HD graph may become disconnected in the LD graph due to missing intermediaries. These lost but meaningful links are treated as positive pairs in our distillation.

Conversely, missing electrodes can also lead to spurious connections in the LD graph that are not present in the HD graph. These are considered negative pairs. The distillation objective encourages the LD model to approximate the topology of the HD graph by explicitly distinguishing such positive and negative connections.


🟦 Q2: Experimental Designs or Analyses

(1)
Thank you for the suggestion. We use mean squared error (MSE) to evaluate reconstruction quality.
The MSE losses for the four cases in Figure 4 (b), (c), (d), and (e) are 0.25, 0.31, 0.44, and 0.17, respectively. These values align well with the visual quality of the reconstructions, further supporting the effectiveness of our approach.

(2)
Random selection is not applicable, as reducing high-density (HD) EEG to low-density (LD) EEG follows specific electrode selection rules.
However, to assess model robustness, we simulate extreme conditions by randomly dropping electrodes in multiple trials. Given the same number of remaining electrodes, performance with random drops is generally worse than with structured downsampling based on predefined rules or distributions.


🟨 Q3: Other Strengths and Weaknesses

(1)
We sincerely appreciate your recognition and valuable suggestions. In response, we have revised Figure 1 by removing redundant symbols and replacing them with clearer textual descriptions. This change has notably improved the clarity and readability of the model illustration.

(2)
We have added the following analysis of Table 4 in the revised manuscript:
We compared our proposed GTD loss with several commonly used graph distillation losses. As shown in Table 4, GTD consistently outperforms the others. Moreover, combining GTD with traditional logits distillation achieves the best performance, as it allows the model to distill both semantic information from logits and structural information from the graph topology.


🟪 Q4: Questions for Authors

(1)
Yes, we incorporated multiple frequency bands as input features and observed a noticeable performance improvement.

(2)
Yes, as shown in the Table of the answer of Q1 for Reviewer c7XS, we evaluated our model on the emotion recognition task using the SEED dataset. Despite being pre-trained on a medical resting-state EEG dataset, our model can still be effectively fine-tuned for the emotion recognition task. We expect that pre-training on a task-specific EEG dataset (e.g., emotion recognition) would further enhance the model's performance.

(3)
No, we have not explicitly explored using low-dimensional data to reconstruct high-dimensional (i.e., more-channel) data as a form of data augmentation or upsampling. However, this is indeed an interesting direction. Leveraging the reconstruction ability of the encoder to infer additional channels could potentially enhance the representation capacity and improve downstream performance, but it is more difficult and needs more data to pre-train the generative model. We consider this a valuable future work direction if we have more data to do this.

审稿意见
4

The study presents EEG-DisGCMAE as a novel and effective approach for EEG-based classification tasks, demonstrating that self-supervised graph pre-training combined with topology-aware knowledge distillation significantly improves LD EEG model performance. The findings suggest that LD EEG devices, which are more accessible and cost-effective, can achieve near-HD EEG accuracy using this framework, making EEG-based medical diagnostics more practical and scalable

给作者的问题

N/A

论据与证据

A few claims require additional justification.

The paper only evaluates the method on two specific clinical EEG datasets (EMBARC, HBN), which focus on depression and autism spectrum disorder (ASD).

The paper mentions that the model can work with both DGCNN and Graph Transformer, but there is limited discussion on performance differences between these architectures.

方法与评估标准

Yes, the proposed methods and evaluation criteria largely make sense for the problem of leveraging high-density (HD) EEG data to improve low-density (LD) EEG models. The graph-based approach, self-supervised pre-training, and knowledge distillation framework are well-motivated given the challenges in EEG classification. However, there are some areas for potential improvement in dataset selection and task diversity:

  1. The results are presented mainly in terms of classification accuracy and AUROC scores. However, EEG models must often be robust to noise, missing electrodes, and subject variability.

  2. EEG graphs are built using Pearson correlation to define edges and PSD values in the α (8-14 Hz) band as node features. However, other functional connectivity metrics (e.g., coherence, mutual information) could be tested. Additionally, other EEG features (e.g., time-domain features, multi-band PSD) might improve performance.

理论论述

Yes

实验设计与分析

Overall, the experimental setup is well-structured and thorough.

补充材料

N/A

与现有文献的关系

The key contributions of the paper build on and extend prior work in EEG analysis, graph neural networks (GNNs), self-supervised learning, and knowledge distillation

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

Thank you for all your comments. We address each comment individually below.


🟩 Q1: Claims And Evidence: (Generalization to Emotion Recognition)

To further evaluate the generalization ability of our model, we conducted additional experiments on the SEED dataset, a widely used benchmark for emotion recognition. The results show that our model, pre-trained on resting-state EEG data, can be effectively fine-tuned for this task.

It is worth mentioning that, due to time constraints in collecting non-clinical resting-state EEG (e.g., task-related EEG from BCI domains), we used our medical resting-state EEG dataset for pre-training before fine-tuning on SEED. The EMBARC task is conducted on HD to LD distillation, and for the emotion recognition task, we also adopt HD(64 electrodes) to LD(32 electrodes) distillation.

Table 1. Performance on Sex Classification (EMBARC) and Emotion Recognition (SEED)

ModelSex Classification (EMBARC)Emotion Recognition (SEED)
Subject-dependentSubject-independentSubject-dependentSubject-independent
Graph Transformer71.6%68.2%86.4%75.4%
GraphMAE73.8%70.6%88.6%78.1%
Ours76.8%74.1%93.6%84.3%

🟦 Q2:Claims And Evidence: (Model Design Motivation)

DGCNN and Graph Transformer are two representative types of graph neural networks (GNNs). DGCNN is a message-passing-based GNN, while the Graph Transformer is a spatial-attention-based GNN.

In terms of performance:

  • GNNs like DGCNN are relatively lightweight but tend to achieve lower accuracy.
  • In contrast, Graph Transformers generally yield better performance, albeit with higher model complexity and computational cost.

🟨 Q3: Methods And Evaluation Criteria: (Robustness to Perturbations and Variability)

We introduced noise into EEG signals and randomly dropped electrodes to assess model robustness. As shown in the table below, our model shows the highest resilience, with the smallest performance drop under noisy and incomplete inputs.

We also evaluated subject variability by comparing performance in subject-dependent and subject-independent settings. Our model demonstrates greater stability, showing the least degradation in the subject-independent scenario compared to other methods.

Table 2. Model Robustness under Perturbations

ModelBefore PerturbationAdd Noise to EEGRandomly Drop Electrodes
GCN76.4%72.8%71.4%
Graph Transformer80.4%74.6%75.7%
GraphMAE83.3%78.5%77.8%
Ours86.9%83.7%84.0%

🟪 Q4: Methods And Evaluation Criteria: (Ablation on EEG Graph Construction Strategies)

We conducted an ablation study on different EEG graph construction strategies. As shown in the table below:

  • Coherence and Mutual Information yielded the best performance.
  • Pearson Correlation ranked next.
  • Spatial Distance-based graphs performed the worst.

We speculate that spatial graphs may lead to overemphasis on spatial locality—as pointed out by Reviewer v1Sd—resulting in degraded performance. In contrast, functional or statistical connectivity better captures neural relationships and leads to improved results.

Table 3. Accuracy with Different EEG Graph Construction Methods

ModelPearson CorrelationCoherenceMutual InformationSpatial Distance
Graph Transformer71.6%73.1%71.9%67.5%
GraphMAE73.8%74.7%73.8%68.5%
Ours76.8%78.1%77.5%72.8%
审稿人评论

N/A

最终决定

The paper makes a significant contribution to the field of EEG analysis by presenting a promising self-supervised pre-training and knowledge distillation framework aimed at transferring representations from high-density to low-density EEG recordings, a central challenge in EEG-based machine learning. The proposed approach is innovative, combining graph contrastive learning with masked autoencoders, followed by a graph topology distillation loss. The empirical validation is extensive, spanning multiple EEG domains, and is notably strengthened by the additional results provided in the rebuttal. Moreover, the method achieves competitive or superior performance relative to state-of-the-art GNN-based and time-series pretraining methods.

In summary, given the positive consensus among reviewers, the strengthened rebuttal, and the methodological and practical relevance of the contributions, I recommend acceptance of this submission.