4.8

/10

Poster4 位审稿人

最低4最高6标准差0.8

4.0

置信度

正确性2.5

贡献度2.5

表达2.8

NeurIPS 2024

EEG2Video: Towards Decoding Dynamic Visual Perception from EEG Signals

Xuanhao Liu,Yan-Kai Liu,Yansen Wang,Kan Ren,Hanwen Shi,Zilong Wang,Dongsheng Li,Bao-liang Lu,Wei-Long Zheng

OpenReview PDF

提交: 2024-05-14更新: 2024-12-27

TL;DR

We build an EEG-video dataset and propose a framework to generate videos from EEG signals, taking an important step towards decoding dynamic visual perception from EEG.

摘要

关键词

EEGvideo generationdiffusion modelbrain-computer interface

评审与讨论

审稿意见

评分: 4置信度: 42024-07-12

The manuscript proposes an EEG decoding model that aims to reconstruct the video stimuli presented to participants. To this end, a large dataset corresponding to twenty participants is collected. The dataset is annotated with respect to different features such as the dominant colour of the video or the presence of a human in the video frames. Lastly, they also developed an eeg2video that can be considered as the baseline for this dataset in future research.

优点

The problem this manuscript addresses is gaining more importance in recent years due to numerous potential applications of brain-computer interfaces. Subsequently, the dataset developed as part of this work could add value to the research community.

The source code is provided in the supplementary materials, which makes the reproduction of reported results easier.

The recontacted videos are visually appealing which helps to showcase the potential of EEG decoding.

缺点

The decoding power that recorded EEG signals offer is questionable with respect to some of the annotations. For instance, the chance level for the "Human" task is 71.43 and the best method (the proposed model) only reaches 73.43. Similar observations can be made for almost all other tasks. In some scenarios, the reported accuracies are even below the chance level, for instance using DE features in the "Numbers" task, the best-performing model reaches 64.2 and the chance level is at 65.64.

Following the point above, it's unclear whether even those cases that are above chance level have any statistical significance as no Student t-test or Wilcoxon Signed-Rank test is conducted.

问题

Why SSIM is not reported for video-based evaluation in Table 2?

The choices for the nine classes (land animal, water animal, plant, exercise, human, natural scene, food, musical instrument, transportation) seem a bit arbitrary. It would be nice to read the rationale behind these categories.

It would be helpful to include experimental results to support the benefit of global and local data streams in the proposed model. If we perform an ablation study where the global branch is lesioned, how does the performance change?

局限性

The limitations of the work are sufficiently discussed.

作者回复

2024-08-06

Thanks for your valuable comments. Below we have addressed your questions and concerns point-by-point.

W1. The decoding power ...

We appreciate your careful reading. We can't agree more and believe that the decoding power of the recorded EEG signals w.r.t. some specific tasks is more than questionable. Actually, in Line 258, we claim them as "difficult or even impossible to classify". However, we deliberately added experiments on these tasks for the following reason and we beg to differ regarding this point as a weakness.

As a pioneering work, our goal is exploring the potential of using EEG signals for reconstructing visual perceptions. At the time the dataset is created, nobody knows the boundary of the decoding ability of EEG. We'd like to answer what kind of visual information we can decode from EEG and use it as intermediate clues to further enhance the reconstruction ability of the EEG2Video framework.

To this end, we not only studied distinguishing among the 40 concepts, but also investigated other decoding tasks, across both low levels (Color, Fast/Slow, Number) and the high levels (Human, Face). We conducted comprehensive experiments to validate the decoding performance of each task with 7 machine learning methods on raw EEG data and two other human-extracted EEG features.

As a result, we reach a quick conclusion that Number, Human, Face are difficult and probably impossible tasks. On the other hand, we figure out that it is possible to decode visual information like Color and Fast from EEG, which guided us to develop some useful modules in our EEG2Video framework for incorporating such information, e.g. semantic predictor for class information, and DANA for slow/fast information.

Of course, we can definitely omit the results of indistinguishable tasks in the paper for better presentation. However, we insist to add them in Table 1 and believe they can offer some helpful empirical findings to the neuroscience community and facilitate the future brain decoding research by focusing on only promising semantics.

W2. Following the point above, ...

Thanks for your constructive advice. We conduct the Student t-test and calculate the p-value on the performance of all classification tasks using raw EEG of our GLMnet model compared to the chance level. The results are as follows and we will add them in our paper:

	40c t1	40c t5	9c t1	9c t3	Color	Fast	Numbers	Face	Human
p-value	0.00	0.00	0.00	0.00	0.00	0.00	0.28	0.07	0.17

According to the statistical significance analysis, it can be concluded that the classification results of the first 6 columns are significantly above the chance level ( $p$ <0.005), while there are no significant gap for the last 3 columns ( $p$ >0.05). We will take your advice and add the results to strengthen our claims and intuitions.

Q1. Why SSIM is not reported for video-based evaluation in Table 2?

SSIM is a metric that reflects the structural similarity between two images. To acquire the SSIM between two videos, what the algorithm do is using each frame pair in these two video and computing the SSIM between each frame pair.

In our paper, we calculate the SSIM of each frame between the ground-truth video and the corresponding reconstructed video. There is no need to add a video-based SSIM since the meanings of frame-based SSIM and the video-based SSIM are the same, all reflecting the pixel-level similarity.

Q2. The choices for the nine classes seem a bit arbitrary ...

That is a very interesting question. In fact, we spent lots of time deciding the class to use and before we finalized the list, we had already considered the EEG-VP tasks. Then, the general idea is to use natural videos instead of artificial ones (like anime) and balance different types of videos for EEG-VP tasks as much as we could. Specifically, we referred to several related works including [1][2][3][4] and obtained the list of classes according to the following guidelines:

We remove some static classes that is not suitable to be presented as videos, e.g., golf balls, keyboards, etc.
We would like to involve roughly 1/3 classes with human beings, 1/3 classes with animals and plants, and 1/3 of non-living scenes or objects.
We would like to have roughly 1/2 videos with rapidly changing scenes, and the other half with relatively static objects.
We would like to balance the numbers of the main colors.

As a result, we obtained classes as described in Figure 1, and present the statistics in Figure 2. However, it's very hard to accurately control the numbers so we call for future work to design a probably more reasonable sets of videos as visual stimuli.

Q3. It would be helpful ...

Thanks for your constructive advice. Actually, the GLMnet's global and local encoder are both simple CNN or MLP. We use the ShallowNet (for raw EEG) or MLP (for EEG features) as our GLMnet's global encoder. ShallowNet and MLP (equal to the ablated GLMnet) all had been compared as baseline models in Table 1. It can be seen that a local encoder can improve the performances for brain decoding tasks. The reason why local data stream works is that it introduce inductive bias in neural networks, which motivates netowrks to focus on the visual cortex.

[1] C. Spampinato, et al. “Deep learning human mind for automated visual classification”

[2] H. Ahmed, et al. “Object classification from randomized EEG trials”

[3] H. Wen, et al. “Neural encoding and decoding with deep learning for dynamic natural vision,”

[4] Allen, et al. "A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence."

2024-08-13

Thanks a lot for responding to my questions.

I have one further question.

Could you please share with me what are your thoughts on why Numbers, Face and Human classification are not statistically significant? What differs them from other classification tasks?

评论- Responses to the reviewer Kswp

2024-08-13

Thanks for your response and your question is really interesting and inspiring. Here are some of our thoughts.

Back to the time when we were designing the EEG-VP benchmark, we selected the classification tasks based on our neuroscience knowledge as follows:

Fine-grained / Course concepts: The ventral stream from the two-stream hypothesis[1] is associated with object recognition and form representation, which goes from the V1 sublayers to areas of the inferior temporal lobe.
Color: the primary visual cortex (V1, in the occipital lobe) processed the color information within a very short time period.
Fast/Slow: the middle temporal visual area (MT or V5) is thought to be highly related to the perception of motion[2].
Face/Human: the fusiform face area[3] in the fusiform gyrus is a part of our visual system to recognize human faces.
Number: the parietal lobe is recognized to be important for counting and numerical cognition [4][5].

However, we did not know how well these neuro-activities are reflected in EEG signal, thus we conducted the experiments. It showed that the model can get quite good results on Concepts, Color, and Fast/Slow, while fails in other tasks. Here are our Hypotheses:

For Number, the subjects are not told to count the objects when looking at the videos, therefore the counting-related areas may not be activated.
For Face/Human, it underlies deep in our brain (fusiform gyrus), not on the surface. Therefore, the signal from it may be very weak or doesn't even exist in EEG signals. Moreover, some humans and human faces appearing in the visual stimuli are less conspicuous to be noticed by subjects while they were focusing on viewing whole scenes.
For other classification tasks which involve the visual cortex and motor cortex in our occipital lobe and temporal lobe, they are right on the surface of a human brain and thus their activities may be easily captured from EEG. (BTW, this is where our GLMNet takes inspiration.)

These are only our thoughts. To get a more comprehensive conclusion, further neuroscience efforts are needed. Whatever, we'd like to present our preliminary results in the paper to inspire future exploration.

[1] Goodale MA, Milner AD (1992). "Separate visual pathways for perception and action". Trends Neurosci. 15 (1): 20–5.

[2] J. H. Maunsell and D. C. Van Essen, “Functional properties of neurons in middle temporal visual area 476 of the macaque monkey. ii. binocular interactions and sensitivity to binocular disparity,” Journal of 477 neurophysiology, vol. 49, no. 5, pp. 1148–1167, 1983.

[3] Kanwisher N, McDermott J, Chun MM (Jun 1, 1997). "The fusiform face area: a module in human extrastriate cortex specialized for face perception". J. Neurosci. 17 (11): 4302–11.

[4] Dehaene, Stanislas, Ghislaine Dehaene-Lambertz, and Laurent Cohen. "Abstract representations of numbers in the animal and human brain." Trends in neurosciences 21.8 (1998): 355-361.

[5] Dehaene, S (2003). "Three parietal circuits for number processing". Cognitive Neuropsychology. 20 (3): 487–506.

2024-08-14

Thanks a lot for the detailed discussion.

审稿意见

评分: 4置信度: 42024-07-13

The authors present a novel annotated dataset of EEG-video pairs and an approach to reconstruct videos from EEG brain activity data. The dataset contains brain responses of 20 subjects watching 2-s videos from 40 general concepts. A total of 7 classification tasks are built based on the metadata available in the dataset (e.g. finegrained concept, coarse concept, color, etc.) and a video diffusion pipeline is trained to reconstruct videos. Classification results on the different tasks are presented, along with generated video frames and image quality metrics.

优点

Originality: the study of video decoding from EEG has not been the focus of much attention yet, and the presentation of a new dataset is both original and useful to the community.
Clarity: the manuscript is overall clearly written and the general approach is well motivated.
Quality: interesting analysis of what information can be decoded (color, optical flow, object number, human face, human) through the different classification tasks described in Section 3.5.
Significance: the presented dataset and analysis set the stage for more generalizable results in visual decoding from EEG.

缺点

The dataset contains videos spanning 40 concepts which are seen in both training and test sets as described in Section B.2, i.e. there is "categorical leakage" between the two sets. This makes it very likely that the model learns to mostly (or solely) predict a concept, rather than predict the finer grained visual information contained in a video. Following this hypothesis, the EEG encoder, the seq2seq model and the semantic predictor could all be replaced by a single classifier that outputs the label of one of 40 concepts, followed by a lookup table that returns the corresponding concept-specific conditioning vector to be fed to the video diffusion model, and generation performance would remain similar. To test whether that is indeed the case, it would be interesting to train the model on e.g. 25 concepts, and test it on the remaining left out 5 concepts (also taking into account the next point about finetuning the video model).
Moreover, as described in Section 4.2, line 244: “[...] all video-text pairs are used for fine-tuning the Stable Diffusion Model [...]”. If that is indeed the case, this means that the video diffusion model has already seen the specific videos it tries to predict later on, which makes the generation task significantly easier.

问题

What is the architecture and hyperparameters for the EEG encoder (Section 4.2)?
In Section 4.1: What is meant by “treating all channels equally”? Most deep learning encoders trained on EEG data have some kind of spatial processing layer, e.g. a convolutional layer, that learns to reweigh different channels end-to-end [1].
The 40-class classification performance appears very low, however generations are qualitatively very good. Can you describe the process for selecting the generations shown in the paper?
In Table 2, what does 40-way classification refer to when fewer than 40 classes are used?

[1] Schirrmeister, Robin Tibor, et al. "Deep learning with convolutional neural networks for EEG decoding and visualization." Human brain mapping 38.11 (2017): 5391-5420.

局限性

Yes.

作者回复

2024-08-06

Thanks for your comments with expertise. Below we have addressed your questions and concerns point-by-point.

W1.1 ... "categorical leakage" between the two sets.

We may be misunderstanding your concern, but we argue that "Categorical leakage" is a pseudo-concept and should not be considered a problem in machine learning. Conversely, all learnings relies on such shared distributions to achieve in-domain generalization. For instance, a diffusion model must be trained on cat images to generate cat images.

Leakage, on the other hand, describes the situation where the information that is inaccessible or only accessible during test is used to construct the model in training. Based on your description, the most similar concept may be the label leakage, where the inaccessible label information is added to training steps. However, in our EEG2Video framework, the only input is the EEG data collected when the subject is shown the video. And the "categorical" information is inferred by the model implicitly with the semantic predictor, thus won't cause any leakage problem.

W1.2 This ... mostly (or solely) predict a concept ...

The pixel-level and higher-level decoding recover visual stimuli from two different perspectives, where the trade-off between fidelity and meaningfulness needs to be considered.

Decoding from EEG is challenging due to several reasons. The classification results shown in Table 1 even struggles to decode information like Numbers and Face, not to mention finer-grained visual information at current stage. Hence, in this work, we prioritize the video recovery of visual stimuli via intermediate semantics in EEG, which is also crucial for understanding the complex mechanism of human perception. We recognize the contribution of categorical information to generate semantically closer results. Nonetheless, there are more the model can facilitate such as the color and fast/slow information. All in all, our aim is to establish a foundation that integrates both pixel-level features and visual semantics in this particular task.

W1.3 Following ... would remain similar.

Essentially, our framework is based on the alignment in both semantic and visual features, and the pre-trained diffusion priors. The predicted dynamic information and other latent clues are also introduced in the diffusion process for enhancing decoding performance. It is definitely different from a simple combination of classifier + look-up dictionary.

The best classification result of 40 classes is 6.23% in Table 1, which is the upper-bound semantic-level accuracy of a simple combination. However, our framework achieved a semantic-level accuracy of 15.9%, more than twice of that.

W1.4 To test ... , train the model on e.g. 25 concepts, and test it ...

We argue that it is an impossible task for the current neuroscience and AI community. Training on N classes and testing on other unseen M classes is called zero-shot transfer, studied by large pre-training model, e.g., [1] pre-trained on $4*10^8$ image-text pairs. Even so, they can only generate images of seen concepts and their combinations.

As a pioneering work, we focus on providing the basic logistic and the first batch of data with only 40 concepts and 1400 videos. This is far not enough to do zero-shot learning.

W2 Moreover, ..., the generation task significantly easier.

Thanks for your careful reading! The right expression is "all video-text pairs from the training set are used for fine-tuning ...". We promise that the whole framework including the diffusion model has never seen any videos it tries to predict later during any training stages. We apologize for the confusion and will revise it.

Q1 ... the architecture and hyperparameters ...?

We adopt our GLMnet as EEG encoder due to the outstanding visual decoding ability. The hyperparameters are detailed in the appendix B.

Q2 ... “treating all channels equally”? ...

"Treating all channels equally" means adding no inductive bias upon different channels. While deep models learns to reweigh input, correct inductive bias can prevent model from learning from spurious features and improve the generalization ability [2]. In fact, all the modifications on model structure can be treated as some kind of inductive bias.

In our case, GLMnet adds another data stream to focus on the visual-related channels, which motivates networks to focus on the visual cortex. As a result, it performs better on all the tasks in the EEG-VP benchmark and thus selected as the EEG encoder.

Q3 The 40-class classification performance appears very low, ...

To clarify, we assure you that we have presented a representative range of qualitative results of the generated videos. More importantly, we also evaluated the quantitative performance upon all EEG-video pairs in the testing set.

From Table 2, thus, we ran our EEG2Video on the testing set and selected the most representative qualitative results for demonstrating the effectiveness. We also selected typical failure samples in Fig. 13 in the appendix including messing up between categories, wrong color, wrong main object, etc. Please kindly refer to Appendix F for failure cases.

Q4 In Table 2, what does 40-way ...

The metric is to verify the generated images/videos class with a pre-trained images/videos classifier. For all cases, we adopt the same 40-class classifier to fairly compare the reconstruction performances, though our generative model may be trained on a fewer number of classes. We follow [3] and used the same code for calculating this metric, which has been submitted in Supplementary.

[1] R. Alec, et al. "Learning transferable visual models from natural language supervision." in ICML, 2021.

[2] Bo Li, et al., “Sparse Mixture-of-Experts are Domain Generalizable Learners” ICLR 2023, Oral

[3] Z. Chen, et. al. “Cinematic mindscapes: High-quality video reconstruction from brain activity,” in NeurIPS, 2023

2024-08-12

Thank you to the authors for the detailed answers. Some follow-up questions:

W1: I think I misunderstood Section B.2 - upon re-reading it seems that the cross-validation split was done on video “blocks”, and that a category is seen in a single block only. Therefore this means there are no shared categories between training and test splits and my original point didn’t hold. Can you confirm this is the case?

Q2. I understand now that there are two independent EEG encoders, one that sees all channels, and one that only sees channels from the occipital lobe (Figure 3B). It would be interesting to include the related ablation in an updated version of Table 1 as this indeed seems to be a novel architectural choice.

Q3. According to Table 1, GLMNet achieves 6.2% in 40-class classification top-1 accuracy. My understanding is the video reconstruction pipeline based on GLMNet should then produce a video from the correct category with a similar ratio of correct categories. However, of the 48 examples shown in Appendix F, only 8 (Figure 13) seem to show videos that are not of the exact same category as the ground truth, which corresponds to 87.5% top-1 accuracy. Can you explain this discrepancy?

评论- Responses to the reviewer ACPC

2024-08-13

Thanks for your follow-up comments. Here are the point-to-point answers to your further concerns.

　W1: ... Can you confirm this is the case?

We're not sure whether we are misinterpreting your concern, and literally the short answer to the question is that the training and test splits certainly share video categories. However, we do not understand why this is considered a problem.

Let's first clarify the concepts here. And the category here means the fine-grained concept in our paper.

A full experiment session contains 7 video blocks, each block consisting of 5 different videos * 40 full categories. Let's number the blocks 1 to 7. By using 7-fold cross-validation, we mean that for the first fold, we train on Block 1-5, validate on Block 6, and test on Block 7. The second fold uses Block 2-6 for training, Block 7 for validation and Block 1 for test, and so on. Therefore, in each fold, the model is trained on 5 videos * 5 blocks * 40 categories = 1000 video samples, and validate and test on 5 other videos * 1 block * 40 categories = 200 video samples respectively.

For the classification task, the categories should of course be shared across train, validation, and test. Machine learning is fitting the function $f_\theta(x)=y$ parameterized with $\theta$ given dataset $\mathcal{D}=\{(x_i, y_i)|i\in\{1, 2, ... L\}\}$ , and the set of categories which defines the range of $f_\theta$ should be consistent, and here in our case, the range is the 40 categories, i.e., $y\in\{Cat, Dog, ..., Ship\}$ . If the range is inconsistent, let's say, the model only sees $(x,y)$ pairs with $y\in\{Cat, Dog\}$ during training. Then it will have no idea how to map $x$ to an unseen $y$ such as $Ship$ .

The categorical leakage (maybe label leakage) means that the model accidentally sees $y$ during training and is trained in the format of $f_\theta(x, y)=y$ . Then during testing, the model has nowhere to have the $y$ as the input and thus fails generalizing. Another typical error is the data leakage which means there are shared $(x_i, y_i)$ pairs in both training and testing, i.e. $D_{train}\cap D_{test}\neq\emptyset$ . However, the case $y_i=y_j$ where $(x_i, y_i)\in D_{train}$ and $(x_j, y_j)\in D_{test}$ should definitely be permissible as long as $x_i \neq x_j$ to allow generalization of the model. To remember, $x_i$ is the EEG signal in our case collected from the subject when watching the video clip of category $y_i$ . We guarantee that while the categories are shared, the videos used for testing have never been exposed to the model during training.

Q2: It would be interesting to include the related ablation ...

Thanks for your constructive advice. We have done the ablation study in Table 1. Actually, the GLMNet's global and local encoder are both simple CNN or MLP. We use the ShallowNet (for raw EEG) or MLP (for EEG features) as our GLMNet's global encoder. ShallowNet and MLP (equal to the ablated GLMNet) all had been compared as baseline models in Table 1. It can be seen that a local encoder can improve the performances for brain decoding tasks.

We will highlight this comparison and make it clear for readers by adding the description "ShallowNet and MLP are the ablated models without the local encoder only focusing on visual-associated channels compared to our GLMNet" to the final manuscript in Section 5.1.1.

Q3: According to Table 1, ... Can you explain this discrepancy?

We'd like to acknowledge that the Figure 5 shows some of the successfully-reconstructed examples for exhibiting the effectiveness of our reconstruction pipeline, rather than all reconstruction results in the test set. There are 200 video clips in the test set and most of the reconstructed videos are semantically mis-matched thus resulting in the errors of the same cause, as you can see the quantitative semantic accuracy is 15.9%, thus for which we only present some representative but not all failure examples in the last figure of Appendix F. Naturally, the actual semantic accuracy is calculated using all EEG-video pairs in the test set, not the presented visual samples.

Next, we'd like to emphasize that the video reconstruction's semantic accuracy is higher than the GLMNet's classification accuracy. Essentially, our framework is based on the alignment in both semantic and visual features, and the pre-trained diffusion priors. The predicted dynamic information and other latent clues are also introduced in the diffusion process for enhancing decoding performance. Consequently, the semantic accuracy of video reconstruction (15.9%) is more than twice of the GLMNet's classification accuracy (6.2%).

We will change the title of figure 5 from "Reconstructed Presentations" to "Some Successfully Reconstructed Presentations" to clarify that these examples are selected. We will also highlight the failure cases in a separate session in Appendix F.

2024-08-14

W1. Yes, that makes sense; I apologize for the confusion. I’ll bring it back to my original point, which was specific to the video reconstruction task (which I admit could have been clearer). In the context of image/video reconstruction strictly, i.e. no classification task, there is no need for the encoder to have seen images from every category of the test set. In fact, if we were to train the encoder on 30 categories only, we could expect that the encoder will generalize to (some of) the other 10 categories to some degree, given the properties of the shared embedding space used as target. For instance, learning the mapping from EEG to representations of rabbits and cats should make it possible for the model to approximate the mapping between unseen test EEG and representations of dogs. As for the latent diffusion model, it has likely already been pretrained on the same or similar categories.

My point was: if the encoder has seen examples from the test categories at training time, it becomes significantly easier to produce semantically good generations for these categories. In itself that is completely fine, as long as this is clearly reported. However, since the semantic metrics of Table 2 rely on top-k accuracy, it may well be that a much simpler pipeline that always predicts the same frames/video for a given class performs really well (the encoder could literally just output the exact embedding of one of the training examples of the correct category and the diffusion model would generate a video of the correct category). A model that wasn’t trained on the test categories, however, would likely not perform well according to this metric as the categorical information hasn’t “leaked” in the training set. Hence the suggestion to include a baseline that only relies on high-level class information - how much better is the full model of Table 2 as compared to a model that uses a “simple” semantic classifier?

Q1. I missed this, interesting side result - thanks for the clarification.

Q3. From your answer I understand that the reconstructions of Figures 5, 8-12 were manually selected because their semantics matched with the ground truth. This information would be important to include when describing the results.

Thanks once again to the authors for their answers. I’m increasing my score to reflect the points that were addressed during the discussion period.

评论- Responses to the reviewer ACPC

2024-08-14

Thanks for your further response. If we understand correctly now, your concern is actually lying in the setting of the task and the corresponding evaluation metrics. We appreciate your expertise and would like to share our thoughts about such concerns. Generally our response is two-folded:

The difficulty of mapping EEG to unseen categories. Your sense for the pretrained diffusion model is acute, however, we have to argue it's still impossible for EEG-decoded videos with nowadays technology. This is because that the task is essentially a multi-modal translation task from EEG to Video. Therefore, the zero-shot generalization of new categories requires abilities in 3 modules: a comprehensive representation space of EEG, a powerful video generation model from representations, and a strong alignment between these two representation spaces. The similar thing is feasible from natural language to videos (or images) because we not only have the powerful uni-modal generative model such as diffusion model, but also have the alignment model such as CLIP, pretrained on massive amount of paired data. And it takes tens of years research efforts in NLP, CV, and the multimodal area.

As for our task, we have the video generation ability which can transfer across categories, but the foundation model of EEG and the alignment model between the two modalities are not backed up yet. From this aspect, we build the dataset also as the first batch of paired data between EEG and video, thus supporting the multimodal alignment pretraining research in the future.

The evaluation metrics for our setting. We now understand your concern that while the comparison is fair for models under the same setting, it will be unfair in the future when model can do the zero-shot transfer learning tasks.

In fact, it's quite common for similar tasks with different settings using the same metrics, and the numbers are meaningless to compare to each other. For example, the BLEU score is widely used in neural machine translation (NMT) tasks comparing the generated sentence with the ground truth, and the BLEU score is definitely better for normal translation tasks than for zero-shot NMT tasks. However, improving the two scores in both settings (normal NMT and zero-shot NMT) are both important for the whole community.

And besides the accuracy, we also have the SSIM metrics to evaluate the pixel-level similarity between the ground truth. Retrieving videos of the predicted category which doesn't involve information such as color may fail on such metrics. Naturally, our dataset and benchmark are supportive for the future works to design more suitable metrics for zero-shot cases.

Again, we're the first in the area to explore the EEG2Video with supported dataset and benchmark, and build the first framework to offer the first batch of results. While our final goal is aspirational (please refer to our general response), we beg to argue that comparing the absolute achievement with a well-studied area such as generating videos from texts would be too harsh.

审稿意见

评分: 6置信度: 42024-07-14

The authors provide an EEG-video paired dataset, addressing the lack of data for decoding dynamic visual perception tasks from EEG signals. They also propose a dynamic noise-adding perception video reconstruction method for this dataset.

优点

1.The authors introduce a new EEG-video paired dataset, providing valuable data support for studying dynamic perception using EEG signals. 2.The dataset includes various classification labels, facilitating the analysis of EEG responses to different shapes, colors, frequencies, and other stimuli. 3.The authors propose an adaptive noise-adding method for image generation, tailored to different OFS.

缺点

1.Method innovation: The video generation method primarily comprises modules from previous methods, limiting its innovation. 2.Comparison methods: The authors should compare their method with other video generation approaches, such as those mentioned in the paper that use fMRI to generate dynamic videos (references [31, 32]), simply replacing fMRI features with EEG features. 3.The article's focus is scattered: According to the title and abstract, the article should primarily focus on EEG-to-video generation. However, the experimental section offers limited analysis of this task, concentrating more on classification performance. While classification performance and analysis can reflect the dataset's quality, presenting more analysis of video generation results in the main text, rather than in the appendix, might be more appropriate.

问题

In the dataset creation (Fig.1), five different video sequences from the same concept are viewed consecutively, followed by a 3-second hint. Could this approach lead to interference between brain signals from different video sequences? Would having a hint between each video be better? Please explain the rationale for this setup.

局限性

See the Weaknesses and Questions.

作者回复

2024-08-06

Thanks for your valuable comments, and we'd like to express our appreciation that our contributions of the dataset and the benchmarks are well recognized. Below we address your questions and concerns point-by-point.

Weaknesses

Method innovation: ...

We would like to emphasize that our contribution and novelty for the EEG2Video framework, while we're the first to propose the DANA module, are definitely not in implementing these techniques per se. Instead, it lies in applying these techniques in a new and challenging domain: reconstructing visual stimuli from the dynamic brain activity.

Our work is at the intersection of neuroscience and CV, where the focus is not solely on inventing new tricks or models. The main aim is to design a novel framework to tackle the unique challenges of what visual perceptions and how we can decode from EEG with adaptation of state-of-the-art generative models to our specific task.

On the one hand, the EEG-to-video method is more than a simple adaptation of the fMRI-to-video method because of the significantly lower spatial resolution and higher temporal resolution, thus we naturally become the first to apply the Seq2Seq framework upon the newly proposed neuro-signal decoding task.

On the other hand, the signal-to-noise ratio (SNR) of EEG is less than fMRI, and we probably need some intermediate visual information to reconstruct high-quality and semantically-correct videos. Hence, based on the findings from the results of the EEG-VP benchmark, we design the dynamic predictor and DANA for injecting the Fast/Slow into video generation process, the semantic predictor to inject class information, and the general Seq2Seq for decoding low-level visual information like Color.

Our contribution is not constrained within EEG2Video framework, even though, we argue that our EEG2Video framework is novel as the Seq2Seq and DANA modules are novel and designed by a prior experimental basis rather than randomly combining existing modules together.

Comparison methods: ...

Actually, we have compared these fMRI-to-video methods [2,3] in our work. The ablation variants w/o Seq2Seq is the simple adaptation of [2,3] on EEG-to-video task. We denote the video diffusion model as T2V, fMRI-to-video methods [2,3] all can be decoupled as an fMRI encoder $E_{fmri}$ and T2V, where $E_{fmri}$ maps fMRI data to text embeddings $e_t$ , and the reconstructed video V = T2V(e_t).

Our EEG2Video has two more modules besides T2V and $E_{eeg}$ : the Seq2Seq model that predicts the frame latent vectors and dynamic predictor for DANA. As the pre-training methods for fMRI data cannot be applied on EEG data, the ablation variants w/o Seq2Seq is actually a simple adaptation of [2,3]. The difference between w/o Seq2Seq and [2,3] is that $E_{eeg}$ is trained without any pre-training methods.

It can be seen from Table 2 that Seq2Seq and DANA all enhance the video generation performance on all metrices. In other word, our EEG2Video outperforms previous SOTA fMRI-to-video methods [2,3] on EEG-to-video task.

The article's focus is scattered: ... .

Thanks for your careful thinking of our paper, however, we beg to differ that the focuses of the paper are scattered. Instead, all our contributions detailed in the main content are meticulously crafted around the title, and are logically-inseparable from each other.

The goal of our work is exploring the possibility of using EEG signals for reconstructing visual perceptions. Due to the low SNR, it's almost impossible to reconstruct video pixel by pixel directly. Indirect method involves decoding intermediate visual information, however, nobody knows the boundary of the decoding ability of EEG at the time the dataset is newly built.

To this end, we conducted comprehensive experiments on the EEG-VP benchmark to figure out the distinguishable ones, e.g. Class, Color, Fast/Slow compared to indistinguishable ones, e.g. number of objects. Until then, we can finally design the modules (DANA for Fast/Slow, Semantic predictor for Class, etc.) in the EEG2Video framework. Hence, we argue that the dataset, classification tasks, and recontruction tasks are equally important as our contribution. It's fair for them to having equal presenting space.

We carefully choose the expression "EEG2Video: Towards Decoding ..." in our title, because as a pioneering work, rather than providing a standalone model, we believe it's more valuable to share our thoughts thoughout the whole research process with the neuroscience community to facilitate the future brain decoding research.

Questions

In the dataset creation (Fig.1), ... the rationale for this setup.

We appreciate your careful reading. Actually, we spent lots of time discussing the experimental protocol. including what you're concerning about.

The interference between brain signals from different video sequences, especially from different class of videos, should definitely be mitigated to the minimum. Therefore, compared to the datasets [4] used by fMRI-to-video works, we decide to add intervals between different scenes. Nevertheless, given the selected number of classes (40) and video clips (1400), if adding intervals after every clip, let's say only 1-second interval, the total length will be 4200s = 1h10min without any break and relaxation. No one can tolerant such an experiment and the fatigue and the distraction will harm the data quality. As a result, we made a compromise to only add an 3-second interval every 5 videos, and add rest phase between blocks for relaxing.

[1] J. Wu, et al. “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation”

[2] Z. Chen, et al. “Cinematic mindscapes: High-quality video reconstruction from brain activity”

[3] J. Sun, et al. “Neurocine: Decoding vivid video sequences from human brain activties”

[4] H. Wen, et al. “Neural encoding and decoding with deep learning for dynamic natural vision”

审稿意见

评分: 5置信度: 42024-07-19

This paper presents a novel framework named EEG2Video for video reconstruction from EEG signals based on Seq2Seq architecture to densely utilize the highly dynamic information in brain signals. It also developed a large EEG dataset named EEG-DV dataset collected from 20 subjects, offering 1400 EEG-video pairs from 40 concepts for studying dynamic visual information in EEG signals.

优点

A large dataset, called EEG-DV, to reconstruct videos from EEG signals, upon which two benchmarks were generated (i.e., EEG Visual Perception Classification benchmark and the Video Reconstruction benchmark) to support evaluating the advances of EEG-based video reconstruction.
A novel baseline, called EEG2Video, for video reconstruction from EEG signals that can align visual dynamics with EEG based on the Seq2Seq architecture.

缺点

The different steps constituting the proposed method in Section 4 are not well highlighted.

问题

It is suggested to highlight the originality of the proposed method in Section 4. It is also summarize the steps constituting the proposed method as an algorithm.

局限性

Yes. The limitations are discussed in Appendix.

作者回复

2024-08-06

Thanks for your valuable comments. Below we have addressed your questions and concerns point-by-point.

Weaknesses

The different steps constituting the proposed method in Section 4 are not well highlighted.

Thanks for your constructive suggestion, we will add an algorithm to demonstrate the process in our paper. To address your question here, we detail the steps of our proposed EEG2Video below:

Training:

Using video-text pairs in training set to implement an inflated diffusion model T2V, which generates videos using text embeddings $e_t$ and frame latent vectors $z_0$ ;
Training a Seq2Seq model for mapping EEG embeddings $e_{eeg}$ to frame latent vectors $z_0$ , where $z_0$ is obatined by feeding the original frames into the VAE encoder of the stable diffusion;
Training a semantic predictor for mapping $e_{eeg}$ to the corresponding text embedding $e_t$ ;
Training a dynamic predictor which is a binary classifier for predicting Fast or Slow from $e_{eeg}$ .

Inference:

Using the Seq2Seq model to get the predicted frame latent vectors $\hat{z}_0$ ;
Using the dynamic predictor to predict the Fast or Slow from $e_{eeg}$ and adopting the Dynamic-Aware Noise-Adding Process for adding noise to $\hat{z}_0$ to obtain the noise $z_T$ at time steps $T$ ;
Using the semantic predictor to predict the text embedding $\hat{e}_t$ from $e_{eeg}$ ;
Using the T2V model to generate the videos with predicted $\hat{z}_0$ (after DANA process is $z_T$ ) and $\hat{e}_t$ .

We acknowledge that we didn't allocate more space to highlight our proposed method EEG2Video in our main content. However, it is worth mentioning that our contribution is not limited to the proposed algorithm. The goal of our work is to explore the potential of using EEG signals for brain decoding. As a pioneering work, we build the dataset to support the area, we conduct the EEG-VP benchmark to figure out what kind of visual information we can decode from EEG signals. Based on the empirical findings that it is possible to decode Color and Fast/Slow from EEG, we develop two modules in EEG2Video: Seq2Seq for Color and DANA for Fast/Slow.

We appreciate that you can acknowledge these contributions when summarizing the strengths of our paper. These contributions are logically coherent with each other, and hence, we'd argue that the dataset, classification tasks, recontruction tasks, as well as the method itself, are equally important as our contributions. It is fair for them to having equal presenting space in the main content. But we truly value your suggestion and will add more details to the paper, probably in the appendix in the final version.

Questions

It is suggested to highlight the originality of the proposed method in Section 4. It is also summarize the steps constituting the proposed method as an algorithm.

Thanks for your constructive suggestion. We will add an algorithm in the appendix to demonstrate the different steps constitued our proposed EEG2Video as stated above. Please kindly refer to the PDF file in the general response, where we write the algorithm of EEG2Video.

评论- Thanks for the clarificaiton

2024-08-08

Thank you for addressing my comments also for the detailed rebuttal. A discussion will be made with other reviewers to reach a comprehensive decision.

评论- Thanks for reading our responses

2024-08-12

We are very glad for having addressed your concerns. If you have any further questions, please feel free to ask us. Thanks again!

作者回复

2024-08-06

We thank all the reviewers for your proficient and valuable comments and suggestions. We are cheerful to find that all the reviewers have reached the consensus that our datasets are novel and valuable to the research community. Moreover, we're also glad to see that our contributions in building the baselines (nkHY, 9iBh), interesting analysis (ACPC), visually appealing results (Kswp) are well recognized, and the paper is thought to be clearly presented (9iBh, ACPC).

Here, we sincerely invite all the reviewers to read this general response before diving into the detailed responses to your individual concerns. To help readers to logically understand our research workflow, we introduce our deep thought on what the actual problem is, how we reach there step by step, and what we are contributing, and most concerns about the scattering contributions, the indirect prediction methodology, the framework, and the allocation of the 9 pages can be naturally resolved.

The roadmap for EEG2Video. Ultimately, we, standing between neuroscience and AI, are seeking solutions to reconstruct the dynamic visual stimuli from the EEG signals due to its significantly higher temporal resolution and low latency compared to other brain signals like fMRI. We could easily figure out two paths to achieve it. a) directly decoding the pixel-level information of the video from EEG. b) indirectly decoding the video via intermediate semantics. As the first attempt, we dispose the direct way almost immediately. From neuroscience perspective, only the primary visual cortex(V1) is related to this very low-level perception, and then there will be only two channels (O1, O2) useful for the decoding. However, the full visual pathways span almost everywhere in our brain to process the information into high-level concepts like colors, motions, recognition, etc., and would be very useful for indirect decoding. From AI perspective, it's even impossible to generate arbitrary pixel combinations with contemporary generative models which involve OOD generalization. As a result, we want to try our best to facilitate intermediate information to help the video reconstruction in our work.
The EEG-VP benchmark and the classification result. Setting the indirect decoding mechanism as the main approach, the next question we need to answer is what should be the intermediate information to use. Nobody can answer this question before we build the EEG-DV dataset simply because of the omission of data resources for analysis (By the way, this adds to the value of our dataset and we're glad to see that most reviewers can recognize it). In fact, we had selected the classification tasks of our interests based on our knowledge as early as the stage when we're choosing the videos and we tried to balance different types of videos as much as possible. As a result, findings from the classification task demonstrate that some semantics are possibly helpful such as Slow/Fast, Class and some are almost impossible to decode such as number of objects. These experiments greatly guides our design of the EEG2Video framework and possibly future attempts, thus we insist to include indistinguishable tasks in the paper and the benchmark. In this sense, the EEG-VP benchmark and the classification results are not only the icing on the cake, but actually a logically indispensable part of our contributions.
The design of EEG2Video. Now we finally reach the point where people may be mostly interested in. Our contributions in the framework are two-folded. The first part comes from the task's property. For the first time, we can model the stimuli reconstruction from brain signal task into a Sequence-to-Sequence task given the modality of input and output. Also, as for decoding visual stimuli, we emphasize the visual cortex by adding inductive bias to the model and construct the GLMNet. The second part comes from the findings of the classification results. We tried to incorporate the distinguishable information by implementing the semantic predictor, dynamic-aware noise-adding process, etc.. We admit that we're not actually inventing new neural network blocks, but selecting the effective inductive bias to the model should definitely be considered as our contribution.

Now we have stated the coherent logic among our contributions in collecting the dataset, building the two benchmarks, and designing the EEG2Video framework. They are equally important to achieve the final goal, so we're not going to reallocate pages for them. Meanwhile, we still value all the suggestions from the reviewers which definitely help we improve the paper quality. Therefore, we will make the following modifications to the paper accordingly:

We will add more details about the model architecture and hyperparameter selection in the appendix.
We will add the workflow in a format of an algorithm as is shown in the attached .pdf file.
We will make some expression more accurate and clear.
We will add the above discussion in the publicly available version, or a link to this page to help the readers better understand it regardless of the final decision of this paper.

Finally, we would express our greatest appreciation and excitement again that all the reviewers recognize potential of our work to the whole neuroscience and AI, especially the BCI community. The word "Towards" in the title expresses our recognition of its position: as a pioneering work, this marks the beginning rather than the end of a new frontier. While the methods and the results are still in a very preliminary stage compared to the final desired goal, we hold the deepest belief that with our dataset, our benchmarks, and our framework, the publication of EEG2Video would open immeasurable possibilities to push the area moving forward.

最终决定Accept (poster)

2024-09-25

This paper develops a dataset and first algorithm for decoding videos from EEG. They show that certain features of video (color, motion speed) can be decoded, but there were other aspects (number of objects, presence of human face) they were not able to decode with their network and dataset.

There was a lot of discussion and the authors answered a lot of the questions. (One of the reviewers kept their score at 4 as they believed the changes would be a lot, but agreed that the work is a valuable contribution - I feel like the changes do not change what has been done, but just reflect clarifying some of the presentation). This area is of great interest and appears to be the first decoding video from EEG. There are interesting findings and algorithmic design choices that others in the field can build on. It is also clearly written. I think this work is worthy of NeurIPS presentation and will receive a lot of interest.