Long-Range Feedback Spiking Network Captures Dynamic and Static Representations of the Visual Cortex under Movie Stimuli
摘要
评审与讨论
Authors propose a deep spiking network with feedforward and feedback connectivity trained on natural movies and on static images, and compare the similarity of representations in their artificial network to the similarity of representations evaluated in the mouse visual cortex. Using such measure, they find a good fit of the artificial spiking network with the data. Feedback connections and the spiking nature of neural activity are both important to achieve good performance, potentially because they allow the model to extract temporal features and encode information with spiking sequences. The model outperforms state-of-the art alternatives.
优点
The paper proposes a new model (building of previous research) that is both outperforming notable alternatives among artificial deep neural networks, as well as it brings plausible insights about the benefits of feedback processing and spiking nature of neural activity on the information processing in biological brains. This work is of interest to the broader audience of NeurIPS.
缺点
In line 318 authors mention the necessity of having a spiking model and of having feedback connections but do not refer to any specific Figure or Appendix. It would be important to provide evidence for these claims.
问题
Authors report that the spiking nature of neural activity is computationally useful to the model as it allows spike-sequential coding and extraction of temporal features. Moreover, spiking seems to allow processing the information more flexibly as it is not limited by specific filer sizes. These results seem as a major insights of the model, but they should be backed up by further evidence. Could authors provide more clear results showing the benefits of spiking? If so, these results should also be better emphasized, e.g. in the abstract.
Authors do not provide much information about how they trained their network in the main part of the paper. While I appreciated the writing, some more details on the training would help to better assess the soundness of results.
To evaluate the TSRSA score, authors evaluate such score in every measured cortical region of the mouse brain and average across regions. An alternative could be to instead compare the artificial network with each region and report which region has the activity that is the most similar to the artificial network. Is there a good reason for averaging? Could some regions be better fitted than others?
局限性
Limitations are addressed, even though authors could consider discussing the lack of lateral connectivity in their model. Lateral connectivity has a major impact on the neural activity in biological neural networks.
We thank the reviewer for being supportive of our work and for the constructive comments. We will try our best to address the comments. Below are our detailed responses.
1. About line 318.
We apologize for not refering to the corresponding table. The conclusion comes from the results on the right side of Table 2. We will add a reference to that table in the manuscript.
2. About spiking mechanisms?
We already have some experiments in our manuscript to compare spiking and non-spiking networks to demonstrate the importance and benefits of the spiking mechanism. What's more, according to the suggestion of reviewers, we conduct additional experiments to provide more evidence and to strengthen our conclusions (please refer to response 1 to Reviewer YZqi). We will add and emphasize these results in our manuscript.
3. About the training of our work.
Due to space constraints, we have put the detailed network training procedure and training parameters in the appendix. Please refer to Appendix C.
4. Is there a good reason for averaging? Could some regions be better fitted than others?
We use the average across brain regions as the similarity score for two reasons. First, neurons responsive to movie stimuli are found in all six cortical regions [1]. Second, the depths of the network layer with the best similarity score to most mouse cortical regions are at the same level [2]. We show the individual similarity scores of our model to each region in Table R1, suggesting that there is no significant difference across regions.
| VISp | VISl | VISrl | VISal | VISpm | VISam | |
|---|---|---|---|---|---|---|
| Movie1 | 0.5274 | 0.5007 | 0.5049 | 0.5147 | 0.5295 | 0.5438 |
| Movie2 | 0.2223 | 0.2618 | 0.2955 | 0.3003 | 0.3012 | 0.3153 |
Table R1: TSRSA scores of LoRaFB-SNet to six cortical regions, respectively.
[1] Saskia EJ de Vries, et al. "A large-scale standardized physiological survey reveals functional organization of the mouse visual cortex." Nature neuroscience 2020.
[2] Jianghong Shi, et al. "Comparison against task driven artificial neural networks reveals functional properties in mouse visual cortex." NeurIPS 2019.
I thank the authors for replying to my questions.
Additional analysis strengthens the evidence of the benefit of spiking versus non-spiking networks on capturing representational similarity in the mouse visual cortex. However, it remains somewhat unclear why spiking is actually beneficial. Authors make a valuable first step to demonstrate such benefit, even though their datasets are relatively small. Their hypothesis of the membrane potential dynamics being helpful (I suppose authors refer to the subthreshold dynamics of the membrane potential) is an interesting and viable hypothesis, but it is not proven in this paper. Future work could dig deeper to explain the benefit of spiking for the information processing of natural and time-dependent stimuli.
I have a question about the choice of the Pearson correlation coefficient to evaluate the similarity of movie frames and of Spearman rank coefficient to compute the similarity of the model and the data. I suppose that the choice of the Spearman coefficient is motivated by the ability to better capture nonlinear relations. Why is similarity of movie frames computed with Pearson correlation coefficient? Authors could justify these choices also when they introduce their method for computing the representational similarity.
Does the membrane time constant of the LIF neurons influence the TSRSA score? Authors report using tau=0.5, is this in units of milliseconds? Biologically plausible values of the membrane time constant are between 10 and 20 milliseconds. Can authors comment on that? Has the dependence of results on the membrane time constant been tested?
I find that the lack of local recurrent connections creates a major discrepancy between the proposed model and sensory networks in biological brains. It is expected that in biological networks that process sensory stimuli, local recurrent connections have a major impact on signal processing [1] and can support efficient computations on sensory features with biologically plausible neural architectures [2]. I would find it important that authors better describe this limitation of their model as they revise the paper.
[1] Bourdoukan et al., "Learning optimal spike-based representations", NeurIPS 2012 [2] Koren and Panzeri, "Biologically plausible solutions for spiking networks with efficient coding", NeurIPS 2022
Thank the reviewer for the feedback. We will provide more clarification on the reviewer's concerns.
-
The introduction of spiking mechanisms in deep neural networks to improve representational similarity for the visual cortex is a relatively new topic in the field. As pioneering work, we think our novel model is of interest to the computational neuroscience community. However, we agree with the reviewer that exploring how the properties of the spiking mechanism contribute to brain-like information processing requires further experiments and analyses in future work.
-
Our choice of the Pearson correlation coefficient to compute the similarity of neural representations for movie frames is mainly due to computational efficiency. Although the Spearman correlation coefficient is able to represent nonlinear relationships, the computational process requires obtaining the rank of the features (i.e., involves sorting high-dimensional neural representations). For the huge number of features per layer of deep neural networks, the use of the Spearman correlation coefficient brings high time cost. Here, we use randomly generated data to test the time cost of the two methods on different sizes of feature dimensions (the number of frames for the two movies is 900 and 3600, respectively). The data reported in Tables R1 (Movie1) and R2 (Movie2) are mean±std in seconds. The results show that the computational cost of the Spearman correlation coefficient is tens of times higher compared to the Pearson correlation coefficient, especially for high-dimensional data. Therefore, although the Spearman correlation coefficient may lead to higher scores, we choose the Pearson correlation coefficient for efficiency.
| Method | |||||
|---|---|---|---|---|---|
| Pearson correlation coefficient | 0.020±0.001 | 0.036±0.001 | 0.151±0.008 | 1.193±0.008 | 13.632±0.440 |
| Spearman correlation coefficient | 0.033±0.001 | 0.217±0.005 | 2.270±0.017 | 27.257±0.026 | 344.818±0.383 |
Table R1: The time cost of on different sizes of feature dimensions for Movie1 (900 frames).
| Method | |||||
|---|---|---|---|---|---|
| Pearson correlation coefficient | 0.310±0.002 | 0.384±0.003 | 1.003±0.010 | 8.118±0.026 | 88.463±0.220 |
| Spearman correlation coefficient | 0.411±0.006 | 1.081±0.018 | 9.297±0.021 | 111.029±0.184 | 1425.066±5.913 |
Table R2: The time cost of on different sizes of feature dimensions for Movie2 (3600 frames).
-
We apologize for the clerical error here, it should be , i.e. . We choose this value for the sake of task pre-training and do not match it to a biologically plausible value. In visual task training for deep spiking networks, 2 is a widely used empirical value for the membrane time constant of LIF neurons, and larger (e.g., 16) can lead to significant degradation in task performance [1]. We will investigate the effect of this value on neural similarity in the future to study its correspondence with real membrane time constants.
-
We strongly agree that local recurrences (e.g., lateral connections) are also important for information processing in the brain, while our model focuses primarily on feedback connections across brain regions. We will mention this limitation in our manuscript, and in future work, we will combine our long-range feedback connections with intra-regional recurrent connections to investigate its implications.
We will reflect the above mentioned reasons for the methodology choices as well as the limitations and prospects of our work in the revised manuscript.
[1] Wei Fang, et al. "Incorporating learnable membrane time constant to enhance learning of spiking neural networks." ICCV 2021.
I thank authors for their thorough reply.
Thank you. We feel glad to have resolved your questions.
The authors in this work proposes a long-range feedback spiking network (LoRaFB-SNet) whose architecture is similar to neuronal and synaptic behavior in the cortical regions of the brain. Furthermore they propose a Time-Series Representational Similarity Analysis framework to measure the similarity between model representations and visual cortical representations of mice. The proposed model exhibits the highest level of representational similarity (following the analysis framework), outperforming the current baselines.
优点
The paper has a strong motivation to analyse the similarity in the dynamic and static representations of movie-based stimuli on neural models and the actual visual cortical representation of mice. The motivation of using feedback connections is also biologically significant. TSRSA as an analysis tool to understand the similarity between the representation of neural models and visual cortex representation of an actual biological brain also seems very interesting.
缺点
-
The model architecture is not particularly novel since past works on SNNs have shown that long-range feedback plays an important role in processing visual information [1]. TSRSA seemed to be one of the novel contributions of the paper. The explanation can be made more mathematical since currently its more of a textual description which can be hard to follow.
-
The paper's baselines (mainly ResNets and RNNs) could benefit from further expansion. A more comprehensive analysis of these representations could be particularly insightful when compared with state-of-the-art vision models such as VLMs, Video-LMs, or SSM-based architectures. Moreover, providing deeper insights into how these representations are processed within the actual visual cortex would not only be beneficial but also enhance understanding within the broader machine learning community.
3/Suggestion: Since this work is more directed at understanding biological implications of how visual cortex represents video-based information and not from leveraging energy-efficiency aspects of SNNs, it might be more relevant to explore bio-plausible learning mechanism (STDP-based), more sophisticated neuronal models such as HH model instead of the simplistic LIF, etc. to develop a more comprehensive analysis.
Reference
- Xiao, Mingqing, Qingyan Meng, Zongpeng Zhang, Yisen Wang, and Zhouchen Lin. "Training feedback spiking neural networks by implicit differentiation on the equilibrium state." Advances in neural information processing systems 34 (2021): 14516-14528.
问题
-
Is there a reason for not using VLMs, etc as a baseline. Understanding attention mechanism which is relevant from the perspective of movie-data processing might be significant.
-
In the experiments why were the Neurons firing less than 0.5 spikes/s excluded?
局限性
Suggestions are added in the weaknesses section.
We thank the reviewer for the clear and thoughtful comments. We will do our best to address the reviewer's concerns and answer the questions in the following.
1. The novelty of our model architecture.
The work of Xiao, Mingqing focuses on deriving the equilibrium state of a spiking neural network by introducing feedback connections, and in turn deriving the gradient of parameters for training the network without having to consider the forward process. However, their algorithm has only been validated on networks with shallow and stacked layers on small datasets like CIFAR10, and it is not yet known whether the equilibrium state is tractable in deeper and more complex network structures (e.g., skipping connections).
In contrast, we introduce multiple long-range feedback connections into a deeper spiking network and train it with a commonly used algorithm (BPTT with surrogate gradient) on large-scale datasets. This provides a more general structure and training paradigm for exploring feedback spiking networks.
2. The TSRSA.
We thank the reviewer for recognizing TSRSA. As suggested by the reviewer, we provide a more mathematical representation of the method here and will add it in our manuscript.
We define the representation matrix as of a given model layer or a given cortical region, where represents the population responses of units/neurons to the movie frame . We first compute the Pearson correlation coefficient between the responses to a given movie frame and to its subsequent frames to obtain the representational similarity vector . The vector is formulated as:
where .
Second, we concatenate all vectors to obtain the complete representational similarity vectors for a network layer and for a cortical region. Finally, we compute the Spearman rank correlation coefficient between and as the similarity score.
3. Is there a reason for not using VLMs, etc as a baseline.
We agree that exploring the information processing mechanisms of VLMs with the real visual cortex may lead to new insights for the machine learning and neuroscience communities. However, our work focuses on constructing biologically plausible deep neural network (DNN) models in terms of the structure and dynamics, to improve representational similarity to real biological neural responses, providing new insights into information processing mechanisms of movie stimuli in the visual cortex. Therefore, the baselines we chose for comparison are all bio-inspired DNNs, which are widely used in studies of neural representational similarity to the visual cortex.
SOTA vision models such as VLM on visual tasks have not been chosen as baselines in our work due to the following points.
-
The core of our work is to build brain-like models, not to achieve better performance on visual tasks. However, VLMs are designed in a performance-oriented manner, without taking biological plausibility (e.g., attention mechanisms) into account [1].
-
Previous studies already proved that visual task performance is not positively correlated with neural representation similarity. Instead, higher task performance may lead to poorer brain-like models [2].
-
Transformer-based vision models have shown poor similarity (worse than convolutional networks) to real neural responses in the visual cortex [2, 3]. Similarly, transformer-based language models (LLMs) have been questioned for not truly reflecting the core elements of human language processing [4].
4. More bio-plausible learning mechanisms and more sophisticated neuronal models.
Despite the higher biological plausibility of the advised learning mechanism and neuronal models, introducing them in deep spiking networks and training them on large datasets suffers from problems such as difficulty in convergence and untrainability. Therefore, we have not considered them in our current work.
5. In the experiments why were the Neurons firing less than 0.5 spikes/s excluded?
When analyzing neuronal spiking activity, researchers often exclude neurons with firing rates below a certain threshold to focus on neurons that are responsive to visual inputs and to more effectively extract stimulus-related neural representations. An empirical threshold of 0.5 spikes/second is commonly used in many studies [5, 6].
[1] Paria Mehrani & John K. Tsotsos. "Self-attention in vision transformers performs perceptual grouping, not attention." Frontiers in Computer Science 2023.
[2] Drew Linsley, et al. "Performance-optimized deep neural networks are evolving into worse models of inferotemporal visual cortex." NeurIPS 2023.
[3] Liwei Huang, et al. "Deep Spiking Neural Networks with High Representation Similarity Model Visual Pathways of Macaque and Mouse." AAAI 2023.
[4] Ebrahim Feghhi, et al. "What Are Large Language Models Mapping to in the Brain? A Case Against Over-Reliance on Brain Scores." arXiv 2024.
[5] Lucas Pinto, et al. "Fast modulation of visual perception by basal forebrain cholinergic neurons." Nature Neuroscience 2013.
[6] Brice Williams, et al. "Spatial modulation of dark versus bright stimulus responses in the mouse visual system." Current Biology 2021.
Thank you for addressing some of my concerns. Though I agree STDP-like hebbian learning rules are difficult to scale, some local-learning rules such as Equilibrium Propagation, etc. have been seen to scale to deeper architectures.
There are usually two main motivations of using spiking architectures, one being bio-plausibility and the other one being energy efficiency. Since the authors main motivation is the former, it would have been great to report some results using a bio-plausible learning rule and/or other encoding mechanisms.
Also, though transformer-like architectures might be less bio-plausible one can explore state space modeling (SSM, Mamba) based approach for ANN baselines, since they can process long temporal sequences linearly without computing explicit attention scores.
We thank the reviewer for the insightful feedback. We will try to provide more experimental evidence and address the reviewer's remaining concerns.
First, although recent studies have made efforts to use equilibrium propagation to train spiking networks, these models are still limited to few (≤5) linear/convolutional layers and small datasets such as MNIST and FashionMNIST [1-3]. The effectiveness of training on deeper and more complex structures, as well as larger datasets, has yet to be validated. However, we agree that exploring the effect of bio-plausible learning rules and other biological encoding mechanisms may provide insights for brain-like modeling with deep spiking networks. We will discuss this limitation in the revised manuscript and investigate it in the future.
Second, as suggested by the reviewer, we perform the representational similarity experiment for VideoMamba [4]. Due to time constraints of the discussion phase, we directly load the parameters of the model pre-trained on Kinetics-400 (there is no publicly available model pre-trained on UCF101). Here, we choose VideoMamba-Tiny which has a comparable number of parameters to our model. As shown in Table R1, our model performs slightly worse than VideoMamba on the Movie1 but better on the longer Movie2. The results suggest that our model shows a pronounced advantage in capturing brain-like neural representations on the longer movie, which is consistent with the experiments in the manuscript as well as the results discussed with Reviewer YZqi. Besides, given that VideoMamba is pretrained on Kinetics-400 (400 classes with 306k samples), it obviously covers a much wider range of data than our model and the baselines used in our work (pretrained on UCF101: 101 classes with 13.3k samples), which may lead to unfair comparison results. Since our work focuses on bio-inspired brain-like models for visual cortex analysis, we will add this experiment as an additional comparison result between our model and a SOTA model of video tasks.
| CORnet (UCF101) | VideoMamba (Kinetics-400) | LoRaFB-SNet (UCF101) | |
|---|---|---|---|
| Movie1 | 0.5060 | 0.5234 | 0.5202 |
| Movie2 | 0.2230 | 0.2719 | 0.2827 |
Table R1: The TSRSA scores.
Importantly, we think that our work makes an important step toward better capturing neural representations in the visual cortex under movie stimuli. Through introducing spiking mechanisms and long-range feedback connections, our model provides insight to the computational neuroscience community about the deep neural network tool. We sincerely hope the reviewer can re-evaluate the contribution of our work.
[1] Jiaqi Lin, et al. "Scaling SNNs Trained Using Equilibrium Propagation to Convolutional Architectures." arXiv 2024.
[2] Erwann Martin, et al. "Eqspike: spike-driven equilibrium propagation for neuromorphic implementations." Iscience 2021.
[3] Peter O'Connor, et al. "Training a spiking neural network with equilibrium propagation." In The 22nd international conference on artificial intelligence and statistics, 2019.
[4] Kunchang Li, el al. "VideoMamba: State Space Model for Efficient Video Understanding." arXiv 2024.
Thank you for the new insights. I have increased my ratings.
We appreciate your recognition of our work and will include the above discussion and additional results in our revised manuscript.
To better understand visual processing in the brain, this paper presents a spiking neural network with top-down connections. It follows the trend over the past few years of building deep neural network models to approximate brain architecture and match brain and behavioral data. Simply put, the goal is to have a network model that can perform visual tasks like the visual neural systems in the brain and align well with the brain in terms of representation. With such a model, we can pose many questions that we typically ask about the real brain within these models. Thus, the performance is measured by how well these networks match the brain representations. The model introduced here, LoRaFB-SNet, differs from previous models in two main aspects, as I understand it: it differs from previous DNN models such as CORnet in terms of the spiking units versus traditional DNN units, and also the authors of CORnet only looked at static but not dynamic visual processing; it is different from SEW-ResNet in terms of the top-down connections feature, as SEW-ResNet is a purely feedforward model. Based on the integration of the good features of the previous works, this work conducted good research into the Effects of Dynamic and Static Information, which provides insight for neuroscientists into this model.
优点
The strength of this work is highlighted by the match of the research question and the approaches the authors took. They first identified the representation related to dynamic visual processing in neuroscience, then drew insights from the literature in neuroscience that the top-down connection is important in this processing in the brain, and then built a model incorporating this feature, as well as considering the spiking mechanisms. The question is indeed an important one in neuroscience and the approaches the authors take and their results show that it is a promising direction. Their thorough knowledge of literature and previous works in both the neuroscience and machine learning community is clear, as they integrate the good and important features from previous works, and make it a comprehensive model, and also their experimental design clearly addresses their research questions.
缺点
The primary concern I have with this paper revolves around the impact of spiking mechanisms within the model. In neural processing, there is a longstanding debate on the relevance of timing versus firing rate for information processing. Recent evidence in neuroscience literature has indeed highlighted the importance of spike timing as a crucial source of dynamic information in visual systems[1]. However, the paper does not sufficiently discuss how spiking mechanisms influence the model's performance or represent a significant improvement over traditional mechanisms. Although spiking mechanisms are a key feature of the model, the discussion and experimental design focusing on these mechanisms are limited. The only mention of these aspects is in the second paragraph of section 4.3.3 and Table 2, which do not clearly demonstrate the importance of spiking mechanisms or how they contribute distinctly to the model's capabilities. This is a crucial gap, as a more detailed exploration here could significantly enhance the novelty of the work compared to previous models like CORnet, which lacks spiking mechanisms but incorporates recurrent connections.
Additionally, the comparisons made in the experiments need a more focused analysis on this point. For instance, in Figure 3, panel A shows that in Movie 1, the performance difference between LoRaFB-SNet and CORnet is minimal, despite the latter lacking spiking mechanisms. Panels B and C do not include comparisons with CORnet, and no explanation is provided for Panel D in the main text or the figure descriptions, which might be an oversight. Figure 4 also omits CORnet in the comparisons. A more thorough comparison with CORnet is vital since it too has recurrent connections but does not incorporate spiking mechanisms. Beyond comparisons with CORnet, more effort should be directed towards differentiating LoRaFB-SNet from its non-spiking version, to better illustrate why spiking is necessary and how it significantly impacts both the model and neural systems.
Another minor point is the paper's focus on region-to-region top-down feedback while seemingly neglecting the potential impact of local recurrent connections, which could also be significant as suggested by recent literature[2].
[1] Quintana, Daniel, Hayley Bounds, Julia Veit, and Hillel Adesnik. "Balanced bidirectional optogenetics reveals the causal impact of cortical temporal dynamics in sensory perception." bioRxiv (2024).
[2] Oldenburg, Ian Antón, William D. Hendricks, Gregory Handy, Kiarash Shamardani, Hayley A. Bounds, Brent Doiron, and Hillel Adesnik. "The logic of recurrent circuits in the primary visual cortex." Nature neuroscience 27, no. 1 (2024): 137-147.
问题
-
Importance of Spiking Mechanisms: The use of spiking mechanisms is a central feature of your model. However, the paper lacks a detailed discussion on how these mechanisms enhance the model's performance compared to non-spiking models like CORnet. Could you elaborate on why spiking mechanisms are critical for your model? How do they improve the representation of dynamic and static visual information? Clarification on this could significantly affect the evaluation of your model's novelty and effectiveness.
-
Data and Model Robustness: The results presented utilize data from only two movies provided by the Allen Institute, with one movie showing no significant difference in performance and the other showing very low similarity scores. Can you discuss the expected robustness of your model's performance across additional movies, animals, or visual regions? How can you justify the robustness and significance of your results with such a limited dataset? Would additional data be necessary to strengthen your conclusions, or do you believe the current results are sufficiently convincing?
-
Dependency on Pretraining Tasks: How dependent are the model's outcomes on the specific pretraining tasks used? Given the unique properties of spiking networks in processing temporal sequences, does the choice of pretraining data significantly influence the results? Understanding this could help in assessing the model's generalizability and applicability to other datasets or tasks.
局限性
As mentioned in the Weaknesses and Questions sections, this review identifies two primary areas of concern:
-
There is insufficient discussion about how spiking mechanisms enhance the model's performance compared to non-spiking models. Addressing this could significantly clarify the unique contributions of your approach.
-
The paper does not thoroughly validate the robustness and significance of the results across various datasets, which is crucial for substantiating the model's applicability and effectiveness.
Additionally, some minor areas that could improve the paper include:
-
Impact of the Pretraining Task: The influence of the pretraining task on the model’s performance is not clearly articulated. Clarifying this could help understand the adaptability of the model under different conditions.
-
Exploration of Local Recurrent Connections: There is potential value in exploring the impact of local recurrent connections, which could provide deeper insights into the comprehensive functionalities of the visual cortex.
Addressing these points would significantly enhance the quality of the paper.
We thank the reviewer for the notable and perceptive comments. We will do our best to address the comments and provide detailed responses point by point, below.
1. Importance of Spiking Mechanisms.
In our work, we design a deep spiking network based on rate coding and pretrain it on large-scale datasets. The training algorithms of temporal coding spiking networks are still not mature [1]. Given that our work is centered around deep neural network models, the rate coding spiking network is a better choice.
To better demonstrate the importance of spiking mechanisms, we clarify our existing results (the comparisons with CORnet in Fig. 3A, 3C and 3D) and perform more experiments for the non-spike version of LoRaFB-SNet (termed LoRaFB-CNet) as suggested by the reviewers. The results (see details below) provide strong evidences that spiking mechanisms play a crucial role, allowing our model to better capture neural representations of the visual cortex. Since our spiking network is based on rate coding, we think that the following two properties may contribute. First, our network encodes and transfers information exclusively in the form of spikes, just like the brain. Second, the membrane potential dynamics of spiking neurons help to process dynamic information, which complements the long-range feedback connections well.
The clarifications and new results about spiking mechanisms:
-
LoRaFB-SNet outperforms CORnet and LoRaFB-CNet on both two movies (Movie1: 0.5202, 0.5060, 0.4975; Movie2: 0.2827, 0.2230, 0.2619).
-
LoRaFB-SNet consistently outperforms CORnet on different lengths of movie clips (Fig. 3C) and shows the increasing ratio compared to CORnet (Fig. 3D). The results suggest our model performs significantly better on longer movie clips, echoing the results in Fig. 3A. Since longer movie stimuli increase the diversity of population neuronal response patterns in the visual cortex [2, 3], it is more difficult for models to capture brain-like representations. Our model has a more pronounced advantage in this case. See discussions in lines 234-243.
-
We added the experiments on Movie2 in Figure 3C for the LoRaFB-CNet. As shown in Table R1, the results support the above conclusions. Besides, LoRaFB-CNet outperforms CORnet for all clip lengths, suggesting that our model structure also benefits information processing in the long movie.
| Model | 30s | 50s | 70s | 80s | 90s | 100s | 110s | 120s |
|---|---|---|---|---|---|---|---|---|
| LoRaFB-SNet | 0.405 | 0.353 | 0.319 | 0.284 | 0.302 | 0.299 | 0.295 | 0.283 |
| LoRaFB-CNet | 0.384 | 0.332 | 0.297 | 0.249 | 0.270 | 0.278 | 0.277 | 0.262 |
| CORnet | 0.328 | 0.275 | 0.234 | 0.192 | 0.194 | 0.202 | 0.218 | 0.223 |
Table R1: TSRSA scores with different movie clip lengths on Movie2. The standard error is omitted.
- For the experiments of static natural scene stimuli in Fig. 4C, we add the results of CORnet and the LoRaFB-CNet (0.4130, 0.3544, 0.3411), which shows that our model also outperforms them in this neural dataset.
The clarifications and new results about feedback connections:
-
Fig. 4A compares the changes in similarity scores between the network with feedback and fully feedforward structures when the dynamic information is disrupted, supporting that the long-range feedback connections allow LoRaFB-SNet to represent dynamic information in a more context-dependent manner. See discussions in lines 272-275. We add the experiments for CORnet and LoRaFB-CNet in Fig. R1 in the PDF of the "global rebuttal", which solidifies our conclusion about the effectiveness of feedback connections.
-
Fig. 4B does not involve comparisons of model structures and mechanisms. Therefore, we don't include CORnet and LoRaFB-CNet here.
We will add the above results and discussion to our revised manuscript.
2. Data and Model Robustness.
As mentioned above, the similarity score of models suffers from an increase in movie length, and LoRaFB-SNet shows a more pronounced advantage in longer movie experiments. The significant improvement in our model's score compared to other models, coupled with the overall lower scores of other models on Movie2, underscores the robust advantage of our model across different movies. In fact, when we use randomly selected 30s (the length of Movie1) clips from the 120s Movie2, LoRaFB-SNet achieves a score of 0.405 (about 80% of the score on Movie1). The scores of our model in six visual regions also indicate the robustness across brain regions (please refer to response 4 to Reviewer rzHc).
In summary, experiments and analysis in a variety of settings effectively demonstrate the performance and robustness of our model. In the future, despite the paucity of publicly available neural datasets under natural movie stimuli, we will try to apply our model to more datasets to solidify our conclusions.
3. Dependency on Pretraining Tasks.
We have preliminarily discussed the influence of pretraining datasets and tasks (Fig. 3B and the first paragraph of Section 4.3.3). In particular, the video recognition task better benefits model's neural similarity. Besides, temporal structures of data contribute more than the static content. Larger datasets and other video tasks may have different impacts, which we will explore in the further work.
4. Exploration of Local Recurrent Connections.
We agree that lateral connections within brain regions are useful for brain-like modeling. However, our work focuses on feedback connections across regions to build bio-inspired models. We will introduce local recurrent connections in the future for exploration.
[1] "Temporal-coded spiking neural networks with dynamic firing threshold Learning with event-driven backpropagation." ICCV 2023.
[2] "Rapid learning in cortical coding of visual scenes." Nature Neuroscience 2007.
[3] "Representation of visual scenes by local neuronal populations in layer 2/3 of mouse visual cortex." Front Neural Circuits 2011.
Thanks for the authors' rebuttal, and thank you for addressing my questions, concerns, and comments.
- Importance of Spiking Mechanisms.
Here, I am convinced that for the model architecture of LoRaFB and the dataset you are using, the Spiking Mechanism is beneficial for the TSRSA measurement. It improves my view of this paper. However, I still have two important concerns or critiques that leave room for a higher score for this work.
The first one is why the spiking mechanisms make the model better here? This is indeed a hard question, which might be hard to address in this paper. But the alternative question could be the generality of the spiking mechanisms here. Do spiking mechanisms always make a model better in terms of capturing brain representations? This is also a big question that might be hard to answer here in the paper. What I see in this paper is that LoRaFB-SNet is consistently about 2 points higher than LoRaFB-CNet, in both videos and lengths of videos. This is an interesting phenomenon, but how does the gap between LoRaFB-SNet and LoRaFB-CNet affected by the length of the video and the baseline score? And could spiking mechanisms also make other models like CORnet have a higher TSRSA score? These smaller questions within the context of this paper don’t address the bigger questions I mentioned before, but at least they will give us a better idea and make me more convinced that the spiking mechanisms are truly important here.
The second concern is that I fully understand that the authors' model performs significantly better on longer movie clips. In the longer movie clips, even without the spiking mechanisms, just LoRaFB-CNet alone outperforms CORnet significantly. But the TSRSA scores are very low for longer videos (Movie 2: 90s, 100s, 110s, and 120s). How could neuroscientists use such a model with low ability to capture the representation to interpret anything or perform other customized analyses for their research? Here I am not saying the low score is unacceptable, but because the TSRSA is a novel measurement here, I don’t understand its meaning in terms of the neuroscience use case, just need insights from the authors that since the improvement is from 0.223 to 0.283, how good is 0.283 and how could neuroscientists potentially use the model? If both 0.223 and 0.283 are unusable, then such a great improvement might leave room or point the communities in a direction for further improving it? Hope the authors can explain it. And for movie one, LoRaFB-CNet has a similar performance as CORnet, which, as the authors have explained, benefits from spiking in longer videos. But why in the case of movie 2-30 s, does LoRaFB-CNet show a big difference from CORnet? The role of the recurrent connections and spiking mechanisms here seems unclear to me, which preserves my concern for the robustness.
- Robustness
As I said at the end of the last section, the robustness here still does not seem strong to me after the authors' rebuttal. Two videos (only one is 120s) is really a small dataset and really hard to prove strong claims. For example, Table R1 is a very good result to showcase the effect of video length and the spiking mechanisms. But here, only one video and even without a second one for further confirmation. Here, because of the limit of the data from the Allen Institute and the difficulty in finding public datasets, I understand the difficulties for the authors. Is there another analysis that the authors can further perform to show stronger evidence on the claims? Or, could the authors share some concrete future plans on improving the robustness of the work? For example, the numbers of the mice? The numbers of the movies? Or since the authors are interested in the visual cortex, then monkey data are actually better than mouse data, is there some monkey data the authors plan to use?
3 & 4
Thanks for addressing these two points. They are not the main factor here in this paper, so they do not play an important role in my judgment of the paper.
We thank the reviewer for recognizing our work and providing the constructive feedback. We will try to address the reviewer's remaining concerns.
1. The low TSRSA score for long movies.
First, we use the approach of randomly dividing the neural data into two halves and computing the similarity between them to obtain the neural ceiling [1, 2]. The results have been reported in Table 1 of our manuscript, and we show here the results with LoRaFB-CNet added (Table R1). Our model achieves 63.3% and 45.9% of the ceilings, which are comparable to the levels reported in related work [2, 3]. Therefore, although the absolute values of scores appear to be low for Movie2, the ratios to the neural ceilings suggest our model effectively captures neural representations of the brain and is meaningfully closer to the mouse visual cortex compared to other models (i.e. from 36.2% to 45.9%).
| Neural Ceiling | CORnet | LoRaFB-CNet | LoRaFB-SNet | |
|---|---|---|---|---|
| Movie1 | 0.821±0.006 | 0.506 (61.6%) | 0.498 (60.6%) | 0.520 (63.3%) |
| Movie2 | 0.617±0.009 | 0.223 (36.2%) | 0.262 (42.5%) | 0.283 (45.9%) |
Table R1: The neural ceiling of TSRSA score and the ratios to the ceilings.
In addition, we use a regression-based method widely used [1] to measure the predictability of our model's representations to individual neurons (Appendix E), and we supplement the results of LoRaFB-CNet here (Table R2). As the results show, our model outperforms other models on this metric, while the absolute value is also at a low level.
| CORnet | LoRaFB-CNet | LoRaFB-SNet | |
|---|---|---|---|
| Movie1 | 0.4326 | 0.4252 | 0.4335 |
| Movie2 | 0.1790 | 0.1774 | 0.1836 |
Table R2: The scores () of linear regression.
In conclusion, the overall low scores (including the neural ceilings) may mainly stem from the diversity and variability of neural representations under long movies. Although there is still room for improvement based on the results of our model, we believe that our work takes an important step forward in capturing brain-like representations under movie stimuli and contributes to the development in neuroscience community.
2. The big difference in the case of Movie2-30s.
Since the mice did not receive the movie clips but the entire movie for neural responses recording, in the experiments for different lengths of movie clips, the neural representations of the mice and the model are derived from the full movie input. In other words, for the neural representation in response to a given movie clip, we extract it from the complete representations, rather than inputting the movie clip into the network to obtain it. As a result, even though the clip selected from Movie2 is of the same length as Movie1 (30 s), the neural representation in response to this clip is still influenced by the entire long movie. Therefore, the difference in performance in this case may be due to the advantage of our model for the longer movie. Importantly, the performance improvement shows that our long-range feedback connections (from CORnet to LoRaFB-CNet) and spiking mechanisms (from LoRaFB-CNet to LoRaFB-SNet) both contribute to the results.
3. About spiking mechanisms.
As mentioned above, neural representations in response to movie clips are extracted under the entire movie stimuli which influences experiments for all movie clips of different lengths. This may explain the fact that LoRaFB-SNet is stably better than LoRaFB-CNet by 0.02-0.03. Moreover, using LoRaFB-CNet as the baseline, we report the change in the gap-to-baseline ratio with movie lengths (Table R3). We find that the ratio first increases, reaching a maximum value at 80s. Then, the ratio happens to decrease, but it is still higher in longer clips (100s, 110s, 120s) than in shorter ones (30s, 50s). For the spiking version of CORnet, we will report the results later as the pre-training takes more time.
| Model | 30s | 50s | 70s | 80s | 90s | 100s | 110s | 120s |
|---|---|---|---|---|---|---|---|---|
| (LoRaFB-SNet - LoRaFB-CNet) / LoRaFB-CNet | 5.5% | 6.3% | 7.4% | 14.1% | 11.9% | 7.6% | 6.5% | 8.0% |
Table R3: The ratio between the gap and the baseline.
Although how spiking mechanisms affect the model to achieve better similarity requires further exploration, we think that the results of our work show the potential of spiking mechanisms in computational modeling of the visual cortex with deep neural networks.
4. About robustness.
We agree that the size of the dataset is a limitation. We hope that the above, especially the comparison of our scores with the neural ceiling for the long movie, will go some way to addressing the concerns about our model's robustness. Furthermore, as suggestions by the reviewer, in the future we will try to validate our model on more datasets, e.g., a chronic 2-photon imaging dataset of mouse V1 in response to movies with sessions over weeks [4] and an electrophysiological dataset of macaque IT in response to movies lasting 5 minutes [5].
[1] Martin Schrimpf, et al. "Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like?" bioRxiv 2018.
[2] Jianghong Shi, et al. "MouseNet: A biologically constrained convolutional neural network model for the mouse visual cortex." PLOS Computational Biology 2022.
[3] Aran Nayebi, et al. "Mouse visual cortex as a limited resource system that self-learns an ecologically-general representation." PLOS Computational Biology 2023.
[4] Tyler D. Marks, and Michael J. Goard. "Stimulus-dependent representational drift in primary visual cortex." Nature communications 2021.
[5] Brain E. Russ, et al. "Temporal continuity shapes visual responses of macaque face patch neurons." Neuron 2023.
As suggested by the reviewer, to further explore the effect of spiking mechanisms, we build a spiking version of CORnet and pretrain it on UCF101 with the same hyperparameters as CORnet.
We test the representational similarity on two movies and perform the experiments with movie clips of different lengths on Movie2. As shown in Table R4 and R5, the spiking version of CORnet yields a lower score on Movie1 but outperforms CORnet on Movie2 in most cases of movie clips. These results suggest that adding spiking neurons to deep neural networks without considering model structures does not stably lead to better performance in brain-like modeling, which has been similarly discussed for visual task performance in the machine learning community [6]. In contrast, the recurrent module designed in our work effectively incorporates spiking neurons and exploits the potential of spiking mechanisms, making our model to achieve the best performance in representational similarity to the visual cortex.
| Model | Movie1 | Movie2 |
|---|---|---|
| CORnet | 0.5060 | 0.2230 |
| Spiking CORnet | 0.4826 | 0.2326 |
Table R4: TSRSA scores for entire movies on Movie1 and Movie2.
| Model | 30s | 50s | 70s | 80s | 90s | 100s | 110s | 120s |
|---|---|---|---|---|---|---|---|---|
| CORnet | 0.328 | 0.275 | 0.234 | 0.192 | 0.194 | 0.202 | 0.218 | 0.223 |
| Spiking CORnet | 0.344 | 0.292 | 0.246 | 0.199 | 0.196 | 0.200 | 0.221 | 0.233 |
Table R5: TSRSA scores with different movie clip lengths on Movie2.
[6] Wei Fang, et al. "Deep residual learning in spiking neural networks." NeurIPS 2021.
Thanks for the quick and thorough response. I appreciate the extra details, evidence, and explanations you’ve added. They address my questions and concerns better than before. Please make sure to incorporate these improvements into the revised paper. Based on this progress, I’m going to increase my score for your paper.
Thank you for your prompt and active response. We are pleased to have addressed your concerns, and your approval of our work is very encouraging. The clarifications and improvements will be incorporated into our revised manuscript.
We sincerely thank all reviewers for their valuable time and their thoughtful and constructive comments. We do our best to answer the questions raised by reviewers in each individual rebuttal.
Since some reviewers are concerned about the importance of the spiking mechanism of our model, we clarify our existing results and add some new experiments (please refer to response 1 to Reviewer YZqi for details). As a result, we demonstrate the validity of the spiking mechanism and discuss its properties that benefit our model:
-
Our network encodes and transfers information exclusively in the form of spikes, just like the brain.
-
The membrane potential dynamics of spiking neurons help to process dynamic information, which complements the long-range feedback connections well.
this work proposes the long-range feedback spiking network (LoRaFB-SNet), which mimics top-down connections between cortical regions and incorporates spike information processing mechanisms inherent to biological neurons. Futhermore, the authors used Time-Series Representational Similarity Analysis (TSRSA) to measure the similarity between model representations and visual cortical representations of mice. These analyses confirmed the critical role of long-range feedback in representing dynamic and static visual information. All reviewers appreciated the novelty of this work and especially the critical role of spiking format. I am inclined to accept this paper.