7.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.8

置信度

创新性3.0

质量2.8

清晰度3.0

重要性3.3

NeurIPS 2025

Know Thyself by Knowing Others: Learning Neuron Identity from Population Context

Vinam Arora,Divyansha Lachi,Ian Jarratt Knight,Mehdi Azabou,Blake Aaron Richards,Cole Lincoln Hurwitz,Josh Siegle,Eva L Dyer

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

We present NuCLR, a self-supervised framework that learns high-quality, population-aware neuron-level embeddings directly from spike train data using a spatio-temporal transformer and tailored contrastive loss.

摘要

关键词

neural identitycell type identificationbrain region identificationcomputational neurosciencesystems neuroscienceself-supervised learningcontrastive learning

评审与讨论

审稿意见

评分: 5置信度: 42025-06-05

The paper suggests NuCLR, a contrastive self-supervised learning method to obtain time-invariant functional neuronal embeddings. It is followed by an MLP to generate class labels and the authors test several regimes (Transductive, Transductive zero-shot, and Inductive zero-shot) to emulate if we had different cell types present during NuCLR pretraining or during the classification MLP finetuning. The paper also presents method validation on several datasets coming from different modalities (electrophysiology, calcium data).

优缺点分析

Strength:

New contrastive architecture, which seems to be better distinguishing the cell types across several datasets, coming from better modalities (electrophysiology, calcium data)
Statistically significant improvements, the model standard deviations across seeds is estimated
The method is carefully designed (eg decoupled contrastive loss to avoid opposing cells from different mice) and clearly explained
Three regimes of model application (Transductive, Transductive zero-shot, and Inductive zero-shot) are tested
The ablation to check the impact of cross-neuron interaction is done.

Weaknesses

The datasets used are relatively small and very few cells types are analyzed (max 5 for Bugeon et al). For brain areas there were 10 areas analyzed but distinguishing brain areas is supposed to be much easier than distinguishing cell types. Suggestion - Allen observatory dataset from [1] with calcium imaging containing 100,000 neurons in 6 areas and 13 cell types.
The baseline of classic clustering (knn, Leiden) on the learn embeddings is missing. (As in NeuPRINT Table 2). Removing MLP might help to remove the need in the labels at all.
Some literature links and comparisons are missing, for instance POYO+[2] also did cell types classification, NEDs [3] did brain regions classification on IBL repeated site dataset (Fig 4), Wang et al [4] predicted cell types and brain regions from time-invariant embeddings as well (Fig 5 d, e), Weis et al [5] used contrastive learning for neuronal morphologies (section 4 "Contrastive methods in neuroscience"). Core-readout models like [4] derived from [6] are actively creating optimal stimuli to derive cell types in different species (marmoset, mice, monkeys) and visual areas (retina, brain) - [7-11] (section 4 "Cell type classification")
Only single species and brain area is considered in the work while mice retina from [9] is publicly available.
Statement in lines 318-319 * In contrast, our method introduces the first spatio-temporal transformer designed and trained for neuron-level embeddings* is technically incorrect, the authors missed STDNT [12], another transformer, who accounted for cross-neuronal interactions. Its embeddings could be used for further cell types classification as well, though they are not time invariant
while i provided a lot of references above - I think this is just the proof that the authors should redo the literature review more carefully and integrate its results, also being more careful with the statements.

[1] de Vries, S. E., Lecoq, J. A., Buice, M. A., Groblewski, P. A., Ocker, G. K., Oliver, M., ... & Koch, C. (2020). A large-scale standardized physiological survey reveals functional organization of the mouse visual cortex. Nature neuroscience, 23(1), 138-151.
[2] Azabou, M., Pan, K. X., Arora, V., Knight, I. J., Dyer, E. L., & Richards, B. A. Multi-session, multi-task neural decoding from distinct cell-types and brain regions. In The Thirteenth International Conference on Learning Representations.
[3] Zhang, Y., Wang, Y., Azabou, M., Andre, A., Wang, Z., Lyu, H., ... & Hurwitz, C. (2025). Neural encoding and decoding at scale. arXiv preprint arXiv:2504.08201.
[4] Wang, EY, Fahey, PG, Ding, Z., Papadopoulos, S., Ponder, K., Weis, MA, ... & Tolias, AS (2025). Foundation model of neural activity predicts response to new stimulus types. Nature , 640 (8058), 470-477.
[5] Weis, M. A., Hansel, L., Lüddecke, T., & Ecker, A. S. (2021). Self-supervised graph representation learning for neuronal morphologies. arXiv preprint arXiv:2112.12482.
[6] Klindt, D. A., Ecker, A. S., Euler, T., & Bethge, M. (2017). Neural system identification for large 579 populations separating “what” and “where.”. Advances in Neural Information Processing 580 Systems.
[7] Burg, M. F., Zenkel, T., Vystrčilová, M., Oesterle, J., Höfling, L., Willeke, K. F., ... & Ecker, A. S. (2024). Most discriminative stimuli for functional cell type clustering. ArXiv, arXiv-2401.
[8] Willeke, K. F., Restivo, K., Franke, K., Nix, A. F., Cadena, S. A., Shinn, T., ... & Tolias, A. S. (2023). Deep learning-driven characterization of single cell tuning in primate visual area V4 unveils topological organization. bioRxiv, 2023-05.
[9] Hofling, Larissa, et al. A chromatic feature detector in the retina signals visual context changes. Elife, 2024, 13. Jg., S. e86860.
[10] Tong, R., da Silva, R., Lin, D., Ghosh, A., Wilsenach, J., Cianfarano, E., ... & Trenholm, S. (2023). The feature landscape of visual cortex. bioRxiv, 2023-11.
[11] Ustyuzhaninov, I., Burg, M. F., Cadena, S. A., Fu, J., Muhammad, T., Ponder, K., ... & Ecker, A. S. (2022). Digital twin reveals combinatorial code of non-linear computations in the mouse primary visual cortex. bioRxiv, 2022-02.
[12] Le, T., & Shlizerman, E. (2022). Stndt: Modeling neural population activity with spatiotemporal transformers. Advances in Neural Information Processing Systems, 35, 17926-17939.

问题

Major:

Could you train your model on Allen Observatory dataset [1] and compare region and cell types classification results with POYO+ [2] - specifically Fig4 B and F panels.
Could you train your model on IBL repeated site dataset and make a comparison with NEDs [3] (Fig 4)?
While you train an MLP - what would happen in case of classic clustering (eg knn or Leiden clustering) on top of neuronal time-invariant embeddings? This might help to completely remove the need in labels in the dataset.
For the ablation to check the impact of cross-neuron interaction - have you replaced the removed spatial attention layers with the temporal attention layers to make sure that the performance drop is not cause by just reducing the the depth of the model? If no, could you please repeat the ablation in the suggested manner?
In lines 319-321 you say Unlike NeuPrint and NeurPIR, which are limited to static statistics or session-tuned encoders, our approach supports dynamic, context-aware, and transfer-capable inference of neuronal identity. Could you please elaborate where NeuPrint is dependent to static statistics or session-tuned encoders ?

Minor:

Could you provide confusion matrixes for the datasets and prediction done?
How would NuCLR perform if we go one level down the Bugeon hierarchy? (eg 11 classes on the type level)
In table 3 are the standard deviation also reported across seeds?
What exactly is "multiple seeds" for the standard deviations in the tables? Two? Three? More?

局限性

yes

最终评判理由

The authors have addressed my concerns and provided additional experiments during the rebuttal period, hence, I updated the score towards more positive

格式问题

some references are duplicated (eg 22 and 23; 30 and 32 )

作者回复

2025-07-31

Thank you for your time and thoughtful review! We appreciate that you highlight that the “method is carefully designed” and “clearly explained”. We provide point-by-point responses to your comments below.

** $Q1$ Could you train your model on Allen Observatory dataset $1$ and compare region and cell types classification results with POYO+ $2$ - specifically Fig4 B and F panels.**

Thanks for your question and the chance to clarify this important point. The POYO+ architecture is aimed at behavior decoding, and, thus by design, it compresses the population activity into a smaller set of latent tokens to learn population-level dynamics, losing the identity of the individual neurons in the process. This is very different from the goal and architecture of NuCLR, which is specifically designed to preserve neuron-level tokens throughout the model and output neuron-level embeddings.

While it is true that POYO+ was able to perform cell-type classification, it was due to the fact that in the calcium recordings that were studied, the entire recording is from a specific Cre line and thus reveals only cells from a specific cell class. Due to this dataset-level correspondence to cell type, it was possible to decode Cre line from the encoder latent outputs from each recording to classify the cell-type of neurons included in that recording (not unit embeddings). This is a special case that does not generalize to the electrophysiology datasets considered in our work, as well as the Bugeon et. al. spatial transcriptomics dataset which contains simultaneous recordings from multiple cell-types. We will make this point more clear in related work in the revised paper.

** $Q2$ Could you train your model on IBL repeated site dataset and make a comparison with NEDs $3$ (Fig 4)?**

Following your suggestion, we have trained NuCLR on the IBL repeated sites dataset and report the performance observed in the table below. The results are for classifying the following regions: PO, LP, DG, CA1, VISa. These performances are for the transductive evaluation criteria (similar to the evaluation of NEDS) and standard deviations are measured over 3 seeds.

	Accuracy (%, unbalanced)	Macro F1-score
IBL Repeated Sites	87.01% ± 0.61%	0.8449 ± 0.0032

NEDS reports an unbalanced accuracy of 83.0% on this task on a single split. While the results are not directly comparable because we couldn’t gain access to the specific splits reported in NEDS, our results suggest that NuCLR outperforms NEDS, even though NuCLR relies on learning only from the spike trains, while NEDS uses behavior to train the model.

We will make sure to include this result in the appendix of the camera-ready version of our paper, and will include the splits in the released code-base so that comparisons could be made on this dataset by future works.

** $Q3$ While you train an MLP - what would happen in case of classic clustering (eg knn or Leiden clustering) on top of neuronal time-invariant embeddings? This might help to completely remove the need in labels in the dataset.**

Thanks for your suggestion. We want to clarify that there is a single linear layer used to read out the cell type or brain region information, and thus it measures how well the pretrained model has captured relationships across cells that is biologically meaningful and interpretable.

We agree that clustering is an interesting line for future work - and imagine that through a harder constraint on collapse or a small number of labeled cell types to use as supervised labels for contrastive sampling, that our approach could be adapted to this setting.

** $Q4$ For the ablation to check the impact of cross-neuron interaction - have you replaced the removed spatial attention layers with the temporal attention layers to make sure that the performance drop is not cause by just reducing the the depth of the model? If no, could you please repeat the ablation in the suggested manner?**

Yes, in the ablation for the spatial-temporal attention layers, we replace those layers with an equal number of temporal layers. In other words, the depth and parameter count of the model is the same in both cases, making it a fair comparison. We will update the text near lines 259-260 to indicate this more clearly.

** $Q5$ In lines 319-321 you say Unlike NeuPrint and NeurPIR, which are limited to static statistics or session-tuned encoders, our approach supports dynamic, context-aware, and transfer-capable inference of neuronal identity. Could you please elaborate where NeuPrint is dependent to static statistics or session-tuned encoders ?**

We apologize for the confusion here. We would revise that line in text as:
> Unlike NeuPrint and NeurPIR, which are limited to simple low-dimensional statistics or session-tuned encoders, our approach supports dynamic, context-aware, and transfer-capable inference of neuronal identity.

We meant to say that NeuPRINT relies on simple and low-dimensional statistics of the population (across-neuron mean and standard deviation of the population activity). The "session-tuned encoders" phrase was meant only for NeurPIR. Specifically, both NeuPRINT and NeurPIR incorporate population activity using (1) population-wise mean and standard-deviation of all the neurons in the recording, and (2) population mean and standard deviation of the M neurons closest to the target neuron. This low-dimensional nature of the population activity that is incorporated in these methods is what we are trying to indicate here.

** $Q6$ Could you provide confusion matrixes for the datasets and prediction done?**

Thank you for pointing this out! Unfortunately we cannot include confusion matrices in our rebuttal as NeurIPS does not allow posting images. However, we do have the confusion matrices for all datasets and will make sure to include them in the camera-ready version of our paper.

** $Q7$ How would NuCLR perform if we go one level down the Bugeon hierarchy? (eg 11 classes on the type level)**

Thanks for the suggestion. Unfortunately, due to time and space we decided to only report on 5 classes as this was the baseline provided in NeuPrint. We leave this fine-grained analysis for future work.

** $Q9$ In table 3 are the standard deviation also reported across seeds?**

Yes, the standard deviations reported in Table 3 are also across 5 training seeds.

** $Q10$ What exactly is "multiple seeds" for the standard deviations in the tables? Two? Three? More?**

Thank you for bringing this up! We apologize for our oversight here. For the results in Tables 1 and 2, we trained NuCLR, NEMO, and LOLCAT on 5 seeds, and NeuPRINT on 3 seeds (due to time limitations and longer training times for this method). Similarly, the ablation experiments in Table 3 were also done on 5 seeds. We will ensure that the final paper has 5 seeds for all models and will make sure to clarify the number of seeds in the main text of the camera-ready version of our paper.

** $W5$ In contrast, our method introduces the first spatio-temporal transformer designed and trained for neuron-level embeddings* is technically incorrect, the authors missed STDNT $12$ , another transformer, who accounted for cross-neuronal interactions. Its embeddings could be used for further cell types classification as well, though they are not time invariant**

Thanks for pointing out this work! You are correct that STNDT accounted for cross-neuronal interactions in their architecture. We will revise our statement to read “designed and trained for learning neuron-level properties like cell type and brain region”, as STNDT focused on forecasting and was not tested for cell type decoding or other neuron-level decoding tasks. We will include a citation to STNDT in the related work and discuss its connection to our approach in the revised manuscript.

** $W6$ While i provided a lot of references above - I think this is just the proof that the authors should redo the literature review more carefully and integrate its results, also being more careful with the statements.**

Thanks for pointing us to these great references! In our related work, we primarily focused on methods that directly performed cell-type classification (NEMO, NeuPRINT, LOLCAT, etc.). However, you are right that several other approaches include cell-type classification analyses, even if that was not their main focus (POYO+, NEDS, Wang et al., etc.). We missed these works and will add a section in the related work to discuss them. Overall, we definitely agree that our literature review needs to be revised, and we plan to incorporate the papers you mentioned.

评论- Not all questions are properly addressed

2025-08-01

Thank for you for your answers and comments.

[Q1] While its true that POYO+ architecture is aimed at behavior decoding, I need to correct the authors about the following part it compresses the population activity into a smaller set of latent tokens to learn population-level dynamics, losing the identity of the individual neurons in the process. Both POYO and POYO+ tokenization learn per-neuron embeddings! In POYO+ paper [2] in section 2.1 Tokenization they explicitly write "Each neuron receives its own learned embedding, which is accessed via a look-up table.", additionally, same is visualized in the original POYO paper in Figure 1 (spike[i].unit_id corresponds to the unique neuron id). As both POYO and POYO+ are trained jointly and not per session, all of the per-neuron embeddings are learnt together, the task of identifying as cell type is not equivalent to decoding a cre-line.

The answer also does not address the part on comparing regions, where even within single Cre line up to 3 layers and 6 regions can be available.

[Q2] Thanks for the transparency with the splits!

[Q3] Yes, I understand that you used a single-layer MLP for cell-types / brain region clustering. However, I am specifically interested in how well you embeddings are separable in unsupervised settings, as this might be useful for data-driven discoveries of cell types / cell functional properties. The analysis I am asking for is not time consuming, you just need to take the pretrained network, same embeddings as you used as input for MLP and do unsupervised clustering on them, to see if there are density modes in the embeddings or if they are uniformly distributed and the MLP just learns to make a correct line to separate them.

[Q6] I would appreciate if you would include confusion matrixes in the reply as tables. I would like to see the confusion matrixes on your dataset and on the IBL repeated site dataset, to compare it to NED [3].

[Q8] I wonder why question of How would NuCLR perform if we go one level down the Bugeon hierarchy? (eg 11 classes on the type level) was skipped given that the model on this dataset was already trained for the main text

评论- Responding to reviewer's followup questions

2025-08-05

We thank the reviewer for their continued engagement, and provide a point-by-point response to the remaining questions below:

** $Q1a$ While its true that POYO+ architecture is aimed at behavior decoding, I need to correct the authors about the following part it compresses the population activity into a smaller set of latent tokens to learn population-level dynamics, losing the identity of the individual neurons in the process. Both POYO and POYO+ tokenization learn per-neuron embeddings! In POYO+ paper $2$ in section 2.1 Tokenization they explicitly write "Each neuron receives its own learned embedding, which is accessed via a look-up table.", additionally, same is visualized in the original POYO paper in Figure 1 (spike $i$ .unit_id corresponds to the unique neuron id). As both POYO and POYO+ are trained jointly and not per session, all of the per-neuron embeddings are learnt together, the task of identifying as cell type is not equivalent to decoding a cre-line.**

POYO and POYO+ architectures indeed learn per-unit embeddings, however, the analysis in POYO+ did not use these unit embeddings to do cell-type or region classification. Instead they state that they perform the classification on the session-averaged latents. Specifically, in Section 3.3 ANALYSIS OF LATENT EMBEDDINGS of POYO+, the authors state "we examined the latents at the output of the encoder in POYO+", "To obtain the session-level latent embeddings, we average the latents across randomly sampled 1s context windows in the recording." Additionally, the caption of Figure 4 of POYO+ states "Balanced accuracy for brain area classification based on hand-crafted features versus session-averaged latents from POYO+," and "Balanced accuracy for Cre-line classification based on handcrafted features versus session-averaged latents from POYO+."

This confirms, in POYO+:

Unit embeddings were not used for the Cre-line or brain region classification. Instead, session-averaged latent outputs of the encoder were used.
The cell-type, region, and layer classification problems were treated as session-level tasks.

As far as we are aware, the unit embeddings of POYO and POYO+ have not been evaluated on cell-type or brain region classification tasks in any prior works.

** $Q1b$ The answer also does not address the part on comparing regions, where even within single Cre line up to 3 layers and 6 regions can be available.**

In the Allen Brain Observatory, data is only collected in a single Cre-line, at a specific depth (layer), and in a single brain region at a time. Thus, classification of layers and regions are also cast as classification of session-level properties.

** $Q2$ Yes, I understand that you used a single-layer MLP for cell-types / brain region clustering. However, I am specifically interested in how well you embeddings are separable in unsupervised settings, as this might be useful for data-driven discoveries of cell types / cell functional properties. The analysis I am asking for is not time consuming, you just need to take the pretrained network, same embeddings as you used as input for MLP and do unsupervised clustering on them, to see if there are density modes in the embeddings or if they are uniformly distributed and the MLP just learns to make a correct line to separate them.**

We are working on performing this analysis, and will report back with the results in a separate comment soon.

评论- Responding to reviewer's followup questions (continued)

2025-08-05

** $Q3$ I would appreciate if you would include confusion matrixes in the reply as tables. I would like to see the confusion matrixes on your dataset and on the IBL repeated site dataset, to compare it to NED $3$ .**

We provide confusion matrices in the form of tables below for the inductive zero-shot results:

Allen VC

	Pvalb	Sst	Vip
Pvalb	88.03	2.28	9.70
Sst	1.23	73.72	25.05
Vip	8.31	37.44	54.26

Bugeon et. al. (E vs I)

	Excitatory	Inhibitory
Excitatory	72.10	27.90
Inhibitory	26.34	73.66

Bugeon et. al. (Subclass)

	Lamp5	Pvalb	Sncg	Sst	Vip
Lamp5	51.98	20.90	6.21	0.56	20.34
Pvalb	9.04	79.10	1.98	6.21	3.67
Sncg	50.00	12.50	29.17	0.00	8.33
Sst	22.22	38.27	0.00	24.65	14.82
Vip	18.33	27.78	12.78	2.22	38.89

IBL

	CB	CNU	CTXsp	HB	HPF	HY	Isocortex	MB	OLF	TH
CB	65.51	0.06	0.04	28.89	3.20	0.01	2.12	0.15	0.00	0.01
CNU	1.43	52.44	5.20	4.19	4.31	1.87	16.47	3.19	9.79	1.10
CTXsp	0.00	5.41	15.11	0.00	19.66	12.53	11.85	0.43	33.74	1.29
HB	16.68	0.18	0.03	62.70	4.35	0.00	0.48	15.41	0.01	0.15
HPF	0.20	2.08	1.44	0.81	59.38	4.14	13.45	10.47	1.85	6.19
HY	4.70	17.45	1.61	0.00	6.31	36.65	6.31	15.57	3.49	7.92
Isocortex	0.60	7.29	3.44	0.64	8.27	1.99	65.37	2.92	8.17	1.30
MB	5.19	0.19	0.08	10.83	2.95	4.26	1.92	72.30	0.02	2.25
OLF	0.00	10.88	17.23	0.00	9.53	1.53	11.94	0.12	48.41	0.35
TH	0.13	3.38	0.87	0.00	4.70	11.05	3.34	2.52	0.54	73.46

Steinmetz et. al.

	HPF	MB	TH	VIS
HPF	75.73	4.84	12.97	6.45
MB	12.95	46.84	30.68	9.53
TH	8.82	0.69	87.62	2.87
VIS	21.84	6.79	24.95	46.42

IBL Repeated Sites

	PO	LP	DG	CA1	VISa
PO	93.00	5.69	0.48	0.24	0.60
LP	6.94	89.66	0.68	2.04	0.68
DG	0.00	3.30	77.65	14.65	4.40
CA1	2.02	2.02	17.93	73.23	4.80
VISa	2.55	3.82	2.12	2.33	89.17

** $Q4$ I wonder why question of How would NuCLR perform if we go one level down the Bugeon hierarchy? (eg 11 classes on the type level) was skipped given that the model on this dataset was already trained for the main text**

Apologies for omitting this in our rebuttal. We evaluated NuCLR's and NeuPRINT's embeddings on the 11-classes of cell-types listed in Figure 1 of $1$ , and present the results below for all 3 evaluation settings. The label coverage in the 11-class case is very imbalanced, with 6 classes having less than 50 labelled neurons each, and the most frequent class having 411 labels. The reported metric is the macro-F1 score.

Evaluation setting	NuCLR	NeuPRINT
Transductive	0.4109 ± 0.0210	0.4109 ± 0.0210
Transductive zero-shot	0.2392 ± 0.0129	0.2324 ± 0.0074
Inductive zero-shot	0.2363 ± 0.0410	N/A

$1$ Bugeon, Stephane, et. al., "A transcriptomic axis predicts state modulation of cortical interneurons" Nature 607, 330–338 (2022)

评论- Responding to the update

2025-08-06

Thanks a lot for the update and additional experiments / results presentation.

[Q1a] Thanks for the pointers, I would double check with the POYO+ paper.
[Q1b] Thanks for correcting me, yes, while several crelines save various depth and brain areas, its only one depth and one brain area per session.

[Q2] I would be looking forward to see the results.

[Q4] Yes, its totally fine, I am aware that Bugeon 2022 data is imbalanced. Could you please also add by-chance macro-F1 score adjusted by the fact that classes are imbalanced, to see how much the models are better than by chance?

评论- Clustering Analysis and Chance Level Performance on Bugeon 11-Class

2025-08-06

Dear reviewer KHtW, we address the remaining questions below:

** $Q2$ Clustering analysis**

Based on your feedback, we conducted an unsupervised clustering analysis using the Louvain method on embeddings from NuCLR, NEMO, and NeuPRINT. We followed the same pipeline as NEMO’s Section 6.2, and also measured alignment between recovered clusters and ground-truth labels using the Adjusted Rand Index (ARI). We provide our observations on each dataset below:

IBL: Embeddings show interconnected clusters with strong density modes aligned with brain regions. Notably, a distinct cluster of Thalamus neurons is visible, consistent with the UMAP plot in Figure 2 of our paper. Below, we compare the alignment of clustering with brain-region labels across models and note that NuCLR achieves the highest ARI score among methods.

Table 1: ARI w.r.t. brain-region labels on IBL.

NuCLR	NEMO	NeuPRINT
0.280	0.191	0.042

Allen VC: Clusters again show meaningful structure, with Louvain clustering primarily identifying spatial subdivisions within the visual cortex (e.g., VISal, VISam). NuCLR achieves the highest ARI score among methods in alignment with these brain regions. Interestingly, we also find a distinct cluster with a higher density of Pvalb neurons, despite low overall alignment with cell-type labels (ARI < 0.02 for all methods). This suggests that cell types are not the dominant axis of unsupervised clustering here, and the MLP classification head learns to exploit linearly separable subspaces for classification.

Table 2: ARI w.r.t. brain-region labels on Allen VC.

NuCLR	NEMO	NeuPRINT
0.122	0.008	0.004

Bugeon et. al.: There is strong clustering based on subject identities in NuCLR's embeddings. This dataset has fewer subjects overall (only 4), while datasets with higher subject diversity (IBL with 115 and Allen VC with 58) avoid this issue. We see a continuous variation between excitatory and inhibitory cells within the individual subject-level manifolds. When looking at ARI scores, however, we see that NeuPRINT's embeddings lead to clusters more aligned with E-I and Subclass ground truth labels. The better performance of NuCLR in the classification tasks would indicate there is more signal to distinguish cell-types in its embeddings, just not in the strongest clustering axes.

Table 3: ARI w.r.t. cell types on Bugeon et. al. dataset. (NEMO cannot operate with calcium data and has been excluded)

	NuCLR	NeuPRINT
E-I	0.020	0.044
Subclass	0.015	0.111

Steinmetz et. al.: This dataset also has relatively few subjects (only 10), and we observe clustering based on subject identities in NuCLR's embeddings. However, within each subject-level cluster, there are density modes corresponding to brain regions. This can be seen by a higher ARI score for NuCLR's embeddings compared to the rest of the methods.

Table 4: ARI w.r.t. brain-region labels on Steinmetz et. al. dataset.

NuCLR	NEMO	NeuPRINT
0.141	0.096	0.020

These observations highlight that, in most cases, NuCLR embeddings do encode meaningful structure. And the structure is more pronounced when there is sufficient data available. We will include ARI scores, visualizations, and this discussion in the appendix of our updated manuscript. Thank you for the suggestion! While our current focus has been on classification, we see strong potential for adapting the NuCLR framework toward fully unsupervised data-driven discovery, potentially by incorporating ideas from clustering-aware self-supervised methods such as SwAV $1$ .

$1$ Caron, Mathilde, et al. "Unsupervised learning of visual features by contrasting cluster assignments." NeurIPS (2020)

** $Q4$ Chance-level macro-F1 scores for 11-class classification on Bugeon et. al.**

Below, we show the chance level performance (macro-F1) on random embeddings. As you can see, both models perform significantly above chance.

Evaluation setting	NuCLR	NeuPRINT	Random Embeddings
Transductive	0.4109 ± 0.0210	0.4109 ± 0.0210	0.1009 ± 0.0257
Transductive zero-shot	0.2392 ± 0.0129	0.2324 ± 0.0074	0.0871 ± 0.0091
Inductive zero-shot	0.2363 ± 0.0410	N/A	0.0772 ± 0.0128

** $Q1a$ Regarding comparison with POYO+**

We hope the text we quoted from the original POYO+ paper was sufficient to convey the major differences in the evaluation settings between the two methods and why we are not comparing on the Allen Brain Observatory dataset. In the meanwhile, we have successfuly been able to set up POYO+ training on Allen VC recently. We will analyze its learned unit embeddings and will report back with comparisons soon.

Thanks for your valuable feedback! Please feel free to let us know if there are any additional questions.

评论- POYO+ comparison on Allen VC

2025-08-08

We are excited to report that we were successfully able to train POYO+ on the Allen VC dataset, and provide its performance for cell-type classification based on its unit embeddings below. We note that NuCLR considerably outperforms the unit embeddings of POYO+ on this task. We note that this result is only provided for a single seed due to the compute intensive nature of training POYO+, however, we will run multiple seeds and report these accuracies in the final paper.

Table 1: Macro-F1 scores for cell-type classification on Allen VC dataset under transductive zero-shot evaluation.

POYO+	NuCLR	NeuPRINT	NEMO
0.3762	0.7218 ± 0.0113	0.3999 ± 0.0312	0.4256 ± 0.0114

For training POYO+ on Allen VC, we used the behavior tasks listed below. We also report its behavior decoding accuracies upon convergence for all the behavior tasks:

Drifting Gratings Orientation: 95.66% balanced accuracy (chance: 12.5%)
Drifting Gratings Temporal Frequency: 94.83% balanced accuracy (chance: 20.0%)
Gabor Orientation: 57.88% balanced accuracy (chance: 25.0%)
Gabor Position: 0.7677 R²
Natural Scenes: 68.13% balanced accuracy (chance: 0.84%)
Running Speed: 0.7534 R²

I hope we were able to address all your concerns!

审稿意见

评分: 4置信度: 32025-06-29

Authors proposed NuCLR, a self-supervised framework to infer individual neuron identities (e.g., cell type and brain region) from large-scale population recordings without requiring anatomical labels. NuCLR models both within-neuron dynamics and across-neuron interactions using a spatiotemporal transformer and uses contrastive learning objective that encourages stable and discriminative neuron embeddings over time. Evaluations on multiple open datasets demonstrate it can be zero-shot generalized into completely new population and shows better results over several baseline models.

优缺点分析

The paper is technically sound overall and presents a compelling approach. However, not all claims are equally well supported by experimental evidence. For instance, while the authors mention the potential for “real-time”, no experiments are provided to validate latency or deployment efficiency in such a setting.

The claim of zero-shot generalization to new populations is partially supported. The “Inductive zero-shot” setting appropriately evaluates this scenario, but several baselines in Table 1 are marked as N/A, limiting a full comparison. In the “Bugeon et al. (E vs. I)” setting, a baseline even achieves better performance, which raises questions about the robustness of the claim.

While the authors separate “Transductive” and “Transductive zero-shot” settings, these may still involve access to test-distribution features during pretraining, potentially raising concerns about information leakage. A more thorough discussion of this issue would be helpful.

It would be better to see more detailed descriptions and appropriate citations for key components like the temporal transformer layers and spatio-temporal blocks.

In the related work section, some statements could be improved to better reflect prior literature. For example, the paper notes that CEED does not incorporate neuronal dynamics or population structure, but it would be helpful to clarify why this omission makes it less relevant or applicable, especially since this work also relies on spike binning, which can reduce temporal resolution and thus may also abstract away detailed neuronal dynamics.

While the problem of inferring neuron identity from activity is important, the current form of the paper does not clearly demonstrate a level of significance that stands out for a NeurIPS audience. Improvements in experimental presentation and more comprehensive comparisons may help strengthen this aspect, but at present, the contribution feels incremental rather than transformative.

问题

Can the author explain why The results from “Transductive zero-shot” and “Transductive” are technically sound to support their claim or they can add additionally experimental results to support their claim.
Please provide more detail about model architecture.
Is there a reason why use Temporal attention followed be spatial temporal attention layer rather than one spatial-temporal layer?

局限性

Yes

最终评判理由

After reviewing the authors' comments, I'm willing to raise my score.

格式问题

None

作者回复

2025-07-31

Thank you for your time and review! We provide point-by-point responses to your concerns below.

** $W1$ However, not all claims are equally well supported by experimental evidence. For instance, while the authors mention the potential for “real-time”, no experiments are provided to validate latency or deployment efficiency in such a setting.**

We appreciate the opportunity to clarify our use of the term “real-time”. Our original claim may have been misleading, and we apologize for any confusion this caused. By “real-time,” we intended to convey that NuCLR supports zero-shot cell-type classification, enabling the possibility of performing cell-type inference in vivo without the need for retraining. However, we recognize that “real-time” typically often refers to low-latency deployment or online processing speed in machine learning, and our usage may have caused misunderstanding. To avoid confusion, we will remove this claim from the abstract and instead include a brief discussion of the model’s in vivo potential in the Discussion section.

** $W2$ The “Inductive zero-shot” setting appropriately evaluates this scenario, but several baselines in Table 1 are marked as N/A, limiting a full comparison. In the “Bugeon et al. (E vs. I)” setting, a baseline even achieves better performance, which raises questions about the robustness of the claim.**

We thank the reviewer for highlighting this important point. Existing cell-type classification methods either do not support inductive zero-shot learning (e.g., NeuPRINT) or are specifically designed for electrophysiological data and do not generalize to calcium imaging (e.g., NEMO), which is why we have marked them as N/A in these settings. While direct empirical comparisons are therefore not feasible in this context, the ability of our model to both perform inductive zero-shot cell-type classification and generalize across recording modalities (electrophysiology and calcium imaging) represents a substantial advancement over current approaches. Regarding LOLCAT’s improved performance over NuCLR on the Bugeon et al. dataset for the excitatory vs. inhibitory (E vs. I) classification task, we found that LOLCAT, which is a supervised method, performs well when there is sufficient labeled data and coarse class distinctions. However, across all other settings, particularly those involving fine-grained labels or limited supervision, NuCLR consistently achieves substantially better performance.

** $W3$ While the authors separate “Transductive” and “Transductive zero-shot” settings, these may still involve access to test-distribution features during pretraining, potentially raising concerns about information leakage. A more thorough discussion of this issue would be helpful.**

We agree that inductive evaluation offers the strongest test of generalization. However, transductive evaluation remains a widely accepted practice in the neuron-type classification literature $1,2,3$ , and we include these settings for completeness and comparability. Both the “transductive” and “transductive zero-shot” setups correspond to realistic use cases where users have access to full neural recordings but only partial labels. In such scenarios, one can pretrain using the full dataset, including test recordings, while withholding all test labels. This setup does not involve any label leakage, and typically improves decodability of representations. That said, we agree this nuance deserves clearer discussion, and we will revise the discussion to more explicitly address the motivations and limitations of transductive evaluations.

$1$ Beau, Maxime, et al. "A deep learning strategy to identify cell types across species from high-density extracellular recordings." Cell 188.8 (2025): 2218-2234.

$2$ Ye, Zhiwen, et al. "Ultra-high density electrodes improve detection, yield, and cell type identification in neuronal recordings." bioRxiv (2024): 2023-08.

$3$ Mi, Lu, et al. "Learning time-invariant representations for individual neurons from population dynamics." Advances in Neural Information Processing Systems 36 (2023): 46007-46026.

** $W4$ It would be better to see more detailed descriptions and appropriate citations for key components like the temporal transformer layers and spatio-temporal blocks.**

Here we provide a detailed description of the components of our model. All transformer layers in NuCLR use multi-head scaled-dot-product attention from $1$ , with the feedforward network (FFN) being a GEGLU network $2$ . LayerNorm is used for normalization before tokens are sent into the attention and FFN layers. For spatial attention layers, no position embeddings are used. And for temporal attention, we use Rotary position embeddings $3$ to incorporate relative timing of the patch-tokens. We will make sure to add these details to our paper. Please let us know if you would like us to include any more details!

We acknowledge the lack of citations for these components. We will also add a citation for PatchTST $4$ for the temporal transformer layer, and STNDT $5$ and Vivit $6$ for the spatio-temporal transformer layer.

$1$ Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

$2$ Shazeer, Noam. "GLU variants improve transformer." arXiv preprint arXiv:2002.05202 (2020).

$3$ Su, Jianlin, et al. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.

$4$ Nie, Yuqi, et. al. "A time series is worth 64 words: Long-term forecasting with transformers" The 11th International Conference o n Learning Representations (ICLR), 2023.

$5$ Le, Trung, et. al. "STNDT: Modeling Neural Population Activity with Spatiotemporal Transformers" Advances in Neural Information Processing Systems 35 (2022).

$6$ Arnab, Anurag, et al. "Vivit: A video vision transformer." Proceedings of the IEEE/CVF international conference on computer vision. 2021.

** $W4$ In the related work section, some statements could be improved to better reflect prior literature. For example, the paper notes that CEED does not incorporate neuronal dynamics or population structure, but it would be helpful to clarify why this omission makes it less relevant or applicable, especially since this work also relies on spike binning, which can reduce temporal resolution and thus may also abstract away detailed neuronal dynamics.**

Thank you for the feedback. Based on your comments and those of other reviewers, we agree that our discussion of prior work needs to be strengthened. To clarify, CEED relies solely on extracellular waveforms for classification and does not incorporate spike timing information. In contrast, NEMO builds on CEED by introducing contrastive learning to jointly embed both spiking activity and waveforms. This distinction is why we compare against NEMO, rather than CEED, in our manuscript. We will revise the prior work section to clarify these distinctions.

** $W5$ While the problem of inferring neuron identity from activity is important, the current form of the paper does not clearly demonstrate a level of significance that stands out for a NeurIPS audience. Improvements in experimental presentation and more comprehensive comparisons may help strengthen this aspect, but at present, the contribution feels incremental rather than transformative.**

We respectfully disagree with this assessment. Across multiple open-access datasets, NuCLR introduces a novel architecture for the task of cell type and brain region classification, and consistently outperforms prior methods on both tasks. Moreover, its ability to generalize in a zero-shot setting to new populations of neurons represents a promising step toward making these tools more practical and broadly usable for neuroscientists. To our knowledge, this is also the first work to demonstrate that cell-type classification performance improves with the addition of unlabeled data, and that pretraining on unlabeled sessions yields greater downstream gains than simply doubling the number of labeled examples. Together, these empirical findings and methodological innovations address core machine learning challenges, such as generalization under distribution shift, learning from limited labels, and leveraging unlabeled data for improved transfer, which are highly relevant to the NeurIPS community.

In addition, recent publications such as NeuPRINT (NeurIPS 2023) and NEMO (ICLR 2024) also focus entirely on the problem of cell-type and brain-region classification from neural recordings. The presence of these works at top machine-learning venues suggests that the topic is of active interest to the NeurIPS audience.

** $Q2$ Please provide more detail about model architecture.**

We refer to our response to weakness 4 ( $W4$ ) above.

** $Q3$ Is there a reason why use Temporal attention followed be spatial temporal attention layer rather than one spatial-temporal layer?**

We saw, in early experiments, that including temporal-only attention layers before the spatio-temporal layer slightly improved the performance. Based on your suggestion, we conducted an ablation study where we replace the temporal layers with a spatio-temporal layer and present the results below. While there is a small improvement with our current architecture, we agree that it is reasonable to replace the temporal attention + spatial attention with a spatiotemporal attention layer without losing much performance. Thanks for the valuable insight! This is indeed an important ablation and we will include these results in Table 3 of our updated paper.

Table 1: Ablation for temporal layers in the model in the inductive zero-shot evaluation setup.

	Allen VC	IBL
Presented architecture	0.7200 ± 0.0267	0.5295 ± 0.0040
Ablation - Spatio-temporal layers only	0.7184 ± 0.0124	0.5259 ± 0.0089

评论- Not all concerns are properly addressed

2025-08-06

Thank you for your response.

I would like to clarify that I fully agree that the problem of inferring neuron identity from activity is both meaningful and of strong interest to the NeurIPS community, as I acknowledged at the beginning of my review. Rather, my comments are focused on the current form of the submission and how the contributions are presented.

Specifically:

• Model architecture – This is typically one of the most important sections in a machine learning paper, yet it is under-discussed in the current version. A more detailed and structured explanation of the model would help readers better understand the novelty and design rationale behind your approach.

• Related work –this section could be strengthened. A well-developed related work section not only demonstrates the authors' understanding of the field but also helps readers contextualize the contribution within existing literature. Considering the importance of these two sections and the foundational role they play in communicating the work, I would encourage the authors to invest additional effort in polishing and expanding these parts of the paper.

The results shown in Table Q3 suggest that a spatial-temporal layer achieves performance comparable to the proposed architecture, while using significantly fewer parameters. However, the paper does not clearly explain the reasoning behind the choice to use temporal attention followed by a spatial-temporal attention layer. What was the underlying hypothesis? Why did you expect this design to perform better? Currently, it appears that a more intuitive alternative (a spatial-temporal layer) was only evaluated post hoc in response to reviewer feedback and performed equally well. I would encourage the authors to conduct more comprehensive ablation studies on architectural choices to better justify the final design.

Given the central importance of the model architecture, the related work, and the points above, I will maintain my score.

评论- Response to rebuttal comment (3/3)

2025-08-08

References:

$1$ Masland, R. H. (2004). Neuronal cell types. Current Biology, 14(13), R497–R500.

$2$ Zeng, H. (2022). What is a cell type and how to define it? Cell, 185(15), 2739–2755.

$3$ Tasic, B., Yao, Z., Graybuck, L. T., Smith, K. A., Nguyen, T. N., Bertagnolli, D., ... & Zeng, H. (2018). Shared and distinct transcriptomic cell types across neocortical areas. Nature, 563(7729), 72–78.

$4$ Bugeon, S., et al. (2022). A transcriptomic axis predicts state modulation of cortical interneurons. Nature, 607(7918), 330–338.

$5$ Weis, M. A., Hansel, L., Lüddecke, T., & Ecker, A. S. (2021). Self-supervised graph representation learning for neuronal morphologies. arXiv preprint arXiv:2112.12482.

$6$ Gouwens, N. W., Sorensen, S. A., Berg, J., Lee, C., Jarsky, T., Ting, J., ... & Koch, C. (2020). Integrated morphoelectric and transcriptomic classification of cortical GABAergic cells. Cell, 183(4), 935–953.

$7$ de Vries, S. E., Lecoq, J. A., Buice, M. A., Groblewski, P. A., Ocker, G. K., Oliver, M., ... & Koch, C. (2020). A large-scale standardized physiological survey reveals functional organization of the mouse visual cortex. Nature Neuroscience, 23(1), 138–151.

$8$ Burg, M. F., Zenkel, T., Vystrčilová, M., Oesterle, J., Höfling, L., Willeke, K. F., ... & Ecker, A. S. (2024). Most discriminative stimuli for functional cell type clustering. arXiv preprint arXiv:2401.

$9$ Walker, E. Y., et al. (2019). Inception loops discover what excites neurons most using deep predictive models. Nature Neuroscience, 22(12), 2060–2065.

$10$ Willeke, K. F., Restivo, K., Franke, K., Nix, A. F., Cadena, S. A., Shinn, T., ... & Tolias, A. S. (2023). Deep learning-driven characterization of single cell tuning in primate visual area V4 unveils topological organization. bioRxiv, 2023-05.

$11$ Höfling, L., et al. (2024). A chromatic feature detector in the retina signals visual context changes. eLife, 13, e86860.

$12$ Tong, R., da Silva, R., Lin, D., Ghosh, A., Wilsenach, J., Cianfarano, E., ... & Trenholm, S. (2023). The feature landscape of visual cortex. bioRxiv, 2023-11.

$13$ Ustyuzhaninov, I., Burg, M. F., Cadena, S. A., Fu, J., Muhammad, T., Ponder, K., ... & Ecker, A. S. (2022). Digital twin reveals combinatorial code of non-linear computations in the mouse primary visual cortex. bioRxiv, 2022-02.

$14$ Schneider, A., et al. (2023). Transcriptomic cell type structures in vivo neuronal activity across multiple timescales. Cell Reports, 42(4).

$15$ Yu, H., Lyu, H., Xu, Y., Windolf, C., Lee, E. K., Yang, F., ... & Hurwitz, C. L. (2025). In vivo cell-type and brain region classification via multimodal contrastive learning. The Thirteenth International Conference on Learning Representations (ICLR).

$16$ Lee, E. K., Gül, A. E., Heller, G., Lakunina, A., Jaramillo, S., Przytycki, P. F., & Chandrasekaran, C. (2024). PhysMAP—interpretable in vivo neuronal cell type identification using multi-modal analysis of electrophysiological data. bioRxiv, 2024-02.

$17$ Beau, M., Herzfeld, D. J., Naveros, F., Hemelt, M. E., D’Agostino, F., Oostland, M., ... & Hantman, A. W. (2025). A deep-learning strategy to identify cell types across species from high-density extracellular recordings. Cell, 2025.

$18$ Mi, L., et al. (2023). Learning time-invariant representations for individual neurons from population dynamics. Advances in Neural Information Processing Systems, 36, 46007–46026.

$19$ Wu, W., et al. (2025). Neuron Platonic Intrinsic Representation From Dynamics Using Contrastive Learning. The Thirteenth International Conference on Learning Representations (ICLR).

$20$ Azabou, M., Pan, K. X., Arora, V., Knight, I. J., Dyer, E. L., & Richards, B. A. (2024). Multi-session, multi-task neural decoding from distinct cell-types and brain regions. The Thirteenth International Conference on Learning Representations (ICLR).

$21$ Azabou, M., et al. (2023). A unified, scalable framework for neural population decoding. Advances in Neural Information Processing Systems (NeurIPS).

$22$ Zhang, Y., Wang, Y., Azabou, M., Andre, A., Wang, Z., Lyu, H., ... & Hurwitz, C. (2025). Neural encoding and decoding at scale. arXiv preprint arXiv:2504.08201.

$23$ Liu, R., et al. (2022). Seeing the forest and the tree: Building representations of both individual and collective dynamics with transformers. Advances in Neural Information Processing Systems (NeurIPS), 35, 2377–2391.

$24$ Le, T., & Shlizerman, E. (2022). STNDT: Modeling neural population activity with spatiotemporal transformers. Advances in Neural Information Processing Systems (NeurIPS), 35, 17926–17939.

评论- Response to rebuttal comment (1/3)

2025-08-08

Parameter count clarification: The “spatio-temporal layers only” ablation in Table 1 of our rebuttal uses exactly the same number of parameters and the same model depth as our proposed model. The only change was to replace two temporal-only attention layers in our main model with spatio-temporal attention layers. Thus, parameter efficiency would not be a reason to choose one architecture over the other.

Intuition/Rationale for Temporal-attention layers: Our spatial attention layers are permutation-invariant across neurons and do not include positional embeddings that distinguish individual neurons. For such layers to be effective, the input tokens must already encode some notion of neuron identity. The temporal-only attention layers provide this by first processing the temporal dynamics of each neuron independently, establishing a partial identity before cross-neuron interactions occur. This identity signal can then be exploited by the subsequent spatial and spatio-temporal layers.

In early experiments, adding these initial temporal-only layers yielded a notable performance improvement over starting directly with spatio-temporal layers, at no additional computational cost. Although the final multi-seed results show the gain is modest, we chose to retain the design because:

It is intuitively well-motivated from the architecture’s invariance properties.
It incurs no parameter or runtime overhead relative to the alternative.
It provided consistent, if modest, benefits during development.

Model details: We are committed to improving the description of our model in the paper. We answered your question in the review - asking for detailed descriptions of the temporal and spatio-temporal layers, for which we provided all the implementation details in our rebuttal. We will also add the rationale for temporal-layers mentioned above. We are more than happy to include any more details you find lacking! Could you please point out what is still missing?

While the model architecture is one important aspect of our work, we also see the training methodology and large-scale pretraining analysis to be core contributions as well. Thus, we aimed to balance coverage of all three within the 9-page limit.

Related works: We do agree this section in our paper needs strengthening. In the initial manuscript, due to space limitations, we primarily focused on deep learning methods that directly performed cell-type classification (NEMO, NeuPRINT, LOLCAT, etc.). However, we realized that we failed to capture the broader landscape of work on functional characterization of tuning and stimulus response properties that also help to define functional cell types. We plan to provide a more comprehensive survey of different approaches for functional cell type discovery in our final paper, including work in finding maximally excited inputs to classify neuronal types $8-9$ , as well as other architectures that allow access to neuron-level embeddings like POYO+ $20$ , POYO $21$ , NEDS $22$ , EIT $23$ , and STNDT $24$ .

评论- Response to rebuttal comment (2/3)

2025-08-08

Please see our proposed revisions for the related work subsections on cell type classification below. Please let us know if there are other specific lines of work or other references that you think are missing.

Proposed Revisions to Related Work:

Cell type classification. Cell type classification seeks to assign neurons to meaningful biological or functional classes using structural, molecular, or physiological information $1, 2$ . Transcriptomic approaches such as single-cell RNA sequencing $3$ and spatial transcriptomics $4$ provide high-resolution cell type labels, but these methods require extensive experimental infrastructure and are difficult to apply in chronic, large-scale, or in vivo recordings. Morphology-based approaches, such as self-supervised graph learning on neuronal reconstructions $5$ , and multimodal in vitro studies that integrate morphology, electrophysiology, and transcriptomics $6$ , provide complementary views of neuronal identity.

A major goal in recent work has been to infer functional cell types directly from large-scale physiological recordings, bypassing the need for extensive molecular and morphological profiling. Early approaches relied on hand-engineered features derived from electrophysiological signals, such as extracellular waveforms, autocorrelograms (ACGs), and peri-stimulus histograms as proxies for cell identity $7$ . Subsequent methods incorporated stimulus-driven tuning properties, either by extracting features from responses to specific stimulus sets or by synthesizing maximally exciting or most discriminative inputs for classification $8, 9$ . Large-scale functional characterization studies have further revealed substantial neuron-level diversity relevant to defining functional cell types, including work on tuning organization in primate V4 $10$ , chromatic feature detectors in retina $11$ , feature landscapes in visual cortex $12$ , and combinatorial codes in mouse V1 $13$ . These studies show that functional cell types can be distinguished not only by static structural or molecular signatures, but also by patterns of tuning, feature selectivity, and population codes. While effective in controlled paradigms, stimulus-dependent approaches are inherently tied to the availability of specific stimuli and may fail to generalize across diverse experimental contexts.

Recent advances have shifted toward learning stimulus-agnostic, neuron-level embeddings that capture intrinsic activity dynamics. LOLCAT $14$ learns trial-level representations and attends to subsets of trials to build a prediction of cell type over many trials, training in a supervised manner to classify individual neurons. NEMO $15$ uses a CLIP-style contrastive loss between waveform and ACG views, while PhysMAP $16$ and VAE-based models $17$ combine multiple physiological signals into shared latent spaces. NeuPRINT $18$ and NeurPIR $19$ aim to learn time-invariant or intrinsic representations for individual neurons from population dynamics using reconstruction and contrastive objectives, respectively.

Channel-level transformer architectures and functional embeddings. Channel-level transformer architectures and related models such as POYO+ $20$ , POYO $21$ , NEDS $22$ , EIT $23$ , and STNDT $24$ also generate neuron-level embeddings or tokens, but are not explicitly designed for neuron-level cell type or brain region readouts. POYO, POYO+, and EIT are trained primarily on supervised decoding tasks, while NEDS uses both encoding and decoding objectives. STNDT is trained with a masked modeling objective and uses a combination of neuron-level and population-level tokens to demonstrate strong performance on behavioral reach decoding tasks. We note that while POYO+ provides results for classification of different Cre-lines and brain regions, it is performed at the session level on latent representations, rather than directly on embeddings of individual neurons.

审稿意见

评分: 5置信度: 42025-07-01

This paper introduces NuCLR, a self-supervised framework that learns population-aware representations of single neurons directly from large-scale recordings. A spatiotemporal transformer exchanges temporal information within neurons and spatial information across neurons. Training relies on a contrastive loss and neuron dropout. The learned embeddings support zero-shot decoding of both cell type and brain region across four electrophysiology and calcium‐imaging datasets. NuCLR consistently outperforms other baselines and is shown to scale favourably with additional unlabeled data and labeled data.

优缺点分析

Strengths

Recasting functional cell typing as a self-supervised contrastive problem is a novel angle. Performance evaluated on four public datasets shows significant improvements over existing baselines. Data-scaling curves are informative and provide actionable insights. Architecture designs are validated by thorough ablations. Writing is well-structured and easy to follow.

Weaknesses

Descriptions of baseline methods are too brief. Readers unfamiliar with methods such as NeuPRINT cannot easily compare their differences in algorithm designs.
It is unclear how the spike-bin/patch length might blur fast-spiking interneuron dynamics, potentially affecting subclass classification.
It is unclear whether there exist different stimulus conditions in these datasets; if so, whether the conditions are shared between training and test sets.
It is unclear how the decodability of brain regions and cell types changes in the intermediate transformer layers.

问题

Could you provide expanded technical descriptions for NeuPRINT, LOLCAT, and NEMO?
Could you analyze/discuss the influence of spike-bin/patch length on model performance, especially for different cell types?
Could you examine the linear decodability of labels from intermediate layers of the trained transformer model?

局限性

Yes

最终评判理由

My concerns are adequately addressed.

格式问题

None

作者回复

2025-07-31

Thanks for your feedback and suggestions! Based upon your suggestions, we ran two new experiments: (i) we examined the role that the bin size has on model performance for different cell types, and (ii) studied the decodability of labels at intermediate layers in the model. We provide these results and a point-by-point response to your concerns below.

** $Q1$ Could you provide expanded technical descriptions for NeuPRINT, LOLCAT, and NEMO?**

Thank you for pointing out the missing technical descriptions. We will make sure to include detailed technical descriptions of all baseline methods in the appendix of the camera-ready version of our paper. We provide these details below:

NeuPRINT is a self-supervised method that learns neuron-embeddings directly by performing causal masked-modelling on neural activity. The method consists of an encoder that takes as input (1) the past activity of a neuron, (2) population-wise mean and standard-deviation of all the neurons in the recording, (3) population mean and standard deviation of the M neurons closest to the target neuron, and (4) a learnable neuron-embedding corresponding to the target neuron. The encoder, along with the learnable neuron-embeddings are optimized using gradient descent at the task of predictive future activity over many neurons and populations. At the end training, these learnable neuron-embeddings are then considered the output of the method. These embeddings are used to train linear-classifiers and MLPs for downstream objectives of cell-type and brain-region decoding.

NEMO is a CLIP-based self-supervised method. It trains two encoders. One encoder is a convolutional neural network (CNN) that embeds a hand-crafted feature of a neuron's activity - its 3D autocorrelogram. The second encoder is also a CNN that embeds the neuron's spike waveform/template. The CLIP objective trains the two encoders to produce similar embeddings. Finally, after training, a neuron's embedding is produced by concatenating the outputs of both the encoders. This concatenated embedding then can then be used for the downstream task of cell-type and brain-region classification.

LOLCAT is a supervised method that aims to directly predict the cell-type of a neuron from its neural activity alone. The neural activity is first separated into trial-based snippets, and then converted to inter-spike-interval (ISI) distributions within each snippet. Each snippet's ISI distribution is first passed through an MLP encoder, then all snippets are aggregated using multi-head attention, and finally the aggregated latent is passed through an MLP classifier to output the cell-type prediction. This model is trained in a supervised manner.

** $Q2$ Could you analyze/discuss the influence of spike-bin/patch length on model performance, especially for different cell types?**

Thank you for this insightful suggestion! Based upon your suggestion, we performed an experiment where we varied the bin size and measured the class-wise F1 scores for the Allen VC dataset. Please refer to the results in the table below. We find 20ms bin size to be sufficient for distinguishing Pvalb and VIP neurons. Sst neurons (mostly regular spiking) are slightly more distinguishable at 10ms bin size, however by not much in comparison to a bin size of 20ms.

Table 1: Effect of bin-size on classification of different cell types. Reported F1 scores below are under the inductive zero-shot evaluation setting represented as mean ± standard deviation measured across 3 seeds.

	1ms bins	5ms bins	10ms bins	20ms bins	40ms bins
Pvalb	0.8212 ± 0.0387	0.8217 ± 0.0285	0.8507 ± 0.0191	0.8670 ± 0.0235	0.8656 ± 0.0159
Sst	0.7204 ± 0.0121	0.7292 ± 0.0210	0.7546 ± 0.0201	0.7422 ± 0.0263	0.7174 ± 0.0295
Vip	0.5113 ± 0.0409	0.5144 ± 0.0340	0.5466 ± 0.0120	0.5725 ± 0.0559	0.5191 ± 0.0052

We will include these results in the appendix of the camera-ready version of our paper. Additionally, we also measured the aggregate macro F1-score for both Allen VC and IBL datasets with different bin-sizes. Please refer to Table 1 in our response to reviewer 6mEF for the same.

** $Q3$ Could you examine the linear decodability of labels from intermediate layers of the trained transformer model?**

While we train only the output of our model in a contrastive manner, we agree it would be interesting to assess how the neuron-level representations evolve across the network. We present the results of this analysis below, and thank the reviewer for this suggestion. It is very interesting to see how the quality of embeddings improves as we go deeper in the model. Specifically, we observe a step-function increase in the linear decodability when going from layer 2 to layer 3, i.e. when the first spatial transformer layer is encountered. We will add this result as a plot in the appendix of the camera-ready version of our paper, and we are sure it will be appreciated by many readers.

Table 2: Linear decoding results from intermediate layers of the model. Macro-F1 score is the reported metric.

Layer number	Allen VC	IBL
1 (Temporal)	0.4734	0.2862
2 (Temporal)	0.4896	0.3360
3 (Spatial)	0.6254	0.4945
4 (Temporal)	0.6530	0.5147
5 (Spatial)	0.7488	0.4973
6 (Temporal)	0.7457	0.5197

Note: We performed this analysis on one training seed here in the interest of time, however we will make sure to do this analysis on 4 more seeds before adding this result to the paper.

** $W3$ It is unclear whether there exist different stimulus conditions in these datasets; if so, whether the conditions are shared between training and test sets.**

The stimulus conditions varied minimally between datasets. Neural recordings in the Allen Visual Coding and Bugeon datasets were collected from mice passively viewing visual gratings and natural images while running on a treadmill. The Steinmetz and IBL datasets involved mice performing a decision-making task using left and right drifting gratings with varying contrast levels.

Each dataset has its own train and test sets, and within each dataset, the train and test sets share the same stimulus conditions. We will include descriptions of these conditions for all four datasets in the camera-ready version of the paper.

2025-08-05

Thank you for your thorough responses and for this wonderful manuscript. I consider my concerns to be adequately addressed.

审稿意见

评分: 5置信度: 42025-07-03

The paper aims to infer neuron identity i.e. cell type/region purely from population dynamics via self-supervision. They design their own network, NuCLR, that uses a spatio-temporal transformer encoder with a sample-wise contrastive objective to build representations that can be probed for cell type as well as brain region. Their contrastive loss compares two time windows of the same population of neurons with random neurons dropped. They compare cell types using the neuropixels dataset and include brain regions like IBL to show the efficacy of their approach.

优缺点分析

Strengths:

Scaling Analysis: The authors show that their approach improves as more unlabeled data is added. This is an impressive finding. Pretraining on unlabeled sessions yields greater downstream gains than doubling labeled examples.
Architectural Validation: Ablations show spatial-attention layers are essential for capturing population structure, and neuron-dropout provides meaningful regularization on small datasets.
Results: Impressive improvement in neuron and brain region identification in comparison to their chosen baselines.

Weaknesses:

Additional ablations: From my understanding, the model doesn’t use precise spike-timing patterns but instead uses uniform binning. This makes sense from the perspective of building a network. But this discards subpatch timing. I didn’t see any ablation on patch length and I was thinking this would be important.
Additional baseline: I think a good comparison would also come from using POYO+. The encoder in POYO+ [1] has a fairly similar architecture in my view and could also be probed in a similar way to identify cell type or region. If the authors think the comparison is not feasible, I would also be interested in gaining some intuition between the differences in POYO+ and the NuCLR and the benefits of either approach.

[1] Azabou et. al. Multi-session, multi-task neural decoding from distinct cell-types and brain regions. ICLR 2025.

问题

Temporal Window: Could a discussion on how to choose Tmax? It seems important but I would appreciate what the design considerations were.

局限性

Discussed adequately.

最终评判理由

My concerns about comparisons to POYO+ were the biggest share of my concerns. The authors also did requested ablations. I am satisfied with the final response.

格式问题

None

作者回复

2025-07-31

Thank you for your time, thoughtful review, and suggestions. We appreciate your recognition of our scaling analysis to be "an impressive finding", and highlighting that our method leads to "impressive improvement in neuron and brain region identification." Based upon your suggestions, we have run two new ablations: a sweep of bin size and a comparison with a POYO-like spike timing tokenization. We provide these results and a point-by-point response below.

** $W1$ Additional ablations: From my understanding, the model doesn’t use precise spike-timing patterns but instead uses uniform binning. This makes sense from the perspective of building a network. But this discards subpatch timing. I didn’t see any ablation on patch length and I was thinking this would be important.**

This is a good point as the precision of spike timing provided to the model could indeed be an important factor for overall performance. To study this effect, we performed an additional experiment where we swept the bin-size from 1ms to 40ms and measured model performance on the Allen VC and IBL datasets for the Inductive zero-shot evaluation setting. We provide the results in the table below, and notice that a bin size of 20ms provides the best performance for both datasets.

Table 1: Bin-size sweep. Results reported below are macro-F1 scores represented as mean ± standard deviation measured across N seeds.

	1ms bins (N=3)	5ms bins (N=3)	10ms bins (N=3)	20ms bins (N=5)	40ms bins (N=3)
Allen VC	0.6843 ± 0.0286	0.6883 ± 0.0260	0.7173 ± 0.0036	0.7200 ± 0.0267	0.7007 ± 0.0104
IBL	0.5134 ± 0.0047	0.5231 ± 0.0118	0.5262 ± 0.0093	0.5295 ± 0.0040	0.5120 ± 0.0036

Additionally, we tested the spike-tokenization based encoder similar to that used in POYO $1$ . In this modified architecture, instead of binning to create a token for a patch, we use a cross-attention transformer layer to encode POYO-like spike tokens within a patch into a single token. The rest of the temporal and spatio-temporal layers are kept the same.

Table 2: Spike-tokenization ablation

	Allen VC - Cell type decoding	IBL - Brain region decoding
20ms binning (used in the paper)	0.7200 ± 0.0267	0.5295 ± 0.0040
Spike-tokenization based encoder	0.6665 ± 0.0167	0.5268 ± 0.0136

We will add both, a plot for the bin-size sweep, and details and results for the spike-tokenization encoder ablation in the camera-ready version of our paper. Thank you for the suggestion!

$1$ Azabou, Mehdi, et al. "A unified, scalable framework for neural population decoding." Advances in Neural Information Processing Systems 36 (2023): 44937-44956.

** $W2$ Additional baseline: I think a good comparison would also come from using POYO+. The encoder in POYO+ has a fairly similar architecture in my view and could also be probed in a similar way to identify cell type or region. If the authors think the comparison is not feasible, I would also be interested in gaining some intuition between the differences in POYO+ and the NuCLR and the benefits of either approach.**

The POYO+ architecture is aimed at behavior decoding, and, thus by design, it compresses the population activity into a smaller set of latent tokens to learn population-level dynamics, losing the identity of the individual neurons in the process. This is very different from the goal of the model architecture proposed by us in NuCLR, which is specifically designed to preserve neuron-level tokens throughout the model and output neuron-level embeddings.

While it is true that POYO+ was able to perform cell-type classification, it was due to the fact that in calcium recordings, the entire recording is from a specific Cre line and thus reveals only cells from a specific cell class. Due to this dataset-level correspondence to cell type, it was possible to decode Cre line from the encoder latent outputs from each recording to classify the cell-type of neurons included in that recording (not unit embeddings). This is a special case that does not generalize to the electrophysiology datasets considered in our work, as well as the Bugeon et. al. spatial transcriptomics dataset which contains simultaneous recordings from multiple cell-types.

** $Q1$ Temporal Window: Could a discussion on how to choose Tmax? It seems important but I would appreciate what the design considerations were.**

We performed a manual sweep for all hyperparameters on the IBL dataset and measured performance on a validation subset of insertions (included in the attached code base). This process gave us the best value of $\Delta T_\max$ to be 30s. We used the same value for the other two electrophysiology datasets (Allen VC and Steinmetz et. al.) and, found that this choice seems to generalize well. For Bugeon et. al. dataset, which is recorded using calcium imaging and has a different timescale, we performed a similar sweep to reach the best $\Delta T_\max$ of 240s.

2025-08-04

I thank the authors for their response. But I think I'm not fully convinced by the response to my request for comparison with POYO+ as mentioned by reviewer KHtW. POYO+ builds per-neuron embeddings and the distinction with the specificity of cell-type and measuring Cre line. I think Reviewer KHtW will has summarized my concerns about this comparison well. I will keep my score.

评论- Clarification for POYO+

2025-08-05

Dear reviewer 6mEF: Thanks for your continued engagement! We have just posted a response to reviewer KHtW, and wanted to summarize our response here as well regarding POYO+.

We want to clarify that while POYO and POYO+ architectures indeed learn per-unit embeddings, the analysis in POYO+ did not use these unit embeddings to do cell-type or region classification. Instead they state that they perform the classification on the session-averaged latents. Specifically, in Section 3.3 ANALYSIS OF LATENT EMBEDDINGS of POYO+, the authors state "we examined the latents at the output of the encoder in POYO+", "To obtain the session-level latent embeddings, we average the latents across randomly sampled 1s context windows in the recording." Additionally, the caption of Figure 4 of POYO+ states "Balanced accuracy for brain area classification based on hand-crafted features versus session-averaged latents from POYO+," and "Balanced accuracy for Cre-line classification based on handcrafted features versus session-averaged latents from POYO+."

In summary:

In POYO+, unit embeddings were not used for the Cre-line or brain region classification. Instead, session-averaged latent outputs of the encoder were used.
The cell-type, region, and layer classification problems were treated as session-level tasks.
In the Allen Brain Observatory dataset analyzed in POYO+, each session is from a specific Cre-line, a specific brain region, and from a single depth (layer). There are no sessions with information about labeled cell types that have a mixture of different types present. In other words, there’s only one type of neuron observed at a time.

As far as we are aware, the unit embeddings of POYO and POYO+ have not been evaluated on cell-type or brain region classification tasks in any prior works. We would appreciate it if the reviewer could reconsider their score in light of this clarification.

2025-08-06

Thanks, this clarification was indeed very helpful. You are correct about session averaged latents, I had not recalled the completely correct setting for POYO+ and it's initial encoder architecture.

I believe my concerns were adequately addressed. Thank you.

评论- POYO+ performance comparison on Allen VC dataset

2025-08-08

Thank you for your reply! We are glad we were able to resolve the matter.

We still wanted to see the performance of the unit embeddings of POYO+ and how they compare to NuCLR. We are excited to report that we were successfully able to train POYO+ on the Allen VC dataset, and provide its performance for cell-type classification based on its unit embeddings below. We note that NuCLR considerably outperforms the unit embeddings of POYO+ on this task. This result is only provided for a single seed due to the compute intensive nature of training POYO+, however, we will run multiple seeds and report these accuracies in the final paper.

Table 1: Macro-F1 scores for cell-type classification on Allen VC dataset under transductive zero-shot evaluation.

POYO+	NuCLR	NeuPRINT	NEMO
0.3762	0.7218 ± 0.0113	0.3999 ± 0.0312	0.4256 ± 0.0114

For training POYO+ on Allen VC, we used the behavior tasks listed below. We also report its behavior decoding accuracies upon convergence for all the behavior tasks:

Drifting Gratings Orientation: 95.66% balanced accuracy (chance: 12.5%)
Drifting Gratings Temporal Frequency: 94.83% balanced accuracy (chance: 20.0%)
Gabor Orientation: 57.88% balanced accuracy (chance: 25.0%)
Gabor Position: 0.7677 R²
Natural Scenes: 68.13% balanced accuracy (chance: 0.84%)
Running Speed: 0.7534 R²

最终决定Accept (poster)

2025-09-17

This paper introduces NuCLR, a novel self-supervised learning framework for functional cell-typing and neuron localization. The approach, which incorporates fine-grained population dynamics through a spatio-temporal transformer and contrastive learning, is well-motivated and addresses a clear unmet need in neuroscience. The reviewers agree that the paper's core findings are impressive and well-supported. The scaling analysis shows that NuCLR improves with more unlabeled data, a critical property for practical use. The ablations and extra experiments during the rebuttal were convincing. Please follow through on your commitment to update the manuscript by incorporating valuable references and a revised related works section was also a significant factor in the recommendation.