PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
5
4
3
4.3
置信度
创新性2.8
质量3.0
清晰度3.0
重要性3.0
NeurIPS 2025

SimSort: A Data-Driven Framework for Spike Sorting by Large-Scale Electrophysiology Simulation

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29

摘要

关键词
Spike SortingNeuroscience

评审与讨论

审稿意见
4

The authors propose to pretrain a neural network for spike detection on a large simulated dataset, before using a contrastive learning approach to cluster spikes. They show improvements upon existing spike sorters on hybrid datasets, and show the results of SimSort on a real tetrode dataset.

优缺点分析

Strengths: Novel use of pretraining for spike sorting, novel combination of methods for spike sorting (clustering method is inspired from CEED, which has not been used for spike sorting so far), good evaluation on hybrid datasets, zero-shot vs fine tuning analysis + Scaling Law of Data Size are interesting analysis.

Weaknesses: Lack of comparisons to existing spike sorters on real-world data, only low-dimensional real world dataset (tetrode) is considered, generalization capabilities are not clearly demonstrated (hybrid datasets are not necessarily "out of distribution" + real world dataset is fairly simple), limitations are not properly discussed (only one limitation is highlighted).

问题

One question concerns the evaluation on real data.

I am surprised it does not catch more low amplitude spikes (figure 5). It's surprising that even the lower amplitude unit does not violate the ISI. these spikes are very hard to accurately cluster, but can still be very informative (Multi Unit Activity). Have you tried to quantify how many spikes it is missing during detection?

Those questions could be adressed by comparing to an existing sorter on the real world data. If SimSort finds as many spikes as Kilosort, then I would find it convincing. You could, for every unit, produce a venn plot + ISI + waveform plots comparing the output of SimSort and KS. Spike Interface provides a way to compare two spike sorters. More generally, I think having a comparison to KS on real data is necessary and I would greatly improve the quality of the paper and I would be willing to improve my score if authors show this during the rebuttal period.

Have you checked whether or not there is drift in your recording? it is possible maybe that units 2 and 6 are the same unit separated by drift? I think running DREDGE (Windolf et al., 2025) to preprocess the data, or just to quantify the amount of drift would greatly improve your preprocessing / subsequent analysis, especially as no augmentation in your contrastive approach accounts for drift.

I am generally against whitening the recording - as the noise is rarely gaussian and removing dependencies between channels can hurt the performance downstream. Have you tried other preprocessing pipelines? And quantified their effects?

Why not have tried the full spike sorting pipeline on IBL Neuropixels dataset (rather than just selecting spikes from good units and clustering them with the contrastive approach)? If you have the data + pipeline, can't you run it on the IBL dataset and compare to KS (which results are available on the IBL datasets)? I would suggest adding localization features to the contrastive learning feature to perform clustering.

局限性

I believe that the limitations are not properly adressed by authors. The only limitation mentioned is that SimSort has only been trained and tested on tetrode data. I agree that this is a major limitation, but have other concerns.

It is not clear whether or not SimSort is able to generalize outside of the training data. More precisely, it may capture only "easy" spikes. Figure S8 is misleading as it shows the "missing GT", but hybrid recordings have more spikes than the injected spikes. Not finding any additional spikes si concerning. Especially, looking at the raw traces, there's other events that could look like small spikes. Again, here the evaluation of detection should be complemented by a comparison to other sorters. Only finding hybrid unitys would be indicative of bad performance, although the metrics would be good. Could the authors clarify whether or not the sorter learns additional units, and evaluate the quality of these units? The training dataset also seems to have very high SNR (figure S10). Why not increase the noise here to match real recordings?

Authors claim to report experiment statistical significance (checklist) but this is not done in table S4. I think authors should indicate there if there's no statistically significant difference. Also, this table, in my opinion should be moved to the main text.

最终评判理由

I am giving a "borderline accept" as the paper presents a cool, novel framework with rigorous evaluation for spike sorting in tetrode probes, but I have doubts about the generalization capacities of the algorithm to multiple probes and brain regions.

格式问题

No formatting concerns

作者回复

Thank you for your detailed and thoughtful review. We appreciate your suggestions regarding the evaluation and the need for a more thorough discussion of the limitations. We address your concerns point by point below.

1. Real-data comparison with Kilosort and detection of low-amplitude spikes (W1, Q1)

SimSort may capture only "easy" spikes?

Thank you for raising this important question. We agree that evaluating a spike sorting method’s sensitivity to low-amplitude events is essential. To assess this, we compared the total number of spikes detected by SimSort and Kilosort2.5 across six real tetrode recordings. We report the overall number of detected spikes per recording as a preliminary comparison:

RecordingKilosortSimSort
111,49512,891
28,90616,855
316,25718,518
45,5388,933
516,68513,498
619,06316,750

SimSort detected a comparable or higher number of spikes in four out of six recordings. In recordings 5 and 6, however, SimSort detected fewer spikes than Kilosort. Upon manual inspection, we found that many of the Kilosort-only units in these recordings were detected primarily on a single channel, with no corresponding activity on adjacent channels.

This detection pattern may reflect the statistical structure of the simulated data used during SimSort pretraining. In our simulation framework, most spikes are generated by neurons whose extracellular potentials are visible across several nearby electrodes. Although we included randomized channel-wise noise augmentation during training, it remains uncommon for simulated spikes to be confined to a single channel while others contain only background fluctuations. This may have led SimSort to be less sensitive to spatially sparse spikes. We consider this a potential limitation of the current framework, which we will discuss in the revised limitations section, and improve in the future version.

To further investigate whether SimSort is biased toward high-amplitude spikes, we analyzed the SNR distribution of all spikes detected by SimSort in the six real recordings:

RecordingTotalSpikesMeanSNRMedianSNRMaxSNRMinSNR
112,8915.715.4815.691.63
216,8555.855.4517.862.05
318,5188.057.1922.161.93
48,9335.655.1014.431.82
513,4985.555.0513.901.56
616,7505.685.5114.071.94

These results show that SimSort detects a wide range of spikes in terms of SNR, including many low-amplitude events with SNR below 2.0. This indicates that the model is not limited to high-SNR spikes and is capable of recovering lower-amplitude spikes in real recordings.

The training dataset also seems to have very high SNR (figure S10). Why not increase the noise here to match real recordings?

Thank you for pointing this out. The trace shown in Figure S10 was selected primarily for visualization purposes and may not reflect the full range of signal-to-noise ratios (SNR) in our training dataset. To address the concern that our training data might have uniformly high SNR, we clarify that the simulated dataset includes a broad spectrum of SNRs, encompassing a substantial number of low-SNR spikes. SNR was computed using a robust method based on the median absolute deviation (MAD). A summary of the SNR distribution is provided below:

MetricValue
# Spikes144,690,960
SNR Range0.62 – 169.89
SNR Median22.52

2. Some questions about the preprocessing steps of real recordings (Q2, Q3)

Is there drift in the real recordings?

The real recordings in this study are relatively short (∼250 seconds), during which significant drift is unlikely to occur. We did not observe noticeable drift patterns in the data and therefore did not apply drift correction. We agree, however, that drift can be a critical factor in longer or high-density recordings. We appreciate the suggestion to consider tools such as DREDGE, and plan to incorporate drift quantification and, if necessary, correction modules into future versions of the pipeline.

SimSort’s current contrastive learning framework does not include augmentation for drift. This was a deliberate design choice: for tetrode data, the concept of drift is inherently ambiguous due to the small number of channels and limited spatial resolution. As a result, we did not apply channel invariances during training. Nonetheless, we see drift modeling as an important direction when adapting SimSort to high-density probes.

Should the whitening step be included in the preprocessing pipeline?

Thank you for the suggestion. In this work, we followed the preprocessing pipeline used in established spike sorting methods such as Kilosort and MountainSort, which includes spatial whitening in signal preprocessing.

To assess its impact, we conducted an ablation study by replacing spatial whitening with data standardization. As shown below, sorting performance decreased consistently across both datasets:

MethodAccuracy(Static)Recall(Static)Precision(Static)Accuracy(Drift)Recall(Drift)Precision(Drift)
SimSort (w/whitening)0.62±0.040.68±0.030.77±0.030.56±0.030.63±0.030.69±0.03
No Spatial Whitening (z-score)0.47±0.030.56±0.030.64±0.030.47±0.020.56±0.020.62±0.02

The results show a consistent decrease in accuracy, recall, and precision when spatial whitening is removed from the preprocessing pipeline.

3. Full pipeline evaluation on IBL Neuropixels data (Q4

Why not have tried the full spike sorting pipeline on IBL Neuropixels dataset?

We appreciate your suggestion to evaluate SimSort as a full spike sorting pipeline on IBL Neuropixels recordings.

SimSort was originally developed and pretrained on tetrode recordings, and directly applying it to high-density Neuropixels data would require adapting the detection model to handle broader spatial input. In this submission, we focused on evaluating the performance of identification model using neuropixels waveforms extracted by Kilosort2.5, following the CEED protocol for fair comparison of contrastive learning approaches.

We thank for your helpful suggestion to incorporate localization features into the contrastive representation. We agree this could further enhance the model’s applicability to high-density recordings and will test it in future extensions.

4. Whether SimSort find additional units in hybrid dataset? (L1)

Could the authors clarify whether or not the sorter learns additional units, and evaluate the quality of these units?

Thank you for raising this important point. To examine whether SimSort is limited to detecting only the injected ground-truth spikes, we compared the total number of detected spikes on the hybrid-static dataset. The results are summarized below:

Recording IDSimSort DetectedGround TruthSpikesKilosort Detected
rec_4c_1200s_1174,24972,16860,767
rec_4c_1200s_2194,15478,27478,220
rec_4c_1200s_3189,21687,02779,308
rec_4c_600s_1137,17336,12030,551
rec_4c_600s_1237,11936,04830,547
rec_4c_600s_2146,99652,87438,851
rec_4c_600s_2247,14247,55638,837
rec_4c_600s_3144,61943,36740,380
rec_4c_600s_3244,60543,30539,029

To better address this question, we respectfully refer you to Figures S11 and S13. Figure S11 shows the identification results based solely on spikes with ground-truth annotations, while Figure S13 shows the output when SimSort performs both detection and identification on the raw hybrid recordings. In Figure S13, more units are identified than in Figure S11, including additional clusters that are not part of the injected ground-truth set. Many of these clusters exhibit well-formed waveforms, suggesting that they likely correspond to meaningful neural activity.

A similar observation can be made from the comparison between Figures S12 and S14. These results indicate that SimSort is capable of discovering additional spike events beyond the injected ground-truth units.

6. Statistical significance in Table S4 (L2)

Thank you for pointing out this important issue, and we apologize for the omission. The comparison between SimSort and CEED yields p = 0.0012, indicating a statistically significant difference (p < 0.05). The comparison between SimSort and CEED_UMAP yields p = 0.0554 and is not statistically significant. We will include these results in Table S4 and move the table to the main text in the final version.

We sincerely appreciate your thoughtful and constructive review. We understand that a key concern is the comparison with existing spike sorters on real-world data. Due to the format constraints of the rebuttal stage, we have presented as much comparative analysis with Kilosort as possible to address this point. We also plan to include a more comprehensive discussion of these issues in the limitations section of the final version. We hope these efforts help clarify the performance of SimSort. If we have misunderstood any part of your feedback, or if you have further questions or suggestions, we would be happy to continue the discussion.

评论

I want to thank the authors for the detailed responses to my questions.

  1. Detection of low amplitude spikes: detecting more spikes than Kilosort is good evidence that SimSort is not biased towards "easy" units. The SNR distribution analysis is a greta addition to the paper and I imagine that with figures / revisions to the manuscript this analysis could be strengthened even further. However, the fact that Simsort struggles with single-channel units tend to indicate that it has a hard time generalizing to waveforms that are not in the training data. I will detail more below.
  2. Preprocessing: The whitening step is generally ok but also compltely removes the information in some recordings. I believe that a good spike sorting pipeline should have "flexible" preprocessing in order to easily adapt to the specifics of the recording. Which is in my opinion one of the reason that pretrained/supervised deep learning approach havn't had great successes at spike sorting before. This is also one reason I'm concerned about the practical adaptability/generalizability of this approach. For example, when recording from the cerebellum, if you whiten then you lose most of the signal because of the high firing rate of units. If you don't whiten and the noise is then not gaussian, will the approach work?
  3. Full pipeline evaluation on IBL Neuropixels data. This is unfortunately the main issue with the paper. I will develop further below.
  4. Figures S11-S14 are interesting. Thank you for pointing them to me. However, I see some overmerges in the results. It would be good to run some analysis to quantify oversplits/overmerges. These figures don't really allow for much insight into the true performance of Simsort, although they indicate that the found waveforms generally correspond to neural activity.
  5. Thank you for adding the statistical significance.

My main concern about the transferability of this approach to more datasets / various recordings (from different brain regions and different electrodes) remain after seeing the responses and analysis from the authors. Although the approach is novel (although the clustering algorithm is very similar to CEED and the simulated dataset relies on existing physical models), and the idea of applying large pretrained neural networks to spike sorting is appealing and timely, it is not clear that this approach will be of practical use to the neuroscience community if it doesn't generalize outside of the training domain. The reasons why it is not clear (and not shown by the authors) that it would generalize are the following:

  • Single-channel units, which are not represented in the training data, are not captured well which "may reflect the statistical structure of the simulated data used during SimSort pretraining". This unfortunately makes me think that SimSort might not capture units that do not have a "standard" shape and not work well in different brain regions.
  • By design, I believe the approach is not super modular. However, in the context of spike sorting -which generally requires manual tuning/preprocessing/correction even when running Kilosort-, being simple and modular is a strength. The use of whitening in the preprocessing makes me dubious about the capacity of the model to work when the noise is non-gaussian (a common case in spike sorting).
  • SimSort hasn't been showed on Neuropixels / other probes than tetrodes. And more importantly it has not been shown that the idea developed in the paper could be applicable to other probes (i.e. how would you construct such a dataset for other types of probes, how would you deal with the different challenges etc..)

Because of the lack of evidence that this application-focused paper is ready to be a practical tool for the neuroscience community, I do not wish to upgrade my score and lean towards rejecting the paper.

评论

Thank you for the careful follow-up and for holding us to a high standard on transferability and practicality. To avoid over-claim, we first restate scope: this submission targets tetrodes, which are widely used by neuroscientists (Many recent studies on Cell/Nature/Science still used tetrodes [1-5]). Our end-to-end pipeline (detection → identification/clustering) and all claims are calibrated to 4-channel recordings. Results on Neuropixels were identification-only (following the CEED protocol) to show representation transfer, not a full HD-probe pipeline claim.

Transferability beyond tetrodes

While we do not claim a Neuropixels detector in this paper, our identification-only evaluation on IBL waveforms shows that the learned representation out of domain performs on par with CEED (which is trained in domain). Combined with the scaling-with-data analysis in the paper, this supports the thesis that broader training coverage (including new geometries and event types) will improve generalization. If accepted, our immediate next steps are: (i) extend simulation to spatially sparse/single-channel events, (ii) add channel-drop/geometry-aware invariances to contrastive training, and (iii) train on multi-geometry corpora (tetrodes + high-density probes) to support a Neuropixels detector.

What the community gets today (within scope).

For tetrodes, SimSort offers (i) zero-shot improvements on Hybrid (best Accuracy & Precision in static; best Accuracy & Recall under drift) and WaveClus-difficult, (ii) physiologically meaningful real-data units (clear refractory structure; tuning), and (iii) a data-driven pretraining recipe that scales with additional, more diverse training data.

Concrete commitments:

  1. Configurable preprocessing in the released code (whitening on/off; alternatives).
  2. Diagnostics for isolation and merge/split in the camera-ready + repo.
  3. Simulator/augmentation extensions to cover single-channel sparsity and multi-geometry data; document this plan in the limitations/future-work.

Clarifying scope (NeurIPS vs. Top Journal)

Our submission meets NeurIPS criteria -- introducing a novel paradigm of simulation-pretraining, demonstrating robust zero-shot transfer, rigorous validation, and reproducibility through public code/data release. While extensive multi-probe evaluation fits top-journal-level work (like Nature Methods, Neuron, Nature Neuroscience etc.), we believe the current submission clearly surpasses the NeurIPS acceptance bar.

Thank you again for strengthening our work through your valuable feedback, and we'd appreciate if you can reconsider the evaluation.

[Reference]

[1] Eliav et al., Fragmented replay of very large environments in the hippocampus of bats. Cell (2025).

[2] Harkin et al., A prospective code for value in the serotonin system. Nature (2025).

[3] Chen et al., Brain region–specific action of ketamine as a rapid antidepressant. Science (2024).

[4] Jun et al., Prefrontal and lateral entorhinal neurons co-dependently learn item–outcome rules. Nature (2024).

[5] Aldarondo et al., A virtual rodent predicts the structure of neural activity across behaviours. Nature (2024).

评论

Thank you for letting us know that we have clarified the scope of the paper and its contribution to the community. We’re glad that our rebuttal helped explain why we chose to evaluate clustering on Neuropixels data.

We also appreciate your recognition of the following contribution:

(i) Zero-shot improvements on Hybrid → This is demonstrated in the paper.

Regarding (ii) and (iii), we fully agree that additional evaluation on published real-world datasets would provide stronger evidence to support adoption by experimental neuroscientists. SimSort is not an incremental extension of existing sorters, but a new framework, and we recognize that building community trust takes time and sustained efforts. In this submission, we present evidence that SimSort generalizes across simulated, hybrid, and real tetrode recordings (Fig. 5), which we believe is an essential step to start. Just like KiloSort, which recently presented its fourth version, SimSort will continue to evolve, and your feedback is an important part for us to improve.

I think a proper spike sorter should be evaluated when run end-to-end on a real dataset (non-hybrid).

We agree. In this submission, we evaluated SimSort end-to-end on tetrode recordings collected in our own lab (Fig. 5), using a classic visual neuroscience paradigm. Please kindly let us know if you have any question about this real data. At the same time, we recognize that reproducing published findings from existing datasets would further strengthen the practical relevance. We appreciate the suggestion.

We sincerely appreciate your thoughtful feedback and will carefully reflect it in the final version.

评论

Thank you for your response. The analysis in figure 5 is very relevant at it shows the applicability on a real recording, and that SimSort is able to find visually-responsive units and follows a "classic visual neuroscience paradigm".

I still have doubt about the generalization capacities of SimSort (especially to different brain regions, as the visual cortex is relatively "simple" compared to other regions) but now believes that the submission meets NeurIPS criteria, as it proposes a novel framework for spike sorting with good evaluation.

I am thus bumping my score and won't stand in the way of acceptance, and hope that authors will later show multi-probe + multi-brain region evaluation of the sorter which could, as pointed out, result in a top neuroscience journal publication.

评论

Thank you for your thoughtful review and active discussion. We appreciate your updated score and will pursue the broader evaluations you suggested.

评论

I thank the authors for clarifying the scope of the paper and the contribution to the community.

I now understand why the authors chose to evauate the clustering on Neuropixels data to show that the learned representation out of domain performs on par with CEED (which is trained in domain).

However, if the contribution is for tetrodes, why not focus the paper on tetrodes and make sure that each of the contribution is clear? More precisely:

(i) zero-shot improvements on Hybrid --> This is shown in the paper

(ii) physiologically meaningful real-data units --> Why not run the sorter on the data from some of the 5 papers cited in the response above and show reproducibility (or even improvement) of the downstream task? This would also adress my concerns related to generalization across multiple brain regions / cell types.

(iii) data-driven pretraining recipe that scales with additional, more diverse training data --> Show how the pipeline can be used "end-to-end" on one or more real datasets + downstream task.

To summarize, I think given the clarified scope and contribution of the paper, it is important to show the full output of the pipeline on a real tetrode dataset (i.e. non hybrid). This can then be evaluated by looking at a downstream task (similar to papers cited above) or hand validation. This is similar to the "Full pipeline evaluation" comment in my original response. I think a proper spike sorter should be evaluated when run end-to-end on a real dataset (non hybrid).

I believe that such an analysis is also needed to convince the neuroscience community of the usefulness of SimSort.

Also, I think either showing (ii) or (iii) should be sufficient, but it seems that authors tried to show both (either a spike sorter that lead to meaningful units in tertrode recordings; or a recipe that generalize to many datasets), without properly showing one or the other. I would recommend focusing on one application either a general spike sorting recipe or a tetrode pipeline that allows efficient and accurate processing of tetrode recordings to stregthen the paper.

审稿意见
5

The paper describes a dataset generated with a biophysical simulation leveraging neuron models published by the BBP in 2011 to generate a synthetic spike-sorting dataset using the neuron simulator. The dataset simulates 6 cortical layers and a 4-channel tetrode inserted inside the cortex model. In the simulation for generating the data, the cells are not recurrently connected, and they are stimulated with independent noise currents (this is my understanding; it could be wrong because the information is not very clearly stated). I could not find details on the morphology and 3D placement (missing information from the main text). The resulting dataset consists of 8k trials of 10minutes and a total of 40k units.

The synthetic dataset is then used to pre-train the proposed SimSort algorithm, which is a deep learning spike sorting algorithm. It is composed of two parts, a transformer (encoder only) which takes extra-cellular recordings sampled at 0.1ms and determines the presence/absence of a spike at each time via binary classification. The second model is a GRU model, which takes the waveform as input and returns an embedding that is used to determine the spiking cell identity.

The SimSort algorithm is tested zero-shot (without fine-tuning) on other spike sorting datasets (these are also synthetic datasets, but created differently). Without fine-tuning, the model already outperforms other spike sorting models and algorithms on most metrics, which is remarkable. When fine-tuned, the model is a bit better. Unsupervised clustering of cell type in layer 6 appears possible even if these synthetic neuron models were not used identically in the training set.

优缺点分析

Main comment: The paper is excellent and is undoubtedly making some progress for deep spike sorting algorithms, but details and citations are missing from the main text about the synthetic dataset. It should be stated much more prominently that the model does not model recurrent interaction and that the neurons are driven by independent pink noises (no recurrent interaction is simulated; in the absence of details, I also assume that no morphology is simulated). Note that other recent papers (including from the BBP team and elsewhere have gone deeper to generate realistic data for spike sorting, and this should be acknowledged [1]. I believe being more precise and transparent about its dataset in the main text will not undermine the validity and the strength of the paper, which has already achieved a lot by combining a deep spike sorting algorithm with a reasonably simple synthetic dataset. The simplicity of the dataset is rather a strength in this case.

Strengths:

  • This paper is timely and follows a current trend that is likely to change the field. Synthetic data for BCI will become very important.
  • The paper is very clear and well written, and the code appears to be detailed and exhaustive.
  • SimSort is outperforming all existing spike sorting algorithms (even without fine-tuning), and everything indicates that the algorithm is ready to be used for wet labs (demonstration on in-house data, although without groundtruth spike times, which is very very difficult with in vivo data).
  • The model training is explained efficiently but with great clarity, everything is understandable, and given access to the dataset, it appears to be reproducible.

Weaknesses: A) A1. The paper is very clear and well written. Even the appendix is well written, for what I saw. Maybe the appendix sections could be more appropriately referenced in the main text in a relevant place to inform the reader of their existence. A2. The font size of the figures is often too small. It makes the Figure impossible to read depending on the support. This is easy to fix, and a pity given the high quality of the paper and the efforts invested in everything else.

B) B1. The dataset itself is not the best of its sort since the BBP group recently [1] (important missing citation) published a more advanced model to study spike sorting, which includes, in particular, recurrent interactions and a careful 3D placement and connectivity that are not commented on in this model. I assume that the simulation does not even need to simulate all cells together, since there is strictly no interaction between cells. If so, it would be useful to clearly state this limitation, rather than letting the reader find out. To be fair, this is not even a simulation; the simplicity of independent neuron simulation is rather a strength than a limitation.

B2. The 3D morphology and recurrent connections are important for accurate spike sorting because of spike synchrony and because the waveform is impacted by the spatial position of cells [1] (should be acknowledged as a limitation).

B3. It would be useful for the reader to know that trials in the dataset are 10 minutes long (basically, Table S1 could be put in the main text, or at least give the trial duration and time step clearly in the main text and point to Table S1 and the appendix for details).

B4. Missing information in the main text: I assume that the neurons are simulated completely independently, and the waveforms are added depending on the 3D placement of the neuron. This could be clarified in the main text. Also related: how many neurons are simulated "in the same 3D volume" around an electrode at the same time?

B5. Is it possible/frequent that two neurons in the same time bin can emit a spike in the same time bin? How is this case handled?

C) I really like the description of the spike sorting algorithm, but I missed a better summary of the related deep learning models for spike sorting. For instance, is it standard to do spike detection with binary classification (spike event)? Which papers have already used deep learning to classify waveforms? (Without supervision?)

[1] Laquitaine et al. 2025 https://www.biorxiv.org/content/10.1101/2024.12.04.626805v2

问题

Scientific questions:

  • At first sight, the dataset appears extremely satisfactory, even if the morphology and connectivity are not simulated. What additional benefits or difficulties should we encounter with more complex and detailed models?
  • What are the failure modes expected when training with this dataset (interfering spike forms, distant neurons, which are causing problems that are specific to this synthetic dataset)?
  • What are the challenges for validating the model, and what type of data should be included to test the limitations of the model?

In coherence with the weaknesses above, I would suggest putting the missing information about the dataset in the main text:

  • 3D placement of the neuron
  • absence of morphology?
  • absence of recurrent connection? Is the pink noise the only drive?
  • how many neurons are potentially simulated around a single electrode?
  • The 10-minute trial duration should be stated in the main text.
  • Added reference to Laquitaine et al. 2025 would be quite important in my opinion to reflect the context of synthetic datasets from the BBP. It would also be relevant to cite more advanced modeling papers that have put a lot of effort into modeling realistic networks (this includes Markram 2015 [2] and work from the Allen Institute [3]; other relevant models are welcome, of course).
  • Adding a bit more on the related work on deep networks for waveform clustering, and deep networks for spike sorting would be useful.

[2] Markam et al. 2015 https://www.cell.com/cell/fulltext/S0092-8674(15)01191-5?innerTabauthor-interview_mmc9=

[3] Billeh et al. 2019 https://www.cell.com/neuron/fulltext/S0896-6273(20)30067-2?dgcid=raven_jbs_etoc_email

局限性

See above, no potential negative societal impact.

最终评判理由

During the discussion period, the authors have addressed my concerns. Given their comments, I understand that the final manuscript is improved and I increased my grade to 5.

格式问题

A minor concern that could be fixed during the review round: The font size for the panel legend and ticks is tiny and not readable.

作者回复

Thank you for your detailed and constructive review. We sincerely appreciate your positive assessment of SimSort’s clarity, reproducibility, and potential impact, as well as your thoughtful comments regarding the simulation dataset. Below we address your concerns point by point:

1. Clarification on morphology and modeling assumptions (B1–B5, Q4)

Thank you for your insightful comments regarding the dataset design. We appreciate your careful reading and the suggestion to clarify several important modeling assumptions in the main text.

We confirm that the morphologies used in our simulations were 2D multi-compartmental neuron models. While the models include detailed dendritic and axonal arbors, these structures were embedded in two-dimensional space. Recurrent synaptic connections were omitted, and each neuron was independently driven by pink noise inputs injected at the soma.

We agree that omitting 3D morphology and recurrent connections limits the biophysical realism of the simulation. As you noted, this simplification may reduce spike synchrony and constrain waveform variability that arises from complex spatial arrangements of neurons. We will explicitly acknowledge this limitation in the revised manuscript.

We will also add the following clarifications to the revised paper:

  • Each 10-minute trial includes five neurons placed in a 100 μm × 100 μm × 100 μm volume surrounding the tetrode, with random rotations and placements.
  • Extracellular signals are generated using the line-source approximation, then summing contributions from all neurons.
  • Neurons were simulated independently with no recurrent connections.
  • Pink noise was the only driving input.

We will also move key simulation details from the appendix to the main text to improve accessibility.

2. Resolving overlapping spikes across neuron (B5)

Is it possible/frequent that two neurons in the same time bin can emit a spike in the same time bin?

Good point! In simulation, spike overlap is possible (multiple neurons to emit spikes within the same time bin), particularly under high firing rate conditions. Since the extracellular voltage is the linear sum of contributions from all active neurons, SimSort’s detection module is trained to identify spike events from such composite waveforms, while the task of assigning spikes to their respective source neurons is handled by the subsequent identification module. Overlapping spikes represent one of the key challenges that our contrastive representation learning framework is designed to address.

How is this handled?

Even when two spikes occur within the same time bin, they typically originate from different spatial positions, resulting in distinct waveforms across channels due to spatial variation in the extracellular potential. This difference in waveform shape across channels provides useful cues for distinguishing overlapping events. An example of such spatial variation can be seen in Figure 5c, where spikes from different neurons exhibit different waveform patterns across the 4 channels.

In practice, we observed that SimSort can often detect and separate overlapping spikes when their peak times differ by more than 15 sampling points (~0.5 ms), with both events typically detected and identified correctly. We will include these observations in the revised manuscript.

3. What additional benefits or difficulties should we encounter with more complex and detailed models? (Q1)

We agree that more biologically detailed models -- such as those with 3D morphologies, synaptic connectivity, and realistic network dynamics can offer valuable potential for spike sorting research. For example, recurrent connectivity can lead to spike time correlations and population synchrony, which increase the difficulty of resolving temporally overlapping events. In addition, incorporating diverse neuronal morphologies, spatial arrangements can result in more complex and variable extracellular waveforms, placing higher demands on the model’s ability to distinguish and represent waveform features robustly.

Importantly, features such as spike-time correlations can naturally emerge in large-scale simulations even with simplified neuron models. When many independent units are simulated over extended durations, spikes from different neurons may occur closely in time by chance. This exposes the model to temporally ambiguous cases without requiring explicit connectivity or coordinated input.

On the other hand, highly detailed and complex neuron models present substantial challenges. They are not only computationally expensive, but also make it considerably more difficult to accurately compute and model extracellular potentials because of the increased complexity of morphology, and signal propagation. In this work, we adopt a relatively simplified yet biophysically grounded simulation approach as a foundational step toward developing robust and generalizable spike sorting methods using synthetic data.

We appreciate the suggestion to include references to more advanced neuron modeling efforts such as Laquitaine et al. (2025), Markram et al. (2015), and Billeh et al. (2019). We agree that incorporating more biologically realistic 3D morphologies and connectivity patterns holds strong potential to further enhance deep learning methods for spike sorting. Such advances may also benefit broader applications in neuroscience and brain–computer interfaces. We will incorporate these references and expand our discussion accordingly in the revised manuscript.

4. Failure modes and validation challenges (Q2, Q3)

This is a valuable comment. Our model may fail in several challenging scenarios. The detector may misclassify noise events with spike-like waveforms as spikes. Under extremely low SNR conditions, the identification model may struggle to extract stable and discriminative features, even with noise augmentation during training. In long-term recordings with drift, spikes from the same neuron may be split into separate clusters, as the model does not explicitly account for waveform invariance across channels. Additionally, when two spikes occur within a short temporal window (e.g., less than 15 sampling points apart), the smaller one is often ignored to avoid false positives, potentially leading to missed detections. We will add more discussions about failure cases in the revised manuscript.

Validating the model remains challenging, particularly due to the lack of ground-truth labels in real in vivo recordings. While paired intracellular–extracellular recordings or optogenetic tagging provide gold-standard validation, they are difficult to scale. As a practical alternative, we perform indirect validation by analyzing the sensory tuning of sorted V1 neurons, where consistent and selective response patterns support the reliability of the sorting results.

To better evaluate the model’s limitations, diverse datasets will be valuable. Hybrid datasets with real background noise and drift, long-term recordings, paired ground-truth data, and more realistic simulations with recurrent dynamics can expose specific failure cases and challenge the model’s temporal resolution and representational robustness. Such benchmarks will be essential for future improvements in generalizability and reliability.

5. Figure readability and expanded discussion of related work (A1-A2, C)

We apologize for the small font size in Figures 5 and 6. In the revision, we will enlarge label fonts to improve readability. We will also reference the relevant appendix sections more clearly within the main text to guide the reader. Additionally, we will expand the discussion of prior deep learning models for spike detection and waveform representation in the revised version, to better contextualize our approach within the existing literature.

We greatly appreciate the time and effort you put into reviewing our work. Please feel free to raise any additional points during the discussion—we are happy to clarify or expand as needed.

评论

Thank you for addressing my review. I am still favorable to the publication of this paper. If the authors state how they plan to integrate a better comparison with Kilosort 4 (at least conceptual), I am willing to upgrade my grade to 5.

The response is very useful, and the suggested clarification regarding the simulation details is very useful and interesting. It could be interesting to review in passing how other works like Kilosort are generating synthetic data for training.

Regarding the response to another reviewer about KiloSort 3.5 and 4. I agree with the other reviewer that this should be reported. Plus, the Kilosort 4 is a deep learning first Spike Sorting method, and it would be fair and important to highlight that SimSort is not the only one of this type. On this topic, I am not satisfied with the response provided to the other reviewer:

First, KiloSort 3.5 and 4.0 introduced enhancements primarily optimized for high-density Neuropixels recordings. Since our current study focuses on tetrode recordings, we did not include KiloSort 3.5 and 4 in our comparisons.

Therefore, KiloSort 4 should be "rather bad" when applied to Tetrode instead of Neuropixels? First, it would be good to verify this statement. Second, if true, a comparison with KiloSort 4 is still needed. At minima, a summary of the technical differences in the deep learning method, or the applicability (if Kilosort4 cannot be applied, should we therefore understand that transferability from tetrode to neuropixels is also impossible with your possible?).

评论

Thank you so much for your prompt follow-up! We truly appreciate your continued support and the insightful points you’ve raised throughout this discussion.

We fully agree that a more direct comparison with KiloSort 4 is important. In response to your suggestion, we have now evaluated KiloSort 4 on the hybrid datasets and will include both the quantitative results and a conceptual analysis in the revised manuscript. We are also in the process of evaluating KiloSort 4 on additional datasets—including the WaveClus dataset, the BBP Layer 6 dataset, and real tetrode recordings—and will report these results in the updated version to provide a more comprehensive comparison. The table below summarizes the current results on the hybrid datasets:

MethodAccuracy (Static)Recall (Static)Precision (Static)Accuracy (Drift)Recall (Drift)Precision (Drift)
KiloSort 40.42 ± 0.050.42 ± 0.060.44 ± 0.060.25 ± 0.030.27 ± 0.030.31 ± 0.04
SimSort0.62 ± 0.040.68 ± 0.040.77 ± 0.030.56 ± 0.030.63 ± 0.030.69 ± 0.03

These results confirm that while KiloSort 4 performs impressively on high-density Neuropixels data, its performance drops significantly in the sparse, nonlinear channel configuration of tetrode arrays. This observation is consistent with the algorithmic structure described in Pachitariu et al. (Nature Methods, 2024), where KiloSort 4 introduces a graph-based clustering scheme that relies on waveform similarity and spatial adjacency to build inter-channel connectivity. While highly effective in linear, high-density settings, this design becomes under-constrained in tetrode recordings, where spatial continuity is limited especially under challenging conditions such as drift.

In light of this, we also recognize the importance of generalization across electrode configurations—for example, from tetrode recordings to high-density probes like Neuropixels. Although our current study focuses on tetrodes, the simulation-based framework underlying SimSort is conceptually extensible to other probe types. Supporting such generalization in practice will likely require further engineering to reflect the spatial continuity, signal correlations, and noise characteristics of dense linear arrays—an important direction for future work.

We also appreciate your suggestion to situate SimSort within the broader landscape of learning based spike sorting approaches. In the revised manuscript, we will more clearly acknowledge their contributions and articulate how SimSort builds upon—and differs from—these prior efforts.

We would also like to kindly point out that KiloSort 4 does not incorporate deep learning architectures or neural network modules. Instead, its performance improvements stem from a template-matching pipeline combined with graph-based clustering algorithms, as described in Pachitariu et al. (Nature Methods, 2024). These algorithms are implemented in PyTorch for GPU acceleration, but are not based on trainable models or learned representations.

We see the key novelty of SimSort not merely in the use of deep learning, but in the scientific finding that large-scale, biophysically realistic simulations alone can enable zero-shot generalization to real-world spike sorting tasks. While previous deep learning-based methods such as YASS and CEED have shown promising results, their reliance on relatively small training datasets has limited their generalization ability. To our knowledge, SimSort is the first demonstration that simulation-driven pretraining alone can achieve generalizable and reliable spike sorting, highlighting a new paradigm for scalable and data-efficient modeling in this field.

Thank you again for prompting this clarification and for helping us better frame the contribution of SimSort. We’ve found this exchange to be highly constructive and sincerely appreciate your engagement. If there are any remaining questions or points that are unclear, we would be more than happy to continue the discussion.

[1] Pachitariu et al., Spike sorting with KiloSort 4. Nature Methods, 2024.

评论

Dear Reviewer 1rAH,

With the discussion period ending tomorrow, we want to confirm whether your concerns have been addressed. Your insights are greatly appreciated, and we are keen to address any outstanding issues to further improve the work. Thank you for your time and effort in reviewing the paper and participating in the discussion.

审稿意见
4

The paper introduces SimSort, a deep learning-based framework for spike sorting, which is the process of identifying and classifying electrical signals from individual neurons in extracellular brain recordings. The goal is to overcome limitations of traditional heuristic methods, such as sensitivity to noise and manual parameter tuning, by leveraging biologically realistic neuron models (206 neuron types from the rat somatosensory cortex) and large-scale electrophysiology simulations.

优缺点分析

Strengths:

  1. Uses a Transformer-based model trained on simulated continuous signals as a novel method, outperforming threshold-based methods。
  2. Employs contrastive learning to generate robust waveform embeddings, followed by clustering (e.g., UMAP + GMM).
  3. Zero-Shot Generalization: Achieves state-of-the-art performance on real-world datasets without fine-tuning.

Weaknesses:

  1. The generalization of model could be explored more thoroughly, such as on different dataset, recording technique et al. For instance, the neuropixel/tetrode recordings with more channels.
  2. There's no comparison with kilosort3.5 and kilosort4 in Table 1-3 as they are the SOTA spikesorting methods widely used.
  3. Ablation studies are needed, as it's hard to see whether the good performance is from the transformer-based spike detection model, the spike identification model with constrastive learning, or denosing operation. For example, how will the performance be changed if using the spike identification results with threshold detector?

问题

  1. The generalization of model haven't been explored enough, such as on different dataset, recording technique et al.
  2. There's no comparison with kilosort3.5 and kilosort4 as they are the SOTA spikesorting methods widely used.
  3. Ablation studies are needed.
  4. How different data types (for example, spikes from different areas) influence the model performance?

局限性

Yes

最终评判理由

The authors address most of the concerns. I will maintain a positive score

格式问题

No

作者回复

Thank you for your positive feedback and helpful suggestions. Below, we respond in detail to the concerns you raised:

1. Generalization to other datasets and recording techniques (W1 & Q1)

We agree that evaluating SimSort on a broader range of datasets and recording configurations is an important direction. In this work, we focused on tetrode recordings as an initial step toward validating simulation-pretrained models on real data. Extending SimSort to higher-density recordings like Neuropixels presents additional engineering and algorithmic challenges (e.g., larger input dimensionality, probe geometry encoding), which we are actively working to address in future versions.

2. Comparison with KiloSort 3.5 and 4.0 (W2 & Q2)

Thank you for raising this important point. There are two keys reasons that we did not include Kilosort 3.5 and 4.0 in our comparisons.

First, KiloSort 3.5 and 4.0 introduced enhancements primarily optimized for high-density Neuropixels recordings. Since our current study focuses on tetrode recordings, we did not include KiloSort 3.5 and 4 in our comparisons.

Moreover, the goal of our work is to demonstrate, for the first time, that large-scale, simulation-driven pretraining provides a potentially practical and generalizable strategy for robust spike sorting.

3. Ablation study on model architecture and detection method (W3 & Q3)

We appreciate your suggestion. We further conduct ablation experiments to analyze the contribution of different components in SimSort to the overall performance. First, we replaced the detection module with a simple threshold detector and kept the spike identification module unchanged. The results on the hybrid dataset are summarized in Table 1.

Table 1. Sorting Results with Threshold Detector vs SimSort (Hybrid Dataset)

MethodAccuracy (Static)Recall (Static)Precision (Static)Accuracy (Drift)Recall (Drift)Precision (Drift)
Threshold (2.0)0.24 ± 0.010.44 ± 0.010.26 ± 0.010.22 ± 0.010.42 ± 0.010.24 ± 0.01
Threshold (2.5)0.30 ± 0.020.49 ± 0.010.33 ± 0.020.30 ± 0.020.47 ± 0.020.33 ± 0.02
Threshold (3.0)0.47 ± 0.050.59 ± 0.030.53 ± 0.060.46 ± 0.050.57 ± 0.040.51 ± 0.05
Threshold (3.5)0.57 ± 0.030.62 ± 0.020.69 ± 0.040.55 ± 0.040.60 ± 0.030.65 ± 0.04
Threshold (4.0)0.55 ± 0.020.58 ± 0.020.76 ± 0.020.50 ± 0.030.53 ± 0.030.70 ± 0.03
Threshold (4.5)0.44 ± 0.020.47 ± 0.020.71 ± 0.020.44 ± 0.040.46 ± 0.040.68 ± 0.03
Threshold (5.0)0.41 ± 0.030.42 ± 0.030.68 ± 0.020.37 ± 0.030.39 ± 0.030.65 ± 0.03
SimSort0.62 ± 0.040.68 ± 0.030.77 ± 0.030.56 ± 0.030.63 ± 0.030.69 ± 0.03

From Table 1, we can see that SimSort consistently performs well, whereas the threshold-based method is sensitive to the choice of voltage threshold.

Then, we tested whether a simpler GRU-based detection model can replace the Transformer-based detector in the spike detection stage. As shown in Table 2, the detection performance drops significantly.

Table 2. Detection Performance Comparison Between GRU and Transformer

Detection ModelAccuracy (Static)Recall (Static)Precision (Static)Accuracy (Drift)Recall (Drift)Precision (Drift)
GRU0.34 ± 0.010.38 ± 0.020.75 ± 0.020.34 ± 0.020.39 ± 0.020.75 ± 0.01
Transformer (SimSort)0.72 ± 0.030.84 ± 0.020.82 ± 0.020.68 ± 0.030.82 ± 0.020.81 ± 0.02

Transformers are particularly well-suited for spike detection due to their ability to model long-range dependencies and contextual interactions between time points, without the limitations of recurrence or fixed receptive fields. Furthermore, the self-attention mechanism allows the model to adaptively weigh different parts of the signal for detecting spike events under varying background conditions.

To evaluate the role of encoder architecture and representation learning objective in the spike identification model, we compared three configurations:
(1) a GRU encoder trained with contrastive loss (SimSort), (2) a Transformer encoder with the same loss, (3) a GRU encoder trained with a non-contrastive supervised objective.

Table 3. Identification performance across objectives, encoders, and denoising

ObjectiveEncoderDenoiserARI ± Std (Hybrid-static)ARI ± Std (Hybrid-drift)
ContrastiveGRU0.91 ± 0.020.89 ± 0.03
ContrastiveGRU0.88 ± 0.030.85 ± 0.02
ContrastiveTransformer0.89 ± 0.030.85 ± 0.03
SupervisedGRU0.24 ± 0.020.26 ± 0.02

As shown in Table 3, the contrastive approach yields substantially better performance, indicating its critical role in robust spike identification. Among the contrastive models, the GRU encoder slightly outperforms the Transformer.

We also conducted an ablation study to evaluate the impact of the denoising module. By removing the denoiser while keeping the GRU encoder and contrastive learning unchanged, we observed a modest drop in performance, particularly in the drift setting. This indicates that while SimSort’s identification performance is primarily driven by contrastive learning, denoising contributes to robustness under more challenging conditions.

Thanks again for your valuable suggestions, we will include these results in the revised manuscript.

4. How different data types (for example, spikes from different areas) influence the model performance? (Q4)

Thank you for the insightful question, we also have considered this problem. The follows are our answers.

For general spike sorting tasks, the specificity of the brain region or neuron type play a relatively limited role. Spike detection primarily relies on identifying temporally localized voltage deflections that exceed background noise -- a process more strongly influenced by signal-to-noise ratio and electrode geometry than by regional differences. For spike identification, our contrastive learning framework is designed to distinguish waveforms based on relative similarity, without relying on prior knowledge of cell types or waveform templates.

We also have shown empirical results about the generalizability of SimSort to different brain regions and animals (Figure 5). we constructed our simulation dataset using morphologically detailed neuron models from the rat somatosensory cortex. For real-data evaluation, we successfully applied SimSort to the tetrode recordings from the mouse primary visual cortex. This implies the specificity of the brain region or neuron type play a relatively limited role.

To further investigate the influence of brain-region variability on model performance, we plan to incorporate additional regions into the simulation dataset and the test dataset in future work.

We sincerely thank you again for your thoughtful and constructive feedback. We hope our response has addressed your concerns, and we would be happy to further discuss any remaining questions during the discussion phase.

评论

Thank you for addressing part of my concerns.

  1. About Kilosort comparison: Whether the method is general enough could not be quantified without extending tetrode into neuropixel. As the authors mention that the goal is to provide a practical and generalizable stratege. The comparison is necessery. "Moreover, the goal of our work is to demonstrate, for the first time, that large-scale, simulation-driven pretraining provides a potentially practical and generalizable strategy for robust spike sorting."

  2. In the reply to reviewer 1rAH, the authors emphaze the zero-shot generalization with large-scale, biolophysically simulations. The large-scale simulation is the authors' contribution. However, the advantages of the models are not significant if they are resulted from the large simulated dataset, because YASS and CEED could also be trained based on the large-scale simulation dataset (the dataset is not limited to be used only by Sim-Sort. Then, the question raised here is: whether the main contribution comes from Sim-Sort or the dataset? If the contribution is from Sim-Sort, the YASS and CEED frameworks should also be trained by the large-scale training dataset for a fair comparison. Moreover, since the training dataset are from simulation, then transferring to Neuropixel data should be achievable. "We see the key novelty of SimSort not merely in the use of deep learning, but in the scientific finding that large-scale, biophysically realistic simulations alone can enable zero-shot generalization to real-world spike sorting tasks. While previous deep learning-based methods such as YASS and CEED have shown promising results, their reliance on relatively small training datasets has limited their generalization ability. To our knowledge, SimSort is the first demonstration that simulation-driven pretraining alone can achieve generalizable and reliable spike sorting, highlighting a new paradigm for scalable and data-efficient modeling in this field."

评论

Thank you very much for your follow-up. We truly appreciate your constructive feedback throughout the review process, and we are particularly grateful for your thoughtful engagement with both the practical relevance and scientific contributions of our work. Below, we address your comments point by point.

Tetrodes remain scientifically relevant and a valuable evaluation setting

We fully agree that generalization across different recording modalities is an important long-term goal. Our current work focuses on a specific but underexplored question: can deep learning models trained entirely on large-scale, biophysically realistic simulations generalize in a zero-shot manner to real tetrode recordings? To our knowledge, this is the first work to systematically demonstrate the feasibility of this approach, and to validate its effectiveness on both hybrid and real datasets.

While Neuropixels offers higher density, tetrode recordings continue to be widely used in high-profile neuroscience, including in recent studies published in Cell, Nature, and Science [1–5]. Their popularity stems from several practical and scientific advantages: long-term implant stability, compact and lightweight design suitable for freely moving animals, and consistent challenges like limited channel count and waveform variability under drift. These properties make tetrodes not only scientifically meaningful, but also an ideal and technically challenging testbed for evaluating sim-to-real generalization.

We believe that establishing robust performance in this domain is an important step toward broader applicability.

SimSort’s performance is not driven by data scale alone

Thank you for raising the important question of whether our results primarily reflect access to large-scale simulation data. We fully understand the need to distinguish model design from data scale.

To address this, we conducted controlled ablations (detailed in the rebuttal), all using the same simulation dataset, but with simplified components:

  • a threshold-based spike detector instead of the Transformer,
  • GRU encoders instead of Transformers,
  • supervised classification instead of contrastive learning,
  • and the removal of the denoising module.

Across all conditions, SimSort’s full architecture consistently outperformed the alternatives. These results support our claim that the observed improvements are not due to dataset size alone, but arise from the design and engineering of a suitable architecture, representation learning objective, and robustness-enhancing components.

Why YASS and CEED are not directly compatible with our dataset

We sincerely appreciate your suggestion to retrain YASS or CEED on our simulation dataset and have considered this option carefully. However, both frameworks rely on design assumptions that are fundamentally incompatible with our task setting and data format:

  • YASS employs a CNN-based denoiser trained on threshold-detected spikes. It does not perform spike detection itself, and its denoising targets and network behavior are tightly coupled to signal characteristics that do not match our simulated tetrode data.
  • CEED is for spike identification and uses contrastive learning with geometry-aware sampling. These augmentations assume dense spatial layouts like Neuropixels and cannot be meaningfully applied to tetrode recordings.

Ongoing efforts to support generalization to Neuropixels

We would also like to clarify that we are actively working toward broader generalization, including to Neuropixels. Our ongoing efforts include:

  • expanding the simulation corpus to cover high-density geometries,
  • incorporating spatial information and geometry-aware invariances into contrastive training,
  • and evaluating both detection and identification on real Neuropixels data.

While these directions are promising, our current submission focuses on validating simulation-pretrained models in the tetrode setting, where we already show:

  • zero-shot improvements over baselines on hybrid and WaveClus-difficult datasets,
  • physiologically meaningful single-unit recovery in real data (e.g., clear refractory periods and tuning),
  • and a scalable, reproducible training recipe that can be extended to other modalities.

Once again, we thank you for your constructive feedback. Your questions have sharpened the focus and clarity of our work, and please kindly let us know if you have any remaining conerns.

Reference

[1] Eliav et al., Fragmented replay of very large environments in the hippocampus of bats. Cell (2025).

[2] Harkin et al., A prospective code for value in the serotonin system. Nature (2025).

[3] Chen et al., Brain region–specific action of ketamine as a rapid antidepressant. Science (2024).

[4] Jun et al., Prefrontal and lateral entorhinal neurons co-dependently learn item–outcome rules. Nature (2024).

[5] Aldarondo et al., A virtual rodent predicts the structure of neural activity across behaviours. Nature (2024).

评论

Dear Reviewer jc2X,

With the discussion period ending tomorrow, we want to confirm whether your concerns have been addressed. Your insights are greatly appreciated, and we are keen to address any outstanding issues to further improve the work. Thank you for your time and effort in reviewing the paper and participating in the discussion.

审稿意见
3

This paper addresses a fundamental and crucial challenge in the field of neuroscience: Spike Sorting. The authors utilize highly accurate computational models from biophysics to create a large-scale, high-fidelity simulated extracellular recording dataset and propose a data-driven pre-training framework called SimSort, which achieves zero-shot transfer.

优缺点分析

Strengths: 1.Significant and highly original: The large-scale, high-fidelity dataset created by the authors, which they plan to make open-source, is a major contribution to the entire neuroscience community. 2.The model architecture is well-designed, with Transformer used for spike detection (a sequence labeling task) and GRU + contrastive learning for spike identification (waveform representation learning). 3.The experimental evaluation is rigorous, with tests conducted on multiple benchmarks, including a self-built simulated test set, publicly available mixed datasets, and real experimental data. Weaknesses: 1.Limited innovation in the deep learning architecture itself: Although applying these techniques to spike sorting is novel, from a pure machine learning perspective, the model components used in the paper (Transformer, GRU, Triplet Loss, etc.) are existing and well-established frameworks. 2.Lack of evaluation on inference speed. Tools like KiloSort are widely adopted not only for their accuracy but also for their extremely high processing efficiency. An analysis of SimSort's inference time should be conducted. 3.The discussion on data scale-up laws is not deep enough. Figure 7 shows the model's performance as the training data scale increases, but based on the trend in the figure, the performance growth of multiple metrics (such as identification ARI) has shown signs of saturation. Does this mean that simply increasing the same type of simulated data no longer brings significant improvements? The authors could delve deeper into whether more diversified simulation data (from different brain regions, different electrode types) or improvements in the model architecture itself are needed to further break through the performance bottleneck.

问题

1.Regarding the gap between simulation and reality, although zero-shot transfer performs well, the SimSort model is mainly trained on simulated tetrode data. Its generalization ability to other electrode types (such as high-density Neuropixels probes) with vastly different geometric structures and noise characteristics may still be limited. Could you discuss which factors in the simulation process are key to successfully achieving zero-shot transfer? 2.Regarding the choice of model architecture, you used Transformer for spike detection and GRU for spike identification. What was the initial intention behind this design? Does GRU have unique advantages over Transformer in representation learning for individual waveforms (shorter sequences)? 3.How does the computational speed of the SimSort model during inference compare to traditional efficient algorithms like KiloSort? For researchers who need to process massive amounts of data, is it a practical choice in terms of speed?

局限性

See Weaknesses and rebuttal questions.

最终评判理由

The authors failed to adequately address the limitations regarding the dataset, specifically:

Despite claiming the use of a large-scale dataset, the actual training data was limited to four-channel tetrode recordings. This dataset limitation may restrict the model's generalizability to higher-density probes (e.g., Neuropixels). Electrode geometry diversity: The model was not trained on diverse electrode geometries, which may impair its performance across different experimental configurations. In their Rebuttal, the authors acknowledged that simply increasing dataset scale cannot fully compensate for insufficient data diversity, yet failed to propose effective mitigation strategies. Moreover, the authors did not sufficiently acknowledge the dataset limitations concerning electrode geometries and higher-density probes.

Given these reasons, we maintain our original evaluation decision.

格式问题

The font size in the tables in Figures 5 and 6 is too small.

作者回复

We appreciate your thoughtful comments and suggestions, which helped improve the clarity and quality of our work. Below, we address your concerns and questions in detail:

1. Model design and architectural novelty (W1)

The main contribution of our study is to reveal the potential of simulation-driven pretraining to enhance the robustness and scalability of spike sorting in experimental neuroscience. Although our architecture is based on well-established components such as Transformers and GRUs, the novelty of our work lies in how these components are integrated and tailored specifically for spike sorting, particularly in the context of simulation-based pretraining.

2. Inference speed evaluation and practicality (W2 & Q3)

This is a nice point. SimSort is actually fast and we will update our paper to clarify its inference speed. As shown in the demonstration video included in our originally submitted supplementary material, SimSort processes a 30-second tetrode recording in 9.5 seconds for the whole process including data pre-processing, spike detection, spike identification and visualization on a single A100 GPU. This pipeline has not undergone dedicated speed optimization. We believe that techniques such as batching, GPU-based preprocessing and clustering could further improve runtime. Even in its current form, SimSort demonstrates practical inference speed on tetrode recording and exhibits potential for efficient large-scale spike sorting. The following Table 1 shows the detailed computational speed.

Table 1: Computational Requirements of SimSort models

QuantitySpike Detection (Transformer)Spike Identification (GRU)
Configuration
Sequence Length per sample200060
Number of Parameters3,160,577795,648
Inference time
CPU Time / per sample (ms)157
CUDA Time / per sample (ms)105

3. Scaling law of the training data size (W3)

discussion on data scale-up laws

Thank you for your careful review. Are you suggesting that detection accuracy is exhibiting signs of saturation? We note, however, that the other two metrics—Sorting Accuracy and Identification ARI (solid green and purple curves in Fig. 7)—do not show comparable saturation.

Consider the Hybrid (static) dataset. When the data size increases from 2122^{12} to 2132^{13}, Sorting Accuracy rises by ≈ 0.02 and Identification ARI by ≈ 0.04. Although these absolute gains may appear modest, they are meaningful because the metrics are approaching their upper bounds. The similar pattern holds for the Hybrid (drift) dataset.

Regarding detection accuracy (dashed purple curves), we agree that the trend is approaching saturation. Importantly, though, enlarging the dataset never degrades detection performance, so increasing sample size remains advantageous—even if the marginal gains taper off.

We appreciate your suggestion about adding more discussion about the results regarding the scaling law, which will be reflected in the revision.

more diversified simulation data or improvements in the model architecture

Thank you for raising this thoughtful question. We agree that a critical direction for future improvement lies in examining how the composition and diversity of synthetic datasets influence model generalization.

As you suggested, merely increasing the volume of data from a single simulation paradigm may yield diminishing returns, especially if the additional data do not introduce novel variability in waveform structure or network-level dynamics.

Further performance gains are more likely to come from increasing simulation diversity rather than scale alone. This includes:

  • Incorporating neuron models from different brain regions, which can differ in cell-type composition, firing patterns, and background activity.
  • Including recurrent connectivity of neuron models, which can introduce spike synchrony and temporal correlations—important challenges for realistic spike sorting.
  • Utilizing 3D neuron morphologies (rather than 2d) and more realistic tissue conductivity, which can generate more heterogeneous extracellular waveforms.

At the same time, improvements in model architecture, particularly mechanisms that account for temporal overlap or drift (e.g., invariant representations across waveform deformations), are also promising directions.

In short, we see this work as a first step demonstrating the viability of simulation-based pretraining, and we fully agree that diversifying simulation conditions and evolving model design will be key to further advancing generalization. We plan to pursue both directions in future work.

4. factors in the simulation process that are key to zero-shot transfer (Q1)

This is a thoughtful question. Several aspects of our simulation design contributed to the observed generalization:

  • Electrode geometry and neuron positioning. We explicitly model the 3D spatial configuration of electrodes and surrounding neurons. The relative distance, orientation, and density of nearby cells strongly influence extracellular signals. This spatial variability is essential for training a detection model to distinguish spikes from noise across diverse waveform shapes and amplitudes.
  • Biophysical realism of neuron models. We use morphologically and electrophysiologically realistic neurons with diverse firing properties, generating a wide range of biologically plausible waveforms.
  • Naturalistic noise-driven activity. Temporally correlated stochastic current injection introduces realistic subthreshold fluctuations and irregular spike timing, enhancing robustness to background noise and temporal variability.
  • Large-scale, ground-truth-labeled training data. The volume and diversity of training samples help the model learn generalizable features and avoid overfitting to specific conditions.

Our results suggest that combining spatial modeling, biophysical diversity, and statistical variability can help bridge the sim-to-real gap. Supporting high-density probes is part of our ongoing efforts.

5. Model architecture design intention(Q2)

For spike detection, the task is formulated as a sequence labeling problem over long, continuous voltage traces. Here, temporal context is critical—not only to suppress noise and artifacts, but also to accurately detect overlapping or low-amplitude spikes that are temporally modulated. Transformers are particularly well-suited for this setting due to their ability to model long-range dependencies and contextual interactions between time points, without the limitations of recurrence or fixed receptive fields. Furthermore, the self-attention mechanism allows the model to adaptively weigh different parts of the signal for detecting spike events under varying background conditions.

For spike identification, the input consists of short waveform segments (~2 ms, typically 60 samples), where the objective is to learn compact, discriminative embeddings of spike shapes for clustering. In this case, we chose a GRU-based encoder after empirical comparison with Transformer. GRUs have significantly fewer parameters and lower computational cost than Transformer.

To validate these design choices, we conducted an ablation study comparing different combinations of detection and identification architectures.

First, we replaced the detection module with a simple threshold detector and kept the spike identification module unchanged. Please kindly refer to our response to reviewer jc2X's W3 & Q3 for the results. We observed that SimSort's detection model consistently performs well, whereas the threshold-based method is sensitive to the choice of voltage threshold.

Then, we tested whether a simpler GRU-based detection model can replace the Transformer-based detector in the spike detection stage. As shown in Table 2, the detection performance drops significantly.

Table 2. Detection Performance Comparison Between GRU and Transformer

Detection ModelAccuracy (Static)Recall (Static)Precision (Static)Accuracy (Drift)Recall (Drift)Precision (Drift)
GRU0.34±0.010.38±0.020.75±0.020.34±0.020.39±0.020.75±0.01
Transformer (SimSort)0.72±0.030.84±0.020.82±0.020.68±0.030.82±0.020.81±0.02

Transformers are particularly well-suited for spike detection due to their ability to model long-range dependencies and contextual interactions between time points, without the limitations of recurrence or fixed receptive fields. Furthermore, the self-attention mechanism allows the model to adaptively weigh different parts of the signal for detecting spike events under varying background conditions.

To evaluate the role of encoder architecture and representation learning objective in the spike identification model, we compared different encoder backbones and learning algorithms as follows.

Table 3. Identification performance across objectives, encoders, and denoising

ObjectiveEncoderDenoiserARI(Hybrid-static)ARI(Hybrid-drift)
ContrastiveGRU0.91±0.020.89±0.03
ContrastiveGRU0.88±0.030.85±0.02
ContrastiveTransformer0.89±0.030.85±0.03
SupervisedGRU0.24±0.020.26±0.02

As shown in Table 3, the contrastive approach yields substantially better performance, indicating its critical role in robust spike identification. Among the contrastive models, the GRU encoder slightly outperforms the Transformer. We also noticed a modest performance drop when denoiser is not applied.

The font size in the tables in Figures 5 and 6 is too small.

We appreciate and will revise.

Thanks again for your valuable suggestions which have helped us improve the clarity and rigor of our work, we will include these results in the revised manuscript. We welcome any further questions and are happy to engage in continued discussion.

最终决定

This paper introduces a simulation-driven deep learning framework for spike sorting. The authors show that large-scale biophysically realistic simulations can enable zero-shot transfer to real tetrode recordings.

During the discussion, reviewers agreed that the paper makes a timely contribution that could be potentially impactful by combining a large simulated dataset with a deep learning pipeline. The main concerns were (i) the limited scope to tetrode recordings and unclear generalization to other probes (e.g., Neuropixels), (ii) the need for more direct comparisons to state-of-the-art spike sorting methods like KiloSort4, (iii) missing details about the simulation setup, and (iv) additional ablations clarifying what drives performance. The authors responded with further experiments (including comparisons with KiloSort, ablations, runtime analysis, and clearer limitations), which addressed (ii)–(iv). The restriction to tetrodes remained and was acknowledged as such but the authors, who argued that tetrodes are still an important method in the field.

As a result, the paper's scores remained borderline after the rebuttal. The restriction to tetrodes could be perceived as severe one in a time where the field is moving quickly towards large-scale high-density probes like Neuropixels. On the other hand, the authors argue (and some reviewers agree) that tetrodes remain in use in several labs. In addition, the reviewers agree that the paper is technically solid and proposes a new method. As a result of the final discussion among the reviewers, I believe the paper has its merits and recommend accepting it.