ActSort: An active-learning accelerated cell sorting algorithm for large-scale calcium imaging datasets
摘要
评审与讨论
The authors introduce an active learning framework for cell sorting in two photon imaging analysis. They develop a software interface to it and conduct a large scale benchmark using multiple datasets and involving multiple domain experts. They show that their method can reduce the needed manual human input to a small fraction of what other algorithms require. In contrast to many other works, they use hand-engineered features and simple classification algorithms.
优点
- The idea to use active learning for cell sorting in large scale two photon experiments is interesting.
- The interpolation between confidence and discrimination based sample selection is a small but clever addition to the literature.
缺点
- The algorithm is part of a large framework, for which many components have been described in other papers (e.g. the cell extraction framework EXTRACT). This, and the reference to some of the data papers, make the paper poorly anonymized.
- There is extensive use of supplementary figures and the appendix, making the paper quite hard to read.
- The figures are super dense and very hard to comprehend in detail. The line width in several figures is very thing, lines are overlapping and not all figures are properly annotated.
- The paper spends too much space on the software and the overall framework, but has many very dense figures that are quite hard to take apart.
- It remains somewhat unclear which parts of the software and the overall framework were constructed for this paper and which were already present before.
- A comparison to a deep learning based cell classifier seems missing.
问题
- Line 113f: The authors discuss features using the number of spikes. How are the spikes counted from the imaging data? This seems surprising.
- Fig 2: Could the authors clarify what is shown here? This is d-prime between which distributions? How were the positive and negative examples defined here?
- Fig 2: What is meant by traditional features?
- Appendix B.1.2: Could the authors give mathematically precise feature definitions where possible?
- Line 140: The authors may want to define their acronym DCAL.
- Fig. 3: In A, what are the different colors?
局限性
Adequately addressed.
Thank you for your time and effort in reviewing our manuscript. Your feedback was extremely helpful for increasing the clarity of our manuscript, and most importantly in distinguishing ourselves from published material.
Summary Respectfully, we do not agree with your summary of our work in several key domains: First, we are not designing software for two photon imaging analysis, our dataset consists mainly of 1p movies (4 out of 5 movies). Second, we are not aware of any published work with deep classifiers that is used to sort cells in 1p movies. Next, there is no work prior to us designing an active learning routine for cell sorting, so we are not sure which many other works are referred to here. Finally, we believe we have several other contributions left out in your summary, please kindly see our general response above.
Anonymity: ActSort is a standalone quality control pipeline, not part of any previous publication. It is compatible with not only EXTRACT, but also all other cell extraction algorithms. Moreover, ActSort is not affiliated with EXTRACT and does not endorse it. We simply used EXTRACT, since other state of the art alternatives to process 1p movies (CAIMAN, or earlier studies used ICA), were less efficient and time consuming. To mitigate this confusion, we added a new experiment on using ActSort on ICA extracted cells as shown in Fig N7 in the PDF. We are happy to address our contributions accordingly if the reviewer can let us know which specific previously published paper you are referring to.
We were carefully following the guidelines to maintain anonymity. For instance, we had not shared the user manual or links to our lecture videos for this very reason. In the references, we did our best to cover a broad community, including multiple calcium imaging techniques and cell extraction algorithms from multiple groups in the field, using published datasets with permissions from authors. Fortunately, your assumption is simply not correct.
Comments on figures and use of appendices Respectfully, we disagree with the reviewer that the extensive additional benchmarking and the detailed explanation of the algorithm in the appendices and the figures are a weakness. This is, in our opinion, a strength of our work. Moreover, as evident from other reviewer reports, our work is self-contained in the main text, and only uses appendices to refer to methodological details and additional benchmarking studies to strengthen our claims. If you can let us know which figures you find confusing, we would love to work on them.
Regarding discussion of software and framework The software, the benchmark, and the active learning framework together form our novel conceptual contributions in addition to the specific active learning query algorithms. All three of them are equally spaced in the main text. We are the first to introduce this framework to experimental neuroscience, so it is reasonable that we spend time explaining them carefully, as a big fraction of our readership will be divided between active learning researchers and experimental neuroscientists.
Deep classifiers Amazing point! We added a new experiment to address this (Fig. N8). Please refer to the “New experiments” section in the general response. In short, we tried ResNet-50 for feature extraction directly from movie frames and cell extraction outputs. Not only were the extracted features suboptimal to engineered features, the time to run cell data over the network was beyond our design.
Line 113f: spikes Good question! We use the peakseek built-in function in MATLAB to obtain the spikes count from the extracted cell traces by the cell extraction algorithm (ICA or EXTRACT).
Clarification on Fig. 2. Great point! We apologize for the confusion here. The d-prime is between the distribution of and where represents the target feature. The positive examples represent the candidate being an actual cell and vice versa. To clarify, we added an illustration to explain the definition of absolute discriminability index, a metric for quantifying the distance between these two probability distributions (Fig. N1). Additionally, as requested, we also added a statistical quantification on the effect size of the traditional features and the new features (Fig. N2).
Fig 2: Traditional features: Great question, one we should have made clear! The traditional features represent the features that CAIMAN and CLEAN used for cell classification on 1p imaging movies. Please see our general response to reviewers.
Detailed math for appendix and typos Thank you for pointing these out. We have updated the math for all features accordingly. We changed L140 to “In this work, we designed an active learning query algorithm, Discriminative Confidence based Active Learning query algorithm, in short DCAL, and compared its performance with traditional random selection and the two other query strategies. ” The Fig 3A legend is placed under Figure B-D. We moved the legend above to make it more accessible for the reader as you suggested.
Final clarifications As we are concluding our response, we wish to address the weaknesses 1 and 5 from your report directly. We want to make it super clear that we have NOT copied any part of ActSort from any published work, nor did any part of ActSort was published before anywhere. We apologize if there was any confusion, and hope the rebuttal, the general response, all additional experiments we shown in the PDF, and modifications, make it clear that ActSort is NOT part of a large framework previously described in other papers.
We are looking forward to your additional comments if you have any on how we can improve clarity. Now that we have addressed all your raised weaknesses and corrected the misconceptions about our contributions - for which we thank you -, we hope you will consider increasing your score.
I apologize for misrepresenting of the data modality the authors focus on (1p vs. 2p Ca2+ imaging). I also better understand that ActSort is a standalone piece of work.
-
DNN: Sure, if you run a really large network on a problem that can be solved with a linear algorithm in ~75 features, you may be slower. I do think a custom-built CNN with a few layers could likely match the speed and performance. Btw, the "amazing point" in your response makes me wonder whether you are being facetious or not.
-
Spikes: I would ask the authors to remove their mention of spikes from the paper in this context. Ample work performing simulatenous patch-Ca2+ imaging recordings has shown that the relationship between the traces and actual action potentials is much more complex than finding peaks. Call them "events" or "peaks".
-
Is there a way to see the definitions for the mathematically preceise feature definitions? Like, what are bad spikes, a term which occurs in a number of features?
Thank you for your response and careful reading of our rebuttal. We answer your questions below.
Point 1: First, the field has had no real success in generalization across modalities in this direction before. Please see discussions above regarding CAIMAN’s deep classifiers being suboptimal for 2p movies and not recommended for 1p movies, and Cascade requiring a decade of public benchmark accumilation to achieve this on spike extraction. Also, please note that the imaging and experimental conditions between two 1p movies may be as diverse as those between a 1p and 2p movie.
Second, after consulting with the experimental neuroscientists, we realized that this path was also inconsistent with a major concern of theirs: Reproducibility. A deep network is a non-convex approach, and often requires retraining with new data (See Cascade here). Unless experimental groups also publish their retrained networks, the annotation would not be reproducible. This is not true for the model in our work. Here, as long as the group of annotated neurons are provided, a third party can always validate the reproducibility of the annotation.
Finally, how the input to such a deep classifier should look like is an open question, as we discussed in details above. CAIMAN, for example, only used extracted cell profiles, but we believe (and our Fig. 2 shows) that spatiotemporal information should somehow be incorporated. We already have several other contributions, so we do not wish to open this direction as well. So instead, we did the next best thing to address it, which is considering whether a simple pre-trained network can become a solution.
Overall, building a specialized deep network is out of scope. We believe we properly discussed this now in the paper and above.
Point 2: Agreed. We will do so, and we also think this is more appropriate. This was an oversight.
Point 3: ‘Bad’ events are the ones that had low calcium amplitudes. Mathematically, we computed the 90th trace quantile and divided it by half. Any "event" that has lower amplitude than this was considered a "bad" event. We now realize that ‘bad’ is not a correct term for this, instead we should call them "weak" events. For the rest of the features, we can share more details for whichever feature you desire to know about.
To address your comment about our tone
In our response, we made it clear when we disagreed with you and also when we agreed with you. Part of it was also to give you feedback about which comments we felt were quite good so that we can all come out as better writers and reviewers. We apologize that our writing was not clear. We were serious in our remarks, there was no humor intended.
We do believe that you have provided solid points for improvement, and that particular point was also necessary to at least discuss in this paper, given that our work is being considered for NeurIPS. Sure, you recommended rejection due to some assumptions about our work, but with the current reviewer loads (which we also have) and the scaling of the conference, it makes sense that certain nuances may slip through. This is why the discussion period exists. We do not believe there is any reason to be facetious about this process, and we had not intended to do so either. We apologize once again and look forward to hearing from you.
Thanks for the additional explanations. I do see the merit in the approach and don't oppose acceptance, therefore I have raised my score to 5. I am still somewhat critical regarding some of the points discussed, also with other reviewers.
We appreciate your time and consideration. Thank you for the helpful comments
This paper introduces a new semi-supervised active learning algorithm, ActSort, designed to accelerate cell sorting in large-scale calcium imaging datasets. The method leverages domain expert feature engineering and a novel active learning framework, optimizing the cell sorting process with minimal human input. The paper also presents a user-friendly custom software and validates its performance through a large-scale benchmark study involving six domain experts and approximately 160,000 candidate cells. Empirical results indicate that semi-automation reduces the need for human annotation to only 1%-5% of the candidate cells, while also improving sorting accuracy by mitigating annotation bias. As a robust tool validated under various experimental conditions and applicable across different animal subjects, ActSort addresses the primary bottleneck in processing large-scale calcium imaging videos, paving the way for fully automated preprocessing of neural imaging datasets in modern systems neuroscience research.
优点
- Developed a new active learning-accelerated cell sorting algorithm for large-scale calcium imaging datasets, significantly reducing the workload of human annotation.
- Utilized domain expert knowledge for feature engineering, enhancing classification robustness and accuracy across different animals and experimental conditions.
- Created user-friendly custom software, allowing experimental scientists without programming backgrounds to use it easily.
- Constructed a large-scale cell sorting benchmark dataset involving annotations by six experts, five mice, and approximately 160,000 candidate cells, which can be used for algorithm development and testing.
- Demonstrated the method's effectiveness through empirical studies on multiple datasets, reducing the need for human annotation to 1-5%.
缺点
I believe this paper is a solid piece of scientific research, suitable for publication in a Nature sub-journal or a top-tier neuroscience journal (with the addition of more statistical analysis and biological significance studies). However, as a NeurIPS reviewer, I need to focus more on the algorithmic innovation and fair comparisons to help you improve the paper. I see the following shortcomings:
-
Lack of Innovation in Active Learning Strategy: While the authors propose a new query strategy, DCAL, it essentially combines existing uncertainty and diversity sampling strategies through simple weighting, without sufficient theoretical justification and discussion on the necessity and effectiveness of this combination. Additionally, there is a lack of systematic analysis and guidance on adjusting the weights, and the ablation study in the experimental section is insufficient.
-
Lack of Experimental Comparison with Existing Semi-Automated Cell Sorting Methods: Although the authors mention some prior semi-automated methods in the Related Work section, they do not conduct any quantitative comparative analysis in the experiments. This makes it difficult to assess the performance and efficiency advantages of ActSort over existing methods, rendering the "state-of-the-art" claim less convincing.
-
Insufficient Feature Representation Learning: The authors rely heavily on handcrafted features designed by domain experts, lacking data-driven automatic feature learning capabilities. Good performance on the given dataset does not guarantee generalization to new datasets and experimental paradigms (e.g., new calcium indicators, neuron types, brain regions). The authors should consider utilizing pre-trained models such as CNNs to automatically learn visual and dynamic features of cells, reducing the burden of feature engineering.
-
Underestimation of Annotation Noise and Error: The accuracy of manual annotations is the foundation of training and evaluation, but the paper does not sufficiently address this. Relying solely on voting to determine the "ground truth" overlooks the statistical properties of annotation errors. Additionally, the active learning process does not consider querying the same sample multiple times to reduce annotation noise, potentially overestimating the actual performance of the current method.
-
Unreasonable Pooling Method for Human Performance Comparison: The paper compares the classifier's output with the annotations of a single annotator on the entire dataset in each iteration, while the classifier only uses a small subset of samples. A more reasonable approach would be to compare the classifier and human performance on the currently annotated subset, which might reveal a greater advantage for humans.
-
Insufficient Reporting of Experimental Setup Details: The paper lacks descriptions of many critical implementation details, such as hyperparameter selection, training loss convergence, network architecture design, etc. In particular, the technical details of how general features pre-trained on ImageNet are transferred to the cell classification task are not reported, affecting the reproducibility of the results.
-
Insufficient Scalability and Robustness Experiments: Although the authors emphasize ActSort's ability to handle large-scale datasets, the "large-scale" in the experiments is only 150GB, which is still far from the current terabyte-scale neural datasets. Moreover, there is no sensitivity analysis regarding different imaging parameters such as resolution, signal-to-noise ratio, and frame rate.
-
Lack of Theoretical Analysis on Active Learning Batch Size and Convergence: The authors simply set the batch size to 1 without discussing its relationship with convergence speed and generalization performance. Particularly, the setting of the hyperparameter lacks theoretical basis and sensitivity analysis. The convergence of active learning sampling is challenging both theoretically and practically, but the paper lacks necessary analysis and discussion on this.
-
Unknown Applicability to Other Types of Neural Activity Data: The authors only tested on calcium imaging data, but neural electrophysiological data, such as in vitro patch clamp and in vivo multi-channel recordings, differ significantly from calcium imaging in terms of morphology and spatiotemporal resolution. Can ActSort be applied to these data types? Would feature redesign be necessary? These questions need validation.
-
Lack of Analysis on Annotator Variability: The differences in knowledge background and experience among annotators can introduce annotation biases, affecting the performance of the trained classifier. The authors use annotations from multiple experts in the experiments but do not analyze the variability among experts and its impact. This variability itself is an important research issue.
问题
Regarding the algorithmic issues mentioned above, I have some technical questions for your reference:
- Can you provide a detailed statistical analysis of these 76 handcrafted features? How is the discriminative power of these features objectively evaluated?
- What is the label distribution of the 160,000 candidate cells annotated by domain experts? How do you address the class imbalance problem?
- How were the hyperparameters for the comparative methods (Random, CAL, DAL, etc.) chosen? Were they selected in a way that might favor ActSort?
- In Figure 4D, why does ActSort outperform human performance with just 1% of the data? Does this imply poor quality in human annotations?
- Besides logistic regression, what other classifiers were attempted? How did they perform?
- What are the specific details of the pre-training and transfer learning experiments? How significant are the generalization performance differences across different datasets and annotators?
- How was the batch size in active learning chosen? Have you considered querying the same sample multiple times to mitigate annotation noise?
- How is the ground truth defined in the experiments? How are samples with significant disagreement among experts handled?
局限性
Overall, this paper represents a substantial amount of work and provides executable software and code, which is of significant importance to the neuroscience field. My concerns are detailed in the shortcomings and issues section. Here, I will focus more on the scope of the paper. If I were reviewing for Nature Methods, I would choose to accept this paper. However, I am uncertain whether this paper will attract widespread interest from the NeurIPS community. Therefore, I am giving a marginal score and would like to see the discussion from other reviewers before providing my final score.
Thank you for your detailed report and excellent summary! We truly appreciate your vote of confidence in our work and the excellent suggestions for improving our technical contributions. Thanks to you, our paper now includes several diverse, convincing control experiments. To save space, we refer to weaknesses with “W” and questions with “Q”, please excuse us.
W1 and W2 Please see our general response regarding the fit to venue and comparison to prior work. To address further concern, we performed two additional control studies. First, an ablation study by removing the adaptive estimation process showed that DCAL was significantly less effective without it (no space in PDF, happy to elaborate). The second study, shown in Fig. N5, demonstrates the evolution of weights over time. Also see Fig. N3 for additional evidence supporting the necessity of DCAL.
W3 Great idea! We added a new experiment to address this (Fig. N8). Please refer to the “New experiments” section in the general response. In short, we tried ResNet-50 for feature extraction from movie frames and cell extraction outputs, resulted in suboptimal features.
W4, Q7, and Q8 Excellent point! We evaluated manual annotation accuracy in Fig. S4, Tables S1, S2, and S3. We used four annotators per dataset, with majority vote as ground truth, to mitigate inconsistencies and annotation noise (See Fig. S3 for the evaluation process). Human annotators could revisit samples multiple times, while ActSort trains on the final annotations.
To address annotator reliability, we performed intraclass correlation analysis. Individual annotators were inconsistent (ICC = 0.56±0.06), but classifiers trained on these annotations mitigate human bias and were more consistent (ICC = 0.64±0.07). Majority votes across annotations were quite consistent (ICC = 0.79±0.05). This shows individual annotators are unreliable, but majority votes (In experiments, only one annotator rates!) approximate ground truths well.
W5 We compare the classifier’s prediction with single annotators across the entire dataset, NOT just a small subset. Also, as evident from the limit of 100% annotation, the classifiers do outperform humans on trained samples.
W6 We used logistic regression for the cell classifier to ensure real-time prediction speed, which converges to global convex optimum. The regularization effect is analyzed in Fig. S7, and a new hyperparameters analysis on the classifier threshold is in Fig. N4. AL convergence is depicted in Figs. 4, 5, S6, S8, S9, S10, and Tables S5, S6, S7. We now include a control for using ImageNet for pre-training general features (Fig. N8), whic the engineered features demonstrated significantly higher AUC than using deep learning.
W7 Good point! The process is dependent on cell numbers, not movie size. We added: “...(data compression) resulting in approximately 270+-90 MB/1,000 cells (mean +- std over 5 mice) data sizes.” A TB-scale movie with 10,000 cells (for example, the movie from Ebrahimi et al., 2022) can be compressed to less than 3GB. The five imaging datasets had different imaging conditions (1p-2p), apparatus, and frame rates (20, 30, and 50), demonstrating feature robustness, which is illustrated by the fact that ActSort works on both 2p and 1p movies without modification.
W8 and W9 These are outside our scope. We use linear classifiers with instant training, making batch size of one (a desirable property to mimic real-work cell sorting) attainable. Other data types have not achieved the recently attained 1 million neurons mark in calcium imaging, thus not of interest here.
W10 This key conclusion motivates ActSort. Human annotators are inconsistent, especially with large datasets. Thus, 4 annotations per dataset is a key contribution of our benchmark and evaluation. For instance, historical benchmarks like Neurofinder (2p, 100s of neurons, 1 annotator) later showed experts missed many cells in Suite2p.
Q1 We added an illustration explaining the absolute discriminability index (Fig. N1). We also added statistical quantification of traditional and new feature effect sizes (Fig. N2). We also updated the math for all features accordingly.
Q2 The label distribution of the 160,000 cell candidates is shown in Figs. S6, S8, and S10. We tested our algorithm on imbalanced datasets with more true positives (Figs. S6 and S10) and artificially inflated datasets with many false positives (Fig. S8). Our results are consistent, demonstrating our approach’s robustness to imbalance.
Q3 The Random query algorithm has no hyperparameters. CAL, DAL, and DCAL use the same cell (and/or label) classifiers to predict cell probability. We included a sweep over regularization parameters (Fig. S7) and a new experiment on classifier threshold sweeping (Fig. N4).
Q4 This indicates 1) humans are inconsistent over time, 2) humans get tired after sorting 10k cells, and 3) the active learning algorithm selects representative samples. Points 1 and 2 are supported by new intraclass correlation results, and point 3 is illustrated in Fig. 3. In short, linear classifiers primarily care about boundary samples; once those are found, classification is mostly optimal.
Q5 Other algorithms would be harder to judge (e.g., random forest, neural network) and would not allow instant training, which is needed for real-time human annotation. Convexity of LR ensures reproducibility, unlike random forest classifiers or neural networks.
Q6 We apologize for the confusion. We should not have called this pretraining and will update the text as “prelabeling”. Details of the fine-tuning process are in Algorithm S1 and Appendix C.2.
Thank you for your detailed review. We implemented your feedback whenever applicable, which improved our manuscript tremendously! If you have additional concerns, we look forward to discussing them further. If your concerns are addressed, would you consider increasing your scores?
Thank you for providing such a detailed response to my review. Your reply has addressed most of my questions and concerns, and I'm impressed with the thoroughness of your response.
First, I want to emphasize that I've always highly appreciated the quality and potential impact of this work. My main concern previously was about the fit within the scope of NeurIPS. After reading your response and considering the opinions of other reviewers, I believe this issue has been well addressed.
I'd like to highlight a few points:
- Your additional ablation studies and control experiments effectively demonstrate the necessity and effectiveness of the DCAL method.
- The attempt to use ResNet-50 for feature extraction and compare its performance is a good idea. Although it didn't outperform your method, this comparison is valuable.
- Your detailed analysis of manual annotation accuracy, especially the ICC analysis, is very helpful in understanding the data quality.
- You've clearly explained the human performance comparison issue and provided additional insights into classifier performance.
- The extra details you've provided on experimental setup and hyperparameter choices are also beneficial.
Given the quality of your response, the additional work you've done, and my high regard for the research itself, I've decided to increase my score to 5. I believe these improvements not only enhance the technical depth and contribution of the paper but also further demonstrate its relevance and importance to the NeurIPS community.
Thank you again for your efforts and commitment to improving the paper.
Thank you very much for your kind words, time, and consideration. We appreciate your encouraging words and please let us know if you end up having additional questions!
The paper proposes an active learning framework for improving the accuracy of cell sorting in calcium imaging datasets. The method rests upon three main components: (1) Preprocessing module which uses an already existing cell segmentation algorithm and reduces the size of the dataset using a set of engineered features. (2) Cell selection module which allows the annotator to visualize the features of specific detections and label them as cells or no cells. (3) Active learning module which trains a cell classifier (whether or not a detection is a cell or not) and a label classifier (whether or not a cell is labeled or not) and uses a discriminative confidence-based strategy to select the next cell for annotation. The authors show improved results over random and mere confidence-based strategies and demonstrate that the model surpasses human performance with a small number of annotations. A benchmark cell sorting dataset and software are also accompanied by the paper as other parts of the contribution.
优点
-
The problem considered is significant for the neuroscience community. Improved cell sorting methods could save significant amounts of time for experimentalists and allow them to spend their time on more specialized tasks.
-
The paper is written very clearly. I appreciate the clarity of the introduction and motivation of the paper. The summary of the contributions is fair and straightforward and solves a well-defined and well-motivated problem.
-
The figures are informative and professional. They include the necessary information in the main text and wherever the extra information is not crucial the information is presented in the supplementary.
-
The results are thorough and significant. The methods show clear advantages of different components of the contributions (features used and DCAL active learning strategy) across multiple scenarios.
缺点
-
The main weakness of the paper is the technical part which does not offer a novel contribution. In principle, the fact that a simple model leads to the presented improvements is a strength of the paper. But given the simplicity of the technical parts of the paper one might argue that a specialized field journal could be a better venue for such a contribution than NeurIPS.
-
There are no comparisons performed against any other method and all the comparisons are against different versions of the same model. What is the state of the art for this problem and what are the competing methods? How do they compare to your proposed method?
问题
-
It appears that the preprocessing step selects a number of cells that are then used in the software for the human annotators to label as cell or not cell. Therefore if the preprocessing step (e.g. EXTRACT algorithm) misses some of the cells there’s no recovery. Can the authors include a discussion of this in the paper?
-
In addition to this, if there are imperfections such as residual motion in the videos the preprocessing step might mark a cell as multiple cells across different frames of the video (a common issue in tracking which requires stitching tracklets). Is there anything in the presented framework that allows for the recovery of these mislabelings? Related to this, if there's motion in the videos, other metrics developed in the multi-object tracking community are used to assess the quality of identifying a cell and maintaining its identity throughout the video. How does your framework deal with cell identity losses? Are these issues redirected to the preprocessing step?
-
This question might be a high-level yet naive question given that I haven't caught up with the latest advances in cell sorting literature. How does the method compare against, say, a fully supervised approach where you collect multiple datasets and train a large model (say a vision transformer or a UNet) for sorting the cells? Given the advances in AI and the foundation models, this seems to be a natural direction to pursue.
-
How does the classifier compare with an architecture well-suited for vision that takes in a zoomed-in crop of the image (or video) of a cell and predicts if it's a cell or not? This control would be important to support the argument for engineered features and a simple classifier.
局限性
The technical novelty of this paper is limited. The simplicity of the methods is a double-edged sword improving the readability yet making the paper a less than ideal candidate for a technical conference such as NeurIPS.
We sincerely thank you for your time and effort in reviewing our manuscript and providing insightful feedback. We have addressed your concerns regarding where ActSort stands on within the entire calcium imaging processing pipeline and explained our fit into the NeurIPS venue in the General Response. Note that we omit citations from text excerpts due to character limits.
Prior work: Great point! We should have done a better job in explaining that 1p movies are non-standard and lack specificity. As a result, there is no published baseline for the problem we are aiming to solve in the field. We are introducing the first benchmark on these types of movies. The current method for the curation process of removing false positives rely entirely on human laborers or other pipelines use simple thresholding based on what we call traditional features (CAIMAN, ICA, EXTRACT), or classifiers defined on these features (Suite2p). Please also refer to “The concerns about comparison to prior work” section in the general response for more detailed explanations.
Imperfections in the data such as motion correction, cell extraction, stitching, cell identity You are absolutely right about all these concerns. To recap, there are two distinct problems when processing calcium imaging movies.
The first problem is the cell extraction, in which the motion should be corrected, cells should be properly identified, cells suspected of duplication should be stitched or removed, and/or cell identity should be faithfully tracked across sections. Field has been working on these problems over two decades.
ActSort comes as a quality control (second problem) pipeline on several existing pipelines designed to address these aforementioned problems. Traditionally, quality control on these outputs was done by experimentalists manually, by going over each cell with custom softwares. However, with thousands and millions of neurons being recorded in a single session nowadays, automated quality control became necessary. Yet, we should have addressed this in the discussion for the broader ML community, which is what we do now as follows:
“The quality control process focuses on identifying the true positive samples correctly while also correctly rejecting the true negative samples that are misidentified by the cell extraction algorithm. [...] However, since ActSort is a quality control algorithm, it requires that movies are motion-corrected and cells are extracted with the experimenter's favorite cell extraction algorithm, therefore, any mistake made in the cell extraction process will propagate to the cell sorting process.” and “One limitation of ActSort is that it relies on correct cell extraction by the cell extraction algorithm used by the experimenter. Future additions to ActSort could mitigate some of the common errors, such as the ability to merge segmented or duplicated cells.”
If the preprocessing step misses some of the cells there’s no recovery You make an excellent point! Although EXTRACT and ActSort all fall into the calcium processing pipeline, they have different purposes. ActSort is a standalone pipeline for QUALITY control instead of cell finding. We modified the discussion section based on your suggestions as: “ActSort is a novel standalone pipeline for quality control that can be added as a further step after using any cell extraction algorithm such as EXTRACT, Suite2p, ICA, CAIMAN, and many others (Pachitariu et al., 2016, Ren et al., 2021, Giovannucci et al., 2019, ).”
Using foundation models Wonderful question! Let us illustrate the shortcomings of foundation models on two examples on different, but related, problems:
Cell extraction: A foundational model, Cellpose, was developed for segmentation from images but is not used for cell extraction from videos. The main reason is that brain recordings, particularly the 1p imaging videos, are very diverse in their backgrounds, imaging and experimental conditions, and cell shapes/types/sizes. Therefore, even for cell extraction, the foundational models have not been successful to date.
Event extraction: Cascade (Rupprecht et al., 2021) utilized deep models to extract events from calcium activity traces in two-photon (but not one-photon) movies. The model was trained from scratch on a massive public dataset, and did not generalize to 1p movies due to aforementioned reasons.
Similarly, for cell sorting, a new model may need to be trained as more public benchmarks (our being the first with such scale) become available.
Control with a deep classifier What a great idea! To address this, we added an experiment using ResNet-50 for extracting features and classification using 1p calcium imaging snapshots. Specifically, we provided the extracted cell profiles and the average cropped movie snapshot, averaged over frames with cells’ activities, to ResNet-50 and collected 2,000 features from the final layer. We repeated our experiment in Fig. 2 for this dataset (Fig N8). The engineered features demonstrated significantly higher AUC than using deep learning.
Technical novelty We agree that the simplicity of the introduced method may seem not technical enough at first, but we kindly disagree. Please refer to the “Regarding fit to the venue and our contributions” section in the general response for further discussion. In short, we believe the impact of our technical contributions, no matter how simple, (now that we have also shown the suboptimality of a classifier trained on ResNet-50 features) is significant, as neither CAL or DAL can reach the success of DCAL, nor DCAL without adaptive updates reaches the same levels (new experiment, data not shown due to page issues, but happy to elaborate further).
Thank you for your helpful suggestions. Specifically, the deep net classifier was a very significant addition to our work, thanks to your feedback! If you believe we addressed your concerns, would you consider increasing your scores?
Thank you for the detailed discussion of the points I raised. Your rebuttal was quite helpful for me to reorient myself and place this contribution in the literature on information extraction from calcium videos.
I’m glad that you found my suggestion of replacing the feature extraction with a CNN helpful, and happy that it opened up the opportunity of the new experiment. In addition, I would like to give the authors credit for their constructive engagement in the discussion with all reviewers and for taking the comments to improve their work. Well done!
I’m happy to increase my rating. The one point that I still don’t fully agree with the authors is the appropriateness of the venue. While I agree with the published literature on neural information processing, the question here is whether the technical contribution together with the impact of the work makes it a suitable candidate for publication at NeurIPS. I decided to leave the answer to this question to the AC.
Thank you very much for your kind and constructive words! Your feedback was very helpful and helped us make our work better!
In this paper, the authors develop and open-source a software, ActSort, for cell sorting of large-scale calcium imaging datasets in neuroscience which integrates domain-expert features with an active learning framework. Alongside with the software, the authors provide a new benchmarking dataset which they use to evaluate their newly developed active-learning query algorithm that forms the backbone of ActSort. The active-learning based algorithm seeks to reduce and most efficiently use human annotator's time by interleaving automated cell classification with queries for human labels of most informative outlier boundary cells. The authors provide extensive benchmarking and evaluation of their approach across different real-world experimental conditions. In doing so, they demonstrate that ActSort significantly reduces the number of human-provided cell labels necessary to achieve sufficient true positive/negative rates, and thereby constitutes an important step towards alleviating the effect of this bottleneck in processing large-scale calcium imaging datasets.
优点
The presented work is excellent in originality, quality and significance, with sufficient clarity. The main strength of the paper is in its core contribution, the development of DCAL, an active-learning query algorithm that combines the advantages of confidence-based and discriminative active learning. This combination leads to the query algorithm selecting outliers (DAL) near the decision boundary (CAL) to be labeled by the human annotator.
Originality
The paper makes several original contributions to the field of cell sorting, the most valuable of which seems to be the 1) presentation of a novel active learning query algorithm that combines confidence-based and discriminative active learning and 2) the application of this algorithm to the problem of cell sorting in the form of 3) a GUI-based open-source software.
Quality
The described benchmarking datasets and the algorithm, as well as the experimental designs for empirical evaluation are of high quality, in that they are large-scale, well-motivated and well-executed, respectively.
Significance
This work is potentially very impactful in the field of systems neuroscience, if the claimed contributions, particularly the substantial reduction in the required number of human-labeled cells, mitigation of annotator bias, user-friendly design of software, and generalization capabilities hold true in deployment.
Clarity
The paper is well-motivated, clear in writing and provides extensive supplementary material. However, improvements are necessary (see Comments/Questions).
缺点
The main weaknesses of the paper are a somewhat shallow discussion and lack in accessibility for the putative audience, that is, experimentalists with less experience in reading technical papers.
Discussion:
Instead of a discussion, authors provide a conclusion that reiterates the contributions of the paper. It would be better to instead discuss potential limitations of the presented methods; interpret some of the more surprising results (see questions); discuss how much modification would be necessary to extend the approach to multi-class data, and data collected with different types of indicators (e.g. voltage-imaging).
Accessibility:
What I particularly like about the paper is that all components of the proposed active-learning based query algorithms are well-motivated and interpretable. It would be valuable to make the components of the algorithm (i.e. the CAL and DAL components and the adaptive estimation of w) more accessible, by adding interpretation aids (verbal and visual) to eqs. 3 and 5.
问题
Major comments and questions
Fig. 2: Please give more guidance in the legend for interpreting the plots (A: provide definition of absolute discriminability index); B: differences seem minimal. Did the increase in accurate rejection of false-positive candidates gained by inclusion of novel features come at the cost of increased false-negatives?
Fig. 3D: It seems surprising that feature distance does not differ substantially between the different dcal versions. Can this be explained by showing the evolution of w during the adaptive estimation process?
Minor comments and questions
ln. 149 following: this sentence is broken.
Fig. 3A: Provide legend indicating what the colours mean; visualize the decision boundary.
局限性
As mentioned above, a discussion of limitations of the presented approach (perhaps in comparison to other approaches) is lacking from the discussion.
We appreciate your time and effort in reviewing our manuscript and providing invaluable feedback. You caught everything in the paper, your suggestions were extremely helpful, and your review led to new experiments that increased the readability and impact of our paper! To us, this was an extremely well written, helpful, and insightful review. Thank you!
Accessibility for experimentalists Thank you for highlighting this critical point while evaluating our work! We have prepared a user manual and lecture videos (with tutorials) for ActSort, but due to the anonymity requirements, we cannot share them yet. We are committed to providing a user-friendly introduction to ActSort, which will hopefully ease the experimental neuroscientists into the software. Currently, ActSort has a small user base of 20+ researchers globally (mainly collaborators), and we are actively collecting their feedback to further improve the user experience.
Discussion section The reviewer is absolutely correct. Thanks to the additional page provided after the revisions, we now added a new discussion section. We are happy to share the full text if requested, but for brevity, we will list the discussion bullet points below:
- ActSort is an active learning accelerated cell sorting pipeline for large-scale 1p and 2p calcium imaging datasets, which is compatible with any cell extraction algorithm (CAIMAN, Suite2p, EXTRACT, ICA, etc.).
- Since ActSort is a quality control algorithm, any mistake made in the cell extraction process (for example, missing cells and/or motion artifacts) would automatically propagate to the cell sorting process. Though we have plans to incorporate merging of duplicate cells in the future, the current version does not support this. Thus, we require that movies are motion corrected and cells are properly extracted with your favorite cell extraction algorithm.
- ActSort makes no assumption on the type of behavioral experiment, the Ca2+ indicator, the imaging conditions; and is robust to variations thanks to its standardized features.
- Though currently validated for calcium imaging experiments, ActSort can be minimally modified (perhaps with the addition and subtraction of new features) to be applied to newly emerging technologies such as voltage imaging and/or multi-class datasets (for example, including dendrites).
- Our optimally compressed data format has implications for sharing Ca2+ imaging datasets publicly. To date, only post processed activity traces were shared, but with our format, now movie snapshots can be shared too, allowing end users to check the quality of the data.
If you have additional comments, please let us know!
Clarification for query algorithms Great suggestion! We added an explanation to equation (3) as “The uncertainty-score represents the uncertainty regarding the sample ’s position relative to the decision boundary. The higher the score, the closer it is to the decision boundary. The discriminative-score represents the uncertainty regarding whether the sample faithfully represents the full dataset. Higher scores indicate the sample is unique and underrepresented by the labeled dataset.”
We added an explanation to equation (5) as “ approaches to the predefined weight if the labels classifier correctly differentiates between labeled and unlabeled data, indicating unique samples in the unlabeled dataset. if there are no underrepresented samples in the unlabeled dataset, resulting in the query algorithm selecting only the boundary cells as the CAL algorithm does.
To address your concerns about adaptive updates, we added a new figure (Fig. N5). In the beginning, when only a few samples are sorted, there is an initial drop in the weight values. This drop occurs because the label classifier fails to differentiate between labeled and unlabeled samples. This indicates that initially, the CAL component dominates the sample selection process. As the process continues, the DAL component starts to take over, selecting unique and underrepresented samples from the unlabeled dataset, and finally converging back to CAL again.
Questions about Fig. 2: Thank you for pointing out the confusion regarding the absolute discriminability index. We added an illustration to explain its definition in the PDF (see Fig. N1). The absolute discriminability index is a metric used to quantify the effect size between two distributions. If the two distributions can be easily told apart, we will obtain a higher d' value. The two distributions here are feature values conditioned on whether a sample is a cell or not. Additionally, we included a statistical analysis comparing the effect sizes of traditional features and new features (Fig. N2).
The illustration on Fig. 2 B makes the difference seem minimal. The AUC changes from 0.94 to 0.97. We will update the figure to have a zoom in version. We added a sentence explaining that the increase in true negative rate did not sacrifice the accuracy of true positive rate - “Re-analyzing the dataset from Fig. 2 B, we found that the new features increased the rejection accuracy from 67% to 75% (Fig. 2 B) without decreasing the accuracy of accepting true-positives (97% to 97%), leading to more effective separation between cell and not cell samples in our benchmarks (Fig. 2 A). ”
Question about Fig 3D: Wonderful suggestion! To address this point, we changed the feature distance measurement from euclidean distance to cosine distance, which has a much better dynamical range (Fig. N3).
Typos: Thanks for pointing out our typos! We will revise the writing! The Fig 3A legend is placed under Figure B-D. We moved the legend above to make it more accessible for the reader as you suggested.
Thanks to your feedback, we were able to improve the clarity of our work. We hope that you are satisfied with our responses and consider supporting our submission with a very strong acceptance!
I thank the authors for their detailed reply to points raised.
I would like to ask the authors to share their updated Discussion section.
Did the authors visualize the decision boundary in Fig. 3A?
While I appreciate additional figures/analyses N3 & N5, I am not convinced of their added benefit. N3: what is the reasoning behind changing the distance metric? I also don't really see a difference in the previous and new plots. N5: This should maybe be plotted on a log scale.
Dear Reviewer,
Thank you for reading our rebuttal and your continued engagement. Please find below our responses (citations omitted and certain sentences shortened):
Discussion Section:
"In this work, we introduced ActSort, a user-friendly pipeline consisting of 3 modules: a preprocessing module, a cell selection module, and an active learning module.
The preprocessing module efficiently reduces the data size associated with neurons down to few GBs, which includes not only cells' Ca activity traces, but also spatial profiles, and movie snapshots during \cm events -allowing end users to check the quality of the data-, together with the engineered features. This compact representation allows the ActSort pipeline to be run locally in laptops, despite original movie sizes of up to TBs. The joint compressing of the movie snapshots and cell extraction information offers another key contribution to the Neuroscience community with implications for sharing imaging datasets publicly.
The cell selection module features a custom design with an easy-to-use interface that displays temporal, spatial, and spatiotemporal footprints, and incorporates a closed-loop online cell classifier training system. During the annotation process, the software provides real-time feedback by displaying predicted cell probabilities and the progress made by the human annotator as well as the fraction of unlabelled cells that ActSort is confident about.
The active learning module works in the background, strategically selecting candidates for human annotations, and trains cell classifiers with annotated candidates. To make our pipeline easily accessible and used by the Neuroscience community, with this work, we are providing a user manual and video tutorials.
Previous work on quality control for two-photon movies can be used, though less accurately, on one-photon movies. One-photon datasets face challenges like low resolution, high background noise, and reduced specificity. Our solution, ActSort, addresses both brain-wide one-photon and two-photon datasets through standardized features that are robust across various behavioral experiments, \cm indicators, imaging conditions, and techniques.
ActSort can be added as a quality control step after any cell extraction algorithm, such as EXTRACT, Suite2p, ICA, CAIMAN, and others. The quality control process focuses on correctly identifying true positives and rejecting true negatives misidentified by the extraction algorithm. ActSort surpasses human performance in true positive and true negative rates by annotating less than 3% of samples. However, since ActSort is a quality control algorithm, it relies on motion-corrected movies and accurate cell extraction, as any mistakes in extraction propagate to sorting.
To support the development of active learning algorithms in systems neuroscience, we introduce the first publicly available benchmark for cell extraction quality control on both one-photon and two-photon large-scale calcium imaging, comprising five datasets: three one-photon and two two-photon \cm imaging datasets, with approximately 40,000 cells and 160,000 annotations (each dataset annotated independently by four annotators). This dataset is unparalleled in the public domain.
One limitation of ActSort is that it relies on correct cell extraction by the cell extraction algorithm used by the experimenter. Future additions to ActSort could mitigate some of the common errors, such as the ability to merge segmented or duplicated cells. Future directions also include the exploration of oversampling techniques without hurting the true positive rate, more sophisticated cell classifier architectures, and batch sampling by the query algorithm.
Furthermore, while ActSort was validated with 1p and 2p datasets, it is a standing question whether it could be applied to newly emerging technologies such as voltage imaging and/or multi-class datasets (for example, including dendrites) with the current features, or with slight modifications such as addition and subtraction of new features, which can be explored in future work.
Another important aspect requiring further exploration is the expertise level of the annotators. Specifically, the six annotators had different levels of expertise in working with \cm imaging datasets. Hence, a potential moderation relationship can exist depending on the expertise of the annotator, which is left as a future work."
Fig. 3A visualization: Yes! We (approximately) visualize the region comprising of the boundary cells, as well as the boundary itself.
Fig N3: We find cosine similarity to be more interpretable, and it has a better dynamical range. We would love to hear your thoughts though?
Fig. N5: Yes, agreed. We will update the figure to be log-scale in the fraction of annotations.
Once again, thank you for your time, consideration, and support. We look forward to hearing your thoughts on our updated discussion!
Thank you for sharing the discussion. I still would like to see more discussion of what would be necessary to extend ActSort to other data modalities (and maybe less repetition of the strengths of ActSort) and more extensive comparison to existing work (the authors did this in the rebuttal).
Overall, I support acceptance of this piece of work.
Thank you very much for your support of our work and additional feedback. Indeed, we will go over our rebuttal one more time before we finalize this work and make ALL the changes we committed. We think all suggestions were helpful and necessary. We will share the changes we performed to address your remaining concerns, but we do not expect a response from the reviewer in case they feel content with the rebuttal.
We took out the first three paragraphs from the discussion, not repeating the strengths of ActSort. For the questions you raised, please find the relevant full (unshortened) paragraphs below:
Existing work
"ActSort comes as a standalone quality control software, which can be used to probe the outputs of cell extraction pipelines such as EXTRACT \cite{inan2021fast}, Suite2p \cite{pachitariu2016suite2p}, ICA \cite{mukamel2009automated}, CAIMAN \cite{giovannucci2019caiman}, and others \cite{zhou2018efficient,cnmf,chen2023hardware,roi}. These cell extraction algorithms take the raw \cm movie as their inputs, correct the brain motion, perform spatial and temporal transformations to standardize the \cm imaging movies, identify putative cells' spatial profiles and temporal activities, and often perform primitive quality controls to output the final set of neural activities.
Historically, additional quality controls on the cell extraction outputs would be performed with manual annotation, which was feasible for the small neural recordings with hundreds of neurons \cite{marzahl2020fast, salvi2019automated,amidei2020identifying, wang2021annotation,schaekermann2019understanding,corder2019amygdalar}. Yet, with the advent of large scale \cm imaging techniques, now recording up to one million cells \cite{manley2024simultaneous}, manual review became unrealistic. Instead, the field of experimental neuroscience direly needs automated quality control mechanisms that would correctly identify the true cell candidates while rejecting true negatives misidentified by the extraction algorithms.
As discussed above, ActSort is the first scalable and generalizable solution in this direction. Yet, previous works (as parts of existing cell extraction pipelines \cite{pachitariu2016suite2p,giovannucci2019caiman,inan2021fast}) had tackled this problem in specific instances: Suite2p designed cell classifiers based on some basic features to increase the precision of the algorithm \cite{pachitariu2016suite2p}, CAIMAN pre-trained a deep classifier for two-photon movies \cite{giovannucci2019caiman} (though not applicable to 1p \cm imaging movies \cite{caiman_demo_pipeline_cnmfE}), whereas EXTRACT performed thresholding on a set of quality metrics \cite{inan2021fast}. Notably, these existing automated methods with pre-trained cell classifiers often found success only for high quality 2p \cm imaging movies \cite{pachitariu2016suite2p,giovannucci2019caiman}, and even then underperformed human annotators \cite{giovannucci2019caiman}. One-photon \cm imaging datasets, on the other hand, are quite diverse in their imaging (miniscope vs mesoscope) and experimental (head-fixed vs freely behaving) conditions and face additional challenges due to low resolution and high background noise. With ActSort, we sought a generalizable solution that does not target a specific modality or require re-training, but uses interpretable features that are robust across various behavioral experiments, \cm indicators, imaging conditions, and techniques.
To provide a baseline for existing methods in our benchmarks, we designed a feature set called "traditional features" (Fig. \ref{fig:fig2}), including features used by classifiers in these prior works (See Appendix \ref{sec:feature_engineering}). Moreover, these methods did not use active learning, instead annotated only randoms subsets. Thus, in our experiments, these prior methods (or a plausible upper bound to be more exact) are represented as the random sampling query algorithm, which uses the full feature set to allow fair comparisons to active learning approaches."
(Continued below)
Future work:
"Though we performed extensive analysis to highlight the efficiency and effectiveness of ActSort, our work is merely a first step for what may hopefully become a fruitful collaborative subfield comprising of experimental neuroscientists and active learning researchers. There are several future directions that future work could improve upon, which we will briefly summarize below.
Firstly, in this work, we used linear classifiers for rapid online training during cell sorting and reproducibility of annotation results. This was mainly rooted in the fact that pre-training deep networks required substantial data and standardization across various \cm imaging movies. We believe that with additional public datasets that may follow our lead, this direction can become reality (as was the case for a different, yet relevant, problem of spike extraction \cite{rupprecht2021database}). Our results in this work have set a strong baseline for such future deep-learning approaches.
One limitation of ActSort is that it comes as a quality control pipeline to existing cell extraction approaches. Therefore, any mistakes in the cell extraction step would automatically propagate to cell sorting. Yet, some of these mistakes can be mitigated or highlighted post hoc. For instance, future ActSort versions could involve options for merging segmented or duplicated cells, or identifying motion in activity traces and thereby notifying the user to improve their preprocessing steps.
Another important aspect requiring further exploration is the expertise level of the annotators. To date, each \cm imaging movie is often annotated by a single annotator, who unfortunately can tire after long hours and become inconsistent as we have discussed above. This was the reason behind our choice to have the movies in our benchmark be annotated by multiple researchers. Yet, the human annotators had different levels of expertise in working with \cm imaging datasets. Hence, a potential moderation relationship can exist depending on the expertise of the annotator. To fully explore this relationship requires further research with more annotators per dataset, by having the same annotators sort the same cells in different times, and/or testing sorting effectiveness before and after a standardized sorting training.
Finally, in this work, we validated ActSort for one-photon and two-photon \cm imaging datasets and for sorting binary classes. Yet, our framework is generally applicable to other modalities, barring additional feature engineering performed by domain experts in those fields, or with a pre-trained deep network. For instance, by replacing the logistic regression with a multinomial version and approximating DAL scores with entropy instead of decoder scores, our work readily applies to multi-class datasets that may jointly include, e.g., dendritic and somatic activities. With correct features, the framework we introduced here should also be helpful for the newly emerging voltage imaging technologies, especially as the number of cells will inevitably increase in such movies with the technological advances."
With these additions, our work is now exactly 10 pages. We thank you very much for your time and support; and wish you a great week!
Many thanks to all the reviewers’ comments! We appreciate your time and efforts. We have addressed all concerns either with written edits to the manuscript, by performing requested experiments (see attached PDF), and/or by citations to relevant literature. Here, we would like to address some of the common concerns raised and further clarify the contributions of our work.
Comparisons to Prior Work ActSort is a novel post-cell extraction quality control algorithm, not part of existing pipelines. Historically, manual annotation was feasible for small datasets, but with large-scale imaging techniques recording up to one million cells (Manley et al., 2024), automated quality control is essential, as manual review is unrealistic for such volumes. For instance, Manley et al. (2024) seems to have used thresholding based on quality metrics (not even classifiers!) rather than human annotation (https://github.com/vazirilab/MAxiMuM_processing_tools/blob/main/planarSegmentation.m).
Most algorithms like CAIMAN and Suite2p focus on 2p movies, whereas our work targets broader solutions, including 1p movies with lower resolution, higher background noise, and reduced specificity. Applying pre-trained classifiers from these algorithms to 1p movies is infeasible, as even for 2p movies, these models are suboptimal (CAIMAN, Giovannucci et al., 2019). CAIMAN advises against using CNN classifiers for 1p data (https://github.com/flatironinstitute/CaImAn/blob/main/demos/notebooks/demo_pipeline_cnmfE.ipynb). Suite2p, validated only for two-photon movies, and uses features + linear classifiers that are represented in our traditional features. Similarly, EXTRACT seems to use thresholding based on some features.
We designed a feature set called "traditional features," including features used by classifiers in these prior works, such as SNR and spike width. The CLEAN framework (https://bahanonu.github.io/ciatah/) employs predefined features to train cell classifiers, though CLEAN is unpublished and lacks public code. We included these features in our traditional set and compared them with our new features. Unlike prior methods, CLEAN did not use an active learning query algorithm, sampling cell candidates randomly. Our baseline represents prior methods as random sampling and compares them with our novel query algorithm.
Fit to the Venue and Contributions Over a long period, tools for “Neural Information Processing” (Prank et al., 1998; Pachitariu et al., 2013; Andilla et al., 2014; Inan et al., 2017; Giovannucci et al., 2017; Aitchison et al, 2017; Choi et al., 2020; Dinc et al., 2023) have been welcomed by the NeurIPS conference. As NeurIPS stands at the intersection of AI and neuroscience, it has a rich tradition of embracing both technical and conceptual novelties, as well as new computational tools for neuroscientists. We believe our work upholds this tradition by offering three main contributions:
-
First Active Learning Benchmark for Cell Sorting: We introduce the first public benchmarks for cell extraction quality control on 1p and 2p imaging, with ~160,000 annotations. This dataset is unparalleled, and we kindly ask that reviewers consider its value for the active learning community as well.
-
Framing Cell Sorting as an Active Learning Problem: We introduce and frame cell sorting as an active learning problem, providing essential tools for further studies, including features, datasets, and software.
-
Novel Query Algorithm: We propose a novel query algorithm which, we believe, makes a significant contribution. While some reviewers found the technical novelty limited, we respectfully disagree. We wish to bring forth the argument that technical novelty should be evaluated based on its impact rather than complexity. Our algorithm, reducing human effort from 100% to 1%, is a substantial technical advancement. We also performed more control studies to provide evidence for this point (see below).
Summary of New Experiments We provide a summary of new figures and experiments addressing reviewers’ concerns. Additional details are in the reviewer responses and figure captions.
- Figs. N1 and N2: Added figures explaining the absolute discriminability index and summarizing effect sizes for all features.
- Fig. N3: Replotted Fig. 3D with cosine distance for better dynamical range.
- Fig. N4: Added sensitivity analysis of classifier thresholds.
- Fig. N5: Recorded DCAL weights throughout the cell sorting process, showing adaptive estimation decreased sensitivity to initial conditions.
- Figs. N6 and N7: Added experiments on cell candidates from the ICA algorithm, demonstrating the necessity of cell sorting. This figure also showcases the necessity of cell sorting, as even discarded garbage components can encode speed (perhaps due to brain motion or other contaminations), which might lead to incorrect biological conclusions if not culled out.
- Fig. N8: We used ResNet-50 to extract a total of 2,000 features directly from the movie and cell extraction data, which we used to classify the cell candidates. Our engineered features outperformed this approach, which was surprisingly good at rejecting cells (outperforming traditional, but not our, features). As noted above, the fact that deep learning based feature extraction was not as successful in 1p movies is in line with the conclusions of prior research (CAIMAN, Cascade etc.). Finally, we wish to emphasize that even this simple experiment (gathering features from average movie frames with ResNet-50) took us roughly 30 minutes for 10,000 cells with an NVIDIA RTX 4090 GPU, and about 2 hours without one. This waiting time is beyond our design restrictions.
The paper introduces a novel approach to cell sorting in large-scale calcium imaging datasets through a semi-supervised active learning framework called ActSort. The proposed method is well-motivated, addressing a significant bottleneck in the processing of such datasets, where manual annotation becomes impractical due to scale. The paper is structured around three primary components: preprocessing, cell selection, and an active learning module that leverages discriminative confidence to optimize cell annotation.
The reviewers noted that the method integrates existing knowledge, particularly through the use of hand-engineered features and simple classification models, and enhances these with an active learning strategy that reduces the need for extensive manual annotation. The authors have effectively benchmarked the method across various datasets and conditions, involving multiple domain experts, which strengthens the empirical validation of their approach.
In the rebuttal, the authors provided additional experiments and clarifications, particularly in relation to comparisons with prior work. They have successfully demonstrated that ActSort offers distinct advantages over existing cell-sorting algorithms, especially in its applicability to both one-photon and two-photon imaging data, addressing limitations where prior methods falter.
The inclusion of user-friendly software and extensive benchmarking supports the potential for broad adoption within the neuroscience community. The authors have also responded adequately to all major concerns raised by the reviewers, providing detailed justifications and expanding their experimental results.
Given the novelty of the approach, its clear practical utility, and the thoroughness of the validation, this paper makes a meaningful contribution to the field.