6.4

/10

Poster5 位审稿人

最低3最高5标准差0.6

3.4

置信度

创新性2.6

质量3.0

清晰度2.4

重要性2.6

NeurIPS 2025

scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery

Yiming Gao,Zhen Wang,Jefferson Chen,Mark Antkowiak,Mengzhou Hu,JungHo Kong,Dexter Pratt,Jieyuan Liu,Enze Ma,Zhiting Hu,Eric P. Xing

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

We present scPilot, a framework for omics-native reasoning in which a large language model converses, invokes bioinformatics tools, and iteratively explains its decisions to for single-cell analysis.

摘要

关键词

Large Language ModelsScientific ReasoningAutomation of ScienceSingle-cell TranscriptomicsCell Type AnnotationTrajectory InferenceGene Regulatory Network Inference

评审与讨论

审稿意见

评分: 4置信度: 32025-06-23

The work converts omics data into step-by-step reasoning problems for LLMs to solve. The addition of the raw omics data creates interpretable analyses and yields better results than direct LLM use. The work also introduces scBench, which is a benchmark that evaluates biological reasoning.

优缺点分析

The problem domain is one not frequently investigated. The benchmark on its own is a noteworthy contribution. The use of this benchmark to evaluate the reasoning capability of LLMs on biological tasks is important in the effort to create scientific reasoning models.

The scPilot framework, which consists of a problem-to-text converter, tool library, and tool planner, is fairly standard in related domains. The novelty comes from the explicit requirement to force the LLM to record its claims in a transparent trace and in its application to this domain.

There is a lack of examples for the reasoning trace, which is one of the design features of scPilot. Figure 3 provides the reasoning for a PBMC3k annotation task, but it is unclear if this is the full reasoning trace, or if there are additional natural language claims/justifications.

问题

Are the tools within the bio-tool library crafted beforehand, or is the tool agent able to dynamically add new tools into the library? In other words, is scPilot able to be used for other bioinformatics related tasks, or would that require manual modification to the tool library?
What are the data statistics for each of the datasets in the benchmark? I.e. number of data points, number of label types, and input feature distribution like lineage graph stats for the trajectory inference.
Was there any additional filtering performed on the benchmark datasets? For the collected annotations such as in the cell type annotation task, was the only qualification that the label be manually annotated?

局限性

yes

最终评判理由

The authors have provided a detailed response and an example of the reasoning trace. Most of the concerns have been addressed, and I would like to keep my score as 4.

格式问题

作者回复

2025-07-31

We sincerely thank Reviewer VK46 for their positive evaluation and for recognizing the key contributions of our work. We are especially grateful for the acknowledgment of:

Our Novel Domain and Foundational Benchmark: For highlighting that we are addressing a "problem domain... not frequently investigated" and that our scBench benchmark is a "noteworthy contribution" in the "important... effort to create scientific reasoning models".
Our Core Methodological Innovation: For identifying the specific novelty of our framework in its "explicit requirement to force the LLM to record its claims in a transparent trace" and its novel application to this domain.

(a) Novelty and Clarification on Reasoning Trace

We thank you for pointing out the need for more clarity on the reasoning trace. Figure 3 is a high-level schematic, and you're correct that the full trace contains additional, detailed natural language claims and justifications for each step. The value of this transparent trace is best shown through concrete examples of the reasoning process in action.

We kindly refer the reviewer to previous response (d) to Reviewer RBqf, response (c) (d) to Reviewer mMx1.

Cell-Type Annotation: Multi-Gene Logic**

scPilot's reasoning trace for this task records a propose -> filter -> analyze workflow that allows for sophisticated biological interpretation.

How it Works: The trace first records the LLM's initial claim (proposing potential cell types). It then logs the tool call and its output (e.g., noting that the key marker for Plasma cells, SDC1, is absent in the data). Finally, the trace includes the LLM's explicit justification for the final annotation, such as using the combination of NKG7, GNLY, and CD3D expression to distinguish NK cells from CD8 T-cells—a nuance often missed by simpler methods.
Value: This trace provides a fully auditable record of the claims, evidence, and justifications, allowing a scientist to manually check and understand each decision.

Trajectory Inference: Self-Auditing with Tool Diagnostics**

The reasoning trace for trajectory inference showcases an even more complex workflow involving self-correction.

How it Works: The trace logs the initial tree construction, the call to py-Monocle for a diagnostic report, and then the LLM's detailed justification for why it is making specific repairs to the tree (e.g., correcting the root, fixing the hepatic developmental ladder).
Value: This provides an interpretable record of how scPilot uses evidence from traditional tools to improve its own results, leading to a more accurate and biologically faithful output (e.g., reducing graph-edit distance by 6).

GRN Prediction: Contextual Reasoning**

For GRN prediction, the reasoning trace captures how scPilot integrates multiple data types to make a context-aware decision.

How it Works: The trace shows the framework assessing a potential TF-gene relationship by combining functional data from the Gene Ontology, gene expression context from the input data, and information from known regulatory pathway databases.
Value: This allows for true tissue-specific reasoning, and the trace makes it clear how the final prediction was weighted by this contextual evidence, moving beyond generic associations.

We will add these detailed examples of the complete reasoning traces to the supplementary material to fully illustrate this key feature of our work.

(b) Flexibility and Extendibility of Bio-tool Library

This is an excellent question about the framework's flexibility. To clarify, the tools within the bio-tool library are curated beforehand, and the agent does not dynamically add new tools during a run. As we explained in our response to Reviewer RBqf (c), this was a deliberate design choice to ensure our scBench benchmark is rigorous and reproducible, focusing on the LLM's reasoning capability rather than on tool discovery.

Regarding its use for other bioinformatics tasks, scPilot is designed to be modular and easily extendable, though this does require manual modification by a developer. As noted by Reviewer 82g2, the modular design of the "Problem-to-Text Converter," "Bio-Tool Library," and "LLM Planner" keeps each part’s responsibility clear and facilitates extension.

To adapt scPilot for a new task, a developer would simply need to:

Wrap the new bio-tool in a simple Python function that formats its output into the standardized JSON that scPilot expects.
Register the tool by adding a description (e.g., a docstring) to the library so the LLM planner understands its purpose and how to use it.

In summary, while our scBench implementation uses a fixed toolset for evaluation integrity, the underlying scPilot framework is intentionally designed to be an extensible platform, allowing developers to apply omics-native reasoning to a wide range of future bioinformatics challenges.

(c) Detailed Benchmark Dataset Statistics - already in supplementary

Thank you for this question. Detailed statistics for all datasets used in scBench—including the number of cells, number of cell types, and graph statistics for the trajectory inference tasks—are provided in Tables 2, 3, and 4 of our supplementary material.

(d) Data Filtering and Annotation Criteria

Our benchmark datasets were curated from high-quality, peer-reviewed studies without additional computational filtering to ensure they reflect established community standards.

Data Filtering

We did not perform additional computational filtering on the datasets. Our selection process focused on using data directly from high-quality, peer-reviewed publications where the data were already processed and filtered according to the standards of the original study. Instead, all datasets for scBench were curated and processed through a uniform, four-step pipeline to ensure quality and consistency.

Data Standardization and Compression: Raw data from high-quality, published studies is first standardized and stored. It is then compressed into a uniform text format that the LLM can process.
Ground-Truth Harmonization: The ground-truth criteria are stricter than simply "manually annotated." Annotations are mapped to standard databases like the Cell Ontology, and published lineage graphs are standardized to ensure a consistent, high-quality ground truth.
Prompt Sketch Creation: Task-specific converters generate compact JSON "sketches" that summarize key data (e.g., cluster statistics, marker genes) to fit within a single LLM prompt.
Packaging: Each task is bundled with its prompt sketch, data configuration, a programmatic grader, and instructions to allow any method to be run reproducibly.

This systematic pipeline ensures that scBench is built upon well-vetted data and expert-validated, harmonized ground truths.

Ground-Truth Annotation Criteria

Our criteria for ground-truth labels were stricter than simply being "manually annotated" and were tailored to each task:

Cell-Type Annotation: The labels were required to be expert-curated and published in a peer-reviewed study, ensuring they have been vetted by the scientific community.
Trajectory Inference: Datasets were required to include a human-curated lineage tree as ground truth within the original publication.
GRN Prediction: The ground truth is based on experimentally-validated TF-gene interactions from the trusted TRRUST v2 database.

This curation strategy ensures that scBench is a reliable and realistic testbed for evaluating scientific reasoning.

2025-08-06

Thank you for the detailed response. I appreciate the clarification on the tracing trace. Could the authors provide some examples of the trace to give a more intuitive understanding of what it looks like? Besides, how many tools have been used in the work? To improve the integrity of the paper, the authors should move the important information about the dataset to the main text (at least in a shortened form), as supplementary material should not be necessary for understanding the work as a whole during both reviewing and reading.

评论- Response to Follow-up Comment from Reviewer VK46

2025-08-06

Dear Reviewer VK46,

Thank you for your continued engagement with our work and for the follow-up questions. We appreciate the opportunity to provide further clarification on the reasoning trace and the specifics of our framework.

Example of a Reasoning Trace

Thank you for this excellent suggestion. To provide the intuitive understanding you requested, here is one condensed, human-readable example of scPilot's reasoning trace for a cell-type annotation task (due to the character limit). It illustrates how the LLM's biological hypothesis is tested and refined using data from the tool library. We will make such examples crystal clear in the next version of the draft.

Step 1: Establish Initial Biological Hypothesis (LLM Thought)

“Cluster 0 strongly expresses genes Cd79a, Cd79b, and Ighm, which are hallmark markers for B cells. However, we also see moderate expression of Cd52, which can be broadly present in various lymphocyte types. We must confirm the specificity to ensure correct annotation.”

Explanation: The LLM begins by recognizing specific marker genes indicating a certain cell type but remains cautious due to an ambiguous gene (Cd52) expression.

Step 2: Verify Marker Expression Specificity (scPilot Action)

Action: Query the scanpy tool's generated dotplot to precisely quantify the expression of canonical B-cell markers (Cd79a, Cd79b, Ighm) vs. the non-specific lymphocyte marker (Cd52) in Cluster 0.

Tool Output:

{
  "cluster": 0,
  "gene_expression": {
    "Cd79a": 2.35,
    "Cd79b": 2.10,
    "Ighm": 2.50,
    "Cd52": 0.60
  }
}

Explanation: scPilot directly consults quantitative gene-expression data from the bioinformatics tool to obtain evidence on marker specificity.

Step 3: Refine Annotation Based on Data-Driven Evidence (LLM Thought)

“The quantitative data clearly demonstrate strong enrichment of classical B-cell markers (Cd79a, Cd79b, Ighm) with only weak expression of the broader lymphocyte marker (Cd52). Given that Cd52 can occasionally appear in multiple lymphocyte lineages, its lower relative expression does not detract from a definitive B-cell identity.”

Explanation: The LLM integrates the tool's output, logically reasoning to confirm a specific biological identity by comparing relative gene expression levels.

Step 4: Contextual Cross-Validation Using Biological Knowledge (scPilot Action)

Action: scPilot cross-references known immune cell-type marker profiles.

Tool Output:

{
  "matched_profile": "B cells (Cd79a+, Cd79b+, Ighm+; occasional weak Cd52 expression noted in immature/transitional subsets)",
  "confidence": "High"
}

Explanation: This step ensures scPilot’s reasoning aligns with expert biological knowledge, enhancing confidence in the annotation.

Step 5: Final Biological Justification & Stable Annotation (LLM Thought)

“Both quantitative marker expression and biological knowledge confirm Cluster 0 as B cells. This cluster’s annotation can be confidently stabilized and finalized for downstream analysis.”

Explanation: The process concludes by explicitly documenting the confidence and justification for the annotation, ensuring clarity and interpretability.

Number of Tools Used in scPilot

Regarding your question about the tool library, scPilot utilizes three core bioinformatics tools for the scBench tasks:

Scanpy: For data preprocessing, cell clustering, and differential expression analysis.
py-Monocle: As a reference for trajectory inference and for providing diagnostic reports.
Gene Ontology (GO) database: As a reference for functional validation in GRN prediction.

This curated set was deliberately chosen to ensure robust, standardized analyses, allowing for a clear and reproducible evaluation of the LLM’s core reasoning capabilities.

Incorporating Dataset Statistics into the Main Text

Finally, thank you for the excellent suggestion to improve the paper's integrity. We agree completely. We will move the important dataset statistics from the supplementary material to the main text (in a shortened form) in the final version to ensure the paper is self-contained and more easily understood by readers.

We hope this detailed example and additional information fully address your questions. Please feel free to let us know if anything else is needed. We are very happy to leverage this discussion period to further improve the clarity and quality of our manuscript.

Thank you again for your constructive feedback.

Sincerely,

The Authors of Submission 23137

评论- Follow-up on "Response to Follow-up Comment from Reviewer VK46"

2025-08-08

Dear Reviewer VK46,

We would like to sincerely thank you again for your time and continued constructive dialogue regarding our submission.

Following our last response, we wanted to briefly check in to see if the detailed, step-by-step reasoning trace example and the additional clarifications on our tool library have successfully addressed your remaining questions.

Your feedback was very essential in helping us see that a concrete example was essential for conveying the transparency and practical utility of our framework's reasoning process. We hope our detailed response provided this intuitive understanding and fully clarified how the trace makes the model's decisions auditable and interpretable, which was a central point of your review.

We would be very grateful to know if these clarifications have helped resolve your concerns. We are eager to incorporate your feedback to ensure the final version of our paper is as strong and clear as possible, and your perspective is invaluable to that effort.

Thank you once again for your thorough and helpful review.

Sincerely,

The Authors of Submission 23137

评论- Please see authors response.

2025-08-05

Hello, thank you for your review. Please see the authors response, particularly around reasoning traces and tool use, to see whether you would like to change your feedback or scores. Thank you.

审稿意见

评分: 5置信度: 42025-07-01

The scPilot system provides human-like reasoning to orchestrate established single cell transcriptomics analysis methods to complete specific analysis tasks. The system is implemented using a problem-to-text converter to support natural language reasoning, use of Bio-Tool library of available functions, and an LLM planner.

优缺点分析

Strength (major): development of the scBench omics reasoning benchmark targeting 3 biologically relevant tasks.

Strength: extends past established “chat with your data” systems for single cell genomics to cover multi-step reasoning directly on the omics data.

Strength: Paper analyzes error modes and proposes augmentation strategies to address performance gaps.

Strength: performance on diverse tasks is high

Weakness (major): Value of reasoning is unclear (could be better explained). For example, direct use of O1 beats scPilot in cell type annotation.

Weakness (major): based on the arguments in the related work section, more appropriate baselines should be selected from state-of-the-art tools that are cited (e.g. chat front ends or C2S-Scale).

问题

Can the authors better show the value of the reasoning system compared to similar recently published methods (e.g. C2S-Scale or CellAgent).

It would be nice to show an example of how the reasoning works, even in supplementary material. Currently, the main basis of the paper is the addition of reasoning to the model, but other than performance assessments, there is not exposition of the value of reasoning, for example in terms of interpretation, definition of workflows, manual checking of model decisions.

局限性

yes

最终评判理由

Increase based on authors emphasis in the rebuttal on the value of the reasoning trace and the author's intent to emphasize this more in the paper.

格式问题

作者回复

2025-07-31

We sincerely thank Reviewer mMx1 for recognizing several core contributions of our work. We are particularly grateful for the acknowledgment of:

A Major Foundational Contribution: Highlighting the "development of the scBench omics reasoning benchmark" as a major strength.
A Methodological Leap Forward: Recognizing that our framework "extends past established ‘chat with your data’ systems" to enable "multi-step reasoning directly on the omics data."
Thorough Validation and High Performance: Appreciating our comprehensive "analysis of error modes" and, importantly, that scPilot's "performance on diverse tasks is high."

(a) Value of Reasoning: Why scPilot (o1) performs worse sometimes

We thank the reviewer for this critical question. It highlights the value of scPilot's reasoning beyond raw accuracy.

Why 'Direct' Prompting Occasionally Outperforms scPilot: As detailed in our response (b) to Reviewer RBqf, the rare cases where the 'Direct' method outperforms scPilot are on datasets with ambiguous markers (e.g., Liver). Here, scPilot's deep reasoning can cause it to "overthink" complex signals—a trait mirroring expert caution that can lower a strict accuracy score. The 'Direct' method, by forcing a single "best guess," can be incidentally correct in such noisy conditions.
The Scientific Value of "Overthinking": The value of scPilot's reasoning lies in its transparent, scientific process, not just the final label. Its "overthinking" provides crucial insights where direct methods fail. For example, on the Liver dataset:
- Cluster 16 (Ambiguous Identity): Faced with markers for multiple cell types (hepatoblasts, hepatocytes), scPilot's log noted the cluster "may need finer analysis," correctly identifying the biological ambiguity instead of forcing a likely incorrect label.
- Cluster 19 (Insufficient Evidence): Lacking strong markers, scPilot flagged this cluster as un-annotatable and a "prime candidate for subgrouping," correctly identifying the limits of the data and preventing a confident error where a direct method would guess.

In conclusion, scPilot's value is its transparent, cautious, and scientifically rigorous analysis. The reasoning trace reveals why a decision is made—or not made—which is invaluable for human trust and guiding future experiments, a key advantage over opaque methods.

(b) Additional experiments to compare with appropriate STOA tools - check response (c) to the reviewer 82g2

We thank the reviewer for this crucial suggestion. As detailed in our response (c) to Reviewer 82g2, we have conducted new experiments comparing scPilot against the latest generation of LLM-based bioinformatics tools.

Expanded Comparison with State-of-the-Art (SOTA) Baselines

We expanded our evaluation to include new SOTA baselines: Biomni (a powerful LLM agent), LLM4GRN (a specialized pipeline), and BioGPT (a foundation model). Our results, detailed in the response to Reviewer 82g2, show that scPilot consistently outperforms these recent methods.

Regarding C2S-Scale and CellAgent

We made significant efforts to include these specific tools but could not perform a fair benchmark due to technical limitations:

C2S-Scale: This tool is not fully open-source. The only public version is a 1B model, a class which our experiments show lacks the reasoning capability for scBench, making a comparison uninformative.
CellAgent: Evaluation was impossible as its web demo's data upload is non-functional and its source code is unavailable, preventing reproducible analysis.

In summary, we have broadened our comparisons to include relevant and accessible SOTA frameworks. We are committed to benchmarking against tools like C2S-Scale and CellAgent in scBench as they become openly available.

(c) Value of reasoning: qualitative comparison with recent methods (Biomni)

We thank the reviewer for this excellent question. While a direct benchmark against C2S-Scale and CellAgent was not feasible due to their limited accessibility (as explained above), we have conducted a detailed qualitative and quantitative comparison against Biomni, a powerful, state-of-the-art general-purpose biomedical LLM agent (technically, Biomni comes after our submission). This comparison clearly demonstrates the value of scPilot's specialized, reasoning-first approach. (Biomni has a very similar design philosophy to CellAgent, but much powerful).

Contrasting Design: Reasoning-First vs. Tool-First

The core difference is their design philosophy. scPilot is a specialized, reasoning-first agent; Biomni is a general-purpose, tool-first agent.

Dimension	scPilot (Reasoning-First)	Biomni (Tool-First)
Workflow	Hypothesis formulation → Omics-native reasoning → Iterative refinement	Run tool → Keyword lookup → Hard-coded template
Error Handling	Self-diagnoses biologically implausible steps and revises	Falls back to heuristics on Python errors; no biological validation
Output	Transparent Chain-of-Thought: Explanations, confidence scores.	Data Dump: Dictionaries, tables, templated outputs.

The Impact on Scientific Accuracy

This design difference leads to significant accuracy gaps.

Retina Annotation: scPilot correctly distinguishes fine-grained subtypes (e.g., ON- vs. OFF-bipolar cells). Biomni makes clear errors, mislabeling Müller glia as amacrine cells.
Liver Trajectory: scPilot correctly identifies the Epiblast root and reconstructs faithful lineages. Biomni misplaces cell types (e.g., "Cardiac muscle") in the lineage, ignoring gene evidence.

The Value in Efficiency and Cost

scPilot's specialized approach is also vastly more efficient and cost-effective.

Task (vs. Biomni on Gemini-2.5 Pro)	scPilot Time	scPilot Cost	Biomni Cost	Cost Ratio
Cell-type annotation	1–3 min	$0.03	$0.80–$ 1.00	~27–33x
Trajectory inference	1–3 min	$0.04	$0.80–$ 1.00	~20–25x
GRN prediction	3-5 min	$0.12	Unsuccessful	N/A

scPilot is up to 30x cheaper and significantly faster because its curated toolchain and reasoning-first method avoid costly, broad tool searches. It also succeeds on the complex Gene Regulatory Network (GRN) prediction task where the general-purpose agent fails. This comparison proves that while general-purpose agents are flexible, scPilot's specialized reasoning delivers higher scientific fidelity at a fraction of the cost.

(d) Additional Examples of Reasoning in Action

We are happy to provide concrete examples of scPilot's reasoning, demonstrating its value in creating interpretable and verifiable workflows. The value lies not just in quantitative performance, but in its transparent, multi-step biological analysis.

Multi-Gene Logic for Accurate and Interpretable Annotation

Instead of simple marker matching, scPilot uses a propose -> filter -> analyze loop for nuanced interpretation, which we detail with examples in Figure 3 and Supp. Figures 1-2.

How it Works: scPilot proposes cell types from top markers, filters hypotheses against data (e.g., requires key marker presence), and then analyzes multi-gene combinations for a final call.
Value: This process correctly distinguishes NK cells from CD8 T-cells in the PBMC3k dataset by analyzing the conjunction of NKG7, GNLY, and CD3D expression. This explicit logic, recorded in the trace, allows scientists to verify the decision and is powerful enough to find rare cell types (<1% abundance).

Self-Auditing for Deeper System-Level Understanding

In trajectory inference, scPilot uses external tool outputs to self-correct its own predictions.

How it Works: scPilot first constructs a lineage tree, then uses a diagnostic report from a tool like py-Monocle to perform a "reconsideration" step, auditing and repairing its initial tree.
Value: This self-correction significantly improves biological accuracy (graph-edit distance reduced by 6, spectral distance by 0.32), correcting the tree's root and repairing the canonical hepatic developmental ladder. The reasoning trace (detailed in Figure 7 and Supp. Table 6) allows direct comparison of predicted branches against known biology.

Contextual Reasoning for Tissue-Specific Predictions

For Gene Regulatory Network (GRN) prediction, scPilot integrates multiple data types for context-aware analysis.

How it Works: The framework assesses Transcription Factor (TF)-gene relationships by combining Gene Ontology, expression context, and known regulatory pathways.
Value: This enables true tissue-specific reasoning. For example, when predicting the Klf4 → Muc5ac regulation, scPilot leverages knowledge that this interaction is key in stomach tissue, yielding a more accurate prediction than a generic model would make (see Supp. Table 7).

These examples show scPilot’s value is in its interpretable workflows, evidence-based self-correction, and nuanced biological interpretation, all exposed for verification. We will highlight these in the supplement, scuh as figure 3, Supp. Fig1, 2, Supp. Table 6, 7.

2025-08-02

Thanks for the useful response. The authors highlight the value of the reasoning trace in the responses to all reviewers. It would be important to highlight this more in the paper. Currently, the word trace only occurs twice in the main text in the methods. I will increase my score.

审稿意见

评分: 3置信度: 32025-07-03

The core idea of this paper is to use the big language model as a "bioinformatics assistant" so that it can understand the single-cell sequencing data by itself, and then output reliable biological analysis for you step by step. It did three things:

Data condensed into a "easy-to-read version" The gene expression matrix of millions of cells is first squeezed into JSON with key information such as cluster size, marker genes, and differentiation pathways by algorithms, so that the model can grasp the key points at once and not be overwhelmed by massive numbers.
"Dialogue + tool" closed loop Every step the model takes, it will first mutter in its mind "This cluster of cells looks like XXX", and then automatically call the bioinformatics tools (such as Scanpy, Monocle, SCENIC, etc.) to run it, and then correct its own conjecture after getting the results back - just like discussing with colleagues, proposing hypotheses while testing hypotheses.
Realistic evaluation routine - SCBENCH Comparison was made for three classic tasks:

Labeling cell clusters (cell type annotation)
Reconstructing cell development routes (developmental trajectory)
Predicting gene regulatory networks (TF-target gene relationships) The results show that after applying this "dialogue + tool" process, the accuracy of various tasks can be increased, some graph structure errors can be greatly reduced, and the common pitfalls of the model can be seen.

In short, SCPILOT turns the "black box script running" bioinformatics analysis into a traceable, iterative, chat-like scientific workflow, allowing us to truly treat large language models as "partners" in bioinformatics, rather than robots that can only move literature and type code.

优缺点分析

Strengths

Quality
- Fairly thorough experimental comparison
  - They compared SCPILOT against several common baselines on three classic tasks—cell type annotation, trajectory reconstruction, and gene regulatory network prediction—and used metrics like accuracy and graph error to show improvements. The empirical work looks pretty solid.
  - The experiments cover a handful of public datasets, and they also released their code and data-processing scripts, which should help others reproduce the results.
- Modular design
  - Breaking the system into “data summarization,” “model reasoning,” “tool calls,” and “result verification” keeps each part’s responsibility clear. That should make it easier to swap out or upgrade individual pieces down the line.
Clarity
- Intuitive flowcharts and example dialogues
  - Using a flowchart (Figure 2) and multi-turn example conversations (Figure 3) to show the loop makes it easy to follow the overall approach.
  - They even list API snippets and input/output JSON for each tool call, so you can pretty quickly see how to implement it.
- Logical structure
  - The paper walks you through data preprocessing and summarization first, then the dialogue strategy and toolchain, and finally the evaluation results. The roadmap is easy to follow.
Significance
- Filling the “LLM + bioinformatics” gap
  - Most work so far has focused either on pure dialogue or on bioinformatics pipelines; SCPILOT’s idea of combining both in single-cell analysis is quite fresh.
  - Merging auditable tool calls with model reasoning could really help push explainable AI in the single-cell field.
- Potential community value
  - They open-sourced the SCBENCH benchmark suite, giving others a starting point for comparison and hopefully spurring more research.
Originality
- Introducing a “dialogue + tools” paradigm
  - While there are earlier examples of LLMs calling external APIs, this is one of the most complete end-to-end demos in single-cell sequencing, tying together community-favorite bioinformatics tools in a loop.
- Structured data summarization
  - Compressing high-dimensional expression matrices into structured JSON summaries is a neat idea that could guide future LLM applications on large-scale biological data.

Weakness

- - Insufficient transparency of method details
  - The empirical thresholds and hyperparameters used in the "marker gene selection" and "branch point extraction" phase of summary generation do not provide mathematical formulas or selection criteria, and no sensitivity analysis is performed on different settings;
  - The tool call link lacks intermediate verification (such as expression distribution diagrams, statistical tests or p-values), and it is impossible to judge whether the hypothesis verification is sufficient.
  1. Limitations of experimental design and evaluation
  - The baselines used for comparison are mostly older versions (such as Monocle2 and classic SCENIC), and there is no direct comparison with the latest LLM+bioinformatics work of the same period, which may overestimate the performance improvement;
  - The lack of error propagation experiments (such as artificially introducing summary or dialogue errors to observe the fluctuation of the final results) and uncertainty assessment (confidence interval or probability calibration) makes it difficult to measure the reliability of the overall process.
  1. Ambiguity in dialogue strategy and automation process
  - It is not clear when the dialogue stops (fixed rounds or dynamic decision) and how to balance "exploration" and "verification";
  - The method claims to be "fully automatic and reproducible", but it also mentions that "manual review of dialogue output is required" to prevent miscalling, which is inconsistent with the previous and subsequent statements.

问题

As listed in weakness

局限性

As listed in weakness

格式问题

I have no concern on formatting issues in this paper,

作者回复

2025-07-31

We sincerely thank Reviewer 82g2 for their detailed and insightful summary. We are particularly grateful that the reviewer recognized the core contributions of our work on multiple levels:

Paradigm Shift: For highlighting scPilot's success in transforming "black box" analysis into a "traceable, iterative, chat-like workflow," creating a "bioinformatics assistant."
Methodological Innovation: For appreciating the originality of our "complete end-to-end demo" that unites community tools, the "neat idea" of structured data summarization, and the clarity of our "modular design" and "intuitive flowcharts."
Solid Evaluation and Community Value: For acknowledging our "fairly thorough experimental comparison" as "pretty solid" and reproducible, and for noting the "potential community value" of the open-sourced scBench benchmark in spurring future research.

(a) Clarification on empirical thresholds and hyperparameters with sensitivity analysis

We thank the reviewer for this excellent suggestion. In response, we have conducted a new sensitivity analysis for key hyperparameter requested (marker gene selection) and are happy to clarify the criteria used for our experimental thresholds. We will add this analysis to the final manuscript to improve transparency.

Sensitivity Analysis on Marker Gene Selection ( $K$ )

For cell-type annotation, a key hyperparameter is $K$ , the number of top marker genes. We tested performance sensitivity on the PBMC3k dataset for $K=5$ , 10 (our default), and 20.

Table 4. Annotation Accuracy (PBMC3k) vs. Number of Top Genes ( $K$ )

$K$	Model	Avg Accuracy	Variance
5	o1	0.771	0.001
5	gpt-4o-mini	0.500	0.016
10 (Default)	o1	0.792	0.005
10 (Default)	gpt-4o-mini	0.604	0.001
20	o1	0.750	0.016
20	gpt-4o-mini	0.542	0.009

Performance peaks at our chosen $K=10$ . More importantly, the strong results across different values demonstrate our approach is robust and not highly sensitive to this hyperparameter.

Clarification on Trajectory Branch Point Extraction

For trajectory branching, our method uses no tunable threshold for branch point extraction. Instead, we directly compare scPilot's predicted graph against the ground-truth lineage curated in the original publications (Pancreas, Liver, Neocortex). The evaluation is a direct comparison to an established biological standard, not a process dependent on an internal parameter.

(b) Validation of intermediate tool calls

We thank the reviewer for this insightful comment. To validate that scPilot's reasoning is grounded in its tool outputs, we performed perturbation studies on two critical components. We will add these results to the manuscript.

Perturbation of Gene Ontology (GO) Database in GRN Prediction

To test dependency on the GO database for Gene Regulatory Network (GRN) prediction, we shuffled its term associations with random noise.

Table 5. Impact of GO Perturbation on GRN Prediction (AUROC)

Model	AUROC (Original)	AUROC (GO Shuffled)	$Δ$ AUROC
o1	0.873	0.813	-0.060
GPT-4o	0.800	0.710	-0.090
gpt-4o-mini	0.697	0.617	-0.080

Corrupting GO data significantly degrades AUROC (e.g., $p=0.044$ for gpt-4o-mini), confirming that scPilot's reasoning relies on accurate information from this intermediate step.

Perturbation of py-Monocle Output in Trajectory Inference

We tested dependency on py-Monocle by providing scPilot with a corrupted report (randomized cluster relationships and pseudotime order) for the liver dataset.

Table 6. Impact of Monocle Perturbation on Trajectory Inference (o1 model)

Metric	Original Avg	Perturbed Avg	Performance Change
Jaccard	1.000	0.933	Worse
GED-nx (10s)	8.00	11.33	Worse
Spectral Distance	0.567	0.593	Worse

The performance drop across metrics demonstrates that accurate outputs from tools like Monocle are critical for scPilot.

Summary: These experiments prove scPilot's success is fundamentally dependent on the integrity of the data provided by the bioinformatics tools it integrates.

(c) Additional experiments with new baselines (traditional bio tools and latest LLM+bioinformatics works)

We thank the reviewer for this crucial suggestion. In response, we conducted extensive new experiments against state-of-the-art (SOTA) baselines.

Expanded Benchmarking with SOTA Baselines

We added several new baselines:

Updated Traditional Tool: Celltypist (1.7.1).
SOTA LLM Methods: Biomni (biomedical agent), LLM4GRN (GRN pipeline), and BioGPT-Large (foundation model).

The results confirm scPilot's superior performance holds against these newer, challenging baselines.

Table 7. New Baselines for Cell-Type Annotation

Dataset	scPilot (o1)	Celltypist 1.7.1 (new)	Biomni (new)
Liver	0.518	0.464	0.464
PBMC	0.792	0.563	0.646
Retina	0.728	0.368	0.570

Table 8. New Baseline for Trajectory Inference

Model (on Liver Dataset)	Jaccard (Avg)	GED-nx (Avg)	Spectral Dist (Avg)
scPilot (Gemini-2.5 Pro)	1.00	3.33	0.199
Biomni (Gemini-2.5 Pro)	1.00	8.33	0.482

Table 9. New Baseline for GRN TF-Gene Prediction

Model (on Stomach Dataset)	AUROC (Avg)
scPilot (o1)	0.873
LLM4GRN	0.727
BioGPT-Large	0.66

This comprehensive comparison confirms scPilot's superior performance and the advantage of our omics-native reasoning approach. We will include all new results in the new version of the manuscript.

(d) Additional Experiments in Error Propagation and Uncertainty Assessment

We thank the reviewer for these suggestions. In response, we highlight existing work and provide new analyses on error propagation and uncertainty.

1. Error Propagation via Input Perturbation

We highlight the input perturbation experiments from our main submission (Figure 4a), where we removed all contextual metadata (e.g., tissue type) from the input prompt for the PBMC3k annotation task.

Table 10. Results of Context Removal on Annotation Accuracy

Model	Accuracy (Original)	Accuracy (No Context)	Performance Drop
o1	0.792	0.688	↓ 0.104
GPT-4o	0.646	0.583	↓ 0.063
gpt-4o-mini	0.604	0.416	↓ 0.188

This directly shows that removing key input context significantly drops performance, confirming rich metadata is critical for reliability.

2. Uncertainty and Reliability Assessment

We performed a new reliability analysis on the GRN prediction task (Stomach dataset). We calculated 95% confidence intervals (CIs) from 10 trials, performed 1000-bootstrap resampling, and assessed calibration using Expected Calibration Error (ECE) and Brier scores.

Table 11. Uncertainty & Calibration Metrics for GRN Prediction

Model	95% CI (10 runs)	Bootstrap (±)	ECE	Brier Score
o1	0.84–0.90	±0.05	0.05	0.14
gpt-4o-mini	0.65–0.75	±0.10	0.12	0.20

The powerful o1 model is not only more accurate but also more stable (tighter CI) and reliable (lower ECE and Brier scores). These analyses provide the requested quantitative assessment of scPilot's reliability, which we will add to the manuscript.

(e) Stopping Criterion

We thank the reviewer for this question. The dialogue stopping condition is pre-defined for each task to ensure reproducibility for our scBench benchmark. Please see (e) in our rebuttal to reviewer RBqf.

The balance between "exploration" and "verification" is managed by our structured reasoning loop:

Exploration: The LLM first generates broad hypotheses from summarized data (e.g., proposing potential cell types).
Verification: Subsequent tool calls gather evidence to verify these hypotheses (e.g., inspecting data to confirm cell types, as in Figure 3).

This propose -> call tool -> analyze cycle channels the LLM's ideas into data-grounded inquiries, preventing ungrounded tangents and ensuring a scientifically sound workflow.

(f) Consistency and Clarity in Manual Intervention

We thank the reviewer for this opportunity to clarify. We believe there may be a misunderstanding, as our paper does not use the term "fully automatic" nor state that "manual review... is required." We clarify "automatic and reproducible" and the role of human oversight below.

How scPilot is Automatic and Reproducible
- Automatic Execution: "Automatic" means scPilot runs an entire workflow end-to-end without human intervention during execution.
- Reproducible by Design: The process is reproducible because prompts, tools, and stopping conditions are fixed. This ensures the same inputs produce the same outputs, a cornerstone of scBench.
The Role of Human Auditing vs. Required Intervention The confusion may stem from our emphasis on auditability. A key strength is producing a transparent reasoning trace that a scientist can review for validation, much like reviewing a colleague's work. This is fundamentally different from requiring intervention during the process. The system runs automatically; transparency is a feature for trust and insight, not a bug requiring manual patches.

In summary: scPilot is automatic in operation, with outputs that are auditable by design, not dependent on intervention.

评论- Following up on our rebuttal for Submission 23137

2025-08-06

Dear Reviewer 82g2,

We sincerely thank you for your exceptionally detailed and constructive review. We were especially grateful for your thorough summary, which showed a deep understanding of our work's vision. Your actionable feedback was invaluable in helping us significantly strengthen the paper.

We have submitted a comprehensive rebuttal and, following the discussion guidelines, wanted to reach out directly. We took your concerns about experimental rigor and transparency to heart and, in direct response, have conducted substantial new work that we believe addresses every weakness you identified:

We added the new sensitivity analysis you suggested for key hyperparameters.
We performed new perturbation experiments to provide the requested intermediate validation for tool calls.
We expanded our evaluation with several new SOTA baselines, including the latest Celltypist, Biomni, and LLM4GRN.
We conducted new error propagation and uncertainty analyses to quantitatively assess the framework's reliability.

Our rebuttal also provides detailed clarifications on the stopping criteria and corrects a key misunderstanding regarding our framework's design: it is automatic in operation and auditable by a human by design, not dependent on manual intervention.

Given that these extensive additions were performed specifically to address the concerns that informed your initial rating, we are very keen to learn if this new evidence has resolved those issues. Your feedback on whether our response is sufficient is critically important to us, and we are eager for any further discussion.

Thank you once again for your rigorous feedback and guidance.

Sincerely,

The Authors of Submission 23137

审稿意见

评分: 4置信度: 52025-07-04

The authors proposed an agent based cell annotation framework to leverage LLMs to use tools and do inference over raw input data. The experiments are conducted across three applications: cell type annotation, trajectory inference, and GRN prediction. The performances are better than previous LLMs based models.

优缺点分析

Strengths:

It is important to incorporate the raw gene data into the analysis which is important and essential in the bioinformatics pipeline. And this point is ignored by most previous LLMs based single-cell analysis framework.
The experiments are a lot including three different tasks with detailed analysis.

Weaknesses:

This work is more like an application of agent on LLMs, which doesn't propose further machine learning algorithm improvement.
The experiments are conducted solely on LLM based method. It is important to verify that whether the LLMs driven methods can be better than traditionally non-LLMs based methods.

问题

See weeknesses.
It is unclear about how the Problem-to-Text Converter work. The mapping process is unclear and I have tried to find unswer in the following content but it seems missing.

局限性

See weeknesses

最终评判理由

I appreciate authors' effort on providing detailed response. Regarding the Reasoning Paradigm in the innovation, which is claimed as the core contribution in this paper, I don't think the current manuscript successfully verifies whether the reasoning path produced by LLMs is really biologically reasonable. It would be better to provide detailed reasoning paths and examing them by biology experts to see whether it provides biologically plausible reasoning results. Probably, these can be checked by reviewers from computational biology journals. Therefore, I decided to keep my original score.

格式问题

The formatting looks good to me.

作者回复

2025-07-31

We sincerely thank Reviewer 4XCY for highlighting our significant contributions:

Methodological Significance: Recognizing our novel approach of "incorporating raw gene data" as "important and essential"—a critical step that has been "ignored by most previous LLM-based frameworks.
Experimental Rigor: Acknowledging that our "experiments are a lot," spanning three different tasks with "detailed analysis."
Demonstrated Improvement: Confirming that scPilot's performance is "better than previous LLMs based models."

We appreciate the reviewer's questions, which touch upon the nature of our contribution, the breadth of our experimental baselines, and the mechanics of our data-to-text conversion. We are confident that the detailed clarifications below will fully resolve these concerns and reinforce the significance and novelty of our work.

(a) Clarification on Machine Learning Contribution and Originality

We thank the reviewer for this comment, which allows us to clarify our core contribution. While we do not propose a new, standalone ML algorithm in the traditional sense, such as a novel neural network architecture or loss function, our work introduces a novel reasoning paradigm and a foundational benchmark, scBench, which are crucial and non-trivial advancements for applying AI to science.

1. First, A Novel Reasoning Paradigm (Not a Pure Application)

Our primary contribution is Omics-Native Reasoning (ONR), the first systematic framework for an LLM to directly interact with raw omics data in an interpretable reasoning loop.

This is fundamentally different from prior work and is a non-trivial advancement:

Unlike simple tool-calling agents that merely wrap existing bioinformatics tools, scPilot deeply interprets the numerical output of those tools to form and revise hypotheses.
Unlike embedding-based foundation models that produce opaque "black-box" vectors, scPilot generates a fully transparent and auditable reasoning trace, addressing a key limitation in computational biology.

As the reviewer noted, creating a framework where an LLM can directly reason with raw gene data is an "important and essential" challenge, ignored by most previous work. Solving this is our core methodological contribution.

2. Second, Tangible Performance Gains from a New Paradigm

This new paradigm is not merely conceptual; it delivers substantial, measurable improvements in analytical accuracy. Our experiments show that the iterative reasoning enabled by scPilot lifts average cell-type annotation accuracy by 11% and cuts trajectory graph-edit distance by 30% compared to simpler, one-shot prompting methods. This demonstrates that changing the reasoning process is a powerful vector for performance improvement, distinct from changing the model architecture.

3. Last but not least, A Foundational Benchmark for Future ML Improvements

Crucially, our work is foundational for future machine learning algorithm improvements. Before algorithms can be improved for a task, that task must be rigorously defined and benchmarked.

A New Task & Benchmark: Our ONR paradigm defines the task, and scBench provides the first-ever leaderboard-style benchmark to measure performance on this complex scientific reasoning.
Enabling Future Models: The transparent reasoning traces generated by scPilot can serve as high-quality, structured data for fine-tuning the next generation of smaller, more efficient, open-source scientific AI models to become expert "omics-native reasoners."

As a result, our framework further provides the essential blueprint and testbed for these future ML advancements.

(b) Additional Experiments with Non-LLM Baselines

We thank the reviewer for emphasizing the importance of comparing against traditional non-LLM methods. We would like to clarify that our paper already includes these crucial baselines across all three tasks, and our results consistently show that scPilot outperforms them.

In our original submission, we benchmarked scPilot against several established non-LLM methods:

Cell-Type Annotation: CellTypist and CellMarker 2.0 (Table 2, Line 170)
Trajectory Inference: py-Monocle (Monocle3) (Table 6, Line 231)
GRN Prediction: Graph Neural Networks including GCN, GraphSage, and GAT (Table 7, Line 260)

To further strengthen this analysis, we have now added the latest version of Celltypist (1.7.1) as an additional baseline. The results continue to demonstrate scPilot's superior performance.

Cell Type Annotation Accuracy Comparison

Dataset	scPilot o1	scPilot Gemini-2.5 Pro	Celltypist 1.6.3 (original)	Celltypist 1.7.1 (new)
Liver	0.518	0.488	0.429	0.464
PBMC	0.792	0.708	0.500	0.563
Retina	0.728	0.675	0.263	0.368

As this table and the results in the main paper show (e.g., Table 6 for trajectory, Table 7 for GRN), our LLM-driven scPilot framework consistently and significantly outperforms these state-of-the-art traditional methods.

(c) Clarification on Problem-to-Text Converter

We thank the reviewer for this question and are happy to clarify the mechanics of our Problem-to-Text Converter. Its purpose is to create a compact, information-rich "semantic sketch" of the massive raw data matrix, which can contain over 100,000 cells, so that an LLM can process it within its context window.

Key Steps:

Clustering: The converter first groups single cells into clusters (e.g., Leiden algorithm via Scanpy). This reduces the dataset from 10⁵ cells to a manageable number of clusters (typically 10–30).
Feature Selection: For each cluster, the converter identifies the top marker genes (e.g., top 10 genes by differential expression), which define the cluster’s identity.
Biological Metadata: The converter includes essential metadata such as species, tissue, protocol, and timepoints, providing biological context.
Structured Prompt Formation: The output is formatted as a structured prompt summarizing for each cluster:
- Cluster size (number of cells)
- Top marker genes (ranked)
- Other relevant features (e.g., mean expression of key genes, pseudotime, etc.)
- Any available prior annotations

Example:

“Cluster 0: 1,050 cells, Top markers: [Cd14, Lyz, Fcer1g, ...];

Cluster 1: 900 cells, Top markers: [Cd3d, Cd3e, Trac, ...]; ...

Species: Mouse; Tissue: Liver; Timepoints: E9.5, E10.5 ...”

Generalization to All Tasks:

For cell type annotation, clusters and their marker genes are provided.
For trajectory inference, the cluster connectivity (e.g., from py-Monocle) and key developmental genes are included.
For GRN prediction, the converter extracts top TF–target gene pairs (from tools like SCENIC) with supporting NES scores/motifs.

The resulting compressed prompt is small enough to fit in the LLM’s context window, yet detailed enough for meaningful biological reasoning. The LLM then reasons over this digest, proposes hypotheses, and iteratively calls tools for further evidence if needed.

评论- Following up on our rebuttal for Submission 23137

2025-08-06

Dear Reviewer 4XCY,

We sincerely thank you for your expert review and your high confidence in the assessment. We appreciate that you recognized the importance of our approach in incorporating raw gene data and the thoroughness of our experimental analysis.

We have submitted a detailed rebuttal and, following the discussion guidelines, wanted to gently ask for your feedback. We aimed to fully address the critical points you raised regarding our contribution, baselines, and the clarity of our methods.

Specifically, our rebuttal:

Clarifies our core contribution as a novel reasoning paradigm (Omics-Native Reasoning) and a foundational benchmark (scBench), which are essential advancements for enabling future ML algorithm development in this domain.
Highlights that comparisons to non-LLM baselines (like CellTypist, Monocle, and GNNs) were included in the original manuscript, and further strengthens this by adding results against the latest version of CellTypist.
Provides a detailed, step-by-step explanation of the Problem-to-Text Converter to resolve the unclarity you noted.

Your expert opinion is highly valuable to us. We would be very grateful to learn if our rebuttal has sufficiently addressed your concerns, and we are eager to engage in any further discussion.

Thank you again for your time and guidance.

Sincerely,

The Authors of Submission 23137

2025-08-07

评论- Addressing A New Point from Reviewer 4XCY with Our Existing Evidence

2025-08-08

Dear Reviewer 4XCY,

Thank you for the further discussion. We want to address your new point regarding the verification of biological reasoning paths—a question not raised in your original review, but one that is fundamental to the future of AI in science. We have strong existing evidence to fully resolve this concern.

1. The Universal Challenge of Verifying AI Reasoning in Biology

First, we wish to frame the problem accurately. Verifying the internal reasoning of frontier models like GPT-4o/o1/Gemini/... is a grand challenge for the entire AI community. This challenge is magnified exponentially in biology, where a single reasoning process involves navigating the complex interplay of tons of biological concepts.

Immense Complexity: A full, every-fact-checking audit of a complex biological reasoning chain would be a monumental task, potentially requiring hundreds of expert-hours per task.
A Universal Problem: Suggesting this can be simply resolved by "reviewers from computational biology journals" overlooks this reality; they, too, face severe time constraints. This is a field-wide problem, maybe not a unique flaw in our paper.

Given that a perfect, exhaustive audit is infeasible for such reasoning systems today, the critical question is: what is the rigorous and sufficient standard of evidence for a foundational work like ours? We argue that our work not only meets but helps define this standard through our contributions.

2. Our Framework is Designed for Trust and Verification

Before presenting our direct evidence, we highlight that our paper’s core contributions are designed specifically to build trust in a domain where it is hard-won.

Objective Benchmarking of Outcomes: A key contribution, scBench, provides the first-ever benchmark to quantitatively measure the final accuracy of complex biological reasoning tasks. It ensures that no matter how complex the internal path, the outcome is anchored to a solid, measurable ground truth.
Transparency as a Proxy for Trust: Our model’s primary advantage over opaque embedding-based methods is its transparency. The generation of an auditable reasoning trace is, in itself, a core contribution that enables the very scientific scrutiny you are calling for.
A Tool for Debugging and Discovery: This transparency transforms the model from a "black box" into a scientific instrument. It allows researchers to trace and debug incorrect conclusions, turning failures into valuable learning opportunities.

3. Current Evidence of Biologically Plausible Reasoning

Having established this context, we now point to the concrete and undeniable evidence that our model's reasoning is biologically sound and that we have performed our due diligence.

A. Qualitative Proof (Full, Unabridged Reasoning Trace): In our reply to Reviewer VK46, we provided a full, annotated example of scPilot’s chain-of-thought for the cell-type annotation task. This trace clearly demonstrates a sophisticated, multi-step biological logic: formulating hypotheses from canonical markers (Cd79a), quantitatively verifying them with expression data, cross-validating against curated knowledge, and recording a detailed, justified conclusion. We will release all reasoning paths to the public to ensure full transparency and audit.
B. Expert Validation (Meeting a Reasonable Standard): Your suggestion for expert review is excellent, and we did exactly that. Our experienced biology co-authors (multiple of them), who specialize in single-cell analysis, reviewed the key logical steps and biological assertions in our model's reasoning traces. They confirmed the logic aligns with current biological understanding and flagged no implausible steps. This represents a pragmatic and rigorous standard for expert validation.
C. Quantitative Proof (Performance is the Ultimate Validation): Finally, the numbers speak for themselves. A systematically flawed or biologically nonsensical reasoning process could not achieve state-of-the-art results. The fact that scPilot consistently and significantly outperforms specialized tools like CellTypist and Monocle3 across three distinct tasks is the strongest possible empirical proof that its underlying reasoning process is effective and sound.

In summary, we are committed to helping address the grand challenge of trustworthy AI in biology and have validated it with a trifecta of evidence: transparent qualitative traces, rigorous expert review of key steps, and overwhelmingly positive quantitative results. We are confident this far exceeds the standard of evidence for a foundational paper in this area.

We hope this detailed explanation fully addresses your concern and reinforces the significance of our work. We kindly ask you to reconsider your assessment in light of this comprehensive evidence.

Sincerely,

The Authors of Submission 23137

审稿意见

评分: 4置信度: 22025-07-07

This paper introduces SCPILOT, a novel framework that utilizes Large Language Models (LLMs) to automate and enhance the analysis of single-cell RNA-sequencing (scRNA-seq) data. The core concept is "omics-native reasoning" (ONR), where an LLM not only processes natural language inputs but also directly interacts with omics data and bioinformatics tools to perform complex analyses. The framework addresses three key challenges in single-cell analysis: cell-type annotation, developmental-trajectory reconstruction, and transcription-factor targeting.

优缺点分析

Strengths:

Instead of simply using LLMs as wrappers for existing tools, SCPILOT enabling the LLM to directly interact with data and reason through biological problems.
The development of SCBENCH provides a standardized and rigorous platform for evaluating LLMs on complex, real-world biological tasks. This is a valuable contribution to the field, allowing for objective comparisons of different models and methods.
A key advantage of SCPILOT is its ability to produce a transparent "chain of thought," which makes the analysis process auditable and easier for researchers to understand and verify. This addresses the "black-box" nature of many computational biology tools.

Weaknesses:

The best results were achieved using powerful, proprietary LLMs like Google's Gemini 2.5 Pro and OpenAI's models. The paper itself notes that the open-source model tested (Gemma 3 27B) underperformed significantly and had issues with latency and cost, limiting the accessibility of the framework for labs without access to these expensive APIs.
The paper notes that smaller or latency-optimized "mini" models occasionally performed worse with the SCPILOT pipeline, suggesting that the extended reasoning process can lead them to "over-explore" and make mistakes.

问题

Does the LLM in the SCPILOT framework autonomously decide which bioinformatics tools to call upon, or does it follow a pre-determined workflow where the sequence of tool usage is set by human engineers?
What is the paper's assessment of the depth of biological knowledge in state-of-the-art LLMs, and how correct is the "chain-of-thought" reasoning they produce in this specialized domain?
How does SCPILOT determine the termination condition for its iterative loop? Is the decision made autonomously by the model itself or by an external program?
According to the results, even with a powerful model like GPT-4o, the 'Direct' method occasionally yields better results than the SCPILOT pipeline. What are the potential reasons for this?

局限性

The framework relies on a "Problem-to-Text Converter" to create a "semantic sketch" of the massive single-cell data matrix to fit within the LLM's context window. A potential issue, acknowledged by the authors, is that this compression step may discard subtle but important biological signals, particularly from rare cell types.

格式问题

作者回复

2025-07-31

We are very grateful to Reviewer RBqf for highlighting three pillars of our contribution.

Novelty of approach – "enabling the LLM to directly interact with data and reason through biological problems."
Community value of scBench – a "standardized and rigorous platform for evaluating LLMs on complex, real-world biological tasks."
Transparency & auditability – our fully exposed “chain-of-thought,” which explicitly tackles the usual “black-box” critique by making every step "auditable and easier for researchers to understand and verify."

(a) Accessibility of scPilot and Additional Cost Analysis

We acknowledge the reviewer's valid concern about relying on powerful proprietary LLMs. Our novel omics-native reasoning (ONR) paradigm is inherently more demanding than standard text analysis and requires powerful LLMs, a point confirmed by our experiments on Gemma 3 27B (the best open-source model testable on 8xH100s at the time). Our experiments reveal two reasons why ONR is challenging for current open-source models:

Insufficient Domain Knowledge: Weaker models lack the nuanced, pre-trained understanding of biology, such as raw markers and pathways, to interpret omics data correctly.
Poor Instruction Following: Weaker models often fail to follow complex instructions and adhere to biologist-defined formats (e.g., JSON), breaking the reasoning chain.

These insights are a key contribution, offering guidance for future model development. However, the cost of using powerful models via scPilot is negligible. Our detailed cost analysis for Gemini 2.5 Pro shows that a complete run for our most complex tasks costs only a few cents, making the framework accessible.

Table 1. Cost analysis of scPilot on all three tasks (Gemini-2.5 Pro)

Task (single run)	Input tokens	Output tokens	Input cost	Output cost	Total
Task 1 – Cell‑type annotation (retina)	6,155	2,197	$0.008	$0.022	$0.03
Task 2 – Trajectory inference (neocortex)	8,221	2,702	$0.010	$0.027	$0.04
Task 3 – GRN TF→gene prediction (stomach, 46 pairs)	17,877	9,276	$0.022	$0.093	$0.12
Note: Token counts approximated with `tiktoken`. Cost = tokens / 1,000,000 unit price ($1.25 / 1M input, $10.00 / 1M output; no caching).*

The negligible cost allows any lab to leverage state-of-the-art AI without expensive local GPUs. Developing cheaper, specialized open-source models for ONR is a promising future direction; in parallel, our framework provides an immediate, accessible tool for the community.

(b) Additional analysis on occasional suboptimal performance

We thank the reviewer for this sharp question. The observation that a simpler ‘Direct’ prompt can occasionally outperform scPilot is a key finding, not a flaw. These cases are rare, systematic, and informative.

scPilot overwhelmingly outperforms the baseline (wins in 87 of 108 total comparisons). The rare losses are systematic: 13 of 21 (62%) are from less-capable "mini" models. Their limited capacity for sustained logic leads them to "over-explore" and make mistakes with scPilot's extended reasoning.

For the few remaining cases with powerful models, the cause is high dataset nuances, inducing "overthinking." The only two powerful-model losses in cell-type annotation occurred on our most complex dataset, Liver. The characteristics of our datasets explain this pattern:

Table 2: Characteristics of three datasets in Cell Type Annotation

Dataset	# Clusters	# Cell Types	Description	scPilot's Behavior
PBMC3k	8	8	Simple: 1:1 mapping, distinct markers	Good, though mini models may saturate performance
Retina	18	9	Moderate: Clear types, slight ambiguity	Consistently improves accuracy.
Liver	28	31	Complex: Overlapping lineages, noisy expression	Can "overthink" and merge distinct subtypes.

A Concrete Example of "Overthinking": In the complex Liver dataset, scPilot with the powerful o1 model scored 0.518, while the Direct method scored 0.560. A key error was confusing developmentally related hepatocytes and hepatoblasts. Because they share many markers, scPilot's deep reasoning amplified this ambiguity and wrongly merged them, whereas the simpler Direct method was incidentally correct by ignoring this nuance.

This analysis defines the boundaries for applying LLMs to complex biological data and will be added to the manuscript. Future improvements include adaptive reasoning depth and marker clarity assessments.

We provide a more detailed analysis in Reviewer mMx1's rebuttal (a) about why o1's overthinking behaviors in Liver dataset.

(c) Clarification on LLM Tool Selection

This is related to the design philosophy of scPilot. scPilot uses a pre-determined tool workflow as a deliberate design choice to rigorously evaluate the LLM's core omics-native reasoning, not its ability to use tools.

Our primary goal is to test the model's ability to interpret complex, numerical outputs from established bioinformatics tools and formulate a coherent scientific argument.
By standardizing the pipeline with gold-standard tools (e.g., Monocle, SCENIC), we isolate this core reasoning capability. This prevents our evaluation from being confounded by trivial tool-use errors, like incorrect API calls, which is a separate research problem.
Our comparison against a tool-calling agent, Biomni, directly demonstrates the negative impact of such tool-use errors on performance; response (c) to Reviewer

(d) Additional analysis on depth of biological knowledge and reasoning in LLMs

This is an excellent question that gets to the heart of our contribution. First, we should note that the internal "chain-of-thought" of proprietary models (o1) is not accessible.

Second, our assessment is that STOA LLMs possess a surprising depth of latent biological knowledge, and the scPilot framework is crucial for structuring this knowledge into a correct, verifiable, and scientifically rigorous reasoning process. We demonstrate this across the following qualitative analysis

Sophisticated Multi-Gene Logic: In cell-type annotation, scPilot moves beyond simple marker matching to distinguish NK cells from CD8 T-cells using a combination of NKG7, GNLY, and CD3D—a task where single-marker tools often fail. It can also identify extremely rare cell types (~1% abundance).
Deep System-Level Understanding & Self-Correction: In trajectory inference, scPilot reconstructed the liver developmental tree and, more importantly, used diagnostic outputs from Monocle to self-audit and correct its own reasoning. This improved trajectory accuracy significantly (e.g., reducing graph-edit distance by 6 and spectral distance by 0.32).
Contextual, Tissue-Specific Reasoning: For GRN prediction, scPilot applies tissue-specific reasoning, such as understanding Klf4's role specifically in the stomach, to make more accurate predictions.

In conclusion, our framework produces a transparent reasoning trace—a step-by-step record of claims, tool outputs, and decisions—that serves as a fully auditable line of reasoning.

(e) Clarification on Termination Condition

This is an excellent question regarding the mechanics of our framework.

For our benchmark, scBench, the termination conditions are pre-defined for each task to ensure fair, reproducible, and rigorous evaluation, rather than being determined autonomously by the LLM.

The specific conditions are tailored to the nature of each task:

Cell-Type Annotation: The process is set to a maximum of 3 reasoning iterations. This fixed number provides a sufficient window for the LLM to perform self-correction and refine its initial hypotheses without leading to "overthinking" or excessive computational cost.
Trajectory Analysis & GRN Prediction: These are structured as single-pass reasoning tasks. For trajectory analysis, this involves an initial tree construction followed by a controlled self-reconsideration step prompted by Monocle's output. For GRN, all evidence is provided upfront, making a single, comprehensive reasoning pass the most effective approach.

This deterministic design is crucial for a benchmark. It ensures models are evaluated under identical conditions, preventing variability that confounds results, an issue seen with the autonomous agent, Biomni.

(f) Potential Information Loss from Data Compression

We agree with the reviewer on this important point. The "Problem-to-Text Converter" is a necessary trade-off to fit large omics datasets into current LLM context windows. While this compression is inherent, our framework is both highly effective today and designed for future extension.

First, our current "semantic sketch" approach is remarkably effective at preserving critical biological signals. For instance, in our Retina analysis, scPilot successfully identified rare Horizontal Cells that constitute less than 1% of the total cell population. This demonstrates that our converter retains the necessary resolution for nuanced discovery, even for subtle signals.

This trade-off represents an exciting research frontier. We envision several strategies to enhance this component:

Dynamic, RAG-like Retrieval: Allow the LLM to query the full data matrix for fine-grained details on demand.
Hybrid Foundation Models: Integrate scPilot with specialized single-cell models (e.g., scGPT) that can process raw data, with our LLM acting as the high-level reasoning engine.

Our scBench provides the ideal platform to develop and test these advanced strategies. Thus, while data compression is a challenge, our work offers an effective current solution and a foundational testbed for future innovation.

评论- Following up on our rebuttal for Submission 23137

2025-08-06

Dear Reviewer RBqf,

We sincerely thank you for your thoughtful and constructive review of our paper. We were particularly encouraged that you recognized the novelty of our omics-native reasoning approach, the value of our scBench contribution, and the benefit of our framework's transparency.

We have submitted a detailed rebuttal and wanted to gently follow up to ask for your feedback. We aimed to thoroughly address the excellent questions you raised regarding the framework's design and performance.

Specifically, we provide:

A detailed cost analysis to address the valid concern about accessibility, demonstrating that scPilot is highly affordable.
A new analysis on the rare cases where simpler methods outperform scPilot, explaining this systematic behavior as an informative finding related to model capability and dataset complexity.
Direct clarifications on your questions regarding tool selection, the depth of biological knowledge in LLMs, task termination conditions, and potential information loss.

We would be very grateful to learn if our response has helped clarify these points. We welcome any further discussion and are eager to hear your thoughts.

Thank you once again for your time and valuable insights.

Sincerely,

The Authors of Submission 23137

最终决定Accept (poster)

2025-09-17

The authors propose an LLM-driven set of workflows and tool use that automates analysis of single cell transcriptomic data. The method converts core problems such as cell type annotation into a natural language reasonign workflow that the LLM can manipulate. The paper also introduces a novel benchmark (scbench), that evaluate these capabilities across a suite of different tasks and using various different base LLMs. Discussion between authors and reviewers clarified the accessibility of the technique, the role and nature of the reasoning traces, and yielded important additional baseline comparisons to state of the art bioinformatics methods.