4.6

/10

Poster5 位审稿人

最低3最高6标准差1.4

3.6

置信度

正确性2.6

贡献度2.0

表达2.6

ICLR 2025

Contextualizing biological perturbation experiments through language

Menghua Wu,Russell Littman,Jacob Levine,Lin Qiu,Tommaso Biancalani,David Richmond,Jan-Christian Huetter

OpenReview PDF

提交: 2024-09-25更新: 2025-03-01

TL;DR

We propose Perturb-seq predictions as a novel set of real-world tasks for large language models and provide a proof-of-concept method with favorable performance.

摘要

关键词

large language modelsPerturb-seqperturbation experimentsknowledge graphsretrieval-augmented generationchain of thought prompting

评审与讨论

审稿意见

评分: 3置信度: 52024-10-29

This paper proposed an in-silico approach to solving the biological perturbation experiment results prediction. The original idea is to enrich the graph-structure-based cellular responses textually. In terms of technique details, the authors proposed PertubQA as a pre-processing step and omitted minor changes in the gene expression level. Then, the authors adopted prompt engineering to retrieve the structured knowledge crossing gene-gene interaction result and gene description. Following a CoT manner, the LLM generates the final answers.

优点

S1. Adopting LLM in biological analysis is an interesting topic. This study follows a combination of LLM-related tech fashion.

S2. The authors developed a PertubQA dataset to validate the following research on textual enrichment of perturbation analysis.

S3. Detailed examples and case studies are provided to show the effectiveness of the framework.

缺点

W1. Part of the technical contribution is related to GenePT, which adopted a textual description and gene expression value to extract the cell representation, and GEARs, which combined prediction on a genetic relation graph (gene-coexpression graph and GO graph).

W2. The contributions of the proposed PerturbQA dataset are unclear. Why do we need a vague textual description of gene perturbation instead of a formalized graph structure with exactly numeric data?

W3. Missing technical details. I couldn't find the setting (or formal definition) of the study problem. Those make this paper hard to follow.

问题

Q1. Please describe the distinct contributions of PerturbQA beside a RAG- or a meta-path-walk-like text description generation. How do you ensure the description is correct? For instance, PubmedQA is generated from many articles and certified by human experts.

For the rest of the question, please refer to weak points.

评论- Thank you for your review

2024-11-21

Thank you for your review and suggestions! We hope this response provides clarity regarding your concerns, and we look forward to discussing.

Contribution and data quality

PerturbQA is first and foremost, a carefully curated benchmark for language-based reasoning for single-cell perturbations.

Compared to existing benchmarks like PubmedQA, the ground truth is not determined by fact checking, but by experimental assays. The quality of our labels can be assessed based on statistical consistencies. Figure 4 illustrates that the Wilcoxon ranked-sum test is reasonably calibrated on our datasets (for determining differential expression), and Figures 5+6 illustrate that in two near-biological replicates, there is high agreement. We also employ conservative thresholds for selecting positives and negatives (A.2), and exclude statistically uncertain examples.

The textual descriptions extracted from knowledge graphs are manually curated by large consortia, compiling decades of research. We believe that the identifier mapping tables they maintain offer a much cleaner means of extracting information relevant to each gene, compared to search engine / embedding-based approaches more common in NLP. While we cannot guarantee that the knowledge is correct, we do include context regarding data provenance (cell line, assay type, whatever is available), as LLMs have demonstrated the ability to filter noisy information, when this is provided [1].

Finally, we do not claim that the LLM-generated summaries are correct; only that they are helpful for making the end prediction, on which the framework is evaluated.

[1] Allen-Zhu and Li. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. 2024.

Algorithmic novelty

Please see the general response. To summarize: applying NLP techniques out of the box results in near-random performance on PerturbQA, and they must be adapted in domain-specific ways. To demonstrate that this task is feasible, we introduced a minimal LLM-based example (SUMMER), which draws from common LLM techniques. We do not claim that this method is novel from an NLP perspective. We do claim that this LLM-reasoning based framework is novel in context of perturbation modeling.

Specific questions

why not "formalized graph structure with exactly numeric data":

Both GEARS and GAT operate over formal graph structures. GEARS leverages real-valued expression matrices, while GAT is trained on our discretized labels. Both methods underperform in Table 1, and this can be attributed to two factors (Section 3, "modeling perturbations").

"Raw" single cell expression matrices are actually the end product of extensive preprocessing pipelines, and they are subject to high aleatoric and epistemic noise (batch effects) [2]. In fact, even two datasets generated by the same lab, of the same cell line, can be largely inconsistent at the individual gene level [3]. This is why we employed stringent criteria on PerturbQA examples (A.2), and why we choose to interpret single-cell perturbation data at the level of discrete insights, in line with best practices (Section 3, "statistical insights").
Biological knowledge graphs are highly heterogeneous, spanning many layers of hierarchy, and are annotated with details regarding the context and provenance of each observation. When current graph-based methods translate annotations into adjacencies, the semantics of each relationship are also lost: edges like "enables," "does not enable," "is part of" are all mapped to "1." Finally, different knowledge graphs operate at differing levels of quality/noise, making their harmonization difficult. Therefore, we only use the graph structure to inform the retrieval and summarization of relevant information, rather than as a strict backbone to constrain modeling.

[2] Luecken and Theis. Current best practices in single‐cell RNA‐seq analysis: a tutorial. Mol Syst Biol (2019) 15: e8746.

[3] Nadig et al. Transcriptome-wide characterization of genetic perturbations. 2024.

formal task definitions: We have included formal definitions for each task in Section 4.1, with motivation in Section 3. Could you please let us know which specific aspects of the notation / setting you find confusing, so we can try to improve the presentation? Thank you!

GEARS and GenePT: We evaluate GenePT and GEARS on our benchmark for completeness, as they represent state-of-the-art approaches for this task. We do not consider their adaptations (for discrete classification) part of our primary technical contribution.

2024-11-26

After checking the rebuttal and the rebuttal to all reviewers, I remain concerned about the application scenario of the proposed dataset, PerturbQA. I insist to keep my evaluation.

评论- Thank you for your response

2024-11-26

Thank you for your response and for reading our rebuttals! Could you clarify what aspects of the application remain concerning, and which aspects of the task definitions are unclear? This would be very helpful for improving our paper.

审稿意见

评分: 6置信度: 42024-10-31

The paper introduces PERTURBQA, a benchmark for using language models to interpret genetic perturbation experiments. These experiments reveal gene functions but are expensive. The proposed SUMMER framework combines biological text data and experimental results, outperforming knowledge-graph-only models on tasks like predicting gene expression changes and gene set enrichment. SUMMER’s language-based outputs offer clearer insights for biologists, enhancing interpretability and accessibility in modeling biological data.

优点

The paper is well-organized and easy to follow.

The paper proposed an interesting problem of using LLMs to predict gene pertubation.

缺点

The biggest concern of the proposed method is the problem itself. For example, SUMMER relies on the biology knowledge graph. However, the biology knowledge graph is built based on the experiments and analysis. Compared with the gene perturbation prediction from scRNA, the knowledge-building quiring updates the knowledge at once to build the possible links among genes. This indicates that this method has a risk of overfitting the known knowledge and being unable to discover new perturbation genes; it is more like a retrieval system for gene interactions we already know.

To further explain how the proposed method can generate more situations, my suggestion is to include more datasets to avoid overfitting. For example, in scGPT[1], the evaluation for perturbation is across 3 different perturbation datasets.

In the previous study of gene perturbation, one gene input can output several gene expressions, which are judged at one time. However, the LLM framework suggests that the source and target are a pair of data. This means that if we want to see every gene expression after perturbation, we should go through every combination of gene pairs with LLMs. This is really time-consuming in this setting. Besides, the experiments are really limited to a small size of genes.

In GEARS, scGPT, and CellPLM, they set the perturbation tasks for both 1 gene unseen and 2 genes unseen, and they can claim that the proposed method can handle different perturbation situations. However, in this paper, there are no such illustrations. Besides, in GEARS, the authors predict 102 genes at one time for novel gene perturbation analysis (Fig5a in GEARS). In scGPT, they predict 210 combinations (Fig3 in scGPT). In this paper, there is no evidence that the proposed method can do such things.

The authors should include gene pretrained methods such as scGPT[1] and Geneformer[2]. The text-based methods are pretrained on lots of textual information, the comparing method is unfair with only MLP, GAT, and GEARS.

As I mentioned before, the pretrained methods on the single-cell data are considered one of the most powerful methods in gene perturbation tasks at this moment. In scGPT, they report scGPT can achieve 50% higher than GEARS. Instead of training on lots of textual information, they trained directly from gene expressions. If the authors want to claim the textual-based pretrained method is more powerful than we predict genes from scRNA data, they need to discuss these models.

Compared with other work in gene expression from ML method, such as CellPLM[3], LanCell[4], CellSentence[5] this paper lacks biology analysis in bio view. This limited the application to further applications.

In scGPT and GEARS, they have a figure of predicted gene expression profiles of the perturbation conditions. Other papers are not perturbation-specific methods, but they are ai4biology papers that are published in the top AI conferences. They also show some biological analysis to strengthen their methods for their own contribution, such as marking genes on the cells to show the cell states.

[1] Cui H, Wang C, Maan H, et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI[J]. Nature Methods, 2024: 1-11.

[2] Theodoris C V, Xiao L, Chopra A, et al. Transfer learning enables predictions in network biology[J]. Nature, 2023, 618(7965): 616-624.

[3] Wen H, Tang W, Dai X, et al. CellPLM: Pre-training of Cell Language Model Beyond Single Cells[C]//The Twelfth International Conference on Learning Representations.

[4] Zhao S, Zhang J, Wu Y, et al. LangCell: Language-Cell Pre-training for Cell Identity Understanding[C]//Forty-first International Conference on Machine Learning.

[5] Levine D, Rizvi S A, Lévy S, et al. Cell2Sentence: Teaching Large Language Models the Language of Biology[C]//Forty-first International Conference on Machine Learning.

问题

see weakness

评论- Thank you for your review (1/2)

2024-11-21

Thank you for your review and questions. We hope this response clarifies certain points of confusion, and we look forward to discussing with you!

Retrieval vs. discovery

"retrieval system:" Please see the general response. To summarize: Only ~3% of our testing pairs physically interact in any context, while only ~20% of pairs are directly connected by any annotation (including coarse ones). We find that the presence of physical interactions is minimally predictive of differential expression (DE).

"my suggestion is to include more datasets [...] For example, in scGPT, the evaluation for perturbation is across 3 different perturbation datasets"

In scGPT, the evaluation was conducted over 3 experiments in the same K562 cell line, including the K562 essential screen that we also incorporate. The other two are older (2016, 2019) and much smaller, as single-cell CRISPR technologies have become more scalable and efficient in the past 5 years.

In contrast, our experiments cover 4 cell lines, derived from 5 experiments, which are the largest publicly-available Perturb-seq screens [1,2]. Biologically, cell lines are very distinct, and they are derived from different cancer tumors / other conditions (K562 myelogenous leukemia, RPE non-cancerous, HepG2 liver cancer, Jurkat acute T cell leukemia). In particular, different genes are expressed, and among the same genes, the marginal distributions (of expression) and gene-gene relationships may differ. To ensure the quality and consistency of our benchmark, we have chosen to focus on these 5 screens, which are larger and published more recently.

[1]. Replogle et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell. 2022.

[2] Nadig et al. Transcriptome-wide characterization of genetic perturbations. 2024.

Combinatorial perturbations

The lack of combinatorial perturbations is less a limitation of the method, but of evaluation.

To craft a "proof of concept" for combinatorial perturbations, it would be relatively straightforward to summarize the neighborhoods of each participating gene, and prompt for any synergy/lack thereof.

However, existing datasets for combinatorial perturbations are limited. The most commonly analyzed are Norman et al. [3] and Wessels et al. [4], each containing around 100 perturbation pairs. These two experiments operate in different modalities (CRISPR activation vs. inhibition), and since there are no alternatives for comparison, it is difficult to quantify the quality of these data. A major goal of this work was to craft a trustworthy benchmark for perturbation modeling, so we chose to focus on single perturbations instead. Therefore we consider combinations currently out of scope, but an opportunity for future work when better datasets are available.

[3] Norman et al. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science. 2019.

[4] Wessels et al. Efficient combinatorial targeting of RNA transcripts in single cells with Cas13 RNA Perturb-seq. Nat Methods. 2023.

评论- Thank you for your review (2/2)

2024-11-21

Additional biological analysis

Thank you for the recommendation! We have updated the draft with several additional analyses regarding the qualitative aspects of our framework (currently in Appendix C), including evaluation from a domain expert. Here is a summary.

Clusters that elude manual annotation tend to be smaller or exhibit lower agreement. On these, gene set over-representation analysis focuses on highly specific gene sets, which each cover subsets of these clusters, rather than the whole. The LLM takes the opposite approach, and its summaries tend to "lift" the description to higher levels of hierarchy (Table 8). The two strategies provide orthogonal information, though the LLM outputs may be more readable.
We analyzed 300 generations (3 trials of 100 DE examples) to understand common failure modes (detailed examples in C.3). Errors and inconsistencies primarily resulted from deductions backed by overly-generic information. For example, the LLM may list an excessively broad set of influences, e.g. "mitochondrial function, protein synthesis, or transcriptional regulation," which affect nearly everything.

In several instances, the LLM was also confused between concepts which may be loosely connected, but not in the same context. For example, Gene A is upstream of stress signaling, e.g. "oxidative stress," which is related to the mitochondria. However, Gene A is not responsible for healthy mitochondria function, and should not respond similarly to genes that are.
Finally, since this paper focuses on providing value to biologists, we recruited a domain specialist (molecular biologist, trained in wet lab and computational biology) for this task (See C.1).

Overall, the LLM-generated summary was equal or more informative than the classical gene set enrichment results in 92% of cases, and agrees with the independent annotator in 72% of cases.

In 21/25 cases, the biologist reported that the LLM-generated summary was more informative. In 2/25 cases, they contained the same amount of information; and in 2/25 cases, the gene set contained more information.
In 18/25 cases, the biologist reported that the LLM summary captured the same biology as the original human annotation (our ground truth labels).

In the 2 cases where the gene set contained more information, a list of specific protein complexes were discovered, e.g.

EIF2AK4 (GCN2) binds tRNA, Aminoacyl-tRNA  binds to the ribosome at the A-site, 80S:Met-tRNAi:mRNA:SECISBP2:Sec-tRNA(Sec):EEFSEC:GTP is hydrolysed to 80S:Met-tRNAi:mRNA:SECISBP2:Sec and EEFSEC:GDP by EEFSEC, UPF1 binds an mRNP with a termination codon preceding an Exon Junction Complex, Translocation of ribosome by 3 bases in the 3' direction, Translation of ROBO3.2 mRNA initiates NMD.

However, this information-rich output is hard to interpret, compared to the LLM output, which the annotator marked as agreeing with the label of "translation."

Ribosomal Protein Components Involved in Translation: This gene set is comprised of components of the large and small ribosomal subunits, which are essential for protein synthesis and translation. These genes are involved in the assembly and function of the ribosome, facilitating the translation of messenger RNA into protein.

In the 7/25 cases where the LLM summary differed from the human annotation, the LLM annotation tended to miss some high specific terms, e.g. "targets of nonsense-mediated decay" was generalized to "stress response," and "dysregulated lncRNA antisense transcripts" was generalized to "nuclear gene regulation." Related terms tend to be sparsely annotated in Gene Ontology, so this indicates that it would be useful to tune the granularity of generations in the future, or to generate multiple candidates for specific descriptions.

scGPT: We finetuned scGPT + GEARS using their published "perturbation" tutorial. Due to time constraints, we were only able to benchmark K562, but we will add the full results to the final paper. We will also include discussion of these references in our work. Thank you!

scGPT + GEARS averaged 0.52 AUC on K562 (compared to SUMMER 0.6, GenePT 0.57). Similar to other methods, this model predicts "perfectly" on some perturbations (104/267 with AUC=1) while guessing randomly on the remaining, so it appears that either the embeddings are unhelpful, or the GEARS backbone (on which scGPT builds) is suboptimal.

2024-11-24

Thank you for your response. Personally, my main concern is how this method influences perturbation analysis from both a logical and practical perspective. I believe the perturbation problem is a complex process in the biological domain. As you mentioned, even the identified perturbation pairs may not accurately represent the actual perturbations. With this in mind, I am unsure whether the knowledge derived is truly reliable for the perturbation task.

One advantage of using pretrained models in perturbation analysis is their ability to study RNA data without needing to verify the knowledge's accuracy. In practice, I also think the pretrained single-cell models need less data to finetune, or infer, or predict the perturbation combination. Additionally, it’s worth noting that scGPT only extracts matching genes from GEARS for specific datasets rather than utilizing GEARS’ structure. Therefore, the results based on the GEARS backbone may not fully align with the implementation.

I appreciate this novel perspective on the perturbation task, particularly how it combines single-cell problems with LLMs, and I will adjust my score because of this novelty.

评论- Thank you for your response!

2024-11-24

Thank you for your response and for updating your score!

We agree that scGPT and other single cell foundation models could capture orthogonal information, and we will update our paper with this discussion / error analysis (when full results are available).

We did run the published scGPT codebase for perturbations. Upon closer inspection, we realize that the "graph" part of each "graph" is discarded and only the expression is retained, so the results should be faithful.

审稿意见

评分: 5置信度: 32024-11-03

This paper introduces PERTURBQA, a benchmark suite for evaluating LLMs in reasoning over structured biological data from genetic perturbation experiments. The paper also presents SUMMER, a language-based framework designed to predict differential expression and direction of gene changes as well as to perform gene set enrichment. SUMMER combines knowledge graphs and retrieval-augmented generation to enhance interpretability, matching or surpassing state-of-the-art models on PERTURBQA tasks. The benchmark is designed to help researchers interpret outcomes in high-content genetic perturbation experiments and understand model limitations.

优点

Novel Application of Language-Based Reasoning in Pertburbation Task: The paper offers a novel application of language-based reasoning to biological data, allowing PERTURBQA tasks to be approached in an interpretable way that benefits domain experts.
Comprehensive Benchmark Design: PERTURBQA includes real-world tasks relevant to differential expression, gene direction, and gene set enrichment, providing a holistic assessment of model reasoning on biological data.
Interpretable Model Outputs: SUMMER’s use of knowledge graphs and retrieval-augmented generation produces outputs that domain experts can readily interpret, addressing the limitations of black-box models in biological contexts.

缺点

1.Insufficient Related Work Discussion: The paper could better situate its contribution within the existing literature, especially regarding related work on several aspects: 1. graph-to-text works: such as Zhao et al., 2023, Chen et al., 2023, check the survey for details. 2. graph RAG works, e.g. GraphRAG, He et al., 2024, Mavromatis et al., 2024 . Discussing these works would strengthen the contextual grounding of the proposed approach.

2.Marginal Technical Contribution: As stated in W1, both ideas of using LLM for graph tasks and graph RAG are widely studied before, which renders the contribution of this paper marginal.

3.Limited Baselines: A wider range of recent baselines, especially from the graph domain (e.g. those mentioned above in W1 and W2), should be included to provide a more comprehensive evaluation and enable better comparison of SUMMER's effectiveness.

4.Potential Data Leakage Concerns: Given that LLaMA3 might have been exposed to substantial amounts of biological data, including related gene interactions, there’s a risk of data leakage. A clearer evaluation of SUMMER’s performance independent of potential pre-trained biases could help clarify if its good performance stems from model design or from pre-existing knowledge in the model.

问题

评论- Thank you for your review

2024-11-21

Thank you for your review and suggestions! We hope this response clarifies some of your confusion, and we look forward to discussing.

Literature review: We are happy to expand our literature review and include a wider range of work on the NLP side. Thank you for the recommendations!

Technical contribution: Please see the general response. To summarize: Our primary contribution is PerturbQA, a carefully crafted benchmark, derived from experimental assays and open knowledge graphs. Applying NLP techniques out of the box results in near-random performance on PerturbQA, and they must be adapted in domain-specific ways. To demonstrate that this task is feasible, we introduced a minimal LLM-based example (SUMMER), which draws from common LLM techniques. We do not claim that this method is novel from an NLP perspective. We do claim that this LLM-reasoning based framework is novel in context of perturbation modeling.

Additional baselines: Please see the general response. To summarize: As Table 1 demonstrates, directly applying NLP methods out-of-the-box on PerturbQA is comparable to random guessing. It is non-trivial to adapt these methods for molecular biology reasoning, especially because the vast majority of biological literature is behind paywalls (vs. in machine learning, where everything is open).

Data leakage: Please see the general response. To summarize: Our "no CoT" and "no retrieve" ablations performed near-random, so it appears that the base Llama model has very poor understanding of experimental outcomes. Furthermore, the gene pairs are also minimally represented in the knowledge graphs, and the presence of a known physical interaction is not predictive.

评论- Response to Rebuttal

2024-11-26

Thank you for the effort put into the rebuttal.

However, my primary concerns regarding the discussion of graph literature remain unresolved, as these works were neither compared nor even discussed in the updated manuscript.

As a result, I will maintain my negative score.

评论- Thank you for your response

2024-11-26

Thank you for your response! We have updated the manuscript with your recommendations and additional relevant work (highlighted in blue on page 3).

Currently we do not have the compute to run additional LLM-based experiments before the paper edit deadline, but please let us know if this addresses your concerns in part.

评论- Response to Rebuttal

2024-11-26

Thanks for the prompt response and revision of the paper. However, I checked the updated related work and still believe that these methods weaken the technical contribution of the proposed method. Hence, I will maintain my original score.

评论- Thank you for your response!

2024-11-27

Thank you for your feedback! Please let us know if there's anything else we can do to improve the paper.

Finally, we would like to reiterate that the primary novelty of this work is not the proposed algorithm, but the framework for modeling perturbation experiments. As we write in the Introduction (Paragraph 2) and Background (Section 3), current approaches for modeling perturbations are misaligned with what biologists aim to glean from these large-scale screens. We are the first to propose that predicting differentiation expression / direction of change and summarizing gene sets are more realistic endpoints, compared to existing regression-based objectives.

Our hypothesis was simply, that language can be helpful for modeling perturbations in this setting. There are many approaches in modern NLP that could be adapted towards perturbation modeling, and we have prepared and presented perturbation screens / relevant knowledge sources in approachable formats for this purpose. We hope that this work can this work can encourage and facilitate such exploration.

审稿意见

评分: 6置信度: 32024-11-04

The paper introduces a benchmark for evaluating LLM reasoning on biological perturbation experiments, and an LLM-based framework that uses knowledge graphs and prior data to outperform current methods in interpretability and performance on these tasks.

优点

The method appears to be sound, combining RAG and COT prompting with knowledge graphs for handling complex biological relationships.
The proposed ramework emphasizes interpretable outputs, which is beneficial to biological research.
The evaluation was thorough, with both graph and language-centric baselines across multiple datasets.

缺点

The model focuses on discrete perturbation outcomes and does not address combinatorial perturbations, which are common in biological studies.
The interpretability of the method comes at the cost of additional complexity in prompt engineering, and it is not known how the performance is sensitive to the prompt design.
The paper acknowledges limitations in current evaluation metrics for enrichment tasks. It would be better to see the utilization of new, domain-specific metrics.

问题

How well does SUMMER generalize to new or sparsely annotated biological datasets? Are there performance drops when applied to less-studied cell lines or organisms?
Did the authors perform an error analysis to determine whether specific types of perturbations or gene functions are harder for SUMMER to predict accurately?

评论- Thank you for your review (1/2)

2024-11-21

Thank you for your review and suggestions! We hope this response clarifies some of your questions, and we look forward to discussing with you.

Combinatorial perturbations

The lack of combinatorial perturbations is less a limitation of the method, but of evaluation.

However, existing datasets for combinatorial perturbations are limited. The most commonly analyzed are Norman et al. [1] and Wessels et al. [2], each containing around 100 perturbation pairs. These two experiments operate in different modalities (CRISPR activation vs. inhibition), and since there are no alternatives for comparison, it is difficult to quantify the quality of these data. A major goal of this work was to craft a trustworthy benchmark for perturbation modeling, so we chose to focus on single perturbations instead. Therefore we consider combinations currently out of scope, but an opportunity for future work when better datasets are available.

[1] Norman et al. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science. 2019.

[2] Wessels et al. Efficient combinatorial targeting of RNA transcripts in single cells with Cas13 RNA Perturb-seq. Nat Methods. 2023.

Domain specific evaluation

Due to the open-ended nature of the gene set task, automated evaluation methods are limited in their ability to reflect practical utility. Since this paper focuses on providing value to biologists, we recruited a domain specialist (molecular biologist, trained in wet lab and computational biology) for this task (See C.1).

Overall, the LLM-generated summary was equal or more informative than the classical gene set enrichment results in 92% of cases, and agrees with the independent annotator in 72% of cases.

In 21/25 cases, the biologist reported that the LLM-generated summary was more informative. In 2/25 cases, they contained the same amount of information; and in 2/25 cases, the gene set contained more information.
In 18/25 cases, the biologist reported that the LLM summary captured the same biology as the original human annotation (our ground truth labels).

In the 2 cases where the gene set contained more information, a list of specific protein complexes were discovered, e.g.

EIF2AK4 (GCN2) binds tRNA, Aminoacyl-tRNA  binds to the ribosome at the A-site, 80S:Met-tRNAi:mRNA:SECISBP2:Sec-tRNA(Sec):EEFSEC:GTP is hydrolysed to 80S:Met-tRNAi:mRNA:SECISBP2:Sec and EEFSEC:GDP by EEFSEC, UPF1 binds an mRNP with a termination codon preceding an Exon Junction Complex, Translocation of ribosome by 3 bases in the 3' direction, Translation of ROBO3.2 mRNA initiates NMD.

However, this information-rich output is hard to interpret, compared to the LLM output, which the annotator marked as agreeing with the label of "translation."

Ribosomal Protein Components Involved in Translation: This gene set is comprised of components of the large and small ribosomal subunits, which are essential for protein synthesis and translation. These genes are involved in the assembly and function of the ribosome, facilitating the translation of messenger RNA into protein.

In the 7/25 cases where the LLM summary differed from the human annotation, the LLM annotation tended to miss some high specific terms, e.g. "targets of nonsense-mediated decay" was generalized to "stress response," and "dysregulated lncRNA antisense transcripts" was generalized to "nuclear gene regulation." Related terms tend to be sparsely annotated in Gene Ontology, so this indicates that it would be useful to tune the granularity of generations in the future, or to generate multiple candidates for specific descriptions.

评论- Thank you for your review (2/2)

2024-11-21

Additional biological analysis

Clusters that elude manual annotation tend to be smaller or exhibit lower agreement. On these, gene set over-representation analysis focuses on highly specific gene sets, which each cover subsets of these clusters, rather than the whole. The LLM takes the opposite approach, and its summaries tend to "lift" the description to higher levels of hierarchy (Table 8). The two strategies provide orthogonal information, though the LLM outputs may be more readable.
We analyzed 300 generations (3 trials of 100 DE examples) to understand common failure modes (detailed examples in C.3). Errors and inconsistencies primarily resulted from deductions backed by overly-generic information. For example, the LLM may list an excessively broad set of influences, e.g. "mitochondrial function, protein synthesis, or transcriptional regulation," which affect nearly everything.

In several instances, the LLM was also confused between concepts which may be loosely connected, but not in the same context. For example, Gene A is upstream of stress signaling, e.g. "oxidative stress," which is related to the mitochondria. However, Gene A is not responsible for healthy mitochondria function, and should not respond similarly to genes that are.

Experimental insight

Clusters that elude manual annotation tend to be smaller or exhibit lower agreement. On these, gene set over-representation analysis focuses on highly specific gene sets, which each cover subsets of these clusters, rather than the whole. The LLM takes the opposite approach, and its summaries tend to "lift" the description to higher levels of hierarchy (Table 8). The two strategies provide orthogonal information, though the LLM outputs may be more readable.
We analyzed 300 generations (3 trials of 100 DE examples) to understand common failure modes (detailed examples in C.3). Errors and inconsistencies primarily resulted from deductions backed by overly-generic information. For example, the LLM may list an excessively broad set of influences, e.g. "mitochondrial function, protein synthesis, or transcriptional regulation," which affect nearly everything.

In several instances, the LLM was also confused between concepts which may be loosely connected, but not in the same context. For example, Gene A is upstream of stress signaling, e.g. "oxidative stress," which is related to the mitochondria. However, Gene A is not responsible for healthy mitochondria function, and should not respond similarly to genes that are.

Human evaluation

We are happy to provide human evaluation and will update the manuscript accordingly (See C.1). We recruited a domain specialist (molecular biologist, trained in wet lab and computational biology) for this task.

Overall, the LLM-generated summary was equal or better to the classical gene set enrichment results in 92% of cases, and agrees with the independent annotator in 72% of cases.

In 21/25 cases, the biologist reported that the LLM-generated summary was more informative. In 2/25 cases, they contained the same amount of information; and in 2/25 cases, the gene set contained more information.
In 18/25 cases, the biologist reported that the LLM summary captured the same biology as the original human annotation (our ground truth labels).

Error analysis of human annotation:

In the 2 cases where the gene set contained more information, a list of specific protein complexes were discovered, e.g.

EIF2AK4 (GCN2) binds tRNA, Aminoacyl-tRNA  binds to the ribosome at the A-site, 80S:Met-tRNAi:mRNA:SECISBP2:Sec-tRNA(Sec):EEFSEC:GTP is hydrolysed to 80S:Met-tRNAi:mRNA:SECISBP2:Sec and EEFSEC:GDP by EEFSEC, UPF1 binds an mRNP with a termination codon preceding an Exon Junction Complex, Translocation of ribosome by 3 bases in the 3' direction, Translation of ROBO3.2 mRNA initiates NMD.

However, this information-rich output is hard to interpret, compared to the LLM output, which the annotator marked as agreeing with the label of "translation."

Ribosomal Protein Components Involved in Translation: This gene set is comprised of components of the large and small ribosomal subunits, which are essential for protein synthesis and translation. These genes are involved in the assembly and function of the ribosome, facilitating the translation of messenger RNA into protein.

In the 7/25 cases where the LLM summary differed from the human annotation, the LLM annotation tended to miss some high specific terms, e.g. "targets of nonsense-mediated decay" was generalized to "stress response," and "dysregulated lncRNA antisense transcripts" was generalized to "nuclear gene regulation." Related terms tend to be sparsely annotated in Gene Ontology, so this indicates that it would be useful to tune the granularity of generations in the future, or to generate multiple candidates for specific descriptions.

评论- Thank you for your review (2/2)

2024-11-21

	K562	RPE1	HepG2	Jurkat
physical, DE	0.094	0.063	0.075	0.106
physical, total	0.032	0.025	0.027	0.029
network, DE	0.214	0.204	0.218	0.253
network, total	0.222	0.209	0.220	0.208

	K562	RPE1	HepG2	Jurkat
physical = 1	0.53	0.52	0.52	0.54
Ours	0.60	0.58	0.61	0.58

Contextualizing biological perturbation experiments through language

摘要

评审与讨论

优点

缺点

问题

Contribution and data quality

Algorithmic novelty

Specific questions

优点

缺点

问题

Retrieval vs. discovery

Combinatorial perturbations

Additional biological analysis

优点

缺点

问题

优点

缺点

问题

Combinatorial perturbations

Domain specific evaluation

Additional biological analysis

Other questions

优点

缺点

问题

Experimental insight

Human evaluation

Other questions

审稿人讨论附加意见