PaperHub
5.7
/10
Poster3 位审稿人
最低4最高7标准差1.2
7
4
6
3.7
置信度
正确性2.7
贡献度2.7
表达2.7
NeurIPS 2024

Gene-Gene Relationship Modeling Based on Genetic Evidence for Single-Cell RNA-Seq Data Imputation

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-06
TL;DR

We propose a novel scRNA-seq data imputation scheme based on genetic evidence.

摘要

Single-cell RNA sequencing (scRNA-seq) technologies enable the exploration of cellular heterogeneity and facilitate the construction of cell atlases. However, scRNA-seq data often contain a large portion of missing values (false zeros) or noisy values, hindering downstream analyses. To recover these false zeros, propagation-based imputation methods have been proposed using $k$-NN graphs. However they model only associating relationships among genes within a cell, while, according to well-known genetic evidence, there are both associating and dissociating relationships among genes. To apply this genetic evidence to gene-gene relationship modeling, this paper proposes a novel imputation method that newly employs dissociating relationships in addition to associating relationships. Our method constructs a $k$-NN graph to additionally model dissociating relationships via the negation of a given cell-gene matrix. Moreover, our method standardizes the value distribution (mean and variance) of each gene to have standard distributions regardless of the gene. Through extensive experiments, we demonstrate that the proposed method achieves exceptional performance gains over state-of-the-art methods in both cell clustering and gene expression recovery across six scRNA-seq datasets, validating the significance of using complete gene-gene relationships in accordance with genetic evidence. The source code is available at https://github.com/daehoum1/scCR.
关键词
scRNA-seqimputationbioinformatics

评审与讨论

审稿意见
7

This paper introduced a new scRNA-seq imputation method, scCR, utilizing both associating and dissociating gene-gene relationships to improve the accuracy of scRNA-seq imputation, especially on noisy dataset. The method constructed comprehensive k-NN graph for both cell-cell and gene-gene. Standardize the value distribution of each gene to enhance gene-gene relationship modeling. Comprehensive analysis on multiple datasets demonstrates the advantage of scCR compared with state-of-the-art methods in two tasks, gene expression recovery and cell clustering, especially on rare cell types.

优点

This paper tackles an important problem, scRNA-seq data imputation, with extra information from negatively correlated gene relationships. The idea itself is straightforward yet effective. The paper includes extensive and solid experiments on multiple datasets. The experiments show solid performance improvement compared to multiple baseline methods. The paper explores the variance on dataset selection and dropout rate. This paper also includes a scalability study to demonstrate its advantage on runtime.

缺点

Though the paper explores variance on dataset selection and dropout rate, yet the variance is mainly focus on dataset. It would be helpful to include a detailed sensitivity analysis of key hyperparameters, such as k for k-NN, \alpha, \beta, and \gamma. The paper includes a runtime analysis, but not a memory usage analysis. The paper can be more comprehensive to include memory usage analysis to show its advantage on scalability.

问题

The overall paper is quite complete and comprehensive. Only a few questions, 1. the impact of hyperparameter selection, additional set of experiments would be perfect. But due to the time limit, theoretical analysis would also be very helpful to understand how sensitive the model is and make the method easier to use. 2. The memory usage of this method on different scales gives potential users a better sense of its applicability and resource requirements.

局限性

Yes.

作者回复

It would be helpful to include a detailed sensitivity analysis of key hyperparameters, such as k for k-NN, \alpha, \beta, and \gamma. The impact of hyperparameter selection, additional set of experiments would be perfect. But due to the time limit, theoretical analysis would also be very helpful to understand how sensitive the model is and make the method easier to use.

We conduct additional experiments to address the reviewer’s concerns and provide a comprehensive analysis of the impact of different hyperparameters, including α\alpha, β\beta, γ\gamma and kk, on the performance of scCR. We report ARI in cell clustering on three datasets by varying α\alpha, β\beta, γ\gamma, and kk in the ranges of {0.01,0.05,0.1,0.5,0.9 0.01, 0.05, 0.1, 0.5, 0.9}, {0.1,0.5,0.9,0.95,0.99,0.999 0.1, 0.5, 0.9, 0.95, 0.99, 0.999}, {0.001,0.01,0.05,0.1,0.5,0.9 0.001, 0.01, 0.05, 0.1, 0.5, 0.9}, and {1,2,3,5,10,15 1, 2, 3, 5, 10, 15}, respectively. When varying a target parameter, we fix the other hyperparameters to their default settings. Table 4, Table 5, Table 6, and Table 7 in the PDF of the global response demonstrate how the choice of hyperparameters impacts the performance of scCR. As shown in the tables, α=0.05\alpha=0.05, β=0.99\beta=0.99, γ=0.01\gamma=0.01, k=2k=2, which are values used in this work, generally show good performance. In terms of sensitivity, when the runner-up’s ARI is 0.660±0.000.660\pm0.00, 0.848±0.000.848\pm0.00, and 0.677±0.000.677\pm0.00 on Baron Mouse, Zeisel, Baron Human, respectively, scCR show its robustness against hyperparameter variations. Specifically, α\alpha \in {0.05,0.1,0.50.05, 0.1, 0.5}, β\beta \in {0.99,0.999 0.99, 0.999}, and γ\gamma \in {0.001,0.010.001, 0.01} result in state-of-the-art performance across the datasets. In the case of kk, except for Baron Human, scCR shows outstanding performance regardless of values for kk. We can also observe that varying a hyperparameter results in superior performance to that of the default hyperparameter. Nevertheless, considering the unsupervised nature of the single-cell analysis, we set the hyperparameters of scCR to default settings by fixing the hyperparameters that generally work well. We will add this discussion and the experimental results regarding the hyperparameter sensitivity of scCR to our final version.

The paper includes a runtime analysis, but not a memory usage analysis. The paper can be more comprehensive to include memory usage analysis to show its advantage on scalability. The memory usage of this method on different scales gives potential users a better sense of its applicability and resource requirements.

We investigate the memory complexity of all methods used in this paper and conduct additional experiments to analyze the memory usage of our scCR. Table 8 in the PDF of the general response compares the inputs and memory complexity of scCR with other state-of-the-art methods. To mitigate the heavy memory usage during kk-NN graph construction, we utilize a batch-wise kk-NN graph construction strategy. When constructing kk-NN graphs among genes, we divide genes into batches with batchsize BB, and compute kk-nearest neighbors for each batch. Similarly, we apply the same batch-wise strategy when constructing kk-NN graphs among cells. This strategy reduces the memory requirement because it avoids the need to store distances between all points in the entire dataset at once. Specifically, in the memory complexity of scCR, batch-wise kk-NN graph construction changes O(G2)O(G^2) and O(C2)O(C^2) to O(BG)O(BG) and O(BC)O(BC), respectively. Thus, batch-wise kk-NN graph construction can handle large datasets that would otherwise be infeasible to process due to memory constraints. Additionally, scCR does not require any trainable parameters used by other deep-learning-based models.

We further measure the memory usage of scCR across various datasets, as shown in Table 9 in the PDF of the general response. The results in the table indicate that the advantages of scCR extend beyond its superior performance and time efficiency, showcasing its scalability as well. We will include this detailed memory usage analysis in the final version of our manuscript to provide a more comprehensive evaluation of scCR.

评论

Dear Reviewer wZRo,

We sincerely thank you for dedicating your time to review our work and for your constructive feedback. We particularly appreciate the positive feedback on the significance of our approach to scRNA-seq data imputation, especially your recognition of the completeness of our work and the thoroughness of our experiments. With only about a day remaining in the discussion period, we are eager to engage further and understand whether our responses have satisfactorily addressed your concerns.

In our rebuttal, we provided point-by-point responses to all your questions and concerns regarding the sensitivity analysis of key hyperparameters and memory usage analysis. In summary:

  • We provided a comprehensive analysis of the impact of different hyperparameters, including α\alpha, β\beta, γ\gamma, and kk.
  • We investigated the memory complexity of all methods used in this paper.
  • We measured the memory usage of scCR across various datasets.

We would greatly appreciate it if you could kindly review our responses. We welcome any further questions and are happy to provide additional clarifications if needed. Thank you for your consideration.

Sincerely,
The Authors

评论

Dear Reviewer wZRo,

In our rebuttal, we provided point-by-point responses to all the concerns and questions you raised. Given that we have only six hours remaining before the deadline, we are very eager to confirm whether our responses have adequately addressed your concerns. We kindly ask you to take a moment to review our rebuttal and provide any further feedback. If there are any remaining questions or concerns, please be assured that we are ready to respond promptly.

Sincerely,
The Authors

审稿意见
4

This paper introduces a novel imputation method for scRNA-seq data that accounts for both associating and dissociating gene relationships by using a k-NN graph and negating the cell-gene matrix. The method standardizes gene value distributions and shows significant performance improvements in cell clustering and gene expression recovery across six datasets, outperforming existing methods.

优点

  1. This paper introduces a new imputation method named Single-Cell Complete Relationship (scCR) that addresses the limitations of current propagation-based approaches for scRNA-seq data by modeling both associating and dissociating gene-gene relationships.
  2. The extensive experiments conducted by the authors show that scCR significantly outperforms existing methods in gene expression recovery and cell clustering, highlighting its effectiveness in capturing complete gene-gene relationships and improving the quality of scRNA-seq data analysis.
  3. The paper is well-structured and clearly written.

缺点

  1. The proposed method appears too simple and lacks significant innovation.
  2. The paper does not clearly articulate the motivation behind the proposed method. Specifically, it does not adequately explain what dissociating relationships among genes are or why identifying these relationships is effective for addressing the research problem.
  3. The experiments lack biological validation, and relevant case studies are needed to support the findings.

问题

  1. There are now many methods using large models for cell clustering, such as SCGPT and GeneFormer. How do these methods compare in terms of effectiveness?

局限性

None

作者回复

How do large models for cell clustering, such as SCGPT and GeneFormer, compare in terms of effectiveness?

Our method and large-scale models have clearly different objectives; while our method tackles denoising scRNA-seq data, large-scale models (e.g., scGPT and Geneformer) targets learning gene and cell embeddings using neural networks pre-trained on large-scale datasets for transfer learning.

Therefore, our method and large-scale models are not in a relationship where their effectiveness can be compared, but rather in a relationship where they can collaborate to create synergy. Our scCR can provide denoised scRNA-seq data to large-scale models. To confirm that scCR can improve the performance of large-scale models, we conduct an additional experiment using scGPT [1]. We measure the cell type annotation performance of an scGPT model on the Multiple Sclerosis dataset [2] when we apply our scCR to the input data of scGPT compared to when we do not apply it (i.e., when using raw data). Table 3 in the PDF of the global response shows the cell type annotation performance of the scGPT model fine-tuned on the dataset, averaged across three independent runs. As shown in the table, our scCR improves the cell type classification performance of the scGPT model. This result demonstrates that our scCR effectively addresses noise contained in scRNA-seq data and can assist the large-scale models.

The experiments lack biological validation and relevant case studies

To verify whether scCR can provide biological insights, we confirm that scCR enriches relevant genes in lupus, a chronic autoimmune disease. Specifically, we conduct an in-depth analysis on the PBMCs dataset [3] obtained from lupus patients. We perform GSEA enrichment tests [4] that identify pathways related to specific conditions. In this case, the condition corresponds to interferon-stimulated CD16 Monocytes. We use both raw data and data imputed by scCR, and compare the results from them. When comparing the top 20 most significantly enriched pathways, scCR newly identifies four SARS_COV-related pathways in interferon-stimulated CD16 monocytes. Since lupus is an autoimmune disease characterized by an overactive immune system that attacks the body's own tissues, the activation of SARS_COV-related pathways may indicate an overactive or abnormally activated immune response in lupus patients, reacting excessively to viral infections. This suggests that the pathways identified through scCR's denoising process provide new and important biological insights that were previously obscured by noise, highlighting the utility of scCR in revealing relevant gene interactions and pathways. We will include this experimental result regarding biological validation in our final version.

Too simple and lacks significant innovation.

We believe that utilizing biomedical evidence and domain-specific knowledge is crucial for the development of machine learning for healthcare. Furthermore, we expect that our work will lead subsequent frameworks to take into account the existence of two types of gene-gene relationships in scRNA-seq data. While designing sophisticated and complex methods is also important, we believe that biomedical evidence can drive significant progress in machine learning for healthcare, and our work can serve as a good example.

Motivation and adequate explaination of what dissociating relationships among genes are or why identifying these relationships is effective for addressing the research problem.

Associating genes refer to genes that co-occur, while dissociating genes refer to genes that avoid each other [6]. Mathematically speaking, associating relationships and dissociating relationships correspond to positive and negative correlation coefficients, respectively. The core idea of previous work, scBFP [7] is to impute false zeros (dropouts) in a gene based solely on associating genes with high cosine similarity. Although scRNA-seq data imputation is a very challenging task due to severe noise, scBFP overlooks the presence of dissociating genes. Within a cell, when considering the value to be imputed for gene Q, the value for its associating gene can assist in inferring the value for gene Q. However, its dissociating gene can also provide crucial information. If its dissociating gene has a high value, the value for gene Q may be low because they avoid each other. Unlike scBFP, our scCR can leverage dissociating gene-gene relationships via the negation of a cell-gene matrix. Despite its simplicity, our scCR successfully models dissociating gene-gene relationships as shown in Figure 8 in the manuscript. Furthermore, scCR significantly outperforms state-of-the-art methods in various downstream tasks, as shown in Table 1, Figure 5, Figure 6, and Figure 7 in the manuscript. We will add this detailed explanation regarding associating and dissociating relationships, as well as the clear motivation behind scCR, to our final version.

[1] H. Cui et al., “scGPT: toward building a foundation model for single-cell multi-omics using generative AI,” Nature Methods, 2024.
[2] L. Schirmer et al., “Neuronal vulnerability and multilineage diversity in multiple sclerosis,” Nature, 2019.
[3] H. M. Kang et al., “Multiplexed droplet single-cell RNA-sequencing using natural genetic variation,” Nature biotechnology, 2018.
[4] A. Subramanian et al., “Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles,” PNAS, 2005.
[5] E. Rossi et al., “On the unreasonable effectiveness of feature propagation in learning on graphs with missing node features,” LoG, 2021.
[6] F. J. Whelan et al., “Coinfinder: detecting significant associations and dissociations in pangenomes,” Microbial genomics, 2020.
[7] J. Lee et al., “Single-cell RNA sequencing data imputation using bi-level feature propagation,” Briefings in Bioinformatics, 2024.

评论

Dear Reviewer 49oE,

We sincerely thank you for dedicating your time to review our work and for your thorough feedback. With only about a day remaining in the discussion period, we are eager to engage further and understand whether our responses have satisfactorily addressed your concerns.

In our rebuttal, we provided point-by-point responses to all your questions and concerns regarding (1) large-scale models (e.g., scGPT and Geneformer), (2) the lack of biological validation, (3) the lack of innovation, and (4) insufficient explanation. In summary:

  • For (1), we clarified that our method and large-scale models have clearly different objectives.
  • For (1), we demonstrated that our scCR improves the cell type classification performance of scGPT.
  • For (2), we confirmed that scCR enriches relevant genes in lupus, a chronic autoimmune disease, by newly identifying four SARS_COV-related pathways in interferon-stimulated CD16 monocytes.
  • For (3), we assert that utilizing biomedical evidence and domain-specific knowledge is crucial for the development of machine learning in healthcare.
  • For (4), we provided a clear motivation behind our method, along with a detailed explanation of dissociating relationships and why identifying them is effective.

We would greatly appreciate it if you could kindly review our responses. We welcome any further questions and are happy to provide additional clarifications if needed. Thank you for your consideration.

Sincerely,
The Authors

评论

Dear Reviewer 49oE,

In our rebuttal, we provided point-by-point responses to all the concerns and questions you raised. Given that we have only six hours remaining before the deadline, we are very eager to confirm whether our responses have adequately addressed your concerns. We kindly ask you to take a moment to review our rebuttal and provide any further feedback. If there are any remaining questions or concerns, please be assured that we are ready to respond promptly.

Sincerely,
The Authors

审稿意见
6

The paper proposes an approach for single-cell RNA-seq data imputation. The data comes as a matrix capturing relationship between cells and genes. Zero values in that matrix represent unobserved gene expression that can result from technical omissions (known as dropouts) and true biological absence. The non-zero values also suffer from noise such as cell and batch effects. The goal is to impute and de-noise observed single-cell RNA-seq data.

An effective prior approach is based on kNN graphs, where one first builds an adjacency matrix between cells or genes using cosine similarity and kNN neighborhoods relative to RNA-seq data. In contrast to prior work that focuses on adjacency matrices informed by “association” links, the proposed approach aims at modelling “dissociation” links that have negative cosine similarity. The approach proceeds in multiple stages:

  • Pre-imputation stage where kNN graph is built on the input matrix and similar to Markov chains, starting from random state (i.e., feature matrix of dimension Ncell×NgeneN_{cell} \times N_{gene}) one “diffuses” to the stationary distribution (Appendix A for details).
  • The second stage appends negated pre-imputed matrix making the feature matrix Ncell×2NgeneN_{cell} \times 2N_{gene} that allows for kNN graphs accounting for both “association” and “disassociation” relationships to be reflected in the adjacency matrix. This is followed by gene-to-gene and cell-to-cell propagation (i.e., stationary distribution of the Markov chain given by these matrices) with resulting matrices convexly combined into imputed “complete relationship”.
  • Final step is de-noising where now propagation of information involves convex combination with the original feature matrix and adjacency matrix build using kNN graph on the back of “complete relationship” from step ii). The final output is a convex combination of steps ii) and iii).

Experiments involve cell clustering, recovery of dropout rates, robustness relative to dropouts, identification of rare cell types, modelling disassociation rates, etc.

优点

This is an interesting problem that relates nicely to link prediction in graph neural networks. It has been clearly presented with strong motivation by noise and dropouts. The paper is also clearly written and well-organized. I would hope that there will be follow up from the community focusing on link prediction.

The approach is not straightforward and involves several “propagation” steps and, contrary to past work, incorporates the disassociation and negative correlations into kNN adjacency matrices.

Empirical performance in some cell clustering tasks is not incrementally but significantly improved. The method also appears to be more robust relative to dropouts than the alternatives.

缺点

An ablation is missing relative to different steps (pre-imputation, complete relationship, de-noising stage). Hence, it is unclear if they are needed and to what extent they contribute to the performance improvement.

Imputation metrics might be dependent on dropout strategy and it would be good to discuss what kind of “random” strategies have been used and how likely they are to reflect the corruptions specific to single-cell RNA-seq. Is there any way to generate “challenging” splits that are better at reflecting the “generalization” and "robustness"?

问题

M is of dimension N x F (line 112)?

How do you decide that cell row is “unknown” from observed data?

Dropout rates? How was this done exactly?

局限性

N/A

作者回复

Imputation metrics might be dependent on dropout strategy and it would be good to discuss what kind of “random” strategies have been used and how likely they are to reflect the corruptions specific to single-cell RNA-seq. Is there any way to generate “challenging” splits that are better at reflecting the “generalization” and "robustness"?

Yes, there is a specific pattern of dropouts in real scRNA-seq data, and we perform additional experiments applying a realistic dropout strategy reflecting this pattern. Existing studies [1, 2] simulate dropout by randomly sampling non-zero values in a cell-gene matrix from a uniform distribution and setting them to zero (i.e., missing completely at random (MCAR)). However, in real scRNA-seq data, dropouts occur more frequently in genes with low expression levels rather than those with high variance [3]. This is because the probability of capturing RNA transcripts of low-expression-level genes during sequencing is lower. Based on this pattern of dropouts, we select the 1000 genes with the lowest expression levels and simulate dropout only in these genes. We randomly sample non-zero values of these genes from a uniform distribution and replace the sampled values with zero (i.e., missing not at random (MNAR)).

Table 1 in the PDF of the global response shows the performance comparison under the aforementioned MNAR settings in terms of data recovery, measured by RMSE. We compare our scCR to the two most competitive baselines, scFP [1] and scBFP [2]. The number of dropouts is set to 20%20\% of the total number values in a cell-gene matrix. As shown in the table, scCR outperforms the compared methods by significant margins in the realistic dropout settings, demonstrating the robustness of scCR in realistic scenarios. We believe the reviewer has highlighted a very important aspect of dropout recovery in scRNA-seq data research. The consideration of realistic dropout simulation can help pre-assess the generalizability of techniques in real-world scRNA-seq application. We will include this important discussion and the experimental results in our final version.

An ablation is missing relative to different steps (pre-imputation, complete relationship, de-noising stage).

Although we have conducted an ablation study analyzing the effectiveness of concatenation and standardization processes in the complete relation stage of scCR as shown in Table 2 in the manuscript, we conduct an additional ablation study to explore the effectiveness of each stage of scCR. Table 2 in the PDF of the global response shows the results of the ablation study in terms of cell clustering, measured by ARI. As shown in the table, the addition of the complete relation stage and the denoising stage notably enhance the performance compared to when the pre-imputation stage is used alone. We can confirm that the complete relation stage and the denoising stage significantly contribute to the outstanding performance of scCR. This ablation study emphasizes the well-founded design of our scCR.

How do you decide that cell row is “unknown” from observed data?

We do not decide known or unknown values from a given observed cell-gene matrix when using our scCR. In this paper, the terms known and unknown are used solely to explain the process of Feature Propagation (FP) [4], which addresses missing feature imputation on graph-structured data. FP assumes that the locations of both observed (known) and unobserved (unknown) values in a feature matrix are given. FP imputes unobserved values by diffusing observed values while preserving these observed values. In contrast, in scRNA-seq data recovery, all values are observed (i.e., known) in a given cell-gene matrix. Thus, to apply FP to scRNA-seq data, FP-based imputation methods treat zero values as unknown values to be imputed via features diffused from non-zero values. For clarity, we will add this explanation to our final version.

Dropout rates? How was this done exactly?

Following conventional dropout recovery research, given a cell-gene matrix, we randomly sampled non-zero values from a uniform distribution at dropout rates of {0.2,0.4,0.8}\{ 0.2, 0.4, 0.8 \}. We then set these sampled non-zero values to zero, creating false zeros.

M is of dimension N x F (line 112)?

Yes, we thank you for pointing out the typo. As the reviewer mentioned, M\mathbf{M} has the same dimension as a given feature matrix XRN×F\mathbf{X}\in \mathbb{R}^{N \times F}, where NN is the number of nodes and FF is the number of feature channels.

[1] S. Yun et al., "Single-cell RNA-seq data imputation using feature propagation," arXiv preprint arXiv:2307.10037, 2023.
[2] J. Lee et al., "Single-cell RNA sequencing data imputation using bi-level feature propagation," Briefings in Bioinformatics, 2024.
[3] Y. Liu, et al., "iDESC: identifying differential expression in single-cell RNA sequencing data with multiple subjects," BMC bioinformatics, 2023.
[4] E. Rossi et al. "On the unreasonable effectiveness of feature propagation in learning on graphs with missing node features," Learning on graphs conference,” PMLR, 2022.

评论

Thank you for detailed response and additional experiments. I am satisfied with the author's response and will be increasing my score.

评论

We appreciate your decision to increase the score and your recognition of our efforts in addressing your concerns. We are pleased that our detailed response and the additional experiments provided were satisfactory.

We recognize the depth of expertise you bring to the review process, which has helped establish a dropout evaluation setting that is more realistic than conventional ones. Thanks to this setting, we were able to demonstrate the superiority of our method even in more realistic scenarios. We will include this important discussion and the experimental results in the final version. Your insightful reviews and efforts have significantly improved our manuscript. We welcome any further questions and are happy to provide additional clarifications.

作者回复

We 1) propose a novel imputation method that newly employs dissociating relationships in addition to associating relationships, 2) standardizes the value distribution of each gene to have standard distributions regardless of the gene, and 3) demonstrate that the proposed method achieves exceptional performance gains in both cell clustering and gene expression recovery.

We appreciate the reviewers’ thoughtful comments of our work. Especially, we thank the reviewers for the positive feedbacks about "clear and strong motivation", "straight forward yet effective approach", and "well organized paper".

评论

This paper introduces scCR, a novel imputation method for single-cell RNA sequencing (scRNA-seq) data, which is grounded in genetic evidence. Genes are known to have two types of relationships: associating (co-occuring) relationship and dissociating (avoiding) relationship. However, while existing imputation methods utilize associating relationships, they overlook the presence of dissociating relationships. To address this issue, scCR models dissociating relationships via the negation of a given cell-gene matrix. Despite its simplicity, scCR achieves exceptional performance gains over state-of-the-art methods in both cell clustering and gene expression recovery across six real-world scRNA datasets. These results offer a crucial insight: when applying machine learning to the biomedical domain, it is essential to approach the problem with a foundation in biological principles rather than focusing solely on the application of cutting-edge machine learning techniques.

During the rebuttal period, we provided point-by-point responses to all the questions and concerns raised by the reviewers. To address the reviewers' concerns that required additional experimental results, we have attached a PDF to our global response below. This PDF contains extensive experimental results, as follows:

  • Table 1: Missing Not At Random (MNAR) Settings - Reviewer NnmD
  • Table 2: Further Ablation Study - Reviewer NnmD
  • Table 3: Integration of scCR and scGPT - Reviewer 49oE
  • Table 4: Influence of α\alpha - Reviewer wZRo
  • Table 5: Influence of β\beta - Reviewer wZRo
  • Table 6: Influence of γ\gamma - Reviewer wZRo
  • Table 7: Influence of kk - Reviewer wZRo
  • Table 8: Comparison of Inputs and Memory Complexity - Reviewer wZRo
  • Table 9: Memory Usage of scCR - Reviewer wZRo

As the discussion period deadline approaches, we are eager to engage further and ensure that our responses have satisfactorily addressed your concerns. We welcome further questions and are happy to provide additional clarifications if needed.

Sincerely,
The Authors

评论

In this study, we propose scCR, a novel single-cell RNA sequencing (scRNA-seq) data data imputation scheme based on genetic evidence. While previous studies have overlooked dissociating (avoiding) relationships among genes, scCR models these dissociating relationships, leading to significant performance improvement in both cell clustering and gene expression recovery.

We express our profound gratitude to all reviewers for their time and effort in evaluating our manuscript. We particularly appreciate the positive feedback on our work being “clearly presented with strong motivation” [NnmD], “quite complete and comprehensive” [wZRo], and “well-structured and clearly written” [49oE]. We are delighted that the reviewers have acknowledged the exceptional performance of our scCR, noting that it has “not incrementally but significantly improved performance” [NnmD], “significantly outperforms existing methods” [49oE], and delivers a “solid performance improvement” in “extensive and solid experiments” [wZRo].

We have listed all the reviewers' concerns below and summarized our responses addressing each concern.

  • [NnmD]-1: more challenging missing scenarios
    • We conducted additional experiments applying a realistic dropout strategy, demonstrating the robustness of scCR in realistic scenarios.
  • [NnmD]-2: Lack of step-level ablation study
    • We performed an additional ablation study to explore the effectiveness of each stage of scCR.
  • [49oE]-1: Comparison with large models (e.g., scGPT and GeneFormer)
    • We clarified that our scCR and large-scale models have distinct objectives and are not directly comparable in terms of effectiveness. Instead, they can collaborate to create synergy. We demonstrated that our scCR enhances the cell type classification performance of the scGPT model.
  • [49oE]-2: Lack of biological validation and relevant case studies
    • We confirmed that scCR enriches relevant genes in lupus, a chronic autoimmune disease, by newly identifying four SARS_COV-related pathways in interferon-stimulated CD16 monocytes. Additionally, in Sec 5.1 of the manuscript, we showed that scCR effectively models dissociating gene-gene relationships in real-world scRNA datasets.
  • [49oE]-3: Lack of innovation
    • We emphasize that leveraging biomedical evidence and domain-specific knowledge is crucial for advancing machine learning in healthcare. Our work presents an innovative approach that identifies and addresses aspects overlooked by existing methods.
  • [49oE]-4: Insufficient explanation
    • We provided a clear motivation behind our method, along with a detailed explanation of dissociating relationships and why identifying them is effective.
  • [wZRo]-1: Hyperparameter sensitivity analysis
    • We conducted a comprehensive analysis of the impact of different hyperparameters.
  • [wZRo]-2: Memory usage analysis
    • We investigated the memory complexity of all methods used in this paper and measured the memory usage of scCR across various datasets.

Detailed responses can be found in the rebuttal for each reviewer. We will include these extensive discussions and the experimental results in the final version. Given that we have only one hour remaining before the deadline, we kindly ask you to take a moment to review our rebuttal. If there are any remaining questions or concerns, please be assured that we are ready to respond promptly.

We express our sincere gratitude to Reviewer NnmD for demonstrating increased confidence in our work, backed by their depth of expertise, and for raising the rating during the Reviewer-Author Discussions period. Despite our repeated attempts to seek confirmation from two reviewers, we have not yet received a response. We hope that the reviewers will continue to review the responses even after the author-reviewer discussion period has ended.

Sincerely,
The Authors

最终决定

The reviewers consider the simplicity and effectiveness of the method, solid empirical evaluation, and biological significance of the task as notable strengths. Some critical comments were also raised related to simplicity of the method and its motivation.

The authors provided thorough rebuttals that seem to address all critical comments, and no further specific criticism was presented during the discussion period, so we conclude that everyone is satisfied.