PaperHub
5.8
/10
Poster5 位审稿人
最低5最高6标准差0.4
5
6
6
6
6
3.2
置信度
正确性3.0
贡献度3.0
表达2.6
NeurIPS 2024

Learning Identifiable Factorized Causal Representations of Cellular Responses

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
Single CellComputational BiologyCausality

评审与讨论

审稿意见
5

This paper proposes FCR, a causal VAE-based model that aims to decompose and cluster covariates, treatments, and their interactions in the latent space.

优点

This paper has a solid and rigorous mathematical foundation, and the assumptions are validated after the model is trained.

缺点

This paper is limited in readability, mainly because of word redundancy, making it hard for the reader to follow the authors’ ideas. Besides, Figure 2 outlines the FCR model, but the encoders and decoders are disconnected, so it is not intuitive to understand how we train components such as p(ztt)p(z_t|t) and p(zxt)p(z_x|t).

问题

Is it possible to enhance other model’s interpretability using the factorized zx,zt,ztxz_x,z_t,z_{tx}?

局限性

I cannot find a section discussing the limitations of this work, but I believe the noise and hidden variables in the single-cell data will cap the model performance.

作者回复

Readability**Readability**:

Thank you for your feedback regarding the readability of our manuscript. We appreciate your honest assessment and understand the importance of clear and concise writing. We will carefully review the manuscript to eliminate any word redundancy and improve the overall clarity, ensuring that our ideas are communicated more effectively.

**p(\mathbf{z**_t|\mathbf{t})andandp(\mathbf{z}_x|\mathbf{x}) learning}:

Thank you for this question. I’d like to clarify that both xx and tt are inputs to the neural network, which then encodes them to learn the prior mean and standard deviation for ztz_t and zxz_x. We use the reparameterization trick to sample p(ztt)p(z_t|t) and p(zxx)p(z_x|x). We will add this explanation under the model figure in the manuscript to provide clearer context.

Interpretability**Interpretability**:

There are generally two approaches to interpreting latent representations, which we also plan to explore in future work:

  1. Latent Traversal: Since ztxz_{tx} is element-wise identifiable, we can use latent traversal across all dimensions of ztxz_{tx}. This involves fixing all latent variables ztxiz_{tx}^i to their inferred values for each cell's gene expression and then varying the value of one latent at a time to observe the reconstructed gene expression. This method allows us to see how each dimension of ztxz_{tx} influences gene expression levels.
  2. Interpretable Decoder System: We can also employ an interpretable decoder, such as LDVAE [1], which uses a linear decoder. By analyzing the weights of the latent representations assigned to each gene, we can interpret how different representations influence each gene.
  3. On-going Work: In light of the neural additive model [2], we are developing a non-linear interpretable encoder as an extension of LDVAE. Specifically, for each latent dimension, there is a sub-neural network to capture information specific to each latent dimension. A logistic regression layer is added to the end of the neural network. So, by analyzing the logistic regression weights (loading matrix), we can conclude how much one latent variable contributes to the reconstruction of a specific gene. An illustration of this model can be found in the rebuttal PDF file Figure D.

Limitationdiscussions**Limitation discussions**:

Thank you for raising this important question. Indeed, this paper has some limitations that we plan to address in future work.

Firstly, the generating function 𝑔 used in our model is either deterministic or incorporates Gaussian noise. Future research will investigate more general scenarios with stochastic generating functions, which are more representative of real-world applications. Secondly, while this paper does not include interpretability mechanisms for FCR, we have proposed potential approaches to enhance interpretability, as discussed in the previous section. Lastly, due to the limitations of the available public datasets, we recognize that testing FCR on datasets with a broader range of drugs and more complex covariate scenarios will be an intriguing direction for future work.

[1] Svensson, V., et al. (2020). Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics (Oxford, England), 36(11), 3418–3421.
[2] Agarwal, R., et al. (2021). Neural additive models: Interpretable machine learning with neural nets. Advances in neural information processing systems, 34, 4699-4711.

评论

Thank you for the authors' reply addressing much of my concern. I will keep my score unchanged.

评论

Thank you for reading our response. Please feel free to let us know if there is anything needs to be clarified after the rebuttal. We value your feedback and want to address your remaining concerns.

审稿意见
6

The authors propose a method for disentangling into factorized representations dependent on only covariates, only treatment and the interaction between treatment and covariates. Introducing a set of assumptions, the authors prove the identifiability of these disentangled variables based on previous proofs on non-linear ICA. The authors implement the model using variational inference and a set of regularizers and discriminators to ensure the needed independence and disentangling. The authors validate the clustering performance, conditional independencies and conditional cellular response prediction against commonly used methods for perturb-seq prediction and show generally very good performance.

优点

  • Non-linear interactions between covariates and treatment are very likely and have to be modeled
  • Given a set of additional assumptions, the method has some guarantees of disentangling, which is also shown empirically
  • Understanding (e.g. through clustering) interactions between covariates and treatment is important for precision medicine applications
  • Good performance as shown by authors

缺点

  • Complicated implementation with many loss functions and minmax training, a discussion on how easy to train the method is would be good (dependence of performance on hyperparameters and hyperparameters on data set).
  • The ablation study should include a comparison without the different loss terms to be able to see the importance of the different parts of the model
  • Assumption 4.3 could be quite strong in many settings, if for every covariate every drug response has to be available. A comment on sufficient experiments would be good and how common this is.
  • For some of the assumptions (e.g. substantive changes of the distribution) empirically arguing that this is the case in common data sets would make the real world application of the proof (instead of the purely academic pursuit) more convincing.

问题

  • How were the hyperparameters chosen for the different data sets?
  • Maybe I am misunderstanding "control", but in the conditional cellular response prediction, why is it necessary to use zx0z_x^0 instead of zxz_x in the function gg if zxz_x is independent of tt?
  • For prediction validation, what is train/val/test protocol? Is the correlation of y in the train set?

局限性

The authors could discuss the limitations (based on assumptions) more directly.

作者回复

Hyperparameterselection**Hyperparameter selection**: Thank you for raising this important question, we appreciate that it is a critical point for evaluating how well the method works. Please check General Response to All Reviewer b.

Ablationstudies**Ablation studies**:

In addition, as the reviewer suggested, we did the components ablation study (without the different loss terms) on the clustering task (sciPlex dataset) with the NMI score reported.
Table1:Benchmarkon**Table 1: Benchmark on \mathbf{z**_x}

Parameterszxz_x on xxzxz_{x} on x,t{x, t}zxz_x on ttMean
w1=0w_1 = 00.5830.6060.0610.417
w2=0w_2 = 00.7810.6060.0600.482
w3=0w_3 = 00.7420.6330.0590.478
w1=w2=w3=1w_1=w_2=w_3=10.7860.6120.0610.486

Table2:Benchmarkon**Table 2: Benchmark on \mathbf{z**_{tx}}

Parametersztxz_{tx} on xxztxz_{tx} on x,t{x, t}ztxz_{tx} on ttMean
w1=0w_1 = 00.7310.5950.0680.465
w2=0w_2 = 00.7250.5560.0370.439
w3=0w_3 = 00.7480.5750.0660.463
w1=w2=w3=1w_1=w_2=w_3=10.7320.5940.0680.465

Table3:Benchmarkon**Table 3: Benchmark on \mathbf{z**_t}

Parametersztz_t on xxztz_{t} on x,t{x, t}ztz_t on ttMean
w1=0w_1 = 00.3330.3890.0690.264
w2=0w_2 = 00.3340.3720.0660.257
w3=0w_3 = 00.3310.4010.0810.271
w1=w2=w3=1w_1=w_2=w_3=10.3340.3930.0880.272

When one of w1,w2,w3{w_1, w_2, w_3} is set to 0, the remaining weights are set to 1.0. Based on this setup, we draw the following conclusions:

The results of zxz_x clustering on xx, {t,xt, x}, and tt indicate that

  • w1w_1 (the weight for LsimL_{sim}) significantly enhances the performance of zxz_x clustering on xx and ztz_t clustering on tt. This improvement is due to LsimL_{sim} increasing the similarity among zxz_x and ztz_t.
  • w2w_2 (the weight for LctL_{ct}) enhances ztxz_{tx} clustering on {tt, xx} and tt, as well as ztz_t clustering.
  • w3w_3 (the weight for LdistL_{dist}) improves ztxz_{tx} clustering on {tt, xx}and also enhances the separation within the spaces of ztxz_{tx}, ztz_t, and zxz_x.

It's important to note that when w1,w2,w3=1,1,1{w_1, w_2, w_3} = {1,1,1}, the performance of ztxz_{tx} clustering on {tt, xx}is not as strong as zxz_x clustering on {tt, xx}. However, by increasing w1w_1, w2w_2, and w3w_3, the performance of ztxz_{tx} surpasses that of zxz_x (see Appendix Figure 7).
Necessarytouse**Necessary to use z_{x**^0insteadofinstead ofz_{x}inthefunctiong,ifin the function g, ifz_{x}isindependentofis independent oft}?:

Thank you for raising this question. Yes, directly utilizing zxz_x might achieve higher R2R^2 score, here we followed CPA, VCI and other methods and replace by zx0z_x^0 to perform "counterfactual" prediction to further evaluate the learned disentangled representations.

Assumption4.3isstrong**Assumption 4.3 is strong**:

Assumption 4.3 can be better understood through the lens of biological experimental design. If we aim to determine whether there is a covariate x1x_1-specific response to treatment t1t_1, we need to conduct two reference experiments: (x0,t1)(x_0, t_1) and (x0,t0)(x_0, t_0), where x0x_0 represents reference cells and t0t_0 represents control/reference treatments.

By comparing the treatment effects between the pairs {(x1,t1),(x1,t0)}\{(x_1, t_1), (x_1, t_0)\} and {(x1,t1),(x0,t1)}\{(x_1, t_1), (x_0, t_1)\}, we can assess whether the observed differences are significant and distinct (as described in Assumption 4.4). Without comparing (x1,t1)(x_1, t_1) and (x1,t0)(x_1, t_0), we cannot determine if t1t_1 has any effect on x1x_1. Similarly, without comparing (x1,t1)(x_1, t_1) and (x0,t1)(x_0, t_1), we cannot assess whether t1t_1 has a differential effect on x1x_1 than x0x_0. Only when both comparisons are made can we begin to analyze covariate-specific effects. This comparison allows us to confidently identify any x1x_1-specific response to t1t_1.

This is a common approach in biological experimental design. For instance, an increasing number of experiments are being conducted for drug screening across various micro-environment organoid systems (covariates), as well as with different drugs, dosages, and time points [3]. The main goal is to discover these covariate-specific cellular responses.

Assumptionsempiricallyarguingthatthisisthecaseincommondatasets**Assumptions empirically arguing that this is the case in common data sets**:

The substantive changes assumption (iii) means the different cell lines under the same drug treatment, the gene expression is significantly different (non-trivial). It can be seen that there are large parts of gene expression only influenced by cell types, etc. (iv) means under different treatments, the same cell (covariates) have significantly different responses to gene expression.

These phenomena are all observed in biological literatures [1] [2] [3]. The Figures 3.B,C,D (FigureDintherebuttalPDFfile**Figure D in the rebuttal PDF file**) in the Trellis paper [3] show evidence that these assumptions are common.

Limiations**Limiations**: Due to the limited space, please check the response to ReviewerWECd.**Reviewer WECd**.

[1] Sanjay R. Srivatsan et al. ,Massively multiplex chemical transcriptomics at single-cell resolution.
[2] McFarland, J.M., Paolella, B.R., Warren, A. et al. Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action.
[3] Ramos Zapatero, María et al. Trellis tree-based analysis reveals stromal regulation of patient-derived organoid drug responses.

评论

Thank you for the extensive response. With additional information on hyperparameter tuning and ablation study, I have increased the score to weak accept.

评论

Thank you again for acknowledging our paper strengths and raising the score. We will further polish our manuscript and include the hyperparameter tuning and ablation study in the final version.

审稿意见
6

The authors present a novel method for causal representation learning in cellular perturbation settings, called Factorized Causal Representation, in which the authors learn disentangled latents for the cellular covariates, the treatment, and the interactions between them. They provide identifiability results following the methods of recent papers, and demonstrate how an implementation of the method, which uses a number of novel regularizers to enforce various causal constraints, achieves top results amongst comparable methods.

优点

  • novel causal representation learning method explicitly learning interactions between covariates and treatments
  • novel regularizers for enforcing conditional independence between sets of variables
  • achieves top cellular response prediction (R2R^2) over comparable conditions

缺点

  • the datasets used have a relatively small number of similar treatments (16 at the most), so it is unclear how well the approach will work when applied to large diverse chemical libraries typical of real-life settings
  • the decision not to evaluate on unseen treatments or cell contexts is surprisingly conservative, since the promise of causal models is in generalizing to unseen domains

问题

  • why not evaluate on unseen treatments or cell contexts? Did you try and find the results to be comparatively poor?

局限性

NA

作者回复

Datasetsize**Dataset size**:

Due to the limited availability of public datasets and the high cost of single-cell sequencing, single-cell datasets with a large number of drugs are not yet available. However, as demonstrated by the theorems in our paper, incorporating more treatments with well-designed controls can enable the latent representations to capture more meaningful and disentangled information, thereby facilitating the algorithm's learning process. As more datasets become available in the future, we plan to conduct extensive benchmarks to evaluate our approach on a broader scale.

Predictionforunseenperturbations**Prediction for unseen perturbations** :

Predicting responses to novel treatments is a pivotal and fast-evolving field in drug discovery. However, the biological literature indicates that cellular responses are highly context-dependent [1,2,3]. This complexity poses significant challenges for AI-driven drug discovery, which often struggles to achieve success in clinical trials [4].

Our motivation for developing FCR stems from the need to understand how cellular systems react to treatments and identify conditions that can deepen our understanding of these responses. FCR enables the analysis of drug interactions with covariates and contextual variables. Additionally, predicting cellular responses to new treatments necessitates prior knowledge, such as chemical structure and molecular function, and comparisons with known treatments. Without this context, predictions can be unreliable.

In this paper, our primary focus is not on predicting responses to unseen treatments. However, given the relevance of this topic, we conducted pilot experiments to showcase the potential future applications of FCR. The MultiTram-Plex and Multiplex-7 datasets share the same cell lines and Trametinib-24 hours treatment, along with other different treatments. By utilizing these two datasets, we established the following experimental settings for unseen prediction scenario:

  • Drug Hold-out Setup: We held out two cell lines, ILAM and SKMEL2, from the Multiplex-Tram dataset, which had been treated with Trametinib for 24 hours. We trained a FCR model using the remaining data, and this model is referred to as MhM_h. We denote the dataset (Multiplex-Tram dataset without Trametinib-24h treated ILAM and SKMEL2) as DhD^h
  • Prior Knowledge Model: We trained another FCR model, MpM_p, using the Multiplex-7 dataset, which includes the ILAM and SKMEL2 cell lines treated with Trametinib for 24 hours denoted as DpD^p. We treat this model, MpM_p, as a prior knowledge model.
  • Transfer MLP: We extract both ztxpz_{tx}^{p} from MpM_p and ztxhz_{tx}^{h} from MhM_h for DpD^p. Then we trained a 1-layer MLP (the same dimension as ztxpz_{tx}^p) to transfer ztxpz_{tx}^p to ztxhz_{tx}^h, by minimize the MSE between them.
  • Contextual Prior Representation: We extracted the ztxpz_{tx}^p representations from model MpM_p for ILAM and SKMEL2 in holdout set. Then transfer ztxpz_{tx}^p by the previous MLP to z^txh\hat z_{tx}^h as a prior contextual embedding. For the hold-out cell lines ILAM and SKMEL2 in the Multiplex-Tram dataset, we extracted zxhz_{x}^h from model MhM_h, and Trametinib-24h zthz_{t}^h from other treated cell lines in DhD^h
  • Representation Matching: Then we match the z^txh\hat z_{tx}^h by zxpz_{x}^p similarity on ILAM and SKMEL2 across Multiplex-tram and Multiplex-7 in the prior knowledge model space.
  • Prediction: We predicted the unseen 24 hours Trametinib responses for the ILAM and SKMEL2 cell lines in holdout dataset using the formula: y^=g(zxh,z^txh,zth)\hat y =g(z_{x}^h,\hat z_{tx}^h, z_{t}^h),where z^txh\hat z_{tx}^h is the corresponding matched prior contextual representation, zth z_t^h is the tramnib-24 average value in other cell lines.
  • Evaluation: We computed the R2R^2 and MSE for the top 20 differentially expressed genes (DEGs) based on the predicted values and compared these results with those from the VCI model.The paper’s OOD prediction setups.)
  • TheillustrationofdatasetsandexperimentssetupscanbefoundintherebuttalPDFfile,FigureAandB**The illustration of datasets and experiments setups can be found in the rebuttal PDF file, Figure A and B**.

We have the following results:

Table:UnseenPredictionResults**Table: Unseen Prediction Results**

MethodsR2R^2MSE
FCR0.74±0.030.74\pm 0.030.52±0.110.52 \pm 0.11
VCI0.71±0.050.71 \pm 0.050.55±0.080.55 \pm 0.08

[1] Sanjay R. Srivatsan et al. ,Massively multiplex chemical transcriptomics at single-cell resolution.Science.
[2] McFarland, J.M., Paolella, B.R., Warren, A. et al. Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action. Nat Commun.
[3] Ramos Zapatero, María et al. “Trellis tree-based analysis reveals stromal regulation of patient-derived organoid drug responses.” Cell vol. 186,25 (2023).
[4] Derek Lowe "AI drugs so far".

评论

Dear Reviewer yZVk:

We sincerely appreciate your constructive suggestions on the unseen prediction. During the rebuttal stage, we worked diligently to design and execute the experiment to address your feedback (Please refer to the response above, as well as Figures A and B in the rebuttal FDF file). We would greatly value your feedback on the experiment and would be eager to discuss it further with you before the discussion period ends. Your insights will be crucial in helping us refine our paper.

Thank you,

All the authors

评论

Thank you for addressing my review. The unseen cell line experiment is nice, and probably as good as can be done given the datasets. I've raised my score accordingly.

审稿意见
6

The authors develop a modification of the identifiable nonlinear ICA algorithm of Khemakhem et al. 2020, suited for scenarios with two disjoint groups of auxilliary variables x and t and their interactions, with an application of treatment effect on cell gene expression. The authors propose a learning objective designed to enforce the conditional independence assumptions of the model, and evaluate it against baselines from the disentanglement, ICA, and cellular response estimation.

优点

  • the proposed method is justified theoretically
  • the paper is written in a clear and concise manner
  • extensive selection of baselines

缺点

  • I fail to see the reasoning of comparing the conditional independence results (6.3) with the other baselines, which were not designed to have separate latent subspaces for x,t, and tx, and thus it does not seem surprising that the conditional independence does not hold for random subsets of them?

  • Results of the cellular response tasks are barely better than the baselines - did the authors conduct any tests to see if the differences are statistically significant?

问题

  • Equation (6) (the constraint for interaction between x and t) - how does that guarantee the interactions? The learned embeddings can have empty dimensions, e.g, k(x) = [k_1(x), …, k_n(x), 1, …, 1] and k(t) = [1, …, 1, k_1(t), …, k_m(t)] so that the Hadamard product becomes [k(x), k(t)]
  • Equation (7) - shouldn’t f_dis also take x as an argument since you want to enforce a conditional indpendence? Or do you train a separate f_dis for each possible value of x that you condition on (probably not feasible)?
  • If the model is supposed to be fully identifiable wrt. z_xt, then I would suggest evaluating this with the appropriate metrics on synthetically generated data in addition to the clustering objective (e.g., MCC as in Khemakhem et al. 2020, or any of the disentanglement metrics [Locatello et al. 2019])
  • How were the hyperparameters selected for the cellular response task?

局限性

  • the algorithm has 3 hyperparameters, which might make training difficult. Details on hyperparameter selection were ommited
  • the evaluation is limited - results of the downstream task of predicting cellular response yield limited improvements, and the identifiability of z_{t,x} is not properly evaluated (with e.g., the MCC metric using synthetic data)
作者回复

Comparingtheconditionalindependenceresults(6.3)withtheotherbaselines**Comparing the conditional independence results (6.3) with the other baselines**:

We thank the reviewer for raising this important question. I would like to clarify the experimental setup. The goal of this experiment is to demonstrate that FCR is capable of learning conditionally independent spaces for zxz_x, ztz_t, and ztxz_{tx}. Our algorithm is the first to specifically address the xx, tt, and txtx subspaces, which makes it challenging to find direct baselines for comparison. However, the baselines such as iVAE, betaVAE, factorVAE, and sVAE are semi-supervised or unsupervised disentangled representation learning methods, they aim to learn representations that should be independent or conditionally independent given xx and tt. Additionally, methods like CPA and VCI enforce certain invariance relations with respect to tt and xx, potentially leading to some level of conditional independence in the learned representations.

However, the challenge in making this comparison lies in not knowing which blocks of latent variables induced by baseline methods satisfy the specific conditional independence relations of interest. To ensure a fair comparison, we exhausted different combinations of the representations for the tests and reported the best results. This comparison is intended to demonstrate that FCR can effectively learn more meaningful and nuanced conditional independence relations in xx, tt and txtx, compared to the baselines. We will ensure that this clarification is included in the manuscript.

Differencesarestatisticallysignificant**Differences are statistically significant**:

Thank you for raising this concern. We tested for significance of changes in mean between the results from the FCR method and those of the second-best performing method (paired t-test). The p-values, along with the corresponding cell number in the testing set appears in the table below.

Table: p-value for R2R^2 performance

Datasetp-value (# cells)
sciPlex0.0006 (295)
Multiplex-Tram0.0056 (266)
MultiPlex-70.0001 (318)

The results indicate that FCR significantly out-performed the baselines across cell lines on sci-Plex, multiPlex-Tram and multiPlex-7. On MultiPlex-9, where FCR is slightly outperformed by CPA, the differences are not significant. We will make sure to include all p-values in the updated manuscript.

To propose an additional metric beyond R2R^2, we also evaluated the mean square error (MSE), for the top 20 Differentially Expressed genes (DEGs). These 20 genes are measured on the largest difference for each cell line with drug treatments compared to control samples (following [1]). Note here, we didn’t compare with CINEMA-OT and scGEN because they are only for binary treatments (case-control data). For detailed results, please check the results in General Response to ALL Reviewers a. Generally speaking, FCR outperforms other baselines in 3 of 4 datasets.

ConcernsonEquation(6)(theconstraintforinteractionbetween**Concerns on Equation (6) (the constraint for interaction between \mathbf{x**andand\mathbf{t})}:

The reviewer asked the question "How does that guarantee the interactions? The learned embeddings can have empty dimensions, e.g, k(x)=[k1(x),,kn(x),1,,1]k(x) = [k_1(x), \ldots, k_n(x), 1, \ldots, 1] and k(t)=[1,,1,k1(t),,km(t)]k(t) = [1, \ldots, 1, k_1(t), \ldots, k_m(t)] so that the Hadamard product becomes [k(x),k(t)][k(x), k(t)]".

We apologize for this misleading language. Essentially, we did not want to claim any theoretical guarantee in this case, but instead motivate the specific architecture choice. We therefore changed the language to "in order to encourage the prior for zxtz_{xt} to capture interactions between xx and tt, we design the functions fμpf^p_{\mu} and fΣpf^p_{\Sigma} to be of the form k(x)k(t)k(x) \otimes k(t), where \otimes denotes the Hadamard product.

xasanargumentfor**x as an argument for f_{dis**}: We agree with the reviewer that it would be much clearer to put xx as an argument of fdistf_{dist}. It is already the case that the function fdistf_{dist} implicitly considers xx as an argument because we perform a random permutation of the triplet (zx,ztx,x)(z_x, z_{tx}, x) under the condition of the same xx (i.e., the permutation or shuffle occurs only within a set of zxz_x and ztxz_{tx} under the same xx). As a result, the input to fdistf_{dist} is the set of newly generated (zxz_x, z^tx\hat z_{tx}), where we ensure that zxz_x and z^tx\hat z_{tx} correspond to the same xx. The outputs of fdistf_{dist} simply are the labels indicating whether a permutation occurred or not. Therefore, we can say that fdistf_{dist} effectively takes xx into account. We briefly discuss this process in lines 226-230. We will make sure to update it in our final manuscript.

Simulationstudiesonsyntheticdata**Simulation studies on synthetic data**: Please check the General Response to ALL Reviewers c.

Hyperparameterselection**Hyperparameter selection**: Thank you for raising this important question, we appreciate that it is a critical point for evaluating how well the method works. Please check General Response to All Reviewers b.

评论

Thank you for your answers.

While most of my questions were clarified, I would suggest a different choice for the additional metric instead of MSE, e.g., the rank correlation mentioned by authors in the response. Unless I am missing something MSE does not bring too much new insight beyond R2, as they are both measuring the squared residuals, with the main difference being that R2 is assuming standardized targets.

Nevertheless in light of the other answers I raised my score accordingly

评论

Thank you very much for raising the score. For the MSE evaluation, we selected the top 20 differentially expressed genes, identified by conducting a comparison between treated and untreated cells, and ranked them based on t-test p-values, we considered the rank information to some extent in this evaluation. Additionally, we plan to add the Spearman correlation comparison in the final version (The table below shows FCR and VCI results). Again, thank you for your valuable comments which helped us improve our work a lot!
Table: Spearman Correlation (top 500 highly variable genes):

DatasetsFCRVCI
sciPlex0.84(0.07)0.82(0.08)
multiplex-Tram0.87(0.05)0.85(0.06)
审稿意见
6

The paper presents a novel method, the Factorized Causal Representation (FCR), which leverages identifiable deep generative models to understand cellular responses to genetic and chemical perturbations across multiple cellular contexts. This method improves upon prior models by learning disentangled representations that delineate treatment-specific, covariate-specific, and interaction-specific factors, which are shown to be theoretically identifiable. The effectiveness of FCR is demonstrated on four single-cell datasets. The performance, measured by the R^2 score of conditional cellular responses, often marginally outperforms state-of-the-art methods, showcasing FCR’s potential in accurately modeling cellular dynamics.

优点

  • The paper successfully demonstrates the identifiability of treatment-specific, covariate-specific, and interaction-specific representations.
  • The inclusion of the cell type classifier fctf_{ct} and permutations discriminator fdistf_{dist} are innovative loss components.
  • Sections 6.1 and 6.2 are well-designed to rigorously test the proposed method's ability to maintain the integrity of dimension reduction, decomposition, and conditional independence, validating the model's effectiveness in designing and training.

缺点

Choice of Evaluation Metric: A significant concern arises from the choice of the R2R^2 metric in section 6.3 for evaluating the performance of the Factorized Causal Representation (FCR) model. R2R^2 is suitable to assess whether a model adequately explains the variation in the response variable. However, it is not necessarily an indicator of the model's predictive accuracy. Moreover, the near-identical R2R^2 reported for FCR, VCI, CPA, and sVAE (as shown in Table 1) suggest that R2R^2 gives limited insights into the comparative effectiveness of these methods.

There are some minor misformats, which I noticed while reading.

  • Lines 35, 54: citation format issue
  • Line 294: subscript format issue

问题

Can you elaborate more about the design choice of the evaluating metric?

局限性

  • Generalizability: The paper does not show the method’s applicability to apply on new dataset. This limitation raises concerns about how well the FCR can adapt to other biological datasets or experimental setups that deviate from the four testing datasets, which restricts the method’s utility.
  • Interpretability: The model lacks mechanisms for clear interpretation, particularly in linking gene expression profile changes directly to specific treatments. So the decomposition does not provide insights into how gene expression is affected by specific treatments.

I understand that both limitations are significantly challenging and it is unrealistic to tackle both of them in one paper while they highlight critical areas for future research.

作者回复

Evalutionmetrics**Evalution metrics**: Please check the General Response to All Reviewers a.

For the performance of R2R^2, we assessed the statistical significance of those results between FCR and second best methods using an paired t-test and showed the p-values with the corresponding sample size of the test set.
Table:pvaluefor**Table : p-value for R^2performance performance**

Datasetp-value (# cells)
sciPlex0.0006 (295)
Multiplex-Tram0.0056 (266)
MultiPlex-70.0001 (318)

The results indicate that FCR statistically significantly out-performed the baselines across datasets on sci-Plex, multiPlex-Tram and multiPlex-7.

We will incorporate the discussion of evaluation and t-test results in our final manuscript.

Generalizability**Generalizability**:

We concur with the reviewer that generalizability is a very important topic in the computational biology field. Given the high cost of data collection, we are still limited to test and evaluate our models on the few public large scale single cell drug screening datasets that are available. As availability improves, there will be improved possibilities for benchmarking, especially in the out-of-distribution scenario.

Interpretability**Interpretability**:

There are generally two approaches to interpreting latent representations, which we also plan to explore in future work:

  1. Latent Traversal: Since ztxz_{tx} is element-wise identifiable, we can use latent traversal across all dimensions of ztxz_{tx}. This involves fixing all latent variables ztxiz_{tx}^i to their inferred values for each cell's gene expression and then varying the value of one latent at a time to observe the reconstructed gene expression. This method allows us to see how each dimension of ztxz_{tx} influences gene expression levels.
  2. Interpretable Decoder System: We can also employ an interpretable decoder, such as LDVAE [1], which uses a linear decoder. By analyzing the weights of the latent representations assigned to each gene, we can interpret how different representations influence each gene.
  3. On-going Work: In the light of the neural additive model [2], we are developing a non-linear interpretable encoder as an extension of LDVAE. Specifically, for each latent dimension, there is a sub-neural network to capture information specific to each latent dimension. A logistic regression layer is added to the end of the neural network. So, by analyzing the logistic regression weights (loading matrix), we can conclude how much one latent variable contributes to the reconstruction of a specific gene. An illustration of this model can be found in the rebuttal PDF file Figure D.

[1]Svensson, V., et al.(2020). Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics (Oxford, England), 36(11), 3418–3421.
[2]Agarwal, R., et al. (2021). Neural additive models: Interpretable machine learning with neural nets. Advances in neural information processing systems, 34, 4699-4711.

作者回复

GeneralResponsetoALLReviewers**General Response to ALL Reviewers**:

We sincerely thank all the reviewers for your thorough evaluation of our paper and your recognition of our contributions. We want to address some of the common questions and update new experiment results here, and we will have detailed responses for each reviewer below:

a.EvaluationMetrics**a. Evaluation Metrics**:

The choice of evaluation metrics for the drug response prediction task is indeed a topic of much discussion. We agree that R2R^2 is well-suited to assess how well a model explains the variation in the response variable.

We chose R2R^2 as our evaluation metric for two main reasons:

  • Previous works, including VCI, CPA, and sVAE, have also used R2R^2 as a standard evaluation metric, providing a basis for comparison.
  • Single-cell transcriptomics data is often noisy, sparse, may contain batch effects, and highly related to library size making direct evaluation of gene prediction accuracy challenging. Given the noisiness of this data, researchers commonly use relative gene enrichment analysis to compare different cell types.

Other evaluation metrics could be:

  • the Mean Squared Error (MSE) of the top differentially expressed genes (DEGs) for post-treatment [1].
  • the Spearman correlation between predicted genes and ground truth across all the cells.
  • the cosine similarities between the predicted genes and ground truths.

We added MSE results for the top 20 DEGs. These 20 genes are selected for showing statistically significant differences in expression levels for each cell line with drug treatments compared to control samples. The same procedures are also carried out in [1]. Note here, that we didn’t compare with CINEMA-OT and scGEN because they are only for binary treatments.

Table1:MSE(Top20DEGs)**Table 1: MSE (Top 20 DEGs)**

DatasetsFCR (ours)VCICPAsVAE
sciPlex0.12 (0.03)0.15 (0.04)0.15 (0.04)0.16 (0.05)
multiPlex-tram0.10 (0.07)0.13 (0.08)0.13 (0.08)0.14 (0.07)
multiPlex-70.18 (0.07)0.21 (0.08)0.22 (0.08)0.23 (0.07)
multiPlex-90.24 (0.06)0.27 (0.04)0.23 (0.06)0.26 (0.05)

FCR still outperforms the competing methods in three datasets while CPA achieves slightly better results on multiPlex-9.

b.HyperparameterSelection**b. Hyperparameter Selection**:

We split the data into four datasets: train/validation/test/prediction, following the setup from previous works [2]. First we hold out the 20% of the control cells for the final cellular prediction tasks (pred). Second, we hold 20% of the rest of data for the task of clustering and statistical test (test). Third, the data excluding the prediction and clustering/test sets are split into training and validation sets with a four-to-one ratio.

For the hyperparameter tuning procedure, conduct the exhaustive hyperparameter grid search with n_epoch=100 on the loss assessed on the validation data. The hyperparameter search space is

w1=[0.5,1,2,3,4,5,6,7,8,9,10]w_1 = [0.5, 1 ,2, 3, 4, 5, 6, 7, 8, 9, 10] w2=[0.5,1,2,3,4,5,6,7,8,9,10]w_2 = [0.5, 1 ,2, 3, 4, 5, 6, 7, 8, 9, 10] w3=[0.1,0.3,0.5,0.7,0.9,1,3,5,7,10]w_3 = [0.1, 0.3, 0.5, 0.7, 0.9, 1, 3, 5, 7, 10]

We will make sure to include this in our final manuscript.

c.SimulationStudy**c. Simulation Study**:

To address Reviewer vnnM's question on synthetic data experiments to demonstrate the identifiability of ztxz_{tx}.

By following the [3,4] protocol of simulation data. For the simplicity of simulation, we set the dimension of zxz_x and ztz_t equal to 1, and the zxtz_{xt} dimension as 4. We have the following setups. For the simplicity of the simulation, we output the yy (dimension is 96) as real numbers instead of count data with a sample number of 5000.

tUnif([1,2,3])t \sim \textrm{Unif}([1,2,3]), xUnif([100,1000,5000])x \sim \textrm{Unif}([100, 1000, 5000]), zxNormal(x/2,1)z_x \sim \textrm{Normal}(x/2, 1), ztNormal(t/2,1)z_t \sim \textrm{Normal}(t/2,1), ztxNormal(xt,I4)z_{tx} \sim \textrm{Normal}(x*t, I_4), gg is a 2-layer MLP with the Leaky-ReLU activation function [3]. To measure the component-wise identifiability of the changing components, we compute the Mean Correlation Coefficient (MCC) between ztxz_{tx} and z^tx\hat z_{tx} . A higher MCC indicates a higher extent of identifiability, and MCC reaches 1 when latent variables are perfectly component-wise identifiable. We computed the mean correlation coefficient (MCC) between the original sources and the corresponding latents sampled from the learned posterior. We followed iVAE [3], and first calculated all pairs of correlation coefficients between source and latent components. We then solve a linear sum assignment problem to assign each latent component to the source component that best correlates with it, thus reversing any permutations in the latent space. The UMAP projections of the ground truth ztxz_{tx} and estimated z^tx\hat z_{tx} which showed perfect separation are shown in Figure C of the PDF file.

Table2:SimulationResultsonMCC **Table 2 : Simulation Results on MCC**

MethodsMCC
FCR0.91 (0.03)
beta-VAE0.38 (0.12)
iVAE0.77 (0.07)
factor-VAE0.37 (0.08)

d.UnseenPrediction**d. Unseen Prediction**: For the unseen prediction experiment setup and results, please check the response to Reviewer yZVk.

Note:WehavenewfiguresintheattachedPDFfile**Note: We have new figures in the attached PDF file**

References\textbf {References}:

[1] Roohani, Y., et al. (2024). Predicting transcriptional outcomes of novel multigene perturbations with GEARS. Nature Biotechnology. [2] Lotfollahi M, et al. Predicting cellular responses to complex perturbations in high-throughput screens. Mol Syst Biol.
[3] Khemakhem, Ilyes, et al. "Variational autoencoders and nonlinear ica: A unifying framework." International conference on artificial intelligence and statistics. PMLR, 2020.
[4] Lopez, Romain, et al. "Toward the Identifiability of Comparative Deep Generative Models." Causal Learning and Reasoning. PMLR, 2024.

最终决定

The method proposes a method for learning disentangled causal representations of covariates, treatment and their interaction The learning objective enforces the conditional independence assumptions and is theoretically well-justified. Performance is tested in and extensive evaluation showing that the method performs significantly better than a large set of baselines. The paper is well-written and tackles a relevant and hard problem. The reviewers criticise minor issues like complexity of the model and choices in the evaluation scheme.

All five reviewers vote towards acceptance.