DeCaFlow: A deconfounding causal generative model
We propose a causal generative model that accurately estimates a broad class of causal queries, including counterfactuals, in the presence of hidden confounders.
摘要
评审与讨论
The paper introduces deconfounding causal flows an extension to causal normalising flows that uses and VAE-style approach to estimate causal queries under hidden confounding. The paper introduces novel indentifiability results that open promise correct estimates on a wide range of causal queries. The experiments support the performance of the proposed method compared to relevant baselines making this a promising method for causal practitioners.
优缺点分析
- The paper tackles the interesting problem of causal inference under hidden confounding providing novel theoretical insights as well as a practical method that should be applicable to a wide range of real-world tasks.
- The paper is well written and thoroughly explains the approach and significance. (I did however, not check the correctness of the proofs or appendix)
- The empirical results could be stronger by testing this method on more real-world datasets without replacing the ground-truth functional relationships. The paper could easily produce additional causal benchmarks by hiding confounders for evaluation, turning a fully observed causal benchmark into one with hidden confounders.
- Could be a bit clearer on the functional limits on CNFs before the conclusion.
- It’s unclear how the hyper parameters of the models are chosen and how a practitioner should go about using this method.
问题
- Can this deal with differences of which variables are available during training and inference?
- A strong requirement here is to perfectly fit the data distribution. How hard is this in practice? How hard is it to estimate if you correctly capture the distribution while only having access to finite samples?
- How do you chose a prior over z? What’s the impact of this?
- L 235 - 4. How do you go about measuring “enough information”?
- How are the hyper parameters chosen? B.3 almost sounds like it’s been chosen given test set performances.
- How come you can achieve almost 0 ATE error but significantly higher CF error if the identifiability of ATEs mean identifiability of CFs?
- Why are you replacing the data generating processes rather than simply taken real world causal datasets and drop confounders to estimate the correctness under hidden confounding?
局限性
mostly
格式问题
n/a
We thank the reviewer for their insightful comments and positive feedback. Below, we clarify the concerns raised and, once they have been satisfactorily addressed, would appreciate your reconsideration of our evaluation.
results could be stronger by testing this method on more real-world datasets.
We agree and would truly appreciate it if the reviewer could provide some specific examples of these datasets. Unfortunately, we could not find existing datasets that we could easily use to test DeCaFlow. At the end, we opted for a similar approach as described by the reviewer, where the oracle has access to the hidden confounders and we simply hide them for the rest of the models. Then, to properly measure the error committed by DeCaFlow (among other reasons, see App. B.4 and B.5), we decided to use semi-synthetic data where we have access to the ground truth SCM which is, to the best of our knowledge, the only way to assess counterfactual estimations.
Note, however, that we already test DeCaFlow in real-world data and application in section 5.3. Nevertheless, we find the evaluation of DeCaFlow (and any other CGM) in real-world scenarios with accessible interventional data at scale an exciting research direction, which recent developments such as causal chambers [1] start making this goal more viable.
Could be a bit clearer on the functional limits on CNFs before the conclusion.
We may not fully understand what the reviewer refers to as “functional limits” and would appreciate further clarifications. If the reviewer refers to the assumptions on invertibility and smoothness in line 114, these are not too restrictive assumptions since, as we do not observe the hidden confounders, we can reduce the real SCM to a causally-equivalent one which fulfills these conditions, as argued in the CNF paper. Moreover, we will add a limitations section discussing other limitations derived from the assumptions made in our work.
It’s unclear how the hyper parameters of the models are chosen and how a practitioner should go about using this method.
We apologize it is not clear, and would appreciate more detailed feedback in this regard. We tried to be explicit, and in App. B.5.3, “Hyper-parameters and splits”, we explain that hyperparameters were selected according to the MMD distance between the validation split and samples generated with DeCaFlow (both using observational data). That is, the best model is chosen by best matching the original observational distribution. All metrics are reported on the test split.
While we tried to report all details, e.g., how we regularized the ELBO loss (App. C.2), how we implement the do-operator (App. D.1), or how to use DeCaFlow at test time (App. F.1), we will revise the manuscript and address what remains unclear.
Can this deal with differences of which variables are available during training and inference?
We assumed that all endogenous variables in the causal graph are observed in both training and inference (except the hidden confounders). This is, however, an interesting line of future work if we assume that the test dataset has a different set of observed variables than the training set, e.g., using tractable probabilistic models that allow for the marginalization of observed variables. This could lead to semi-supervised settings that can be very useful for real-world problems where we have related datasets extracted from different sources. We will add this in our future work discussion.
A strong requirement here is to perfectly fit the data distribution. How hard is this in practice? How hard is it to estimate if you correctly capture the distribution while only having access to finite samples?
Normalizing flows are universal density approximators, and if the causal graph is well specified, a causal normalizing flow should be able to fit the data distribution. However, as an ML method, its performance improves as we increase the size of the dataset. Based on the insightful suggestion by the reviewer, we have conducted an ablation study to compare performance across different data regimes which confirms this trend and will be included in the revised paper. This will help practitioners better understand the expected performance of our approach.
How do you select a prior over z? What’s the impact of this?
In all experiments we set as a standard multivariate Gaussian, and we will make sure to add this information in the next revision. Theoretically, the prior of z should not impact the model. To see why, note that we can always rewrite to a standard Gaussian using cumulative functions without affecting the observed causal model. This is a similar argument as that of the choice of in the original CNF paper, since their role in the data-generating process is almost identical, see Eq. (1). However, in practice the choice of can make the problem easier/harder to learn, and any prior information can be leveraged. To see this effect in practice, we refer the reviewer to App. D.3 of the original CNF paper, where the authors ablate the choice of .
L 235 - 4. How do you go about measuring “enough information”?
What we refer to as “enough information” in the main text to ease readability, it is actually the same completeness conditions used by [2] and [3]. Paraphrasing [3]: “Roughly, completeness requires that the distributions of corresponding to different values of are distinct”. Refering to [3], while measuring completeness is non-trivial, many causal models satisfy it by construction such as exponential families, location-scale families, and nonparametric regression models.
This completeness condition is also employed in nonparametric causal identification and, generally, in statistics (cf. Completeness (statistics) in Wikipedia). We refer the reviewer to the works [4] and [5] for further pointers on sufficient conditions for completeness.
How are the hyper parameters chosen? B.3 almost sounds like it’s been chosen given test set performances.
Please refer to the answer above. In short, we use the MMD between validation split and generated data to select the hyperparameters (observational distribution matching). All metrics were reported on a held-out test set after hyperparameters selection.
How come you can achieve almost 0 ATE error but significantly higher CF error if the identifiability of ATEs mean identifiability of CFs
We assume that the reviewer refers to Figure 7. Here, the difference proceeds from the fact that the metrics are different, and, therefore, not comparable. ATEerror’ compares the means of the true and estimated distribution, while CFerror measures the error sample by sample and then averages them. These differences are also observed in the oracle, highlighting why we should compare the performance of the models with the oracle.
Why are you replacing the data generating processes rather than simply taken real world causal datasets and drop confounders to estimate the correctness under hidden confounding?
We did cover this question in the App. B.4, but let us expand further. As mentioned above, we could not find suitable datasets to use and, regarding Sach’s and Ecoli datasets, we found that: i) there were several proposed causal graphs, so the true causal graphs could potentially contain cycles; and ii) we did not have access to counterfactuals to assess DeCaFlow. Moreover, we found that the Ecoli dataset has 9 observational samples. Therefore, we followed the approach by Chao et. al. [6] and resorted to semi-synthetic data.
Besides, the interventional data in Sachs’ dataset correspond to soft interventions, while DeCaFlow performs hard ones, making it unclear how to compare them. We find it interesting for future work to extend causal generative models to allow for soft interventions, but out of the scope of this paper and thus deferred to future work. More generally, a real-world evaluation of causal inference approaches remains an open challenge (especially in the context of counterfactuals [7, 8]) in the literature [9,10], which like us relies on (semi-)synthetic datasets. Thus, it is an important but challenging venue for future work. . However, note once again that the experiment in section 5.3 is based on real-world data.
We hope to have clarified all the concerns raised during the review, and remain at the disposal of the reviewer if further questions come up during the rebuttal period.
References
[1] Gamella, J.L., Peters, J. & Bühlmann, P. Causal chambers as a real-world physical testbed for AI methodology. Nat Mach Intell.
[2] W Miao, Z Geng, and Eric JT Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika.
[3] Wang, Y., & Blei, D. (2021). A proxy variable view of shared confounding. In International Conference on Machine Learning. PMLR.
[4] D’Haultfoeuille, X. (2011). On the completeness condition in nonparametric instrumental problems. Econometric Theory.
[5] Hu, Y., & Shiu, J. L. (2018). Nonparametric identification using instrumental variables: sufficient conditions for completeness. Econometric Theory.
[6] P Chao, P Blöbaum, and S Kasiviswanathan. Interventional and coun- terfactual inference with diffusion models. ArXiv preprint, abs/2302.00860, 2023.
[7] Holland, P. W. Statistics and causal inference. Journal of the American Statistical Association.
[8] Bareinboim, E., Correa, J. D., Ibeling, D., & Icard, T. , 2022. On Pearl’s hierarchy and the foundations of causal inference. In Probabilistic and causal inference: the works of judea pearl.
[9] Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics.
[10] Curth, A., & Van Der Schaar, M. (2023). In search of insights, not magic bullets: Towards demystification of the model selection dilemma in heterogeneous treatment effect estimation. In International conference on machine learning. PMLR.
This work presents deconfounding causal normalizing flow (DeCaFlow), a framework that extends causal normalizing flows (CNFs) to the case when hidden confounders are present. When a causal graph is given, CNFs are autoregressive normalizing flows that parametrically model the generative mechanisms and allow us to answer causal queries among other applications. DeCaFlows are similar, however they are equipped to handle hidden confounders as well, under some assumptions. The framework assumes access to the underlying causal graph and iid observations, and essentially builds a generative model and a deconfounding model, which are jointly trained using a variant of the standard ELBO loss.
The paper shows that DeCaFlows satisfy many desiderata, such as identifiability results for causal queries such as do operations or interventional queries (even though we have hidden confounders). However, these results apply under various assumptions on the causal structure of the query and infinite data regime. Finally the technique is validated via experiments on semi-synthetic data and the Ecoli70 gene dataset, and it outperforms prior techniques on standard metrics such as MAE. The counterfactual fairness-prediction application is also interesting. The target audience are people interested in causal inference.
优缺点分析
Strengths:
-
The problem of building normalizing flows in the presence of hidden confounders is important and relatively less studied, so this work is a useful step towards this direction.
-
The paper is well-written, and related work seems to be cited well. The proof techniques are very interesting, generalize prior works and the experiments probe different aspects such as misspecification.
Weaknesses:
- While the identifiability results are interested, I'm unable to gauge how restrictive the assumptions are, e.g. in proposition A.2. It'd be nice if the authors could comment on this.
问题
None
局限性
Limitations have been discussed.
最终评判理由
I read the other reviews and the authors' responses. They clarified my questions.
格式问题
None
Thanks to the reviewer for their positive feedback of DeCaFlow and for the positive comments about the practical significance of the paper, and its presentation. Let us try to help the reviewer better understand the assumptions made in our work and, if appropriate, re-evaluate our work after the rebuttal period.
Regarding the assumptions in section 3 (lines 113-114), these are the same assumptions made in the original Causal Normalizing Flow paper [1]. Namely, we assume:
- An acyclic causal graph, as it allows us to write the data-generating process in a finite number of steps (i.e. to “unroll” the equations). This is as typical of an assumption that some authors directly define SCMs as inducing acyclic graphs, see [3] for an example.
- Diffeomorphic causal structures. This subsumes three conditions: continuity, bijectivity, and differentiability. The first two we argue are not restrictive since, as we do not observe the exogenous variables, we can always reduce a given SCM to a causally equivalent one which is continuous and bijective with respect to the exogenous variables. Differentiability was assumed in the original CNF paper to relate the model Jacobian and causal dependencies of the SCM.
- Continuous random variables. While it can sound restrictive at first, the original paper (see their App. A.2.1) argue how to adapt CNFs to allow for discrete random variables too.
Regarding the assumptions in Proposition A.2, these are quite similar to those previously made by Miao et al. [4] and Wang and Blei [5]:
-
Assumptions i)-iii) regard the conditional independence of different variables in the causal graph, which can be automatically verifiable given a faithful causal graph. An example of such structural assumptions is actually provided in our real-world use case in Section 5.3, where it is an accepted assumption in the literature (see [7]) to consider knowledge (the hidden confounder) to be independent of the socio-demographic features (the adjustment set in Assumption 1), and the exam scores GPA and LSAT to be conditional independent among them and with FYA (the outcome), conditioned on knowledge and the adjustment set. Similar examples can be found in other contexts such as healthcare, where it is common to leverage data on other diseases or comorbidities—such as records of routine check-ups or hospital admissions for causes other than the treatment of interest—as proxies for unmeasured confounders. For example, as shown in [4], when measuring the effect of influenza vaccination on influenza‐related hospitalization, health-seeking behaviour is an unobserved confounder, and two proxies can be employed: the number of annual wellness visits, which correlates with health-seeking behaviour but does not affect the risk of influenza hospitalization, and trauma-related hospital admissions, which, conditional on health-seeking behaviour, is independent of influenza hospitalization. Other well known examples can be found in [8].
-
Assumption iv) and v) regards the quality of the variables and as proxies of the hidden confounder. Intuitively, they mean that, as we change their values, the posterior of the hidden confounder varies enough as to properly perform inference on it. Perhaps it helps to know that this notion of completeness has been adopted from that of statistics (cf. “Completeness (statistics)” in Wikipedia). This assumption is harder to verify but, as we show in Figure 6, the greater the number of proxy variables, the better the estimation of the hidden confounder's effect and the more accurate the causal query estimation is.
-
Finally, assumptions vi) and vii) are standard regularity conditions [5,6] that are (almost surely) fulfilled in practice, as long as our random variables are well-behaved (e.g. having finite moments).
Finally, for causal queries in which there is a high uncertainty around these assumptions, one could consider alternative approaches that jointly perform causal identification and estimation for those individual queries such as [9]. While they also assume knowledge on the causal graph, they perform causal identification for each causal query based on the available data. Such an approach would, however, lose the scalability advantage of DeCaFlow, which with a single model provides estimates for many causal queries in large graphs and handles continuous variables. Consequently, we would encourage practitioners to combine both approaches based on the reliability of their assumptions. We will make sure to clarify this in the discussion of the limitations of our work.
References
[1] Javaloy, A., Sánchez-Martín, P., & Valera, I. (2023). Causal normalizing flows: from theory to practice. Advances in Neural Information Processing Systems, 36, 58833-58864.
[2] Reizinger, P., Sharma, Y., Bethge, M., Schölkopf, B., Huszár, F., & Brendel, W. (2023). Jacobian-based causal discovery with nonlinear ICA. Transactions on Machine Learning Research.
[3] Schölkopf, B., & von Kügelgen, J. (2022, April). From statistical to causal learning. In Proceedings of the International Congress of Mathematicians (Vol. 1). EMS Press.
[4] Shi, Xu, Wang Miao, and Eric Tchetgen Tchetgen. "A selective review of negative control methods in epidemiology." Current epidemiology reports 7.4 (2020): 190-202.
[5] Wang Miao, Zhi Geng, and Eric J Tchetgen Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika.
[6] Wang, Y., & Blei, D. (2021, July). A proxy variable view of shared confounding. In International Conference on Machine Learning (pp. 10697-10707). PMLR.
[7] Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. Advances in neural information processing systems, 30.
[8] Lipsitch, M., Tchetgen, E. T., & Cohen, T. (2010). Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology, 21(3), 383-388.
[9] Kevin Muyuan Xia, Yushu Pan, and Elias Bareinboim. Neural Causal Models for Counterfactual Identification and Estimation. In The Eleventh International Conference on Learning Representations, ICLR 2023.
I thank the authors for the rebuttal and stand by my rating to accept.
Many thanks to the reviewer for reading our rebuttal and for the positive feedback.
In this paper, the authors extend existing work on Causal Normalizing Flow and introduce DeCaFlow, for "deconfounding causal normalizing flow". The focus of the authors is on scenarios where not all variables of a system are observed and, in particular, where there exists a non-empty set of hidden confounders . The goal of DeCaFlow is to be able to answer interventional and counterfactual queries over observed variables . To this end, they propose to train two models: a deconfounding network (from to ) and a generative network (from the exogenous noise and to ). The authors prove that DeCaFlow correctly solves causal queries between non-confounded variables and provide sufficient conditions for queries between confounded variables.
优缺点分析
Strengths:
- The problem of estimating causal effects between confounded variables is known, interesting, and well-motivated.
- As with Causal Normalizing Flows, DeCaFlow also represents a class of models identifiable from the observational distribution.
- The authors provide in the supplementary materials an algorithm to verify whether a causal query can be identified or not.
Weaknesses:
- The method requires the causal graph between observed variables, which is not trivial to recover in general, in particular under hidden confounding. This problem is not discussed in the paper.
问题
Do the authors have any intuition on how DeCaFlow could be integrated with existing causal discovery algorithms for confounded data?
局限性
yes
最终评判理由
The authors acknowledged the reported weakness on causal discovery under confounding. Their discussion in the rebuttal, to be integrated in the paper, is satisfactory. This was a minor point that does not change my overall positive assessment on the acceptance of the paper.
格式问题
none
We thank the reviewer for their positive review of DeCaFlow and acknowledge that there are important but also very challenging venues for future work (e.g., integration of causal discovery and causal inference methods) that could build on our work. Let us respond to the comments raised by the reviewer below:
The method requires the causal graph between observed variables, which is not trivial to recover in general, in particular under hidden confounding. This problem is not discussed in the paper. Do the authors have any intuition on how DeCaFlow could be integrated with existing causal discovery algorithms for confounded data?
We agree that this is an important point that should be discussed in the paper, and thus will discuss it in a new section in the next revision of the manuscript.
For readability, we consider a fully-specified causal graph throughout the paper. However, DeCaFlow is robust to some misspecifications of the causal graph. For example, for all variables not directly affected by hidden confounders, DeCaFlow should be able to capture their causal relationships given only a causal ordering, which is a feature inherited from Causal Normalizing Flows (CNFs). Similarly, we can work with partially specified causal graphs and consider groups of variables, as in App. A.2.2 of the original CNF paper. However, if the given causal graph misspecifies the relations with the hidden confounders, resulting in violations of the assumptions in Prop. 4.1, then our estimations for confounded causal queries may be inaccurate.
Moreover, while there have been some attempts to use CNFs for causal discovery (see [1]), these are still preliminary works. Hence, for causal queries in which there is a high uncertainty about the variables affected by the hidden confounders, the practitioner may instead rely on approaches that jointly perform causal identification and estimation for those individual queries such as [2]. While they also assume knowledge on the causal graph, they perform causal identification for each causal query based on the available data. Such an approach would, however, lose the scalability advantage of DeCaFlow, which with a single model provides estimates for many causal queries in large graphs and handles continuous variables. Consequently, we encourage practitioners to combine both approaches based on the reliability of their assumptions. We will make sure to clarify this in the discussion of the limitations of our work.
We remain at the disposal of the reviewer, in case they have further questions, and we would appreciate it if they could consider re-evaluating our work after the rebuttal period.
References
[1] Xi, Q., Gonzalez, S., & Bloem-Reddy, B. (2023, October). Triangular monotonic generative models can perform causal discovery. In Causal Representation Learning Workshop at NeurIPS 2023.
[2] Kevin Muyuan Xia, Yushu Pan, and Elias Bareinboim. Neural Causal Models for Counterfactual Identification and Estimation. In The Eleventh International Conference on Learning Representations, ICLR 2023.
I thank the authors for their rebuttal and welcome the integration of this discussion in the paper. I can confirm my recommendation to accept the paper.
We want to thank the reviewer for helping us to improve the discussion of our paper, and for recommend our paper.
This paper introduces DeCaFlow, a method for causal inference in the presence of latent confounders. The primary contribution is extending causal normalizing flows to the latent confounded case.
Given a SCM with endogenous variables X and latent confounders Z, this work assumes:
- Underlying causal graph G is known and acyclic
- Structural equations f are invertible given Z, and both f and its inverse are continuously differentiable
- Perfect interventions (do-calculus) over X
- Access to proxy variables (CI variables informative of hidden confounders) for identifiability beyond what do-calculus provides
The method is implemented via a latent variable model, where the encoder (X to Z) and decoder (Z to X) are based on normalizing flows.
Evaluations consider 3 settings: two semi-synthetic (real graphs, simulated structural equations) and one real (law school dataset).
优缺点分析
Strengths:
- Experiments show strong empirical performance, approaching the oracle baseline in both semi-synthetic experiments.
- The proposed method seems efficient and reasonable -- requiring one training run per dataset (vs. one per query).
Weaknesses:
A primary weakness of this work is whether the setting is indeed realistic (inherited from causal normalizing flows, upon which this work is based).
- Invertibility
In biological datasets (like those used in semi-synthetic experiments), confounders take the form of, e.g. which lab technician conducted the experiment, or the batch index of the cells when being sequenced (later batches are allowed to grow for longer). It is easy to train an inference model to predict these confounders. However, it is not always possible for a generative model to generalize: there may be no predictable pattern to how a new lab technician would act, compared to previous ones.
Could you provide some examples in which invertibility does make sense?
- Knowledge of G
G is often "known" only to a noisy extent. For example, there are multiple versions of the Sachs graph, and feedback loops play a central role in biological systems.
Can you comment on the robustness of your method to errors in G?
问题
-
In 5.2.1, new SCMs were sampled based on the Sachs graph to ensure that "the randomized effect of the hidden confounder is perceptible" [337]. Since the effect size of the original interventions can be quantified (based on the change in target protein activity), is it possible to evaluate the method on (some of) the actual data?
-
The paper is somewhat dense. It would be nice if the main paper included more intuitive explanations throughout. For example: what are proxy variables, how the CNFs are implemented, what do-calculus implies in the context of the illustrative example in Fig 1.
-
Is it necessary to include the ablation study as the first part of results? Moving it to the appendix + citing that the model is not sensitive to increasing could free up space for expanded discussion of [above].
-
Overall, the writing can be made more concise to improve readability. Some limited examples:
- 103-106 can be split into 2 sentences.
- 136-137 "we take [....] and adapt them [...]" -> "we adapt CNFs to [...]"
- 159-161 -> "To avoid posterior collapse, we implement [...]"
局限性
yes
最终评判理由
The author's rebuttal has resolved the concerns that I raised during the review. Therefore I am adjusting the score.
格式问题
no
Thanks to the reviewer for their questions and suggestions, which will help us improve the readability and understandability of our manuscript. We will address all raised concerns in the next revision, and would appreciate it if the reviewer could reconsider their rating, had we properly addressed their concerns. Next, let us discuss the points raised during the review:
Invertibility (and biological datasets example)
There are two different questions to address here. As for the example provided, the reviewer is right. DeCaFlow, just like Causal NFs and any other ML model, relies on data and generalization is always a challenge: if a new technician arrives with highly unpredictable behavior (relative to the previous ones) the model could struggle to generalize. However, existing ML techniques (e.g. a small amount of fine-tuning) should help in this scenario.
Then, note that we assume invertibility from to given , which we argue is not a real limitation in terms of the SCMs we can model. The reasoning is that, as we do not observe the exogenous variables and we do not work directly with them (i.e. we do not give them semantics), given a non-invertible SCM we could always use the Knothe-Rosenblatt transport to find an invertible one (again, from u to x given z) which is causally equivalent in all three rungs of Pearl’s causal hierarchy, as argued in the original CNF paper.
Can you comment on the robustness of your method to errors in G?
This is an important question to discuss in the manuscript, which we will make sure to add. It is important to note that DeCaFlow is robust to some misspecifications of the causal graph and, to clarify this, consider the following two cases:
- If the variables are not directly affected by the hidden confounders, we inherit the theory from CNFs (see App. A.1), where it is sufficient to know the causal ordering (rather than the causal graph), and where we can use partial knowledge of the causal graph to group variables together and work with blocks of variables (see App. A.2.2 of the original CNF paper).
- If some variables are children of the hidden confounders, our theory relies on the conditions in Prop. 4.1. Thus, if there is a mismatch between the true and assumed causal graphs regarding these assumptions (e.g., proxy variables not being actual proxies), our identifiability results will no longer hold and hence our estimates can be inaccurate. We thus believe that a sensitivity analysis studying causal-graph misspecification is an interesting venue for future work, which we will add to our discussion.
Since the effect size of the original interventions can be quantified (based on the change in target protein activity), is it possible to evaluate the method on (some of) the actual data?
Besides line 337, we explain why we resort to semi-synthetic data in App. B.4. In short, the true causal graph potentially contains cycles, and we do not have access to counterfactuals. Semi-synthetic data solve these two problems. While we could attempt to use the original interventional data to evaluate our model, we could not find a reliable way of disentangling the model prediction error with the error due to the potential causal-graph misspecification.
Moreover, interventions in Sachs’ dataset are soft interventions, while DeCaFlow performs hard ones, making it unclear how to compare them. We find it interesting for future work to extend causal generative models to allow for soft interventions, or to design metrics to compare different types of interventions, but this is out of the scope of this paper and thus deferred to future work. More generally, a real-world evaluation of causal inference approaches remains an open challenge (especially in the context of counterfactuals, since they are not observable by definition [1, 2]) in the literature [3, 4], which like us relies on (semi-)synthetic datasets. Thus, it is an important but challenging venue for future work.
The paper is somewhat dense. It would be nice if the main paper included more intuitive explanations throughout. For example: what are proxy variables, how the CNFs are implemented, what do-calculus implies in the context of the illustrative example in Fig 1.
We appreciate the feedback and agree with the reviewer that extra explanations and intuitions can make the paper more accessible. We will thus add to the main paper intuitions on what are proxy variables, a more detailed explanation of what do-calculus implies, as well as how CNFs are implemented in practice, as well as any other explanation that we find missing while revisiting the manuscript, such as a “limitations” section to discuss noisy causal graphs and the validity of our assumptions.
Is it necessary to include the ablation study as the first part of results? Moving it to the appendix + citing that the model is not sensitive to increasing could free up space for expanded discussion of [above].
At submission time, we believed it important to validate the sensitivity of DeCaFlow to the choice of . However, we can see how in its current state we can defer the ablation to the appendix and free up more space for the changes promised above. We will thus apply the changes suggested by the reviewer in an attempt to make the paper as accessible as possible. We truly appreciate the feedback.
References
[1] Holland, P. W. Statistics and causal inference. Journal of the American Statistical Association, 81(396):945–960, 1986.
[2] Bareinboim, E., Correa, J. D., Ibeling, D., & Icard, T. (2022). On Pearl’s hierarchy and the foundations of causal inference. In Probabilistic and causal inference: the works of judea pearl (pp. 507-556).
[3] Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1), 217-240.
[4] Curth, A., & Van Der Schaar, M. (2023, July). In search of insights, not magic bullets: Towards demystification of the model selection dilemma in heterogeneous treatment effect estimation. In International conference on machine learning (pp. 6623-6642). PMLR.
The authors have addressed the concerns that I raised. I will raise my score to 5.
We would like to thank the reviewer for their suggestions, which will improve the understandability of the paper, and for considering to raise the score.
The paper introduces DeCaFlow, extending causal normalizing flows to hidden-confounder settings via proxy variables and proximal identifiability, and provides a practical encoder/decoder flow architecture that answers many causal queries from a single trained model. Reviewers acknowledged the importance of the problem, clear theoretical contributions (identifiability conditions and a query-check algorithm), efficiency relative to query-specific estimators, and competitive results on semi-synthetic benchmarks and a real application. The discussion improved the paper with clearer intuitions (proxies, do-calculus, CNF implementation), an explicit limitations section, and a more detailed treatment of assumptions and robustness to graph misspecification. The authors should (i) consolidate the new explanations and limitations, including sensitivity to noisy graphs and proxy quality; (ii) add practical guidance on model selection (e.g., MMD-based validation) and priors; and (iii) outline pathways for broader real-world evaluation (soft interventions, additional datasets) and for integrating with causal discovery.