/10

Rejected4 位审稿人

最低1最高3标准差1.0

ICML 2025

☕ Decaf: A Deconfounding Causal Generative Model

Alejandro Almodóvar,Adrián Javaloy,Juan Parras,Santiago Zazo,Isabel Valera

提交: 2025-01-23更新: 2025-06-18

TL;DR

We propose a causal generative model that solves causal queries, including counterfactuals, in presence of hidden confounders.

摘要

关键词

hidden confounderscausal identifiabilitycausal querylatent variable modelcounterfactuals

评审与讨论

审稿意见

评分: 12025-03-13

To identify interventional and counterfactual queries in the presence of hidden confounders, this work proposes Decaf, an encoder-decoder architecture that combines causal normalizing flow (CNF) (Javaloy et al., 2023) as the decoder with conditional normalizing flow (CdNF) as the encoder. The key idea is to treat hidden confounders $Z$ both as the conditional variables in the CNF and as the latent variables (i.e., the latent code) in the ELBO of variational inference, while exogenous variables $U$ remain the latent variables in the CNF. The identifiability result is derived by combining the backdoor adjustment and the two-proxy approach from Miao et al. (2018). Counterfactual identifiability is established using the twin SCM technique (Balke & Pearl, 1994).

给作者的问题

Algorithm 4, Line 3: Is it necessary to “estimate the mean”? This line can be removed; the algorithm can be run multiple times, and the final output is averaged. In fact, I belive that taking the average of $Z$ first is wrong: the mean of a function is not equal to the function of a mean.

论据与证据

(Critical) In the informal Proposition 6.1, clause (iii) incorrectly claims that Decaf identifies causal queries using proxy variables. While the formal Proposition A.2 correctly states that causal queries are identifiable, Decaf itself does not achieve this identification. Specifically, when computing the distribution of $Z$ using the encoder (Equation 5), it does not solve the Fredholm integral equation proposed in Miao et al. (2018), as is done in [1,2,3,4]. This is similar to the mistake made in the CEVAE paper [5], which cites identifiability results for proxy variables but proposes an estimator that does not leverage them. To illustrate the issue more starkly, consider an analogy: we have the standard adjustment formula (Equation 9), but instead of using the necessary adjustment variables $\bf{b}$ , we simply regress $Y$ on $T$ naively.

Furthermore, as is clear in Proposition 6.1 clause (i), the backdoor adjustment does not work in the presence of hidden confounders. Thus, the method principally fails under hidden confounders, contrary to the paper’s main highlight.

(Over-stretch the significance). The contrast that Decaf enables “training once [and] comput[ing] any causal query on demand” while other methods “are tailored to a specific causal graph” and “train[ed] one model per query” is misleading. Decaf is also “tailored to a specific causal graph” because "the given causal graph" is explicitly embedded into the CNF (Equation 3). While Decaf allows querying multiple causal effects within a large graph, other methods can be used as building blocks to deal with large graphs as well. In fact, Decaf’s encoder would likely need to perform similar operations to those in [1,2,3,4].

Additionally, Proposition A.2 does not appear to be qualified as “a generalization of the results previously presented by Miao et al. (2018) and Wang & Blei (2021)”. The combination and modification seem rather straightforward.

方法与评估标准

In its current form, there is no clear justification for using a CdNF as the encoder—any neural network would do (if the method really works). This suggests the critical problem mentioned earlier: arbitrary NNs, including CdNF, won't do; a more appropriate approach would be to incorporate elements similar to those in [1,2,3,4]. I will revisit this in Experimental Design and Analysis.

理论论述

Despite the critical concern mentioned above, the formal theoretical statements are reasonable, though I did not check the proofs. Below are some non-fatal but important points.

The assumptions and mathematical formulation regarding the invertibility of $\bf{f}$ are unclear. The core requirement, I believe, is that for any given value of $\bf{z}$ , the vectors $\bf{x}$ and $\bf{u}$ must be connected by an invertible function. This is a significant assumption that the paper does not clearly state.

Moreover, the notation is incorrect. According to the paper’s convention, $\bf{f}$ denotes the vector collecting the functions $f_i$ . However, this differs from the whole-system function that connects $\bf{x}$ and $\bf{u}$ given $\bf{z}$ . Stating that the vector $\bf{f}$ is invertible is, at best, ambiguous: on the one hand, we certainly do not mean that each $f_i$ is invertible (which they cannot be); on the other hand, what could we mean by that, then?

Also, (2) should be written as $T_{\theta,z}(\mathbf{x})=\mathbf{u}$ and $\mathbf{x}=T_{\theta,z}^{-1}(\mathbf{u})$ where $T_{\theta,z}(\cdot):=T_{\theta}(\cdot, \mathbf{z}=z)$ , because the invertibility is between $\bf{x}$ and $\bf{u}$ given $\mathbf{z}=z$ . The current notation reads as if $T_{\theta}$ was at same time an invertible function between $(\mathbf{x, z})$ and $\bf{u}$ and an invertible function between $(\mathbf{u, z})$ and $\bf{x}$ , which is very confusing.

实验设计与分析

The experiments themselves appear sound, and I do not suspect bugs. However, I hypothesize that dataset biases contribute to the reported performance.

For instance, in Section 7.1, if the artificial variables follow Gaussian (or more generally, exponential family) distributions and the functions are invertible, these properties align with the inductive bias of normalizing flows, potentially leading to favorable results. Similar biases—particularly those related to exponential family distributions—could be present in other datasets as well.

A useful sanity check would be to replace the CdNF encoder with several types of NNs and observe whether the results change significantly.

补充材料

I reviewed all formal theoretical statements in the Appendix and examined Sections C (Do-operator) and E (Algorithms) in detail.

与现有文献的关系

While I've never recommended removing references, I believe it is justified in this case. The CEVAE paper is better not to be cited, as this submission itself demonstrates how that work can mislead future research. I suggest the authors have a look at [5].

[5] Rissanen, Severi, and Pekka Marttinen. "A critical look at the consistency of causal estimation with deep latent variable models." Advances in Neural Information Processing Systems 34 (2021): 4207-4217.

遗漏的重要参考文献

The following papers are essential for understanding estimation based on Miao et al. (2018):

[1] Shi, Xu, et al. "Multiply robust causal inference with double-negative control adjustment for categorical unmeasured confounding." Journal of the Royal Statistical Society Series B: Statistical Methodology 82.2 (2020): 521-540.

[2] Cui, Yifan, et al. "Semiparametric proximal causal inference." Journal of the American Statistical Association 119.546 (2024): 1348-1359.

[3] Mastouri, Afsaneh, et al. "Proximal causal learning with kernels: Two-stage estimation and moment restriction." International Conference on Machine Learning. PMLR, 2021.

[4] Kompa, Benjamin, et al. "Deep learning methods for proximal inference via maximum moment restriction." Advances in Neural Information Processing Systems 35 (2022): 11189-11201.

其他优缺点

Despite the critical concern I mentioned, the work is solid for the most part. And, set aside the issues in claims and mathematical formuations, the paper is generally well written. I am willing to raise my score if the authors can show that my critical concern is invalid.

其他意见或建议

The term “amortized” in the abstract is never explained and appears unnecessary. The approach can be understood simply as variational inference.
L267: “We find that an interventional query is identifiable if…” This is not an original finding, as mentioned earlier.
L1540: Editorial error in the title—it should be “counterfactual”.
Algorithm 2, Line 4, and Algorithm 4, Line 6: These steps resemble Abduction, not Action. In particular, Action should not modify exogenous variables. Here, the update of the latent $U$ occurs because it differs from the true exogenous variables by an invertible transformation of the whole system (cf my Theoretical Claims section).
There is another well-cited, but not quite related paper, named Decaf. Van Breugel, Boris, et al. "Decaf: Generating fair synthetic data using causally-aware generative networks." Advances in Neural Information Processing Systems 34 (2021): 22221-22233.

作者回复

2025-04-01

We thank the reviewer for the thorough comments. We will revise our work to include the clarifications below and improve its clarity. We believe that the necessary (and already-implemented) changes are not substantial and thus hope the reviewer will reconsider their assessment.

clause (iii) incorrectly claims that …

We thank the reviewer for pointing out this unfortunate wording. Please refer to the response to reviewer Ui4z for more details.

does not solve the Fredholm integral equation [...] This is similar to the mistake made in the CEVAE paper [5] …better not to be cited…

We acknowledge we take an alternative approach and, akin to Wang and Blei (2021), Decaf implicitly solves the Fredholm integral equation by modeling $p_\theta(x|z)$ consistently with the causal graph (see Eqs. 11-16 in Prop. A.2). Importantly, unlike CEVAE, we do not attempt to recover the true $z$ , and our decoder is a CNF which is identifiable given $z$ , as shown in the original CNF article. We only advocate to use Decaf under specific conditions: we have been very careful with our assumptions, theory, and queries we can accurately estimate (see lines 305-316). We understand the reviewer’s concerns on CEVAE, and will extend the discussion in line 1575 to reflect the criticisms in [5].

…the backdoor adjustment does not work in the presence of hidden confounders.

Note that Decaf can estimate queries for which any of the conditions in Prop 6.1 hold. We believe the reviewer’s example is similar to that of Fig. 11, which we can estimate if there exists proxy variables. Please correct us if we misunderstood the statement.

Decaf is also “tailored to a specific causal graph” …

We meant that Decaf is not designed around any particular causal graph, but that it can be applied to any given DAG, as we extensively show in our experiments. We have experimented with diverse DAGs with tens of variables and multiple confounders (see, e.g., Fig. 1, 7 and 8), demonstrating the flexibility and effectiveness of Decaf in accurately estimating causal queries that are or not affected (see, respectively, Sec. 7 and App. B.3.2) by hidden confounders.

… other methods can be used as building blocks to deal with large graphs as well.

While we agree with the statement, combining such blocks probably requires expert handcrafting, and we are unaware of packages automating this process.Thus, we believe that the reviewer's statement further reinforces the Decaf’s main contribution, i.e., providing a practical and theoretically grounded approach that amortizes parameters and training to estimate any (identifiable) causal query.

Prop. A.2 does not appear to be qualified as “a generalization of [...]”.

Prop. A.2 generalizes existing results as it accommodates for the covariate $c$ and the adjustment set $b$ , which is critical for the results that follow in App. A.3 and was absent in previous works.These are novel results, even if not difficult to prove, which support our main contribution.

there is no clear justification for using a CdNF as the encoder…

Since the encoder has to match $p_\theta(z|x)$ as closely as possible, a conditional NF allows us to approximate any given density. However, we agree that other networks could be used, and thus we run an ablation on Sachs to test different encoder options (see here).

The assumptions and mathematical formulation regarding the invertibility of f are unclear.

We thank the reviewer for pointing out this imprecision. We will clarify in lines 124 and 163 that invertibility of f is conditioned on $z$ , as we already do for Decaf in lines 157-159. We also agree that the reviewer’s notation is more clear, and will adopt it in the revised paper.

I hypothesize that dataset biases contribute to the reported performance.

We politely disagree with this statement. First, we generate the data using the code of Chao et al. (2023) which uses random weights and nonlinear functions (see here). This does not resemble Decaf’s architecture. Second, we can always find a mapping from $x$ to a standard Gaussian using the Knuth-Rosenblatt map, irrespective of whether this was originally the case.

These steps resemble Abduction, not Action…

As detailed in App. C, our abduction-action steps resemble the ones of the original CNF, where the authors showed the equivalence between the implementation of their do-operator (which modifies the u’s) and the standard do-operator on SCMs.

Alg. 4: Is it necessary to “estimate the mean”?

For counterfactual estimation, we decided to generate a single sample by taking the average latent representation, which is a common approach in latent-variable modelling [1-2].

[1] Be more active! understanding the differences between mean and sampled representations of variational autoencoders (2023)

[2] Challenging common assumptions in the unsupervised learning of disentangled representations (2019)

审稿人评论

2025-04-08

Thank you for the clarifications. However, my main concerns remain.

“Decaf provides accurate estimates when a causal query is proven to be identifiable” “Decaf implicitly solves the Fredholm integral equation”

I appreciate the intent, but I still cannot see how Decaf implicitly solves the Fredholm integral equation. Neither the CNF nor the CdNF architecture incorporates the necessary moment restrictions or estimation techniques found in works like Miao et al. (2018) or follow-ups [1–4]. Without such mechanisms, Decaf is not a valid estimator based on the identifiability result.

“We meant that Decaf is not designed around any particular causal graph”

But in your experiments, for each dataset, I believe you build a specific causal graph into Decaf—i.e., into the CNF via Equation (3). So Decaf is tied to a particular graph, and while the graph may be large enough to support multiple queries, the method is still tailored to that structure. This weakens the contrast drawn with other methods.

On Proposition A.2, you seem to agree that “the combination and modification seem rather straightforward”? If so, I suggest making this more transparent in the paper—clearly explaining what is novel and what is inherited or adapted from existing results (e.g., Miao et al., Wang & Blei, etc.).

The use of the Knothe-Rosenblatt map to justify the invertibility of the encoder is reasonable. I recommend adding this explanation—at least in the Appendix—for completeness. Still, the consistent empirical advantage of CdNF over MLP in your experiments deserves further explanation. If CdNF is not theoretically necessary for identifiability, what inductive bias or training behavior explains the performance gain?

作者评论

2025-04-09

We thank the reviewer for engaging with our rebuttal.

I still cannot see how Decaf implicitly solves the Fredholm integral equation [...]. Without such mechanisms, Decaf is not a valid estimator based on the identifiability result.

We politely disagree with the reviewer statement that Decaf is not a valid estimator of identifiable causal queries. Both our theoretical results and empirical evidence demonstrates otherwise. Our proof of Prop. A.2 clearly shows that the estimated causal query by Decaf is equal to the true causal query (Eqs. 17-21). The proof shows that the solution to the integral equation of Decaf (Eq. 10) also solves the equation of the original model (Eqs. 29-32). I.e., modelling the data following the true causal graph does solve the integral equation implicitly, meaning that we do not have access to the solution of the integral equation, $\tilde h$ , but we do not need it as we can compute $\tilde p(y|do(t), c)$ . Please, refer to (Wang and Blei, 2021, Sec 2.2), whose theoretical results follow the same argument.

We have carefully revisited both, our proof and the one in (Wang and Blei, 2021, Sec 2.2), and could not find any error. Also Reviewer Ui4z stated: “I checked the proof of proposition 6.1 (A.2) in the appendix. There are no issues with the proof and I can see the rest of the proofs have the same quality so I don’t worry about their validity.” Therefore, we respectfully invite the reviewer to check our proof, and provide evidence to support their claim about the invalidity of Decaf.

I believe you build a specific causal graph into Decaf—i.e., into the CNF via Eq. (3) [...]

We do not fully understand nor share the reviewer's concern, especially the statement “This weakens the contrast drawn with other methods”.

First, for all the papers referenced by the reviewer [1-4], the causal graph is always the same, and only one specific causal query can be estimated. Therefore, those methods are limited to a specific causal graph and query. If one wants to estimate another causal query with the same data, they need to train another model, even if the query can be estimated through adjustment.

Second, Decaf is not designed with a specific causal graph in mind and it can estimate every possible query in that causal graph. We have clearly stated that we need to train a Decaf model for each specific pair of observed data and causal graph. That said, Decaf is able to: i) model any causal graph as long as it is a DAG; and ii) estimate all queries that are identifiable as long as the true SCM fulfills our assumptions. These queries include those solvable by proximal inference as well as through adjustment. Moreover, the methods in [1-4] do not compute counterfactuals, which is also a main contribution of Decaf.

We find this a significant difference from a practical point of view, which is our main focus, as stated in lines 84-85: “Decaf offers a practical and efficient solution for causal inference in the presence of hidden confounding”.

On Prop. A.2, you seem to agree that “the combination and modification seem rather straightforward”? If so, I suggest making this more transparent in the paper—clearly explaining what is novel and what is inherited or adapted from existing results (e.g., Miao et al., Wang & Blei, etc.).

We appreciate the suggestion. However, note that this is already stated in the main paper (lines 266-268) as well as in the appendices (lines 827-832) and we clearly state how to recover the existing results in lines 937-940. Although the generalization can be regarded as “simple”, it is crucial to model counterfactuals, and we consider to have been honest and transparent already in our manuscript.

The use of the Knothe-Rosenblatt map to justify the invertibility of the encoder is reasonable. I recommend adding this explanation—at least in the Appendix [...] If CdNF is not theoretically necessary for identifiability, what inductive bias or training behavior explains the performance gain?

We will add this discussion in full detail to the Appendix.

We will also clarify that CdNFs present several advantages for our purposes. First, it allows us to factorize the posterior distribution according to the correct dependencies inferred from the causal graph (we refer to the discussion with reviewer fBNE for more details). This can be observed in this figure, where we expand Fig. 2 of the paper. There, we observe how the CdNF correctly handles the dependencies of the posterior distribution. Second, a CdNF is a universal density approximator and, more importantly, does not restrict the posterior to a specific parametric form (e.g. Gaussian), unlike an MLP.

Note about the name of the model

We will rename our model to DeCaFlow.

We hope that this reply answers all the concerns, and that the reviewer updates their assessment accordingly.

审稿意见

评分: 12025-03-13

The paper introduces Decaf, a normalizing-flow based causal generative model that can sample interventional and counterfactual data when given the graph and trained on observational data. Importantly, as opposed to many prior works, Decaf does not assume causal sufficiency (i.e., unobserved confounding may be present), and the paper also shows the identifiability of the queries of interest. Experiments show that Decaf outperforms many competitors.

To summarize my review, I cannot recommend acceptance for the current form of the paper, largely due to the issues I have with the claims and evidence, discussed below. I am open to discussion in case I misunderstood something about the paper.

给作者的问题

Please let me know if there were details about the paper that I missed or misunderstood.

论据与证据

The claims of the paper are as follows:

Claim 1: Decaf is the first causal generative model that accounts for hidden confounders given observational data and the causal graph.

This claim seems to be false; see “Relation to Broader Scientific Literature” section below. There have already been models developed that handle unobserved confounders and are more general than Decaf in many ways.

Claim 1b: Decaf is the first causal normalizing flow-based causal generative model that accounts for hidden confounders.

This would be the natural claim if claim 1 is false, and it is still a strong claim. Indeed, incorporating hidden confounders is not an easy task in normalizing flow-based models. However, I find this dubious too.

First of all, the form of $\mathbf{Z}$ and the way that the model incorporates the graph is not clear. The way that the proposed model jointly performs the inference on $\mathbf{z}$ seems to be problematic. One example of a graph that is particularly challenging for a normalizing-flow design is the graph $A \rightarrow B \rightarrow C$ , with confounding between $A \leftrightarrow B$ and $B \leftrightarrow C$ , but notably not between $A$ and $C$ . In this case, one implied constraint is that $P(c \mid do(a, b)) = P(y \mid do(b))$ . However, given the architecture of Decaf, it seems that both sets of confounding are modeled in $\mathbf{z}$ . $T_{\theta}$ would take $a, b, c$ as input, and $T^{-1}_\theta$ would output $a, b, c$ . It looks like this constraint is lost, and instead, the problem is modeled with joint confounding between all three variables. The paper does not seem to discuss anything about the implied constraints of the graph, and it seems that this is one case where two different graphs implying different identifiable results are not distinguishable through the Decaf architecture.

Second, the model does not seem to be able to handle all possible counterfactual queries, even when identifiable. For example, one may be interested in the query $P(y_x, z_w)$ , where we evaluate $Y$ under an intervention on $X$ while simultaneously evaluating $Z$ under an intervention on $W$ . If these cannot be sampled, that would imply a lack of generality of the Decaf architecture when compared to SCMs.

Claim 2: Decaf identifies all causal queries and counterfactual queries under certain conditions.

This claim is mostly correct but seems to lack generality. See “Theoretical Claims” section.

Claim 3: Empirical results with Decaf outperform existing approaches.

This claim is mostly correct. See “Experimental Designs or Analyses” section.

方法与评估标准

Leveraging normalizing flows for this problem is an interesting concept, but it may have some issues (see above).

理论论述

The paper makes two claims about the identifiability of queries for Decaf.

Prop. 6.1 seems to be sound. However, it is strange that it seems to only prove identifiability for specific queries in specific families of graphs, such as queries that are identifiable through adjustment. There exist queries in graphs that do not satisfy this criterion (e.g, see napkin graph).

Prop. 6.2 is not a property that holds in general, but looking in the proof, it seems that it is specifically about cases where $P(\mathbf{y} \mid do(\mathbf{t}))$ is identifiable from proxy variables. This may be important to state in the proposition. Additionally, it is again strange that only this particular family of counterfactual quantities is shown to be identifiable, as opposed to a more general set of counterfactuals. Is the paper saying that for queries not covered by Prop. 6.1 and 6.2, Decaf makes no claim?

I should also point out that it is somewhat misleading that the paper claims that Decaf identifies queries, since it is not Decaf itself that is performing the inference, but rather the proofs of the paper that show that the query is identifiable.

实验设计与分析

The experiments are extensive and show fairly conclusively that Decaf achieves lower estimation error when compared to other models. It is also impressive that Decaf can be applied to such large graphs. The qualitative results surrounding the inference of the confounders are also interesting. I would be curious to see if there is a task that could leverage the generative capabilities of Decaf, as opposed to simply performing an estimation task.

补充材料

I did not evaluate the supplementary material, except to check the validity of the proofs of Sec. 6.

与现有文献的关系

The paper does not acknowledge the work of some key papers that develop causal generative models that already handle unobserved confounding.

[Goudet’17] develops one of the earliest causal generative models called the Causal Graph Neural Network (CGNN), and Sec. 6 of the paper explicitly discusses how the model could handle unobserved confounders.

[Xia’21] introduces a causal generative model called the Neural Causal Model (NCM) and also discusses theoretical properties such as expressiveness and causal constraints. Not only could NCMs handle general graphs with arbitrary confounding, they could be used to solve the identification problem directly, and later work [Xia’23] shows that they can handle more general problem settings such as with interventional datasets or querying arbitrary counterfactual quantities (even nested ones).

In contrast, the work presented in this paper appears to be less general. Only specific families of queries are discussed, and the model specifically works in the case with observational data. Identifiability appears to be proven in the paper for these specific families of queries, as opposed to providing a method to deduce identifiability.

Sources:

[Goudet’17] “Learning Functional Causal Models with Generative Neural Networks”, Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, Isabelle Guyon, David Lopez-Paz, Michèle Sebag

[Xia’21] “The Causal-Neural Connection: Expressiveness, Learnability, and Inference”, Kevin Xia, Kai-Zhan Lee, Yoshua Bengio, Elias Bareinboim

[Xia’23] “Neural Causal Models for Counterfactual Identification and Estimation”, Kevin Xia, Yushu Pan, Elias Bareinboim

遗漏的重要参考文献

See above.

其他优缺点

I appreciate that the assumptions are stated in the paper. However, there seem to be some assumptions that are missing. For example, some queries are only identifiable given positive probability in the dataset. Also, there seems to be an assumption on which particular identifiable queries Decaf is capable of handling. Another example is that there may be several regularity conditions related to the theoretical results arising from the normalizing-flow architecture.

其他意见或建议

It would be helpful to see an example architecture given a specific graph.

作者回复

2025-04-01

We appreciate all the feedback and references. Due to the space limit, we only respond to the most critical questions below.

issues I have with the claims

Claim 1:

We thank the reviewers for the references. We will better position our contributions with the related work and relax our claim accordingly. Refer to our response to Rev. fBNE for our comparison to [Xia’21] and [Xia’23]. Regarding [Goudet’17], we will include the reference yet, as far as we understand, the main focus of said work is on causal discovery.

Claim 1b:

We would like to stress that the proposed Decaf exploits all the information in the causal graph, including which variables are affected by each hidden confounder (see lines 172-173). For example, in Ecoli70, we separately model the three hidden confounders and their causal dependencies separately.

Like most existing works, while we have not explored simultaneous interventions in multiple variables, we do not see any technical limitation that prevents doing so and, thus, it is an interesting line for future work. Note that Decaf inherits the do-operator from CNFs, and nothing impedes us from applying a do-operation several times.

Claim 2:

Prop. 6.1 requires one of the three conditions to hold in order to identify the interventional query, and thus we can ensure accurate estimations also when proxy variables exist as in Miao et al. For counterfactual queries, Prop. 6.2 clearly stated that they are identifiable if their interventional counterpart is so too (under any of the three assumptions). The proof of Prop. 6.2 refers to Prop. A.2 as this is the most general case, and all the other results in the appendix can be derived from it as special cases (e.g. assuming $t \in ch(z)$ ).

Unfortunately, we fail to understand what the reviewer means by “particular family of counterfactual quantities”. If the reviewer means, e.g., conditioning only on a subset of factual variables or performing multiple interventions, we will add these as lines for future work. Importantly, as shown in our empirical results, Decaf provides a practical, yet theoretically grounded, approach to accurately estimate a large number of causal queries in large graphs, as highlighted by the reviewer.

The Napkin graph is indeed an interesting case for which can also prove identifiability by reducing the query to smaller ones, similar to what we do in the frontdoor example of App. A.2.3. We will include the proof in the revised paper, which we have already empirically corroborated with these figures.

Claim 3:

We firmly believe that our empirical evaluation is thorough and enough to demonstrate both the capabilities (see, e.g., Section 7) and limitations (see, e.g., App. B.3.2) of Decaf. If the reviewer has specific additional experiments of interest in mind, we will happily consider them for the revised version of the paper.

misleading that the paper claims that Decaf identifies queries

We thank the reviewer for pointing out this unfortunate wording, which arose from our efforts to summarize our results and particularize them to Decaf. We will rewrite our Proposition 6.1 (and A.2) to clearly state that, indeed, Decaf provides accurate estimates when a causal query is proven to be identifiable through careful causal analysis and appropriate assumptions (which we already stated in lines 305-316). We have carefully revised the manuscript to clarify this important distinction along the paper.

some assumptions that are missing…identifiable given positive probability in the dataset

The positivity assumption is included in Def. 3 and we assume Decaf to perfectly match the observational distribution in line 185. We will make this assumption more explicit with regards to the available dataset used to train Decaf. We believe no regularity assumptions are missing, but we would appreciate it if the reviewer could be more specific so that, if any, we could properly include them.

I would be curious to see if there is a task that could leverage the generative capabilities of Decaf, as opposed to simply performing an estimation task.

We (partially) evaluate this by checking observational and interventional MMD in the appendices and the fairness use-case, which shows how to build fairer classifiers. Further analysis of the generative capabilities of Decaf are deferred to future work.

It would be helpful to see an example architecture given a specific graph.

We agree it can help and will add some examples in the revised paper. For now, we refer the reviewer to the original CNF paper for details on the architectural design of the networks (minus conditioning).

We believe that the clarifications above help improve the clarity of the paper and thus we will revise the paper accordingly. We believe that the necessary changes in the paper (which we have already implemented) are not substantial and thus hope that the reviewer will reconsider their assessment.

审稿人评论

2025-04-05

Thank you for the clarifications. I appreciate the analysis comparing to the related works, and I think it sheds light on the contributions of Decaf. I also appreciate that the authors are willing to add clarifications regarding assumptions, contributions, and examples in the revised version. That said, I still have some concerns.

Claim 1b

I don't think this addresses my comment. I understand that the intent is the Decaf should exploit all information in the causal graph, but I believe the example I gave (with $A \rightarrow B \rightarrow C$ , $A \leftrightarrow B$ , $B \leftrightarrow C$ ) is a counterexample.

we have not explored simultaneous interventions in multiple variables

Unfortunately, we fail to understand what the reviewer means by “particular family of counterfactual quantities”. If the reviewer means, e.g., conditioning only on a subset of factual variables or performing multiple interventions, we will add these as lines for future work.

It seems that this is being used as a counterargument regarding the lack of generality of Decaf being able to represent the full set of counterfactual queries. I am OK with this being deferred to future work, but I would then expect to see the contributions revised to say that the model only handles a specific family of counterfactual queries.

The Napkin graph is indeed an interesting case for which can also prove identifiability by reducing the query to smaller ones, similar to what we do in the frontdoor example of App. A.2.3. We will include the proof in the revised paper, which we have already empirically corroborated with these figures.

I appreciate the reference to the empirical results. That said, it is unclear to me how Decaf is able to represent this graph, architecturally speaking. There is unobserved confounding between $X \leftarrow W$ , and there is unobserved confounding between $W \leftrightarrow Y$ , but there is no unobserved confounding between $X$ and $Y$ (otherwise it would not be ID). How is this handled?

We believe no regularity assumptions are missing, but we would appreciate it if the reviewer could be more specific

Since Decaf is designed to handle continuous cases, there are many potential cases where the generating model could be poorly behaved. For example, in any query where a hard intervention is performed, say $do(X = 1)$ , it is possible that $do(X = 0.9999)$ and $do(X = 1.0001)$ have completely different behavior. It seems that there may implicitly need to be some kind of smoothness requirement to have any kind of theoretical guarantee.

作者评论

2025-04-08

We appreciate the engagement from the reviewer, and we are certain we can further clarify the reviewer’s questions.

... the example I gave [...] is a counterexample.

Let us provide further details on how Decaf models both graphs differently (and thus it is not a counterexample), now that we are not as limited as in the previous rebuttal by space constraints. Refer as CASE 1 to the graph where the three variables $A,B,C$ are affected by the same hidden confounder $Z$ ; and CASE 2 to the graph where $A \leftrightarrow B$ and $B \leftrightarrow C$ , i.e., there are two (a priori) independent confounders, $Z_1$ and $Z_2$ , affecting respectively $A,B$ and $B,C$ . Next we discuss how Decaf models (and thus distinguishes) both cases at both the encoder and decoder networks:

CASE 1:

Encoder: The posterior of $Z$ depends on all the three observed variables, i.e., $p(Z | A, B, C)$ (and thus can not be factorized). That is, the encoder in Decaf will receive an adjacency matrix that connects all $A,B,C$ to $Z$ . See our response below for further details on the structural constraints of Decaf.
Decoder: As in the original CNF, the decoder needs to model a causally consistent data-generating process. In this case, it means that $p(A, B, C|Z) = p(A|Z) p(B|A, Z) p(C|B, Z)$ . Causal relationships are enforced in the decoder using the corresponding adjacency matrix.

CASE 2:

Encoder: For the posterior of the hidden confounders $Z= \{Z_1, Z_2\}$ , we may factorize it as $p(Z | A, B, C)= p(Z_1, Z_2 | A, B, C)=p(Z_1 | A, B) p(Z_2 | Z_1, B, C)$ , such that the Decaf encoder will connect $A,B$ to $Z_1$ and $B,C, Z_1$ to $Z_2$ .
Decoder: In this case, the decoder factorizes as $p(A, B, C|Z_1, Z2) = p(A|Z_1) p(B|A, Z_1, Z_2) p(C|B, Z_2)$ .

We would like to remark that, as discussed with Reviewer fBNE, we have generalized Eq. 5 to capture more general cases. We refer the reviewer to the discussion with fBNE for more details.

… it is unclear to me how Decaf is able to represent this graph, architecturally speaking... How is this handled?

Architecturally speaking, both the Decaf encoder and decoder build upon the structural constraints of Masked Autoencoder for Distribution Estimation (MADE) [1], already exploited by the original CNF to ensure causal consistency. More specifically, Decaf exploits MADE [1] and the explicit knowledge of the (directed and acyclic) causal graph, to impose a specific functional dependency (given by an adjacency matrix) in two conditional normalizing flows. In that way, the decoder (respectively, encoder) generates each endogenous variable (or hidden variable) using only those variables that we know cause them (or that appear in the corresponding posterior of the new Eq. 5 above). For details on the specifics on the implementation and nuances of this masking we refer the reviewer, if interested, to [2].

For completeness, we provide here an expanded version of Fig. 2 of the paper for the napkin graph, explicitly showing how each module generates each variable, complementing the explanation given above.

[1] MADE: Masked Autoencoder for Distribution Estimation

[2] Structured Neural Networks for Density Estimation and Causal Inference

… implicitly need to be some kind of smoothness requirement to have any kind of theoretical guarantee.

We would like to highlight that, as explicitly stated in our assumptions paragraph (lines 122-125), we assume the true SCM to have $C^1$ -diffeomorphic causal equations which implies that, given $z$ , the data-generating process is assumed to be continuous, differentiable, and invertible with respect to the exogenous variables $u$ . Such an assumption is also made for Decaf in Section 5, as it relies on a conditional CNF. Thus, all our assumptions have already been explicitly written in the paper. That said, we will add an assumption paragraph (similar to the one in lines 122-125) to Section 5 to make sure that the reader does not miss such important information when using Decaf.

... I would then expect to see the contributions revised to say that the model only handles a specific family of counterfactual queries.

Despite the lack of explicit confirmation, we understand from the reviewer’s answer that the examples mentioned in our rebuttal are indeed the more general queries they were referring to. We will revise the contributions to clearly specify what (broad) family of queries we have considered in this work, and which ones we defer for future ones.

We hope that with the above response, the reviewer does not have any outstanding concerns and, as a consequence, they would consider updating their score on our paper accordingly. Thanks again for the feedback!

审稿意见

评分: 32025-03-14

This paper proposes a Causal Generative Model (CGM) that can identify all causal queries under certain conditions. The architecture of the model is an encoder-decoder network where both encoder and decoder are conditional normalizing flows, which is constrained by an assumed causal graph. The authors then informally state that the model is able to identify a given interventional query under 3 possible conditions. Finally, the authors test their model on synthetic and semi-synthetic data.

给作者的问题

No questions beyond my comment in the claims section.

论据与证据

Partially. What I understand of the proposed approach is a plug-in estimator of structural causal models and everything that derives from it; that is, interventions and counterfactuals. So I believe some of the theoretical statements are not necessarily supported by evidence. For example, in proposition 6.1, the authors write

“Decaf is able to identify a given interventional causal query if one of the following exists: i) a valid adjustment set b not containing z, ii) an invalid one where p(b | do(t)) is identifiable, or iii) sufficiently informative proxy and null proxy variables.”

But that is true with any estimation method and not particular to the architecture they are proposing. This is also seen in the proof of proposition A.2. on the appendix and the algorithms for identification on the appendix, which again, don’t depend on the method.

This is not to say that the architecture is not valuable but more to say that there might be some statements stronger than the evidence.

方法与评估标准

Yes, they are consistent with the theory they are presenting.

理论论述

I checked the proof of proposition 6.1 (A.2) in the appendix. There are no issues with the proof and I can see the rest of the proofs have the same quality so I don’t worry about their validity.

实验设计与分析

Yes. Whatever was on the main paper plus appendix B where the authors describe the data generation process.

补充材料

See above.

与现有文献的关系

I think the proposed architecture is an interesting addition to the causality research, expanding the method by Javaloy et.al. (2023) to deal with hidden confounders.

遗漏的重要参考文献

I would say yes. Since they draw a lot of inspiration from Miao et.al. (2018), Wang and Blei (2019) and Javaloy et.al. (2023), I don’t know how they could have missed the discussion on D’Amour (2019). Indeed, D’Amour even proposes using proximal causal methods, like Miao et.al. (2016; 2018) and the authors of this paper, to solve some of the problems in Wang and Blei.

其他优缺点

Strengths:

I particularly liked the combination of the work by Miao et.al. (2018), Wang and Blei (2019), and Javaloy et.al. (2023). In particular, extending the work of Javaloy et.al. to include unobserved confounders is important, given this is a common phenomenon in the real world.
The proofs are of very high quality.

Weaknesses:

See claims and prior work.

其他意见或建议

None.

作者回复

2025-04-01

We thank the reviewer for their thorough review and appreciation of our paper, especially regarding the quality of our proofs. In the following, we clarify the points raised by the reviewer, which are of great help to further improve the presentation of our contributions.

I believe some of the theoretical statements are not necessarily supported by evidence. [...] But that is true with any estimation method and not particular to the architecture they are proposing.

I don’t know how they could have missed the discussion on D’Amour (2019).

We were indeed aware of D'Amour's work, and therefore apologize for missing the reference. We have already revised our paper to include this relevant reference. D'Amour’s work is a key citation for our discussion in Appendix D, as it recognizes the use of proxy variables as an appropriate (alternative to use only active treatments) way to approach causal inference under hidden confounding. In our paper, similar to Wang and Blei (2021), we use this result to build on Miao et al. to propose a practical approach to causal inference under hidden confounding.

We are greatly surprised the reviewer liked our the combination of other three great works we present here with Decaf, and thanks them once again for their feedback. We hope the changes implemented during the rebuttal help improve their assessment on our work as much as they helped improve the quality of the same.

审稿意见

评分: 32025-03-19

The paper proposes a method called Decaf that learns a causal generative model in presence of unobserved confounders. After training the model, the model can perform interventional and counterfactual estimation. Finally, the authors showed empirical evaluation on the Ecoli70 dataset.

给作者的问题

Should there be any cross entropy term in equation 7?
How do the authors obtain $p_{\theta} (z|x)$ in equation 8? Isnt $z$ a conditional input in the $T_{\theta}$ model?

论据与证据

The authors considered a confounded structural causal model where $x_i= f_i(pa(i), u_i, z)$ . Its not clear how the authors are doing the abducting step for $u$ . Even though $u_i$ are independent, during the abduction process, they become dependent due to conditioning on direct children $x_i$ or some descendant. If the authors are using the deconfounding network for the abduction step of $z$ , how are they abducting $u_i$ for specific $x$ ?
The authors mentioned that to obtain posterior of $z$ , they obtain independent hidden confounder $z_k$ by conditioning on their children. However, please consider this graph: $x_1 \leftarrow z_1 \rightarrow x_2 \leftarrow z_2 \rightarrow x_3$ . We can not obtain correct posterior of $z_1$ by only conditioning on $x_1,x_2$ because, conditioning on $x_2$ makes $z_1$ independent on $x_3$ . The authors should defend how they address such case.

方法与评估标准

The authors mentioned that existing works in their relevant area either consider causal sufficiency or tailored for specific causal graph. However there exists multiple works on. causal generative models such as [1, 2, 3] which can estimate causal effects in presence of unobserved confounders for any causal graph when the causal query is identifiable. Also, the authors also claim that they are the first to identify counterfactuals in presence of hidden confounders. However to my knowledge [4] can estimate counterfactuals in presence of unobserved confounders for specific cases.
There are many identifiable causal queries that do not have a valid adjustment set. For example, the frontdoor causal graph. Where does the proposed algorithm fail if there does not exists an adjustment set?

Please check the Essential References section for the citations.

理论论述

The theoretical claims appear to be correct although not checked in detail.

实验设计与分析

I appreciate the author’s extensive experiments on synthetic, semi-synthetic (sachs, ecoli) and real word dataset.

补充材料

Yes, but not in detail.

与现有文献的关系

Proper connection has been established although some important citations are missing.

遗漏的重要参考文献

[1] Rahman, Md Musfiqur, and Murat Kocaoglu. "Modular learning of deep causal generative models for high-dimensional causal inference." arXiv preprint arXiv:2401.01426 (2024).
[2] Xia, Kevin, Yushu Pan, and Elias Bareinboim. "Neural causal models for counterfactual identification and estimation." arXiv preprint arXiv:2210.00035 (2022).
[3] Xia, Kevin, et al. "The causal-neural connection: Expressiveness, learnability, and inference." Advances in Neural Information Processing Systems 34 (2021): 10823-10836.
[4] Nasr-Esfahany, Arash, Mohammad Alizadeh, and Devavrat Shah. "Counterfactual identifiability of bijective causal models." International Conference on Machine Learning. PMLR, 2023.

其他优缺点

Strength: The paper is written in a nice and well-read manner.

其他意见或建议

See above.

作者回复

2025-04-01

We thank the reviewer for their positive feedback on our extensive experiments and questions, whose clarification will help to further improve our paper.

Its not clear how the authors are doing the abducting step for u.

Note that, given $z$ , the generative network becomes a regular CNF and thus we can perform causal inference as in the original CNF. More in detail, since the network is invertible given $z$ , we can simply evaluate $T_\theta(x^\text{factual}, z)$ (Eq. 2). Due to space constraints, these details are omitted from the main paper, but explained at length in Appendix C, as mentioned in line 175.

x1←z1→x2←z2→x3. We can not obtain correct posterior of z1 by only conditioning on x1,x2 because, conditioning on x2 makes z1 independent on x3.

We appreciate the reviewer input and agree that indeed, in such cases, our variational approximation of the posterior in Eq. 5 could be further improved by accounting for the dependencies between confounders, in the example via the factorization $q(z_1, z_2) = q(z_2 | x_3, x_2) q(z_1 | x_1, x_2, z_2)$ . We will update our Eq. 5 to account for more general cases, like the one in the example, as it will only further improve our results. Note that, there is only one variable, LacY, in Ecoli affected by two confounders, and in Sachs both confounders and jointly modeled.

However there exists multiple works on causal generative models such as [1-3] which can estimate causal effects in presence of unobserved confounders for any causal graph when the causal query is identifiable.

We appreciate the suggested references, which we have read and included to the related works. With this new context, we will re-estate our claims and the key differences that still make Decaf a significant contribution to the field, namely: Decaf amortizes parameters and its structure is much more scalable than that of [2,3]. Decaf approximates the posterior of the latent variables, allowing for counterfactual inference, which [1,3] cannot do. [2,3] are restricted to discrete and low-dimensional variables, whereas Decaf handles continuous variables and arbitrarily large graphs. [1] cannot handle confounders that affect more than 2 variables and therefore, the presence of proxy variables cannot be leveraged in the same way as we do.

However to my knowledge [4] can estimate counterfactuals in presence of unobserved confounders for specific cases.

We acknowledge that [2,4] can estimate counterfactuals under hidden counding for some cases, and we will relax our claims accordingly. We still believe that Decaf contributions are significant as it provides a practical, yet theoretically grounded, approach that amortizes parameters and training to estimate any (identifiable) causal query in large graphs with continuous variables (see review VqSV). In contrast, [2] and [4] focus on discrete variables, and require one network per variable. Furthermore, [2] requires a new model for each query and relies on rejection sampling for the abduction step.

Frontdoor causal graph. Where does the proposed algorithm fail if there does not exist an adjustment set?

We believe there is a misunderstanding with the assumptions in Prop. 6.1, as it provides three alternative sufficient conditions for query identifiability. If there is no valid adjustment set, we can still prove identifiability if either condition ii) or iii) holds. Moreover, we analyze the usual frontdoor graph in App. A.2.3 and show that, leveraging the frontdoor criterion and the identifiability of the confounded-outcome case (lines 256-261), we prove identifiability for that query too. We are working on generalizing our theoretical result in Prop. 6.1 to also include the frontdoor criterion.

Should there be any cross entropy term in equation 7?

We are afraid we do not fully understand the question. Eq. 7 expresses the ELBO by taking the prior $p(z)$ term from the KL divergence on $z$ to the first summand in Eq. 6, yielding the joint $p(x, z)$ minus the entropy, $H(q(z|x))$ . We use this formulation, as stated in lines 284-288, to reason about the information encouraged to be used within $z$ .

How do the authors obtain $p_\theta(z|x)$ in equation 8? Isnt $z$ a conditional input in the $T_\theta$ model?

We remark that we optimize the two networks in Fig. 2 using Eq.6. Instead, Eq. 8 aims to illustrate that optimizing the ELBO in Eq. 6 is equivalent to minimizing those two KLs, i.e., learning the data evidence with the generative network, and matching the intractable posterior with the deconfounding network.

We finally would like to thank the reviewer again, as the above answers have helped us further improve our paper and clarify its contributions, which we believe lie in a theoretically grounded, yet practical and efficient, approach for causal inference with hidden confounders.

审稿人评论

2025-04-05

Thanks to the authors for their rebuttal.

in such cases, our variational approximation of the posterior in Eq. 5 could be further improved by accounting for the dependencies between confounders, in the example via the factorization $q(z_1, z_2) = q(z_2 | x_3, x_2) q(z_1 | x_1, x_2, z_2)$ . We will update our Eq. 5 to account for more general cases.

The authors should make it clear how they would address this for more general cases as there might be sequence of confounders instead of just two.

作者评论

2025-04-08

We thank the reviewer for engaging in the discussion on our paper. Below, we answer the reviewer's comment.

The authors should make it clear how they would address this for more general cases as there might be sequence of confounders instead of just two.

In our previous response we had to be concise due to space limitations, but we agree with the reviewer and updated the paper accordingly. Specifically, we have generalized the previous factorization inEq. 5 to reflect these changes. Intuitively, we condition each $z_k$ with its children and the parents of each child (as they form a collider). This equation can be expressed as follows: $q(\mathbf{z} \mid \mathbf{x}) = \prod_k q\left(z_k \mid \text{ch}(z_k) \cup \bigcup_{c \in \text{ch}(z_k)} \left( \text{pa}(c) \setminus \{z_j : j \geq k\} \right) \right)$ . Note that this equation assumes a causal ordering between the different hidden confounders to avoid cyclic dependencies by excluding the posterior latents in the conditional factorization, i.e., either $z_i$ affects $z_j$ or $z_j$ affects $z_i$ , but not both. That ordering is arbitrary and does not affect the results, since the collider associations have no causal direction. Note also that our encoder is an autoregressive normalizing flow where only the specified connections in the above expression, given by an adjacency matrix representing a DAG, are enabled.

In addition, we are happy to share that we have repeated the Ecoli70 experiment with this new factorization for the encoder, and the obtained results have improved as foreseen in the previous response, obtaining comparable results to the Oracle model. We provide an updated figure in this link. Note that, for the Sachs experiment, both factorizations are equivalent, so results do not change.

We hope that with the above response, the reviewer does not have any outstanding concerns and, as a consequence, they would consider updating their score on our paper accordingly.

最终决定Reject

2025-05-01

This submission proposes a causal generative model to, as the authors put it, "provably identify all causal queries with a valid adjustment set or sufficiently informative proxy variables". A key claim is to prove that a confounded counterfactual query is identifiable if its interventional counterpart is. Two reviewers expressed valid concerns about these claims. The authors seem to conflate identification and estimation, which are not the same, and do not seem to be aware of the gap between counterfactual and interventional identification (see the Pearl hierarchy). At the very least, the writing fails to acknowledge these well-known gaps. Finally, some of the results are quite straightforward from existing results (Miao et al., 2018; Wang & Blei, 2021).

The empirical results are promising, and if the authors can (a) acknowledge the gaps above and (b) clarify how Decaf overcomes them, then this could make a nice contribution. It is worth noting that both these gaps are fundamental (no estimator can overcome them in a universal sense), which means that there must be some additional constraints that allow these gaps to be overcome. Clearly articulating these assumptions in a future revision will be crucial.

For these reasons, I cannot recommend acceptance in the current form.