Unified Covariate Adjustment for Causal Inference
We develop an estimation framework that can estimate broad causal effect identification efficiently, achieving doubly robustness.
摘要
评审与讨论
The paper introduces a new framework for identifying causal estimands, referred to as Unified Covariate Adjustment (UCA).
It demonstrates that the UCA-expressible class (a class of causal estimands identifiable by UCA) is extensive, encompassing estimands identified by the (sequential) back-door adjustment, front-door adjustment, and Tian’s adjustment.
Furthermore, the paper proposes an estimation strategy for the UCA-expressible class using machine learning methods and analyzes the error of such estimators.
Notably, the proposed estimator is scalable and achieves the double robustness property.
优点
-
The paper provides good examples to illustrate the relationship between the UCA-expressible class and other classes of estimands identified by different adjustments.
-
It offers comprehensive identification and estimation strategies, thereby presenting a complete causal inference methodology.
-
The theoretical analysis of the estimator's scalability and error characterization is solid.
缺点
-
While the paper provides sufficient conditions for an estimand being not UCA-expressible (i.e., necessary conditions for an estimand being UCA-expressible), it lacks necessary conditions for an estimand being not UCA-expressible (i.e., sufficient conditions for an estimand being UCA-expressible). However, addressing this issue might be beyond the scope of this paper.
-
In the presence of unmeasured confounders, causal estimands (e.g., average treatment effects) may not be identifiable by UCA since identification requires bridge functions and a specifically designed ID algorithm (e.g., Shpitser et al. 2023, JMLR, The Proximal ID Algorithm). It may be worth mentioning these points at the end of Section 2, along with the "napkin" estimand.
-
The authors claimed the estimator is doubly robust (page 8, line 307) under certain conditions, which includes convergence rate for . However, I believe this condition may not be achievable, as the error tends to accumulate for large ; see Question below.
问题
page 2, Table 1: Can the authors clarify or provide examples of why UCA does not cover the functionals identified by obsID/gID?
page 3, line 94: There are double commas.
page 4, line 145: Given that the probability is at most 1, does this mean that is discrete? If so, can the authors explain why has to be discrete?
Section 3.1:
-
Estimation of : I think the error is not only affected by the estimation errors from the th stage, but also by the estimation errors from the previous stages. In other words, the estimation error of accumulates as the index increases. Can the authors provide some comments on this error accumulation?
-
Bias structure: The bias structure in Theorem 3 consists of [error of ][error of ] and [error of ][error of ]. The second cross- product term seems non-trivial. Can the authors provide an example of causal estimands that only has the bias structure of [error of ][error of ] without cross- product terms?
Sections F.1.4 and F.1.5: Typo: textttnormal
局限性
Limitations are not explicitly discussed in the paper.
The authors might consider including the weaknesses mentioned above as limitations.
We thank the reviewer for the time and valuable feedback, and appreciate the positive assessment of our work.
While the paper provides sufficient conditions for an estimand being not UCA-expressible (i.e., necessary conditions for an estimand being UCA-expressible), it lacks necessary conditions for an estimand being not UCA-expressible (i.e., sufficient conditions for an estimand being UCA-expressible). However, addressing this issue might be beyond the scope of this paper.
As mentioned, developing necessary conditions is challenging and beyond the scope of this paper. We believe this question has the potential to open a new direction for enhancing the proposed method. Thank you for the insightful question.
In the presence of unmeasured confounders, causal estimands (e.g., average treatment effects) may not be identifiable by UCA since identification requires bridge functions and a specifically designed ID algorithm (e.g., Shpitser et al. 2023, JMLR, The Proximal ID Algorithm). It may be worth mentioning these points at the end of Section 2, along with the "napkin" estimand.
In Table 1, we provide the coverage of the UCA, which states that not all obsID/gID functions are identified. We will further mention this point as suggested at the end of Section 2.
The authors claimed the estimator is doubly robust (page 8, line 307) under certain conditions, which includes convergence rate for . However, I believe this condition may not be achievable, as the error tends to accumulate for large ; see Question below. Estimation of : I think the error is not only affected by the estimation errors from the th stage, but also by the estimation errors from the previous stages. In other words, the estimation error of accumulates as the index increases. Can the authors provide some comments on this error accumulation?
Good point. Due to the nature of nested expectation (and nested regression), the error of the nuisances can accumulate as decreases from to . The term indeed represents the accumulated error of . Even with this accumulation, the error decomposes into the product of the errors of the nuisances; i.e., \text{[error of DML]} = \sum_{i=1}^{m} \text{[error of \mu^i]} \times \text{[error of \pi^i]} still holds. As long as converges to (which is likely in practice with flexible ML models), even if errors accumulate, the rate of convergence of the DML estimator outperforms that of competing estimators (OM, PW).
On the other hand, we note that is used to exemplify the debiasedness property because it’s the fastest rate that a neural network can achieve [1] .
[1] Györfi, László, et al. “A distribution-free theory of nonparametric regression” (2002)
page 2, Table 1: Can the authors clarify or provide examples of why UCA does not cover the functionals identified by obsID/gID?
UCA does not cover all the functionals identified by obsID/gID. We provide an example called Napkin estimand in lines 202-204, with a causal diagram in Figure 1c. The identification estimand is given as . The UCA cannot handle cases where the functional is given as a ratio of two functions. A detailed discussion on the coverage of the UCA is provided in Section C.3.
page 4, line 145: Given that the probability is at most 1, does this mean that is discrete? If so, can the authors explain why has to be discrete?
No, it means that the value of is governed by the probabilistic measure . Whether is continuous or discrete depends on the choice of . For example, if is a uniform distribution over , then can be any real number within . However, if is a Bernoulli distribution, then is a binary variable.
Bias structure: The bias structure in Theorem 3 consists of [error of ] [error of ] and [error of ] [error of ]. The second cross- product term seems non-trivial. Can the authors provide an example of causal estimands that only has the bias structure of [error of ] [error of ] without cross- product terms?
Consider the back-door adjustment where . Since the error term only contains [error of ] [error of ], the second cross-term doesn’t exist when .
Sections F.1.4 and F.1.5: Typo: textttnormal page 3, line 94: There are double commas.
Thank you for catching the typo. These will be fixed.
The authors might consider including the weaknesses mentioned above as limitations.
We will discuss more about the limitations based on provided feedback. Thanks.
Thank you for your response. I believe the paper is well-prepared for acceptance, so I will keep my score as is.
Thank you for taking the time and effort to provide constructive feedback.
This paper describes the estimand framework "unified covariate adjustment (UCA)" and discusses its coverage with multiple examples (Front-door, Verma's equation, Counterfactual directed effect and most importantly Tian's adjustment). Then it develops an estimator for this function class and shows that it is scalable. Lastly the authors run experiments on simulated data to empirically demonstrate the robustness and the scalability of their estimator.
优点
-
This work is original as it introduces - to the best of my knowledge - the only scalable estimator for many functions classes. It makes a lot of novel causal inference results applicable in real applications.
-
The overall quality of this work is very good, the results are provided as theoretically sound theorems and empirically confirmed through simulated data experiments.
-
This paper is clear, the contributions are clearly stated and the paper is well structured with some useful examples.
缺点
-
All the proofs are provided in the supplementary material and the intuition for the proofs are not provided in the main paper.
-
Some further experiments could be interesting even they are not necessary. For example: how does DML compare to prior scalable estimators on BD/SBD? In figure 2a,b,c how much can the dimenion of the summed variables grow before the running time of DML reaches unreasonable values (eg. 2000)? Similar question for figure2d,f how little can thee sample size be before the errors of DML reaches unreasonable values?
-
UCA could be more thoroughly delimited. It is well defined and the authors give examples of scenarios that are included, how they do not clearly discuss which type Ctf-ID are covered and which are not (This is also true for obsID/gID and transportability). Furthermore, there is no discussions concerning non-covered function classes and what difficulties prevent estimation in those cases.
-
Some intuition regarding the meaning of the mathematical objects would have been appreciated (eg. what are the sets and in Def 1)?
-
While pseudo-code is provided, giving access to the code is always appreciated.
问题
-
Could you give intuition concerning the meaning of the and and in Def 1 as well as and for non experts.
-
Could you discuss the limits of DML, ie. when the dimension grows and when the sample size shrinks.
-
Concerning the experiments, each point corresponds to the average of 100 simulations. Could you provide the variances as well?
局限性
- The authors do not discuss which function classe are not covered by UCA. Moreover, they do not discuss the limits of their estimator DML when the dimension grows and when the sample size shrinks.
We thank the reviewer for the time and valuable feedback, and appreciate the positive assessment of our work.
Some further experiments could be interesting even they are not necessary. For example: how does DML compare to prior scalable estimators on BD/SBD? In figure 2a,b,c how much can the dimenion of the summed variables grow before the running time of DML reaches unreasonable values (eg. 2000)? Similar question for figure2d,f how little can thee sample size be before the errors of DML reaches unreasonable values?
We have provided a set of experimental results in the attached pdf file.
UCA could be more thoroughly delimited. It is well defined and the authors give examples of scenarios that are included, how they do not clearly discuss which type Ctf-ID are covered and which are not (This is also true for obsID/gID and transportability). Furthermore, there is no discussions concerning non-covered function classes and what difficulties prevent estimation in those cases.
Some cases where the target estimand cannot be expressed through UCA are discussed in Appendix C.3. A summary of this discussion will be provided in the main paper.
Some intuition regarding the meaning of the mathematical objects would have been appreciated (eg. what are the sets and in Def 1)? Could you give intuition concerning the meaning of the and and in Def 1 as well as and for non experts.
The meaning of , , and depends on specific cases. In all examples, we specified what , , and meant. For the BD/SBD, we can view as a set of covariates, as a treatment, and as predecessors of . is a (nested-) expectation functional representing the UCA estimand, and is the probability-weighting-based functional representing the UCA estimand. We will provide more explanation to give an intuition about these mathematical objects in the paper. Thank you.
While pseudo-code is provided, giving access to the code is always appreciated.
We will make the code available after the revision. Thank you.
Could you discuss the limits of DML, ie. when the dimension grows and when the sample size shrinks.
As shown in the experiment in the PDF of the global response, the proposed DML estimator remains scalable when the dimension is high.
When the sample size shrinks, it’s possible that the error of the DML estimator is amplified because its error decomposes into the product of the errors of nuisances; i.e., . If the sample is small so that the errors of nuisances become large, the resulting DML estimator may have a larger error since the error is multiplied. However, as the sample size grows, the DML estimator is guaranteed to converge faster whenever nuisances are converging to the truth.
Concerning the experiments, each point corresponds to the average of 100 simulations. Could you provide the variances as well?
In all plots in Figure 2, the confidence intervals of the error with are shown as error bars.
The authors do not discuss which function classe are not covered by UCA.
Some function classes where the target estimand cannot be expressed through UCA are discussed in Appendix C.3.
We will add a sufficient criterion to determine which estimands can be represented as a UCA estimand in the revision of the paper. The idea behind the criterion is as follows: If
-
The target estimand is expressed as the mean of the product of conditional distributions over ; and
-
The variables that are marginalized and fixed simultaneously (e.g., in the front-door adjustment) only appear in ,
the proposed methods can be applied. These conditions are sufficient for applying the empirical bifurcation technique (Def 2) that allows scalable estimation.
Thank you for all the clarifications. After reading other reviews, I still think it is an interesting paper and I maintain my score.
Thank you for spending the time reading our paper and the positive assessment.
The paper presents a class of adjustment formulas called unified covariate adjustment (UCA) which is shown to be able to express many classes of adjustments known in the existing literature. A scalable and doubly robust estimator for UCA is also presented along with some experimental results.
优点
The proposed UCA estimator seems to be very expressive and is able to model many existing estimators in the literature. Examples were given to show how existing estimators can be expressed as a UCA estimator.
The paper also proposes a doubly robust method to obtain UCA estimates.
缺点
- It is of my personal opinion that the paper lacks polish. There are many instances of technical notations being used without first properly defining them, making it hard to understand and follow the discussion (see the Questions section for some of them). The reader should not be expected to guess the meaning of certain notations by cross-referencing across subsequent pages (at best check the preliminaries/notation section) or even across other paper references.
- The paper claims on Lines 57-58 that "while these estimators are designed to achieve a wide coverage of functionals, they lack scalability due to the necessity of summing over high-dimensional variables" but the general definition of UCA in equation (1) also involves summing over potentially many variable values in . Please explain clearly why UCA avoids scalability issues.
- Line 302-304: It is not true that only asymptotic analyses were known for all these estimators. For example, [1] gives non-asymptotic finite sample guarantees for the Tian-Pearl adjustment. The paper would benefit from a comparison against such prior works and illustrate how DML-UCA adjustments are indeed more sample efficient as compared to existing estimators.
- Why is there no experiments against the Tian-Pearl adjustment?
- No code was released (though some parameters were given the appendix).
[1] Arnab Bhattacharyya, Sutanu Gayen, Saravanan Kandasamy, Vedant Raval, Vinodchandran N. Variyam Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:7531-7549, 2022.
问题
- How are the symbols in Table 1 determined? "scalable" is defined to be "evaluable in polynomial time relative to number of covariates and capable in the presence of mixed discrete and continuous covariates", but about "coverage"?
- Consider rephrasing the awkward-sounding sentence "Our work strives maximizing coverage..." on Line 73-76?
- first appears in Line 141. How is this defined? How does it differ from ?
- In Line 144, are referenced but how are they defined? Also, how do the s relate to the variable set ? Is the c-component of the vertex?
- Equation (4): The notation is undefined. Is as in the Tian-Pearl 2002 paper?
- Line 312: What is ? How does it differ from just the usual big-O notation? Also, how do the errors in the estimation scale with the number of samples? Lines 307-309 only say "if the terms converge at a rate of , then DML-UCA coverges at a rate of ". Why do those terms converge at a rate of ?
- Given a general causal graph and causal query, how does one find a suitable and valid UCA expression? What is the procedure?
Potential typos:
- Double comma on Line 94
- Extra ) on Line 153
局限性
Nil
Thank you for your feedback and for the opportunity to provide further elaboration.
technical notations being used without first properly defining them
We will further proofread the paper and the preliminaries.
Please explain clearly why UCA avoids scalability issues.
Existing BD/SBD estimators avoid scalability issues by replacing marginalization with nested expectation. However, when a variable is both marginalized and fixed simultaneously (e.g., FD: , where is fixed to in and marginalized by in other components), representing the marginalization operator as a nested expectation is non-trivial. These challenges lead to potential scalability issues for previous FD estimators (Fulcher et al., 2019; Guo et al., 2023), which have a complexity of (: the dimension of variables, : time complexity of learning nuisances). In contrast, the newly proposed UCA estimator uses empirical bifurcation to replace marginalization with nested expectation, achieving by leveraging the empirical bifurcation method.
[1] gives non-asymptotic finite sample guarantees for the Tian-Pearl adjustment.
We will cite all the referred papers. Please note that lines 302-304 discuss DML-style estimators, while the mentioned paper provides finite sample guarantees for a basic plug-in estimator under the discrete random variable setting.
Why is there no experiments against the Tian-Pearl adjustment?
Figure 2(b,e) shows the experimental results for Verma's graph (Figure 1b), which is an example instance of the Tian-Pearl adjustment.
No code was released (though some parameters were given the appendix).
We will make the code available after the revision. Thank you.
How are the symbols in Table 1 determined? "scalable" is defined to be "evaluable in polynomial time relative to number of covariates and capable in the presence of mixed discrete and continuous covariates", but about "coverage"?
Coverage indicates whether established estimators exist. For obsID/gID classes, estimators like those by Jung et al. (2021a, 2023a), Xia et al. (2021, 2022), and Bhattacharya et al. (2022) are marked in the "Prior" column. The UCA, covering Tian's adjustment, is checked in its respective column.
Consider rephrasing the awkward-sounding sentence "Our work strives maximizing coverage..." on Line 73-76?
Thank you. We will rephrase as follows: “Our work aims to maximize coverage, enabling the effective development of scalable estimators with the doubly robust property.”
first appears in Line 141. How is this defined? How does it differ from ?
represents a set of variables fixed to in a conditional distribution (e.g., with in FD, Example 1). Meanwhile, is a subset of the union of and , excluding the fixed set.
are referenced but how are they defined? Also, how do the s relate to the variable set ? Is the c-component of the th vertex?
- , , and are all defined as the empty set. We will add this in the preliminaries.
- We define \mathbf{S}\_{i-1} := (\mathbf{C}\^{(i-1)} \cup \mathbf{R\}^{(i-1)} ) \setminus \mathbf{S}\^{b}\_{i-1} in line 144. This is a subset of the variable set .
- Thank you for the good question. is not a c-component. We only used to denote the c-component containing . We will improve the notation to distinguish them more explicitly.
Equation (4): The notation is undefined. Is as in the Tian-Pearl 2002 paper?
Yes, it's defined in line 98.
Line 312: What is ? How does it differ from just the usual big-O notation? Also, how do the errors in the estimation scale with the number of samples?
Why do those terms converge at a rate of ?
- As written in line 104, is a stochastic boundedness called the big-O in probability [van der Vaart, "Asymptotic Statistics."] (1998). The expression means that will be bounded even when increases to infinity. This indicates that decreases at least as fast as . If the error term is , then it decreases at the rate of .
- Theorem 3 and Corollary 3 show that when nuisances converge at the rate of (), the estimator can converge at the double rate: . We demonstrate this with , since is the fastest convergence rate for modern ML models like neural networks [1].
[1] Györfi, László, et al. “A distribution-free theory of nonparametric regression” (2002)
how does one find a suitable and valid UCA expression?
Some causal queries that satisfy known graphical criteria (e.g. BD/SBD, FD, or Tian’s adjustment) can be represented as an UCA. On the other hand, we have further developed the sufficient criterion to determine if a given estimand can be represented as a valid UCA expression. The idea behind the criterion is as follows: If
-
The target estimand is expressed as the mean of the product of conditional distributions over ; and
-
The variables that are marginalized and fixed simultaneously (e.g., in the front-door adjustment) only appear in ,
the proposed methods can be applied. This criterion will be added in the revised version of the paper.
Double comma on Line 94, Extra ) on Line 153
We will fix the typos, thank you.
Thank you for the detailed responses. They have addressed my concerns. I look forward to these being incorporated nicely in a future revision. I will increase my score upwards.
We appreciate your constructive input. Thank you for the positive feedback.
The paper introduces a novel framework, unified covariate adjustment (UCA), which covers a broad class of sum-product causal estimands and additionally develops a scalable estimator (via DML-UCA) that ensures double robustness.
优点
-
The paper presents a well-developed theoretical framework with clear assumptions and derivations. The motivation to extend existing estimands is well-articulated, and the comparisons with prior work are thoroughly examined.
-
The paper is well-written and clearly presents the motivation, methodology, and contributions.
-
It is interesting to revisit the coverage and scalability of previous studies and provide comprehensive evaluations.
缺点
-
The proposed method is based on structural causal models, which have been extensively studied. UCA-class is an extension of the sequential back-door adjustment (SBD), and there are already existing studies that address similar questions
-
The authors mention in Example 1 that the Front-Door adjustment (FD) can be represented using the UCA framework. Are there any assumptions required to validate this representation? Similarly, to represent the Verma constraints as UCA in Example 2, are there any criteria that could be followed to verify the representation in practice? What are the requirements and limitations for implementing these representations to ensure the reliability and validity of the estimation in general?
问题
See Weaknesses
局限性
yes
Thank you for sharing your thoughts and feedback!
UCA-class is an extension of the sequential back-door adjustment (SBD), and there are already existing studies that address similar questions
Indeed, UCA is an extension of the SBD, and we have appreciated and cited papers regarding estimating the SBD estimand. However, naively applying the SBD estimators to the UCA-class (e.g., FD, Verma) may lead to biased estimators whenever the value of SBD estimand and the UCA estimand don't match. Also, modifying the existing SBD estimators to the UCA-class is non-trivial since variables that are fixed and marginalized at the same time (e.g., ‘X’ in FD) are not properly treated in the SBD estimators. A special method (such as the empirical bifurcation in Def. 6) is required to develop an estimator. Furthermore, for Tian’s adjustment, the weighting nuisances do not have the same form as those in SBD. In summary, developing doubly robust estimators for the UCA class is a novel and non-trivial task.
Are there any assumptions required to validate this representation? Similarly, to represent the Verma constraints as UCA in Example 2, are there any criteria that could be followed to verify the representation in practice? What are the requirements and limitations for implementing these representations to ensure the reliability and validity of the estimation in general?
Some causal queries that satisfy known graphical criteria (e.g. BD/SBD, FD, or Tian’s adjustment) can be represented as an UCA. Recall that representing the Tian's adjustment through the UCA estimand is demonstrated in the paper.
On the other hand, we have further developed the sufficient criterion to determine if a given estimand can be represented as a valid UCA expression. The idea behind the criterion is as follows: If
-
The target estimand is expressed as the mean of the product of conditional distributions over ; and
-
The variables that are marginalized and fixed simultaneously (e.g., in the front-door adjustment) only appear in (that is, ),
then the proposed methods can be applied. These conditions are sufficient for applying the empirical bifurcation technique (Def 2) that allows scalable estimation.
Thanks for the detailed responses, I will maintain my score.
Thank you for your positive assessment of our paper.
We attached a PDF to report the experimental results in respond to the following questions from Reviewer ZBfF:
-
In figure 2a,b,c how much can the dimenion of the summed variables grow before the running time of DML reaches unreasonable values (eg. 2000)?
-
how little can thee sample size be before the errors of DML reaches unreasonable values?
In summary, the proposed DML-UCA estimator can be evaluated under a high-dimensional setting where . The estimator can also be evaluated under a small sample size setting, with samples varying from 10 to 100.
The paper introduces a new causal identification framework which covers a wide range of estimands. The authors provided detailed (and ultimately satisfactory) responses to the initial reviews and all reviewers have recommended acceptance.