Generalizing Causal Effects from Randomized Controlled Trials to Target Populations across Diverse Environments
This paper proposes identifiable conditions and a novel method for generalizing causal effects across environments in the presence of distributional shift and missing variables.
摘要
评审与讨论
This paper studies the identification and estimation problem in generalizing treatment effects from an RCT to an observational dataset. Instead of assuming all relevant separate set is observed, it is assumed that part of them are only observed in one dataset. The identification is made possible by introducing the notion of shadow variables, which is a subset of commonly observed covariates and is conditionally independent of observability given other covariates. The authors then develop the identification theory for the target treatment effect. In addition, they propose a two-stage doubly robust estimator that involves both the selection of shadow variables and doubly robust estimation of the target treatment effect. The efficacy of the proposed methods is demonstrated by extensive numerical experiments.
给作者的问题
- Why is Assumption 3.2 testable using observational data, given that is only observed in one dataset?
- On page 13 (the proof of Lemma 3.4), why Assumptions 2.5 and 3.2 can lead to the first equation (line 664-665)? It seems to use the equivalence between the conditional distributions of and , but I'm not sure this can be implied by the two assumptions, and some clarifications are needed.
- In Lemma 3.5, what do you mean by "identified"? Do you mean these quantities allow you to compute OR( for every value of ? If so, only knowing the conditional expectation wouldn't allow you to get back to a function of and . Please clarify so that I can make a more clear judgement.
- In step 2 in the identification, why "Under Assumption 2.3, we have "? If I understand correctly, represents the observational dataset in setting 2, so this cannot be true? I think this should be conditional on .
- Can you provide a proof for Theorem 4?
- Why do you use mean value imputation for ? Why is this enough for estimating given that Theorem 3.3 uses the density of given ?
- Could you develop consistency and inference guarantees for the resulting estimator?
论据与证据
The experiments are solid, but I have several confusion regarding identification theory and estimation procedures. It is not fully clear to me why the theory goes through under the given assumptions -- please see my questions below.
方法与评估标准
Yes.
理论论述
I checked the correctness but cannot arrive at a clear conclusion, and clarification is needed.
实验设计与分析
Yes, the experiments seem solid.
补充材料
Yes, I reviewed the proof in the SM.
与现有文献的关系
Generalizing treatment effects to target populations is an important topic, and this paper proposes methods that relax existing conditions, which contributes to this literature and may inspire future developments.
遗漏的重要参考文献
No to my knowledge.
其他优缺点
If all results are correct, this paper would be interesting. Strengths include: rich results, new identification conditions for generalization, new procedures. But given the confusions I list in the "questions" section, I would urge the authors to clarify them for me to give a fair judgement.
其他意见或建议
Please see the Questions section below. Given the confusions my current rating is towards reject, but would love to adjust if the authors can address the confusions in the rebuttal.
We sincerely appreciate the reviewer’s great efforts and insightful comments to improve our manuscript. In below, we address these concerns point by point.
[Q1]
Regarding the testability of Assumption 3.2, we would like to clarify the following:
- Assumption 3.2(1), i.e., , is testable because all three variables are observable in the dataset with , where indicates that is observable in this dataset.
- Assumption 3.2(2), i.e., , is testable for the following reason. Although is unobservable in the dataset with , we can still conduct the test based on Theorem 4.1, which does not require the values of in the dataset with since the only expression involving in the equation to be solved, i.e., , is always zero when .
[Q2]
Thank you for pointing out the issue caused by our accidental omission of the factor in the corresponding lines. However, we would like to clarify that this does not affect our final conclusion (the result on lines 674-675). The corrected proof is as follows:
Thank you again for your insightful feedback. We will address this issue in the revised version of the manuscript.
[Q3]
The identification of is guaranteed by Theorem 1 in [1], as referenced in Appendix D. The core idea relies on the completeness condition being satisfied. The detailed proof can be found in the Appendix (Proof of Theorem 1 on Pages 14-15) of [1]. Due to the character limit in the rebuttal stage, we regret that we are unable to provide the full proof here. We kindly ask that the reviewer refer to the original paper for further details. We will include this proof in Appendix D of the revised manuscript. Thank you for the suggestion.
[Q4]
Thank you for pointing out a typo in our manuscript. Indeed, it should be , where is equivalent to in Setting 2. We will correct this typo in the revised version of the manuscript.
[Q5]
As cited in the manuscript, Theorem 4.1 is derived from Theorem 2.3 in [2], and the detailed proof can be found in Appendix A.3 (Page 12) of [2]. Due to the character limit in the rebuttal stage, we regret that we are unable to provide the full proof here. We kindly ask that the reviewer refer to the original paper for further details. We will include this proof in the appendix of the revised manuscript. Thank you for your valuable suggestion.
[Q6]
We employed mean value imputation to reduce the complexity of imputation and improve its practicality and efficiency. Performing distributional-estimation-based imputation would be perfect, but it requires additional estimation of quantities such as variance, higher-order moments, etc., or the use of generative methods, all of which would increase complexity. Mean value imputation is a commonly used technique to reduce complexity [3], and our experimental results demonstrate that it performs well. Of course, Theorem 3.3 supports distributional-estimation-based imputation, and we will explore it in future work. Thank you for your insightful suggestion.
[Q7]
Due to the character limit in the rebuttal stage, we are unable to include the full proof here. We promise to provide the consistency proofs for both doubly robust imputation and doubly robust ATE estimation in the appendix of the revised version. Thank you for your insightful suggestion.
We hope the above discussion will fully address your concerns about our work, and we would really appreciate it if you could be generous in raising your score. Thank you!
References
[1] Miao, W., Liu, L., Li, Y., Tchetgen Tchetgen, E. J., & Geng, Z. (2024). Identification and semiparametric efficiency theory of nonignorable missing data with a shadow variable. ACM/JMS Journal of Data Science, 1(2), 1-23.
[2] d’Haultfoeuille, X. (2010). A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154(1), 1-15.
[3] Lin, W. C., & Tsai, C. F. (2020). Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53, 1487-1509.
Thank you for the response! I have updated the score.
We are glad to have addressed your concerns and sincerely appreciate your support for our work, as well as your valuable suggestions for improving it. Thank you!
This paper deals with generalizing treatment effects estimated from RCTs to different environments where there exists environmental shifts. Existing methods assume that covariates common to both source and target datasets contain the separating set, which is often violated in real-world.
The authors propose a Two-Stage Doubly Robust (2SDR) method to address this. The key idea is to relax the standard assumption. Instead of requiring the separating set to be present in the common covariates, they only require it to be present in at least one of the datasets. They then use shadow variables (covariates correlated with the missing ones but not directly influencing the environmental shift) to impute the missing covariates.
The authors also provide theoretical justification and evaluate the method on synthetic and real-world datasets.
给作者的问题
Refer to all the weaknesses in the "Claims and Evidence" and "Methods and Evaluation Criteria" sections above.
论据与证据
-
The theory assumes that the chosen shadow variables are sufficient to fully determine the distribution of the missing covariates. There's no guarantee that such a set of shadow variables exists or that the proposed selection procedure will find them. If the shadow variables are only weakly predictive of the missing covariates, the imputation will be poor, leading to bias.
-
The claim of unbiasedness also depends on the correct specification of the imputation models or the odds ratio function. The paper acknowledges the doubly robust property, but this only protects against misspecification of one model in each pair. It does not guarantee unbiasedness if both models in a pair are misspecified, or if the shadow variable assumption is violated. The empirical evaluation doesn't systematically investigate the impact of model misspecification.
-
The selection procedure depends on the reliability of the conditional independence tests (RCIT). I would recommend authors discuss these tests' own assumption and limitations because these tests are not perfect, especially in high-dimensional settings with limited data.
方法与评估标准
While I know it may be beyond the scope of this rebuttal, simulating scenarios where the shadow variable assumption is partially violated (e.g., by introducing a weak dependence between Z and S) and quantify the resulting bias in the TATE estimates would significantly strengthen the work.
理论论述
I quickly went over the proof of Theorem 3.3, I did not notice significant flaw in the proof.
实验设计与分析
The experiments in general demonstrate the effectiveness of the proposed method.
补充材料
I quickly review the Appendix.
与现有文献的关系
Estimating causal effects from multiple environments is quite important in a lot of the applications, in particular, experiments with multiple sites.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
Figure 1(a) and the related definitions in its caption need improvement. I am confused what type of variables are "covariates affecting treatment effect heterogeneity". Are these variables confounders? Why author state is a covariate affecting neither treatment effect heterogeneity nor environmental shifts in line 68-69?
Furthermore, the wording "variables affect environmental shifts" was misleading. Seems like the covariates exhibit distributional differences, but they don't cause the underlying environmental shift.
We sincerely appreciate the reviewer’s great efforts and insightful comments to improve our manuscript. In below, we address these concerns point by point.
[Claims And Evidence 1]
First, we would like to clarify that the method we propose for automatically selecting shadow variables is theoretically guaranteed, based on the testability of Assumption 3.2:
- Assumption 3.2(1), i.e., , is testable because all three variables are observable in the dataset with , where indicates that is observable in this dataset.
- Assumption 3.2(2), i.e., , is testable for the following reason. Although is unobservable in the dataset with , we can still conduct the test based on Theorem 4.1, which does not require the values of in the dataset with since the only expression involving in the equation to be solved, i.e., , is always zero when .
The shadow variables selected by the proposed method should pass the above hypothesis tests, ensuring that they satisfy Assumption 3.2.
Second, the existence of shadow variables can also be assessed through hypothesis testing. Moreover, as discussed in the manuscript, Assumption 3.2 is reasonable because not all variables in with predictive ability for are direct causes of in many real-world scenarios, such as the example of the AIDS study mentioned in the manuscript. Of course, in certain cases, shadow variables may indeed be weak or even absent. We have followed the reviewer's suggestion and conducted experiments in such cases.
[Claims And Evidence 2]
Our method, as most DR methods do, requires that at least one of the two models in both stages, either the regression model or the selection score model, be correctly specified to ensure the consistency of the imputation and ATE estimation. However, such requirement for correct model specification is very common in related work [1, 2], and we can leverage techniques such as neural networks to learn the correct model specification as accurately as possible. Of course, we value the reviewer's suggestion and have also conducted experiments to assess the impact of incorrect model specification due to violations of Assumption 3.2.
[Claims And Evidence 3]
Thank you for your insightful suggestion. Our method does not rely on specific CIT methods; any CIT method can be used for our hypothesis testing of Assumption 3.2(1). The choice of the method should be determined based on the specific application context. Due to the character limit in the rebuttal phase, we regret that we cannot provide a detailed discussion of the various CIT methods here. We promise to include the relevant discussion in the appendix of the revised manuscript.
[Methods And Evaluation Criteria]
Thank you for the suggestion. We have conducted experiments in cases where Assumption 3.2 is violated. Specifically, we conducted experiments on the synthetic dataset by
- changing the coefficient of Z on S from 0 to {0.1, 0.3} to introduce weak dependence;
- reducing the correlation coefficient between Z and from 0.8 to {0.3, 0.1}, so that Z only has a weak predictive ability for .
| Extent | Case 1 | Case 2 |
|---|---|---|
| High | 0.3550.319 | 0.3830.280 |
| Low | 0.3110.293 | 0.3410.281 |
| None | 0.2680.212 | 0.2680.212 |
The results demonstrates that while the performance of 2SDR does decline when Assumption 3.2 is violated, the decline is steady as the extent to which Assumption 3.2 is violated increases.
[Other Comments Or Suggestions]
Thank you for your valuable suggestions.
- "Covariates affecting treatment effect heterogeneity" are not equivalent to confounders: Suppose there are three covariates, A, B, and C, and the generation process of Y is . In this case, A is the only variable affecting treatment effect heterogeneity, as the value of changes with variations in A. We will improve the related definitions in the figure and its caption.
- "Variables affect environmental shifts" refers to the variables in the causal graph that point to R (the causes of R). We will revise this statement to make it clearer.
We hope the above discussion will fully address your concerns about our work, and we would really appreciate it if you could be generous in raising your score. Thank you!
References
[1] Miao, W., Liu, L., Li, Y., Tchetgen Tchetgen, E. J., & Geng, Z. (2024). Identification and semiparametric efficiency theory of nonignorable missing data with a shadow variable. ACM/JMS Journal of Data Science, 1(2), 1-23.
[2] Colnet, B., Mayer, I., Chen, G., Dieng, A., Li, R., Varoquaux, G., ... & Yang, S. (2024). Causal inference methods for combining randomized trials and observational studies: a review. Statistical science, 39(1), 165-191.
Thank you for your rebuttal. My main concerns are addressed. Thus I will raise my score to 3.
But as I go through the comments from other reviews to ensure I give a fair judgment on the work, exploring distributional-estimation-based imputation in the paper could enhance robustness.
If possible, could you provide the consistency proofs for both doubly robust imputation and doubly robust ATE estimation during the rebuttal period? I understand there exists 5000 character limit, but at least providing sketch proofs here would be really helpful to evaluate the work.
We are glad that our previous response has addressed your concerns. We greatly appreciate your support of our work, as well as the constructive suggestions you have provided for improving our paper. Below are our responses to your new suggestions:
[S1] Exploring distributional-estimation-based imputation
Thank you for the valuable suggestion. We use mean value imputation because it is more efficient while remaining effective. We agree that performing distributional-estimation-based imputation could enhance robustness, and our Theorem 3.3 also supports this approach. We will take your suggestion and explore it in future work.
[S2] Sketch consistency proofs
- The DR imputation model is consistent if either the regression model or the selection score model of S is correctly specified, and the model is correctly specified. Proof: We only need to prove the unbiasedness of , and then the consistency of in large samples holds under the conditions specified in [2,3]. Step 1: Unbiasedness of the OR model. Given the conditions specified in [1], is a consistent estimate of . Therefore, based on Eq.(6), if the model is correctly specified, its unbiasedness holds, and thus the unbiasedness of the OR model also holds based on Eq.(4). Step 2: Unbiasedness of . We need to prove that holds, where . It is equivalent to proving that holds. Under Assumption 3.2(2), this further reduces to proving that holds. We provide proofs for the following two cases. (1) The regression model is correctly specified while the selection score model is not. In this case, as the regression model is correctly specified, is an unbiased estimate of . Therefore, given the unbiasedness of the OR model, based on Eq.(2), the unbiasedness of holds. Consequently, holds, and thus the unbiasedness of holds. (2) The selection score model is correctly specified while the regression is not. In this case, as the selection score model is correctly specified, is an unbiased estimate of . Therefore, given the unbiasedness of the OR model, based on Corollary 3.7, the unbiasedness of holds. Consequently, holds, and thus the unbiasedness of holds.
- If the imputation model is consistent, the DR TATE estimator is consistent if either the regression model or the selection score model of R is correctly specified, and the model is correctly specified. Proof: When either or is consistent, the consistency theory of the DR TATE estimator in large samples has already been established in [5]. Therefore, we focus on proving the consistency of and here. (1) If the selection score model and the model are correctly specified, is consistent, as proved earlier. Since equals or based on Corollary 3.7, is also consistent. (2) Given that the imputation model is consistent, if the outcome regression model is correctly specified, estimated using the imputed data still maintains consistency under the conditions specified in [2,5]. Therefore, at least one of or is consistent, and thus is consistent.
We will provide full proofs in the revised manuscript.
We hope the new discussion will fully address your remaining concerns, and we would really appreciate it if you could be generous in raising your score further. Thank you again for your valuable suggestions.
References
[1] Silverman, B. W. (2018). Density estimation for statistics and data analysis. Routledge.
[2] Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of econometrics, 4, 2111-2245.
[3] Miao, W., & Tchetgen Tchetgen, E. J. (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103(2), 475-482.
[4] Colnet, B., Josse, J., Varoquaux, G., & Scornet, E. (2022). Causal effect on a target population: a sensitivity analysis to handle missing covariates. Journal of Causal Inference, 10(1), 372-414.
[5] Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data. John Wiley & Sons.
This paper studies the problem of generalizing RCTs under environment shifts, particularly shifts in the distribution and quantity of covariates. It relaxes the assumption in the prior literature where the separating set (variables that simultaneously affect treatment effect heterogeneity and environmental shifts) is present in both the RCT dataset and target observational dataset. Instead it assumes that the separating set belongs to at least one of the two datasets. Leveraging a shadow variable, the authors introduce a novel Two-Stage Doubly Robust (2SDR) estimator for the Target Average Treatment Effect (TATE). They support their solution through identification theory and demonstrate its effectiveness on both synthetic and real datasets.
给作者的问题
(1) How realistic is it to assume the existence of a shadow variable Z that satisfies Assumption 3.2? Under what conditions can we expect this to hold?
(2) Will results deteriorate smoothly w.r.t. the extent to which this Assumption 3.2 is satisfied?
(3) Typically, under environmental shifts, there may also be shifts in Y|X shifts. It would be beneficial if the authors discussed situations where we expect only X-shifts and where both X- and Y-shifts might occur, thereby clarifying where 2SDR can be effectively applied?
论据与证据
(1) The paper extends the TATE identifiability conditions, which traditionally require the separating set to be fully observable in both the RCT and the target observational datasets. Instead, it allows for partial observability, requiring the separating set to be present in at least one of the datasets. This generalization is theoretically supported by a novel identifiability framework.
(2) The authors propose a Two-Stage Doubly Robust (2SDR) TATE estimator that leverages a shadow variable Z. Empirically, the estimator outperforms existing baselines.
方法与评估标准
Yes, the proposed method appears to be novel, and the evaluation seems to be well-suited and thorough.
理论论述
I have reviewed the proofs of Lemma 3.4 and Lemma 3.5, and they seem to be correct to me.
实验设计与分析
I haven’t reviewed the dataset generation procedures in detail, but the overall experimental design and analysis seem appropriate and suggest that 2SDR outperforms the alternatives.
补充材料
I have reviewed the proofs of Lemma 3.4 and Lemma 3.5.
与现有文献的关系
The paper generalizes the TATE identifiability conditions, which traditionally require the separating set to be fully observable in both the RCT and target observational datasets, to allow for partial observability. In this case, the separating set needs to be a subset of one of the datasets.
In general, generalizing RCT results beyond the sites where they are conducted is an important problem. This paper addresses environmental shifts, particularly changes in the distribution and quantity of covariates, to enhance the generalizability of RCT results.
遗漏的重要参考文献
The relevant literature appears to be sufficiently discussed.
其他优缺点
Strengths
(1) The extended TATE framework appears to be novel and well-motivated. The relaxation of the separating set to belong to just one of the datasets offers practical value, as it accounts for potential covariate shifts or missing data when generalizing to other sites. (2) Experimental results suggest that 2SDR outperforms other baselines on real datasets, particularly in the JTPA dataset experiment, where it improves estimates for sites different from the RCT data.
Weaknesses
(1) Assumption 3.2 seems to be a strong condition that may be difficult to satisfy in practice (see questions below). (2) A more detailed discussion is needed to convey the intuition behind why the 2SDR estimator performs better.
其他意见或建议
(1) A more detailed practical example for Settings 1 and 2 would be helpful.
(2) Figure 1 is not very clear or readable unless the entire Section 2 is read. There is room for improving the figure by making some of the definitions more explicit.
(3) Readability of Section 4.1.2 needs some improvement (maybe by explaining what is happening in each step or having a diagram).
We sincerely appreciate the reviewer’s great efforts and insightful comments to improve our manuscript. In below, we address these concerns point by point.
[Other Comments Or Suggestions]
Thank you for your valuable suggestions. We will carefully revise our manuscript based on your recommendations.
- We will provide more detailed descriptions of the practical examples for Settings 1 and 2.
- We will improve the related definitions in the figure and its caption to make them clearer.
- We will add a brief summary at the beginning of each proof step to explain what is happening in that step, and we will also provide a diagram to further illustrate it.
[Q1] How realistic is it to assume the existence of a shadow variable Z that satisfies Assumption 3.2?
Assumption 3.2 requires that among the common covariates shared by the two datasets, there exist variables that are correlated with the covariates missing in one of the datasets but do not directly influence environmental shifts (not a direct cause of R). Assumption 3.2 is reasonable in many real-world scenarios, as, typically, most covariates only indirectly affect R, with relatively few variables being direct causes of R. Moreover, Assumption 3.2 is testable:
- Assumption 3.2(1), i.e., , is testable because all three variables are observable in the dataset with , where indicates that is observable in this dataset.
- Assumption 3.2(2), i.e., , is testable for the following reason. Although is unobservable in the dataset with , we can still conduct the test based on Theorem 4.1, which does not require the values of in the dataset with since the only expression involving in the equation to be solved, i.e., , is always zero when .
Therefore, we can assess whether Assumption 3.2 holds in real-world applications by conducting hypothesis tests on the covariates.
[Q2] Will results deteriorate smoothly w.r.t. the extent to which this Assumption 3.2 is satisfied?
Thank you for your insightful suggestion. We have conducted experiments in cases where Assumption 3.2 is violated. Specifically, we conducted experiments on the synthetic dataset under the following cases:
- Case 1: Assumption 3.2(2) is violated. We changed the coefficient of Z on S from 0 to {0.1(Low), 0.3(High)} to introduce weak dependence between Z and S;
- Case 2: Assumption 3.2(1) is violated. We reduced the correlation coefficient between Z and from 0.8 to {0.3(Low), 0.1(High)}, so that Z only has a weak predictive ability for .
| Extent | Case 1 | Case 2 |
|---|---|---|
| High | 0.3550.319 | 0.3830.280 |
| Low | 0.3110.293 | 0.3410.281 |
| None | 0.2680.212 | 0.2680.212 |
The results demonstrates that while the performance of 2SDR does deteriorate when Assumption 3.2 is violated, it deteriorates smoothly as the extent to which Assumption 3.2 is violated increases. We will include the results and analysis of this experiment in the revised manuscript.
[Q3] Typically, under environmental shifts, there may also be shifts in Y|X shifts. It would be beneficial if the authors discussed situations where we expect only X-shifts and where both X- and Y-shifts might occur, thereby clarifying where 2SDR can be effectively applied?
Thank you for your valuable suggestion. 2SDR relies on Assumption 3.1, which requires that holds. Therefore, the applicable situations for 2SDR can be summarized as follows:
- X-shifts. As stated in Definition 2.1, the original definition of environmental shift in our work is X-shifts, i.e., .
- Both X- and Y-shifts caused by X. Although we did not explicitly state this in Definition 2.1, due to the fact that variables in X may be causes of Y, the distribution of Y will also shift along with the distribution of X, i.e., , or alternatively, .
- shifts. Given that holds, we have . Therefore, our problem setting essentially assumes the absence of Y|X shifts. However, for the common covariates shared by the two datasets, there still exist shifts, i.e., . Assumption 3.2 required by 2SDR is still satisfied under shifts, and therefore, 2SDR can address shifts. However, existing methods rely on Assumption 2.8, which requires that holds. Under shifts, Assumption 2.8 does not hold, and thus, in contrast to 2SDR, previous methods cannot address shifts.
We will include the above discussion in the revised manuscript.
We hope the above discussion will fully address your concerns about our work, and we would really appreciate it if you could be generous in raising your score. Thank you!
Thanks for your detailed response. My concerns have been appropriately addressed, I have raised my score to 4.
We are glad to have addressed your concerns and sincerely appreciate your support for our work, as well as your valuable suggestions for improving it. Thank you!
This paper introduces a two-stage method for generalizing treatment effects from RCTs to target populations facing environmental shifts. Unlike common solutions, the proposed procedure relaxes the assumption that the separating set of covariates is fully observable in both the RCT and target data. Instead, it only requires the separating set to be present in at least one of the two data groups and utilizes shadow variables to impute missing covariates in a two-stage process.
Reviewers generally acknowledged the novelty and importance of the problem and found the experimental results promising, particularly on real-world datasets like JTPA. A key concern was the realism regarding the existence and properties of shadow variables, which was partially addressed by the author response. More detailed and practical examples will significantly strengthen the paper further.