Optimal Transport for Reducing Bias in Causal Inference without Data Splitting
摘要
评审与讨论
The paper proposes a novel method for addressing covariate shift in observational data when estimating treatment effects, based on optimal transport. The benefit of the method is that data does not need to be split into groups and that the method can easily be applied to both binary and continuous treatments. Extensive theoretical and empirical results are provided to validate the efficacy of the method.
优点
- The method addresses a clear problem and provides a natural solution.
- The paper provides extensive theoretical and empirical results.
- Despite the complex topic, the paper is generally easy to follow.
缺点
- I found it hard to judge the novelty and contribution of the work with respect to existing work:
- There is earlier work on applying optimal transport for treatment effect estimation, which is not discussed [1, 2]. How does this work compare to your own? Is there any reason why these methods were not included as benchmarks?
- Apart from the data splitting, how exactly is your method different from CFR_Wass in Shalit et al. (2017)? Understanding this would make your own contribution more clear. Did you use CFR's version with Wasserstein distance in your experiments?
- The method introduces significant complexity and I would like to see more discussion of technical details:
- Tuning of hyperparameters (, ), as well as tuning in general, is not discussed. As this is a difficult problem in CATE estimation generally, I would like to see how the authors addressed this. Additionally, a sensitivity analysis of is provided, but not for . Is there any reason why no such analysis is presented for ?
- The efficiency of the method is not discussed. Does the Sinkhorn algorithm make your training procedure much slower?
- The conclusion summarizes the work, but does not include limitations of the proposed method or suggestions for future work.
[1] Wang, H., Fan, J., Chen, Z., Li, H., Liu, W., Liu, T., ... & Tang, R. (2024). Optimal transport for treatment effect estimation. Advances in Neural Information Processing Systems, 36.
[2] Li, Q., Wang, Z., Liu, S., Li, G., & Xu, G. (2021). Causal optimal transport for treatment effect estimation. IEEE transactions on neural networks and learning systems, 34(8), 4083-4095.
问题
-
See weaknesses points 1, 2, and 3
-
Calling the potential outcome instead of is a bit confusing, as is typically reserved for the treatment effect. I would suggest calling it . Related to this, previous work is mostly focused on estimating the treatment effect, instead of the potential outcome; or PEHE instead of AMSE. Could your theoretical results be applied to PEHE as well?
-
Any reason why evaluation for continuous treatments was not done with the Mean Integrated Squared Error (MISE)?
Minor points:
- Missing space in line
- Missing space in line 845
Thank you for the valuable and insightful comments. We will revise the submission according to the comments. Our responses are as follows.
Q1: Comparison with [1,2]
A1:
-
Both [1] and [2] consider the binary treatment setting. Different from them, we propose a unified framework for both binary and continuous treatment settings.
-
[1] reduce the distribution shift between the treatment group and the control group, i.e., . Different from it, we reduce the shift between each conditional distribution and the marginal distribution, i.e., .
-
[2] considers an optimal transport problem between the factual and counterfactual distributions, which is different from our optimal transport model. In addition, [2] relies on a linear potential outcome model. We employ the neural network to train a nonlinear outcome model.
-
We have conducted the methods in [1,2] as the compared methods in the revision. Our method achieves better performance. The results are as follows.
| Methods | IHDP | IHDP | IHDP |
|---|---|---|---|
| ORIB | 1.1129 1.4290 | 0.2134 0.3488 | 1.1976 1.3822 |
| ESCFR | 1.2443 2.1300 | 0.4112 0.5902 | 1.3498 2.1298 |
| CausalOT | 13.8269 13.5417 | 2.4498 0.8065 | 7.3281 6.2416 |
Q2: Difference from CFR-Wass.
A2:
Apart from data splitting, there are two major difference between our method and CFR-Wass.
-
CFR-Wass reduces the Wasserstein distance between the treatment group and the control group, i.e., . Different from it, we reduce the Wasserstein distance between each conditional distribution and the marginal distribution, i.e., .
-
CFR-Wass assigns equal weights to the samples in one group. Different from it, we introduce the generalized propensity score into the optimal tranpsort model as the sample weights.
Yes, we conduct CFR with the Wasserstein distance as a compared method in Section 5.2.
Q3: Parameter sensitivity.
A3: The hyperparameter between the outcome prediction loss and the Wasserstein discrepancies range from . The trade-off hyperparameter of the entropy regularization is tuned in the range , and the sensitivity results on synthetic data of are reported as follows.
| Parameter | |
|---|---|
| 0.0002 | 0.1123 |
| 0.0004 | 0.1111 |
| 0.0006 | 0.1095 |
| 0.0008 | 0.1089 |
| 0.001 | 0.1074 |
| 0.0012 | 0.1096 |
| 0.0014 | 0.1108 |
| 0.0016 | 0.1145 |
| 0.0018 | 0.1156 |
Q4: The efficiency of the method.
A4: Running time results are given as follows. Although our method takes more time to solve optimal transport, our method achieves better performance compared with others.
- Continuous treatment setting on synthetic data in one realization
| Methods | Times |
|---|---|
| ORIC | 135s |
| VCNet+TR | 23s |
| VCNet | 17s |
| ADMIT | 47s |
| ACFR | 24s |
| DRNet | 26s |
| GPS+MLP | 25s |
| MLP | 18s |
| GPS | 9s |
| BART | 7s |
| KNN | 8s |
- Binary treatment setting on the IHDP-1000 data in one realization
| Methods | Times |
|---|---|
| ORIC | 76s |
| CFRNet | 47s |
| DragonNet | 41s |
| DKLITE | 4s |
| ESCFR | 165s |
| CausalOT | 4s |
| GANITE | 4s |
| BART | 0.2s |
| OLS | 0.2s |
| KNN | 0.3s |
Q5: Limitations and future work.
A5: Thank you for the valuable suggestions. The major limitations are discussed as follows.
-
Our approach of confounding bias reduction relies on the assumption of ignorability, which means that all the confounders are observed. In the future, we will further investigate the situation with unobserved confounders.
-
Our method involves multiple optimal transport problems, which is of high computational complexity. In the future, we will consider more efficient algorithms to solve the optimal transport problem.
Q6: Regarding the potential outcome and the treatment effect.
A6:
-
Thank you for the valuable suggestion. We will use to denote the potential outcome in the revision.
-
Since we aim to propose a unified framework for both binary and continuous treatments, we consider the potential outcome and AMSE, which are widely used for the continuous setting and can also be used for the binary setting.
-
Our theoretical results can be applied to PEHE. More details here.
We use binary treatment as an example to illustrate that our theoretical results can be applied to PEHE. Based on the assumptions in Theorem 2, we first decompose PEHE for the true causal effect as follows:
$
\varepsilon_{PEHE} =E_{x\sim q(x)}[\ell(h_1(x) - h_0(x), \mu_1(x) - \mu_0(x))] \leq E_{x\sim q(x)}[\ell(h_1(x), \mu_1(x))] +E_{x\sim q(x)}[\ell(h_0(x), \mu_0(x))] =\varepsilon_{q} (h_1) + \varepsilon_{q} (h_0)
$
where is the -norm based loss function and has the triangle inequality property.
And We define the estimation error of the potential outcome function and in treatment and control groups, respectively:
$
\varepsilon_{q_{1}} (h_1)=E_{x\sim q_{1}(x)}\ell (h_{1}(x) ,\mu_{1}(x))
$
$
\varepsilon_{q_{0}} (h_0)=E_{x\sim q_{0}(x)}\ell ( h_{0}(x) ,\mu_{0}(x))
$
According to Eq. (12), we have
$
\varepsilon_{PEHE} \leq \varepsilon_{q} (h_1) + \varepsilon_{q} (h_0) \leq \varepsilon_{q_1} (h_1) + \varepsilon_{q_0} (h_0) + \mathcal{W}(c, q_1, q) + \mathcal{W}(c, q_0, q)
$
- For the experiments of binary treatments, we also report the results of PEHE, which demonstrate the effectiveness of our method.
Q7: Regarding the metric MISE.
A7: As shown in [3], when considering continuous dosage without different kinds of treatments, the MISE criterion corresponds to the AMSE criterion mentioned. Therefore, we adopt AMSE in our submission.
[1] Wang H, Fan J, Chen Z, et al. Optimal transport for treatment effect estimation[J]. Advances in Neural Information Processing Systems, 2024, 36.
[2] Li Q, Wang Z, Liu S, et al. Causal optimal transport for treatment effect estimation[J]. IEEE transactions on neural networks and learning systems, 2021, 34(8): 4083-4095.
[3] Kazemi A, Ester M. Adversarially balanced representation for continuous treatment effect estimation Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(12): 13085-13093.
Thank you for your detailed response and helpful changes! I have updated my score.
Thank you for the positive comments!
This paper presents a novel approach for measuring the causal effect given a treatment. The model, called ORIC, is designed to directly predict outcomes rather than causal effects and is based on the formulation of optimal transport to reduce confounding bias in observational data. Unlike typical causal inference models that often split the data according to treatments and train separate models (e.g., T-learner), ORIC is designed to be effectively trained without data splitting, which is the most significant contribution of this paper.
优点
While there have been a few instances in the literature over the past few years where the formulation of optimal transport has been borrowed to solve the problem of measuring causal effects, this paper presents yet another new perspective, which I find intriguing. The formulation that naturally combines representation learning through the function with it could be widely utilized in other related research in the future.
缺点
The paper lacks a convincing explanation or example of how avoiding data splitting is practically beneficial in certain cases. In the case of binary treatment, splitting into just two sub-populations means the sample size is halved. Still, it is hard to see how this significantly worsens the estimation of the conditional covariate distribution. It is unclear whether the assumption is that one population is extremely smaller than the other or if the issue is that both populations are too small, making splitting problematic. This paper aims to apply to observational data rather than RCTs, but in real-world data, while there may be confounding bias, the sample size is not too small and often large. Therefore, it needs to be more clearly explained in which scenarios this methodology is more beneficial.
The approach in this paper that avoids data splitting starts from equations (3) and (4), where AMSE is defined. It is necessary to explain the validity of why minimizing AMSE should be the goal and to provide more details on the mathematical properties of the results produced by minimizing AMSE. AMSE can be also expressed as follows: . This includes the term twice. Since is biased, the inclusion of the term is necessary. However, including it twice is not immediately convincing. For example, when defining PEHE, the term is included only once: . Because AMSE includes twice, through Theorem 3, the integral in equation (16) is calculated over the joint distribution rather than , which enables to avoid data splitting. If the integral in equation (16) were calculated over , data splitting would still be required. This paper intentionally designed AMSE to avoid data splitting and aimed to reduce it through its upper bound. Therefore, it is necessary to explain why minimizing AMSE guarantees unbiased and accurate causal inference. As it stands, minimizing AMSE might lead to excessive bias depending on the data. For instance, does it sufficiently minimize when ? Additional proofs or demonstrations seem necessary.
问题
[1] Please provide a rebuttal or explanation for the points raised in the weaknesses section.
[2] Since the computational complexity of the Sinkhorn algorithm used to calculate optimal transport increases quadratically with the number of data points, the algorithm in this paper would have a significant computational load for large datasets. How does it compare to other algorithms?
[3] In Figure 1, the neural network is not depicted. How is the neural network implemented? The explanation of the model is insufficient, making the understanding of lines 360 to 364 somewhat ambiguous. How does the model depend on ?
[4] Can this paper be understood in the context of S-learner vs. T-learner? In other words, is this paper proposing an S-learner framework?
Thank you for the valuable and insightful comments. We will revise the submission according to the comments. Our responses are as follows.
Q1: Explanation regarding data splitting.
A1:
-
Theoretically, more samples are helpful for better distribution estimation and alignment, which is supported by Eq. (15) in Theorem 3, in which the term indicates that more samples can induce a tighter upper bound.
-
In practice, less samples bring higher variances. This issue becomes even worse when multiple or continuous treatments are considered. More possible treatment values will induce fewer samples in each subset, resulting in higher variances and inaccurate distribution estimation and alignment. Therefore, it is important to improve the data efficiency to leverage more samples [1].
Q2: Regarding AMSE, Eq. (16), and data splitting.
A2:
-
AMSE is unbiased since the prediction loss is measured on unbiased distribution . AMSE is usually adopted as the target to minimize in causal inference for continuous treatments [2]. Since all the combinations of and are considered and the integral is conducted on , does not affect AMSE as long as , which is satisfied because of the positive assumption.
-
AMSE involves counterfactual outcomes and is intractable. Different from AMSE, the factual loss in Eq. (16) is tractable but is conducted on the conditional distribution rather than the marginal distribution . We propose to balance the distribution of and by minimizing the Wasserstein distance between them, so that the model trained on balanced can be well generalized to . As shown in Theorem 3, minimizing the Wasserstein distances and can reduce the upper bound of AMSE.
-
For the Wasserstein distance involving , we do not conduct data splitting. Instead of considering samples receiving only, we model in Eq. (14), where is estiamted based on the generalized propensity scores. As a result, all the training samples are involved in the empirical distributions, which avoids the issue of data splitting and enhances the performance of distribution estimation.
Q3: Computational complexity.
A3: Running time results are given as follows. Although our method takes more time to solve optimal transport, our method achieves better performance compared with others.
- Continuous treatment setting on synthetic data in one realization
| Methods | Times |
|---|---|
| ORIC | 135s |
| VCNet+TR | 23s |
| VCNet | 17s |
| ADMIT | 47s |
| ACFR | 24s |
| DRNet | 26s |
| GPS+MLP | 25s |
| MLP | 18s |
| GPS | 9s |
| BART | 7s |
| KNN | 8s |
- Binary treatment setting on the IHDP-1000 data in one realization
| Methods | Times |
|---|---|
| ORIC | 76s |
| CFRNet | 47s |
| DragonNet | 41s |
| DKLITE | 4s |
| ESCFR | 165s |
| CausalOT | 4s |
| GANITE | 4s |
| BART | 0.2s |
| OLS | 0.2s |
| KNN | 0.3s |
Q4: Implementation of
A4: We are sorry for the confusion. The implementation of is based on [3] and described as follows. Assuming that the conditional distribution of treatment given covariates is Gaussian, i.e., . We can estimate the parameters by maximizing the likelihood:
.
After that, the estimated generalized propensity score is given by:
.
We will revise the submission to make it clear.
Q5: Regarding S-learner and T-learner.
A5:
-
S-learner learns one model to predict potential outcomes, while T-learner learns multiple models to predict potential outcomes of different treatments separately. In this sense, our method can be regarded as a S-learner.
-
The focus of our submission is to avoid data splitting to improve data efficiency rather than the comparison between S-learner and T-learner. Even for S-learner, when considering distribution alignment between different groups receiving different treatments, reducing distribution shift after data splitting is still a common choice. Different from them, we propose to avoid data splitting and leverage more samples for learning.
[1] Kaddour J, Zhu Y, Liu Q, et al. Causal effect inference for structured treatments[J]. Advances in Neural Information Processing Systems, 2021, 34: 24841-24854.
[2] Nie L, Ye M, Nicolae D. VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments. International Conference on Learning Representations.
[3] Hirano K and Imbens GW. The propensity score with continuous treatments. In: Gelman A and Meng XL (eds) Applied bayesian modeling and causal inference from incomplete-data perspectives. Oxford, UK: Wiley, 2004, pp.73–84.
Questions 3, 4, and 5 have been resolved with your answers. Thank you. However, I still have questions regarding questions 1 and 2.
Q 1: Of course, the number of samples is important, and if it decreases, the uncertainty of the model increases. The fact that the term "data splitting" is included in the title indicates that it is the most important contribution of this paper. I want to gauge the practical significance of this contribution. For example, in a situation where only binary treatment (0 and 1) is considered, it is understood that the sample size is halved with existing methods and models, whereas this model can prevent the sample size from being halved. However, halving the sample size does not seem to be a major constraint. If there is a model that works properly with 1,000 samples, and we already have 2,000 samples, splitting them in half still gives us 1,000 samples. My question is whether the concept of data splitting is based on the number of treatments as I understand it, and if so, whether the effect of this paper is not significant for binary treatment. If so, what scenario would this methodology be most effective in?
Q 2: I have reviewed the referenced paper [2]. I am a bit confused about whether the used there is equivalent to equation (3) in the author's paper. Is there another paper that uses AMSE? If so, could you share it? What do you think about my comparison with PEHE above? PEHE and AMSE seem to have different properties. Even if is positive, if it is very close to 0, AMSE only considers one side when . In fact, PEHE can have attached twice. Since the integrand is symmetric for binary treatment, having attached twice still results in PEHE. On the other hand, if an integrand has -dependence like the integrand of AMSE, attaching once more may cause it to skew towards the side where is larger. I am not 100% sure about this and believe you, as someone who has conducted the research and experiments, would know more. The process of calculating the upper bound of AMSE using rigorous and sophisticated mathematical methodology is beautiful. My concern is whether optimizing based on that AMSE ensures unbiased inference, and I want to have a solid understanding of this.
Thank you for the reply and valuable comments. Our responses are as follows.
Q1: Regarding data splitting.
A1: Yes, data splitting is related to the number of treatment values. As the number of treatment values increases, the number of samples in each group decreases, and the issue of data splitting becomes more severe. For the binary treatment setting, the issue of data splitting is milder compared with the setting of more treatment values. While for the setting of continuous treatments, the issue becomes more severe since multiple discrete values of the treatment are considered, and each group only includes a small part of the samples, highly reducing the data efficiency and the precision of distribution estimation.
Q2: Regarding and .
A2:
-
in [2] is equivalent to of our submission, in which the potential outcome is revised as according to the comments of Reviewer E4XK.
-
is also used in [a][b], in which is called EMSE and MISE, respectively.
-
is the expected loss measured on the marginal distribution . Following [a][b], is defined as follows
$
AMSE = \int_{\mathcal{T}} \int_{\mathcal{X}} \ell(h_t(x), \mu_t(x)) p(x) p(t) dx dt.
$
For , the generalized compares the outcome of each treatment with the outcome with and is given as follows [c]
$
PEHE_g = \int_{\mathcal{T} \setminus {0}} \int_{\mathcal{X}} \ell \big( h_t(x) - h_{0}(x), \mu_t(x) - \mu_{0}(x) \big) p(x) p(t| t \neq 0) dx dt,
$
and the pairwise compares the outcome of each treatment pair and is given as follows [d][e]
$
PEHE_p = \int_{\mathcal{T} \times \mathcal{T}} \int_{\mathcal{X}} \ell \big( h_t(x) - h_{t'}(x), \mu_t(x) - \mu_{t'}(x) \big) p(x) p(t) p(t') dx dt dt'.
$
From the above definitions, we observe that for the multiple-treatment setting, if some treatment value has a very small probability, i.e. , both the outcome in and the causal effect in related to are tended to be ignored. In this sense, and have similar behaviors in the multiple-treatment setting.
While for in the binary setting, includes only two values, i.e. and , both and degenerate to the following form
$
PEHE_b = \int_{\mathcal{X}} \ell \big( h_1(x) - h_{0}(x), \mu_1(x) - \mu_{0}(x) \big) p(x) dx.
$
As there is only one combination of different treatment values, the issue caused by a small does not exist. In this situation, and indeed have different properties.
Since we consider a unified framework for both binary and continuous settings, we mainly focus on in the submission. According to A6 to Reviewer E4XK, for the binary setting, is an upper bound of , and our theoretical analysis based on can be easily extended to .
This is indeed an interesting question, which prompts deep thinking and enhances the understanding of and .
- In causal inference, confounding bias brings estimation bias of potential outcomes, which means that the loss on the observed distributions is a biased estimate of the desired population loss on the marginal distribution , and the model minimizing the loss on subsets cannot be well generalized to the whole population [f]. By minimizing , the expected loss between and is minimized on the marginal distribution , which is independent with without confounding bias.
[a] Wang X, Lyu S, Wu X, et al. Generalization bounds for estimating causal effects of continuous treatments. Advances in Neural Information Processing Systems, 2022, 35: 8605-8617.
[b] Kazemi A, Ester M. Adversarially balanced representation for continuous treatment effect estimation. Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(12): 13085-13093.
[c] Guo X, Zhang Y, Wang J, Long M. Estimating heterogeneous treatment effects: Mutual information bounds and learning algorithms. In International Conference on Machine Learning 2023.
[d] Schwab, P., Linhardt, L., and Karlen, W. Perfect Match: A Simple Method for Learning Representations For Counterfactual Inference With Neural Networks. arXiv preprint arXiv:1810.00656, 2019.
[e] Kaddour, J., Zhu, Y., Liu, Q., Kusner, M. J., and Silva, R. Causal effect inference for structured treatments. Advances in Neural Information Processing Systems, 34: 24841–24854, 2021.
[f] Johansson FD, Shalit U, Kallus N, Sontag D. Generalization bounds and representation learning for estimation of potential outcomes and causal effects. Journal of Machine Learning Research. 2022;23(166):1-50.
You provided good answers to my questions. However, my remaining concern is that while PEHE is symmetric with respect to , AMSE is not. In other words, if is heavily skewed, it is still unclear what results AMSE will produce. Therefore, I will raise my score to 5 instead of 6 or 8. I fundamentally think this is a good paper. If it gets accepted, I suggest including an example in the appendix showing that AMSE provides good results without bias even when is heavily skewed. If not, this paper might be understood as having a flaw in general application.
Thank you for the reply and valuable comments. Our responses are as follows.
It is difficult to consider a situation where is heavily skewed. Instead, we approximate skewed by setting the or , where and are the numbers of samples in the treated and control groups, respectively. We remove some samples to obtain skewed and . The results on the News data are reported as follows, where means . Our method consistently achieves promising performance.
| t:c | 1:10 | 1:10 | 1:10 | 1:5 | 1:5 | 1:5 | 10:1 | 10:1 | 10:1 | 5:1 | 5:1 | 5:1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| metric | ||||||||||||
| CFRNet | ||||||||||||
| ORIC |
Besides, for the ihdp data, is also skewed, and our method also performs well.
This work mainly studies the optimal transport method to reduce bias in causal inference. The authors propose a distribution alignment paradigm that leverages all training samples, avoiding the need for data splitting, which is a common limitation of existing methods. The theoretical contributions and empirical results demonstrate the effectiveness of the proposed method in both binary and continuous treatment settings.
优点
-
The paper is well-structured and flows naturally.
-
This paper proposes a balanced algorithm is designed that can effectively reduce confounding bias without data splitting.
-
The experimental studies are well done. A sufficient amount of empirical evidence for the proposed method is provided.
缺点
-
Theorem 2 holds under the assumption that is an RKHS. Given this, it’s unclear why the author didn't opt to use kernel methods for learning representations, as they seem more aligned with the conditions of the theorem.
-
While the experimental results show notable improvements, it's not entirely clear why the method performs so well. The theoretical bounds appear loose, so it would be valuable to explore why the method is still effective in practice. Gaining a deeper understanding would make the approach more reliable.
-
It would be beneficial to conduct ablation studies on the loss function involving Wasserstein distances. This could help identify which components contribute most to the performance gains and provide further insight into the tightness of the theoretical bounds.
-
What are the key advantages and differences between the author's debiasing method and other re-weighting approaches? Clarifying this comparison would be helpful.
问题
-
The algorithm relies on the ignorability assumption for its validity. I'm curious about its robustness in the presence of unobserved confounders. Could it still perform well under such circumstances?
-
If is a high-dimensional variable, how to estimate the probability mass ? I worry about the curse of dimensionality
I will raise the score if these items are addressed.
Thank you for the valuable and insightful comments. We will revise the submission according to the comments. Our responses are as follows.
Q1: Regarding the assumption of RKHS.
A1:
-
The kernel method requires to calculate kernel functions between all the pairs of two samples, which has high computational complexity.
-
The kernel function between two samples and is the inner product of their mapped representations, i.e., . In practice, we employ a neural network to approximate the mapping function . A similar approach has also been used in [1].
Q2: Explanation of the performance improvement.
A2: The performance improvement mainly comes from two perspectives.
-
Existing distribution alignment methods usually split samples into multiple subsets, each of which only includes the samples receiving a specific treatment. Different from them, we do not split data into smaller subsets. This means that more samples are leveraged for learning. This advantage is theoretically supported by Eq. (15) in Theorem 3, in which the term indicates that more samples induce a tighter upper bound.
-
For balanced representation learning achieved by minimizing the Wasserstein distances, we further introduce the generalized propensity scores into the model. This approach can adaptively assign weights for samples, making samples with higher generalized propensity scores contribute more to the model.
Q3: Ablation study on the loss function involving Wasserstein distances.
A3: We have conducted the ablation study on the loss function involving Wasserstein distances in the revision. The results are reported as follows.
| Methods | Synthetic | Synthetic | Synthetic | Synthetic | IHDP | News |
|---|---|---|---|---|---|---|
| ORIC | 0.1098 0.0273 | 0.1234 0.0388 | 0.1313 0.0464 | 0.1168 0.0316 | 0.3595 0.0304 | 0.1507 0.0406 |
| ORIC without wass | 0.2077 0.0238 | 0.2028 0.0203 | 0.2022 0.0210 | 0.2161 0.0157 | 0.6303 0.0826 | 0.4255 0.2115 |
| ORIC without wass and GPS | 0.2083 0.0275 | 0.2042 0.0311 | 0.2044 0.0252 | 0.2044 0.0252 | 0.6566 0.0710 | 0.4355 0.2098 |
Q4: Comparison with other re-weighting approaches.
A4:
-
Propensity score-based methods employ propensity scores to re-weight samples. Different from them, we leverage propensity scores to model the conditional distributions, and incorporate propensity scores into the optimal transport model for balanced representation learning.
-
Some re-weighting methods learn weights for samples to balance the distributions of different groups, where the distribution shift is measured by a predefined metric. However, for distributions with a very large shift, re-weighting cannot well reduce the distribution shift since the support sets are fixed. Different from them, we characterize the confounding bias by considering the balancing error between the conditional distribution and the marginal distribution as shown in Eqs. (6) and (7), and connect the confounding bias with the optimal transport, which motivates us to learn balanced representations to reduce the confounding bias.
Q5: Regarding unobserved confounders.
A5: Since we characterize the confounding bias by measuring the discrepancy between and , the ignorability assumption is required. If unobserved confounders exist, the confounding bias can not be fully captured by considering and only. We will investigate the situation with unobserved confounders in the future.
Q6: How to estimate the probability mass for high-dimensional variables?
A6: The probability mass is estimated by generalized propensity scores, which can be estimated by a regression model. For high-dimensional variables, we can reduce the dimension by PCA or neural networks, or introduce the norm to select informative variables.
[1] Shalit U, Johansson F D, Sontag D. Estimating individual treatment effect: generalization bounds and algorithms. International conference on machine learning. PMLR, 2017: 3076-3085.
The paper suggests the use of a novel estimator of conditional average treatment effects and related causal objects when there is confounding explained by observable covariates. The method can be adjusted to allow for continuous treatments as well as discrete treatments. The proposed methods employ a particular adjustment for covariate imbalance that is based on an optimal transport. The authors also propose a related measure of covariate imbalance which they suggest may be used to assess the threat of confounding bias. The authors benchmark their methods against alternative approaches on established synthetic datasets.
优点
The use of optimal transport methods to adjust for covariate imbalance seems to me a promising area of research. The performance of the proposed methods on synthetic data is very encouraging.
缺点
I found the motivation for the methods unclear. The authors point out that the difference between the feasible loss function and the infeasible loss function can be bounded by the Wasserstein distance between and . It does not follow immediately that minimizing the sum of the feasible loss and this distance metric should lead to superior estimation performance. Indeed, the objective in equation (20) is minimized over three parameters , , and , but only is shared between the feasible loss estimate and the Wasserstein metric. The estimate of depends only on and . Thus the benefits of including this Wasserstein distance in the optimization problem must result only from improved feature learning (i.e., an improved choice for ). In fact, I was rather perplexed that the loss did not incorporate the generalized propensity score . It seemed to me from earlier parts of the paper that calculating the generalized propensity score by optimal transport, and using this to recover covariate balance, was likely to be the main point of the paper. But unless there is an error in the description of the methods, then is not directly used in learning , only impacting that problem through its impact on the choice of . Incidentally, this in itself is rather complicated because the cost function in the optimal transport is based on , and setting to be identically zero would trivially make the Wassersteuin metric equal to zero.
Perhaps I am missing something here, particularly given that the very good performance on the benchmarking data, but I spent rather a long time trying to understand this and so, at the very least, I do not think it is well-explained.
I also found the description of the algorithm confusing. The algorithm contains a loop whose final step is to minimize the objective in (20) and the loop is iterated until convergence. But if we minimize this objective, then what changes in each iteration of the loop?
Finally, I find the description of the existing literature to be somewhat narrow. Causal inference methods have been developed over many decades and ML methods that incorporate data-splitting represent only a very recent and thin stratum of this much broader literature. In my opinion the authors ought to clarify when they are talking about causal inference methods as a whole, and when they are referring specifically to only recent causal inference methods from the ML literature.
问题
Why does the loss not incorporate the generalized propensity score? Give that it does not, what is the purpose of the Wasserstein part of the loss? Does it play a role other than improving the choice of ? If its only purpose is indeed to improve why should it achieve this? I am very curious to better understand the methods, particularly given the apparent superior performance in Section 5, but until I am able to grasp how/why the proposed methods might work I do not feel I can recommend accepting the paper.
I wondered how GPS (without MLP) might perform on the data, and likewise for some other classical or non neural-network-based methods?
Thank you for the valuable and insightful comments. We will revise the submission according to the comments. Our responses are as follows.
Q1: Explanation of the method.
A1:
-
Following [1], we aim to reduce AMSE that is the unbiased outcome estimation error for all the treatment values. Based on Lines 289-296 and Eq. (12), AMSE is upper-bounded by the Wasserstein distances between and , as well as the feasible biased loss . This analysis motivates us to minimize the upper bound of AMSE, i.e., the sum of the feasible loss and the Wasserstein distances, and design the algorithm in Section 4.3.
-
Generalized propensity scores obtained by are modeled as and are used in the minimization of the Wasserstein distances, as shown in Eqs. (18) and (19), which tends to obtain balanced representations by learning .
-
Without balanced representations, the outcome prediction model will be trained on biased data distributions, which means that the prediction model could fit the conditional distribution (i.e., samples receiving the treatment ) while not being well generalized to other subsets (i.e., samples receiving ). By minimizing the Wasserstein distances based on and generalized propensity scores, we can obtain balanced representations, and can be well generalized to all the samples rather than a subset receiving specific treatments.
-
Setting to be identically zero will make the Wasserstein distance zero. On the other hand, setting to zero will induce a very large outcome prediction loss in Eq. (17). Therefore, minimizing can avoid the situation where being zero. This is discussed in Lines 290-296, which motivates us to propose Eq. (12).
Q2: Algorithm description.
A2: We are sorry for the confusion. Since Problem (20) involves multiple blocks of variables to optimize, we cannot minimize the objective in one iteration. Instead, in Step 9, we update the parameters of the model based on the gradient descent method. We will revise the submission to make it clear.
Q3: Description of the existing literature.
A3: Thank you for the valuable comments. In this paper, we mainly focus on the machine learning method for causal inference. We will revise the submission accordingly.
Q4: The performance of classical or non neural-network methods.
A4: We have conducted non neural-network methods in the revision. The results are reported as follows.
| Methods | Synthetic | Synthetic | Synthetic | Synthetic | IHDP | News |
|---|---|---|---|---|---|---|
| ORIC | 0.1098 0.0273 | 0.1234 0.0388 | 0.1313 0.0464 | 0.1168 0.0316 | 0.3595 0.0304 | 0.1507 0.0406 |
| KNN | 0.2339 0.0294 | 0.2234 0.0296 | 0.2211 0.0235 | 0.2361 0.0209 | 0.8364 0.0917 | 0.6104 0.4117 |
| BART | 0.2205 0.0248 | 0.2108 0.0312 | 0.2177 0.0259 | 0.2238 0.0212 | 0.6825 0.0715 | 0.5639 0.3125 |
| GPS | 0.2103 0.0319 | 0.2056 0.0345 | 0.2063 0.0264 | 0.2219 0.0238 | 0.7247 0.0582 | 0.4422 0.2033 |
Q5: The reason why loss does not incorporate the generalized propensity score .
A5: Generalized propensity scores obtained by are modeled as and are used in the minimization of the Wasserstein distances, as shown in Eq.s (18) and (19), which tends to obtain balanced representations by learning . Based on balanced representations, the outcome prediction model can be trained on unbiased data distributions, making the prediction model generalized to all the samples rather than only the subset receiving the treatment .
[1] Nie L, Ye M, Nicolae D. VCNet and Functional Targeted Regularization For Learning Causal Effects of Continuous Treatments. International Conference on Learning Representations.
Thank you for your response. Regarding the motivation for the methods, could you perhaps elaborate on point 3: What does really represent here? What would be the oracle choice for ? Why does the Wasserstein part of the objective help you obtain a good choice for ?
Thank you for the reply and valuable comments. Our responses are as follows.
Q1: What does really represent here?
A1: is a feature mapping function implemented by a neural network. is the representation vector of the sample in a learned feature space.
Q2: What would be the oracle choice for ?
A2: The oracle choice for is a feature mapping function whose output representation vectors own the two properties:
i) the outcome prediction loss in Eq. (17) is minimized, which means that the outcome information is well captured in .
ii) the Wasserstein distances is minimized, which means that the empirical distributions an defined in Eq. (14) are close to each other. As a result, the conditional distribution and are similar, and the outcome prediction model trained on the distribution can be well generalized to the marginal distribution .
Q3: Why does the Wasserstein part of the objective help you obtain a good choice for ?
A3:
-
By minimizing the Wasserstein distances, we can learn balanced representations to reduce the confounding bias. Here, the balanced representations are obtained by the learned mapping function .
-
According to Line 344 and Eq. (18), the Wasserstein distance is based on the underlying cost function . A good is the one to minimize the Wasserstein distances in Eq. (18), in which the optimal transport plan is obtained by solving Problem (19). In practice, given a mapping function parameterized by a neural network, we solve Problem (19) to obtain the optimal transport plan . After that, we calculate the objective function (i.e., the Wasserstein distance in Eq. (18)) based on , and then obtain the gradient of the objective with respect to the parameters of , so that a gradient descent algorithm can be applied to update the parameters of . A similar approach to calculate the Wasserstein distance and learn the corresponding feature mapping function is also used in [a].
[a] Shalit, Uri, Fredrik D. Johansson, and David Sontag. "Estimating individual treatment effect: generalization bounds and algorithms." International conference on machine learning. PMLR, 2017.
Perhaps I am still missing something, but I still do not see how the fact that the cost function is defined using should help adjust for confounding. It seems that this feature of the estimator should mean that after applying to , its conditional distribution given should not depend too much on (in the sense that the distribution of conditional on is close in the Wasserstein metric to its distribution conditional on ), is that right? But why does that help with causal inference? What are the advantages of controlling for features that are not too strongly dependent on ? I apologize for belaboring this point, but I really want to understand what drives the performance of the estimator.
Thank you for the reply and valuable comments. Our responses are as follows.
For simplicity, we answer the question by considering the setting of the binary treatments in the following. Our conclusion can be easily extended into the setting of continuous treatments.
-
Yes, the distribution of conditional on is close to the distribution conditional on in the Wasserstein metric, which means that is independent on .
-
According to Theorem 1, the confounding bias characterized by the balancing error can be upper bounded by the Wasserstein distances. Motivated by this, we parameterize the Wasserstein distances by the feature mapping function and minimize the distances to reduce the confounding bias.
-
According to Eq. (12) and the discussion in Lines 288-295, we reduce AMSE by minimizing its upper bound, i.e., the factual loss and the Wasserstein distances. In other words, by minimizing the Wasserstein distances based on the feature mapping function , the confounding bias is reduced, and the factual loss is close to our target .
-
Intuitively, the confounding bias makes , where the treatment assignment is affected by the covariates . As a result, is dependent on , and the outcome prediction model trained on cannot achieve good performance on . For example, if serious patients tend to receive better treatment (), while mild patients tend to receive a modest treatment (), then trained on serious patients cannot be generalized to mild patients to predict their outcomes receiving , since the two groups follow significantly different distributions. By learning a to make not dependent on , we enforce the two groups follow a similar distribution, so that trained on of serious patients can be well generalized to of the wild patients. At the same time, by minimizing the factual loss in Eq. (17), we make sure that the potential outcome information is well preserved in .
Thank you for your response. I feel I better understand your motivation, but unfortunately, I am not convinced that your reasoning works. It is indeed true that the confounding bias can be upper bounded by the difference in the conditional distribution of under and . But this relies crucially on the fact that conditional on , treatment is independent of potential outcomes. is typically of lower dimension to and so there is no guarantee that conditional on , treatment will still be independent of potential outcomes. In fact, the very fact that is chosen so that its conditional distribution does not depend strongly on treatment, means that using is likely to be problematic. When we adjust for confounding, we aim to control for observables that can explain the association between and potential outcomes. Such variables must be correlated with . By choosing so that is not strongly related to , this would seem to select that are not helpful for adjusting for confounding.
Thank you for the reply and valuable comments. Our responses are as follows.
According to the ignorability Assumption,
is the confounders that simultaneously affect both and ,
which need to be adjusted to avoid confounding bias through the backdoor path .
Learning balanced representation is one way to adjust for the confounders. Specifically, by implementing the factual loss and Wasserstein distance together,
we ensure that is related to while remaining independent of .
From the perspective of causal graphs,
this means that is the mediator on the edge , i.e., ,
and blocks the backdoor path from to .
Therefore,
when we use balanced and to train the model,
we can approximately think of it as training on RCT data, avoiding confounding bias.
Actually,
the balanced representation-based method is already a classic paradigm for causal effect estimation, which is widely used in the machine learning community, e.g., [1-4]
From a broader perspective, propensity score and prognostic score are two forms of dimensionality reduction for confounders. The propensity score is a mediator on the path , while the prognostic score is a mediator on the path . According to [5] and [6], controlling any one of them is sufficient for causal inference. And as stated in the related work in [2], our balanced representation can be viewed as a special case of a prognostic score.
[1] Shalit U, Johansson F D, Sontag D. Estimating individual treatment effect: generalization bounds and algorithms[C]//International conference on machine learning. PMLR, 2017: 3076-3085.
[2] Johansson F D, Shalit U, Kallus N, et al. Generalization bounds and representation learning for estimation of potential outcomes and causal effects[J]. Journal of Machine Learning Research, 2022, 23(166): 1-50.
[3] Kazemi A, Ester M. Adversarially balanced representation for continuous treatment effect estimation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(12): 13085-13093.
[4] Guo X, Zhang Y, Wang J, et al. Estimating heterogeneous treatment effects: Mutual information bounds and learning algorithms[C]//International Conference on Machine Learning. PMLR, 2023: 12108-12121.
[5] Rosenbaum P R, Rubin D B. The central role of the propensity score in observational studies for causal effects[J]. Biometrika, 1983, 70(1): 41-55.
[6] Hansen B B. The prognostic analogue of the propensity score[J]. Biometrika, 2008, 95(2): 481-488.
Although this paper contains some potentially interesting ideas about correcting for the effect of confounding variables, in its current form, this paper has just too many weaknesses. To me, the most severe point of criticism concerns the concept of correcting for confounders using variables that are (almost) uncorrelated with the treatment, which seems to be a serious conceptual problem, because the very nature of a confounder is that it jointly influences both treatment and outcome. Unfortunately, this conceptual problem could not be addressed in a clear way during the rebuttal and discussion phase.
审稿人讨论附加意见
The main this conceptual problem -- namely the unclear role of variables that are uncorrelated with treatment for correcting for confounding -- could not be addressed in a convincing way during the rebuttal and discussion phase.
Reject