Counterfactual Contrastive Learning with Normalizing Flows for Robust Treatment Effect Estimation
We propose a method for ITE estimation that leverages a derived error bound to ensure fine-grained alignment and robust performance, even with individual heterogeneity.
摘要
评审与讨论
The paper points out that the prediction of the individual treatment effect (ITE) is crucial for personalized therapy planning and proposes a contrastive learning approach (along the lines of SIMCLR) to estimate it. The accuracy of an estimator is measured in terms of the expected squared error of the ITE estimates with respect to the ground-truth. However, ground-truth ITEs are not accessible in practice (since the counter-factual outcome is unknown), and existing sample alignment methods are not good enough for applications with high-dimensional covariates and considerable individual heterogeneity. To circumvent this difficulty, the authors derive a tractable contrastive upper bound for the squared loss that justifies their choice of learning method. Experimental comparisons with several baselines demonstrate the superior behavior of the new method.
Crucially, the authors explain that contrastive training examples cannot be generated by standard data augmentations or adversarial training in data space, because these samples are not sufficiently realistic -- they are too far from the manifold of typical covariate samples. Instead, they propose to first train a normalizing flow to represent the data distribution around the manifold, and then perform a gradient search for contrastive samples in the latent space, where the gradient is calculated from the Jacobian of the decoder. The search for contrastive examples thus stays near the manifold, and the generated counter-factual training data are much more realistic.
EDIT after rebuttal: The authors appropriately addressed my questions and concerns, so I've increased my score.
给作者的问题
Please answer the following questions (I'm willing to increase my score):
-
relation to debiased ML
-
how to fix presentation shortcomings
-
experiments demonstrating the superiority of flow-based counter-factual generation
论据与证据
The claims are largely clear and well supported.
However, I am surprised that debiased ("double") machine learning (e.g. [1]) is neither mentioned as related work nor included in the comparison. In my understanding, debiased ML is a relatively simple method that can estimate ITEs in the absence of randomized controlled experiments, where selection bias of treatment decisions would otherwise contaminate the predictions. I'm curious how the authors asses this possibility.
方法与评估标准
Evaluation follows standard practices in the field and standard benchmarks. Results appear trustworthy.
理论论述
The claims in Lemma 4.1 and Theorem 4.2 are plausible, but I did not check the proofs.
实验设计与分析
Evaluation follows standard practices in the field and standard benchmarks. Results appear trustworthy.
It would have been helpful to include an experiment demonstrating that standard data augmentation and adversarial training fail and how the flow-based method fixes this. For example, one could use the Maximum Mean Discrepancy (MMD) to quantify the difference between the distributions of real and synthetic (counter-factual) covariantes. Then, the MMD should be considerably smaller for the proposed method.
The latter experiment could be augmented by an ablation study, showing how ITE results deteriorate when flow-based counter-factual generation is not used.
补充材料
10 pages of comprehensive supplement (4 pages of proofs, 1 page of additional explanations, 5 pages of additional experiments)
与现有文献的关系
The literature is reviewed well and compared under standard protocols with the proposed method, except for the possible omission of debiased ML mentioned above.
遗漏的重要参考文献
see above.
其他优缺点
The main weakness of the paper are some minor presentation glitches.
-
Definition 3.6 (and repeatedly further down the paper, e.g. equations (6) - (8)): The text refers to "the classification decision k(x)", but k(x) is undefined.
-
Lemma 4.1, Theorem 4.2: disc(., .) (the distance in representation space) is never concretely defined, nor are the pros and cons of possible choices discussed. How is it related to the similarity sim(.,.) in equation (9)?
-
Second line of section 5.1: possible typo [rho 1 p 1 p'] => [rho 1 p 1^T p' ] ?
其他意见或建议
none
We greatly appreciate your high recognition of our work, and are eager to share our thoughts with you.
Relation to debiased ML.
Thanks for your insightful idea. We would like to clarify that our work primarily focuses on the challenges of deep learning-based causal effect estimation. As debiased machine learning (DML) is a traditional method, we did not explicitly discuss it in the literature review. We will include this discussion in the revision.
Then, we discuss the differences and connections between DML and our FCCL.
First, we explain the different mechanisms that DML and our proposed FCCL use to address confounding bias. DML handles confounding bias through orthogonal residual regression, i.e., , where eliminates the influence of the confounder on and eliminates the influence of the confounder on [1]. The way creates statistical independence between treatment and confounder to simulate randomization conditions. Instead, our method captures the characteristics of potential outcomes under different treatments, mitigating distribution shifts through sample-level alignment to emulate RCTs, which better captures individual-level heterogeneity.
Second, the DML way to addressing bias is highly instructive, particularly its unbiasedness. Therefore, we see great potential in incorporating DML into our future work, as R-learners draw inspiration from DML's approach to handling confounding bias, . Thanks again.
Besides, we add comparison experiments with DML (Table1) and the results show that FCCL achieves lower estimation errors.
Table1 ITE estimation errors (std) comparison with DML on IHDP.
| Method | ||||
|---|---|---|---|---|
| DML | 2.87(0.09) | 0.30(0.05) | 2.95(0.14) | 0.35(0.03) |
| FCCL | 0.53(0.04) | 0.09(0.01) | 0.64(0.07) | 0.12(0.02) |
How to fix presentation shortcomings?
Thank you for pointing out the typos, we will correct them in next version.
( 1 ) We will add a formal definition of in .
( 2 ) We thank the reviewer for the suggestion. Indeed, denotes the inner product of vectors in the representation space,i.e., . Detailed proofs are provided in Appendix A. shows that the estimation error is upper bounded by the distance constraints in the representation space. Therefore, we implement the optimization via contrastive loss, specifically by leveraging the cosine similarity .
( 3 ) We apologize and will thoroughly go over this for the revision, specifically changing to , where denotes the -dimensional all-ones vector and denotes the identity matrix of size .
Experiments demonstrating the superiority of flow-based counter-factual generation. For example, one could use the MMD to quantify the difference between the distributions of real and synthetic (counter-factual) covariantes.
Following your valuable suggestion, we use the MMD to quantify the difference between the distributions of factual and counterfactual covariates (Table 2) and the result shows that the MMD is considerably smaller for our FCCL. Besides, we compare the effect of alternative counterfactual generation strategies on ITE estimation error in Table 3 in the main text. These results show the superiority of flow-based counterfactual generation.
Table2 MMD mean (std) comparison on IHDP.
| Method | grad asc in | GAN | FCCL |
|---|---|---|---|
| MMD | 0.13(0.16) | 0.52(0.55) | 0.09(0.14) |
R1: As debiased machine learning (DML) is a traditional method, we did not explicitly discuss it in the literature review. We will include this discussion in the revision.
The regression components of DML can just as well be implemented by neural networks, so it is not restricted to traditional methods :-)
Otherwise, your answers appropriately address my concern. Please make sure to revise the paper accordingly.
We agree with your comment. The regression components of DML can indeed be implemented using any prediction model, including neural networks. We implemented DML with neural networks for the regression components, and the experimental results are presented in the table below. As shown, our FCCL still demonstrates superior performance compared to DML.
Table1 ITE estimation errors (std) comparison with DML on IHDP.
| Method | ||||
|---|---|---|---|---|
| DML (RF-based) | 2.87(0.09) | 0.30(0.05) | 2.95(0.14) | 0.35(0.03) |
| DML (NN-based) | 2.45(0.12) | 0.20(0.05) | 2.60(0.14) | 0.33 (0.05) |
| FCCL | 0.53(0.04) | 0.09(0.01) | 0.64(0.07) | 0.12(0.02) |
We would like to clarify that our categorization of DML as a "traditional method" was based on how it handles confounding bias through orthogonal residual regression, where DML operates directly on the confounder X. Instead, deep learning methods (also known as representation learning) learn a representation space to obtain , and handle the confounding bias in the representation space.
We sincerely hope our responses can resolve your concern. In light of these clarifications, we respectfully invite you to consider raising the score.
This paper presents FCCL, an ITE estimation method. FCCL integrates diffeomorphic counterfactual generation and contrastive learning to align treatment and control groups at a fine-grained, sample level, mitigating distribution shifts and approximating RCT randomization. By ensuring realistic counterfactuals and enforcing semantic consistency, FCCL lowers ITE estimation error. Experiments indicate superior performance in heterogeneous and data-scarce settings.
给作者的问题
- It is stated that the proposed method is specifically useful for “handling sparse boundary samples.” However, neither a proper definition is provided, nor a discussion of why handling them is challenging.
- Is optimization end-to-end? or are diffeomorphic counterfactuals found first and then the representation and prediction modules are trained?
- The evaluation metric “dis” (average distance between boundary samples and corresponding class centers) in Fig. 3 is stated to be reflecting sample heterogeneity. However, it’s not mentioned why a smaller dis is desirable.
论据与证据
Yes
方法与评估标准
Yes
理论论述
I looked at Lemma 4.1 and Theorem 4.2. They seem to be correct.
实验设计与分析
- The experiments are valid; although it would have been nice to include more complicated datasets such as ACIC. This is important to evaluate robustness and generalization of the proposed method in large scale settings.
- The evaluation metric “dis” in Fig. 3 is not well-motivated (see section "Questions For Authors” for a specific question).
补充材料
Yes, I briefly looked at the appendix.
与现有文献的关系
NA
遗漏的重要参考文献
NA
其他优缺点
Strengths:
- The paper provides strong theoretical grounding.
- Innovative combination of normalizing flows (to maintain semantic meaning in counterfactual samples) and contrastive learning (to ensure robust alignment), addressing core challenges in causal inference.
Weaknesses:
- The paper could improve in terms of clarity and flow. Specifically, theory is presented before providing a clear intuition or practical context, which makes it hard to follow the core logic. As a result, readers might struggle with intuitive understanding without first seeing practical motivation.
- The evaluation could benefit from inclusion of more challenging datasets (e.g., ACIC).
- The metric "dis" used in evaluations lacks clear motivation.
其他意见或建议
NA
We thank the reviewer for their time and for recognizing the importance of heterogeneous treatment effect estimation. Please see below answers to the questions.
Provide a proper definition of boundary samples and discuss why handling them is challenging.
We thank the reviewer for consideration. We provide a formal definition of boundary samples in Appendix D.1, and will include this definition in the main text in the revision.
Furthermore, we analyze the challenges of handling boundary samples from two perspectives:
(1) Many scenarios demand flexible investigation of effect heterogeneity across individuals. However, boundary samples, due to their distinctive characteristics, reside in low-probability density regions distant from class centroids, which poses challenges for accurate ITE estimation.
(2) Existing methods, such as MMD, minimize the distribution discrepancy between treated and control groups by aligning their mean representations and . However, such methods overlook samples near the boundaries of the treatment and control distributions.
Our method achieves robust performance via sample-level alignment, especially in scenarios where individual differences are significant. Besides, we also provide boundary sample analysis by listing the estimation errors of the first five samples from Table 4 in Appendix D, as shown in Table 1. More detailed results are presented in Appendix D.
Table1 The ITE estimation error of boundary samples on IHDP.
| Sample | CFR-MMD | FCCL |
|---|---|---|
| 1 | 4.2264 | 0.2144 |
| 2 | 3.0977 | 0.0594 |
| 3 | 0.0337 | 0.0805 |
| 4 | 6.1246 | 0.6418 |
| 5 | 2.1014 | 0.2263 |
Is optimization end-to-end? or are diffeomorphic counterfactuals found first and then the representation and prediction modules are trained?
Our approach falls into the latter: we first generate the diffeomorphic counterfactuals, and then train the representation and prediction modules.
Why the smaller evaluation metric “dis” is desirable?
A smaller "dis" value indicates that dispersion of samples in the latent representation space is reduced, which further reflects a smaller distance between boundary samples and their corresponding class centers. This shows that there are fewer boundary samples, enabling better model fitting and exhibiting lower ITE estimation bias, as demonstrated by our results. We will make this point clearer in the next version.
The evaluation could benefit from inclusion of more challenging datasets (e.g., ACIC).
Thank you for your valuable suggestions. For validation, we additionally evaluate our method against several representative baselines on ACIC (Table 2). We observe that our method still outperforms the baselines.
Table2 Within-sample and out-of-sample mean (std) for the metrics on ACIC.
| Method | ||||
|---|---|---|---|---|
| CFR-MMD | 1.70(0.38) | 0.29(0.12) | 2.36(0.59) | 0.30(0.12) |
| SITE | 1.71(0.39) | 0.38(0.14) | 2.33(0.58) | 0.39(0.15) |
| ABCEI | 1.93(0.46) | 0.17(0.07) | 2.49(0.64) | 0.18(0.07) |
| CBRE | 1.69(0.38) | 0.12(0.05) | 2.31(0.56) | 0.14(0.06) |
| DIGNet | 1.66(0.35) | 0.23(0.07) | 2.30(0.54) | 0.24(0.07) |
| FCCL | 1.51(0.40) | 0.16(0.05) | 2.16(0.56) | 0.17(0.05) |
I thank the authors for addressing my comments.
Please make sure to clarify R2 in the paper. BTW, would it be possible to design the algorithm to be end-to-end?
Thank you for your suggestion. We aim to generate semantically meaningful counterfactuals, but an end-to-end design would likely introduce non-negligible interference gradients into training and thus impair the quality of counterfactual generation. Nevertheless, we will explore better approaches to design the algorithm to be end-to-end.
The paper introduces Flow-based Counterfactual Contrastive Learning (FCCL), a novel approach for Individual Treatment Effect (ITE) estimation that integrates normalizing flows for realistic counterfactual generation and contrastive learning for fine-grained sample alignment. It derives a theoretical generalization-error bound linking ITE estimation error to factual prediction error and representation distances. Empirical evaluations on synthetic, semi-synthetic (IHDP), and real-world (Jobs) datasets demonstrate FCCL’s superior performance
给作者的问题
- How does FCCL compare to diffusion-based counterfactual generators? Diffusion models are gaining traction in structured data. Would FCCL’s theoretical framework extend to them?
- How stable is FCCL under different hyperparameter choices? Particularly, how does the contrastive loss temperature τ affect performance?
- Can FCCL generalize to continuous treatments? The paper focuses on binary treatments, but many real-world applications require handling continuous interventions.
论据与证据
Claim 1: FCCL generates realistic counterfactuals that adhere to the data manifold.
Supported. Through the use of normalizing flows, which enforce structure on counterfactual transformations. The authors present visualization results showing that FCCL maintains sample-level semantic consistency better than baseline methods.
Claim 2: FCCL reduces ITE estimation error via sample-level alignment.
Partially supported. While contrastive learning improves alignment, the empirical evidence (particularly in Tables 1 and 2) primarily shows improvement in error metrics rather than a direct measure of sample alignment. Additional ablation studies isolating the contrastive loss would be beneficial.
Claim 3: The proposed theoretical generalization-error bound justifies FCCL’s effectiveness.
Supported. The bound is mathematically derived (Theorem 4.2) and aligns with the contrastive loss objective. However, empirical validation of whether this bound holds in practice is not explicitly tested.
Claim 4: FCCL significantly outperforms state-of-the-art baselines.
Mostly supported. The results show FCCL achieving the best ϵ_PEHE and ϵ_ATE scores across datasets. However, the magnitude of improvement varies, and for some cases (e.g., IHDP out-of-sample ATE error), the advantage is marginal.
方法与评估标准
The use of benchmark datasets (IHDP, Jobs, synthetic data) is appropriate for treatment effect estimation. Baselines (OLS, CFR, GANITE, ABCEI, etc.) are well-chosen, covering both traditional and deep learning-based ITE estimation methods. Evaluation metrics (ϵ_PEHE, ϵ_ATE, ATT) are standard, but additional fairness metrics (e.g., subgroup fairness or bias analysis) could strengthen the evaluation.
The use of latent space visualizations (Figure 3) is insightful, but further quantitative measures of alignment (e.g., KL divergence or propensity score matching quality) would reinforce the claims.
理论论述
Correctness of Proofs: The theoretical derivations appear correct and align with existing literature on treatment effect bounds (e.g., Shalit et al., 2017).
Missing Considerations: The assumptions regarding the invertibility of representations (Φ) and the geodesic distance formulation could be further justified. Additionally, the bound does not account for sample sparsity effects, which are critical in real-world datasets.
实验设计与分析
The experiment design is generally well-structured, with appropriate train-test splits and multiple trials.
Missing Analyses: The impact of hyperparameters (especially temperature τ in contrastive loss) is not explored. Additionally, no robustness checks (e.g., sensitivity to noise or dataset shifts) are provided.
补充材料
Supplementary material includes code seems technically sound
与现有文献的关系
FCCL extends prior work on representation learning for ITE estimation (e.g., CFR, SITE, CITE) to contrastive learning for causal inference, aligning with recent trends in self-supervised learning for structured data. The approach could be adapted to multi-treatment and continuous intervention settings, similar to recent developments in continuous treatment estimation (e.g., Kazemi & Ester, 2024).
遗漏的重要参考文献
The paper extensively cites prior work but does not discuss alternatives to normalizing flows for counterfactual generation, such as energy-based models (Du et al., 2021) or diffusion-based generative approaches (Song et al., 2021).
其他优缺点
no
其他意见或建议
no
Thank you for your insightful suggestions. We have addressed the comments related to the counterfactual generation and model robustness evaluation. Please see our responses below.
How does FCCL compare to diffusion-based counterfactual generators?
We thank the reviewer for the question. We add experiments on diffusion-based counterfactual generation (Table1), which shows performance comparable to FCCL, but with slightly inferior result on . This may be due to the noise-driven mechanism of the diffusion model [1], which causes counterfactual to deviate from the sample semantic space, especially in the out-sample cases. In contrast, the flow-based model ensures that counterfactuals reside on the same manifold as the original instances, which generates both meaningful and reliable counterfactuals, making FCCL more reliable for individual-level treatment effect predictions. We will include this discussion in the revision.
Table1 ITE estimation errors (std) with different generation methods on IHDP.
| Method | ||||
|---|---|---|---|---|
| diffusion-based | 0.54(0.05) | 0.09(0.01) | 0.72(0.10) | 0.12(0.03) |
| FCCL | 0.53(0.04) | 0.09(0.01) | 0.64(0.07) | 0.12(0.02) |
How does the contrastive loss temperature affect performance?
We perform sensitivity analysis focusing on the temperature coefficient (see Table 2). We observe minimal variation in performance with respect to , which demonstrates our model's general robustness.
Table2 ITE estimation errors (std) with different values of on IHDP.
| 0.1 | 0.54(0.04) | 0.11(0.02) | 0.64(0.06) | 0.13(0.02) |
| 0.3 | 0.56(0.04) | 0.10(0.02) | 0.70(0.08) | 0.14(0.02) |
| 0.5 | 0.58(0.06) | 0.10(0.02) | 0.76(0.16) | 0.11(0.02) |
| 0.7 | 0.61(0.06) | 0.10(0.02) | 0.79(0.15) | 0.11(0.02) |
| 0.9 | 0.56(0.04) | 0.10(0.01) | 0.69(0.07) | 0.12(0.02) |
Can FCCL generalize to continuous treatments?
This is a valuable point. FCCL currently focuses on binary treatments, we can extend it to continuous treatments in future work. The most direct approach would involve discretizing the continuous treatment variable[2]. Specifically, we can split treatments into heads, each assigned a dose level from the range, which is divided into equal intervals of width . We will explore better approaches to handle continuous treatments directly.
Additional ablation studies isolating the contrastive loss would be beneficial.
We discussed the impact of contrastive loss, thank you for going into Figure 4 in the main text.
Empirical validation of whether this ITE bound holds in practice is not explicitly tested.
Our empirical analysis demonstrates that the proposed ITE bound effectively guides ITE model training and becomes tighter than the bound proposed in CFR as iterations increase under identical conditions[3] (Table 3).
Table3 Generalization-error bound comparison on IHDP.
| Iterations | 400 | 800 | 1200 | 1600 | 2000 |
|---|---|---|---|---|---|
| CFR | 13.24 | 10.84 | 10.06 | 9.64 | 9.91 |
| FCCL | 8.54 | 3.74 | 2.93 | 2.77 | 2.67 |
Further quantitative measures of alignment would reinforce the claims (e.g., KL divergence).
Thank you for your suggestion. We add KL divergence comparisons between treatment and control distributions for two typical methods (see Table 4), which demonstrates that our FCCL can better address distribution shifts. We will include this metric in Figure 3 in the next version.
Table4 KL divergence mean (std) comparison on IHDP.
| Method | CFR | ABCEI | FCCL |
|---|---|---|---|
| KL | 0.14(0.08) | 0.20(0.30) | 0.09(0.05) |
[1] Kotelnikov Akim, et al. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pp. 17564–17579. PMLR, 2023.
[2] Schwab et al. Learning counterfactual representations for estimating individual dose-response curves. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5612-5619, 2020.
[3] Shalit, U., et al. Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, pp. 3076–3085. PMLR, 2017.
The paper proposes FCCL framework for the ITE estimation. The proposed method can generate realistic counterfactuals by leveraging normalizing flows to ensure adherence to the data manifold, preserving semantic similarity to factual samples. The authors also derive a new generalization bound connecting ITE estimation error to factual prediction errors and representation distances between factual-counterfactual pairs, providing theoretical grounding for their proposed sample-level alignment method.
给作者的问题
Please see Strengths And Weaknesses
论据与证据
Yes
方法与评估标准
Yes
理论论述
Yes.
实验设计与分析
The experimental design and comparisons are proper.
补充材料
Yes. Theoretical parts.
与现有文献的关系
ITE model development.
遗漏的重要参考文献
Some papers also discuss the new upper bound of PEHE. Can include them in the literature review.
其他优缺点
Strengths:
The motivation, presentation, and experimental comparisons are good.
Weaknesses:
I only have one concern about the theoretical parts. One of the key contributions is that the authors claim they propose a new ITE error bound for their sample alignment method. However, anyone can propose an ITE error bound and minimize such a bound to learn the ITE model. The point is, how can you make sure the proposed ITE bound is tight or not? If the bound is too loose, such a theoretical bound may give bad guidance for training the ITE model.
I thus suggest authors give some theoretical or empirical evidence to show that the bound is tight (at least tighter than the bound proposed in CFRWASS)
其他意见或建议
None.
We appreciate your time and your thoughtful and encouraging comments. We hope our responses can resolve your concern.
I suggest authors give some theoretical or empirical evidence to show that the bound is tight (at least tighter than the bound proposed in CFR).
Following your helpful suggestion, we conduct an experimental analysis to compare the generalization error bounds of our proposed FCCL (Equation (14) in Appendix A) with the CFR bound (Equation (8) in Appendix A.2, [1]) under the same conditions.
As shown in Table 1, we track the generalization error bounds at different iterations. The results demonstrate that as the number of iterations increases, FCCL achieves a significantly tighter bound than that of CFR.
Table1 Generalization-error bound comparison on IHDP.
| Iterations | 400 | 800 | 1200 | 1600 | 2000 |
|---|---|---|---|---|---|
| CFR | 13.24 | 10.84 | 10.06 | 9.64 | 9.91 |
| FCCL | 8.54 | 3.74 | 2.93 | 2.77 | 2.67 |
In addition, we would like to clarify that our FCCL provides a different theoretical perspective from CFR. Our generalization error bound links ITE estimation error to the generalization error of factual predictions and representation distances, motivating our focus on minimizing these distances via sample-level alignment. Instead, the bound of CFR shows ITE estimation error can be reduced by the difference between the treated and control distributions.
[1] Shalit, U., et al. Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, pp. 3076–3085. PMLR, 2017.
The reviewers generally appreciated the work in combining normalizing flows and contrastive learning to better estimate individual treatment effects (ITE). They generally felt that both the theory and experiments were useful and were explained well.
The authors should aim to include several points of clarification and explanation brought up during the review process in the final version.