Coupling Generative Modeling and an Autoencoder with the Causal Bridge
We present a new theoretical perspective for the causal bridge that allows a deeper understanding of the impact of its assumptions on causal estimates while yielding an autoencoder formulation that improves causal effect estimation.
摘要
评审与讨论
This paper derives theoretical bounds on the average error in treatment effect estimation under proximal causal inference when the causal bridge (CB) assumption is violated. The authors also propose an autoencoder-based architecture for estimating the causal bridge function, providing both theoretical guarantees and a practical implementation framework.
优缺点分析
Strengths:
- The paper extends existing theory for proximal causal inference by providing theoretical analysis for the scenario where the bridge function independence assumption (relative to the proxy variable) is violated.
- The experimental evaluation includes both synthetic datasets and real-world data with ground truth established through randomized controlled trials, providing comprehensive empirical validation.
Weaknesses:
- The paper fails to discuss or compare against recent neural network approaches for unmeasured confounding, including key works such as Deep Learning Methods for Proximal Inference via Maximum Moment Restriction (Kompa et al., 2022), Causal Effect Estimation using Identifiable Variational AutoEncoder with Latent Confounders and Post-Treatment Variables (Xie et al., 2024), and Density Ratio-based Proxy Causal Learning Without Density Ratios (Bozkurt et al., 2025).
- The synthetic data experiments omit important baselines: (1) CAVAE: the authors' claim that results are not numerically stable does not justify excluding this established method, and (2) the approach from Kompa et al. (2022).
- The real-world data evaluation lacks comparison with relevant baselines, particularly Ying et al. (2022).
- The authors present their method's proportional hazard assumptions as an advantage over Ying et al. (2022) in survival outcomes, but fail to acknowledge that requiring this assumption could be viewed as a limitation of their own approach.
- The paper provides no justification for choosing autoencoders over diffusion models for sampling, despite diffusion models' growing prominence in generative modeling
问题
I might increase or decrease my score based on authors' responses
- What motivated the choice of autoencoders over diffusion models for generating samples from the conditional distribution, particularly given diffusion models' demonstrated effectiveness in generative tasks?
- Could you include comparisons with the missing baselines identified in the weakness section (CAVAE, Kompa et al. 2022, and Ying et al. 2022) to provide a more comprehensive experimental evaluation?
局限性
The authors fail to acknowledge that their experimental evaluation omits several important baselines from recent literature, and do not discuss why they chose auto-encoders instead of diffusion models.
最终评判理由
The authors responses addressed my concerns and questions. As such I increased the score.
格式问题
NA
We would like to thank the reviewer for their constructive feedback. Below we address the three concerns raised by the reviewer.
- Thanks for pointing us to the three related papers. Below we show results for Kompa et al., 2022 and Bozkurt et al., 2025, showing that the proposed approaches (CB and CB + AE, namely with and without the autoencoder formulation, respectively) outperform both of their settings NNMR-U and NNMR-V, and Kernel Alternative Proxy, respectively. In the table below, Q1 and Q3 are the first and third quartiles presented along with the median out-of-sample MSE results. The other paper will be discussed in the revision; however, it is not considered in the experiments because Xie et al., 2024 is for a different setting that requires post-treatment variables.
| Method | Sample Size | Demand Median (Q1, Q3) | dSprite Median (Q1, Q3) |
|---|---|---|---|
| CEVAE | 1000 | 360.05 (205.15, 602.59) | 52.92 (51.45, 53.29) |
| NMMR-V | 1000 | 23.07 (17.27, 28.12) | 24.48 (21.32, 25.73) |
| NMMR-U | 1000 | 21.6 (20.41, 25.38) | 22.22 (18.37, 25.57) |
| Kernel Alternative Proxy | 1000 | 109.86 (79.01, 176.86) | 22.4 (21.75, 23.03) |
| DFPV | 1000 | 43.93 (38.38, 48.34) | 22.08 (19.96, 24.38) |
| DFPV (our generator) | 1000 | 34.15 (29.65, 40.88) | 15.25 (13.97, 18.59) |
| CB (our method) | 1000 | 20.5 (17.52, 24.78) | 14.06 (13.21, 14.94) |
| CB + AE (our method) | 1000 | 16.66 (14.59, 20.32) | 13.68 (13.39, 14.49) |
| CEVAE | 5000 | 262.24 (161.96, 438.63) | 53.89 (53.07, 54.18) |
| NMMR-V | 5000 | 31.03 (24.07, 41.42) | 11.62 (8.77, 14.44) |
| NMMR-U | 5000 | 17.12 (13.19, 24.1) | 19.11 (18.11, 20.3) |
| Kernel Alternative Proxy | 5000 | 153.19 (96.4, 221.03) | 26.53 (25.85, 27.1) |
| DFPV | 5000 | 43.35 (32.65, 53.2) | 16.33 (15.23, 17.44) |
| DFPV (our generator) | 5000 | 27.84 (24.73, 30.71) | 8.79 (8.2, 9.49) |
| CB (our method) | 5000 | 15.84 (14, 17.28) | 7.45 (6.92, 7.96) |
| CB + AE (our method) | 5000 | 14.77 (12.9, 16.89) | 7.27 (7.04, 8) |
-
Above we show results for CEVAE, NNMR-U, NNMR-V (Kompa et al., 2022) and Kernel Alternative Proxy (Bozkurt et al., 2025), from which we see that the proposed approach outperforms the alternatives in both datasets (Demand and dSprite) and dataset sizes (1000 and 5000). These will be shown graphically in the revision as in Figure 3.
-
We produced results using the method in Ying et al. 2022 using the same data settings in our experiment with the Framingham dataset shown in Figure 3 (Right). We use their formulation for the hazard function (described below as ) and estimate hazard ratios (HRs) for three distinct event quantile time points which correspond to event horizons at 1924, 3709, and 5494 days, respectively, consistent with the experiments in Ying et al. 2022. Note that unlike in our approach, they estimate causal effects for specific points in time, while we consider a time-independent causal estimate. Results below indicate that the approach in Ying et al. 2022 fails to capture the correct direction of the hazard ratio (HR) and magnitude of the ground-truth HR=0.75. Summaries using the method of Ying et al. 2022 (mean and standard deviation for the HR) are calculated over 30 runs:
| t | 0.25 | 0.5 | 0.75 |
|---|---|---|---|
| Mean | 1.0454 | 1.1414 | 1.2103 |
| Standard Deviation | 0.0011 | 0.0077 | 0.0007 |
-
We understand that the proportional hazard assumption can be seen as a limitation; however, in Ying et al., 2022, they consider a very specific formulation for the hazard function that can be similarly perceived as a limitation. Specifically, they use , i.e., an exponentiated linear function of time (), time time (), time treatment (), time outcome proxy () and additional covariates denoted as . In the revision, we will acknowledge the limitations of both approaches, i.e., using a proportional hazard assumption vs. choosing a specific form of the hazard function a priori.
-
We understand the confusion about choosing the autoencoder for sampling, which we clarify as follows. We are not using the autoencoding formulation for sampling, but to learn functions , , , and . Sampling is done through a separate model trained to sample from , which is not implemented with an autoencoder. In fact, we use a conditional GAN and a conditional diffusion probabilistic model for the dSprite and Framingham experiments as described in Appendices E.5 and F.3, respectively. The Demand data is very simple; thus, a simple MLP modeling the (Gaussian) mean and variance of W is sufficient as described in Appendix E.4. In answer to the question from the reviewer, we are effectively using a diffusion model when there is a need for sampling.
Thanks for providing the results from additional baselines and clarifications on my questions. I increased my score based on the updated results and arguments from the authors.
This paper focuses on the effect estimation with latent variables and two additional observed proxies, a framework commonly referred to as proximal methods. The paper derives a new error bound on the causal bridge function (more specifically, the outcome confounding bridge function), highlighting that the proxies should serve as low-noise measurements of the latent confounders. Further, the paper proposes a generative model based on an autoencoder architecture to learn the bridge function. Experimental results demonstrate the effectiveness of the proposed approach.
优缺点分析
strengths
-
The paper is clearly written and well-organized.
-
It proposes an autoencoder-based architecture for learning the causal bridge function within the proximal framework, which performs well empirically.
Weaknesses / Questions
- Corollary is not a surprising result, since is a non-information-loss function. Therefore, the information loss stems solely from the exogenous noise . As , the proxy become equivalent to , leading to .
- The authors discuss solely the case of outcome confounding bridge function (denoted in [1]). Is it possible to drive similar result for treatment confounding bridge function [1]?
- Relatedly, is it possible to learn the treatment bridge function using the proposed autoencoder architecture?
- It would be better to clarify why we should access . What is the implication of using a separate dataset versus the same dataset?
- The experimental evaluation could be strengthened by comparing the proposed method with more state-of-the-art proximal baselines.
- The paper discusses how the additional loss terms in Equation (13) improve the learned bridge function hhh, but an ablation study on these terms is missing.
[1] Semiparametric proximal causal inference. JASA 2024.
问题
see above
局限性
see above
最终评判理由
During the rebuttal, all my concerns were well addressed. Thus I increase my score from 3 to 4.
格式问题
None
We would like to thank the reviewer for their positive assessment of our paper and constructive feedback. Below we address the concerns raised by the reviewer.
-
We completely agree with the reviewer that the Corollary is not surprising, but we decided to include it because it is an interesting special case for structural equation models. To our knowledge Theorem 3 is new, and we hoped to aid understanding of it by demonstrating that it implies the expected result in the Corollary.
-
It is certainly possible to derive a similar result for the treatment confounding bridge; this is an excellent question. We have results (experimental and theoretical) for the treatment bridge, which we have obtained since we submitted to NeurIPS, however, we consider them a separate contribution. Nevertheless, we will discuss the treatment bridge as a natural extension that will be addressed in subsequent work. Additionally, please note that the treatment bridge is most useful for binary treatments (interventions), and most of our experiments considered real treatments.
-
The reviewer is also correct in that it is possible to learn the treatment bridge in an autoencoder setting, so both the treatment and outcome bridge share statistical strength when learning the function that is used to implicitly sample from using , which for the treatment bridge will be a function of , matched in distribution to that using via an integral probability measure (IPM) or minimum mean discrepancy (MMD).
-
We should clarify that is not necessary; however, it is done to weaken the requirement of having a common dataset for which (,,,) are available. This distinction is especially useful in situations where the dataset for which the outcome is available is not the same or is (much) smaller than . We will clarify it in the revision to avoid confusion.
-
Following your suggestion and that of Reviewer rsP2 we have produced results for CEVAE, NNMR (Kompa et al., 2022) and Kernel Alternative Proxy (Bozkurt et al., 2025) reproduced below, where Q1 and Q3 are the first and third quartiles presented along with the median out-of-sample MSE results. In the revision these will be presented graphically as in Figure 3. We see that our methods (CB and CB + AE, namely with and without the autoencoder formulation, respectively) outperform all the baselines considered.
| Method | Sample Size | Demand Median (Q1, Q3) | dSprite Median (Q1, Q3) |
|---|---|---|---|
| CEVAE | 1000 | 360.05 (205.15, 602.59) | 52.92 (51.45, 53.29) |
| NMMR-V | 1000 | 23.07 (17.27, 28.12) | 24.48 (21.32, 25.73) |
| NMMR-U | 1000 | 21.6 (20.41, 25.38) | 22.22 (18.37, 25.57) |
| Kernel Alternative Proxy | 1000 | 109.86 (79.01, 176.86) | 22.4 (21.75, 23.03) |
| DFPV | 1000 | 43.93 (38.38, 48.34) | 22.08 (19.96, 24.38) |
| DFPV (our generator) | 1000 | 34.15 (29.65, 40.88) | 15.25 (13.97, 18.59) |
| CB (our method) | 1000 | 20.5 (17.52, 24.78) | 14.06 (13.21, 14.94) |
| CB + AE (our method) | 1000 | 16.66 (14.59, 20.32) | 13.68 (13.39, 14.49) |
| CEVAE | 5000 | 262.24 (161.96, 438.63) | 53.89 (53.07, 54.18) |
| NMMR-V | 5000 | 31.03 (24.07, 41.42) | 11.62 (8.77, 14.44) |
| NMMR-U | 5000 | 17.12 (13.19, 24.1) | 19.11 (18.11, 20.3) |
| Kernel Alternative Proxy | 5000 | 153.19 (96.4, 221.03) | 26.53 (25.85, 27.1) |
| DFPV | 5000 | 43.35 (32.65, 53.2) | 16.33 (15.23, 17.44) |
| DFPV (our generator) | 5000 | 27.84 (24.73, 30.71) | 8.79 (8.2, 9.49) |
| CB (our method) | 5000 | 15.84 (14, 17.28) | 7.45 (6.92, 7.96) |
| CB + AE (our method) | 5000 | 14.77 (12.9, 16.89) | 7.27 (7.04, 8) |
- We apologize for the confusion regarding equation (13). We do have an ablation study in the sense that we present results both using (12), i.e., just the bridge (CB in Figure 3), and (12) + (13), i.e., the proposed autoencoding setup (CB + AE in Figure 3). These are reproduced above for convenience. Moreover, the weights of the three loss terms (, , ) are also optimized as described in Appendix F. Moreover, we do not present ablation results only using and because the autoencoding setup is intended to model the joint distribution following the graphical model in Figure 1.
I appreciate the authors' detailed responses, especially the additional experimental results, well supporting their claims.
I modified my score correspondingly,
This paper addresses the challenge of estimating treatment effects under unobserved confounding by leveraging a causal bridge function within a proxy variable setting. It reformulates the bridge in a way that enables an intuitive deep learning implementation that improves the interpretability and statistical performance. The authors present theoretical conditions under which treatment effects can be identified using two proxy variables and introduce a novel set of assumptions that maintain the same inferential soundness. They further contribute a sensitivity analysis framework to assess robustness to assumption violations and demonstrate how their method enables statistical strength-sharing across variables. The framework consists of a two-step modeling pipeline: training a generative model to sample from proxy variables and then estimating treatment effects using these samples along with simulated unobserved confounders. A joint modeling strategy is employed to refine the confounder sampler. The approach is also extended to survival outcomes and is shown empirically to outperform state-of-the-art methods.
优缺点分析
Strengths:
- The main contribution of the new bridge formulation is a novel result that cleverly allows a new practical implementation for estimating treatment effects.
- Assumptions are very clearly stated.
- Both the causal bridge approach and the joint objective are clearly demonstrated to lower estimate errors in out-of-sample settings in the experiments.
Weaknesses:
- The paper is dense and is difficult to read. It would benefit from having figures and examples explaining the intuition of the bridge function, as well as the proven results.
问题
Full disclosure: I have previously reviewed this paper and gave it an accepting score. In the rebuttal, my questions were answered, and my opinion of the paper remains largely unchanged.
局限性
Limitations are stated.
最终评判理由
I maintain my score and recommend the paper for acceptance.
格式问题
No issues found.
We would like to thank the reviewer for their positive assessment of our paper and general feedback. We understand that the paper is somewhat dense, however, necessarily so due to space constraints. We have done our best to balance its contents with extensive supplementary information in the Appendix containing an illustrative example of the bridge function as well as experiments using an intuitive structural equation model (SEM) in Appendices C and D. As a complement, we will include a new figure in Appendix C illustrating the SEM model in equations (27)-(30), which can further provide intuition about the bridge function.
I thank the authors for the rebuttal. I maintain my positive score.
The paper presents theoretical insights into the causal bridge method, a causal inference approach based on proxy variables, designed to estimate the causal effect of a treatment variable on an outcome in the presence of unobserved confounders. Building on these theoretical analyses, the paper proposes an architecture aimed at improving the quality of causal bridge estimates.
优缺点分析
Strengths:
- The paper is well-motivated and addresses an important problem. Causal effect estimation in the presence of unobserved confounding is a challenging yet critical issue, and proposing a framework to handle this is valuable.
- The paper is interesting, particularly the theoretical analysis and how it is used to justify the proposed method. The comparison with CEVAE is also compelling.
- The paper is well-structured and easy to follow and the proposed method demonstrates good performance, outperforming the baselines.
Weaknesses:
- Some contributions of the paper appear incremental. In particular, much of Section 2, e.g., including Propositions 1 and 2, are direct derivations from Equation (1). Additionally, some findings, such as W and Z are noisy mappings of U have already been discussed in prior work [20].
- Some of the language in the paper needs to be more precise. For example, the introduction should explain more clearly the advantages of proxy variable methods, especially causal bridge, compared to others, such as instrumental variable approaches.
- The paper only uses one real-world dataset in the experiments. Including more real-world datasets would help better demonstrate the performance of the proposed method. Also, given the similarities between the proposed approach and the CEVAE, the paper should also include that as baseline. Also, I highly suggest that the paper includes more ablation studies in the main paper, rather than the appendix.
问题
Please refer to Strengths And Weaknesses.
局限性
Yes.
最终评判理由
I believe the paper addresses an interesting problem and proposes both theoretical analysis an a well-justified methodology. However, the concerns regarding the clarity of the paper remains.Overall, I'd like to keep my positive score.
格式问题
N/A
We would like to thank the reviewer for their positive assessment of our paper and constructive feedback. Below we address the three concerns raised by the reviewer.
-
Some of the contents of the paper including Propositions 1 and 2 though can be derived from (1) as the reviewer point out have not been introduced elsewhere. However, the reviewer is correct in that the assumption that and are noisy mappings of have been discussed elsewhere. We would like to point out that beyond re-examining the assumptions underlying the causal bridge in Section 4, which includes new theoretical results in Theorem 3 and Corollary 1, we also introduce a new formulation for the causal bridge leveraging generative models for the outcome proxy (), an autoencoder architecture that enables sharing of statistical strength between observed quantities (, , and ), an extension to survival outcomes and extensive experimental results with synthetic and artificial data demonstrating the effectiveness of our generative, autoencoding and survival approaches to model the causal bridge.
-
The suggestion about clarifying the advantages of proxy variable methods is well taken. In the revision, we will clarify the need and advantages of proxy variable methods in the unobserved confounding setting. Importantly, we will emphasize that it requires weaker assumptions than an instrumental variable approach, i.e., that the instrumental variable and the unobserved confounder are independent and the latter influences the treatment but not the outcome.
-
We agree that having additional real-world datasets will make the experimental results stronger; however, it is difficult to find real-world datasets for which the ground-truth causal effects are known, such as it is for the case shown in our experiments using the Framingham dataset. This will be discussed as a limitation of the proposed approach in the Conclusions section. We did not include results for CEVAE because they are not competitive (they did not fit within the range of the vertical axes we used in our figure); however, we will include them in the revision and are presented below (in the revision these will be presented graphically as in Figure 3), where Q1 and Q3 are the first and third quartiles presented along with the median out-of-sample MSE results. We also include results from two additional competitive baselines suggested by Reviewer rsP2, namely NMMR (Kompa et al., 2022) and Kernel Alternative Proxy (Bozkurt et al., 2025). We see that our methods (CB and CB + AE, namely with and without the autoencoder formulation, respectively) outperform all the baselines considered. We agree with the reviewer that including the ablation studies in the main paper is desirable, and we plan to do so in the revision, considering that we will have additional space (1 page).
| Method | Sample Size | Demand Median (Q1, Q3) | dSprite Median (Q1, Q3) |
|---|---|---|---|
| CEVAE | 1000 | 360.05 (205.15, 602.59) | 52.92 (51.45, 53.29) |
| NMMR-V | 1000 | 23.07 (17.27, 28.12) | 24.48 (21.32, 25.73) |
| NMMR-U | 1000 | 21.6 (20.41, 25.38) | 22.22 (18.37, 25.57) |
| Kernel Alternative Proxy | 1000 | 109.86 (79.01, 176.86) | 22.4 (21.75, 23.03) |
| DFPV | 1000 | 43.93 (38.38, 48.34) | 22.08 (19.96, 24.38) |
| DFPV (our generator) | 1000 | 34.15 (29.65, 40.88) | 15.25 (13.97, 18.59) |
| CB (our method) | 1000 | 20.5 (17.52, 24.78) | 14.06 (13.21, 14.94) |
| CB + AE (our method) | 1000 | 16.66 (14.59, 20.32) | 13.68 (13.39, 14.49) |
| CEVAE | 5000 | 262.24 (161.96, 438.63) | 53.89 (53.07, 54.18) |
| NMMR-V | 5000 | 31.03 (24.07, 41.42) | 11.62 (8.77, 14.44) |
| NMMR-U | 5000 | 17.12 (13.19, 24.1) | 19.11 (18.11, 20.3) |
| Kernel Alternative Proxy | 5000 | 153.19 (96.4, 221.03) | 26.53 (25.85, 27.1) |
| DFPV | 5000 | 43.35 (32.65, 53.2) | 16.33 (15.23, 17.44) |
| DFPV (our generator) | 5000 | 27.84 (24.73, 30.71) | 8.79 (8.2, 9.49) |
| CB (our method) | 5000 | 15.84 (14, 17.28) | 7.45 (6.92, 7.96) |
| CB + AE (our method) | 5000 | 14.77 (12.9, 16.89) | 7.27 (7.04, 8) |
I'd like to thank the authors for their response. I'll maintain my positive score.
The paper provides a new error bound for the causal bridge under violated assumptions, introduces a generative + autoencoder formulation that improves estimation, extends to survival outcomes, and shows empirical gains over strong baselines on synthetic and real-world data. In addition, the authors have effectively addressed the reviewers’ concerns. Therefore, I believe this paper is suitable for publication.