Causal Effect Estimation with Mixed Latent Confounders and Post-treatment Variables
摘要
评审与讨论
This paper investigates the problem of latent post-treatment bias in causal models where there exists some proxy variables of the latent confounder and post-treatment variables. The authors first derive a general form of latent post-treatment bias which is intractable in most situations (except in special cases such as linear SCM). The authors state that the latent post-treatment bias can be arbitrarily bad for existing proxy-based causal inference methods. They then propose an identifiable VAE-based causal inference algorithm under the assumption that at least one dimension of each sufficient statistic of the latent prior is invertible. The proposed method is evaluated on both synthetic and real-world datasets to demonstrate its causal effect estimation capability with the presence of both latent confounders and post-treatment variables.
优点
• Causal reasoning in the context of latent confounder and post-treatment variables is an important topic especially with observational data.
• The authors clearly state the necessary assumptions for the identifiability of true latent variables, and the logic of determining the dimensions of and is well presented.
• The paper has a well-established theoretical basis.
缺点
• For the illustrative example in the introduction, it might be better to explicitly specify what the post-treatment variable is.
• Other existing works [1-3] on identifying latent confounder/mediators based on the iVAE architecture should also be included in the related work.
• The role of post-treatment variables seems to be a bit ambiguous. To be specific, is Theorem 4.1 valid for all types of relationships between and ?
• The illustration of (iv) in Assumption 3 is a little confusing, as it assumes one extra degree of freedom on the prior parameters of and is critical to the identifiability of from . More explanation on this point will be appreciated.
• The empirical evaluation consists of only one real-world dataset, which somehow limits the applicability of the proposed method.
References:
[1]. Zhou, D., & Wei, X. X. (2020). Learning identifiable and interpretable latent models of high-dimensional neural activity using pi-VAE. Advances in Neural Information Processing Systems, 33, 7234-7247.
[2]. Sorrenson, P., Rother, C., & Köthe, U. (2020). Disentanglement by nonlinear ica with general incompressible-flow networks (gin). arXiv preprint arXiv:2001.04872.
[3]. Jiang, Z., Liu, Y., Klein, M. H., Aloui, A., Ren, Y., Li, K., ... & Carlson, D. (2023). Causal Mediation Analysis with Multi-dimensional and Indirectly Observed Mediators. arXiv preprint arXiv:2306.07918.
问题
• Is in Eq. 4 also based on Lemma 3.1? If yes, it should be explicitly stated.
• In what cases can the bias in Theorem 3.2 be arbitrarily bad besides the causal model assumed by linear SCM in Corollary 3.3?
• What is the rationale behind the simulation procedure of and in Eq. 11 if the latent confounder represents “job seniority” as elaborated in the introduction? How do you anticipate the estimation error to change if we increase the complexity of the neural network in Eq. 11?
局限性
The authors do not include a paragraph discussing the limitations and potential societal impact of this work.
The authors deeply appreciate your insightful comments to make our paper better. We hope that we have addressed your concerns in our responses. If you have further questions, we'd be happy to continue the discussions.
Comment 1: Specify the post-treatment variable for the example.
Response: Thank you for your valuable suggestion. The post-treatment variables in the example are subset of required skills that are causally influenced by the treatment (i.e., switching a job from onsite to online), while influencing people's decisions on applying for the job. For instance, switching to online work might require stronger communication skills, which could affect people's application decisions.
Comment 2: Existing works [1-3] should be included in the related work.
Response: Thank you for pointing out these important works. [1] proposed GIN network in iVAE, which is invertible with volume preservation. [2] generalized [1] by modeling spike outputs with Poisson distribution. [3] adapted iVAE to identify multiple latent mediators from high-dimensional observations. We will include them in the related work.
Comment 3: Is Theorem 4.1 valid for all types of relations between 𝑀 and 𝑌?
Response: Thank you for your valuable feedback. Theorem 4.1 is valid for all types of relations between and . The reason is that, the non-factorized part in the sufficient statistics of the conditional prior of allows for arbitrary dependence between latent variables (which includes both and ) and (see Assumption 2). Therefore, it implicitly allows for any relation between and .
Comment 4: The illustration of (iv) in Assumption 3 is a little confusing.
Response: Thank you for pointing this out. In this paper, we aim to prove that for CiVAE, if for two we have , the map from to is element-wise bijective. The reason we need points (instead of just ) is that, after plugging in points in the equality and taking diff. with the first equation, we can end up with linearly independent equations (see Eq. (19)), which is necessary to prove the sufficient statistics can be identified up to bijective transformation. The details are in Step I of C.4.1.
In addition, Section B.2.3 of [4] shows that if the exponential family parameters are independent, (iv) can be satisfied with arbitrary different (Y, T) points.
Comment 5: The empirical evaluation consists of one real-world dataset.
Response: Thank you for your constructive feedback. We have conducted experiments on the real-world IHDP [5] and the Jobs [6] datasets according to your advice, where we follow the same dataset generation process as the Company dataset to simulate the latent confounders and the latent post-treatment variables in the observed covariates . We normalize the outcome such that the reported errors become comparable. The results are summarized as follows:
| IHDP | Jobs | |||
|---|---|---|---|---|
| ATE | Err | ATE | Err | |
| CEVAE | -0.463 ± 0.081 | 0.787 | 0.130 ± 0.047 | 0.525 |
| TEDVAE | -0.317 ± 0.074 | 0.641 | 0.185 ± 0.069 | 0.470 |
| CiVAE | 0.178 ± 0.138 | 0.146 | 0.602 ± 0.162 | -0.053 |
| True ATE | 0.324 ± 0.000 | 0.000 | 0.549 ± 0.000 | 0.000 |
The results further demonsntrate that CiVAE shows more robustness to latent post-treatment bias.
Comment 6: Is in Eq. 4 also based on Lemma 3.1?
Response: Thank you for the important question. This step is based on the definition of DEV in Definition 1, where we substitute with .
Comment 7: In what cases can the bias in Theorem 3.2 be arbitrarily bad?
Response: Thank you for the important question. We provide two linear cases only to intuitively show the latent post-treatment bias (which can be calculated with the coefficients of the linear structural equations). The latent post-treatment bias in the general, nonlinear cases is provided in Eq. (4), which can also be arbitrarily bad, but it is abstract and cannot be further simplified without further assumptions on the causal generation process of the dataset.
Comment 8-1: Rationale behind the simulating and in Eq. 11 if represents job seniority.
Response: Thank you for the important question. The example in the introduction aims to provide a concrete example of possible scrambling of latent confounders (i.e., seniority of the job) and latent post-treatment variables (i.e., work-mode relevant job skills) in the observed covariates . However, both and are difficult to quantify in the Company setting. Therefore, given , we simulate and from scratch, whereas Eq. (11) ensures that the information of the real-world data, i.e., the marginal distribution of the observables, i.e., , is preserved in the semi-simulated dataset.
Comment 8-2: How will estimation error change if we increase the complexity of the neural network in Eq. 11?
Response: Thank you for the important question. As we explained in our response to your comment 8-1, Eq. (11) ensures that the marginal distribution of the observable is consistent with the real-world data. A more complicated will make the semi-simulated dataset deviate more from the real-world data (due to overfitting), but it wouldn't affect the estimation step (which is independent of the generation of the semi-simulated dataset).
[4] Variational autoencoders and nonlinear ICA: A unifying framework.
[5] Bayesian nonparametric modeling for causal inference.
[6] Evaluating the econometric evaluations of training programs with experimental data.
Thank you for your response to my comments, though the negative ATE error on the Jobs dataset looks a bit weird. Also, for the second step in Eq. 4 where you replace with , I mean this substitution probably needs the injectivity statement in Lemma 3.1.
I think the authors' response mostly addresses my concerns, and I'm willing to raise my score.
Dear reviewer jEuj,
Thank you for the acknowledgment. We are glad to hear that our responses mostly address your concerns. We will try our best to integrate your valuable comments into the paper. Here, we provide responses to your further questions.
1. In the current version, we report the difference between the true ATE and the estimated ATE as the error. We will also provide the absolute difference to make the comparison between different methods more straightforward.
2. We note that is the ATE estimator when controlling an arbitrary variable . is defined as: where ":=" denotes the RHS is the definition of the LHS. Here, is an arbitrary variable.
For Eq. (4), we are discussing the bias when estimating the ATE when controlling the variable , i.e., the latent variable inferred from via . The bias is defined as the difference between the true ATE and the estimated ATE, i.e., . Therefore, the second step simply uses the definition of , i.e., where no derivation is involved in this step. The actual derivations occur in the 3rd and 4th step, where the 4th step uses Lemma 3.1 to remove the injective in the condition. We will change "=" in the second equation of Eq. (4) to ":=" to avoid confusion. Thank you again for raising this important question to make our paper clearer.
Best,
Authors
Thank you for your clarification. I have no further questions at this point.
Dear Reviewer jEuj,
Thank you for the acknowledgment. We are committed to integrating your valuable comments into the paper. Thank you again for your time and efforts.
Best,
Authors
The authors deal with latent post-treatment bias for proxy-based methods which are employed for causal effect estimation. They show that post-treatment variables can be latent and mixed into the observed covariates along with the latent confounders. The authors transform the confounder-identifiability problem into a tractable pair-wise conditional independence test problem. They prove that the latent confounders and latent post-treatment variables can be identified up to bijective transformations. Finally, they provide experimental analysis for their approach.
优点
The paper deals with a very interesting problem. The proposed method appeared to be theoretically robust. The method is evaluated with proper experimental analysis on synthetic and real-world datasets and compared with multiple benchmarks.
缺点
Here I provide some weaknesses of the paper:
- Bi-directed edges in Figure 1 are not defined properly.
- Do-operator in equation 3 is not defined in detail.
- Assumptions in Assumption 2 should be described in more detail.
- The proposed method seems to depend on a lot of assumptions. Assumptions 1,2,3 each contain multiple assumptions. The authors should explain how their assumptions hold for the real-world scenarios they considered in their experiment section.
问题
Here I provide some questions:
- Why do the authors assume that the models can recover the true latent space up to invertible transformation (line 151) ? How realistic is that assumption?
- Do the proxy-based methods claim to perform well for the causal graphs in Fig 1c?
- How do these assumptions hold when X is high-dimensional?
- What values of K_C and K_M are considered?
局限性
The authors discussed a very few limitations of their paper but more discussion should be done.
The authors deeply appreciate your insightful comments to make our paper better. We hope that we have addressed your concerns in our responses. If you have further questions, we'd be happy to continue the discussions.
Comment 1: Bi-directed edges in Figure 1 are not defined properly.
Response: Thank you for pointing this out. Bi-directed edges in Fig. 1 means arbitrary causal relationship between each of the post-treatment variable and the outcome . This will be clarified in the revised paper.
Comment 2: Do-operator in equation 3 is not defined.
Response: Thank you for pointing this out. The do-operator represents an intervention in the causal model, where represents the expected value of if we were to intervene and set the treatment to value for the entire population, regardless of the natural causes of . We will add this explanation to the paper for clarity.
Comment 3: Assumption 2 should be described in more detail.
Response: Thanks for the constructive suggestion! The assumption states the mild condition the prior of , i.e., , should satisfy for individual and bijective identification of from the observables . Specifically, it assumes that belongs to a general exponential family with two-part sufficient statistics : (i) A factorized part , where each component has at least one invertible dimension. This ensures individual/bijective identifiability of latent variables. (ii) A non-factorized part modeled by a ReLU deep neural network, which allows complex dependencies among latent variables. We will add the above explanation to the paper.
Comment 4: How the assumptions hold for the real-world scenarios in the experiment section?
Response: Thank you for your insightful suggestion. We provide empirical justification for the three assumptions as follows:
For Assumption 1, since the dimension of the observed covariates , i.e., , is larger than the dimension of the latent variables , i.e., , it would be likely that is injective or very close to injective due to the low probability that two distant points in the low dimensional latent space are mapped by to the same point in the high-dimensional space.
Assumption 2, i.e., the conditional prior of latent variables following a general exponential family, would be a reasonable approximation of the true prior, as general exponential family includes the most commonly used distributions, and the non-factorized part of the sufficient statistics parameterized by a ReLU deep neural network allows complex (conditional) dependence among the latent variables.
Assumption 3 ensures that the dataset and model class we choose allow the identification, where (i) states that the noise distribution should not be degenerative, which depends on the dataset quality. (ii), (iii) can be trivially satisfied by neural networks. For (iv), Section B.2.3 [1] shows that if the exponential family parameters are independent, (iv) can be satisfied with arbitrary different points. This can be satisfied by most exponential family distributions.
Comment 5: Why do the authors assume that the models can recover the true latent space up to invertible transformation (line 151)?
Response: Thanks for the valuable feedback. We want to clarify that this assumption is made only for existing proxy-based methods, not for our proposed CiVAE. This is actually the most optimistic assumption we could make for these methods as they exactly recover the latent space. We show that even under this optimistic assumption, the ATE/CATE estimation is still arbitrarily biased when latent post-treatment variables are present.
Comment 6: Do the proxy-based methods claim to perform well for the causal graphs in Fig 1c?
Response: Thanks for the valuable feedback. Most proxy-based methods do not explicitly address the causal graph in Fig 1c. They typically rely on the strong ignorability assumption, which implies both that (i) all confounders are captured by observed covariates, and (ii) no post-treatment variables are included. However, these methods often focus more on the first implication and ignore the potential presence of post-treatment variables in the proxies (which leads to violation of (ii)). This can lead to biased ATE/CATE estimates when latent post-treatment variables are mixed with confounders in the observed covariates, and motivates us to design CiVAE to address the latent post-treatment bias.
Comment 7: How do these assumptions hold when is high-dimensional?
Response: Thanks for raising this important point. Assumption 1 (noisy-injectivity) implies that the dimension of is larger than or equal to the latent space, which is typically satisfied when is high-dimensional. Assumption 2 puts a general prior on the latent variables, whereas Assumption 3 contains standard regularity conditions, which are both irrelevant to the dimensionality of . Therefore, all three assumptions in this paper hold for high-dimensional covariates .
Comment 8: What values of K_C and K_M are considered?
Response: Thanks for the valuable feedback. In our experiments, we considered various combinations of and : For the simulated datasets, we empirically set and . For the real-world Company dataset, we empirically set and . Additionally, in Section 5.3, for the Company dataset, we conducted a sensitivity analysis where we varied the ratio of to from . This analysis demonstrates the robustness of CiVAE under different latent variable configurations.
[1] Variational autoencoders and nonlinear ICA: A unifying framework.
In this paper, the authors investigated the issue of latent post-treatment bias in causal inference from observational data. They showed that estimator of existing proxy-of-confounder-based methd, i.e., DEV (f(X)), is an arbitrarily biased estimator of the Average Treatment Effect (ATE), when the selected proxy of confounders X accidentally mixes in latent post-treatment variables (Theorem 3.2). To address this issue, they proposed the Confounder-identifiable VAE (CiVAE), which identifies latent confounders up to bijective transformations under a mild assumption regarding the prior of latent factors. They showed that controlling for latent confounders inferred by CiVAE can provide an unbiased estimation of the ATE. Experiments on both simulated and real-world datasets demonstrate that CiVAE exhibits superior robustness to latent post-treatment bias compared to state-of-the-art methods.
优点
Being able to recover latent variables (cofounders, post-treatment variables, or others) from observations is challenging and important. Ignoring latent variables or assuming non-existence of latent variables is unrealistic and can lead to the wrong conclusion and decisions. The authors further motivated the importance of recovering latent cofounders, post-treatment variables and the consequence of not doing so (Theorem 3.2). The solution provided shows originality and quality.
缺点
The presentation can be improved.
问题
Is Fig. 1(c) general enough? It assumes that all latent variables are either confounders or post-treatment variables. However, there can be other types of latent variables, such as:
- Pre-treatment Variables: These latent variables influence the treatment (T) but do not directly affect the outcome (Y) or the additional observation (X). They exist before the treatment is applied and can introduce selection bias.
- Latent Interaction Variables: These latent variables interact with the treatment (T) to influence the outcome (Y). They are not confounders because they do not influence the treatment directly, nor are they post-treatment variables.
- Latent Mediator Variables: These latent variables mediate the effect of the treatment (T) on the outcome (Y) and are not directly observed.
- Latent Variables Influencing Both Pre-treatment and Post-treatment States: These latent variables influence the state of the system both before and after the treatment but do not fit the typical definition of confounders or post-treatment variables. For example, a latent mental state might affect both a person's initial willingness to undergo treatment and their behavior or responses post-treatment.
Can the proposed method handle these types too (with some extension), or some of the types are quite disruptive to the proposed methodology?
局限性
n.a.
The authors deeply appreciate your insightful comments to make our paper better. We hope that we have addressed your concerns in our responses. If you have further questions, we'd be happy to continue the discussions.
Comment 1: How does CiVAE (with possible extension) address other types of latent variables?
Response: Thank you for the insightful comments. It would indeed be interesting to discuss the behavior and possible extension of CiVAE when different types of latent variables are scrambled into the observed covariates alongside the latent confounders and the latent post-treatment variables .
First, from Assumption 2, we know that CiVAE allows arbitrary (conditional) dependence among latent variables that generates to individually and bijectively identify them from the observables . Therefore, regardless of the type of , if we denote the inferred latent variables as , we have , where is a bijective function. However, for each , whether corresponds to type , or is unknown.
Therefore, if only and exist, to further distinguish from in , since are pre-treatment and are post-treatment, a clever strategy is to select variables in where independence increases after conditioning on , as only form V-structure with (i.e., ), where their dependence increases after conditioning on . In contrast, and form chain structure with (i.e., ), and form fork structure with (i.e., ), where dependence decreases after conditioning on .
We can use similar logic to reason with the case when different types of exist.
Case 1. Pre-treatment variables that do not direct influence the outcome.
If are pre-treatment variables, since they causally influence the treatment , they still form V-structure with , and therefore CiVAE will identify them in and include them in the control set after the pair-wise independence test.
Here, we need to further divide the pre-treatments into two cases.
The first case is that are correlated with . In this case, controlling the identified can reduce both confounding bias and variance.
Another case is that are not-correlated with . In this case, controlling is still unbiased (which achieves the main purpose of removing latent post-treatment bias of the paper), but the estimation variance could increase. A trivial extension of CiVAE to address this issue is to conduct another round of independence test among the identified confounders (with and without the outcome as the condition) and keep the pairs in where dependence increases after conditioning on (as true confounders form V-structure with ). The discussion will be included in the revised paper.
Case 2. Latent interaction variables.
The case where are latent interaction variables is more complicated, as the relation between and the treatment is undetermined. If each is confounded with via an independent unobserved confounder , have the following relationship with , i.e., . Since the dependence among will increase after conditioning on , will be included in the control set. However, if is confounded with via a shared confounder , the relation becomes , controlling would probably decrease the dependence (as contains the confounder information). In this case, won't be included in the control set.
However, regardless of whether are included in the control set, CiVAE remains unbiased, because do not influence the identification of confounders . In addition, are still pre-treatment, such that no post-treatment bias can be introduced in the ATE/CATE estimation.
Case 3. latent mediator variables.
If are latent mediators, is a special case of post-treatment variables . Since form the fork structure with the treatment (i.e., ), their dependence will decrease after conditioning on , and therefore they will be successfully excluded from the control set to eliminate latent post-treatment bias.
Case 4. latent variables influencing both pre-treatment and post-treatment states?
If are latent variables that influence both pre-treatment and post-treatment states, since still forms the fork structure with the treatment (i.e., ), their dependence will decrease after conditioning on , and therefore they will be successfully excluded from the control set to eliminate latent post-treatment bias.
Thank you. I've read your rebuttal, responses, and the other reviews. I will keep an eye on the reviewers' discussion phase, if there is one.
Dear reviewer VF9h,
Thank you for the acknowledgment. We will try our best to integrate your valuable comments into the paper. Thank you again for your time and efforts.
Best,
Authors
This paper addresses the challenge of causal inference with observational data, particularly when direct measurement of confounders is infeasible. The authors propose a new method, Confounder-identifiable Variational Autoencoder (CiVAE), to mitigate post-treatment bias using observed proxies for both latent confounders and latent post-treatment variables. The paper provides a theoretical analysis under specific assumptions and validates the proposed approach through experiments on both simulated and real-world datasets.
优点
- The paper investigates a critical question concerning the mitigation of post-treatment bias, which is essential in various practical scenarios.
- The ideas presented in the paper are clear and easy to follow, and the theoretical analysis is well-established.
缺点
-
In practical scenarios, interactions among latent factors are often present and can significantly impact the estimation. It would be beneficial if the authors could elaborate on how their method addresses these interactions and whether there are any theoretical guarantees regarding their handling in the proposed approach.
-
The theoretical guarantees rely on strong assumptions, and the assumptions are hard to verify in practice. In assumption 1, the paper assumes an injective function of latent confounders and latent post-treatment variables into the observed proxy. This is a strong assumption, and it will be much harder to meet the assumption in general when the function is nonlinear. The specific setup with strong assumptions limits the practical applicability of the proposed approach. It would be helpful if the authors could provide examples where these assumptions hold and demonstrate how they can be verified.
-
The experiment lacks sufficient details on setup and implementation. Could the authors provide more specific information to enhance understanding of the empirical results?
问题
See Weaknesses.
局限性
- The proposed method relies on very strong assumptions to ensure identifiability, which can be challenging to verify in practical applications.
The authors deeply appreciate your insightful comments to make our paper better. We hope that we have addressed your concerns in our responses. If you have further questions, we'd be happy to continue the discussions.
Comment 1: How CiVAE addresses interactions among latent variables and theoretical guarantees?
Response: Thank you for raising this important point. We have extended our analysis to interactions among latent variables in Section 4 of the Appendix. Specifically, we consider two cases: (i) Intra-interactions among latent mediators and (ii) Inter-interactions between and latent confounders .
Theoretically, the inferred latent variables via Eq. (10) still individually identify the true latent variables, i.e., , up to bijective map, as Assumption 2 allows arbitrary (conditional) dependence among latent variables. When interactions exist, we can use more general causal discovery methods, e.g., the PC algorithm, to further identify the latent confounders in . The reason is that, since latent post-treatment variables cannot causally influence (otherwise will be post-treatment), and the PC algorithm orients edges in the causal graph via colliders, latent confounders can be properly oriented by the PC algorithm as they form colliders with and therefore be identified from .
Empirically, we simulate two datasets according to the above two cases, and show that CiVAE can be adapted to handle these interactions by adopting the PC algorithm in the second step of confounder identification. Tables 1 and 2 in the Appendix demonstrate that the adapted CiVAE remains more robust to latent post-treatment bias compared to baselines even when interactions exist among latent variables.
Comment 2-1: Assuming an injective function of latent confounders and post-treatment variables into the observed proxy is strong, and it will be harder to meet the assumption in general when the function is nonlinear.
Response: Thank you for your insightful comments. The proposed noisy-injectivity assumption is weaker than a strict injective assumption, as it allows the map from latent variables (including latent confounders and latent post-treatment variables ) to the observed covariates to be non-injective due to the presence of noise.
In addition, since the dimension of the observed covariates , i.e., , is larger than the dimension of the latent variables , i.e., , it would be likely that is injective or very close to injective in practice due to the low probability that two distant points in the low dimensional latent space are mapped by to the same point in the high-dimensional space. We will clarify the above points in the revised paper.
Comment 2-2: It would be helpful if the authors could provide examples where assumptions hold and demonstrate how they can be verified.
Thank you for your insightful comments. For the remaining two assumptions, we provide further discussion as follows:
Assumption 2, i.e., the conditional prior of latent variables following a general exponential family, would be a reasonable approximation of the true prior. The reason is that, the non-factorized part of the sufficient statistics of general exponential family defined in Eq. (7) is parameterized by a ReLU deep neural network, which allows complex (conditional) dependence among the latent variables.
Assumption 3 ensures that the dataset and model class we choose allow the identification. Specifically, (i) denotes that the noise distribution should not be degenerative, which depends on the dataset quality. (ii), (iii) can be trivially satisfied by neural networks. For (iv), Section B.2.3 of [1] shows that if the factorized part of the exponential family parameters are independent (which is very weak), (iv) can be satisfied with arbitrary different points.
[1] Variational autoencoders and nonlinear ICA: A unifying framework.
Comment 3: The experiment lacks sufficient details on setup and implementation. Could the authors provide more specific information to enhance understanding of the empirical results?
Response: Thank you for your constructive feedback. The detailed setup and implementation are summarized as follows:
For the simulated datasets, we empirically set the dimension of the latent confounders and the latent post-treatment variables to and , which leads to . The dimension of the observed covariates is set to . The dataset generation process for both the mixedLatentMediator and mixedLatentCorrelator cases have been formulated in the paper. For CiVAE, the inference network is implemented as an MLP with one hidden layer with hidden dimension . For the prior network : for the factorized part, we implement and implement as a dense layer of ; for the non-factorized part, we implement as a ReLU neural network with hidden dimension of and output dimension of 1, and implement as a dense layer of . We train the model according to Eq. (10) for ten epochs, conduct ten random runs of the experiment, and report the average and standard deviation.
For the real-world dataset, we select 52 most common job skills as (which leads to ). We set the dimension of the latent space, i.e., , and vary the ratio from to and plot the results in Fig. 3. The implementation and training of CiVAE follow the same setting as the simulated datasets. We will include the above details in the revised paper.
I thank the authors for providing detailed responses, and these partially address the concerns. I will keep my rating.
Dear Reviewer kHjK,
Thank you for the acknowledgment. We are glad that our responses partially address your concerns. We will take your remaining comments seriously and further polish the paper, and we are committed to integrating your valuable comments into the paper. Thank you again for your time and efforts.
Best,
Authors
This paper addresses causal effect estimation with unobserved confounders, focusing on recovering confounding information from auxiliary proxy variables. Specifically, it tackles the challenge of proxies that capture information about both unobserved confounders and latent post-treatment variables, which can introduce post-treatment bias. The paper proposes a VAE approach to individually recover latent confounders and post-treatment variables up to bijective transformations. It then aims to disentangle these components and adjust for latent confounders in causal effect estimation.
While the reviews are overall positive, some concerns were raised, including the plausibility of stringent assumptions, handling of general confounders and post-treatment variables, and the presentation of experimental results. The rebuttals partially addressed these concerns, but fully resolving the issues, especially regarding general confounders and post-treatment variables, requires significant work. The paper emphasizes the VAE component and briefly mentions using pairwise independence tests to disentangle confounders and post-treatment variables under a restrictive independence assumption. The appendix offers brief discussions on more general cases without independence restrictions. However, this is a very cursory treatment of the disentanglement procedure, one key component that differentiates the work from existing literature. A deeper study of the disentanglement process is needed, including clear procedures, validity explanations, and validation through well-designed ablation studies.
During the AC-reviewer discussion, another crucial question arose: Is the problem studied practically relevant? It is unclear when the problem with post-treatment proxies would be encountered in real scenarios. Since confounders are pre-treatment, it's generally more reasonable to use pre-treatment information. The paper lacks a discussion on the practical relevance of the problem, providing only a simplified and somewhat unrealistic example in the introduction that may represent only a narrow use case. Additionally, the example involving job mode switch and applicants' age is confusing, as age is an exogenous attribute likely influencing decisions rather than the other way around, and variables like seniority and skills are also ambiguous.
Given these issues, I recommend rejection at this stage.