PUATE: Efficient ATE Estimation from Treated (Positive) and Unlabeled Units
摘要
评审与讨论
This paper proposes a novel framework for estimating average treatment effects (ATE) using only treated and unlabeled data. It derives semiparametric efficiency bounds for two common missing data settings (censoring and case-control) and proposes estimators that are provably efficient. Theoretical results are supported by simulations showing improved performance over standard methods like inverse probability weighting.
优缺点分析
Strengths:
- The claims made in the paper are well supported by both theoretical derivations and simulation studies
- The writing is very clear and the contributions are well connected to the broader literature on treatment effect estimation (I appreciated the extended discussion in Appendix A)
Weaknesses:
- The introduction does not clearly justify when and why the treated-plus-unlabeled scenario arises naturally in causal inference, beyond a brief and somewhat unclear recommendation-system example. It is not obvious to me how common this setting is in real-world causal inference
- The censoring setting rests on the “selected completely at random” assumption that observation probability is constant within the treated class, an analogue of MCAR that is rarely satisfied
问题
- Asymptotic normality in this paper requires the true propensity score (or an auxiliary dataset), whereas standard doubly-robust AIPW only needs an estimated propensity score converging at a fast enough rate. What structural aspect of the treated-plus-unlabeled setup makes propensity-score error first-order, and do you see other ways to recover asymptotic normality using only the original data (i.e., without bringing in a separate dataset)?
- Could you provide a concrete, real-world example where only treated and unlabeled units are observed, and where ATE is the relevant causal estimand?
局限性
yes
最终评判理由
I am maintaining my original score. The authors provided a clear explanation for the lack of double robustness in their setting and justified the need for stronger assumptions to achieve asymptotic normality. They also addressed my concern about real-world relevance with several concrete examples.
格式问题
none
We would like to thank the reviewer for the detailed feedback. Below, we address each point raised by the reviewer.
Q1. Asymptotic normality in this paper requires the true propensity score (or an auxiliary dataset), whereas standard doubly robust AIPW only needs an estimated propensity score converging at a fast enough rate. What structural aspect of the treated‑plus‑unlabeled setup makes propensity‑score error first‑order, and do you see other ways to recover asymptotic normality using only the original data (i.e., without bringing in a separate dataset)?
A1. We believe that in our case, when the propensity score is unknown, guaranteeing asymptotic normality requires additional assumptions, such as access to an auxiliary dataset for estimating the score. The mechanism is as follows.
Double robustness is a key technical tool for establishing asymptotic normality, not only for consistency. However, we note that in our setting, such a doubly robust structure does not hold between the propensity score and the other nuisance parameters (the conditional expected outcome and the observation probability). In our framework, double robustness holds between the observation probability and the conditional expected outcome: given a consistent estimator of the propensity score, if either the observation probability or the conditional expected outcome is consistently estimated, we obtain a consistent ATE estimator.
Note that under the standard setup of ATE estimation, double robustness instead holds between the propensity score and the conditional expected outcome: if either of those two components is consistently estimated, the ATE is consistently estimated. This difference between the PUATE and the standard ATE frameworks stems from the identification of the expected outcomes of the control group. In PUATE, estimating requires the propensity score itself, whereas in the standard case the two nuisance parameters can be estimated independently. Concretely, we use
In the standard ATE estimation, -consistency of the doubly robust estimator comes from the product of the estimation errors of nuisance parameters. However, because of the above dependence between the propensity score and the expected outcome, a bias‑reducing product term such as does not appear. We can confirm this property from Lemma 4.1. As shown in Lemma 4.1, the efficient influence function therefore lacks such a doubly robust structure.
Thus, we lack the doubly robust structure that is usually required for -consistency. Consequently, the estimation error of the propensity score may persist. Specifically,
\sqrt{n}(\hat{\tau) - \tau) = G + Bias1 + Bias2,
where is asymptotically normal, Bias1 denotes the product of the estimation errors for the observation probability and the conditional expected outcome, and Bias2 denotes the estimation error of the propensity score. Bias1 can be , but Bias2 is not unless the propensity score is estimated at a rate faster than the standard parametric rate . This is why our analysis assumes that the propensity score is known, or is estimated more rapidly using an independent dataset.
Therefore, without stronger conditions, we cannot establish asymptotic normality in our setting. For example, an auxiliary dataset that enables a sufficiently fast first‑stage estimate of the propensity score is a practical solution.
Note that double robustness is an analytical device derived from the efficiency bound and is often formalized via the canonical gradient or Neyman orthogonality (see Schuler and van der Laan, 2021). We also explain this problem in the reply A5 to Reviewer D1gL. We will explain this problem more clearly in the next revision.
- Alejandro Schuler and Mark van der Laan. Introduction to modern causal inference, 2021.
Q2. Could you provide a concrete, real‑world example where only treated and unlabeled units are observed, and where ATE is the relevant causal estimand?
A2. Thank you for the suggestion. In the next revision, as we also mentioned in our reply to Reviewer D1gL, we will incorporate the following application scenarios and concrete examples that match the setting of our study.
-
We are interested in how opening an online store affects product sales. Within a recommendation system, a customer who purchases a product through the website is known to have visited the site, whereas a customer who buys the product in a physical store may or may not have visited the website. Treating the launch of the online store as the treatment, customers who purchase in store constitute the unknown group.
-
In a similar vein, consider the effect of distributing coupons on product sales. Because the logging system is imperfect, we observe two data sets: purchase records for customers who used coupons and purchase records that mix customers who did and did not use coupons. Customers who appear only in the mixed data set form the unknown group.
-
A company unintentionally sells defective products, for example vehicles with faults or food containing impurities. Consumers who notice the defect report both the defect and any resulting damage, such as health issues or accidents, to the company. Consumers who do not notice the defect file no report. In this setting, consumers who report the defect form the treatment group, whereas those who do not belong to the unknown group.
Our method is also useful when the treatment can be defined only by contrasting it with a non‑treatment condition. For example, consider examining how excessive work hours affect workers' mental health. Courts and labor inspectors can identify illegal overtime, but it is difficult to decide whether the other companies do not impose such overworking to employees. In this case, firms found to have engaged in illegal overtime can therefore serve as the treatment group, whereas all other firms constitute the unknown group. Then, we identify the treated units from the unknown group by using our method. Similarly, by taking units with easily observable outcomes as the treatment group and all others as the unknown group, one can define the comparison group simply as control. This idea is widely adopted in anomaly‑detection approaches that rely on density‑ratio or PU learning.
- Hido, Tsuboi, Kashima, Sugiyama, and Kanamori (2008). "Inlier-Based Outlier Detection via Direct Density Ratio Estimation."
Thank you for your detailed response. These clarifications address my concerns, and I will maintain my original score.
This paper investigates treatment effect estimation in settings where treatment assignments are missing for some observations, including both censoring and case-control scenarios. The authors establish identification and semiparametric efficiency theory, and demonstrate the performance of the proposed doubly robust estimators through theoretical analysis and empirical results.
优缺点分析
Overall, this paper addresses an important and interesting problem in causal inference. The proposed methods are supported by solid theoretical guarantees and empirical evidence, and the presentation is clear and well-organized. The manuscript would benefit from a deeper discussion of the practical applications and real-world relevance of the proposed methods. More detailed comments on specific aspects are provided below. Quality: The theoretical foundations and proposed methods are solid. I have listed a few questions regarding them in the 'Questions' section below. Clarity: The paper is clearly written and well organized. To maintain clarity and avoid excessive length, the details for the case-control setting are appropriately moved to the appendix. I have also included some comments on improving the notation in the 'Questions' section below. Significance: While the theoretical development is solid, the authors should place greater emphasis on the practical applicability of their setting and methods. Originality: The results in this paper are novel to the best of my knowledge.
问题
- The significance of the work could be enhanced by incorporating additional application scenarios and concrete examples that align with the settings considered in the paper. This would better demonstrate the practical utility of the proposed methods.
- In Assumption 3.1, the authors should consistently use “o” as the realization of the variable "O” rather than switching to “d,” which is used as the realization of “D.” A similar inconsistency appears in Assumption 3.3. It is recommended that the authors carefully review the manuscript and ensure consistent use of notation throughout to improve clarity.
- Assumption 3.1 posits that the missingness indicator O is independent of X for treated units, which corresponds to a Missing Completely At Random (MCAR) assumption. This is a strong assumption in many practical scenarios. It would be helpful for the authors to discuss why this assumption is necessary in the current framework and to explore whether a relaxation to a Missing At Random (MAR) assumption could be considered.
- Section 3 presents an IPW-style identification formula for the average treatment effect (ATE). It would be valuable to also include an identification formula based on outcome regression, such as as commonly used in standard causal inference settings. Providing a derivation of the identification results—perhaps in an appendix—would also help readers better understand how the assumptions in Section 3 enable identification.
- The main result (Theorem 4.7) is currently stated under the assumption that the propensity score is known. Section 4.5 shows that the same result holds when the propensity score is estimated from an independent sample. It may improve the flow and clarity to directly present the main results under the sample-splitting framework (e.g., divide the sample into L+1 folds with one fold training the propensity score), which is widely adopted in practice. As long as the number of folds is fixed, the final convergence rates remain valid.
局限性
Yes.
最终评判理由
This paper presents an interesting and well-developed framework, and I believe it can serve as a valuable contribution to the literature.
格式问题
None.
We appreciate the reviewer's detailed comments. Below, we address each of the points raised in the review.
Q1. The significance of the work could be enhanced by incorporating additional application scenarios and concrete examples that align with the settings considered in the paper. This would better demonstrate the practical utility of the proposed methods.
A1. Thank you for the suggestion. In the next revision, we will incorporate the following additional application scenarios and concrete examples that match the settings of our study.
- We are interested in how building an online store affects product sales. In a recommendation system, a customer who purchases a product through the website is known to have visited the site, whereas a customer who buys the product in a physical store may or may not have visited the website. When the construction of the online store is regarded as the treatment, customers who purchase in‑store belong to the unknown group.
- In a similar vein, consider the effect of distributing coupons on product sales. Because the logging system is imperfect, we observe two datasets: purchase records for customers who used coupons, and purchase records that mix customers who did and did not use coupons. Here, customers who appear only in the mixed dataset constitute the unknown group.
- A company unintentionally sold defective products, for example, vehicles with faults or food containing impurities. Consumers who notice the defect report both the defect and any resulting damage, such as health issues or accidents, to the company. Consumers who do not notice the defect file no report. In this setting, consumers who report the defect form the treatment group, while those who do not notice the defect belong to the unknown group.
Our method is also useful when the treatment can be defined only by contrasting it with a non‑treatment condition. For example, consider examining how excessive work hours affect workers' mental health. Courts and labor inspectors can identify illegal overtime, but it is difficult to decide whether other working hours are excessive. Firms found to have engaged in illegal overtime can therefore serve as the treatment group, whereas all other firms constitute the unknown group. Similarly, by taking units with easily observable outcomes as the treatment group and all others as the unknown group, one can define the comparison group simply as not treated. This idea is widely adopted in anomaly‑detection approaches that rely on density‑ratio or PU learning.
Q2. In Assumption 3.1, the authors should consistently use “o” as the realization of the variable rather than switching to “d,” which is used as the realization of . A similar inconsistency appears in Assumption 3.3. It is recommended that the authors carefully review the manuscript and ensure consistent use of notation throughout to improve clarity.
A2. We appreciate the reviewer’s comment. This is a typographical error, and we will correct it in the next revision.
Q3. Assumption 3.1 posits that the missingness indicator is independent of for treated units, which corresponds to a Missing Completely At Random (MCAR) assumption. This is a strong assumption in many practical scenarios. It would be helpful for the authors to discuss why this assumption is necessary in the current framework and to explore whether a relaxation to a Missing At Random (MAR) assumption could be considered.
A3. This assumption is introduced by Elkan and Noto (2008) as an analogue of the MCAR assumption and is widely accepted as a baseline in the PU learning literature. As explained in our draft (line 150), several methods have already been proposed for PU learning without the SCAR assumption. Because our paper does not focus on propensity score estimation, we present the analysis under the standard setting.
In the censoring PU learning setting, Bekker and Davis (2018) demonstrates possible relaxations, while in the case‑control setting Kato, Teshima, and Honda (2019) offers related results. We will add a fuller discussion and more references in the next revision.Another key issue is the need to estimate the class prior, which is an additional unknown parameter relative to standard logistic regression. Identifying this parameter requires assumptions.
- Bekker and Davis (2018). "Learning from positive and unlabeled data under the selected at random assumption."
- Kato, Teshima, and Honda (2019). "Learning from positive and unlabeled data with a selection bias."
Q4. Section 3 presents an IPW-style identification formula for the average treatment effect (ATE). It would be valuable to also include an identification formula based on outcome regression, such as as commonly used in standard causal inference settings. Providing a derivation of the identification results—perhaps in an appendix—would also help readers better understand how the assumptions in Section 3 enable identification.
A4. In the censoring setting, an outcome regression‑based identification formula is indeed feasible. In the case‑control setting, however, treatment is deterministic for the treated group, so conditioning on is not directly appropriate. We can still define , but the interpretation differs. We will clarify these points and, where possible, supply derivations in an appendix.
Q5. The main result (Theorem 4.7) is currently stated under the assumption that the propensity score is known. Section 4.5 shows that the same result holds when the propensity score is estimated from an independent sample. It may improve the flow and clarity to directly present the main results under the sample‑splitting framework (e.g., divide the sample into folds with one fold training the propensity score), which is widely adopted in practice. As long as the number of folds is fixed, the final convergence rates remain valid.
A5. Thank you for the suggestion. We consider that sample splitting alone does not resolve the issue, because the bottleneck lies in the convergence rate for the sample size . Even with sample splitting, this rate cannot be improved.
When constructing semiparametric efficient estimators, two ingredients are usually relied upon, namely a doubly robust structure and control of function complexity. In our setting, the doubly robust structure involves the observation probability and the conditional expected outcome, but not the propensity score. Specifically, conditional on a consistent estimator of the propensity score, if either the observation probability or the conditional expected outcome is consistently estimated, we can estimate the ATE consistently.
The doubly robust structure contributes not only to consistency but also to rapid bias reduction, a property often referred to as Neyman orthogonality. To illustrate this, consider a toy example in which our goal is to estimate an unknown parameter under a data‑generating process that depends on two unknown functions and (not in our paper). For instance, may represent the ATE, and and correspond to the conditional expected outcome and the propensity score, respectively. If we obtain a doubly robust estimator for , we can bound the bias of by the product , where and are estimators of and , respectively.This product structure enables ‑consistency. While and individually may not reach the rate, for example, nonparametric estimators often converge at slower rates such as or depending on smoothness, their product can converge faster than . This is the key advantage of the doubly robust structure for bias reduction.
In PUATE, however, although a doubly robust structure exists between the observation probability and the conditional expected outcome, it does not arise between the propensity score and the remaining nuisance parameters. Consequently, the estimation error of the propensity score may persist. Specifically,
\sqrt{n}(\hat{\tau) - \tau) = G + Bias1 + Bias2,
where is asymptotically normal, Bias1 denotes the product of the estimation errors for the observation probability and the conditional expected outcome, and Bias2 denotes the estimation error of the propensity score. Bias1 can be , but Bias2 is not unless the propensity score is estimated at a rate faster than the standard parametric rate . This is why our analysis assumes that the propensity score is known, or is estimated more rapidly using an independent dataset. When using such an independent dataset, in our proof, we take the limit with first, followed by . Alternatively, one could consider an asymptotic regime where both and go to infinity simultaneously (i.e., ). In the latter case, it is required that diverges at a faster rate than .
I appreciate the authors’ detailed responses to my questions. Most of them have been addressed, and I just have a follow-up point:
Regarding the estimation of the ATE in this PU-setting, it appears that double robustness can only be achieved when the propensity score is correctly specified—unlike in standard ATE estimation, where consistency typically holds if either the propensity score or outcome model is correctly specified. Should this be viewed as a fundamental feature of the PU-setting? In related problems, such as generalizing causal effects (e.g., Dahabreh et al., 2019), the structure of double robustness often involves terms like the product of propensity score error and outcome model error, plus the product of observation probability error and outcome model error. These formulations differ from the structure presented in this work. I would be very interested in better understanding the nature of this difference.
Dahabreh, I. J., Robertson, S. E., Tchetgen, E. J., Stuart, E. A., & Hernán, M. A. (2019). Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics, 75(2), 685-694.
Thank you very much for your thoughtful follow-up. We fully agree with the reviewer's comment.
Indeed, a key structural feature specific to the PUATE setting is that double robustness holds only when the propensity score is correctly specified. That is, in the PUATE framework, double robustness can be stated as follows: given a consistent estimator of the propensity score , if either the observation probability (e.g., ) or the conditional expected outcome (e.g., and ) is consistently estimated, then the ATE estimator is consistent.
This arises because estimating the conditional expected outcome for the control group, , requires the propensity score . In the standard ATE setting, can be estimated directly via regression on outcomes with . However, in the PUATE setting, we must rely on the following identity to recover .
In the standard doubly robust estimator, can be estimated by taking the sample mean of
In contrast, for PUATE under the censoring setting, can be estimated by taking the sample mean of
Put differently, in the standard ATE setting, can be estimated either by the sample mean of
or by
One relies solely on the propensity score , and the other solely on the outcome model . However, in the PUATE setting, can be estimated either by the sample mean of
or
Both estimators depend on the estimated propensity score .
This highlights the intuition behind the lack of double robustness in the PUATE setting: to estimate (or identify) , one must necessarily rely on an estimator of . This structural dependency underlies why double robustness, in the usual sense, does not hold in PUATE. Even in cases where the treatment indicator contains missing values, if can be estimated without relying on , it is still possible to construct a doubly robust estimator involving the product of the propensity score error and the outcome model error. However, in the PUATE setting, we observe only treated units and a mixture of treated and control units. This makes the estimation of particularly challenging. We will clarify this point more explicitly in the revised draft, with reference to Dahabreh et al. (2019).
Thank you for your detailed response! I have increased my score and wish the authors all the best.
This work studies the problem of estimating the average treatment effect in a setting analogous to the positive and unlabeled setting that has been previously studied in the supervised learning literature. Specifically, there is a set of known treated units and a second set of units for which the treatment status is unknown. The authors consider two variants of this problem, in the case-control setting the two groups are independent groups, in the censored setting there is a random process that dictates membership to the positive and unlabeled groups. The authors develop a semi-parametric estimator and derive the influence functions and asymptotic normality for both of these settings. In each, the authors leverage PU estimators from the literature. The propensity score amongst the unlabeled units is assume known in the main text for the proof of asymptotic normality. This is relaxed later in the paper by assuming the availability of an auxiliary dataset.
优缺点分析
Strengths:
- The problem studied in this paper is interesting, and novel to the causal setting. The authors do a nice job of making the connection between PU learning and semi-parametric estimation.
- The derivation of consistency and asymptotic normality is nice to see.
- The authors do a nice job of centering the proposed approach with the derivation of the IF. The assumptions are clearly laid out, and well stated throughout.
- The proposed approach shows good empirical performance against sensible alternatives.
Weaknesses:
-
The largest weakness with this paper is the assumption that the g function is known throughout. The authors do present a solution to this by having an auxiliary data set. However, it's not entirely clear to me how this will affect the results. My intuition is that the convergence rate is now going to be dominated by the smaller of the two data sets, either the one used to estimate the g function or the one used to estimate the rest of the estimator.
-
It would be really useful if the authors could contextualize the results and presented methodology within the larger framework of missing data. They do a nice job of setting up the missingness mechanism, SCAR. However, I think the paper would have greater impact if there was some discussion of this method and its relation to other approaches for missing data, such as the Heckman model or, e.g.Robins, Rotnitzky, and Zhao (1994). I do recognize that the proposed setting is unique, however I just think it would help gain intuition for the reader.
问题
-
There is a fairly large set of methods that have been proposed for PU learning. It would be helpful if the authors could provide some intuition here for how the choice of PU learning method affects the proposed method, e.g., employing Elkon, vs du Plessis et al.
-
Can the authors provide some intuition on the effect of having to estimate the g function on the overall performance of the estimator? As I mentioned above, it seems like there would be an interaction between that and the results both simulation and theoretical presented, but it's not entirely clear to me from reading.
局限性
Yes.
最终评判理由
I would like to thank the authors for their clarification. In my view, my concerns have been well addressed and I have updated my score to reflect this. This paper provides a nice addition to the literature, addressing a problem of practical importance that, at least to my knowledge, has not been studied in the literature to this point.
格式问题
N/A
We appreciate the reviewer’s insightful comments. Below, we respond to each of the points raised.
Q1. There is a fairly large set of methods that have been proposed for PU learning. It would be helpful if the authors could provide some intuition here for how the choice of PU learning method affects the proposed method, e.g., employing Elkon, vs du Plessis et al.
A1. We appreciate your thoughtful suggestion. The choice of PU learning method should reflect the underlying data‑generating process. For instance, Elkan and Noto (2008) and du Plessis et al. (2015) tackle essentially different problems: the former studies the censoring setting, whereas the latter considers the case‑control setting. Under the censoring setting we can estimate the conditional class probability, that is, the propensity score, without knowing the class prior and under weaker assumptions than in the case‑control setting.
Other methods, such as Bekker and Davis (2018) and Kato, Teshima, and Honda (2019), address selection‑bias issues in PU learning. When using neural networks, Kiryo, Niu, du Plessis, and Sugiyama (2017) propose a non‑negative correction that mitigates overfitting bias. Many approaches, especially for censoring PU learning, are summarized in the survey by Bekker and Davis (2020).
For the class prior, for example, Lancaster and Imbens (1996) discuss that it can be estimated simultaneously with the conditional class probability via the generalized method of moments, provided parametric models are used for the propensity score. Ramaswamy, Scott, and Tewari (2016), as well as du Plessis and Sugiyama (2016), study class prior estimation as an independent problem of interest.
We currently explain these methods in the main body and Appendices A and B. In the next revision, we will explain them in more details.
- Elkan and Noto (2008). "Learning classifiers from only positive and unlabeled data."
- du Plessis, Gang. Niu, and Masashi Sugiyama (2015). "Convex formulation for learn395 ing from positive and unlabeled data."
- du Plessis and Sugiyama (2014). "Class prior estimation from positive and unlabeled data."
- Lancaster and Imbens (1996) "Case-control studies with contaminated controls." Journal of Econometrics.
- Bekker and Davis (2018). "Learning from positive and unlabeled data under the selected at random assumption."
- Bekker and Davis (2020). "Learning from positive and unlabeled data: a survey."
Q2a. The largest weakness with this paper is the assumption that the g function is known throughout. The authors do present a solution to this by having an auxiliary data set. However, it's not entirely clear to me how this will affect the results. My intuition is that the convergence rate is now going to be dominated by the smaller of the two data sets, either the one used to estimate the g function or the one used to estimate the rest of the estimator.
Q2b. Can the authors provide some intuition on the effect of having to estimate the g function on the overall performance of the estimator? As I mentioned above, it seems like there would be an interaction between that and the results both simulation and theoretical presented, but it's not entirely clear to me from reading.
A2. Thank you for your comments. First, we clarify that the convergence rate is not dominated by the smaller of the two data sets. In estimating the function, we can use the entire set of data, including both the additional dataset and the original dataset used for estimating the other parameters. As we stated in our draft, "We can also use from to estimate with . The inclusion of can improve empirical performance." The key point is that the function must be estimated at a faster rate than the other parameters; therefore, we use both the original and the additional datasets. This also reflects a difference in asymptotic frameworks. In our proof, we take the limit with first, followed by . Alternatively, one could consider an asymptotic regime where both and go to infinity simultaneously (i.e., ). In the latter case, it is required that diverges at a faster rate than .
Additionally, we consider the following points regarding the function:
- We need to impose additional assumptions or use additional datasets to theoretically obtain asymptotic normality.
- We can prove consistency without using additional datasets, but there is no double robustness between the propensity score and the observation probability.
- The empirical performance depends on the correctness of the model specification or the datasets used. In the next revision, we will explain more clearly why the function must be known in order to establish asymptotic normality.
To support the above points, we outline the mechanism why we need the true below and also refer to our reply A5 to Reviewer D1gL.
We begin by noting the absence of a doubly robust structure linking the propensity score with the other nuisance parameters (the conditional expected outcome and the observation probability) in our setting. It is important to emphasize that double robustness is a key technical tool for establishing asymptotic normality, not only consistency. Double robustness is used as an analytical device derived from the efficiency bound and is often justified via the canonical gradient or Neyman orthogonality (see Schuler and van der Laan, 2024). This structure makes the ATE estimator insensitive to the estimation of the nuisance parameters in the standard ATE estimation setting. Essentially, it forms a product of the estimation errors of the nuisance parameter estimators, which reduces the overall estimation error more quickly.
In our framework, double robustness holds between the observation probability and the conditional expected outcome: given a consistent estimator of the propensity score, if either the observation probability or the conditional expected outcome is consistently estimated, we obtain a consistent ATE estimator. In contrast, in standard ATE estimation, double robustness holds between the propensity score and the conditional expected outcome: if either of those two components is consistently estimated, the ATE is also consistently estimated.
The lack of double robustness in PUATE can be explained as follows. Estimating requires the propensity score itself, whereas in the standard case, the two nuisance parameters can be estimated independently. Concretely, we use Because of this dependence, we cannot obtain the doubly robust structure. Since we cannot obtain the doubly robust structure, a bias‑reducing product term such as also does not appear. As shown in Lemma 4.1, the efficient influence function therefore lacks a doubly robust structure, and standard proofs based on Neyman orthogonality cannot deliver asymptotic normality without stronger conditions. An auxiliary dataset that enables a sufficiently fast first‑stage estimate of the propensity score is a practical solution.
- Alejandro Schuler and Mark van der Laan. Introduction to Modern Causal Inference, 2024
Q3. It would be really useful if the authors could contextualize the results and presented methodology within the larger framework of missing data. They do a nice job of setting up the missingness mechanism, SCAR. However, I think the paper would have greater impact if there was some discussion of this method and its relation to other approaches for missing data, such as the Heckman model or, e.g. Robins, Rotnitzky, and Zhao (1994). I do recognize that the proposed setting is unique, however I just think it would help gain intuition for the reader.
A3. Thank you for the suggestion. We currently frame our results within causal inference under stratified sampling. Another example of stratified sampling is covariate shift; Uehara, Kato, and Yasui (2021) study causal inference when the training data contain (covariates, treatment, outcomes) but the evaluation data contain only covariates. Concerning identifiability in PU learning, in addition to the studies cited in A1 (Bekker and Davis (2018) and Kato, Teshima, and Honda (2019)) the classical work of Lancaster and Imbens (1996) is also relevant, as answered in A1.
The Heckman model and the framework of Robins, Rotnitzky, and Zhao (1994) are important contributions to causal inference with missing data. The Heckman model addresses endogeneity between the outcome and the observation indicator, whereas our analysis does not consider such endogeneity. Thus our setting is simpler in that respect, yet it is more challenging because the relationship between treatment assignment and observation follows the PU learning mechanism. Similarly, Kennedy (2020) allows for endogeneity between treatment observation and outcome, but the observation mechanism itself is simpler than in our work. These approaches are not nested within one another; rather, they are complementary, and combining them may yield more flexible statistical modelling.
We summarize some of these related studies in Appendices A and B and intend to provide additional details in the camera ready version.
- Uehara, Kato, and Yasui (2021). "Off-policy evaluation and learning for exter516 nal validity under a covariate shift."
- Kennedy (2020). "Efficient nonparametric causal inference with missing exposure information." The International Journal of Biostatistics.
This paper studies a new setup of average treatment effect (ATE) estimation inspired by learning from positive and unlabeled data: we have observations for which we know the treatment was given while we have other observations for which we do not know whether the treatment was given or not (i.e., in the control group). More specifically, the paper studies both censoring and case-control settings.
For each setting, the authors propose estimators based on the Inverse Probability Weighting (IPW), the Direct Method (DM), and the efficient (Eff) influence function. The authors derive an explicit formula for the efficient influence function (Lemma 4.1) and show the efficiency bound (Theorem 4.2). Then, they propose the Eff estimator as the zero of the sample average of the influence function solved for the parameter of ATE (l.203). Under several standard assumptions, they show their simplest form of double robustness result (Theorem 4.4).
Moreover, assuming that the propensity score is known, they propose a cross-fitting-based estimator, for which they prove the asymptotic efficiency (Theorem 4.7) under the assumptions including those on some doubly-robust rates of convergence for the nuisance parameter estimates (Assumptions 4.6). They also illustrate that the IPW and the DM estimators fail to achieve this asymptotic efficiency under those assumptions (Remarks in l.273 and l.282).
The authors then extend the result to the case where we need to estimate , for which another dataset is assumed to be available. For this methods, they show the asymptotic efficiency.
Finally, the authors present simple simulation results confirming their theory.
优缺点分析
Strengths
- The setup is interesting and seems useful.
- The theoretical results seem sound. I checked the proofs for the main results. (However, I have a few questions for clarification.)
- The proposed method and the results seem highly non-trivial and relevant.
Weaknesses
- There is no real data experiment.
- Some parts of the proofs are too brief.
- The notation of the math symbols is a bit confusing.
问题
I will consider raising the score if these questions are addressed.
Major questions:
- l.1300, I could not figure out how to obtain the first equality. Could the author give more details?
- ll.155-156, "the propensity scores and can be estimated using PU learning methods" Are they identifiable? I thought we need to know the class prior in PU learning, and it requires some conditions for identifiability.
- l.1250, what is ?
- l.1260, where did come from? Where did and go after the equality?
- l.1261--1262, how do I conclude that?
Minor questions:
- l.1251, Is defined? Is it ?
- l.1265, how did the authors deal with the random quantity in the denominator? A similar question applies to the last equality of l.1023.
- l.1292, do we not need a boundedness of the functional for the Riesz representation theorem? (Do we really need the Riesz representation theorem if you anyway show Eq. (9) concretely?)
- l.1298, did the authors use the fact that the scores have zero mean? Do we not need some regularity condition for this?
局限性
yes
最终评判理由
During the discussion, we mainly discussed the following points:
- Identification issue
The authors addressed this during the discussion. They provided references and explained when we have the identification result.
- Clarification on the proofs
The authors addressed my concerns during the discussion. Some regularity conditions are missing, but the authors promised that they would add them. I am more convinced that the main theorems of the paper are correct.
- No real-data experiment
This was not addressed in the discussion. However, I believe that the strong methodological and theoretical contributions compensate for this weakness.
My final evaluation: The paper works on an interesting problem, proposes a great method, and provides strong theoretical guarantees. Clarity issues have been addressed in the rebuttal. I would like to update my score from 4 to 5.
Minor feedback: I think the proofs could be made more accessible by adding further details.
格式问题
- l.138, "the density ratio between the covariate densities" --> "the ratio between the covariate densities"
- l.95, --> . There are several similar mistakes like this.
We thank you for your detailed and constructive feedback. We will correct typos the reviewer points out. Below, we address each of your points.
Q1. l.1300, I could not figure out how to obtain the first equality. Could the author give more details?
A1. We relied on the following from the first to the second equality:
We missed a minus sign in front of , which will be corrected in the next revision. The main ideas are
- using and ,
- recognising that the expectations of the score functions are zero. These points will be clarified in the next revision.
Q2. ll.155-156, "the propensity scores and can be estimated using PU learning methods" Are they identifiable? I thought we need to know the class prior in PU learning, and it requires some conditions for identifiability.
A2. Identifiability differs between the censoring and case‑control settings and, in both, further depends on specific assumptions. Elkan and Noto (2006) show that under the censoring setting, the conditional class probability can be learned without the class prior, provided the SCAR assumption holds. Under the case‑control setting, some studies demonstrate that both the propensity score and the class prior can be identified simultaneously if parametric models are used for the propensity score (Lancaster and Imbens (1996) and Kato, Xu, Niu, and Sugiyama (2016)). There are also studies that investigate class prior estimation as an independent problem (Ramaswamy, Scott, and Tewari (2016) and du Plessis and Sugiyama (2016)).
We note that while PU learning with an unknown class prior has been rigorously addressed in the censoring setting, unresolved issues may remain in the case‑control setting. Although previous works, such as Lancaster and Imbens (1996) and du Plessis and Sugiyama (2016), have been well received and demonstrate empirical success, the conditions under which the class prior is identifiable have not been fully explored in detail, especially when using nonparametric propensity score models. Compared with standard logistic regression, PU learning introduces the class prior as an additional unknown. Therefore, we believe that the core issue lies in the number of unknown parameters and the information available in the data. From a data‑availability perspective, the censoring setting provides more information than the case‑control setting, which simplifies the problem.
Our aim here is to present a straightforward solution under the most standard set of assumptions. We will clarify this point and add further references in the next revision.
- Elkan and Noto (2008). "Learning classifiers from only positive and unlabeled data."
- Lancaster and Imbens (1996). "Case-control studies with contaminated controls."
- Kato, Xu, Niu, andSugiyama (2016). "Alternate estimation of a classifier and the class-prior from positive and unlabeled data."
- du Plessis and Sugiyama (2016). "Class prior estimation from positive and unlabeled data."
- Ramaswamy, Scott, and Tewari (2016). "Mixture proportion estimation via kernel embeddings of distributions"
Q3a. l.1250, what is ?
Q3b. l.1260, where did come from? Where did
and $\frac{1}{\overline{n}}\sum^m_{i=1} go after the equality?
A3. Thank you for pointing out the typo. Both instances of should be . We will fix this in the next revision.
Q4. l.1261--1262, how do I conclude that?
A4. As the reviewer notes in Q6 and Q7, the result follows from combining the boundedness assumption with the consistency of each nuisance‑parameter estimator. We will clarify this point in the next revision.
Q5. l.1251, Is defined? Is it ?
A5. This is indeed a typo; it should be . We will amend it in the next revision.
Q6. l.1265, how did the authors deal with the random quantity in the denominator? A similar question applies to the last equality of l.1023.
A6. We appreciate the reviewer's comment. In l.1265 and l.1023, boundedness of the estimators is required. We had omitted these boundedness assumptions and will add them in the next revision.
Q7. l.1292, do we not need a boundedness of the functional for the Riesz representation theorem? (Do we really need the Riesz representation theorem if you anyway show Eq. (9) concretely?)
A7. The boundedness of the functional is indeed necessary for the Riesz representation theorem. More concretely, we ned to assume that the score function needs to belong the L2 space. In the next revision, we will add assumptions about the boundedness of the conditional means and conditional variances of the outcomes.
The Reisz representation theorem itself is also required to guarantee the existence of an efficient influence function. However, as the reviewer commented, if you anyway show Eq. (9) concretely, we can pass the use of the Riesz representation theorem at least in our lemma. Note that we finally need to use the theorem to discuss the lower bound of the asymptotic variance, for which we only cited van der Vaart (1998). We will explicitly assume boundedness of the score functions in l.1285 in the next revision.
Q8. l.1298, did the authors use the fact that the scores have zero mean? Do we not need some regularity condition for this?
A8. We used the property explained in A1 of this reply. Regularity conditions are not required once we can exhibit a specific score function and verify that it satisfies the definition for this problem.
Thank you for your answers. My concerns have been mostly addressed.
Below are rather minor questions.
the expectations of the score functions are zero
I thought some regularity conditions were necessary for this. Where does the paper show this?
l.1261--1262
I thought we use a type of continuous mapping theorem there. It would be great if the proof explained what result was used with what conditions.
Then, we show that the remaining parts converge to zero in probability. Let us denote the parts as Next, we have
Then, from the parallelogram law, we have
Here, for example, we can bound. by
where is constant independent of , and we used the boundedness of and .
Similarly, we can bound each of the remaining terms. Thus, we complete the proof. [Proof ends]
As the reviewer notes, the theorem can also be established via the continuous mapping theorem, while the above approach is more direct.
The proof above remains somewhat simplified due to constraints in the rebuttal and OpenReview environment. In the next revision, we will provide a complete proof that presents these arguments in full detail.
Once again, we thank the reviewer for these thorough comments. They will greatly improve the clarity and rigor of our manuscript, and we will incorporate all suggested changes in the next revision.
Thank you for your answers to my questions.
I suppose this is my last question: where is the proof of Theorem 4.4?
My concern about this theorem is that may behave badly in the denominator of one of the terms in , because Assumption 4.3 covers the case where is not consistent. Perhaps, we need to explicitly assume that is bounded away from zero.
Very minor comments on the proof in the authors' last comment (so it is not necessary to reply to them):
- It might be less confusing if you avoid using in the proof, which is anyway assumed to be the same as (Assumption 4.5).
- I think the boundedness of is an asymptotic one, which follows from Assumption 3.3 and the convergence of to (Assumption 4.6).
In general, I would appreciate it if the proof mentioned the assumptions when they are used.
Thank you for your detailed comment.
Q1. I suppose this is my last question: where is the proof of Theorem 4.4?
My concern about this theorem is that may behave badly in the denominator of one of the terms in , because Assumption 4.3 covers the case where is not consistent. Perhaps, we need to explicitly assume that is bounded away from zero.
A1. Thank you for your comment. As the reviewer correctly points out, we do need the boundedness assumption. As we mentioned in our first reply in this thread, we overlooked this assumption throughout the paper and became aware of it after receiving the first review comments. We will impose the boundedness assumption in the next revision in order to complete the proof.
We omitted the proof of Theorem 4.4 because we considered it to be straightforward. However, as the reviewer notes, it should have been included. In the next revision, we will also add this proof.
We also appreciate the other comments. We agree with the suggestion regarding the notation of and and will revise the draft accordingly. We also agree with the comment concerning the boundedness assumption. In addition, as we noted in both the above reply and our first reply in this thread, we should have added a boundedness assumption independent of . (Asymptotic boundedness could still work, but it would require more careful arguments in the proof.) We will add this assumption in the revision and refine the proof accordingly.
Thank you for the great discussions. I understand the work better now. I will update my review later.
We sincerely appreciate your detailed feedback.
Q1. "the expectations of the score functions are zero"
I thought some regularity conditions were necessary for this. Where does the paper show this?
A1. Thank you for your comment. As the reviewer correctly points out, some regularity conditions are indeed necessary to guarantee that the expectations of the score functions are zero. In particular, exchangeability between integration and differentiation of the density functions is required. This condition is typically satisfied if the density is integrable and its first derivative is bounded. We acknowledge that we should explicitly state these assumptions in the revised draft.
More precisely, it is standard to say that we construct a regular parametric submodel for the true distribution. Regular parametric submodels are technical tools used to derive the efficiency bound, and can be constructed to satisfy the necessary regularity conditions. For example, for a true density , a typical construction is , where is a bounded score function and . This satisfies . If is unbounded, we can still define a parametric submodel as , where is a nonnegative function satisfying , and is a normalizing constant ensuring that . For more details, please see Section 25.16 of van der Vaart (1998).
In our study, to verify the regularity condition, we check whether the parametric submodel with the derived score satisfies the necessary properties. In our case, the condition holds under finite first and second moments of the outcomes, together with the common support condition (Assumption 3.3). However, we overlooked the boundedness assumption and will clearly state it in the next revision.
- van der Vaart (1998). "Asymptotic Statistics"
Q2. l.1261--1262: I thought we use a type of continuous mapping theorem there. It would be great if the proof explained what result was used with what conditions.
A2. Thank you for this helpful suggestion. In the omitted proof, we simply showed convergence directly. However, as the reviewer points out, we could also invoke the continuous mapping theorem. In either case, the boundedness of the estimators plays a crucial role.
Below, we provide a sketch of the proof for the argument leading to (8).
[Sketch of the proof] Since mathematical notation cannot be fully rendered in this OpenReview environment, let and denote the first and second terms in the expectation of (8), respectively. Then, we have Here, we have By showing that , we prove the statement.
To show , we use the following concrete form of :
Then, we have
Here, we can show that the following term converges to zero in probability, which follows directly from the convergence in probability of each nuisance-parameter estimator:
We also truly appreciate your detailed comments and discussion. They will significantly improve the quality of our manuscript, and we will incorporate them in the next revision.
This paper presents a novel and valuable contribution to causal inference by developing semiparametric efficient estimators for the Average Treatment Effect (ATE) in a Positive and Unlabeled (PU) learning setting. The reviewers unanimously found the problem interesting. After a thorough discussion period where the authors provided detailed and constructive rebuttals, all reviewers converged on a positive recommendation for acceptance. I agree with the reviewers and recommend acceptance.
For the camera-ready version, the authors should incorporate the following revisions:
- enhance the clarity and completeness of all theoretical results by adding omitted proofs, missing conditions and making the proofs and technical results more accessible;
- correct all identified typos and notational inconsistencies throughout the manuscript to improve readability;
- strengthen motivation and context by Integrating real-world examples and expanding the discussion to better contextualize the work within the broader literature on causal inference with missing data.