Efficient Policy Evaluation Across Multiple Different Experimental Datasets
We provide estimators for evaluating the effect of policies leveraging data from related domains.
摘要
评审与讨论
This paper studies how to evaluate policies where source and target sites have distribution shifts. The authors introduce identification criteria for the effectiveness of policies, and develop doubly robust estimators that achieves fast convergence. The results are also generalized to multiple source datasets. Simulation results are provided to show the effectiveness of the method.
优点
-
This paper is well written and easy to follow. The setups are clearly introduced and motivated, the assumptions and methodologies are accurately stated.
-
The proposed framework is general to cover both two domains and multiple domains.
-
The empirical results seem good and well aligned with the theories for both synthetic simulation and real world datasets.
缺点
I am not an expert in this field and not familiar with causal inference. I find no major weaknesses. Some minor weaknesses are on the empirical evaluations. For example, the non-synthetic experiments are only conducted on ACTG 175 clinical trial dataset. Experimenting on other different datasets will enhance the empirical results.
问题
See weaknesses part.
局限性
N/A
Thank you for your feedback. We agree with your comment and have added another experiment using a real-world dataset.
For example, the non-synthetic experiments are only conducted on ACTG 175 clinical trial dataset. Experimenting on other different datasets will enhance the empirical results.
We appreciate this feedback. We provide an additional experiment using the real-world dataset called "Project STAR" dataset (Krueger & Whitmore, 2001) [1] in the global response (attached pdf). We will add full details about the experiment in the revised version of the paper.
The discussion period is nearing its end, with just about a day remaining. Could you please check and confirm if our rebuttal has addressed your concerns and comments?
The paper develops novel graphical criteria and estimators for evaluating the effectiveness of various policies by integrating data from multiple experimental studies. Through theoretical analysis and empirical validation via simulations, the paper demonstrates the robustness and fast convergence of the proposed estimators.
优点
The authors develop nonparametric identification criteria to determine whether the effect of a policy may be expressed through an adjustment formula from two separate distributions induced by policy interventions collected from different populations. Further, they generalize the identification criteria and propose a general multiplier robust estimator applicable to the evaluation of policies from multiple source datasets.
缺点
The paper is difficult to follow. It is very hard to understand. Some grammatical mistakes and confused notations, e.g.,
- line 71, ‘variable For example’ -> ‘variable . For example’
- unify the notations if necessary, try to avoid writing and simultaneously. For the experimental section, please give detailed descriptions on the simulated experiment and the empirical experiment (if the descriptions are too long, please move to the appendix), which allows reader to understand the advantages of the proposed framework. For the theoretical section, the authors first give study when combining two experiments, followed by the study when combining multiple experiments. Honestly, the materials in the two studies are similar. Authors can unify the two studies: presenting the study when combining multiple experiments followed by the two examples (example 2 and example 3). This can release more space to discuss the core components.
问题
Consider example 3. The proposed framework allows researchers to evaluate a policy from the three experiments. Suppose that all the datasets are synthetic together such that the resulting data is treated as a single dataset obtained from one experiment; probably, we can evaluate a policy based on the synthetic dataset as well. Could you conduct a study that compares the two policy evaluation methods?
In some situations, it is necessary to evaluate a policy when the size of the multiple treatments is large. Consider example 3 again, but this time, we consider all the big cities in the US instead of three cities in the US. The computation complexity increases, obviously. Are there any efficient methods to handle the relevant problem?
The adjustment criterion assumes that the distribution is invariant across various experiments. To my realization, the distributions vary even in the examples presented in the paper. How does the proposed methodology handle situations where the underlying assumptions are violated?
局限性
The paper focuses on combining two experiments for policy evaluation only. It would be better if the authors can discuss the optimal policy when combining two experiments.
Thank you for your thoughtful review. Please find below a point by point answer to all questions and concerns. Please let us know if we can clarify anything further.
(1) line 71, ‘variable For example’ -> ‘variable . For example’; (2) unify the notations if necessary, try to avoid writing and simultaneously.
Thank you for pointing this out. We will fix these.
For the experimental section, please give detailed descriptions on the simulated experiment and the empirical experiment
Thank you for the feedback. We will add further description on the details of the simulations. For further reference, we have provided the details of the dataset leveraged in this paper in the appendix.
Honestly, the materials in the two studies are similar. Authors can unify the two studies: presenting the study when combining multiple experiments followed by the two examples (example 2 and example 3). This can release more space to discuss the core components.
Thank you for your suggestion. We designed the current structure of the paper since our theories are new to both the causal and RL communities. We wanted to introduce the idea starting from a canonical setting (combining two experiments) and then extending it to more general settings.
Suppose that all the datasets are synthetic together such that the resulting data is treated as a single dataset obtained from one experiment
We added the simulation under the suggested scenario, and the result is reported in the attached PDF (Figure 3) in the global response. The key observation in this figure is that the errors do not converge to zero and diverge as the sample size increases. This shows that the estimators are not consistent, unlike the proposed estimators that are consistent with the true effect of the policy in the target domain.
The computation complexity increases
The computational complexity of the proposed estimator is , where is the number of variables in a causal graph, is the number of samples, and is the time complexity of training ML models for estimating nuisances, which is commonly polynomial in and . The complexity is because we are learning nuisances for constructing the proposed estimator. This analysis shows that even if the number of source domains (upper bounded by ) increases, the time complexity increases linearly with . Therefore, the proposed estimator is scalable with respect to the number of source domains.
To my realization, the distributions vary even in the examples presented in the paper.
We couldn't parse this question. In our setting, the distributions vary across domains. However, if some invariances between domains (such as the domain transfer for and in Def. 4 and Def. 6) hold, our proposed method is applicable.
How does the proposed methodology handle situations where the underlying assumptions are violated?
If the assumptions are violated, the proposed estimator will not be consistent with the effect of the policy in the target domain and may deviate systematically from the true value. For example, as shown in the attached PDF in the global response, if we ignore the heterogeneity assumption and treat all datasets as if they are from the same population, the resultant estimates do not converge and even diverge.
The discussion period is nearing its end, with just about a day remaining. Could you please check and confirm if our rebuttal has addressed your concerns and comments?
The paper studies off-policy evaluation in a transfer learning setting with multiple source datasets collected from observational and/or randomized studies. The objective is to evaluate the effect of a target policy on a possibly different target population. To achieve this, the author(s) assume at each time point, at least one source population shares the same distribution of the dynamic variable given the historical covariate-treatment pair as the target population. This ensures the identifiability of the target population's policy value, even no source population perfectly matches its entire distribution over time. The author(s) further propose doubly robust estimators, investigate their theoretical properties and finite sample performance.
优点
Originality: To the best of my knowledge, the proposed identification formula and the proposed estimators introduced have not yet appeared in the existing literature.
Quality: The theorems established seem to be correct, and the methodologies proposed are theoretically sound and potentially useful.
Clarity. Overall, the paper is well-organized and easy to follow.
缺点
I have two major concerns, regarding the technical assumptions and numerical experiments, along with several moderate comments. I detail them one by one below:
Assumptions. The assumptions might appear overly realistic:
- Same number of studies to horizon. In addressing a -horizon off-policy evaluation (OPE) problem, the author(s) assume that there are exactly studies, each corresponding to a time point , where exactly one study per time point shares the same outcome distribution (the distribution of if and that of otherwise) with the target population at that time. These settings seem unrealistic in practical scenarios. It would be beneficial for the author(s) to provide examples of real applications that validate this assumption. The current analysis using the ACTG study seems artificial; I will elaborate on this concern in more detail later.
- Knowledge of matching source population. Additionally, the author(s) seem to require prior knowledge of which source population matches the target population’s outcome distribution at each time point. In practical scenarios, while we may have access to multiple studies, it is typically unknown which one aligns with the target's outcome distribution at each time point. A more realistic approach would adaptively learn which studies are similar to the target distribution at each time based on the data. This adaptive learning scenario, in my opinion, would better fit real-world applications.
- Identical distributions. Although the outcome distributions between the source and target populations may be similar at each time, they are not necessarily identical. Even if the distributions are not exactly the same, as long as the differences are minimal, it is reasonable to use source data for transfer learning remains. Recent studies have addressed such distributional shifts using regularization or adaptive weighting:
Numerical experiments. The experiments are overly simplified:
- Single-horizon settings. While the paper studies multi-horizon dynamic treatment regimes, the simulations are conducted in a single stage setting. I would suggest to use D4RL benchmark datasets or OpenAI Gym environments to evaluate the proposed methodologies in multi-stage studies.
- Lack of competing methods. There are some naive estimator to consider in these transfer learning settings. However, the author(s) did not include them in the experiments. For instance, one can ignore the differences in the outcome distributions and assume the multiple datasets come from a single population, with a mixture of multiple behavior policies that generates the action.
- Real data. The ACTG dataset is generated under a single population. To evaluate the proposed methodology, the author(s) manually created a second population and a target population. It would be better if a real "multi-source" dataset could be used.
Related works.
- Under the identical distribution assumption, this paper mainly considers settings with covariate shift. There are some recent works that studied similar problems, although in single-stage settings, e.g., https://onlinelibrary.wiley.com/doi/epdf/10.1111/biom.13583.
- In the related work section, the author(s) argued that their work can be viewed as a bridge between causal inference and OPE by leveraging formal theories in causal inference to solve OPE problems. In that sense, there are prior works that similarly integrated both fields. For instance, standard policy evaluation methods in the RL literature uses the backdoor adjustment to learn the Q- or value as a function of the state, to address the confounding effects (see e.g., https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf). Meanwhile, other studies have applied the front-door adjustment formula for OPE in the presence of unmeasured confounders (https://www.tandfonline.com/doi/full/10.1080/01621459.2022.2110878). Finally, some works have leveraged double negative controls for OPE (https://cdn.aaai.org/ojs/6590/6590-13-9815-1-10-20200520.pdf).
Typo: is used to denote the dynamic covariate in Section 2. However, this notation has been replaced with starting from Section 3.
问题
Would it be possible to establish the semiparametric efficiency of the proposed estimators?
局限性
The limitations have been discussed in the appendix.
Thank you for your extensive review, we really appreciate the feedback. In the following, we comment on the mentioned weaknesses and address outstanding questions and concerns separately.
(1) Same number of studies to horizon (2) While the paper studies multi-horizon dynamic treatment regimes, the simulations are conducted in a single stage setting.
We appreciate the careful reading of our setting and thank you for raising this point. Given that the paper tackles a novel problem setting, we have presented a canonical case of the T-horizon off-policy evaluation problem that assumes access to T studies. This assumption can be relaxed in practice, depending on the causal structure. For concreteness, in the attached PDF, we provide a simple example where a two-horizon OPE problem can be solved with access to a single source domain. More generally, the number of heterogeneous domains and the horizon are not necessarily matched if a given causal structure satisfies the proposed criterion formalizing the invariance of mediated and outcome variables across domains.
Knowledge of matching source population.
In the worst case, where every variable has arbitrarily different distributions from those of the target domain, the data obtained from the source population may not provide any clues for estimating the quantity defined in the target domain. To transfer knowledge, some form of invariance between sources and the target domain is essential.
In our proposed framework, the required knowledge about domain discrepancy is relatively minimal. Instead of quantitative information, qualitative information on which variables' distributions could differ from the source is sufficient to apply the proposed method. For example, if the age distributions in London and New York are different, we can use this information without needing to know the extent of the difference.
A more realistic approach would adaptively learn which studies are similar to the target distribution at each time based on the data.
Thank you for the intriguing proposition. In the setting we consider, however, we do not have access to data from the target population (except for baseline covariates) – and therefore, adaptive learning, which requires sources from the target population, may not be directly applicable to our problem setting.
We want to remark that our problem setting might be reasonable for safety-critical or costly applications where inference of causal effects is necessary before implementing a candidate policy of interest and collecting data in the target domain.
Although the outcome distributions between the source and target populations may be similar at each time, they are not necessarily identical.
We appreciate the references provided and will cite them. However, we respectfully disagree with the statement. The invariance of outcome distributions across domains stems from the ignorability assumption, which is present in all mentioned papers (e.g., Assumption 1 in the first paper) and in this paper (fourth bullet of Def. 4). This assumption establishes conditional independence between the outcome variable and domain-representing index variables, ensuring consistent outcome distributions across domains.
D4RL benchmark datasets or OpenAI Gym environments
We appreciate these suggestions. However, the proposed benchmark may not be suitable for our problem setting, as it lacks elements like heterogeneous population inference, unobserved confounding, or causal structures.
Instead, to further convey the practical utility of the proposed approach, we conduct an additional experiment using a real-world dataset from Project STAR dataset (Krueger & Whitmore, 2001). This includes multiple stages and a semi-synthetic set-up (in which the existing dataset is partitioned to form different environments). We include a full description in the global response and attached pdf.
For instance, one can ignore the differences in the outcome distributions and assume the multiple datasets come from a single population
We added the simulation under the suggested scenario, and the result is reported in the attached PDF (Figure 3) in the global response. The key observation in this figure is that the errors do not converge to zero and diverge as the sample size increases. This shows that the estimators are not consistent, unlike the proposed estimators that are consistent with the true effect of the policy in the target domain.
Real data
To enhance the practical implications of the proposed method, we added a real-world data simulation using the dataset from Project STAR (Krueger & Whitmore, 2001) in the attached pdf (Figure 2).
Related works.
Thanks. We will definitely cite the introduced paper.
Would it be possible to establish the semiparametric efficiency of the proposed estimators?
Yes. Since the proposed estimator is composed of heterogeneous datasets from multiple domains, multiple sampling distributions corresponding to each dataset should be leveraged to construct the semiparametric efficiency bound. Specifically, the partial influence function [2], an influence function defined relative to each , can be constructed to provide an efficiency bound for the part corresponding to . We will add a detailed discussion on the semiparametric efficiency using the partial influence function in the revised version of the paper.
[2] Pires, Ana M., and João A. Branco. "Partial influence functions." Journal of Multivariate Analysis 83.2 (2002): 451-468.
The discussion period is nearing its end, with just about a day remaining. Could you please check and confirm if our rebuttal has addressed your concerns and comments?
I would like to thank the author(s) for their responses. I would keep my score now and vote for acceptance.
This work presents a method for evaluating effectiveness of policies across multiple domains using a new graphical criteria and estimators by combining data from multiple experimental studies. The authors report error analysis of the proposed estimators that gives provides fast convergence guarantee, and additionally share simulation results as well as empirical verification on real data.
优点
The targeted problem seem to be of importance, and the proposed method provides theoretical error and convergence analysis, backed by empirical results both based on simulation and real data.
缺点
The the proposed method seems to have important potentials in real world applications, though the majority of the paper has been dedicated to theoretical analysis and the empirical evaluations are very limited. There is little discussion on how this could impact realworld problems, and there is not a detailed analysis on the empirical results of the results on ACTG175 dataset.
From figure 3.c, it can be seen a contradictory message that the proposed DML method does not perform better than OM on all datasets, though there is no discussion on that.
also authors state that "simulations on real and synthetic data are provided for illustration purposes only." which brings the question to what extend this approach is useful in real world scenarios.
问题
-
Please provide more empirical analysis on the proposed method ideally on more real-world datasets if possible.
-
Please discuss the results on ACTG175 datasets and the different pattern observed in Fig3.c.
-
Provide more empirical evidence on real-world data.
局限性
Yes.
Thank you for your valuable feedback and insights. In the following, we aim to answer each one of your concerns in turn. Please let us know if we can provide more details.
(1) the empirical evaluations are very limited. There is little discussion on how this could impact realworld problems, (2) Please provide more empirical analysis on the proposed method ideally on more real-world datasets if possible. (3) Provide more empirical evidence on real-world data.
We appreciate this feedback. We provide an additional experiment using the real-world dataset called "Project STAR" dataset (Krueger & Whitmore, 2001) [1] in the global response (attached pdf). We will add full details about the experiment in the revised version of the paper.
[1] Krueger, Alan B., and Diane M. Whitmore. "The effect of attending a small class in the early grades on college‐test taking and middle school test results: Evidence from Project STAR." The Economic Journal 111, no. 468 (2001): 1-28.
there is not a detailed analysis on the empirical results of the results on ACTG175 dataset.
Please note that the data generation process relating to ACTG 175 is described in the Appendix. More detailed discussions on the results of the synthetic and ACTG 175 experiments will be added.
From figure 3.c, it can be seen a contradictory message that the proposed DML method does not perform better than OM on all datasets, though there is no discussion on that.
Please discuss the results on ACTG175 datasets and the different pattern observed in Fig3.c.
Thank you for this comment. This observation is expected. The construction of the DML estimator combines aspects of the OM estimator and the PW estimator. The error of the DML estimator can be represented as a product of the errors of nuisances for the OM estimator and the errors of the PW estimator. As a result, if the PW estimator has a large error, this can be reflected in a correspondingly larger error of the DML estimator. Since the error of the PW estimator is relatively larger than that of the OM estimator in Figure 3c, we expect the DML estimator to underperform compared to the OM estimator when the sample size is small. However, when both the OM and PW estimators converge to the truth, we expect that the DML estimator will outperform the other estimators as the sample size grows.
also authors state that "simulations on real and synthetic data are provided for illustration purposes only." which brings the question to what extend this approach is useful in real world scenarios.
Please consider the context in which this statement is located. The full sentence is in our Broader Impact Statement section as follows:
"Finally, we emphasize that simulations on real and synthetic data are provided for illustration purposes only. These results do not recommend or advocate for the implementation of a particular policy, and should be considered in practice in combination with other aspects of the decision-making process."
With these sentences, we do not question the usefulness of our methods, which we have justified through a series of theoretical and empirical evidence. Rather, we clarify that our paper should not be read as an endorsement of specific policies.
The discussion period is nearing its end, with just about a day remaining. Could you please check and confirm if our rebuttal has addressed your concerns and comments?
I thank the authors for the provided response. In light of the additional experiments, I have updated my scores accordingly.
Thank you again for your time and dedication in reviewing our work. In this global response, we describe additional results, illustrated with figures in the attached PDF, to address outstanding comments and questions. In particular, we attach three figures that are described below.
Figure 1: Relaxation of the Assumption on the number of source datasets
To respond the Reviewer cwjE's concern on "the author(s) assume that there are exactly studies", we describe a scenario in which a single source domain can be used to evaluate a two-stage policy. This shows that the assumption of having exactly source datasets for solving a -stage Offline Policy Evaluation problem can be relaxed in practice, depending on the causal structure of the domains.
Figure 2: Experiment on Project STAR dataset
The second figure describes an additional experiment evaluating the effect of policies determining teacher/student ratios to improve academic achievement. Specifically, we use a semi-synthetic version of the Project STAR dataset (Stock et al., 2007) [1]. This study investigated the impact of teacher/student ratios on academic achievement for kindergarten through third-grade students. It was a four-year longitudinal study where students were randomly assigned to one of three interventions with different class sizes each year, following different randomization procedures. The dataset is publicly accessible from the R data repository.
The causal diagram we assume for this setting is provided in Fig. 2 of the attached PDF. Specifically, we consider the evaluation of a 3-stage policy that determines student-teacher ratio across three different grades over time (these are the variables in the target environment). We observe intermediate outcomes that represent academic scores in grades 0 (Kindergarten) and 1, baseline covariates such as ethnicity and gender, and the outcome of interest representing academic scores at the end of grade 3.
To mimic the setting where data at different stages was collected from different domains, we subsample the dataset using different sets of probabilities to induce differences in the distributions of baseline covariates (similar to the procedure conducted for ACTG 175).
We consider a similar evaluation setup as in the main body of this paper and evaluate the proposed estimators PW, OM, and DML on different dataset sizes, plotting their absolute errors compared to the ground truth effect of the candidate policy. The results are shown in Fig. 2 of the attached PDF. We observe that the DML estimator converges faster. The performance pattern is qualitatively similar to the other presented experiments, with all methods improving with sample size towards convergence (in the limit), and the DML estimator converging faster.
[1] Stock, J.H. and Watson, M.W. (2007). Introduction to Econometrics, 2nd ed. Boston: Addison Wesley.
Figure 3. Performance of naive estimators that ignore the differences across domains
To respond to comments from reviewers cwjE and xiN2, we implemented the "naive" estimator that ignores discrepancies across domains. We considered the 2-stage synthetic simulation scenario described in Sec. 5.1 and plotted performance as a function of increasing sample size. The key observation in this figure is that the errors do not converge to zero and diverge as the sample size increases. This shows that the estimators are not consistent, unlike the proposed estimators that are consistent with the true effect of the policy in the target domain.
All the reviewers agree the problem tackled in the paper, offline policy evaluation in transfer learning setup, is important. There were concerns about the lack of real data evaluation, the assumptions / settings of the theory and missed related works. In addition the paper seems to suffer from presentation.
Most of the concerns have been addressed by the authors’ rebuttal and the reviewers are leaning towards accepting this paper.
I would still urge the authors to improve the presentation and make the paper widely accessible to neurips community.