PaperHub
5.0
/10
Poster4 位审稿人
最低4最高6标准差1.0
4
6
6
4
2.8
置信度
正确性2.5
贡献度2.5
表达3.0
NeurIPS 2024

Externally Valid Policy Evaluation from Randomized Trials Using Additional Observational Data

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

We propose a nonparametric method that uses trial data, adjusted with additional covariates from the target population, to provide certifiably and externally valid policy evaluations.

摘要

关键词
Policy EvaluationRandomized TrialsExternal Validity

评审与讨论

审稿意见
4

This paper introduces a method for inferring policy decisions based on randomized controlled trial data when applied to a target population with new covariate data. The method is nonparametric and makes no assumptions about the distributional forms of the data and certifies valid finite-sample inferences of the out-of-sample loss.

优点

The problem of policy evaluation with covariate shifts studied in this paper is important.

缺点

  1. The motivation for problem setup is not clear. Why inferring L_{n+1} for the additional (n+1)th data point rather than the new 1...n? This would be reasonable in clinical trial, n new patents coming in with unknown outcomes.
  2. Discussion of theoretical guarantees is not thorough, especially when the method relies heavily on heuristics, e.g. strong assumptions about the correctness of the estimates of unknown distribution of p(S|X,U) are made, and division of D into D' and D'' seems arbitrary.
  3. The experimental setup looks overly simplistic, a basic two-d covariate space and a simple quadratic model for the loss.

问题

  1. Could you clarify the rationale behind inferring L_{n+1} for a single additional data point rather than considering the distribution of outcomes for multiple new observations (1...n)?
  2. Is there a way in your experiments to evaluate the estimated p(S|X) and hence odds ratios (Figure 4)? Is proposed alg 1 sensitive to mis-estimation of p(S|X), which is very likely due to unobserved confounding U?

局限性

The authors have marked some limitations, such as the method requires independent samples and may not be suitable in scenarios like major virus outbreaks.

作者回复

We thank the reviewer for the comments and provide our response below. We believe they may clarify some potential misunderstandings.

[W1] The motivation for problem setup is not clear. Why inferring Ln+1L_{n+1} for the additional (n+1)(n+1)th data point rather than the new 1...n1...n?

Bounding Ln+1L_{n+1} quantifies the out-of-sample performance of π\pi with respect to a future individual drawn from the target population (after having sampled the covariates of nn persons), see lines 81-83.

In our assessment, it could be possible to extend the methodology to consider, say, NN future individuals, i.e., n+1,,n+Nn+1, …, n+N by leveraging permutations of independent samples. However, it appears to lead to a combinatorial explosion which quickly becomes intractable. We have therefore opted for clear intuitive results that are computationally feasible.

[W1, cont’d] This would be reasonable in clinical trial, nn new patents coming in with unknown outcomes.

It seems this could be a misunderstanding of the problem setting we are considering: The aim of the paper is not to infer outcomes in a clinical trial, but rather outcomes of a policy applied to a target population for which we only have covariate data.

[W2] the method relies heavily on heuristics, e.g. strong assumptions about the correctness of the estimates of unknown distribution of p(S|X,U) are made, and division of D into D' and D'' seems arbitrary.

A central point of the methodology we develop is precisely to avoid strong assumptions about p(SX,U)p(S|X,U) (see for instance lines 34-35, 69-80, 131-132, 290-291). We do so using a model p^(SX)\hat{p}(S|X) and the sensitivity specification Γ\Gamma in eq. (3).

Sample splitting D\mathcal{D} into D\mathcal{D}' and D\mathcal{D}'', as described in lines 168-169, is a standard procedure in statistical inference to ensure valid inferences. (In supervised machine learning, this occurs e.g. in train-test splits to evaluate out-of-sample risk of a learned prediction rule.)

[W3] The experimental setup looks overly simplistic, a basic two-d covariate space and a simple quadratic model for the loss.

The first experimental setup presented in Section 5.1 is indeed intended to be illustrations as the subheading indicates. This is so that the reader can obtain some intuition behind the method. For this reason we could either use one or two-dimensional covariates to visualize the setting. We chose two-dimensional covariates as in Figure 4.

Note that the method uses a nonparametric assumption for the loss distribution; and only uses a parametric model for the sampling pattern, i.e., p^(SX)\hat{p}(S|X), along with Γ\Gamma. The simplicity of the conditional loss distribution is therefore not relevant when illustrating the method.

The second experimental setup presented in Section 5.2 (and Figure 3) evaluates seafood consumption policies based on real data and involves 8 covariates. This is a considerably more complex problem for policy evaluation.

[Q1] Could you clarify the rationale behind inferring L_{n+1} for a single additional data point rather than considering the distribution of outcomes for multiple new observations (1...n)?

Please see our reply to W1 above.

[Q2] Is there a way in your experiments to evaluate the estimated p(S|X) and hence odds ratios (Figure 4)? Is proposed alg 1 sensitive to mis-estimation of p(S|X), which is very likely due to unobserved confounding U?

We are unsure about the first part of the question. Figure 4 belongs to the illustrative experiments in Sec. 5.1 in which there is a synthetic ground truth p(SX)p(S|X). In real experiments, our suggestion is to use ideas from sensitivity analysis (see lines 86-99 respective Figure 3a and 5a) to evaluate a credible range for Γ\Gamma that limits the odds ratios in eq. 3.

The second part of the question seems to restate the misreading above: A central point of Alg. 1 is that it is robust against mis-estimation of p(SX)p(S|X), not least when there is unobserved confounding UU! Hence the (benchmarked) miscalibration degrees Γ\Gamma required as input.

评论

Thank you for the rebuttal. However, I am still not satisfied with the responses regarding the problem setup, motivation and the sensitivity estimated model p(S|X). For instance, the authors use notation m for indicating/summing trial population data, and 1... {n+1} for covariate-only for target population data, which could all be "future", but strangly focuses on L_{n+1} without convincing explanation. I believe a clearer version of this paper might be of interests to Neurips community, but not in its current form. Hence, I maintain my score.

评论

We believe there are still some fundamental misunderstandings regarding the problem setup. Let us restate it succinctly here. We assume access to two data sets in this scenario:

  1. The first data set comprises mm samples drawn from the trial distribution, and typically represents a randomized controlled trial (RCT) setup. This data set includes covariates XX, actions AA, and losses LL. (See lines 64-65.)

  2. The second data set consists solely of covariate data XX and contains nn (not n+1n+1!) samples drawn from the target distribution. (See lines 65-66.)

The goal is to infer the out-of-sample loss in the target population. That is, infer LL for a new individual which does not exist in our second dataset. The future loss is therefore indexed Ln+1L_{n+1}.

To exemplify this setting, we are currently analyzing data from a randomized controlled trial (RCT) that compares the blood pressure responses to various blood pressure-lowering drugs. In this RCT data set, we have mm samples, where mm is less than 500. Additionally, we have access to medical records from a large number of patients, nn is over 100 000, who are seeking treatment for high blood pressure. This larger data set represents our target distribution, where we aim to evaluate and implement new treatment policies.

It is also vital to understand that the method is robust to miscalibrations that occur when the conditional probability p(SX,U)p(S|X, U) is estimated as p^(SX)\hat{p}(S|X) (see lines 69-78). The robustness is controlled by the parameter Γ\Gamma. We also provide a practical approach for determining suitable ranges for Γ\Gamma (lines 86-96).

审稿意见
6

This paper proposes a method for constructing limit curves (upper bounds on the CDF) of an outcome, under a given policy, in a target population, using data from an experimental study. The general goal is to certify that bad outcomes are unlikely in a target population, given experimental data and some knowledge of the strength of potential biases arising from selection on unobservable characteristics, model misspecification, and so on.

优点

Full disclosure: I reviewed a previous version of this paper, and previously recommended rejection. Hence, my review is influenced both by the current version, and by my knowledge of what has changed since the previous version.

Without further ado, the strengths of this paper, as I see them:

  1. This paper considers an important and significant problem at the intersection of several interesting lines of work (e.g., conformal prediction & risk control, generalization of policy evaluation from experimental to observational settings, etc).
  2. With some minor nits (see below), the paper clearly presents their contributions, the motivation for their approach, and the assumptions required. Moreover, the current version does a much improved job putting some of their work in the context of other papers that consider similar bounds on generalizing from experimental to observational data.
  3. The approach is practical, given the incorporation of some informal methods for benchmarking plausible values of the Γ\Gamma parameter. This is new since the previous version, and greatly improves the practical applicability of the method in my view.

缺点

The main weakness, in my view, is the claimed contribution that Gamma can incorporate finite sample and model misspecification error (see lines 77-78, "includes all sources of errors...selection bias, model misspecification, estimation error)"), since it is not at all clear how those considerations enter into the selection of Gamma? The informal benchmarking approach involves building intuition for plausible impacts of selection bias due to unobserved factors, but it didn't seem to speak to model misspecification or estimation error. This is not a major issue necessarily, but it might be worth softening some of the claims that the proposed approach handles all these other types of error, without some corresponding method for benchmarking these. For instance, for smaller sample sizes, I could imagine these factors being much more influential in the true value of Γ\Gamma.

The second weakness, which I'm less inclined to weigh heavily, is that the novelty of technical contribution is slightly unclear: Mechanically speaking, there is little difference between sensitivity analysis considering unobserved confounding that impacts selection & outcome, versus confounding that impacts treatment & outcome, and there is similarly little difference between policy evaluation and evaluating average treatment effects. It would be helpful if the authors could highlight the technical contributions they think are most notable compared to Jin et al. 2023, Ek et al. 2023, Huang 2024, etc, beyond the simple fact that the setting differs (e.g., considering treatment-outcome vs selection-outcome confounding, or considering ATE vs policy evaluation).

I also have some additional minor feedback that might be worth incorporating into a future version:

  1. Regarding the benchmarking, it would be worth giving the caveat that this is an informal approach to benchmarking that can yield unintuitive results, see [0]. Part of the contribution of Huang 2024 (cited here) was to give a more principled approach under a different sensitivity model. However, I don't view this as a major weakness, since this is generally an unresolved challenge as I understand it for Rosenbaum-like sensitivity bounds as used here (Huang 2024 & Cinelli and Hazlett 2020 use an R2-based sensitivity model where a more principled approach is possible in the first place).
  2. There's a lot of work on combining observational & experimental data, and it might be worth highlighting differences to the settings considered. For instance, [1] deals with a similar setting, but one where outcome information is available from the observational data. There are other papers that deal with estimating causal effects in target populations from experimental data where observational data includes individuals not represented in the trial, e.g., [2], [3].
  3. It's not entirely clear from the introduction what the "loss" LL entails, as the name suggests something like a prediction error. I think it would be useful to highlight upfront that LL can represent e.g., an outcome of interest (more typically rendered as YY), as done in the application. It is more intuitive to me to interpret the setup in light of the application, e.g., the goal is to bound the probability of extreme outcomes occurring.
  4. As an improvement, it seems like it should be fairly straightforward to derive lower bounds as well, no? E.g., replacing LL with L-L and applying the same machinery. That might be useful in applications where you want to ensure that an outcome stays within a certain range.

[0] Making Sense of Sensitivity: Extending Omitted Variable Bias. Carlos Cinelli, Chad Hazlett. https://doi.org/10.1111/rssb.12348 (JRSS B, 2020)

[1] Hidden yet quantifiable: A lower bound for confounding strength using randomized trials. Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, Fanny Yang. https://arxiv.org/abs/2312.03871 (AISTATS 2024)

[2] Removing Hidden Confounding by Experimental Grounding. Nathan Kallus, Aahlad Manas Puli, Uri Shalit. https://arxiv.org/abs/1810.11646 (NeurIPS 2018)

[3] Falsification before Extrapolation in Causal Effect Estimation. Zeshan Hussain, Michael Oberst, Ming-Chieh Shih, David Sontag. https://arxiv.org/abs/2209.13708 (NeurIPS 2022)

问题

I would appreciate any clarifications or additional comments from the authors on the two main weaknesses I raised above:

  1. How to incorporate model misspecification or estimation error into the choice of Gamma?
  2. Are there any parts of the technical contribution the authors would highlight as particularly interesting from their perspective? Honest question, as I have not read some of the related papers in depth.

局限性

The authors adequately discuss limitations.

作者回复

We are very grateful for the reviewer’s past comments, which improved the revision substantially.

The informal benchmarking approach involves building intuition for plausible impacts of selection bias due to unobserved factors, but it didn't seem to speak to model misspecification or estimation error. This is not a major issue necessarily, but it might be worth softening some of the claims that the proposed approach handles all these other types of error, without some corresponding method for benchmarking these. [Q1] How to incorporate model misspecification or estimation error into the choice of Gamma?

This is a valid point: the benchmarking method addresses only the plausible impact of unobserved selection factors UU. We will add a remark that points the reader to Appendix B.3, where we suggest using reliability diagrams to quantify bounds on model misspecification or estimation errors.

The possible joint impact of UU and model misspecification or estimation error is, of course, beyond the scope of our methodology.

The second weakness, which I'm less inclined to weigh heavily, is that the novelty of technical contribution is slightly unclear. [Q2] Are there any parts of the technical contribution the authors would highlight as particularly interesting from their perspective?

Yes, while the `mechanical’ aspects of the employed proof technique does build on Jin et al. 2023 and cited works developed in other problems areas, the contribution to the problem of establishing externally valid policy evaluation is novel. We have tried to break down our proof steps in Appendix A in a transparent manner and highlight those steps that invoke past results (e.g. lines 767, 776, 778). To make it clearer we will also refer to the relevant theorems in the cited work.

Regarding the minor feedback:

[1] Regarding the benchmarking, it would be worth giving the caveat that this is an informal approach to benchmarking that can yield unintuitive results, see [0]. Part of the contribution of Huang 2024 (cited here) was to give a more principled approach under a different sensitivity model. However, I don't view this as a major weakness, since this is generally an unresolved challenge as I understand it for Rosenbaum-like sensitivity bounds as used here.

We agree. While we have tried to be transparent about it, adding a reference could make it even more formal.

[2] There's a lot of work on combining observational & experimental data, and it might be worth highlighting differences to the settings considered. For instance, [1] deals with a similar setting, but one where outcome information is available from the observational data. There are other papers that deal with estimating causal effects in target populations from experimental data where observational data includes individuals not represented in the trial, e.g., [2], [3].

We have descriptions of the setting both in the introduction (lines 25-31) and the background (lines 109-117), but the difference to other settings may not be clear enough. We will review the references and try to clarify this.

[3] It's not entirely clear from the introduction what the "loss" LL entails, as the name suggests something like a prediction error. I think it would be useful to highlight upfront that LL can represent e.g., an outcome of interest (more typically rendered as YY), as done in the application. It is more intuitive to me to interpret the setup in light of the application, e.g., the goal is to bound the probability of extreme outcomes occurring.

We refer to it as “loss” as the opposite of “reward”, either way it is an ordered decision outcome that we want to bound. However, this is a good suggestion that will help the reader and we will try to clarify it already in the introduction.

[4] As an improvement, it seems like it should be fairly straightforward to derive lower bounds as well, no? E.g., replacing LL with L-L and applying the same machinery. That might be useful in applications where you want to ensure that an outcome stays within a certain range.

Replacing it with L-L will result in a lower bound. Although we have not considered this type of application, it is a valuable suggestion and we will add a remark about it.

评论

Thank you for the response - just chiming in that I've read the response and that all sounds good to me (e.g., clarifying limitations wrt handling both misspecification and confounding from a benchmarking point of view, clarifying the theorems being used from the cited references, etc). Glad to hear that you can easily get lower bounds as well. Regarding the comparison to other settings, feel free to take or leave the feedback - it's clear enough to me what the differences are (as I mentioned in the review), just thought it might help frame the contribution better for a reader.

审稿意见
6

This paper studies the challenge of generalizing randomized controlled trial (RCT) results to a target population, addressing the potential issue of distributional shift from RCT participants to the intended population. Instead of estimating the expected loss on the target population, the paper proposes a nonparametric method that leverages covariates information from the target population and certifies valid finite-sample inferences of out-of-sample tail loss. The approach is validated using both synthetic and real data, ensuring its applicability to the target population.

优点

  1. The paper is well written. For me who work on causal inference but not very familiar with the methodology on handling distributional shift, I can easily follow the paper and get the gist of the idea.
  2. I think the method and the technical results are very clean and sound.
  3. I believe the inference on out-of-sample tail risks is of vital importance, which is often omitted in policy evaluation.

缺点

  1. If I understand correctly, the policy Pi is given as an exogenous parameter. However, it is a common practice that we learn a policy using the RCT data and aim to know its performance on the target population. This definitely introduces the dependence issue. I suspect some type of cross-fitting may help but I doubt the empirical performance, or any chance to analyze the bias in this case?

问题

Please refer to the weakness part for questions.

局限性

The authors have adequately addressed the limitations.

作者回复

We thank the reviewer for the question and are pleased that he/she acknowledges the importance of the problem we are tackling.

If I understand correctly, the policy Pi is given as an exogenous parameter. However, it is a common practice that we learn a policy using the RCT data and aim to know its performance on the target population.

Yes, this paper focuses on the evaluation of any given policy π\pi, which could either be proposed by any clinical expert or learned using past data. It is therefore possible to set aside, say, NN past samples from an RCT study to learn a policy πN\pi_N and then use the proposed methodology to evaluate its out-of-sample performance. We will add a remark in the manuscript to inform the reader about this possibility.

审稿意见
4

This paper aims to use trial data to make valid inferences about policy outcomes for a target population. By incorporating additional covariate data from the target population, the sampling of individuals in the trial study is modeled. The authors develop a nonparametric method that provides certifiably valid trial-based policy evaluations, regardless of model miscalibrations, and ensures validity even with finite samples. The effectiveness of the certified policy evaluations is demonstrated using both simulated and real data.

优点

Paper has theoretical analysis.

缺点

  1. paper is not well organized.

问题

NA

局限性

Yes. I have very limited knowledge for this area, my reviews for this paper should not be count.

作者回复

We appreciate the reviewer’s honesty and trust that the chair will discount this review.

最终决定

The paper considers a very important problem: leveraging randomized trial data and partial knowledge of treatment selection to ensure that bad outcomes are unlikely in a target population. As noted by grGf and Cyms, the paper is well-written, the methodology is well-motivated and practical, and the authors have clearly contextualized their contribution within the existing literature. That said, I must still strongly encourage the authors to revise their paper along the lines suggested by grGf. Specifically, please clarify the limitations of handling both misspecification and confounding from a benchmarking point of view and please clarify which theorems are being used from the cited references.

Note: I have ignored the review and score of S2kx, who noted that their review should not count due to their limited knowledge of this area. I believe that xLhe's initial lower assessment was based, at least in part, on a misunderstanding of the problem setting. The authors have clarified this point satisfactorily. So my recommendation is based primarily on the reviews of Cyms and grGf.