PaperHub
5.6
/10
Poster5 位审稿人
最低4最高7标准差1.0
4
5
6
7
6
4.4
置信度
正确性2.8
贡献度2.6
表达2.4
NeurIPS 2024

When to Act and When to Ask: Policy Learning With Deferral Under Hidden Confounding

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

Learning a treatment recommendation policy under hidden confounding, with the option of deferring the decision to an expert

摘要

关键词
policy learningcausal inferencesensitivity analysishuman-algorithm collaboration

评审与讨论

审稿意见
4

The paper proposes a learning to defer method for policy learning problem with binary actions under unobserved confounding. The proposed method uses MSM to bound the confounding strength and compare the pessimistic bound between Y1, Y0 and identifiable human reward. The final optimization uses the surrogate loss proposed in Mozannar and Sontag [2020]. Overall, the method seems sound. However, the authors overlooked some existing literatures in learning to defer for policy learning problems. The authors compare the proposed method with a self-proposed rejection-inference style baseline, and policy learning method considering MSM without humans.

优点

The paper proposes a sound solution to policy learning with unobserved confounding by conservatively comparing the lower bound of Y1, Y0 and identifiable human reward. The method has a strong performance compared to baselines used in the paper by leveraging a stronger direct method class. It is interesting to see how direct method can be used in the UC case for learning to defer since with unconfoundness, both Y1 and Y0 are identifiable and humans may not be needed unless the policy class is limited.

缺点

Missing related work and inaccurate claims: For the related work, it seems that the learning to defer for policy learning problem is first proposed in [1], and the learning to defer for policy learning with unobserved confounders is first proposed in [2]. These two papers adopts the Inverse-Propensity-Score framework to address the problem with similar experimental setup and the same problem setting. The authors should compare with existing works and it seems the claim that it is "the first to learn a policy with the option of deferral to an expert while allowing for hidden confounders" is not very accurate. Also the assumptions in this paper is stronger since it requires an additional assumption in Definition 1.

Binary action: [1][2] can work with multiple actions, it also seems the proposed method should also be extended to multiple actions, so why authors only restrict to binary actions?

Theoretical result: For the bound validity assumption/definition 1, this seems to suggest Y is deterministic, which is rather strong. For example, a simple Gaussian distribution would violate this assumption. What if Y is a random variable, how can theorem 1 be adapted? I would expect a regret bound for the total reward in the paper (theorem 2 is a bound for the approximated loss instead of the reward if I understand correctly). Similarly, for a general direct method, when would we expect assumption 1 holds and not holds?

Experiments - baseline: For baselines, I would expect a purely pessimistic AI baseline for a fair comparison. For example, the eq 3 can also use the pessimistic principle (compare Y1_lower and Y0_lower in the otherwise condition) authors used for comparing human and pessimistic bounds of the counterfactuals. The authors should also compare with existing works using IPS method and discuss the trade-offs.

Experiments: the experimental results are relatively weak with a synthetic dataset and the IHDP dataset. It would make the paper stronger if the authors can experiment with real human responses and real-world datasets.

Presentation: I feel the authors should expand on the introduction of Mozannar and Sontag [2020] for the paper to be self-contained. Currently, it seems hard to understand unless readers have read Mozannar and Sontag [2020] separately.

[1] Gao R, Saar-Tsechansky M, De-Arteaga M, et al. Human-AI collaboration with bandit feedback, IJCAI-21. [2] Gao R, Yin M. Confounding-robust policy improvement with human-ai teams[J]. arXiv preprint arXiv:2310.08824, 2023.

问题

See weakness.

局限性

Yes

作者回复

Related work: Thank you for pointing this paper out! As this very relevant paper has also been brought up by another reviewer, we choose to comment on it in detail in the “general” comment at the top. We should definitely have caught on to it, and we will refer to it, revise our claims, and discuss it in the revised version of our paper as detailed in the aforementioned comment. We will also make sure to refer and discuss the earlier work by Gao et al. (2021) which deals with the unconfounded case.

Binary action: Indeed our method can be readily extended to multiple actions. We found the presentation to be simpler (including visualizing the intervals as we do in the appendix) for the binary case.

Theoretical results - the bounds validity assumption: The Theorem’s claims hold for any xx for which the bounds include the true potential. Thus, if the bounds obey this for a fraction of 1δ1-\delta of the xx’s, the Theorem will hold for the same fraction of cases. Bounds that include quantile components (as the B-learner bounds do) could be robust to a fair degree of noise in the random variable. Oprescu et al. show (in their Corollary 2) that the B-learner bounds are valid on average. More generally, the goal of the theorem is to show that the costs we derived make sense and would lead a model to choose the best course of action.

Experiments - baselines and data: Thank you for the suggestions. We wish to point out that we did compare with an IPS based method: the method of Kallus & Zhou 2020, denoted CRLogit and CRLogit L1 in our experiments. We further attempted to compare with the method of Gao & Yin; however, as we detail in the general comment to all reviewers at the top, we unfortunately could not obtain an implementation of their code or experimental setup, and could not replicate it fully based on the details in the arxiv paper.

We wish to point out that experimenting with real human responses would require an “active” experiment: since this is a causal problem we cannot merely use historical human decisions, but rather we would need to recruit real humans that act as decision makers, in a setup where we know the causal ground truth. While we completely agree that this would be an ideal experiment, we believe that most causal inference papers in the community are not required to attain such a high standard, and we humbly ask the reviewer to take this into account. For example, in the Gau & Yin paper, while data of real human responses is used, they synthesize the risks and the hidden confounding aspects of the data. Similarly, we use the IHDP dataset which includes real human subjects, and induce hidden confounding in it.

Related work presentation - Expanding on Mozannar et al.'s work: Thank you for this important comment. We will make sure to improve our presentation of Mozannar and Sontag [2020] and ensure the paper is self-contained.

评论

Thanks authors for the responses.

For the experimental baselines, I refer to a pessimistic baseline with the PI bound from Blearner, not the CRLogit baseline where the bounds are from propensity scores.

Another question is why the B-learner policy is almost not impacted by the confounding strength, even when the assumption is violated.

评论

Thank you for the baseline suggestion, it is indeed illustrative. Specifically, we have implemented the pessimistic baseline (PB henceforth) as you suggested, where in eq. 3 we compare Y(1)_lowerY(1)\_{\text{lower}} and Y(0)_lowerY(0)\_{\text{lower}} in the "otherwise" condition, instead of deferring. We ran it on the IHDP experiment, adding a line to Figure 2a. Unfortunately we do not seem to be able to upload a new figure at this stage, thus we will describe the results qualitatively:

  • For values of log(Λ)\log(\Lambda) going up to about 2, performance of PB is very similar to the B-learner policy (slightly higher mean but CIs strongly overlapping). The performance is of PB significantly lower than CARED for these values.

  • For values of log(Λ)\log(\Lambda) between 2 and 4.5 PB performance increases, reaching a policy value of 15.2 (nearly as good as CARED, but without deferring) and then decreases back again.

  • At the very highest values of log(Λ)\log(\Lambda), where the performance of CARED and B-learner drops as they defer most cases, the performance of the PB also drops but is better than that of CARED for the same high level of log(Λ)\log(\Lambda).

We note that comparison of this baseline to CARED and B-learner is not exactly like for like, as they are both “encouraged” to defer, especially for high levels of Λ\Lambda. Interestingly it means that at least for this specific dataset, at very high levels of hidden confounding where B-learner and CARED defer most cases, the lower bounds still maintain some useful information.

Regarding the relative stability of B-learner performance across levels of hidden confounding: In the synthetic data experiment (Figure 1) we do see the performance of the B-learner varying. For the IHDP experimental setup the B-learner indeed shows less sensitivity. One possible explanation is that in the IHDP dataset decisions where the B-learner bounds are correct (e.g. positive lower bound on CATE for a case where A=1A=1 is best) are the ones that tend to be deferred as Λ\Lambda increases; this is unlike CARED which can actually learn when is it actually beneficial to defer, and thus defer cases where the CATE function is misleading.

审稿意见
5

This paper studies deferral policy learning when there is unmeasured hidden confounding observed by human experts but not recorded in data. The paper formulates it as a cost-minimization problem and derives a feasible surrogate loss. The method is shown to achieve better policy value on synthetic and semi-synthetic data.

优点

  • The paper is generally well-written.
  • Theoretical properties are explored.
  • The proposed method is easy to implement and the performance is validated by the empirical studies.

缺点

The performance of the proposed human-AI collaboration system relies on the properly specified cost function. For the expert deferring cost and action assigning costs, the current algorithm sets them in a conservative way. As the author mentioned, there are multiple ways to set these costs. Is the conservative cost always the best? If so, please provide theoretical and empirical support. If not, it would be helpful to clarify the situations when the conservative cost is inferior and evaluate other cost functions in the empirical studies.

For the theoretical analysis, the main condition in Definition 1 requires coverage for all Y(a)Y(a), which is a strong and maybe infeasible requirement. In practice, the guarantee usually only holds for the expected coverage or the probability of coverage. How would this more practical validity definition influence the Theorem 1 claim?

Lastly, the paper states "there are no previous works that learn a policy with the option of deferral under hidden confounding," but it seems [1] considers a similar scenario of unmeasured confounding in the human-AI system. If so, it would be important to discuss the differences and use this method as a baseline in the simulation studies.

I look forward to the author's responses.

[1] Confounding-Robust Policy Improvement with Human-AI Teams (https://arxiv.org/abs/2310.08824)

问题

  • In Sec. 4.1, the expert’s action space Y\mathcal{Y} is not defined. Is it the same as the action space A\mathcal{A}?

  • Are the observed data (Xi,Ai,Yi)(X_i, A_i,Y_i) only generated by human experts? Can it be generated by an algorithm?

  • In Figs 1 and 2, why the value of CARED policy does not change with a wide range of Γ\Gamma values? And why does it not peak at the true Γ\Gamma? It would be better to plot the true Γ\Gamma as a vertical line in these figures.

局限性

N/A

作者回复

Cost function alternatives - conservative vs. optimistic costs: This is an excellent question. Under the validity assumption, the conservative approach is guaranteed to be correct whenever the expert is correct, while the optimistic costs do not have this guarantee. Thus, intuitively the conservative costs would be better in cases where the experts are generally correct. More generally, neither the conservative nor the optimistic case are always superior to each other, as we show now: Let (x,a,y)(x,a,y) be a sample where w.l.o.g. Y(1)>Y(0)Y(1)>Y(0). We have the following sufficient conditions:

  1. When the expert is wrong, the conservative costs are strictly better than the optimistic costs when the following holds: 2Y^(x,0)<Y^+(x,1)+Y(1)<Y^(x,0)+Y^+(x,0)2\hat{Y}^{-}(x, 0) < \hat{Y}^{+}(x, 1)+Y(1) < \hat{Y}^{-}(x, 0)+\hat{Y}^{+}(x, 0).

  2. When the expert is wrong, the optimistic costs are strictly better than the conservative costs, when the following holds: 2Y^(x,1)<Y^+(x,0)+Y(0)<Y^(x,1)+Y^+(x,1)2\hat{Y}^{-}(x, 1) < \hat{Y}^{+}(x, 0)+Y(0) < \hat{Y}^{-}(x, 1)+\hat{Y}^{+}(x, 1).

If none of these conditions hold we might have that both are wrong or both are correct.

Bounds validity assumption: The Theorem’s claims hold for any xx for which the bounds are valid. Thus, if the bounds are valid for a fraction of 1δ1-\delta of the xx’s, the Theorem will hold for the same fraction of cases. The goal of the theorem is to show that the costs we derived indeed make sense and would lead a model to choose the best course of action.

Related work: Thank you for pointing this paper out! As this very relevant paper has also been brought up by another reviewer, we choose to comment on it in detail in the “general” comment at the top. We should definitely have caught on to it, and we will refer to it, revise our claims, and discuss it in the revised version of our paper, as detailed in the aforementioned comment.

Expert’s action space: You are right, this is a typo. The action space is the same as the action space A\mathcal{A}.

The source of observed data: In principal the only requirement we have is that the process generating the actions in the historical data is the same one generating them at test time; and our method is geared toward the case where the process generating these actions has access to extra information that is not reflected in XX, but which is correlated with the outcome YY. Thus, while we are motivated by human experts, the method is not limited to that case.

CARED policy performance across Λ\Lambda Values: In Figure 2b we see that for large enough Λ\Lambda the CARED policy does indeed change and its value decreases; this can also be seen to a lesser degree in Figure 2a. We will add a line corresponding to the true Λ\Lambda value in Figures 1 and 2, following your suggestion. Notavelyt for the IHDP hidden confounding dataset this Λ\Lambda value will be estimated from the propensity scores, as this dataset was created by hiding a confounder from the original IHDP dataset.

[3] Jesson, Andrew, et al. "Quantifying ignorance in individual-level causal-effect estimates under hidden confounding." International Conference on Machine Learning. PMLR, 2021.

评论

The author's response addresses some of my major concerns and I update my evaluation accordingly.

审稿意见
6

This work learns a policy that can abstain from predicting an action in the case where actions are binary. Their idea follows previous work in the learning to defer literature for supervised learning. Their proposed surrogate can recover the optimal policy, for which they design the cost functions in a way to satisfy this condition. Their experiments show the effectiveness of the proposed method.

优点

  • They show the surrogate loss has the same optimal solution as the original expert-machine loss.
  • Their experiments on synthetic data and IHDP hidden confounding show their method CARED outperforms baselines consistently in terms of policy value.

缺点

  • The authors missed some references on learning to abstain such as [1]

[1] Yin, Tongxin, Jean-Francois Ton, Ruocheng Guo, Yuanshun Yao, Mingyan Liu, and Yang Liu. "Fair Classifiers that Abstain without Harm." In The Twelfth International Conference on Learning Representations.

问题

  • The cost function C(xi,a)C(x_i,a) is the worst-case regret. I wonder if what would the results be if an alternative such as mean regret or best-case regret is used.
  • L 240 I wonder if the Rademacher Complexity varies with NN architecture as the paper claims any NN with weight decay or dropout has the same Rademacher Complexity.
  • Did the authors consider the fact that the number of sample the expert can process is limited?

局限性

  • The cost function relies on bounds of potential outcomes, which is often unknown in real-world applications. This eventually reduces to the dependence on the B-Learner. I wonder if the authors considered alternatives of B-learner to estimate the bounds.
作者回复

References on Learning to Abstain: Thank you for bringing this paper to our attention. While our paper has been inspired by the work of Mozannar & Sontag on deferral which adopts an end-to-end approach, this work presents an alternative, post-hoc approach to the problem of (non-causal) deferral. We leave it to future work to consider how the approach of Yin et al. above can be adapted to the causal case, where no ground-truth labels are available.

Alternatives of the cost function: This is a very important research direction which can be further extended to a general question - how to choose the best set of costs for a specific problem? Each choice (example average cost) would imply different trade-offs which would be interesting to explore.

Rademacher complexity of NN architectures: Indeed the specific complexity would depend on the architecture. Our point was merely to state that this condition is not vacuous and is fulfilled by a range of widely used approaches. We refer the reader to the following papers which discuss this subject with far more nuance. We will mention this point in the paper.

  • Bartlett, Peter L., Dylan J. Foster, and Matus J. Telgarsky. "Spectrally-normalized margin bounds for neural networks." Advances in neural information processing systems 30 (2017).
  • Neyshabur, Behnam, et al. "Towards understanding the role of over-parametrization in generalization of neural networks." arXiv preprint arXiv:1805.12076 (2018).
  • Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. "Size-independent sample complexity of neural networks." Conference On Learning Theory. PMLR, 2018

Impact of limited expert sample processing: This is an important point. We believe that in many cases the deferral rate would be set apriori according to such limits on the experts’ time and capabilities. This motivates the comparison in Figure 2b which asks “given a fixed deferral rate, which policy would perform the best”: for example, assuming we know that experts can only deal with 30% of the samples, we can fix the deferral rate accordingly.

Reliance on B-Learner for estimating bounds: We assume we are working with observational data, which is the case with real-world applications. In this case, for each sample, we have the observed treatment, and the observed corresponding outcome. That is, for each sample, we only observe one treatment, and will need to estimate the other potential outcomes in order to know what is the right treatment for each sample. Our theoretical guarantees hold for any model that provides bounds on the CAPO and for which Assumptions 2-5 hold. In practice, the algorithm would work with any model that gives bounds on the CAPO even if these assumptions do not hold, though for such a model our theoretical guarantees might not carry over. Thus, the B-Learner is just one possible model that our method can use. However, our choice of the B-Learner is due to the strong guarantees it provides such as validity, sharpness, robustness, and the ability to perform well on moderate amounts of data.

评论

Thanks for the rebuttal. I will maintain my score.

审稿意见
7

The paper combines two frameworks: optimizing surrogate losses for learning to defer, and bounds on CATE under unobserved confounding under the marginal sensitivity model. The challenge with applying the surrogate loss from Mazumdar and Sontag to the causal setting is that the costs of various classification outcomes are unknown, and need to be estimated. The paper proposes essentially deferring when the causal effects are not (partially) identified to be strictly positive or negative and develops conservative/optimistic characterizations of these losses. Deferral comprises of reporting the recorded aia_i. When the unobserved confounding is exactly correlated with potential outcomes (as is the case with simulation specifications of unobserved confounding that are "favorable" to robust bounds methodology), therefore deferral can be adaptive and improve upon both the expert policy and valid bounds.

优点

significance:

While the paper connects two prior areas (surrogate losses for learning to defer, and PI bounds), there is some work needed to establish the connection and the paper does a good job of doing so and showing that such an approach can obtain improvements upon current conservativism of robust approaches only (under implicit assumptions on expert-ness of the underlying behavior policy). This is well-illustrated in the empirics.

缺点

There seems to be an implicit assumption that the confounded behavior policy

问题

  • Could the authors please describe if there is some relationship between implicit assumptions on regimes where this approach "does well", i.e. can strictly improve upon robust PI bounds policies, and recent work on "superoptimal" treatment regimes?

superoptimal treatment regimes: Optimal regimes for algorithm-assisted human decision-making MJ Stensrud, JD Laurendeau, AL Sarvet - Biometrika, 2024

局限性

sufficiently although studying more explicitly 1) when the policy improvement is strict and 2) how do assumptions on the expert-ness of the underlying confounded behavior policy relate to the performance of the conservative/optimistic losses given in the paper

作者回复

Implicit assumption on the confounded behavior policy: We would be grateful for a further clarification of this comment, as it seems to have been cut short.

Relationship between implicit assumptions on regimes where this approach "does well": This is a fascinating connection, as both our work and the work cited consider explicitly the case of human experts with access to information which the model does not. The work by Stensrud et al. considers a different scenario, where using an instrumental variable and some clever identification results one can “override” the human expert. The main difference is the reliance on instrumental variables for identification, which is a serious drawback as such variables do not always exist. However, in future work we are considering merging their approach with MSM-type bounds. We include a mention of their work and this connection in the revised version.

Exploring policy improvement conditions and expert-ness assumptions impact on loss performance:

  1. In the proof of Theorem 1 we outline conditions for when the policy improvement is strict, which relies on whether inequality (12) holds or does not hold. Unfortunately this condition, which can easily occur in practice (as we see in the experiments), does not have a straightforward interpretation in terms of CATE bounds. Our best attempt of explaining it is as follows: Assuming w.l.o.g.that Y(1)>Y(0), then this event occurs when the negative of a “narrow” CATE bound (narrow in the sense of taking the lower bound for the higher potential outcome minus the upper bound for the lower potential outcome) is larger than the lower bound minus the actual potential outcome. We will refer to this condition in the main paper.
  2. In general we cannot know for sure whether the experts are mostly correct or not, due to the fundamental problem of causal inference. However, if we assume they are generally correct, then one might wish to use conservative approach and defer more often to the experts.
评论

Thanks for the rebuttal and clarifications. Apologies for the cut off comment. The main question was about connections between assumptions on the "expert-ness" of the underlying behavior policy and the improvement from deferral. E.g. the DGP used is one example of "adversarial unobs confounding" that is exactly correlated with outcomes, which is also in some sense the best-case scenario for information in the behavior policy. The rebuttal comment about the technical assumptions for strict improvement was helpful here. Outlining these technical conditions in the main text would indeed be helpful.

评论

Thank you for the explanation and for the support! We will make sure to detail the conditions on strict improvement in the main text, following the comments above. We will also give more space to the discussion of the connections between the hidden confounders and the expert behavior.

审稿意见
6

This paper extends the work of [Mozannar & Sontag 2020] on "learning to defer" to the causal inference setting with (bounded) hidden confounding. Compared to the original supervised learning setting, the causal setting does not have ground truth labels. Therefore this paper proposes to use estimated bounds on the potential outcomes to drive the deferral decisions, and formulates it as an optimization problem.

Experiments are conducted on synthetic and semi-synthetic (IHDP) datasets.

优点

  • Studies an interesting and relevant problem
  • Writing is generally clear.

缺点

  • Although writing is generally clear, the flow could be improved to better explain some of the baselines / obvious / naive approaches, and justify why the proposed approach (arguably more complex) is necessary.

问题

Major:

  • L227 Eqn (3) B-learner policy: this is used as a baseline approach in the experiments. Currently this is presented in the theory section, and for a moment I thought this is your proposed approach. It is such a straightforward definition, a simple approach that also outperforms the baselines. I think this needs to be presented earlier in the text, and you need to make it clear why it's not good enough and how your approach is different.

Others:

  • L37: "human expert typically has access to the hidden confounders" - in what form? Do you mean human decisions are more likely to account for hidden confounders.

  • L181-182: in the definition of pessimistic and optimistic deferral costs, what is maxaai\max_{a' \neq a_i}, aren't there only two actions? L190 I assume it's defining the same thing but different notation is used.

  • L202 Algorithm 1 is never referenced in the methods section. Also it seems to simply repeat the expressions already presented in the text above, not sure this algorithm is necessary.

  • L224 what does bounds are valid on average mean? How does that affect the theoretical results if some bounds are invalid?

  • L227 "This policy defers if and only if the CATE upper bound is positive and the CATE lower bound is negative." You may elaborate that intuitively if bounds are both positive then the CATE is positive (and vice versa) so it's clear what to do, and the deferral situation is when it's unclear the true CATE is >0 or <0 and unclear what to do.

  • L256 Theorem 2the generalization bound compares the training loss and the expected loss which are both L_CE, but how is the result of this learning process related to Eqn (1) the true optimization objective?

  • Theorem 2: What does the generalization mean and how does it compare to other things? Does it tell us about what we should do in practice when using your approach?

  • experiments should compare to a random deferral strategy as a baseline.

  • Fig 2a "proposed method outperforms both baselines" -> this is not true, for large log(Λ) the proposed approach is overtaken by B-learner.

  • Fig 2b as deferral rate increases beyond 60% performance drops, I think it's worth commenting on if for example we allow for up to 80% deferral, can the policy learning figure out it only needs to defer 60% of the time. Alternatively, what are practical ways of selecting this deferral ratio?

Minor issues:

  • Citation style: there are places where the reference in text should be parenthesized but are not.
  • Typo: There's a slash on L13 of the abstract.
  • L122-124, L138-139L deferral policy defined twice
  • L132 "ho" -> "how"
  • L204 "we see than" -> we see that
  • L249 says Y1+(x)Y1+(x)Y^{+}_1(x) - Y^{+}_1(x). I think there is a typo.
  • L271 says three variants, but only two are listed
  • L347 "We note there when learning to act" -> we note that

局限性

Main limitations noted include access to expert actions in observational data, and that experts behavior do not change when they receive machine's deferral.

评论

Comparison to random deferral baseline: Thank you for this excellent suggestion! We attach the new Figure 2b in the general comment at the top. Following your suggestion, we have implemented a random deferral strategy, where the action that is taken in non-deferred cases is based on a T-learner, using the same outcome base-learners as those that were used in the B-learner. The comparison with the random baseline shows that generally speaking, both the B-learner policy and CARED defer “correctly”, i.e. defer cases where they might make incorrect actions, with CARED significantly outperforming the B-learner on a wide range of deferral rates.

Figure 2a results: You are correct to point out that for large log(Λ) B-learner slightly outperforms our approach, and we will be more careful in our presentation of this result. However, we wish to point out that as seen in Figure 2b, what is actually happening is that for similar log(Λ) values the B-learner defers much less than our approach, which explains the difference in performance. When comparing per deferral rate, our approach outperforms B-learner. In addition, it seems that in order to get the B-learner to defer at high-rates one would need to use extremely high values of log(Λ). We will add this more nuanced presentation of the results.

Figure 2b results: Estimating the policy value of a policy which includes human experts and hidden confounding can only be done using a real-world experiment. In principle, when such an experiment is conducted different deferral rates can be compared and the optimal one chosen. Alternatively, one can make various assumptions about the human expert and try to estimate policy values with hidden confounding for such hybrid policies. This is an interesting question which we leave for future research.
In practice, in many cases the deferral rate could be set or constrained a-priori to some narrow interval based on economic and administrative constraints and preferences, for example the capacity of human experts. We will mention this point in the discussion.

Limitation: Static Expert Behavior: A good deal of observational data includes expert actions; for example many electronic health care datasets would include the treatments and medications prescribed by expert clinicians. These same clinician populations might be assisted by an algorithmic model based on our approach.

However, you raise an excellent point regarding the way humans might react to the presence of a model-based action recommendation system. We believe this is a fascinating avenue for future work in collaboration with experts on human decision making: do human experts change their actions when they know their decisions are those that were deferred to them by a machine?

作者回复

Baselines Presenting: Thank you for this useful suggestion, we will make sure to better explain the baselines and justify the merits of our approach See also our reply to the major question.

Clarification of baseline and differentiation from proposed method: Thank you for this important comment - we indeed should have explained this point better. We will revise and explain in more detail the baseline policy we denote by πboundsQ\pi_{bounds}^Q, which defers if the estimated CATE interval crosses 0. This approach is indeed natural and has been used for deferral in many previous papers (including papers which only deal with statistical uncertainty). It was also used in the B-Learner paper.

We will also give a more detailed discussion of the ramifications of Theorem 1: in particular, the Theorem shows that whenever the bounds include the true potential outcome, the action that minimizes our proposed loss function is at least as good as the action implied by the baseline “bounds” policy, as well as the human expert policy. We further show that under very reasonable conditions there will be a non-trivial proportion of cases where the action minimizing our loss function will be strictly better than the one chosen by the bounds policy and expert policy. Finally, our experiments show empirically the added value of our approach compared to these baselines. The reasoning for why our approach might perform better is that it directly optimizes for the deferral classification, as opposed to classifying a sample as “deferred” solely based on some uncertainty measure. This last point has already been made by Mozannar & Sontag 2022 for the non-causal case.

We will make sure to clarify these important points in the revised text.

Human expert access to hidden confounders clarification: When learning how to act based on historical data where the actions in the data were made by human experts, the confounders are by definition factors that affected both the action choice and the outcome. Thus, if at test time we have a human decision maker drawn from the same distribution as those in the historical data, they will by necessity have access to all the confounders. Notably, this is true even if those confounders only affect the decision maker unconsciously: the point is that we have access to actions taken by the “true” policy which generated the historical data. Consider for example the case where a human clinician’s decisions are affected by the demeanor of the patient: this would typically be considered a hidden confounder, as it is not recorded in health records. However, at test time the clinicians can still see the patients’ demeanor and be affected by it - in that sense they have access to a confounder which is hidden from the eyes of any system which is merely based on recorded data.

Deferral costs notation: Indeed, while our approach can be extended to multiple actions using the max operator as in L181, in our paper we focus on the binary case. We will correct this and use the notation for binary actions as in L190.

Algorithm 1: Thank you for your note, we will consider whether we can do without the algorithm, saving the corresponding space.

Bounds validity: This sentence refers to Corollary 2 of Oprescu et al. 2023 [2], which gives a precise statement. Informally it means that with high probability, the average of the estimated upper bounds over the dataset is larger than the average of the true upper bounds given by the MSM. A symmetric result is true for the lower bounds.

[2] Oprescu, Miruna, et al. "B-learner: Quasi-oracle bounds on heterogeneous causal effects under hidden confounding." International Conference on Machine Learning. PMLR, 2023.

CATE bounds' deferral policy explanation: Thank you for the comment. You are correct in pointing this out and we will make sure this point is explained more clearly.

Theorem 2 - Optimization objective: L is a non-convex loss function that is hard to optimize, so we propose optimizing L_CE which is convex, and easier for optimization. We show that L_CE converges to the optimal solution of L. Then, since L_CE is easier to work with, and converges to the desired solution, we show the theoretical guarantees on L_CE

Theorem 2 - Generalization meaning Theorem 2 implies that our optimization objective is reasonable in the sense that attaining good performance on the train set by minimizing it can indeed lead to good performance over unseen samples. In principle one could have had an objective which cannot be learned, or where the errors cannot be controlled even for arbitrarily large samples.

The reason this result is not completely standard is the fact that we optimize a cost sensitive objective where the costs themselves are estimated from data. This is similar in spirit (though not at all in the specifics) to cases in causal inference where objectives weighted by the inverse propensity score are considered, and one has to account for the error in estimating the propensity scores themselves. Our approach using estimated costs turns out to be tractable, but still requires some care in its derivation and proof. As is usually the case in generalization bounds, the result does not translate immediately into a practical guide, but it does imply that the method is expected to respond “well” (i.e., similar to standard supervised learning) to changes in sample size and the complexity of the function class.

评论

Thank you authors for providing responses to reviews. I have updated my rating from 5 to 6.

作者回复

We thank all of our reviewers for their insightful comments and constructive feedback, and we are encouraged by your support. Your comments have helped us clarify and strengthen this work, and we are grateful for them. We will address here a major comment regarding the paper of Gao & Yin, and address additional comments in the individual responses to the reviewers. If there are remaining or new questions please let us know and we will do our best to address them.

We wish to thank reviewers fb21and r89L for bringing to our attention the paper “Confounding-Robust Policy Improvement with Human-AI Teams” by Gao and Yin, from October 2023 [1]. We were unaware of this paper, which has the same motivation as ours: learning to act in conjunction with a human expert, when the data used for learning has a bounded degree of hidden confounding. Thus, our claim in the Discussion of being the first to study the problem is incorrect, and we apologize for our oversight. We will add a discussion of this very relevant paper, and the similarities and differences of their approach compared to ours. The main differences between the papers are:

  1. Gao & Yin optimize an inverse propensity weighted objective, while we optimize a cost-sensitive objective. This means that unlike our approach, their approach does not model the outcome directly. Moreover, the weighted objective they employ leads to a situation where only cases where the proposed policy agrees to a high degree with the observed policy would be taken into account. Our approach can be more efficient as we can make use of bounds that merge in a near-optimal manner the propensity score and outcome models such as the B-learner.

  2. Gao & Yin provide a bound on the regret of the policy. Their bound assumes a well-specified model and scales with the square of the smallest inverse propensity score in the historical policy, which could be quite small. Our result does not bound the regret directly: instead, it shows that we can obtain low values of the weighted cost function while taking into account the error in estimating the CAPO bounds. We further show that the pointwise minimum of the loss function is better than both the human and the bounds-policy. Our theoretical results are thus not directly comparable.

  3. Gao & Yin also address the case where there are multiple human experts each with their own policy. This implies that the specific experts who were used to generate the historical data to also be the experts that use the system upon deployment. This is not a case we have examined and we will look into its ramifications in future work.

Regrettably no code is available online for replicating the method of Gao & Yin, and no code is available for replicating their experimental setup. We have contacted them but unfortunately they have said that the code cannot be available at this time due to constraints on their side. We have attempted to replicate their simulation, which is similar to ours, but several crucial details are not reported in the paper and we could not replicate their numbers.

We believe both papers have their merits and drawbacks, as we approach the same challenge using quite different algorithmic approaches. We hope that in the future a direct experimental comparison will be possible.

In the attached PDF we add, following the suggestion of reviewer Vchf, a version of Figure 2b of our paper with an added random deferral baseline.

[1] Gao, Ruijiang, and Mingzhang Yin. "Confounding-robust policy improvement with human-ai teams." arXiv preprint arXiv:2310.08824 (2023).

最终决定

This paper proposes a framework for learning to defer under hidden confounding by relying on partial identification of causal effects and deferring (in a cost-sensitive objective setting) when the bounds cover 0 (effect not guaranteed essentially). This is an improvement over existing work in attempting to account for the presence of hidden confounding. Authors demonstrate their designed objective has has the same optimal as the original human-expert loss. Empirical evaluation on IHDP data demonstrate utility of their approach.

All reviewers found the work important.

An important clarification raised was the comparison with Gao & Yin, which attempt to address a similar problem. However, the worst-case hidden confounding assumption under the MSM model is used on a IPW loss on the causal effect (as opposed to direct partial identification with a cost-sensitive loss). The rebuttal has provided a fairly detailed description of the similarities and differences. Authors have suggested they have not been able to reproduce the experimental claims on Gao & Yin and that the code is not available.

Based on reviewer suggestions, authors have added a random baseline, and improved and clarified some specific issues. All things considered I am recommending an accept considering that reviewers found the work of value despite not being the first in addressing the problem, and the following updates for camera-ready:

  1. Most importantly, authors should include all noted differences between Gao & Yin in the literature review, including modifying the claim made in the paper on novelty and highlight the complementary aspects of their approach.
  2. I strongly encourage the authors to include a concrete comparison with Gao & Yin on a simpler empirical setup where reproducing results is possible, and more importantly, attempt to reproduce their results in the actual setup used in the paper for empirical comparison. This is necessary for a fair comparison even if the authors were not aware of the work before the reviewers pointed it out.
  3. Reviewers have additionally pointed out several typos as well as some general presentation issues that I would like the authors to deliberately address for the camera-ready.
  4. Please also check the simulation setup and why a policy (CARED specifically) should not be sensitive to the strength of confounding and provide justification of the behavior in discussion of these empirical results.