Counterfactual Delayed Feedback Learning
摘要
评审与讨论
The paper provides a formalization of heterogeneous treatment effect estimation with delayed response. The authors show under what assumptions identifiability is satisfied in their setup. They present an EM-algorithm called CFR-DF for this task and compare their method against standard baslines for estimating HTEs in the causal ML literature.
优点
Complete piece of work: the notation is consistent, the paper is well structured, and technical proofs are provided.
缺点
- Unclear motivation: In medical studies, a sufficiently large time window is typically considered for a treatment response to be observed. Hence, in all relevant datasets the eventual outcome is known. As the method is trained fully offline, I do not see why there is a need for a method that models the time of delayed treatment response (if the response time is not of direct interest). Instead, this is simply a pre-processing issue.
- Delayed feedback as censoring problem: There is plenty of literature on censoring. I wonder whether the proposed delayed feedback could alternatively be framed as such.
- Binary outcome: since the outcome is binary, I do not see the benefit of this work compared to survival analysis. The paper briefly mentions this literature stream (line 116) but I do not see why this is a completely different setting.
- Straightforward method: the proposed CFR-DF is, to a large extent, similar to existing works and therefore fairly straightforward.
- Restrictive assumptions: both Monotonicity and Prinicipal Ignorability are quite restrictive for identifiability. On a more positive note, the authors state that the assumptions are difficult to verify.
- Short term proxy literature: There is a stream of literature that covers long-term HTEs when short-term HTEs are available. Although this is a slightly different setting, it would be fair to acknowledge this literature stream.
- Integral Probability Metric: The function class of the IPM is not mentioned. Different function classes result in different IPMs. This is an important detail.
- Baseline implementation: There are no details on baseline implementation (in particular, hyperparameter tuning). Further, I wonder if the baselines are trained on the same data or if the authors removed all instances with Y=0 and D=infinity? Based on the my understanding, the authors did the former, which may introduce a systematic bias to due incorrect/noisy observations. A simple but important baseline would be drop the mentioned observations. This would increase the variance in the estimation but likely with little impact. Likewise, another baseline would be to add D as another covariate to the data (it is unclear whether this happened). Further, were the baselines trained on the eventual outcomes or the outcome at the time of treatment? This makes a large difference.
Minor:
The paper requires proof reading. There a quite a few typos + grammar errors.
问题
- Why is this not a censoring problem?
- What is the key difference to the survival analysis literature?
- Why is the estimation task not simply a pre-processing problem? (see weakness)
- Can you report details on baseline implementation?
- What is the exact IPM?
Different from traditional HTE estimation, the paper addresses the challenge of estimating HTE when treatment effects are delayed. From my understanding, the setting can be described as the true outcome can be censored by the observation time window. The authors propose CFR-DF, a new method that simultaneously predicts potential outcomes and response times, using an EM algorithm to handle latent variables representing the eventual potential outcomes.
优点
- The problem, from my own perspective, is interesting, novel and practice relevant. The delayed feedback censored by the observation time window is actually a very common problem for online retailers.
- The paper is theoretically rigor, including formulations, proofs of identifiability for both eventual outcomes and response times in specific subgroups, which reinforces the robustness and applicability of the proposed method.
- The computational evaluation is comprehensive.
- The paper is well written and easy to follow. I like Figure 1 a lot, which presents the problem very well.
缺点
I enjoy reading the problem formulations and theories a lot. However, my main concern is about the technical contribution.
- The identifiability results and analysis seem to be very standard, and I am not sure whether there is any novel part the authors want to highlight.
- The idea of introducing a latent variable for the missing information and adopting the EM algorithm is very delicate, which I really enjoy. But, at the same time, it is also a bit standard and straightforward. Are there any additional technical challenges that the authors want to point out?
- Generally, can I think the main contribution of the paper is on the formulation side?
问题
Please see the weakness above.
The work studies causal inference in the potential outcomes framework in the super-population setting when outcomes are binary. The authors propose a model for (heterogeneous) treatment effect estimation when positive responses are observed only after some time delay; the causal estimands of interest are (a) effect of treatment on outcome (conditioned on covariate) and (b) for those individuals that would have a positive outcome regardless of treatment, effect of treatment on response time (conditioned on covariate). The work shows that both estimands are identifiable under some assumptions, proposes an EM-style algorithm for estimation, and evaluates the algorithm experimentally.
优点
- The work proposes an interesting modeling problem for (heterogeneous) treatment effect estimation that seems to be reasonably-motivated by hypothetical real-world applications. The second estimand in particular is well-suited to the model.
- Experimental evaluations compare against a variety of alternate methods and appear to be thorough.
缺点
- I had a lot of trouble understanding the model until the experimental section (see below). Additionally, I understand that delayed feedback (and the risk of 'false negatives') is a real issue in many settings where estimating treatment effect might be of interest. However, it's not obvious that both response time and observation time should be random variables; the latter seems like something that is typically fixed and/or controlled, even in observational data.
- It is surprising to me that only identifiability results are provided -- nothing about whether the proposed algorithm actually does the identification (e.g. unbiasedness, variance estimation/ convergence rates), which means that the strength of the method relies almost entirely on experimental results. In terms of presentation, it also would have been nice for identifiability to be formally defined.
- The experimental setup is somewhat contrived and hard to follow -- it is not clear to me why treatment probabilities, response times and observation times should be simulated this way.
Various presentation notes:
- Model and notation can generally be made clearer: that response times AND observation times are random variables; both taken to be in rather than ; that , , and (apparently?) can depend arbitrarily on . Note that lowercase will be used to denote density of and used for (discrete) probabilities. The "given data point" discussed in on L262, i.e. the apparently-canonical data point format , should be introduced in section 2.
- Minor, but probably good to note in preliminaries that super-population setting assumes (iid) samples from an underlying population with well-defined means etc.
- L104, L121 sentence grammar
- Sentences at L138-L140 what conclusions?
- L144 STUVA/SUTVA
- L377 what's ?
问题
- Is it important that HTE is the end-goal, rather than ATE (for either estimand)? Is the delayed feedback problem solved for ATE?
- L406-407 / Fig 3: is this the parameter to the exponential distribution discussed on L399?
- L398 -- how does control heterogeneity?
The paper deals with a very practical scenario, where the response time is delayed. It incorporates the impact of observation time and the effect of the delays and subsequently in the likelihood estimation of the counterfactual observation. They simultaneously learn the response times and counterfactual response time. Finally, the paper shows experimental validation across several baselines.
优点
I believe that there are two key strengths of the paper:
(1) Novel problem (not approach) with clear and important practical application.
(2) Neat formulation (with some caveats)
Originality
The problem is quite new but the formulation is not quite technically challenging and hence did not require much creativity. I have some comments which I would elaborate in the weakness section.
Soundness
The formulation is decent and sound.
缺点
(1) Notations and presentation are not clear: Note that this is not a minor suggestion but require a major revamping of the paper. To give a few examples, and are used interchangeably (line 236--238). T is random variable and t is an instantiation of the random variable T; Equations barely have numbers; Second equation in line 229 is universally true: P(A,B) = P(A)P(B|A). In figure (2), \tilde is used on top of D.
(2) I did not find a clear mention about what are the observed data. Is the response time given in the data? Line 263 mentions (x_i, w_i, t_i, y_i ^t), but it does not mention D. However the paper maximizes the likelihood of D=d as well. How practical is it obtain the response time in data? Does it come through a survey? In medical treatment, the response time would be noisy. How will one take into account of them?
(3) Experiments: Are the results winning in Table 2 and 4 statistically significant? There are quite a bit overlap between the mean STD across numbers. I am not sure there is enough investigation in the comparison in Table 2 and 4. It was not mentioned how a baseline is adapted to take into account delayed feedback or they have been not. How are the hyperparameters tuned?
(4) Lack of novelty: The technical contribution is a bit limited and lacks creativity, although I believe the creativity was not needed for this problem since the derivation is straight forward.
(5) I could not find any discussions on limitations in the paper.
问题
See above.
This paper studies heterogeneous treatment effect estimation with delayed response, proposing an EM-style algorithm for estimation. The reviewers appreciate its strength in formulating a novel problem and providing theoretical analysis of the results. However, there are shared concerns about unclear presentation, issues with experimental setup and baselines, and technical novelty. The majority of the reviewers recommend rejection and I agree with the reviewers.
审稿人讨论附加意见
Given the weaknesses mentioned above and the lack of author response, I recommend rejection.
Reject