Conformal Prediction for Causal Effects of Continuous Treatments
摘要
评审与讨论
This paper proposes a conformal prediction method for the estimation of potential outcomes under continuous treatments and known/unknown propensity scores. The authors provide finite-sample validity guarantees and validate their method empirically on synthetic and real-world data.
优缺点分析
Strengths:
- Novelty: First to provide conformal prediction intervals for continuous treatments with unknown propensities.
- Structured: The methodological contributions are well-structured (e.g., known propensity->soft intervention, unknown propensity-> hard intervention) and carefully motivated (e.g., challenges a,b,c discussions).
Weaknesses:
- Weak Empirical Evidence:
- Despite the consistent results on the Synthetic Dataset 1-2, such DGP with a single covariate is over-simplistic to relate with real-world applications.
- I really appreciate the attempt to complement such experiments with semi-real and real-world data. However, the experiments on TGCA dataset (semi-synthetic) are poorly documented (unclear outcome of interest, treatment assignment, evaluation procedure, ground truth) and lack comparison with other approaches.
- Similarly, the experiment on MIMIC dataset (real-data) looks incomplete: Figure 7 only shows the CP-estimated intervals (compared with MC dropout), which a reader can only evaluate by their qualitative width but not their coverage (how to establish that yours is out-performing MC-dropout?). Also unclear if the counterfactual estimation on this dataset isn't just trivial linear interpolation (also what are the dotted-line in Figure 7?). Finally, I don't understand why showing the intervals per individual if the conditional (over the covariates) coverage is never discussed.
- Section 4: Despite the great high-level structure, I struggled to both follow and interpret the two theoretical results (also not very familiar with the related works though). The math sounds correct (despite I may have missed some details), but I found difficult to interpret both variables (f, l, g, ...) and formulations. In practice, missing interpretations makes difficult to understand the limitations and practical validity of the method. Are the propensity score estimation and calibration data size challenges the only challenges of this approach (as reported in Section 6) to get exact finte sample guarantees?
问题
Can you elaborate on the numerical stability of your method, particularly in Scenario 2 where the optimization is non-convex and involves small propensity estimates and Gaussian tilting?
局限性
Weak (practically relevant) empirical validation and uncritical/difficult to interpret theoretical motivation. If I was a physician I wish the potential challenges potentially invalidating the method were better stressed through the paper: e.g., I have limited data, I have a prior on how to bound M, (when) can I use your method?
See weaknesses and questions.
最终评判理由
The authors have addressed several of my concerns in a detailed and constructive way. They improved the documentation of the TCGA setup and added comparisons to baselines, which partially strengthens the empirical section. However, the experimental validation remains limited overall: the TCGA experiment is still fully synthetic, and the MIMIC evaluation is qualitative, lacking ground truth, coverage assessment, or clarity.
Theoretical sections are clearer after the response, especially the reformulation via quantile regression and dual optimization, though the notation would still benefit from more intuitive interpretation. The explanation of numerical stability is sound and technically competent.
Overall, I appreciate the conceptual and theoretical contributions, but remain disappointed by the very limited empirical evidence, which may fall short of expectations for practitioners interested in applying these methods on real-world applications.
格式问题
- Redundant 'Eq. equation' expression (line 198, 213, ...).
- Confusing Figure positioning, mixing captions with text.
Thank you for your constructive feedback. We will add the corresponding improvements labeled as “Action” to our revised manuscript.
Answer to weaknesses
Empirical evaluation:
We are happy to discuss our empirical evaluation and give more background on the TCGA and MIMIC experiments below.
(i) TCGA: We synthetically model a continuous treatment based on the sum of the 10 covariates with the highest variance and assign a treatment effect that is constant in the sum of the covariates. Specifically, we model the treatment to follow a normal distribution centered at 100*sum of the 10 covariate values, and the outcome to follow a normal distribution centered at the sum of the treatment and the covariate sum times 100.
We now also obtained prediction intervals from the other baselines (MC dropout and the ensemble method) and report the coverage results below. The results are in line with our findings on datasets 1 & 2.
| Intervention | α = 0.05 | α = 0.1 | α = 0.2 |
|---|---|---|---|
| Δ = 0.5; MC: | 0.8280 (0.0795) | 0.8160 (0.0783) | 0.8040 (0.0741) |
| Δ = 0.5; E: | 0.0640 (0.0445) | 0.0600 (0.0379) | 0.0480 (0.0412) |
| Δ = 1.0; MC: | 0.8560 (0.0612) | 0.8520 (0.0614) | 0.8400 (0.0657) |
| Δ = 1.0; E: | 0.0880 (0.0483) | 0.0680 (0.0371) | 0.0520 (0.0348) |
| Δ = 1.5; MC: | 0.7880 (0.0815) | 0.8040 (0.0925) | 0.8280 (0.0786) |
| Δ = 1.5; E: | 0.0720 (0.0411) | 0.0680 (0.0412) | 0.0640 (0.0389) |
(ii) MIMIC: Note that numerically evaluating causal inference methods on real-world data is not possible due to the fundamental problem of causal inference, i.e., only one potential outcome (for continuous treatments out of infinitely many) is observed for each individual. Therefore, we chose to provide “insights plots” (Fig. 7) for this dataset, as (i) assessing overall coverage across the test dataset is impossible, and (ii) aggregating prediction intervals is uninformative as an evaluation method.
As the reviewer correctly noted, merely comparing the intervals based on their width is difficult, as width does not directly map to coverage. Therefore, we interpret Fig. 7 as suggestive of the coverage (see Section 5.3) and perform the semi-synthetic experiment on the TCGA dataset to be able to properly evaluate our intervals on a medical dataset.
Action: We will address the challenges of real-world evaluation and clarify the meaning of Fig. 7 in the camera-ready version of our paper.
Section 4:
We are happy to provide a more detailed explanation of our notation and the theoretical results, acknowledging that our paper may be notationally dense for readers unfamiliar with the related work. Specifically, we will discuss Section 4 based on (i) the reformulation as quantile regression in the high-level outline, (ii) the reformulation of Lemma 4.1 in terms of the dual optimization problem in Theorem 4.2 (and similarly the formulation of Eq. in terms of Theorem 4.5), and (iii) the challenges when employing our approach.
(i) Quantile regression Eq. 34:
Eq. 3 & 4 represent the general form of a quantile regression as a minimization problem based on the so-called pinnball-loss , a standard result in the statistical regression literature (e.g. [1],[2]). Observe that can equivalently be represented as . This asymmetric linear loss penalizes under-prediction proportionally to and over-prediction proportionally to , which results in the fitted function (see Eq. 3) targeting the -th conditional quantile.
(ii) Dual optimization Theorem 4.2 with primal optimization problem in Lemma 4.1:
Note that directly solving the problem given in Lemma 4.1 is not feasible, as it would require solving Eq. 5 for all possible imputed values of . It is unclear how to feasibly find the imputation of the “true” needed for the interval construction in Eq. 6. We thus exploit the efficient computation through dual optimization.
We can rewrite the primal problem in Lemma 4.1 as
Note that this is precisely the formulation of the respective primal problem stated in Eq. P_S as well as Theorem 4.5, where and represent the primal optimization variables.
The Lagrangian of the primal problem with Lagrangian multipliers states
leading to the dual problem
with dual optimization variables .
For more background on dual (convex) optimization theory, we refer to, e.g., [3], [4]. Specific details on our derivations can be found in our proofs of Theorems 4.2 and 4.5 in Appendix D. For more details on the coverage guarantees/reformulation from Eq. 6 to Eq. 9, we refer to the respective part of our proof of Thm. 4.2 in L 663ff in our manuscript.
(iii) Challenges:
The main challenge of our approach is the need for sufficient prior knowledge to set a bound on the propensity estimation error (see “answer to limitations” below). However, this bound is far less restrictive than the assumption of a known propensity score, and is common in the literature (see Section 2). In practice, one aims for valid but also narrow intervals. If the calibration dataset is not representative of the full population, the resulting intervals are potentially wider than necessary for test samples in segments of the output space with few calibration data. For transparency, we discuss all limitations in Section 6 of our manuscript.
Action: We will provide more background and explain the intuition for our derivations, as well as the challenges of our approach, in more detail in the camera-ready version.
Answer to questions: Numerical stability
We are happy to discuss the numerical stability of our proposed optimization approach. In Scenario 1, the only source of potential instability can be a very low propensity score in low-overlap regions of the covariate space. This, however, is only a problem if . We can thus consider this unlikely in practice. For example, consider a patient who would be treated with a dosage of 10mg of some medication, which is prescribed in the range from 0 to 50mg. A practitioner is likely to be interested in the effect of an increase of the dosage to 15mg (i.e., ) in contrast to an increase to 50mg. Furthermore, it is reasonable to assume the propensity function to be locally smooth. Therefore, the instability of is unlikely to occur.
In Scenario 2, we could additionally face an underflow of or a blow-up of the pre-factor . As a remedy, we can reparameterize the problem to work in the log domain and only exponentiate after subtracting a stable (maximum) constant across all datapoints. This additionally prevents the Jacobian/Hessian of the constraints from becoming nearly singular or wildly varying, causing Newton‐type steps to blow up or stall.
Action: We will add a discussion on the stability of the optimization approaches to our manuscript.
Answer to limitations: Considerations for practical application
We are happy to discuss the requirements of applying our method in practice. First, our method is designed for single continuous treatments with a univariate outcome, which is common in dosaging (e.g., determining the dosage of chemotherapy/ insulin/ hypertension drugs, etc).
- In randomized controlled trials (when the propensity score is known), we can directly apply Theorem 4.2 on top of the trained outcome prediction model to construct our CP intervals at target coverage level .
- In observational studies, when the propensity score is unknown, we require the practitioner to have sufficient prior knowledge to set a bound on the propensity estimation error. The practitioner should choose this bound with care and rather in a conservative way to prevent undercoverage. Then, the CP intervals can be derived based on Theorem 4.5.
Our intervals do not suffer from undercoverage due to limited data. Importantly, CP guarantees are valid for any sample size and our method is kept fully model agnostic, enabling the practitioner to make reliable judgements based on limited data and an ML model of choice.
Action: We will discuss practical considerations of applying our method in medicine.
[1] Koenker, et al. Regression quantiles. Econometrica (1978): 33-50.
[2] Koenker. Quantile regression. Vol. 38. Cambridge University Press, 2005.
[3] Boyd, et al. Convex optimization. Cambridge University Press, 2004.
[4] Luenberger, et al. Linear and nonlinear programming. Vol. 2. Reading, MA: Addison-Wesley, 1984.
I thank the authors for addressing several of my concerns in detail. However, I remain unsatisfied with the very limited empirical evidence, which may fall short of expectations for practitioners interested in applying these methods in real-world applications (reason why I don't feel confident in giving a positive rating).
Thank you for your answer. We are glad we could satisfactorily address your theoretical and practical concerns and clarify all questions.
As you suggested in your original review, we have provided new experimental results for the semi-synthetic TCGA dataset and added two more baselines for comparison in our rebuttal. Furthermore, we provided more information on the experiments on the MIMIC dataset.
Following the suggestion of other reviewers, we also added further results, which were initially shared in our responses to individual reviewers. For completeness and visibility, we provide them again here, and we will incorporate these results into the revised version of the paper. Our new experiments in response to the other reviewers are: (i) comparison to the naive vanilla CP (V-CP), i.e., CP, which does not account for the distribution shift, and a (ii) Gaussain process UQ method on the synthetic datasets as well as the (iii) weighted CP method for causal tasks on binary treatments by [1] on the TCGA dataset. We present the results below:
-
(i) We observed that V-CP does not provide any valid prediction interval, across all distribution shifts and confidence levels. This can be explained by the good prediction performance of the underlying model (see Supplement G). Thus, V-CP intervals are extremely small (average width of 0.0003) and can never cover the true potential outcome after the intervention.
-
(ii) The GP is only able to capture the true potential outcome in the prediction intervals for small distribution shifts (Δ = 1) on dataset 2. However, the empirical coverage is extremely low: . This result aligns with our expectations, as the aleatoric uncertainty in our experiments is low. Therefore, the GP intervals will be small (average width of 0.1293) and barely provide valid intervals for the potential outcomes after the intervention.
-
(iii) Both the CP method for binary treatments and our method fulfill the coverage guarantees. However, when comparing the interval width, we see that *our method is superior: Our intervals have an average width of 0.6255 (sd = 0.1714). In contrast, the intervals on the binarized treatment via the CP method for binary treatments have an average width of 3.2876 (sd = 0.5587). Hence, the intervals by our method are – by far – more informative.
Overall, we have provided extensive experimental results and included relevant baselines and datasets from the literature on causal inference applicable to continuous treatment variables (e.g., as in [2]).
Should you have specific requests or suggestions for additional experimental evaluations (i.e., datasets suited for causal inference with continuous treatments), we would be happy to include them. Otherwise, we kindly ask the reviewer to reconsider their evaluation in light of the comprehensive empirical evidence already provided.
[1] L. Lei and E. J. Candes. Conformal inference of counterfactuals and individual treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):911–938, 2021. ISSN 1369-7412.
[2] J. Schweisthal, D. Frauen, V. Melnychuk, and S. Feuerriegel. Reliable off-policy learning for dosage combinations. In Conference on Neural Information Processing Systems (NeurIPS), 2023
I thank the author for the further clarifications. I may keep or increase my score accordingly.
This paper proposes a CP framework for constructing finite-sample valid prediction intervals for counterfactual outcomes, under both hard and soft interventions on a continuous treatment. It addresses a key limitation of standard conformal prediction—the violation of exchangeability between calibration and test points due to distribution shifts induced by interventions. To restore valid coverage, the authors introduce a tilting function based on the ratio of interventional to observational treatment densities (i.e., propensity scores). The method is empirically validated on both synthetic data and the MIMIC-III ICU dataset.
优缺点分析
-
Strong motivation as the authors clearly identify three challenges and address them under different scenarios. They also provide strong theoretical results with finite-sample validity in sections 4.1 and 4.2.
-
I quite like the idea of causal conformal prediction—the use of a tilting function based on the treatment distribution is clever. However, this also leads to my another concern: the method in this paper does not guarantee causal identifiability in the traditional sense. Instead, it is more like assumes the predictive model (and, in some cases, the propensity model) is already "good enough" and focuses on valid uncertainty quantification under distribution shifts induced by hard or soft interventions. It lies at the intersection of predictive modeling, distribution shift correction / importance weighting, and finite-sample uncertainty quantification, and touches causal inference only to the extent that it predicts under interventions, but does not address identification or estimation of causal effects directly.
-
The authors provide extensive experimental results to illustrate the superior performance of their method. However, the baseline methods are limited to MC dropout and deep ensembles. Later, they find that ensembles are highly unfaithful and remove them from comparison. This raises a concern about fairness: MC dropout and ensembles are not designed to handle counterfactual or distributional shift elements (they rely on the exchangeability assumption), so comparing them to the proposed method is like comparing a bicycle to a jet engine. This might explain why the figure 4 and 5 look too good to be true. It would be more convincing to compare the proposed method against baselines that incorporate causal elements, such as Bayesian causal forests or causal GPs, which are widely used and provide credible intervals for potential outcomes or treatment effects.
-
As mentioned above, the proposed methods heavily rely on the correctness of the predictive model. It would be helpful to include some ablation analysis to show how performance depends on the accuracy of propensity score estimation, model misspecification, or unmeasured confounding.
问题
- No comparisons with causal UQ baselines. Additionally, it would be helpful to include a related work section discussing recent progress on causal CP and causal UQ methods.
- The method assumes that outcome exchangeability is violated. Is this violation primarily due to treatment shift (e.g., a patient receiving a different treatment than observed), or due to covariate distribution shift (e.g. a young patient and an old patient)? How does the proposed approach handle these two types of shift, especially when they co-occur?
- This may be slightly tangential, but in line 276 you mention aleatoric uncertainty. Could you elaborate on how aleatoric and epistemic uncertainties are defined in this context, and whether (or how) the proposed method addresses each type?
局限性
See above.
最终评判理由
I quite like the idea of using a causal conformal score. However, the paper does not stand out sufficiently from existing methods due to a weak empirical evaluation and some ambiguity in distinguishing causal interventions from general distributional shifts or conditional prediction problems. Therefore, I have decided on a borderline accept.
格式问题
NA
Thank you for your constructive feedback and the positive evaluation of our manuscript. Below, we provide answers to all questions. We will add the corresponding improvements labeled as “Action” to our revised manuscript.
Answers to weaknesses:
Causal identifiability:
Throughout our work, we make the three standard assumptions in causal inference: positivity, consistency, and unconfoundedness (see Section 3). These assumptions are standard in the literature (e.g., [1],[2],[3],[4]). Under these assumptions, the conditional average treatment effect is identified as . The potential outcome estimation thus simplifies to estimating the conditional expectation; in our case, estimated through the ML model (see Section 3). In the main paper, we provide conformal intervals for potential outcomes. We derive CP intervals for the treatment effect in Appendix A. Given that these assumptions are standard and generally satisfied by design (e.g., in clinical trials) (e.g., [4],[5],[6]), we do not share the concern raised by the reviewer. However, we are happy to clarify further aspects if needed.
Causal UQ baselines:
To foster the evaluation of our method, we additionally provide coverage results on two further causal UQ baselines: (i) causal Gaussian processes, and (ii) the causal CP method for binary treatments by Lei & Candés.
(i) The GP is only able to capture the true potential outcome in the prediction intervals for small distribution shifts (Δ = 1) on dataset 2. However, the empirical coverage is extremely low: . This result aligns with our expectations, as the aleatoric uncertainty in our experiments is low. Therefore, the GP intervals will be small (average width of 0.1293) and barely provide valid intervals for the potential outcomes after the intervention.
(ii) We compare our method and the method by Lei and Candés on the semi-synthetic TCGA dataset. However, when comparing the interval width, we see that our method is superior: Our intervals have an average width of 0.6255 (sd = 0.1714). In contrast, the intervals on the binarized treatment via the Lei-Candés method have an average width of 3.2876 (sd = 0.5587). Hence, the intervals by our method are – by far – more informative.
Dependence on propensity score estimation, model misspecification, or unmeasured confounding:
Propensity estimation: Our intervals depend on the quality of propensity estimation through the parameter . Larger represents more uncertainty in the propensity estimation and thus wider intervals. We conducted an ablation study on the TCGA dataset: the results confirm this pattern, as expected.
Model misspecification: Our CP intervals are exactly designed to account for the bias and uncertainty stemming from model misspecification. Misspecified models lead, on average, to higher non-conformity scores on the calibration dataset and, thus, to wider intervals.
Unmeasured confounding: We highlight again that we make the three standard assumptions in causal inference, including unconfoundedness. In the presence of unmeasured confounders, the potential outcomes are not (point-)identified. Our method is not designed for this setting. For future work, extending our method to partial identification of potential outcomes based on sensitivity analysis could be of interest.
Action: We will add the above discussion to the camera-ready version of our paper, including the results of the new ablation study.
Answers to questions:
(1) Causal UQ baselines and related work:
For your question on causal UQ baselines, please see our answer above.
There exists a variety of work for uncertainty quantification (UQ) in causal effect estimation for continuous treatments. Existing methods for UQ of causal quantities are often based on Bayesian methods (e.g., [1],[2],[7]). However, Bayesian methods require the specification of a prior distribution based on domain knowledge and are thus neither robust to model misspecification nor generalizable to model-agnostic machine learning models. Other methods only provide asymptotic guarantees (e.g. [3],[8]). The strength of conformal prediction, however, is to provide finite-sample uncertainty guarantees. Overall, none of the methods tackles the problem of distribution-free uncertainty quantification for causal effect estimation of continuous treatments in finite sample settings.
For an extensive discussion of related work on CP for causal quantities, we refer the reviewer to our extensive extended literature review in Appendix E.
(2) Exchangeability violation:
Coverage guarantees of existing CP intervals essentially rely on the exchangeability of the non-conformity scores. Exchangeability assures that the nonconformity score of the test point is equally likely to fall anywhere among the calibration scores, its rank is uniform, and that uniformity is exactly what yields the distribution‑free coverage guarantee. Without exchangeability, the rank is not guaranteed to be uniform, and the marginal coverage bound can fail.
However, intervening on treatment shifts the propensity function and, therefore, induces a shift in the covariates between calibration and test data, specifically in the treatment . Therefore, exchangeability is not fulfilled, and the coverage guarantees might fail (at least for existing methods).
As a remedy, we present a novel and powerful remedy in our work: The overall distribution of the confounders is assumed to stay constant between train, calibration, and test data (as standard in ML problems). This is completely orthogonal to constructing intervals for different (e.g., young or old) patients. Note that CP intervals are constructed for only one sample/patient at a time. This means that different intervals are constructed for patients with different features . In other words, the intervals are constructed conditionally on , but the coverage guarantee is marginal across the complete population of . Overall, the shift from one patient to another does not pose any challenges for CP methods.
Action: We will include a discussion on the role of the exchangeability assumption and the limitations it poses for existing causal CP methods in the camera-ready version of our paper. We will clarify how our proposed method circumvents these limitations, offering a more robust and practical approach under realistic causal settings.
(3) Aleatoric and epistemic uncertainty:
This is an interesting question! First, we like to highlight that CP guarantees marginal (total) coverage, agnostic to the split between aleatoric and epistemic error.
Aleatoric uncertainty denotes the irreducible noise inherent in the data–generation process, such as measurement error or intra‑subject variability. Epistemic uncertainty is present due to the limited data available and due to model misspecification. This uncertainty is, in principle, reducible by gathering more data or improving the model.
[1] A. Alaa and M. van der Schaar. Bayesian inference of individualized treatment effects using multi-task Gaussian processes. In Conference on Neural Information Processing Systems (NeurIPS), 2017
[2] K. Hess et. al. Bayesian neural controlled differential equations for treatment effect estimation. In International Conference on Learning Representations (ICLR), 2024.
[3] J. Jonkers et.al. Conformal Monte Carlo meta-learners for predictive inference of individual treatment effects. arXiv preprint, 2024.
[4] P. R. Rosenbaum and D. B. Rubin.The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983
[5] S. Feuerriegel et. al. Causal machine learning for predicting treatment outcomes. Nature Medicine, 30(4):958–968, 2024.
[6] P. Sanchez et al. Causal machine learning for healthcare and precision medicine. Royal Society Open Science 9(8), 2022
[7] A. Jesson et.al. Identifying causal-effect inference failure with uncertainty-aware models. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
[8] Y. Jin et.al. Sensitivity analysis of individual treatment effects: A robust conformal inference approach. Proceedings of the National Academy of Sciences of the United States of America, 120(6), 2023.
Dear Reviewer K7Nc,
We would like to thank you once again for your helpful comments and your thoughtful evaluation. We sincerely appreciate your time and effort in reviewing our work.
If you should have any further questions or concerns, we are happy to provide clarification during the remaining part of the discussion period. Otherwise, as we believe we have sufficiently addressed your main concerns, we would kindly ask you to reconsider your assessment and, if appropriate, revise your score accordingly.
I would like to thank the authors for their response and additional experiments. However, I remain unsatisfied with two points: 1. the proposed method is not causal enough, it's more like a UQ framework under distributional shift. 2. limited empirical evidence. While I appreciate the added experiments, the paper still lacks appropriate and comprehensive baseline comparisons. Therefore, I have decided to maintain my score.
Thank you for your answer. We are happy to provide further clarification for the two remaining concerns:
-
We are afraid we do not fully understand what the reviewer means by ‘not causal enough’. Our method provides conformal prediction intervals for potential outcomes. Different from standard CP tasks, the treatment assignment (i.e., the intervention) postulates an unknown distribution shift between the calibration and test data, which violates the underlying assumptions of traditional conformal prediction. In contrast, our CP method is carefully targeted at causal tasks. We would like to note that our way of looking at a causal inference problem in terms of shifts between the observational and the interventional distributions is standard in causal conformal prediction, e.g. [1],[2],[3],[4],[5], as well as other causal inference literature, e.g. [6], [7]. In sum, we develop a novel method for CP in causal inference tasks with continuous treatments.
-
As you suggested in your original review, we have provided new experimental results for two more baselines. Specifically, we provided results for the causal CP method for binary treatments by [1] and a GP. We observed that the GP intervals are rarely valid. The CP intervals by [1] are valid, but uninformative, highlighting the need for developing CP methods for causal tasks of continuous treatments.
Following the suggestion of other reviewers, we also added a comparison to the naive vanilla CP (V-CP) as an additional baseline, i.e., CP, which does not account for the distribution shift and is non-causal, which were initially shared in our responses to individual reviewers. For completeness and visibility, we provide them again here, and we will incorporate these results into the revised version of the paper. We observed that V-CP does not provide any valid prediction interval, across all distribution shifts and confidence levels. This can be explained by the good prediction performance of the underlying model (see Supplement G). Thus, V-CP intervals are extremely small (average width of 0.0003) and can never cover the true potential outcome after the intervention. This highlights the need for adjusting for the distribution shift caused by the treatment intervention.
Overall, we have provided extensive experimental results and included relevant baselines from the literature on CP and causal inference.
[1] L. Lei and E. J. Candes. Conformal inference of counterfactuals and individual treatment effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):911–938, 2021. ISSN 1369-7412.
[2] J. Jonkers et.al. Conformal Monte Carlo meta-learners for predictive inference of individual treatment effects. arXiv preprint, 2024.
[3] Z. Chen, R. Guo, J.-F. Ton, and Y. Liu. Conformal counterfactual inference under hidden confounding. In Conference on Knowledge Discovery and Data Mining (KDD), 2024.
[4] M. F. Taufiq, J.-F. Ton, R. Cornish, Y. W. Teh, and A. Doucet. Conformal off-policy prediction in contextual bandits. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
[5] M. Yin, C. Shi, Y. Wang, and D. M. Blei. Conformal sensitivity analysis for individual treatment effects. Journal of the American Statistical Association, 2022. ISSN 0162-1459.
[6] Mlodozeniec, Bruno Kacper, David Krueger, and Richard E. Turner. "Position: Probabilistic Modelling is Sufficient for Causal Inference." Forty-second International Conference on Machine Learning Position Paper Track.
[7] Fernández-Loría, Carlos. "Causal Inference Isn't Special: Why It's Just Another Prediction Problem." arXiv preprint arXiv:2504.04320 (2025).
This paper develops a conformal prediction framework for estimating uncertainty intervals of potential outcomes under continuous treatment settings in causal inference, which inherently break exchangeability. In this paper, authors propose CP method to remain valid in terms of exchangeable assumption under known or unknown propensity scores, when interventions shift the treatment assignment mechanism. Synthetic and real-world experiments confirm reliable coverage and adaptive interval widths under various intervention scenarios and significance level.
优缺点分析
Strengths: (1) The paper addresses a clear gap in the conformal prediction literature: providing finite-sample valid prediction intervals for counterfactual outcomes under continuous treatments
(2) The paper provides finite-sample marginal coverage guarantees for the constructed intervals in both known and unknown propensity settings, which is rare in causal CP work
(3) The method is carefully grounded in the potential outcomes framework, and the authors precisely handle the violation of exchangeability under interventions.
Weakness: (1) In the unknown propensity case, the method uses estimated scores but doesn't model or quantify uncertainty in their estimation. This may degrade validity if the estimation is poor.
(2) The experiments do not include baseline methods for conformal prediction under causal settings (e.g., DR-CP, CPC for binary treatments). Though not directly comparable, some kind of ablation or prior method could serve as a sanity check.
(3) Some strong assumptions about the variables and function form, such as in terms of Gaussian function in Eq. (14).
问题
(1) The motivation of uncertainty quantification is not clear. As claim in the paper, the point estimate will neglect that certain treatment is ineffective for some patients profile. however, some causal estimands, e.g., CATE, can also fill this gap.
(2) one of the challenges of conformal prediction for causal quantities lies on the non-exchangeability due to the intervention. But why the challenge is non-trivial? The manuscript is unclear for thin point. Why CP needs exchangeability assumption?
(3) in eq. (3) and (4), what does the function represent? this part lacks important details about the quantile regression.
(4) in line 208, the vector \eta^S has n+1 elements? it is not correct.
(5) it's very confusing that why one needs to estimate the weight in eq.(8), instead of directly using and ? what's the difference between eq,(9) and (3), (4)?
(6) and it also lacks of specific explanation why use max in the max-min objective in eq.(8).
(7) how does the Eq.(Ps) is reformulated? what does u_i and v_i represent?
局限性
yes
最终评判理由
Most papers on causal inference, specifically treatment effect estimation or counterfactual estimation, focus on point estimates. This paper nicely fills the gap between effect estimation and uncertainty quantification, and its theoretical framework is relatively clear. However, since my expertise lies primarily in causality, and I am less familiar with conformal prediction, there may be areas where I lack sufficient domain knowledge. Therefore, I am unable to provide a more definitive recommendation, but I lean toward a borderline accept.
格式问题
N/A
Thank you for your constructive feedback and the positive evaluation of our manuscript. Below, we provide answers to all questions. We will add the corresponding improvements labeled as “Action” to our revised manuscript.
Answers to weaknesses:
(1) Uncertainty quantification of propensity estimation:
Thank you for raising this question. Accounting for the propensity estimation error is exactly one of the strengths of our method (see Section 4.2) and differentiates our framework from former work in this field (see Section 2). Under Assumption 1 (by stating a user-specified upper bound on the estimation error), our intervals are valid for all misspecified propensity estimations with estimation error smaller than or equal to the bound.
(2) CP Baselines:
We followed your suggestion and added a comparison to the naive vanilla CP (V-CP), i.e., CP, which does not account for the distribution shift. Specifically, we now report the performance of (i) the naive vanilla CP (V-CP), i.e., CP, which does not account for the distribution shift, and (ii) the method proposed by Lei and Candés for binary treatments (for which we binarize our treatment).
(i) We observe that V-CP does not achieve any valid prediction interval, across all distribution shifts and confidence levels. This can be explained by the good prediction performance of the underlying model (see Supplement G). Thus, V-CP intervals are extremely small (average width of 0.0003) and can never cover the true potential outcome after the intervention. Overall, the results confirm the importance of accounting for the distribution shift induced by the intervention.
(ii) We compare our method and the method by Lei and Candés on the semi-synthetic TCGA dataset, where both methods fulfill the coverage guarantees. However, when comparing the interval width, we see that *our method is superior: Our intervals have an average width of 0.6255 (sd = 0.1714). In contrast, the intervals on the binarized treatment via the Lei-Candés method have an average width of 3.2876 (sd = 0.5587). Hence, the intervals by our method are – by far – more informative.
Action: We will include our performance results of the V-CP and the binary method by Lei and Candés in the camera-ready version of our manuscript.
(3) Functional form of :
We are happy to explain our choice of representation of the Dirac delta function. Note that it is a common way to represent the Dirac delta as the limit of a family of probability densities that become infinitely concentrated at zero. One such family is the Gaussian density. Other options are, e.g., the representation through Lorentz spikes, rectangular pulses, or sinc kernels. For background on the representation, we refer the reviewer to, e.g., [1]. Please let us know if there are other functional forms in our paper that should be explained in more detail.
Answers to questions:
(1) Motivation for UQ:
The reviewer is correct that heterogeneous effects, such as CATE, are designed to capture variability in the treatment effect estimate; however, this only refers to the variability based on the heterogeneity in effects for varying covariates. It does not include the variability in the effects due to noise in the data or modeling error (which is our objective). Therefore, uncertainty quantification (as in our paper) is necessary in causal inference settings in the same way as for other ML applications.
(2) Exchangeability assumption:
Coverage guarantees of existing CP intervals essentially rely on the exchangeability of the non-conformity scores. Exchangeability assures that the nonconformity score of the test point is equally likely to fall anywhere among the calibration scores, its rank is uniform, and that uniformity is exactly what yields the distribution‑free coverage guarantee. Without exchangeability, the rank is not guaranteed to be uniform, and the marginal coverage bound can fail.
However, intervening on treatment shifts the propensity function and, therefore, induces a shift in the covariates between calibration and test data, specifically in treatment . Therefore, exchangeability is not fulfilled, and the coverage guarantees might fail.
As a remedy, we present a novel and powerful remedy in our work: our calibration step differs from the standard procedure in that we conditionally calibrate the non-conformity scores depending on the tilting function to achieve marginal coverage for the interventional and thus shifted data. Thereby, we ensure that the exchangability assumption holds.
(3) Quantile regression Eq. 3 & 4:
Equations 3 & 4 represent the general form of a quantile regression as a minimization problem based on the so-called pinnball-loss , a standard result in the statistical regression literature (e.g. [2],[3],[4],[5]).
Observe, that can equivalently be represented as . This asymmetric linear loss penalizes under-prediction proportionally to and over-prediction proportionally to , which results in the fitted function (see Eq. 3) targeting the -th conditional quantile.
Action: We will include more background information on the reformulation of quantile regression in the camera-ready version of our manuscript.
(4) Dimension of vector in line 208:
Thank you for noting the typo! Initially, we referred to the sizes of both the training and calibration datasets as , resulting in the dimension stated in line 208; however, we later adapted our formulation to enhance the reader’s understanding. The correct dimension of is .
Action: We will correct the typo in the camera-ready version of our paper.
(5), (6) & (7) Dual optimization:
We are happy to explain the necessity and the derivations of the dual optimization problem in Theorem 4.2 from the primal optimization problem in Lemma 4.1.
First, observe that the objective function and loss formulation in Eq. 3 & 4 denote a general quantile regression, which does not directly correspond to our CP problem. The precise (primal) optimization problem for our CP setting is given in Lemma 4.1 (Eq. 5).
Note that directly solving the problem given in Lemma 4.1 is not feasible, as it would require solving Eq. 5 for all possible imputed values of . It is unclear how to feasibly find the imputation of the “true” needed for the interval construction in Eq. 6. We thus follow former results in the literature (e.g., [6]) and exploit the efficient computation through dual optimization.
We can rewrite the primal problem in Lemma 4.1 as
Note that this is precisely the formulation of the respective primal problem stated in Eq. P_S as well as Theorem 4.5, where and represent the primal optimization variables.
The Lagrangian of the primal problem with Lagrangian multipliers states
leading to the dual problem
with dual optimization variables .
For more background on dual (convex) optimization theory, we refer the reviewer to, e.g., [7], [8], due to space constraints in the rebuttal. Specific details on our derivations can be found in the proofs of Theorems 4.2 and 4.5 in Appendix D.
For more details on the coverage guarantees/reformulation from Eq. 6 to Eq. 9, we refer the reviewer to the respective part of the proof of Thm. 4.2 in lines 663 and following in our manuscript.
Action: We will provide more background and explain the intuition for the dual optimization step in more detail in the camera-ready version of our manuscript.
[1] Gel’fand, I. M. and Shilov, G. E. . Generalized Functions, Vol. 1: Properties and Operations (Academic Press, 1964)
[2] Koenker, R., and Bassett Jr., G. "Regression quantiles." Econometrica: journal of the Econometric Society (1978): 33-50.
[3] Koenker, R. Quantile regression. Vol. 38. Cambridge University Press, 2005.
[4] Yu, K., and Moyeed, R. A. "Bayesian quantile regression." Statistics & Probability Letters 54.4 (2001): 437-447.
[5] Belloni, A., and Chernozhukov, V. "ℓ 1-penalized quantile regression in high-dimensional sparse models." (2011): 82-130.
[6] Gibbs, I. et al. Conformal prediction with conditional guarantees. arXiv preprint, 2023.
[7] Boyd, S. P., and Vandenberghe. L.. Convex optimization. Cambridge University Press, 2004.
[8] Luenberger, D. G., and Ye, Y. Linear and nonlinear programming. Vol. 2. Reading, MA: Addison-Wesley, 1984.
I would like to thank the authors for the response. I have read them and will maintain my score.
This paper focuses on the task of conformal prediction of causal effects from continuous-valued treatments. The novelty in this paper stems from handling the cause of continuous-valued treatment, while previous literature focuses on binary or discrete-valued treatments.
The paper starts by tackling the causal effect estimation setting, making the standard assumptions of positivity, consistency, and unconfoundedness. They also assume access to a ML model to predict potential outcomes. The paper tackles producing CP intervals under either discretized hard or soft interventions, which cause the resulting distribution to violate standard exchangeability assumptions. To address this shift in distribution, the authors perform calibration (conditioned on the propensity shift). They define this shift in Equation (2) and use split conformal prediction with a calibration step based on the shifting function .
The authors empirically compare their approach with MC-dropout (with the same underlying MLP) and deep ensembles, given that there are no baselines that handle the setting of continuous treatments. They provide results on synthetic data, finding that the proposed approach has much higher coverage.
优缺点分析
Strengths
- Tackles a new (and well-motivated) setting of CP for causal effect estimation in the setting of continuous-valued treatments
- Technical novelty in deriving a new theoretically sound approach (i.e., with proper coverage guarantees), adapting the approach from split CP
- Develops technical methodology to handle both cases where propensity scores are known and unknown
Weaknesses
- No comparison to a naive CP implementation as a baseline – this would be another valid and interesting baseline to understand the importance of the proposed approach
- The method seems computationally intensive; it requires calls to the optimization solver during each iteration of a binary search. If I’m not mistaken, shouldn’t the algorithm’s fixed complexity be $O(\log(\frac{S_{up} - S_{low}}{\epsilon} * \sigma_s(n_c)) (lines 601-602)?
- Minor formatting issues (e.g., the wrapped figures on Page 8)
问题
Empirically, how does the runtime compare to the alternatives on the MIMIC dataset?
Also, see other weaknesses above.
局限性
Yes
最终评判理由
I remain borderline as the method is computationally intensive and the authors have not empirically demonstrated the runtime comparisons. These would be good improvements to keep in mind for future versions of the work.
格式问题
N/A
Thank you for your feedback and the positive evaluation of our manuscript. Below, we provide answers to all the questions and concerns you raised. We will add the corresponding improvements labeled as “Action” to our revised manuscript.
Comparison to naive CP baseline:
We followed your suggestion and added a comparison to the naive vanilla CP (V-CP), i.e., CP, which does not account for the distribution shift. This may be of interest to the reader to understand the importance of the proposed approach. In our experiments, we observed that V-CP does not provide any valid prediction interval, across all distribution shifts and confidence levels. This can be explained by the good prediction performance of the underlying model (see Supplement G). Thus, V-CP intervals are extremely small (average width of 0.0003) and can never cover the true potential outcome after the intervention. Overall, the results confirm the importance of accounting for the distribution shift induced by the intervention.
To investigate whether simple weighted CP methods for causal tasks on binary treatments sufficiently address this challenge, we compare our method with the method by Lei and Candés. Of note, both methods fulfill the coverage guarantees. However, when comparing the interval width, we see that our method is superior: Our intervals have an average width of 0.6255 (sd = 0.1714). In contrast, the intervals on the binarized treatment via the Lei-Candés method have an average width of 3.2876 (sd = 0.5587). Hence, the intervals by our method are – by far – more informative.
Action: We will include our observations on the performance of both methods in the camera-ready version of our manuscript.
Computational complexity:
You are correct that the overall complexity is following from the argumentation in the derivation. Please excuse the typo.
We acknowledge that our method is computationally rather complex, which we also stated when discussing the limitations of our method. However, this is not unique to our method, but also inherent to the baselines. As a case in point, the complexity of deriving intervals through the baselines (i.e., MC-dropout or ensemble methods) depends on the number of MC samples or models, respectively, both of which must be rather large to offer similar computational performance. In particular, both baselines do not offer valid prediction intervals, which is the contribution of our method. On top of that, the latter baseline (ensemble methods) scales with the precision of the intervals, which might be difficult to control and is thus impractical in real-world applications. In our work, we use optimization as a tool to provide conformal prediction intervals. Future research should focus on developing more efficient optimization algorithms for this task.
Action: We will correct the typo in the camera-ready version of our manuscript.
Runtime on MIMIC dataset:
This is an interesting question. Due to the optimization problem that needs to be solved in our approach, the time needed for computing our CP intervals is naturally higher than the time needed to calculate the MC intervals. While the theoretical runtime of CP may exhibit more complex scaling behavior (e.g., cubic or non-linear), our empirical results demonstrate that CP scales well in practice. Specifically, computing CP intervals takes roughly 10 times longer than computing MC intervals. However, we emphasize that MC intervals are generally not faithful and therefore not directly comparable. To better understand the scaling behavior of our CP method in practice, we examine computation time relative to the average interval width. Under this lens, both methods exhibit a similar time-to-width ratio of approximately one. In other words, our experiments do not indicate that CP intervals scale poorly—in fact, we observe good scaling properties in practice.
Action: We will include a full runtime comparison on the MIMIC dataset in the camera-ready version of our paper.
Dear Reviewer v4YF,
Thank you once again for your thoughtful evaluation. We sincerely appreciate your time and effort in reviewing our work.
If you should have any further questions or concerns, we are happy to provide clarification during the remaining part of the discussion period. Otherwise, as we believe we have sufficiently addressed your main concerns, we would kindly ask you to reconsider your assessment and, if appropriate, revise your score accordingly.
Thanks for the clarifications, which have partially addressed my concerns. I will maintain my score
Evaluating causal effects with continuous treatments from observational data poses interesting challenges. First of all, there are the usual challenges of causal analysis (such as estimating propensity scores). Second, there is the fact that one rarely if ever sees two sample points that have received the same treatment, which requires some sort of smoothing over "treatment space".
The present paper applies the methodology of conformal prediction to this task. The goal in the present setting is this: after training a model and calibrating CP, one wants to be able to produce, upon seeing a new covariate X_{n+1}, an interval containing the true treatment effect with high probability. The treatment to be applied to X_{n+1} might either be chosen by the user, or assumed to come from the same distribution. The assumption needed here is fairly strong, but quite classical: positivity, consistency, and unconfoundedness. Moreover, the authors assume that the propensity score is either known, or can be estimated up to a constant factors.
The actual proposed method is kind of CP with covariate shift. However, it is quite subtle, as it requires an appeal to duality theory. As is usual for CP, the method gives no guarantees on the sizes of intervals, but it does give valid coverage at the desired level (at least asymptotically). In any case, experiments (including some reported in the discussion) suggest that it does have good coverage and decent-sized intervals in practice.
The main strengths of the paper are the novel setting and the clearly important contribution of providing uncertainty estimates for treatment effects. On this point I confess that I should have probably pressed the reviewers further: the method, even though limited by its assumptions and intense computational cost, seems like an important contribution. The problem formulation and the proposed solution are quite interesting. (All reviewers acknowledged as much, but in a somewhat lukewarm fashion.)
On the weaknesses of the paper, I concur with the reviewers only partially. The paper does lack more extensive experiments and details on runtimes (though the authors intend to add more info on these points to the final version). On the other hand, it is clear that the "simple baselines" they suggested could never provide appropriate coverage. Some of the reviewers' comments were somewhat misguided, like the questions about l_\alpha (the "pingpong" loss that is standard in quantile regression) and about the assumptions (the ones used here are standard in the area).
Overall, I found the discussion of the paper somewhat unproductive and a bit unfair to the paper. I tend to side with the authors insofar as they were able to answer all relevant questions. Whatever they did not answer fully seems better left for future work.
My reading of the paper is that its merits outweigh its shortcomings by some margin. As pointed out above, the approach proposed is limited by its assumptions. These assumptions, however, are standard, and thus fine for a first paper on a topic. The comments on experiments are fine. Then again, I do not think that the first paper in a topic needs to be "complete" in this sense. Reading other papers in the causal area suggests that the present set of experiments is not atypically smaller than what one sees out there.
Therefore, I suggest that the paper be accepted.