Boosted Conformal Prediction Intervals
We introduce a boosted conformal procedure designed to tailor conformalized prediction intervals toward specific desired properties, such as enhanced conditional coverage or reduced interval length.
摘要
评审与讨论
This paper introduces a gradient boosting-based approach to tailor a conformal score function to better satisfy desirable properties. A more general score function "family" is introduced which adheres to a specific form dependent on parameters (e.g. ). Instead of directly using the parameter estimates obtained from model training of the predictor , a post-hoc boosting stage is introduced in which the parameters are further refined leveraging additional data to minimize property-specific losses, such as conditional coverage deviation or conformal interval length. After boosting for iterations, the refined parameter estimates can be used for down-stream conformal calibration and prediction interval generation. Results are compared to default local conformal scores (Local) and CQR on a range of regression tasks, obtaining slightly more favorable empirical properties.
优点
- Addressing the practicality of obtained conformal sets is a relevant problem, since results can sometimes be overly conservative to adhere to the conformal guarantees
- The proposed method seems relatively lightweight and has the benefit of not requiring changes to the model weights of , since the boosting derivatives are computed w.r.t. the inputs of the scoring function , which are the predictions
- The improvements for the local conformal score function seem substantial at times
缺点
My main concerns are in regards to evaluation and practicality. The authors propose two applications of the boosting procedure to improve conditional coverage and prediction set size. For conditional coverage, the boosting loss is introduced as a relatively complicated group deviation metric leveraging a particular "contrast tree" mechanism, and in practice requires further non-trivial approximations to make it differentiable for boosting. Since the results are very strongly tied to this "contrast tree" approach, I was struggling to see how the boosting loss and conditional coverage target (Eq 12) connect in practice and how the results (e.g. Fig. 3) are more widely applicable, since the procedure is motivated as an "effective evaluation metric (for target conditional coverage) of independent interest". Is this procedure amenable to any black-box predictor and can be considered a particular gradient boosting mechanism? In general, I found the obtained improvements not overly convincing. For example, in Fig. 3 if we consider the % of data for which conditional coverage ends up being violated (above the red line) we find 33% for Local and 37% for Boosted Local. Granted, the boosting seems to distribute miscoverage somewhat more evenly, but the overall violation rate remains similar. In Table 1, boosting CQR only gives marginal improvements. It would be good to compare the results to other conformal algorithms which claim to empirically (or sometimes even theoretically, under relaxed conditions or varying interpretations of conditionality) improve conditional coverage, e.g. [1,2,3,4]. This would help clarify if the approach is more widely beneficial. Similarly, all experiments are run leveraging the same random forest and quantile network regressors -- do results translate and are amenable across different models as well? Another hurdle might be the fact that a differentiable property-specific loss needs to be defined for every target property, whose amenability may also depend in some way on the underlying model's design, and is fixed. Thus, tackling a more flexible scoring function comes at the cost of introducing another (fixed) functional to optimize over, with required approximations.
On another note, I was somewhat struggling to understand how the boosting procedure connects to the overall conformal procedure. In Fig. 1 it is suggested that boosting is an independent step inbetween model training and conformalization, but in the experiment description it seems that boosting leverages existing datasets (namely training data). Should I understand that the boosting dataset is a separate set? If so, how do its sampling requirements (e.g. exchangeability) connect to calibration and test data. If not, should I understand that the boosting step actually equates to an iterative conformal calibration procedure leveraging ? Since the boosting objectives (Eq. 14, 17) actually require computing conformal sets and quantiles. Thank you for clarifying.
Other comments:
- I believe the index in the 3rd final line of Algorithm 1 should be omitted, since the final boosted score functions are not taken w.r.t. a particular fold.
- Fig. 4 is missing a legend, making it hard to interpret
- The use of the term "Power Loss" seems odd, since the goal is to reduce prediction set size, and from my understanding a smaller prediction set/interval will equate higher power, since the permutation testing interpretation of the null hypothesis is on inclusion in the set.
- Theorem 5.1 does not seem like a theorem but more like a remark to me, since it is merely suggesting that the "generalized" score family may contain the optimal oracle solution, but in no way relates any optimality to the practically obtained solutions via the boosting procedure.
- It would be beneficial to use hyper-referencing for Equations and the Algorithm throughout the paper.
References
- [1] Romano, Yaniv, et al. "With malice toward none: Assessing uncertainty via equalized coverage." Harvard Data Science Review 2.2 (2020): 4.
- [2] Gibbs, Isaac, John J. Cherian, and Emmanuel J. Candès. "Conformal prediction with conditional guarantees." arXiv preprint arXiv:2305.12616 (2023).
- [3] Sesia, Matteo, and Yaniv Romano. "Conformal prediction using conditional histograms." Advances in Neural Information Processing Systems 34 (2021): 6304-6315.
- [4] Jung, Christopher, et al. "Batch multivalid conformal prediction." arXiv preprint arXiv:2209.15145 (2022).
问题
- Could you please comment on some of the raised points above, such as on clarifying the data used by the boosting procedure, the applicability of the "contrast tree" method, the practicality of defining new optimization objectives, or the distinction to other existing methods aiming for conditional coverage.
- The approach to obtain a differentiable loss objective from Eq. 14 seems quite involved. Could you comment on the motivation of leveraging "contrast trees" and the tools used to go from Eq. 14 to a differentiable loss?
- Are there any results that can be made on how the boosting procedure affects down-stream conformal coverage guarantees, or statements of optimality or improvement with regards to the obtained "boosted" conformity scores over the default (-th iteration) conformity scores?
局限性
I did not find the limitations of the method thoroughly discussed anywhere, and I believe the distinction to other related works could be improved. There are some comments made in the discussion on improvements, as well as L218-L222 on not guaranteeing an optimal solution.
We thank the reviewer for their thoughtful comments. In response to each of the reviewer’s specific points:
Is this procedure amenable to any black-box predictor and can be considered a particular gradient boosting mechanism?
- As an evaluation metric, the contrast tree algorithm applies to any black-box predictor f, as it only requires three inputs, the features, the labels and the conformalized prediction intervals, which can be derived from the conformity scores calculated via the values the model predicts. Then the contrast tree algorithm automatically detects regions, where coverage is most uneven.
In general, I found the obtained improvements not overly convincing. For example, in Fig. 3 if we consider the % of data for which conditional coverage ends up being violated (above the red line) we find 33% for Local and 37% for Boosted Local.
- Assuming exchangeability, conformal prediction ensures that the average miscoverage rate over the test set is approximately 10%. The dream would be to achieve exactly 10% miscoverage within any subset of the feature space. While this might seem counterintuitive, the guarantee provided by conformal prediction implies that if one subset in the feature space has a miscoverage rate smaller than 10%, another disjoint subset must have a miscoverage rate greater than 10%. This results in prediction intervals that sometimes overcover and sometimes undercover. Figure 3 illustrates that before boosting, the conditional coverage in each leaf significantly deviates from the target rate of 10%; it is either too large or too small. After boosting, however, all groups have coverage rates that hover around 10%. Concretely, before boosting, the contrast tree identifies splits in the feature space that deviate from the 10% miscoverage rate by 15%. After boosting, the largest deviation identified by the contrast tree is reduced to no more than 6%.
In Table 1, boosting CQR only gives marginal improvements.
- Please refer to point 3 in the global rebuttal.
It would be good to compare the results to other conformal algorithms which claim to empirically (or sometimes even theoretically, under relaxed conditions or varying interpretations of conditionality) improve conditional coverage, e.g. [1,2,3,4].
- In comparison to other works aiming to improve conditional coverage, such as [1] and [2] mentioned by the reviewer, our approach offers a distinct advantage. Those works measure group conditional coverage based on user-specified groups. However, relying on predefined groups may not always be ideal. Specifically, please refer to point 4 in the global rebuttal.
Thus, tackling a more flexible scoring function comes at the cost of introducing another (fixed) functional to optimize over, with required approximations.
We agree with the reviewer that optimizing a targeted property requires a customized objective. We would like to highlight that the conformity score functions we discuss in the manuscript are arguably the two most widely used score functions for regression tasks. Although we do not address classification tasks in this project, it is worth noting that commonly used score functions in that domain, such as the threshold conformal predictor (THR) and adaptive prediction sets (APS), are also differentiable by design.
Should I understand that the boosting dataset is a separate set?
- During the boosting stage, we use the training set to simulate the conformal procedure, assessing the corresponding interval length and conditional coverage. Specifically, at each iteration of cross-validation, we divide the training set into a sub-boosting set and a sub-validation set (the held-out fold). The sub-boosting set serves as both the boosting set and the calibration set, while the sub-validation set acts as the test set. Please refer to Figure 2 for an illustrative schematic drawing. We shall work to explain this more clearly in the revised manuscript.
Response to other comments:
- We thank the reviewer for pointing out the typo in the 3rd final line of Algorithm 1. We shall fix this in the revised manuscript.
- We shall add a legend for clarity of presentation in the revised manuscript.
- We agree with the reviewer that a smaller prediction set or interval corresponds to higher power. In the manuscript, by "power loss," we intended to convey "loss of power," meaning that when the interval length is larger, we lose power. If the reviewer finds this phrasing misleading, we will rephrase it in the revised manuscript.
- We agree with the reviewer that Theorem 5.1 only proves that the search space contains the optimal solution, but does not guarantee the optimality of the numerical boosting procedure.
- We shall add hyper-referencing in the revised manuscript.
Response to questions:
- Please see our reply to the 1, 2, 3, 5 and 6th items in the Weaknesses section.
- For the motivation behind leveraging contrast trees, please refer to point 4 in the global rebuttal. One of the primary reasons we chose to approximate the empirical quantile using the Harrel-Davis quantile estimator is that it transforms the original linear combination of two samples into linear combinations of all samples, significantly enhancing the robustness of the optimization objective.
- Please refer to point 5 in the global rebuttal.
Thank you for your answer and clarifications. A few more comments:
Please refer to point 3 in the global rebuttal.
It would be nice to see more experimental validation to support your claims on leveraging your boosting procedure as a general tool to optimize for desired properties, and your suggestion for FDR control seems interesting. Naturally, this seems out of scope for the current work but perhaps could be added to future work / discussion.
Those works measure group conditional coverage based on user-specified groups. However, relying on predefined groups may not always be ideal
This is a valid point, but a counterargument can be made for the fact that when we work in the group-conditional setting we often either know (or even define) interesting group partitions and may only be interested in these particular groups or a subset thereof, whereas coverage imbalance for other partitions is amenable (e.g. in the fairness case). Your contrast tree approach seems to perform its own partitioning in feature space that can (i) provide group partitions that are not really meaningful or hard to interpret, and (ii) end up suboptimal for particular groups of interest. Thus I still believe it would be useful to compare to such approaches in an experimental setting where groups are known a priori, and demonstrate that the contrast tree approach recovers meaningful partitions. At the very least, a clearer separation from such works should be done.
We agree with the reviewer that optimizing a targeted property requires a customized objective.
I would suggest to stress this limitation for reader clarity, since I do not see it explicitly mentioned anywhere. This seems like a key modelling design step that can prove challenging for anyone wanting to boost for alternative objectives. In particular, it is unclear how the approximations made to obtain a differentiable objective impact the fundamental conformal prediction outcomes (e.g., validity and set sizes).
Please refer to Figure 2 for an illustrative schematic drawing. We shall work to explain this more clearly in the revised manuscript.
Thank you for clarifying. I agree that this needs clarification, because there is no direct connection established between Fig. 1 and 2, and they seem to suggest different conceptual approaches (post-hoc vs. leveraging training data).
Relating to this and my rebuttal (If so, how do its sampling requirements (e.g. exchangeability) connect to calibration and test data.), it seems that leveraging the training data during boosting may impose an additional assumption on the training distribution being exchangeable with the calibration and test distributions. If so, this is not line with "the usual exchangeability assumption". It seems like this data reuse during boosting is a key step that complicates the possibility to provide theoretical guarantees of some form on your procedure, would you agree?
We agree with the reviewer that a smaller prediction set or interval corresponds to higher power. In the manuscript, by "power loss," we intended to convey "loss of power," meaning that when the interval length is larger, we lose power. If the reviewer finds this phrasing misleading, we will rephrase it in the revised manuscript.
I do find it somewhat confusing, especially given that the multiple testing interpretation is not touched upon anywhere else in the paper, and would suggest renaming or omitting this.
We agree with the reviewer that Theorem 5.1 only proves that the search space contains the optimal solution, but does not guarantee the optimality of the numerical boosting procedure.
In my opinion this makes Thm 5.1 a fairly weak statement, and I would suggest renaming it into a "remark" or "proposition". In practice, this does not provide me with anything of real value. Since it follows primarily from your definition of generalized score families, perhaps Thm 5.1 as well as Thm. A.2 can even be rephrased as results following from a theoretical statement on "inclusion of the optimal solution in the search space of generalized score families" under sec 3.
Looking forward to your reply.
We thank the reviewer for taking the time to review our rebuttal and provide additional helpful comments.
It would be nice to see more experimental validation to support your claims on leveraging your boosting procedure as a general tool to optimize for desired properties, and your suggestion for FDR control seems interesting. Naturally, this seems out of scope for the current work but perhaps could be added to future work / discussion.
We shall include this in the discussion of additional objectives in Section A.1.
Thus I still believe it would be useful to compare to such approaches in an experimental setting where groups are known a priori, and demonstrate that the contrast tree approach recovers meaningful partitions. At the very least, a clearer separation from such works should be done.
We agree with the reviewer that there are situations where groups defined based on domain knowledge a priori are of interest. In response, we further conduct the following experiments. As discussed in Section A.1, we can readily extend our boosting procedure to the group conditional coverage objective. As a result, we run the boosting procedure on the Meps-19 dataset divided into 4 groups of non-white female, non-white male, white female and white male.
We observe a significant improvement (54%) for the Local baseline, and a moderate improvement (15.6%) for the CQR baseline.
| Marginal Coverage | Max within group deviation from target coverage rate | Average length | |
|---|---|---|---|
| Local | 89.6% | 8.0% | 2.16 |
| Boosted Local | 89.8% | 3.7% | 2.67 |
| CQR | 89.4% | 4.2% | 3.26 |
| Boosted CQR | 89.7% | 3.5% | 3.21 |
I would suggest to stress this limitation for reader clarity, since I do not see it explicitly mentioned anywhere. This seems like a key modelling design step that can prove challenging for anyone wanting to boost for alternative objectives.
We shall discuss this limitation in our revised manuscript.
In particular, it is unclear how the approximations made to obtain a differentiable objective impact the fundamental conformal prediction outcomes (e.g., validity and set sizes).
We agree with the reviewer that in general, using an objective that does not directly target set size may lead to undesirable (or more desirable, as shown in the comparison between CQR and Boosted CQR in the experiment above, where the target was group conditional coverage) set sizes. As a result, we would recommend that the user incorporate all the targeted properties when customizing the optimization objective. We shall defer the discussion on validity to the reviewer’s next question.
Relating to this and my rebuttal (If so, how do its sampling requirements (e.g. exchangeability) connect to calibration and test data.), it seems that leveraging the training data during boosting may impose an additional assumption on the training distribution being exchangeable with the calibration and test distributions. If so, this is not line with "the usual exchangeability assumption". It seems like this data reuse during boosting is a key step that complicates the possibility to provide theoretical guarantees of some form on your procedure, would you agree?
We respectfully disagree with the reviewer on this point. The guarantee of marginal coverage validity for our boosted procedure does not require additional assumptions beyond the exchangeability assumption already made for the split conformal procedure. Whether or not boosting is applied, the training set produces a fitted or boosted conformity score based on the training data. As long as the calibration set and test set remain exchangeable, the original proof of validity still holds. This is also one of the reasons we chose to use the training set for boosting. We hope this clarifies our approach and rationale.
I do find it somewhat confusing, especially given that the multiple testing interpretation is not touched upon anywhere else in the paper, and would suggest renaming or omitting this.
We thank the reviewer for clarifying. We shall omit this in the revised manuscript.
In my opinion this makes Thm 5.1 a fairly weak statement, and I would suggest renaming it into a "remark" or "proposition".
We shall rename it as a proposition in the revised manuscript.
Thank you for your response.
As long as the calibration set and test set remain exchangeable, the original proof of validity still holds. This is also one of the reasons we chose to use the training set for boosting. We hope this clarifies our approach and rationale.
I don't see this explicitly mentioned in the paper, so I would suggest to make this more apparent. If this is the case, the fact itself could be formalized further in a proposition that extends the guarantee of split CP to incorporate the boosting stage. This seems, in my opinion, like a much more practical statement to formalize than the current Thm 5.1.
Overall, while I am still not entirely sold on the contrast tree approach and on the intricacies of your guarantees connecting to approximations of your boosting objectives, I think there are some nice ideas that could benefit practitioners in the CP field, in particular leveraging training data but moving away from the conformal training procedures which I find quite subpar. Thus I am raising my score and looking forward to the revised version incorporating your stated improvements.
We sincerely thank the reviewer for their helpful suggestions to improve our paper.
I don't see this explicitly mentioned in the paper, so I would suggest to make this more apparent. If this is the case, the fact itself could be formalized further in a proposition that extends the guarantee of split CP to incorporate the boosting stage. This seems, in my opinion, like a much more practical statement to formalize than the current Thm 5.1.
We agree with the reviewer that this is worth clarifying. We shall include this as a proposition in the revised manuscript.
This paper introduces a methodology for learning a conformal score function after training. Notably, the proposed method does not require model retraining, and instead learns the score function using trained model predictions using a cross-validation approach on the training data.
优点
Originality: Although the paper is in the same vein as works such as ConfTr [1], the fact that it can work directly on top of trained predictions on the same data is a clever formulation and a nontrivial benefit.
Quality: Overall, the paper is technically sound; I have no concerns regarding the theory. My reservations regarding the empirical results are discussed under weaknesses.
Clarity: The paper is well-written and all necessary ideas regarding conformal prediction for understanding the results in the paper are discussed appropriately.
Significance: The idea of learning a conformal score function after training (using the same data) but before calibration is, to my knowledge, a novel formulation and worth building on.
缺点
Marginal empirical benefit. My main reservation with the proposed method is the seemingly marginal benefit it provides over CQR for conditional coverage, as shown in Table 1. This is especially relevant for large scale settings where the overhead of the boosting/CV stage is going to be highly nontrivial. Additionally, conditional coverage here is evaluated using contrast trees, which are baked into the objective of the boosting approach. It would be interesting to see if the improvements hold under, for example, just group-conditional coverage.
Baseline score functions. The baseline score functions as described in Appendix A.8 are relatively weak, untuned models. While I do not expect an expansive evaluation of models to be within the scope of this work, it would be good to know how sensitive the improvements in the paper are to the choice of baseline score model.
Overall, while I have reservations regarding the empirical results, I do find the formulation of learning the conformal score in this paper useful. As a result, I lean accept.
问题
-
The boosting stage uses the same data on which the baseline score is trained. I'm curious whether the authors also considered splitting off some data for this stage (instead of doing the CV approach)? Although this would be less sample efficient, it seems like it may have greater benefit than reusing data for more powerful baseline score models.
-
It seems to me that the choice of boosting here for improving the baseline score could be replaced by some alternative means of mapping trained model predictions to the necessary mean/variance parameters for the conformal prediction intervals, and I'm wondering whether the authors considered such alternatives?
局限性
Although not in a separate section, limitations are adequately discussed.
We thank the reviewer for their time spent reviewing the manuscript. In response to each of the reviewer’s specific comments on weaknesses: 1.
My main reservation with the proposed method is the seemingly marginal benefit it provides over CQR for conditional coverage, as shown in Table 1.
Please refer to point 3 in the global rebuttal.
Additionally, conditional coverage here is evaluated using contrast trees, which are baked into the objective of the boosting approach.
We would also like to clarify that while we use contrast trees in our loss function, we execute the contrast tree algorithm at each iteration of gradient boosting. In other words, we re-partition the feature space at each iteration as soon as the conformity score is updated. When we evaluate performance on the test set, we rerun the contrast tree algorithm on the test set, taking as inputs the features, the labels and the conformalized prediction intervals calculated on the test set. Therefore, the partition of the feature space at test time is different from that at boosting time.
It would be interesting to see if the improvements hold under, for example, just group-conditional coverage.
Please refer to point 4 in the global rebuttal.
- We only partially agree with the reviewer that the improvement depends on the accuracy of the underlying model. Imagine we have infinite training data and that the underlying model perfectly identifies the conditional mean and the conditional expected absolute deviation . Then plugging this in the local conformity score would not be optimal in terms of average interval length or conditional coverage. Below, we illustrate this with a simulated example. Assume that each X follows a uniform distribution Unif(0.1,1.1), and that , where e is a random variable distributed as ; this says that e follows a beta distribution shifted to have mean zero. We can theoretically calculate , . Here, stands for the beta function. However, because a beta distribution is not symmetric, even if the training model perfectly fits and , the conformalized prediction intervals will not be of optimal length, as shown in the left panel in the attached pdf. If we take the theoretical mean and MAD as trained, then the boosted Local score effectively adjusts for the non-symmetric nature of the underlying distribution, as shown in the right panel. The respective average conformalized interval lengths are 0.237 and 0.216. This is about a 10% reduction in size.
Response to questions:
-
We agree with the reviewer that an alternative approach could involve holding out a validation set in addition to the training, calibration, and test sets. However, as noted by the reviewer, this method presents a trade off: a small validation set can lead to high variability in prediction intervals and performance, while a large validation set may reduce the number of samples available for training, potentially limiting the model's effectiveness. In contrast, our procedure uses a form of cross-validation to avoid this trade off. To illustrate this, we will include experiments comparing the performance of the two procedures on real datasets in the revised manuscript.
-
We agree with the reviewer that the gradient boosting algorithm could potentially be replaced by other machine learning models. In this sense, the gradient boosting algorithm is not central to our contribution. Within our boosting framework, once the appropriate search space and loss function are established, any suitable machine learning model can, in principle, be used to search for an enhanced conformity score.
Thank you for the thorough response, in particular the provided example. I plan to keep my score as I was already favoring acceptance, and in my view raising the score would require more substantive experiments that extend beyond a minor revision.
The paper proposes to utilize the training data and gradient boosting (of the model's predictions) to optimize loss functions related to conformal prediction interval length and conditional coverage with respect to the score function. The optimized score function is then used for "plain" CP usage, via calibration set.
优点
-
The approach is novel and significantly differs from other methods that utilize training data to improve CP performance.
-
The authors derive asymptotic theory for the sufficient expressiveness of the approach.
-
Experiments demonstrate that the approach consistently boosts the CP metric associated with the designed loss (interval length or conditional coverage).
缺点
-
There is no discussion and investigation on the impact of optimizing the score for conditional coverage on the interval length and vise versa. In practice, both can be important for a user.
-
Even though there is theory of the "asymptotic expressiveness" of the approach, the method in practice is based on differentiable approximations and greedy optimization that do not guarantee optimality.
问题
-
What can you say about the computational complexity of the boosting procedure?
-
In practice, both conditional coverage and the interval length are likely to be important for a user.
Therefore, when you optimize the score for better conditional coverage report also the effect on the interval length.
Similarly, when you optimize the score for better interval length report also the effect on the conditional coverage. -
Regarding the previous comment, when discussing these two properties it is worth mentioning the work:
Dabah, L., & Tirer, T. (2024). "On Temperature Scaling and Conformal Prediction of Deep Classifiers". arXiv preprint arXiv:2402.05806.
that shows a simple way to trade between them in CP methods for classification. -
While I am aware that your approach do not require access to the model's parameters, it would be insightful to compare the approach (even at some of the settings) to the approach of [26], namely optimizing the model's parameters to minimize a loss that includes the term that you design.
局限性
n/a
We thank the reviewer for their positive comments. In response to each of the reviewer’s specific comments on weaknesses:
- Please refer to point 1 in the global rebuttal.
- Please refer to point 5 in the global rebuttal.
Response to questions:
- Please refer to point 2 in the global rebuttal.
- Please refer to point 1 in the global rebuttal.
- We thank the reviewer for pointing us towards this literature. We shall include this in our revised manuscript when discussing the trade off between conditional coverage and interval length.
- In our revised manuscript, we will include additional results comparing the performance of our method with the approach outlined in [26]. While [26] primarily focuses on reducing prediction set size for classification tasks, our work is centered on regression tasks. Another difference is that [26] considers conformity scores based solely on predicted probabilities for each class. In a regression context, this approach is analogous to using only or in the Local conformity score. To bridge this gap, we plan to conduct an experiment using two neural networks to fit and in the generalized Local conformity score in an alternating fashion. The objective will be the differentiable approximation of the average prediction interval length, as formulated below line 264. Originally, the training model's objective for fitting was the mean squared error of , and for fitting , it was the mean squared error of .
I thank the authors for their response and keep the score that I gave (which already supports acceptance). As discussed, I am keen to see in the next version the effect of the approach on the trade-off between conditional coverage and interval length, which is currently not presented.
The paper resents a novel method to enhance conformal prediction intervals using gradient boosting. The proposed method focuses on improving specific properties such as conditional coverage and interval length. The key idea is to iteratively refine a predefined conformity score function via boosting, guided by a loss function designed to measure deviations from the desired properties. Experiments demonstrate that starting from conventional conformal methods, the boosted procedure significantly reduces interval lengths and better aligns with target conditional coverage.
优点
- Post-Training Enhancement: The boosting process is applied post-training, meaning it doesn't require modifications to the original model, making it applicable to a wide range of pre-trained models.
- Flexibility: The approach can be tailored to optimize various properties beyond interval length and conditional coverage, such as fairness or other specific criteria.
- Robustness: The use of cross-validation to select the number of boosting rounds helps prevent overfitting and ensures the method's robustness across different datasets and applications.
缺点
- A comprehensive description of the experiment setting is necessary. Since the method is post-training, it is crucial to explain how the training samples used for post-training differ from those used for the base model. Additionally, it is important to address potential outcomes if a distribution shift occurs. If the data used for post-training is different from the one use for calibration and testing, can it still works well? (or if the data used for (posted-training,calibration, testing) is different from the one used for base model training, can it still achieves to improve CP?)
- For experiment presentation, why the tables contain only conditional coverage and size solely? I mean, to convincingly demonstrate the method's effectiveness, it is essential to show that the method can improve interval length without significantly compromising conditional coverage, or improve conditional coverage without significantly increasing interval length.(rather than showcase them solely)
- The introduction of a boosting step increases the computational burden, potentially requiring significant resources and time, especially for large datasets. An analyze for time complexity is required for better illustration.
问题
-
Can your method work well with different error rates ?
-
If the post-training set is small, can your method still perform effectively?
局限性
Please refer to the weakness.
We thank the reviewer for taking the time to carefully review our manuscript. In response to each of the reviewer’s specific comments on weaknesses:
A comprehensive description of the experiment setting is necessary.
In our revision, we will more clearly describe our experimental setting. For concreteness, we shall make the following clarifications: we use the same data to train the initial prediction model and to boost. Building upon the trained prediction model , we boost for an improved conformity score, and subsequently calibrate with the boosted conformity score on the calibration set. As a result, under the same exchangeability assumption as for the classical split conformal procedure, an expected marginal coverage rate of is guaranteed on a new test point drawn from the same distribution.
Additionally, it is important to address potential outcomes if a distribution shift occurs.
We of course acknowledge that if the exchangeability assumption is violated (e.g., in the case of a distribution shift), the marginal coverage property may no longer be valid. Robustness vis a vis distribution shifts is the subject of an immense literature and outside of the scope of this paper; that said, this is an interesting direction to explore. It would be interesting to see how our boosting procedure could be combined with a line of research which relaxes the exchangeability assumption: for instance,
[1] Barber R F, Candes E J, Ramdas A, et al. Conformal prediction beyond exchangeability. The Annals of Statistics, 2023, 51(2): 816-845.
[2] Gibbs I, Candès E J. Conformal inference for online prediction with arbitrary distribution shifts. Journal of Machine Learning Research, 2024, 25(162): 1-36.
-
Please refer to point 1 in the global rebuttal.
-
Please refer to point 2 in the global rebuttal.
Response to questions:
-
We believe that different target coverage rates would not affect the effectiveness of our method. We shall report additional results with different nominal miscoverage rates in our revised manuscript.
-
We are somewhat confused about the reviewer's question. As mentioned previously, we use the training data in combination with cross validation during the boosting stage. Regarding the calibration set, as with any method that relies on a holdout set, an extremely small holdout set will inevitably lead to high variability in the performance of the method. This is not specific to our method, and in general we would not recommend relying on a small holdout set via any procedure, so we do not explicitly consider this question in our work. It is worth pointing out, however, that marginal coverage guarantees do hold regardless of sample size, since these results are “on average” and do not account for variability.
Thank you for the thorough response, in particular the supplementary experiments, I raise my rating to 6.
We are grateful to all the reviewers for their valuable feedback and constructive suggestions for improving our manuscript. In the rebuttals below, we have responded to each reviewer’s point individually. The attached PDF includes additional simulation results. Below, we address several common points raised by the reviewers:
1. Trade-off between conditional coverage and interval length (trTq,tKW7).
We agree with the reviewers that in practice, both conditional coverage and interval length are likely to be important. In the revised manuscript, we will expand Tables 1 and 2 to report additional results on both conditional coverage and interval length to address this concern.
2. Computational complexity of our boosted conformal procedure (trTq,tKW7).
-
The computational complexity of our procedure stems from three main components: 1) evaluating the customized loss function at each boosting iteration, 2) using the gradient boosting algorithm to update the conformity score function, and 3) performing cross-validation. We acknowledge the importance of improving the computational efficiency of our algorithm while maintaining or even enhancing its performance. One potential improvement could involve reducing runtime by replacing cross-validation with a simpler approach, namely, splitting the training set into a boosting set and a validation set. However, as discussed in our response to question 1 raised by reviewer D558, this simplification may introduce trade offs between the variability of the prediction intervals and the effectiveness of the training model. Additionally, in our current implementation, the contrast tree algorithm is executed at every iteration of gradient boosting to identify regions in the feature space that deviate from the target coverage rate. A more computationally efficient strategy might involve partitioning the feature space into regions that maximize deviation at specific intervals during the boosting iterations, such as every 5 or 10 iterations. In between these iterations, we would retain the partition computed last, and use it to evaluate performance.
-
In our revised manuscript, we will report a more detailed comparison of the runtime between the unboosted split conformal procedure and our boosted procedure.
3. Modest improvement on conditional coverage for CQR (D558,3gAf).
We acknowledge the observation from the reviewers that the improvement our procedure affords over CQR is generally somewhat limited. However, we would like to emphasize a few points: first, applying boosting after running CQR does not degrade its performance. Second, the experiments we presented on the selected datasets are by no means exhaustive. It is possible that one might see substantial improvements on datasets where the base CQR learner is less than ideal. Finally, our main goal is to demonstrate the flexibility of our method, highlighting its adaptability to various conformity scores and loss functions. While the improvement in conditional coverage may be modest in the specific context of this paper, we anticipate potential benefits in other applications. For instance, we foresee extending this approach to boost for the power of multiple hypothesis testing while controlling for the false discovery rate (FDR), particularly in tasks such as drug discovery and candidate screening. Following the framework proposed by [1], we would first calculate conformal p-values from the conformity scores associated with each sample and then apply the Benjamini-Hochberg procedure to control for FDR. P-values depend on the choice of conformity scores and optimizing these scores would yield lower p-values and hence more rejections. Similarly, conformal methods have been used for outlier detection and boosting conformity scores to improve the ROC (receiver operating characteristic) curve would equally be of great interest.
4. Comparison between the contrast tree algorithm and the group conditional coverage metric (D558, 3gAf).
We would like to argue that one of the key innovations here is the ability to identify regions in the dataset without requiring users to pre-specify them. By employing the contrast tree algorithm, we can automatically detect regions where coverage is most uneven. This feature is particularly beneficial if the user's objective is to achieve even conditional coverage across the feature space. Relying on user-specified groups may not always be appropriate, as it might overlook subgroups within these predefined categories where conditional coverage significantly deviates from the target rate, even if coverage appears even across the broader groups.
5. Clarifications on theoretical guarantees (tKW7,3gAf).
We agree with the reviewers that we currently cannot give any formal guarantee on the improvement our algorithm affords. We can only claim the following: 1) under the usual exchangeability assumption, our boosted procedure has the same coverage guarantees as the split conformal procedure on the test set; and 2) the search space contains the optimal solution. We can of course not guarantee that our algorithm will find this optimal solution.
References:
[1] Jin Y, Candès E J. Selection by prediction with conformal p-values. Journal of Machine Learning Research, 2023, 24(244): 1-41.
This paper presents a novel approach to enhancing conformal prediction intervals through a post-training boosting process, which offers substantial flexibility and robustness. While some concerns were raised regarding the empirical benefits and the use of the contrast tree approach, the authors' thorough responses and proposed revisions have addressed these points adequately. Given the innovative contributions and the positive reception from reviewers, I recommend accepting this paper.