Conformal Prediction for Ensembles: Improving Efficiency via Score-Based Aggregation
Rather than performing conformal aggregation in the space of prediction regions, we propose a vector extension of the conformal score, through which aggregation can be performed more efficiently.
摘要
评审与讨论
Authors propose conformal score aggregation (CSA) a new method for adapting conformal prediction to ensembles. Instead of averaging the scores of each model, they construct a data-driven ordering of the score vectors. This pre-ordering is achieved using a set of nested sets, each defined by a threshold parameter. Once the conformal scores are ordered, the coverage guarantee can be computed. Then, they go over a binary search over a separate parameter to obtain the desired coverage based on that ordering. They prove the theoretical consistency of CSA and provide experiments demonstrating the improvements it offers.
优缺点分析
Strengths:
S1. The CSA framework is novel and takes a different approach on the adaptation of conformal prediction to ensembles. (Originality).
S2. CSA experimentally proves to yield significantly less conservative regions than previous methods. (Significance).
S3. Computational efficiency.
S4. The writing is clear and the exposition and methods are sound. (Quality, Clarity).
Weaknesses:
W1. There is no comment on the limitations of the proposed method.
W2. The article lacks a conclusion section.
Minor comments: E1. Line 230 grammatical error.
问题
Q1. The conclusions section appears as Discussion, it should be self-contained and have more detail.
Q2. The paper can be improved with a comment on the limitations of CSA.
Q3. We are not sufficiently familiar with this specific field to provide a confident assessment, however the papper seems technically sound and consistent. This is spetially the case for sections 3.2 and 4.2.
局限性
No, the authors should comment on limitations of the proposed method.
最终评判理由
The concerns raised in the review were properly addressed by the authors. Since these were minor comments, I will keep my rating as it was.
格式问题
No major formatting issues.
We thank the reviewer for their time and questions. We address each of the points successively below.
The conclusions section appears as Discussion, it should be self-contained and have more detail.
We will expand the discussion in the final submission to include more details about extensions that can be made on the current CSA framework.
The paper can be improved with a comment on the limitations of CSA.
We will extend the conclusions to note some limitations. Some current ones were highlighted in the discussion section, where it would be of interest to see further use cases of CSA being integrated into regression settings beyond the predict-then-optimize setting we considered in the paper. Another limitation might be in the loss of data efficiency in needing to further split the calibration data into and for determining the quantile envelope shape without sacrificing coverage guarantees. Similar to how there is full conformal prediction as a way to augment split conformal use cases when it is infeasible to hold out a separate calibration set, it could likely be possible to generate a similar “full CSA” procedure, where the set is not split and where we sacrifice computational efficiency for data efficiency.
I appreciate the answer of the authors. My concerns are addressed since the discussion will be expanded and some limitations have been explained. I will keep my score as it was.
The authors propose a method to merge prediction sets from different predictors, each being a valid prediction set with at least coverage rate. They take advantage of the quantile envelope idea but make substantial changes. Most importantly, they develop a nontrivial method to correct the coverage rate for the intersection of many prediction sets, which seems to be a promising alternative to Bonferroni. Performance on real datasets is shown.
优缺点分析
Strength: It is impressive to see the idea of splitting the calibration set into two (disjoint) parts, with the first part used to calculate a for coverage correction and the second part for providing a further correction for constructing the final prediction set. In my opinion, this is the most important contribution of this work to the study of conformal prediction.
Weaknesses:
-
I do not think Theorem 1 precisely covers the setting of Algorithm 1. A complete proof of the correctness of Algorithm 1 must be provided. Theoretical guarantees are perhaps the most valuable feature of conformal prediction. Also, in the statement of the theorem, as well as in its proof (Appendix B) and elsewhere in the paper, the authors should carefully and clearly specify with respect to which random variables the probability is taken. I will raise my rating once these are done satisfactorily.
-
The readability of some parts of the paper needs to be improved, especially on page 5. This is important, as this page contains the most essential idea of the paper. For example, is not defined; just above Theorem 1, what does it mean by ``with this choice of ''? Does already contain the information of ?
问题
-
What are the typical values of in your numerical experiments? I guess it might be very close to in many cases as the random forest is one of the predictors and its performance is almost as good as that of the proposed method (Appendices H and I). What about ?
-
What are the typical correlations between scores from different predictors in your experiments? In general, the random forest and the XGBoost could provide very similar scores.
-
Overall, LASSO or OLS is not as good as more advanced algorithms, so if we drop them from the ensemble, does it really matter? In other words, under what situations, including such ``simple'' predictors will help? (In terms of coverage and/or size of the resulting prediction set.)
-
Is it possible to establish some proper concentration inequality that quantify the fluctuation of a finite number of random directions ? Or, could you provide some empirical evaluation on the influence of on coverage and size of the prediction set? (Similar to Table 4.)
局限性
Yes.
最终评判理由
The authors propose a theoretically justified method for aggregating multiple valid prediction regions, addressing a timely topic that has received considerable attention recently. The approach is novel and interesting. I believe it substantially advances the research in this direction.
格式问题
None.
We thank the reviewer for their time and questions. We address each of the points successively below.
A complete proof of the correctness of Algorithm 1 must be provided. Theoretical guarantees…
We consider the modified form of the theorem statement and proof below, where we make explicit note of the data splitting procedure and the random variables over which the probabilities are defined. In this presentation, we first provide the general theorem and proof for multivariate scores. We then state a corollary for the choice of (defined in the theorem below) defined with quantile envelopes used in the paper.
Theorem
Suppose are exchangeable, where . Assume further that non-negative maps have been defined and a composite is defined.
Let be a random permutation of the indices , drawn uniformly and independently of and . Let the calibration set be partitioned into and , where . Let the corresponding score sets be and . Let be a deterministic function for any given realization of .
For some , let be the -th smallest value of the set of transformed scores . Assume that ties among the transformed scores occur with probability zero. Then, denoting by ,
where the probability is defined over the joint draw of the data , , and the random permutation .
Proof
The overall probability is taken over the joint distribution of the exchangeable data, and , and the independent random permutation, . We use the law of total probability by first conditioning on a specific realization of the permutation, , and the data in the first split, . Given and , the score set is fixed. As a result, the function becomes a fixed, deterministic transformation.
By the initial exchangeability of all data points, after conditioning on the values of the first split , the remaining calibration points in and the test point are still an exchangeable sequence. Applying the fixed transformation to their scores yields an exchangeable sequence of scalar values:
Under the no-ties assumption, the rank of the test value within this sequence is uniformly distributed on . The test point is covered if its transformed score is less than or equal to the threshold . This occurs if and only if the rank of the test score is at most . The probability of this event, conditional on and , is:
Since this guarantee holds for any realization , the unconditional probability also holds by the law of total probability:
where the expectation is taken over the joint distribution of and . This completes the proof.
Corollary
The coverage guarantee of Theorem 1 holds for the implicitly defined by Algorithm 1. Let be the set of projection directions, drawn randomly and independently of the data and the random permutation . Given a realization of the first score split, , and the directions , the parameters are determined by the procedure in lines 4-9 of Algorithm 1. The function is then explicitly defined for any score vector as
Proof
To prove the corollary, we must show that this specific function satisfies the conditions of Theorem 1. The overall probability is taken over the joint draw of the data (), the random permutation , and the random directions . We use the law of total probability by conditioning on specific realizations of the random elements , , and .
Given these fixed realizations, the score set and the projection directions are fixed. The procedure in Algorithm 1 to find the base quantiles via binary search is a deterministic operation on this fixed data. Therefore, the function becomes a fixed, deterministic function of . The conditions of Theorem 1 are met (again assuming no ties in ), and its proof implies that the conditional probability of coverage is at least :
Since this guarantee holds for any realization , the unconditional guarantee follows from the law of total probability:
Thus, the guarantee holds for the specific procedure in Algorithm 1.
The readability of some parts of the paper needs to be improved…
We apologize for this lack of clarity; we were hoping the Appendix A accompaniment would clarify the exposition. For your specific notational comments, denotes the unit sphere in , which we will make explicit where it is introduced in Section 2.2. For the clarification on , yes, these parameters are defined once has been determined.
What are the typical values of and in your numerical experiments?
In general, should be fairly close to , and, if the quantile envelopes for and are similar, ; if the envelope is overly tight and if overly conservative. We show these below for the experiments that were run in the paper. We also highlight that, while Appendix I suggests that the random forest achieved comparable performance to aggregation, this was only true in the simple, UCI baseline experiments. In the more complex settings considered in the main text (Appendix G), our aggregation strategy outperformed the random forest predictor.
-- across all tasks here
| Task ID | |
|---|---|
| 361234 | 1.093155 |
| 361235 | 1.818106 |
| 361236 | 1.272447 |
| 361237 | 1.575767 |
| 361238 | 1.172660 |
-- across all tasks here
| Task ID | |
|---|---|
| 361234 | 1.029694 |
| 361235 | 1.779467 |
| 361236 | 1.297137 |
| 361237 | 1.433094 |
| 361238 | 1.148173 |
What are the typical correlations between scores from different predictors in your experiments?
Here are some examples of correlations of scores from the experiments. As expected, the linear and LASSO models can be highly correlated; CSA is most useful when predictor scores are uncorrelated.
Task 361234
Task 361235
Under what situations does including such “simple” predictors help?
You are correct that including such simple predictors should not help. However, we wanted to highlight that the method would not be “thrown off” by the presence of these extra predictors, which we confirm in the ablations for LASSO below. This is because, in a real use case, it may not be a priori clear which predictors are “overly simple” or not.
Task 361234
| With | Without | |
|---|---|---|
| 0.05 | 8.0534 | 8.1016 |
| 0.025 | 9.0303 | 9.0375 |
| 0.01 | 10.2253 | 10.1544 |
Could you provide some empirical evaluation on the influence of on coverage and size of the prediction set?
Intuitively, as , we would expect to recover the tightest cover and, thus, that the prediction region size should be roughly decreasing in , with some plateau. This is borne out below. Across all the choices of , the coverages were identical, namely 0.981 for the case and 0.962 for (for task 361237).
| Length () | Length () | |
|---|---|---|
| 50 | 28.037 | 23.017 |
| 100 | 27.111 | 22.071 |
| 500 | 25.880 | 22.621 |
| 1000 | 26.110 | 22.172 |
| 5000 | 24.943 | 22.037 |
I am satisfied with the authors' response. The new proofs are clear. I will raise my score later.
There is only one point: since only the lower bound of coverage is concerned, there is no need to assume there are no ties. 1) It is possible that there are ties even if the response is continuous (e.g., for tree-based predictors). 2) Usually, the no-ties assumption is introduced to derive an upper bound. It is optional, but the authors are encouraged to establish one if they stick to this assumption.
We greatly thank the reviewer for raising these additional points. Indeed, as suggested, we can remove the assumption on ties for establishing the lower bound and will do so in the camera-ready version of the paper. We will also include the upper bound proof under the additional ties assumption as suggested by the reviewer.
Thanks for your efforts. I don’t have any further questions.
Summary:
The paper introduces a novel method called Conformal Score Aggregation (CSA) to enhance uncertainty estimation in ensemble models using conformal prediction. While existing approaches aggregate prediction regions after conformalization, potentially causing overly conservative regions, CSA instead aggregates predictons in the score space itself. The main contribution is using a multivariate score function and defining prediction regions via quantile envelopes. This enables tighter, more data-driven prediction regions. The authors demonstrate that their method achieves distribution-free coverage guarantees. It is less conservative, and improves efficiency in both classification and predict-then-optimize regression tasks through empirical evaluations.
优缺点分析
Strengths:
The introduction of multivariate score functions and quantile envelopes in conformal prediction is innovative and addresses limitations in existing methods.
The method leverages quantile envelopes and concepts of higher-dimensional quantiles for aggregation, enhancing predictive efficiency.
Authors conduct experimentation, including classification (ImageNet) and regression benchmarks, providing strong evidence of CSA's practical advantages.
CSA can be applied across both classification and regression tasks, unlike some existing methods restricted to specific scenarios.
Weaknesses:
The approach involves multiple stages (splitting calibration sets, directional quantile computation). Authors should comment on the computational overhead, especially with larger ensemble sizes.
The necessity of tuning (e.g., binary search for optimal quantiles) could limit practical scalability.
Performance is likely sensitive to the calibration set split (e.g., and ); suboptimal splitting could substantially affect predictive efficiency and coverage.
While the paper extensively compares to previous methods, authors should comment extensively on the computational cost relative to these alternatives.
问题
Questions for Authors:
How does computational complexity scale with increasing ensemble size (K) or calibration dataset size (NC)? Could the method become infeasible at larger scales?
How sensitive is CSA to the choice of projection directions {u_m}? Would alternative direction-sampling methods improve performance further?
How should practitioners optimally select the calibration split ratio (DC^(1):DC^(2)) in practice, and how sensitive are results to variations in this choice?
局限性
yes
最终评判理由
Initially I mistakenly uploaded review of another paper. And added the review of the submitted paper in the official comment because I was unable to edit the main review at the time. I am able to edit the main review now. Have not received any rebuttal by authors yet on the new review. Will edit this soon after I receive their rebuttal.
格式问题
no
We thank the reviewer for their time and questions. We address each of the points successively below.
How does computational complexity scale with increasing ensemble size (K) or calibration dataset size (NC)? Could the method become infeasible at larger scales?
The computational complexity is largely the result of the increasing count of directions one may need in order to capture the complex envelope shapes in higher-dimensional spaces. However, computing the quantile is extremely efficient, since are two primary aspects of the method, (1) computing the envelope threshold and (2) checking whether future points fall into this envelope. Importantly, both of these computations can be vectorized, which makes their computation extremely fast. Please see Appendix D for a full discussion of the vectorization, which highlights how the results can be efficiently computed for even large ensembles.
To demonstrate the computational efficiency of this approach, we reran the experiment on the “Parkinsons” UCI task (from Appendix I) with both varying numbers of predictors () and projection directions (); for each combination, we measured the total time taken to compute the quantile (i.e. to run Algorithm 1) and to perform the projection to assess coverage for the test points. The additional predictors were taken to be random forests with different numbers of trees. is given in the left column and in each column heading, with the entry for each pair being reported in seconds.
| 10 | 100 | 1000 | 10000 | |
|---|---|---|---|---|
| 6 | 0.111668 | 0.373029 | 2.32803 | 37.2561 |
| 8 | 0.0961056 | 0.327051 | 2.16211 | 36.7216 |
| 10 | 0.117146 | 0.373464 | 2.73875 | 37.2603 |
| 12 | 0.123772 | 0.384527 | 2.35386 | 37.0735 |
This highlights how the computational cost is minimal and is independent of when isolated from the directions count.
The necessity of tuning (e.g., binary search for optimal quantiles) could limit practical scalability.
We find that the search for typically converges within 5-10 iterations, with each iteration being extremely fast to compute, as it is simply assessing for coverage in the fashion described in the previous question. The overall process, therefore, converges within the order of ~10 seconds, and will scale with according to the same empirically observed rates as those shown in the previous question.
How sensitive is CSA to the choice of projection directions {u_m}? Would alternative direction-sampling methods improve performance further?
Empirically, we observed that if the number of directions is large enough, the performance is stable. Intuitively, as , we would expect to recover the tightest cover and, thus, that the prediction region size should be roughly decreasing in , with some plateau. This is borne out below. Across all the choices of , the coverages were identical, namely 0.981 for the case and 0.962 for (for task 361237).
| Length () | Length () | |
|---|---|---|
| 50 | 28.037 | 23.017 |
| 100 | 27.111 | 22.071 |
| 500 | 25.880 | 22.621 |
| 1000 | 26.110 | 22.172 |
| 5000 | 24.943 | 22.037 |
Doing alternative direction-sampling may further improve avoid the sampling of redundant directions; however, as discussed in the main paper, the uniform sampling from a hypersphere is a known difficult problem from classical mathematics, which is why this stochastic sampling approach was taken. Nonetheless, an interesting improvement could be a sampling method with faster convergence to Unif(), though the empirical results suggest that this will likely have minimal impact on the end results.
How should practitioners optimally select the calibration split ratio (DC^(1):DC^(2)) in practice, and how sensitive are results to variations in this choice?
Empirically, we observed that a split of 20/80 works well empirically, as seen across some of our experimental setups. In general, it is more important to have a large to retain the classical upper bound guarantees of conformal prediction that restrict conservativeness. The size of largely dictates how accurately the quantile envelope contour can be reconstructed, meaning this becomes more relevant with increasing . In the same way that conformal prediction typically prescribes an 80/20 training/calibration split with practitioners varying around this guideline based on how much training accuracy they are willing to sacrifice or how sample-efficient they believe their model is, we prescribe a 20/80 general split with practitioners having heterogeneous ensembles suggested to have more allocated to , i.e. a 30/70 split.
Thank you for the rebuttal. I maintain my positive score.
This paper aims to improve conformal prediction for ensemble models. While existing work has already proposed aggregating ensemble prediction regions to retain coverage guarantees, this paper introduces a new framework called Conformal Score Aggregation (CSA). CSA leverages quantile envelopes to enable data-driven uncertainty estimation. The method can be directly incorporated into Predict-Then-Optimize tasks. It is evaluated on both regression and classification tasks and demonstrates strong performance.
优缺点分析
Strengths:
The paper is well-written and well-organized.
The method provides theoretical guarantees for convergence.
It is applicable to both regression and classification tasks.
Weaknesses:
I am not very familiar with this line of work, and I found the paper quite difficult to understand. Section 3.1.1 is especially complex, and I am unsure what the main novelty is and why it is important.
I suggest adding illustrative figures or diagrams to help visualize the process of conformal prediction in ensemble models. This would also help clarify the specific contributions and steps introduced in the paper.
The phrase “improving efficiency” in the title is unclear to me. Does it refer to optimization speed or something else? Clarification would be helpful.
I also found the evaluation results hard to interpret. For instance, what evaluation metrics are used for the regression and classification tasks?
While the abstract mentions uncertainty quantification, the paper does not clearly demonstrate how uncertainty is estimated, nor does it provide an evaluation of the uncertainty quantification quality.
I may improve my score based on further discussion.
问题
Please see above for the weaknesses. All the questions are there.
局限性
yes.
最终评判理由
Most of my concerns are addressed. I will raise my score.
格式问题
no.
We thank the reviewer for their time and questions. We address each of the points successively below. Two of the questions are related, so we address these concerns first and then address the remaining ones.
-
The paper does not clearly demonstrate how uncertainty is estimated, nor does it provide an evaluation of the uncertainty quantification quality.
-
The phrase “improving efficiency” in the title is unclear to me.
In the conformal prediction community, the goal is to construct prediction regions that achieve some pre-determined level of marginal coverage , i.e. . This is the sense in which the conformal prediction community does “uncertainty estimation”: the replacement of point predictions with gives a sense of how that prediction could be wrong (i.e., by any deviation that remains in the prediction region). To make such uncertainty estimates maximally informative, we wish to achieve this coverage guarantee while minimizing the average size of such regions, which is referred to as the “predictive efficiency” of the procedure. This is precisely the “efficiency” being referenced in the title. Evaluating different methods of uncertainty quantification, therefore, occurs in two stages: demonstrating the coverage guarantee holds and, if so, then comparing the average prediction region sizes.
I am unsure what the main novelty is and why it is important.
As elaborated on in our Related Works discussion, there is now a significant amount of interest in producing these prediction regions for model ensembles. The naive strategy is just to conformalize the ensembled prediction; however, several previous works have demonstrated improvements (in the sense of creating smaller prediction regions) possible by instead conformalizing the models first and then aggregating the individual models’ prediction regions.
These aggregation strategies, however, perform poorly (producing overly large prediction regions) when the uncertainty across models varies significantly. Our method takes a completely different route for aggregation to tackle this issue. Specifically, our core contribution was to define a multivariate-score for an ensemble with models and to both demonstrate that the probabilistic coverage guarantees of conformal prediction could be extended by introducing a multivariate quantile envelope in place of the classical scalar “threshold” (from which a partial ordering over could be defined and used to generalize the definition of ) and that the resulting prediction regions defined by this procedure were more “efficient” than alternate aggregation strategies, in both classification and regression settings. Extending the coverage guarantees was highly non-trivial and is what is presented in Section 3.1.1, which involved defining a partial ordering using ideas from quantile envelopes and a non-obvious data-splitting step to ensure this partial ordering produces prediction regions that are as tight as possible (which was confirmed to be necessary for coverage in the ablation shown in Section 4.2).
Additionally, using high-dimensional prediction regions in regression settings is often not possible to do directly. One common use case is performing robust decision making downstream of having a conformalized predictor. Section 3.2, therefore, was another significant contribution to highlight that this approach we proposed can be used in decision-making frameworks. This very critically relies on some of the modeling choices of how we constructed the quantile envelopes above and would not trivially follow if these details changed, in particular, in ensuring the constraints can be expressed within the framework of convex optimization, as . Notably, none of the other aggregation strategies lend themselves to use in this fashion, as they do not take on a similarly explicitly decomposable form, rendering it not possible to efficiently solve the resulting problems using convex solvers.
For instance, what evaluation metrics are used for the regression and classification tasks?
As discussed above, the metric of interest in the conformal prediction literature is, assuming a method achieves the desired coverage, how small the average prediction region size is, although how this is measured in regression settings is sometimes non-obvious. In the classification case, one can simply count, for a given , how many of the classes of comprise . This is what is shown in Table 1: we demonstrate that, for three distinct choices of (0.1, 0.05, and 0.01), our method attains the theoretically demonstrated coverage guarantee and produces significantly smaller prediction regions than the competitor aggregation strategies and the strategy of conformalizing the ensembled predictor.
In scalar regression settings (i.e. ), prediction region sizes can similarly be assessed by discretizing the output space and checking what subset of such discretized points falls into . The coverages and region sizes as measured by summing the covered discretized space, therefore, are again what are shown in Table 2, across a wide collection of benchmark tasks, again for two different values of (0.05 and 0.025).
As mentioned earlier, however, many regression problems are over multi-dimensional output spaces. In these cases, two issues arise. One is that the ability to even estimate the volume of the prediction region (other than some simple analytically computable cases) becomes infeasible due to the exponentially large grids necessary. More importantly, however, it is often not of practical interest to create a discrete collection of points that lie in the prediction region in these high-dimensional cases. These two points, therefore, point to indirect measures of the prediction region size as being of greater interest in these high-dimensional settings.
As discussed in the first answer, these high-dimensional regression regions are often useful when employed in robust decision-making pipelines. The standard conformal decision-making framework seeks to solve as a robust proxy to solving the problem under a true, but unknown, problem parameter , i.e. . In robust decision making, a larger prediction region will lead to a more conservative decision, which results in a higher suboptimality gap . A normalized version of this suboptimality gap is precisely the indirect measure that is of interest to compare methods for high-dimensional settings and is what we use to assess and demonstrate improvements in the extremely high-dimensional () problem of Section 4.3.
I suggest adding illustrative figures or diagrams to help visualize the process of conformal prediction in ensemble models.
We were hoping the visual accompaniment provided in Appendix A would clarify the textual description on Section 3.1.1 but would gladly want to extend this to further clarify parts of the method that remain opaque.
I think most of my concerns are addressed. I will raise my score.
This paper presents a method for aggregating prediction regions based on scores produced by individual models in their ensemble. Earlier work merges prediction sets, each of which is generated by each individual model. On the other hand, the current paper considers a vector of scores and leverages quantile envelope to construct a final prediction set. All of reviewers feel that the paper is well written and the method referred to as CSA enhances predictive efficiency. The authors did a good job in responding to reviewers’ comments, leading that two of reviewers raised their overall scores. I believe that this work is one promising research toward conformal prediction for ensembles.