PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
4
3
3
4
ICML 2025

Volume Optimality in Conformal Prediction with Structured Prediction Sets

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

This work provides conformal prediction methods with restricted volume optimality in both the unsupervised setting and the supervised setting.

摘要

Conformal Prediction is a widely studied technique to construct prediction sets of future observations. Most conformal prediction methods focus on achieving the necessary coverage guarantees, but do not provide formal guarantees on the size (volume) of the prediction sets. We first prove the impossibility of volume optimality where any distribution-free method can only find a trivial solution. We then introduce a new notion of volume optimality by restricting the prediction sets to belong to a set family (of finite VC-dimension), specifically a union of $k$-intervals. Our main contribution is an efficient distribution-free algorithm based on dynamic programming (DP) to find a union of $k$-intervals that is guaranteed for any distribution to have near-optimal volume among all unions of $k$-intervals satisfying the desired coverage property. By adopting the framework of distributional conformal prediction (Chernozhukov et al., 2021), the new DP based conformity score can also be applied to achieve approximate conditional coverage and conditional restricted volume optimality, as long as a reasonable estimator of the conditional CDF is available. While the theoretical results already establish volume-optimality guarantees, they are complemented by experiments that demonstrate that our method can significantly outperform existing methods in many settings.
关键词
Conformal PredictionVolume OptimalityStructured Prediction Sets

评审与讨论

审稿意见
4

The submission studies volume minimization in conformal prediction. It proposes a dynamic programming algorithm to construct prediction sets with marginal (and approximate conditional) coverage. The sets returned by the method have provably optimal volume within the class of union of k intervals. Experiments on synthetic datasets support the claim that the proposed method minimizes the interval length compared to existing alternatives.

给作者的问题

I do not have any further questions.

论据与证据

Evidence supports claims.

方法与评估标准

Experiments support the theoretical claims made in the paper. Experiments on real-world datasets would support the practical utility of the proposed method.

理论论述

Proofs of Theorem 2.1 and Proposition 2.3

实验设计与分析

Experiments use synthetic distributions published in existing literature.

补充材料

The proofs discussed above.

与现有文献的关系

Volume optimization is a topic that has gained significant attention in conformal inference literature. The submission bridges concepts of dynamic programming with distributional conformal prediction, which are well-established in their respective sub-areas.

遗漏的重要参考文献

All essential references are discussed.

其他优缺点

Strenghts

  • The paper is well-written
  • The topic of volume optimization in conformal inference is timely
  • Being able to construct sets that are union of intervals is important for multimodal distributions

Weaknesses

  • Experiments on real-world datasets would strengthen the utility of the proposed method

I will expand on a few comments and questions and I am looking forward to discussing with the authors!

其他意见或建议

Prior works on volume minimization

Kiyani et al, 2024 "Length Optimization in Conformal Prediction" also studies optimality of prediction sets with an efficient algorithm for hypothesis classes with bounded complexity. It would be important to discuss the relation between the theoretical findings in this submission and this previous work. Furthermore, comparing the two algorithms would strengthen evidence.

Teneggi et al, 2023 "How to Trust Your Diffusion Model: A Convex Optimization Approach to Conformal Risk Control" also considers mean interval length minimization for conformal risk-control in high-dimensional settings.

Connections with hypothesis testing

Could the authors expand on this connection?

Proposed algorithm

A brief high-level description of the intuition behind the DP algorithm in the main text would increase accessibility of the submission to a broader audience.

Is this algorithm optimal in terms of complexity? Or could, for example, the last step of selecting the solution with minimum volume be avoided?

In Proposition 2.3, what is the role of point 1. ? How does this property of the algorithm play a role in the rest of the contributions?

Experiments

A comparison of runtimes would help practitioners understand the tradeoffs between the different methods.

--

Minor Comments

  • Lines 27-30, right column: CP also provides a coverage upper-bound, which guarantees the sets are not trivial.
  • Lines 119-127, right column: I understand Pn+1P^{n+1} is the measure of the test point. For Pn×λP^{n} \times \lambda, does PnP^{n} mean the measure of all previously observed points?
  • Lines 292-294, right column: it is true that the construction is only needed for the calibration points, but it also needs to be performed for each new point at inference, which can be expensive.
作者回复

We thank the reviewer for detailed suggestions and providing more related works. We will ensure to make these changes to improve our paper and include discussions to related works.

  • Experiments on real-world datasets would strengthen the utility of the proposed method

We implement our method and baselines on real-world datasets (See response to Reviewr Ubed for details). We also include the runtime comparison.

  • A comparison of runtimes would help practitioners understand the tradeoffs between the different methods.

We will include the runtime comparison in the revision.

  • Kiyani et al, 2024 "Length Optimization in Conformal Prediction" also studies optimality of prediction sets with an efficient algorithm for hypothesis classes with bounded complexity. It would be important to discuss the relation between the theoretical findings in this submission and this previous work. Furthermore, comparing the two algorithms would strengthen evidence.

We have a discussion about Kiyani et al, 2024 in Appendix A. We will include the comparison with them in the revision.

  • Teneggi et al, 2023 "How to Trust Your Diffusion Model: A Convex Optimization Approach to Conformal Risk Control" also considers mean interval length minimization for conformal risk-control in high-dimensional settings.

We will discuss Teneggi et al, 2023 in the related work section.

  • Connections with hypothesis testing: Could the authors expand on this connection?

Since both coverage and volume are two measures of sets (one is the data generating probability and the other is the Lebesgue measure), the volume optimality problem subject to coverage property is naturally connected to hypothesis testing between the two measures. This is exactly how we prove the impossibility result in the unsupervised setting.

Such a connection also holds when one considers conditional coverage and conditional volume optimality, and a similar impossibility result would hold.

More generally, one could consider (conditional) coverage and (conditional) restricted volume optimality. This is exactly the conformal prediction framework that we consider in the paper. While we proved that the proposed DP method satisfies both guarantees, it would be interesting to also study the sample complexity of the problem. To be more specific, one could ask if the coverage slack of order (k+logn)/n\sqrt{(k+\log n)/n} in Theorem 2.5 is optimal or not. We believe the lower bound approach for this problem could also be tackled via the hypothesis testing connection. This will be a very interesting direction to explore in the future.

  • Is this algorithm optimal in terms of complexity? Or could, for example, the last step of selecting the solution with minimum volume be avoided?

The final step of selecting the minimum volume solution can indeed be avoided by maintaining the best solution directly within the DP table. However, this does not improve the overall time complexity since the last step takes only O(kn/γ)O(kn/\gamma) time. In practice, it is still possible to approximately implement this algorithm more efficiently through various tricks, for example by bucketing the samples.

  • In Proposition 2.3, what is the role of point 1. ? How does this property of the algorithm play a role in the rest of the contributions?

Points 1 & 2 jointly guarantee that the output of the DP algorithm achieves the coverage and the restricted volume optimality with respect to the empirical distribution. Then, together with the uniform concentration, it leads to the restricted volume optimality with respect to the distribution of the testing sample in Point 2 of Theorem 2.5.

  • Lines 27-30, right column: CP also provides a coverage upper-bound, which guarantees the sets are not trivial.

You are correct. We will change the wording here in the revision.

  • Lines 119-127, right column: I understand Pn+1P^{n+1} is the measure of the test point. For Pn×λP^{n} \times \lambda, does PnP^{n} mean the measure of all previously observed points?

We have training data X1,,XnX_1, …, X_n and testing data Xn+1X_{n+1}. The Pn+1P^{n+1} is the joint distribution of all n+1n+1 samples. In Pn×λP^{n} \times \lambda, PnP^n means the joint distribution of the training data with nn samples.

  • Lines 292-294, right column: it is true that the construction is only needed for the calibration points, but it also needs to be performed for each new point at inference, which can be expensive.

The computational bottleneck is the construction of the nested system {Sj(x)}j[m]\{S_j(x)\}{j\in[m]}. It needs to be constructed for all calibration points {Xn+1,...,X2n}\{X_{n+1},...,X_{2n}\}, together with the testing point X2n+1X_{2n+1}. The n+1n+1 nested systems are sufficient to evaluate the conformity score at each yRy\in\mathbb{R}. On the other hand, if one has more than one testing point, say X2n+1,,X2n+rX_{2n+1}, …, X_{2n+r}, then the nested systems will need to be computed for each individual one, and you are correct that it may be expensive when rr is large.

审稿人评论

I sincerely thank the authors for their thoughtful consideration of the comments raised by all reviewers!

The real-data results strengthen evidence in support of the proposed method, and the authors' response clarified my questions.

I am happy to recommend acceptance, I kindly ask the authors to address the outstanding concerns by Reviewers hJ91 and UBed regarding title and relation with existing works in the revised version of the paper, which are important points.

审稿意见
3

This paper studies the problem of providing guarantees regarding the efficiency (ie, small size/volume and thereby informativeness) of conformal prediction (CP) sets. The paper first presents an impossibility result for volume optimality where any CP method can only find a trivial solution. Then, the paper introduces a definition of “restricted volume optimality” where the CP sets are restricted to belong to a set family of finite VC-dimension, specifically sets that are a union of kk intervals. The main algorithm presented leverages dynamic programming and a conformity score based on distributional conformal prediction (Chernozhukov et al., 2021) and nested prediction sets; analysis is given regarding approximate conditional coverage and conditional restricted volume optimality when a ‘good’ estimator of the conditional CDF is available.

update after rebuttal

My primary concern was the lack of real-data results--the authors addressed this in their rebuttal with some experimental results on standard UCI tabular datasets, so I raised my score to weak/borderline accept. Unfortunately, the authors did not respond to the remaining secondary questions/concerns I had, so I'll restate those here, to highlight that they were not resolved--while I decided to raise my score before hearing back from the authors because I overall think reasons to accept outweigh those to reject, it would have been nice to get clarity on these, to feel more certain about my recommendation....

A couple follow-up questions: (a) Can you confirm what the target coverage was for the supervised setting UCI datasets? It looks like it was 70%, but just checking. (b) If so, why are the new experiments being run with different target coverage levels? One might worry that multiple target coverage levels were tried for each, and only the best results are reported. In a final version, I think it'd be helpful/important to provide at least a couple examples where the experiments are repeated with a grid of different target coverages (or, even a calibration plot over these values), to see how results vary (or not) accordingly. Especially since end-users are typically most interested in higher target coverage levels, eg, 90-95%.

My last remaining concerns were regarding the title (see Claims and Evidence) and references (see Essential References Not Discussed and Relation To Broader Scientific Literature). These have not been address by the authors' response, though perhaps due to character limits. I hope that the authors will (1) consider updating the title as have suggested, especially as reviewer hJ91 had a similar concern about the title being misleading and (2) incorporating further discussion of some related work on other approaches to improving efficiency of prediction sets.

给作者的问题

Overall, as I’ve mentioned, my main concern for this paper is that I think some experiments on real datasets (even simple tabular UCI datasets) should be provided, in addition to the synthetic-data experiments. Secondarily, addressing my comments on mentioning other related literature on combing CP sets, as well as revising certain informalities in the writing, would improve it in my view.

If some real-data experiments (e.g., on say 3 UCI datasets) were added with results that supported those of the current experiments, I would consider improving my evaluation of the paper; addressing the secondary concerns around mentioning related literature and revising the writing would further help.

论据与证据

The main claims appear to be reasonable and supported by evidence. However, I think that the title (which can be viewed as a claim itself) is perhaps too broad for the scope of the paper, in two ways:

(1) Given that other a number of other works have studied optimizing for CP efficiency (e.g., by selection, designing conformity scores, aggregating prediction sets, analyzing efficiency, etc), and given that the main contributions of the paper focus on the notion of “restricted volume optimality,” I think it would be more accurate if “Restricted” were added to the title, eg either as “Restricted Volume Optimality in …” or “Volume Optimality in Conformal Prediction with Restricted Prediction Sets.”

(2) Secondly, in my view I’d suggest removing “Structured” from the title, or at the very least it should be clarified in the paper what is meant by structured (i.e., it seems that it means “restricted,” in the sense defined). That is, it appears that the word “structure/structured” is only ever used in the title and in one subheading--it appears to never be used in the main text of the paper even to explain what is meant by it. When I first saw the paper’s title, I had thought that “Structured” meant multidimensional labels as in structure-prediction tasks, and I think that others could similarly mis-aligned expectations.

方法与评估标准

A main limitation of the paper’s evaluation is that the experiments appear to only evaluate on synthetic datasets (e.g., Gaussian data, Censored Gaussian, Mixture of Gaussians, ReLU-Transformed Guassians). The paper’s evaluation would be strengthened by also including real datasets, even real but simple tabular datasets e.g., from the UCI ML repository. Including evaluation on real data is arguably especially important for this paper as some of the methods rely on kernel density estimation (KDE), which will be expected to perform well for synthetic Gaussian or mixture-of-Gaussian data, but could perform worse on data with more complex distributions.

理论论述

I briefly looked over the proofs for the main theoretical claims, ie, for the impossibility result (Theorem 2.1), marginal coverage and restricted volumn optimality (Theorem 2.5), and approximate conditional coverage/conditional restricted volume optimality (Theorem 3.3). They seem reasonable, leveraging some standard CP arguments and concentration inequalities.

实验设计与分析

I reviewed the main experiment settings. As I mentioned previously, a main limitation of the paper is that the experiments appear to only evaluate on synthetic datasets (e.g., Gaussian data, Censored Gaussian, Mixture of Gaussians, ReLU-Transformed Guassians). Evaluation on real datasets would strengthen the evaluation.

补充材料

I briefly looked over most of the supplementary materials, though did not closely check details. I.e., I briefly looked over the related work (Appendix A), additional experiments (Appendix B), and Additional Proofs (Appendix D).

与现有文献的关系

Whereas the majority of the conformal prediction literature focuses on proving guarantees for coverage with efficiency (volume optimality) only evaluated empirically, this paper focuses on providing theoretical analysis for a restricted form of volume optimality. There are several prior works that study the question of volume optimality in CP, which the authors mention in the introduction and discuss more thoroughly in the Related Work (Appendix A).

It should be noted that the authors focus on split conformal prediction, and thus their algorithms and main results (Theorem 2.5) and (Theorem 3.3) should be noted to provide optimality only within the split conformal framework, although as they note it seems that much of the analysis could extend to full conformal. (I.e., the restricted volume optimality results include the size of the calibration set nn in the guarantee; in full conformal, the data are used more efficiently and so the calibration set would effectively be larger; relatedly, cross-validation-style CP methods also make efficient use of the data, but would likely require further analysis.)

It may be relevant for the authors to mention in the “Other Related work” section (or may simply be of interest to the authors for future work) the literatures on (1) aggregating prediction sets to achieve greater efficiency, which often takes a form of cross-validation-style CP, and on (2) selecting prediction sets based on efficiency. The following are relevant references that the authors may be interested in looking at or mentioning to acknowledge this broader literature on improving the efficiency of CP sets:

Papers on cross-validation-style aggregation of split CP sets (which empirically tend to be more efficient than split CP sets):

  • Vovk, V. Cross-conformal predictors. Annals of Mathematics and Artificial Intelligence, 74(1):9–28, 2015.
  • Vovk, V., Nouretdinov, I., Manokhin, V., and Gammerman, A. Cross-conformal predictive distributions. In Conformal and Probabilistic Prediction and Applications, pp. 37–51. PMLR, 2018.
  • Barber, R. F., Cand`es, E. J., Ramdas, A., and Tibshirani, R. J. Predictive inference with the jackknife+. The Annals of Statistics, 49(1), 2021.
  • Kim, B., Xu, C., & Barber, R. (2020). Predictive inference is free with the jackknife+-after-bootstrap. Advances in Neural Information Processing Systems, 33, 4138-4149.
  • Prinster, D., Liu, A., and Saria, S. Jaws: Auditing predictive uncertainty under covariate shift. Advances in Neural Information Processing Systems, 35:35907–35920, 2022.
  • Prinster, D., Saria, S., and Liu, A. Jaws-x: addressing efficiency bottlenecks of conformal prediction under standard and feedback covariate shift. In International Conference on Machine Learning, pp. 28167–28190. PMLR, 2023.

Other papers on aggregating/selecting CP sets based on efficiency:

  • Liang, R., Zhu, W., & Barber, R. F. (2024). Conformal prediction after efficiency-oriented model selection. arXiv preprint arXiv:2408.07066.
  • Yang, Y., & Kuchibhotla, A. K. (2024). Selection and aggregation of conformal prediction sets. Journal of the American Statistical Association, 1-13.

Miscellaneous other paper with length optimality result:

  • Teneggi, J., Tivnan, M., Stayman, W., & Sulam, J. (2023, July). How to trust your diffusion model: A convex optimization approach to conformal risk control. In International Conference on Machine Learning (pp. 33940-33960). PMLR.

遗漏的重要参考文献

In my view, a more comprehensive related work section would cite many of the references I provided in the “Relation To Broader Scientific Literature” section to at least acknowledge this related literature on designing CP sets for efficiency (i.e., either by aggregating, selecting, or optimizing). However, among those references, perhaps the most relevant is Liang et al. (2024), as well as a couple references on cross-validation-style CP/aggregating prediction sets, to acknowledge that literature.

其他优缺点

Other strengths: Overall, the paper provides an interesting perspective on the efficiency of CP sets, with potentially promising methods.

Other weaknesses: Occasionally the language is a bit informal in a way that can distract or be ambiguous, for instance the authors state that they “believe” XYZ rather when perhaps “conjecture” would be more appropriate. For instance, they state “Though not explicitly stated in (Chernozhukov et al., 2021), we believe that the DCP procedure essentially achieves (12) for k = 1” and “We believe the comparison between the full conformal versions of the two methods will lead to the same conclusion.”

其他意见或建议

NA

作者回复

We sincerely thank you for the detailed and constructive feedback and the additional references that we missed. We will implement these changes in the revision of the paper.

We evaluated our method and different baseline methods on several real-world datasets in both unsupervised and supervised settings, and observed improved or competitive volumes in both settings.

For the unsupervised setting, we implement our conformalized DP method and the baseline conformalized KDE method on two real-world datasets (Acidity and Enzyme) used in density estimation literature (Richardson and Green (1997)). We set the target coverage at 80%, and repeat the experiments 50 times. We output the average and s.d. for volume and empirical coverage. We observe that our conformalized DP outputs a smaller volume prediction set than the KDE with the best bandwidth for almost all k>=2.

Acidity Dataset

Conformalized KDE Results

Bandwidth0.10.30.50.70.9
Volume2.4927 ± 0.19602.3934 ± 0.20442.5749 ± 0.25712.7617 ± 0.20382.8196 ± 0.2587
Empirical Coverage0.8401 ± 0.02260.8092 ± 0.02610.8133 ± 0.03150.8013 ± 0.02560.8021 ± 0.0294

Conformalized DP Results

k12345
Volume2.5999 ± 0.19372.3627 ± 0.17552.4099 ± 0.18322.3615 ± 0.14522.2172 ± 0.1349
Empirical Coverage0.8121 ± 0.04760.8507 ± 0.02240.8552 ± 0.02450.8582 ± 0.01960.8341 ± 0.0253

Enzyme Dataset

Conformalized KDE Results

Bandwidth0.10.30.50.70.9
Volume0.9826 ± 0.09691.4640 ± 0.09401.5218 ± 0.13321.4526 ± 0.18091.3867 ± 0.1799
Empirical Coverage0.8081 ± 0.02170.8056 ± 0.02310.7999 ± 0.02680.7962 ± 0.02720.7971 ± 0.0236

Conformalized DP Results

k12345
Volume1.2207 ± 0.11880.8997 ± 0.14141.0118 ± 0.19751.0131 ± 0.18781.0520 ± 0.2002
Empirical Coverage0.8101 ± 0.04610.8144 ± 0.03530.8416 ± 0.03900.8458 ± 0.03790.8558 ± 0.0353

For the supervised setting, we implement our method and the baseline methods on four UCI datasets Abalone, AirFoil, WineQuality, Superconductivity. These four real-world datasets are used in conformal prediction by Dhillon, Deligiannidis, and Rainforth (2024). We observe that our method outputs prediction sets with smaller volumes than baselines on two datasets WineQuality, Superconductivity. It provides competitive results on the other two datasets Abalone and AirFoil. The most prominent example showing the advantage of using unions of several intervals is the dataset WineQuality. In this dataset, the label Y space is integers from 0 to 10 for the quality of wine. The conformalized DP with 5 intervals clearly dominates all other methods including our own with k=1 due to the multimodality of prediction.

Abalone

MethodEmp CovAvg VolRuntime (s)
CQR70.6943.82044.119
DCP-QR69.7373.78846.851
DCP-QR*68.7803.57948.201
ConformalizedDP (k = 1)69.7373.692166.875
ConformalizedDP (k = 5)68.1823.611659.637

AirFoil

MethodEmp CovAvg VolRuntime (s)
CQR74.7515.89510.509
DCP-QR74.0865.3889.667
DCP-QR*72.4255.4169.205
ConformalizedDP (k = 1)70.7645.28951.588
ConformalizedDP (k = 5)74.0866.275224.675

WineQuality

MethodEmp CovAvg VolRuntime (s)
CQR71.0001.130104.084
DCP-QR69.5380.826109.458
DCP-QR*69.5380.732109.746
ConformalizedDP (k = 1)72.1540.786302.068
ConformalizedDP (k = 5)73.3850.1861031.165

superconductivity

MethodEmp CovAvg VolRuntime (s)
CQR70.39713.7223160.658
DCP-QR71.19713.8032998.426
DCP-QR*71.19712.0703206.956
ConformalizedDP (k = 1)70.93811.8083809.463
ConformalizedDP (k = 5)70.77412.6326163.017
审稿意见
3

Conformal prediction is a framework to construct label sets such that the marginal probability of coverage is guaranteed to be above a desired level. This paper studies the conformal label sets for unidimensional regression problems, where the conformal label sets are restricted to be a union of kk intervals. The motivation is to minimize the set sizes while maintaining the conformal coverage guarantee.

The paper first considers the unsupervised setting. It defines the optimal size as the minimum achievable size for label sets restricted to be a union of kk intervals. It proposes using dynamic programming to empirically approximate label sets close to the optimal size. Together with the conformal framework, the paper proposes a non-conformity score that achieves close to optimal label set sizes with high probability (quantifying the distance from the optimal and the high probability).

Next, the paper extends similar arguments to the supervised setting. It defines the conditional optimal sizes (conditioned on the covariate) with the same restriction. It proposes using dynamic programming (similar to before) to empirically approximate label sets close to the conditionally optimal sizes. Together with the conformal framework, the paper proposes a non-conformity score that achieves close to conditionally optimal label set sizes with high probability (quantifying the distance from the conditionally optimal and the high probability). It also achieves conditional coverage with high probability.

Lastly, the paper includes experimental results to validate their proposed method.

给作者的问题

  1. Theorem 2.5 and 3.3 both assume i.i.d. data. How does the statement change if the calibration and test data are exchangeable and not i.i.d.?

  2. How is the index that satisfies the second requirement in Assumptions 2.4 and 3.1 chosen? Or, how is the index for step 1 in Section 4.1 chosen?

  3. When is the bound in Eq. 8 satisfied?

  4. Why is PλP \ll \lambda required for the initial analysis in Section 2.1 (lines 76-93, column 2)?

论据与证据

The claims are supported for the most part. I have questions about the experiments and the results (see below).

方法与评估标准

  1. It would help to see a similar analysis done on real-world data.

  2. How does the proposed method compare to the baselines on computational cost?

理论论述

I briefly checked the proofs.

实验设计与分析

  1. The analysis does not include the standard deviations of the reported quantities. Since "outperforming" is a strong word (lines 407-408, column 2), one should consider the statistical significance of the results.

  2. Figures 1, 2, and 3 are not easy to see in the main paper.

  3. Since Fig. 1a depicts the best proposed model, Fig. 1b should illustrate the best baseline model (I believe ρ=0.25\rho=0.25 is the optimal hyperparameter from the search space chosen).

补充材料

I briefly reviewed the Appendix.

与现有文献的关系

This paper adds to the previous work on optimal conformal prediction sets. Specifically, this paper looks at the unidimensional setting and studies optimal prediction sets where the sets are a union of kk intervals.

遗漏的重要参考文献

There are works on the marginal size of the conformal label sets, termed inefficiency. Most show that conformal inefficiency asymptotically converges to that of an oracle under different settings: unsupervised learning [Lei et al., 2013, 2015], regression [Lei and Wasserman, 2013], binary classification [Lei, 2014], and multi-class classification [Sadinle et al., 2019]. Similarly, Vovk et al. [2014, 2016] and Sadinle et al. [2019] provide results under per-class/label coverage. Additionally, Dhillon et al. [2024] quantify conformal inefficiency in the finite-sample setting.

The paper includes some but not all references.

References

G. S. Dhillon, G. Deligiannidis, and T. Rainforth. On the expected size of conformal prediction sets. In S. Dasgupta, S. Mandt, and Y. Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pages 1549–1557. PMLR, 02–04 May 2024.

J. Lei. Classification with confidence. Biometrika, 101(4):755–769, 10 2014.

J. Lei and L. Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):71–96, 07 2013.

J. Lei, J. Robins, and L. Wasserman. Distribution-free prediction sets. Journal of the American Statistical Association, 108(501):278–287, 2013.

J. Lei, A. Rinaldo, and L. Wasserman. A conformal prediction approach to explore functional data. Annals of Mathematics and Artificial Intelligence, 74(1):29–43, Jun 2015.

M. Sadinle, J. Lei, and L. Wasserman. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223–234, 2019.

V. Vovk, I. Petej, and V. Fedorova. From conformal to probabilistic prediction. In L. Iliadis, I. Maglogiannis, H. Papadopoulos, S. Sioutas, and C. Makris, editors, Artificial Intelligence Applications and Innovations, pages 221–230, Berlin, Heidelberg, 2014.

V. Vovk, V. Fedorova, I. Nouretdinov, and A. Gammerman. Criteria of efficiency for conformal prediction. In A. Gammerman, Z. Luo, J. Vega, and V. Vovk, editors, Conformal and Probabilistic Prediction with Applications, pages 23–39, Cham, 2016.

其他优缺点

Strengths:

  1. The area of research is well-motivated for practical impact.

  2. The theoretical and empirical results demonstrate the benefits of the proposed algorithm.

Weaknesses:

  1. The paper is not easy to parse. For instance, the dynamic programming algorithm, one of the key features of the proposed method, is not described well.

  2. The experiments are missing real-world data and reporting of statistical significance.

  3. Several important references are missing.

其他意见或建议

  1. Explain the proposed dynamic programming algorithm (Algorithm 1) in the main paper. It is not easy to follow without any explanations.

  2. Mentioning that the paper deals with unidimensional regression tasks in the abstract and the introduction would help.

  3. Using "structured prediction sets" in the title is misleading as the structure is limited to the union of kk intervals.

  4. The coverage, as detailed in lines 193-198, column 2, depends on the exchangeability of the calibration and test data while being independent of the training data.

  5. The threshold in Eq. 9 is not defined.

  6. Similar to Eq. 6, it would help to write the definition of conditional restricted optimal volume with infCCk\inf_{C \in \mathcal{C}_{k}} (in Section 3.1).

  7. The paper denotes the constructed label set as a strict subset of the label space, C^(Xn+1)Y\hat{C}(X_{n + 1}) \subset \mathcal{Y}. However, it is not a strict subset and C^(Xn+1)Y\hat{C}(X_{n + 1}) \subseteq \mathcal{Y}.

  8. The set value setting is not defined (lines 57-58, column 1).

  9. \citet is for textual citations, and \citep is for parenthetical citations.

Typos:

  1. "...score based on dynamic programming, the proposed method..." \rightarrow "...score based on dynamic programming. The proposed method..." (lines 102-103, column 1)

  2. "Among all data-dependent set that satisfies (3), ..." \rightarrow "Among all data-dependent sets that satisfy (3), ..." (line 83, column 2)

  3. "...conformalized KDE is highly densitive to the choice..." \rightarrow "...conformalized KDE is highly sensitive to the choice..." (caption of Fig. 1d)

  4. The citation for Barber et al. [2021] is incorrectly written as Foygel Barber et al. [2021]

作者回复

Thank the reviewer for constructive suggestions and providing comprehensive references. We will make sure to implement these changes and include all related works in the revision.

  • It would help to see a similar analysis done on real-world data.

We evaluated our experiments on various real-world datasets as well for the rebuttal, please see our response to Reviewer UBed for details.

  • How does the proposed method compare to the baselines on computational cost?

We will include the running time comparison. Due to the time complexity of dynamic programming, our method is naturally slower than the baselines. But as we have shown in additional real dataset experiments, the running time of our method while larger is still comparable with that of baselines. Furthermore the running time of our method is dominated by the method for producing the conditional CDF estimate.

  • The analysis does not include the standard deviations of the reported quantities. Since "outperforming" is a strong word (lines 407-408, column 2), one should consider the statistical significance of the results.

You're absolutely right that statistical significance is crucial when claiming "outperformance," and we will incorporate the averages and standard deviations of repeated tests to validate the results. For the unsupervised experiments on real-world datasets, we include the average and standard deviations in the result. Please see our reply to Reviewer UBed for details. The standard deviation for volume and empirical coverage is at most 10% and 2.5% respectively. We will make sure to report the standard deviations in all numerical experiments in the revision.

  • Since Fig. 1a depicts the best proposed model, Fig. 1b should illustrate the best baseline model (I believe ρ=0.25\rho=0.25 is the optimal hyperparameter from the search space chosen).

We agree that it makes sense for Fig. 1b to show the KDE method with the best bandwidth. We implement the baseline KDE method with ρ=0.25\rho=0.25, and we will update the figure accordingly. We wanted to point out that for ρ=0.25\rho=0.25, KDE still uses a relatively large interval to cover the left Gaussian, resulting in a total volume of 3.7049, much larger than our method. While the KDE method is very sensitive to the choice of bandwidth, our proposed DP is very stable with the choice of kk as long as it exceeds a certain threshold.

  • Theorem 2.5 and 3.3 both assume i.i.d. data. How does the statement change if the calibration and test data are exchangeable and not i.i.d.?

If the data are only exchangeable, we still have the marginal coverage guarantee in Theorem 2.5 and 3.3. The volume optimality and the conditional coverage will not hold since the volume optimality is not well-defined in this case and the conditional coverage is generally impossible without additional assumptions.

  • How is the index that satisfies the second requirement in Assumptions 2.4 and 3.1 chosen? Or, how is the index for step 1 in Section 4.1 chosen?

In Step 1 of Section 4.1, the index jj^* depends on mm, α\alpha, nn and δ\delta. The choices of nn and α\alpha are clear. For the other two parameters, mm stands for the discretization of [0,1][0,1] in the nested system, and we typically need m>1/αm>1/\alpha. Larger mm provides finer discretization for the nested system, and mnm \leq n. The number δ\delta stands for the statistical error of estimating the (conditional) CDF. In later numerical experiments, we chose m=50m=50 and δ=(k+logn)/n\delta=\sqrt{(k+\log n)/n} with kk being the number of intervals.

  • When is the bound in Eq. 8 satisfied?

Eq. 8 is always satisfied with high probability, since the VC dimension of Ck\mathcal{C}_k is O(k)O(k).

  • Why is PλP \ll \lambda required for the initial analysis in Section 2.1 (lines 76-93, column 2)?

Here, we intend to provide an example of the optimal-volume solution (the display after (3)) under the condition of PλP \ll \lambda, which means PP is absolutely continuous w.r.t. λ\lambda. When we introduce the notation of volume optimality in the next display, we emphasize that the condition PλP \ll \lambda is no longer needed.

审稿意见
4

Conformal prediction is a technique that produces prediction sets with marginal (1α)(1-\alpha)-coverage guarantees; in general there are many subsets of the label space that may satisfy this coverage guarantee, and conformal methods do not necessarily produce the smallest such set (by measure) that satisfies this guarantee, or even a set close in measure to the optimal one. This paper studies the problem of obtaining prediction sets that achieve these coverage guarantees while also being the best in terms of size – formally, this is any prediction set whose volume equals the infimum of volumes over sets that satisfy the coverage guarantee – called volume-optimality.

Firstly, the authors prove that there is no distribution-free method that can achieve fully volume-optimal prediction sets, that is, for any conformal predictor C^\hat{C} that achieves the marginal coverage guarantee on all distributions, there is some distribution for which it does not produce a volume-optimal prediction set. They move to trying to achieve an approximate version of volume optimality, and restrict themselves to achieving this over the collection Ck\mathcal{C}_k of unions of kk intervals. The high-level idea is to achieve approximate volume-optimality over the empirical distribution defined by the calibration set (which converges in total variation to the true distribution since C\mathcal{C} has finite VC dimension). This can be done by using a dynamic programming approach to efficiently find a union of kk intervals that includes at least (1α)(1-\alpha) of the empirical probability weight while satisfying the approximate volume-optimality property. The approach can be generalized in settings with context to achieve conditional coverage guarantees with approximate volume-optimality if you are able to get a good estimate the conditional distribution over the label space. Experiments show that the paper’s approach achieves smaller prediction set size than existing approaches.

给作者的问题

  1. What is the issue (if any) generalizing this approach to prediction sets in Rn\mathbb{R}^n? Does it blow up the DP space too much?

论据与证据

The theoretical claims all make sense, and the experiments show improvement in terms of prediction set size over existing methods.

方法与评估标准

Yes, it seems fine.

理论论述

I did not check the proofs in the appendix, but the DP algorithm works and the arguments in the main paper look sound.

实验设计与分析

The experiments looked good. Since the guarantees are in expectation over the calibration set and the data is being simulated anyway, it could be good to get also the results averaged over multiple draws of the calibration data.

补充材料

I looked very briefly at some of the additional experiments.

与现有文献的关系

This work is concerned with theoretical guarantees / optimality on the size of prediction sets generated by conformal prediction, a relatively less-explored part of the CP literature. It is related to work like Lei et al. 2013, Isbicki et al. 2020, Kiyani et al. 2024, which all similarly consider conformal prediction for regression, though it defines a slightly different kind of volume-optimality, and uses the unique dynamic programming approach to optimize for the prediction set’s volume and define a non-conformity score.

遗漏的重要参考文献

Not to my knowledge.

其他优缺点

Strengths:

A good deal of work in the CP literature is aimed essentially at making incremental improvements in terms of prediction set size (through empirical testing) without providing theoretical guarantees, so I think the topic of this paper is very relevant and interesting. The technique seems general enough in the sense that if you can come up with an expressive enough collection of prediction sets C\mathcal{C} that you can also efficiently optimize over and use to construct a non-conformity score one could perhaps construct other kinds of useful convex prediction sets in higher dimension.

Weaknesses:

Due to considering a more expressive collection of potential prediction sets and optimizing through the DP algorithm, there is probably tradeoff in terms of runtime during the calibration process with more lightweight procedures, and it’s not clear how this runtime performance compares against the other approaches with volume-optimality guarantees.

其他意见或建议

Couple of typos:

Line 083: should be “sets” instead of “set”

Figure 1 (d): should be “sensitive”

作者回复

Thank you for your constructive suggestions and insightful questions.

  • Since the guarantees are in expectation over the calibration set and the data is being simulated anyway, it could be good to get also the results averaged over multiple draws of the calibration data.

We appreciate this suggestion. You are right, it is important to get the results averaged over multiple draws. In the revision, we will ensure that all numerical experiments include both the average and standard deviations. In the additional experiments, we show the averages and standard deviations of repeated tests (See our response to Reviewer Ubed for details).

  • Question: What is the issue (if any) generalizing this approach to prediction sets in Rn\mathbb{R}^n? Does it blow up the DP space too much?

This is an excellent question. The DP approach can potentially be extended to low-dimensional spaces, such as dimension d=2d=2 or d=3d=3, to construct prediction sets that are unions of disjoint rectangles. However, as the dimension dd increases, the time and space complexity of DP grow exponentially, making it impractical for high-dimensional settings. Furthermore this is unavoidable; this problem is NP-hard in high-dimensions even for simple generalizations of intervals like Balls (even for one ball). However, we can use our framework with any algorithm that can find low-volume confidence sets (from a prescribed family) in high dimensions. More broadly, finding low-volume confidence sets in high-dimensional settings is an important and challenging task in high-dimensional statistics. Naturally, this is out of the scope of this work. In a followup project, we are currently working on computing approximately small volume prediction sets in high-dimensions like balls and ellipsoids, and unions of them.

最终决定

Conformal prediction is a general uncertainty quantification framework which can produce prediction sets from a given black-box predictive model. Coverage (whether the true output is present within the set or not) and prediction set size (efficiency) are two important metrics. Although there is a lot of theoretical work on coverage guarantees, there is relatively less work on guarantees for prediction set size. The focus of this paper is on this later challenge for single-target regression tasks where prediction sets are restricted to be a union of intervals. It gives an impossibility result for volume optimality followed by an algorithm based on dynamic programming to achieve restricted volume optimality. The paper also presents analysis for approximate conditional coverage and conditional restricted volume optimality when a good estimator for the conditional CDF is available.

All the reviewers' were positive about the paper but also asked constructive questions. The author rebuttal answered their questions satisfactorily. Therefore, I recommend accepting this paper.