/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Direct Prediction Set Minimization via Bilevel Conformal Classifier Training

Yuanjie Shi,Hooman Shahrokhi,Xuesong Jia,Xiongzhi Chen,Jana Doppa,Yan Yan

提交: 2025-01-23更新: 2025-07-24

TL;DR

We propose the Direct Prediction Set Minimization (DPSM) approach by integrating the principle behind the standard quantile regression with a differentiable measure of the prediction set size that is conditioned on the learned quantile.

摘要

Conformal prediction (CP) is a promising uncertainty quantification framework which works as a wrapper around a black-box classifier to construct prediction sets (i.e., subset of candidate classes) with provable guarantees. However, standard calibration methods for CP tend to produce large prediction sets which makes them less useful in practice. This paper considers the problem of integrating conformal principles into the training process of deep classifiers to directly minimize the size of prediction sets. We formulate conformal training as a bilevel optimization problem and propose the {\em Direct Prediction Set Minimization (DPSM)} algorithm to solve it. The key insight behind DPSM is to minimize a measure of the prediction set size (upper level) that is conditioned on the learned quantile of conformity scores (lower level). We analyze that DPSM has a learning bound of $O(1/\sqrt{n})$ (with $n$ training samples), while prior conformal training methods based on stochastic approximation for the quantile has a bound of $\Omega(1/s)$ (with batch size $s$ and typically $s \ll \sqrt{n}$). Experiments on various benchmark datasets and deep models show that DPSM significantly outperforms the best prior conformal training baseline with $20.46\%\downarrow$ in the prediction set size and validates our theory.

关键词

conformal predictionuncertainty quantificationdeep classifiersconformal training

评审与讨论

审稿意见

评分: 42025-03-06

This paper introduces a conformal training algorithm called Direct Prediction sEt Minimization (DPSM). Existing training method suffers from a learning bound depending on the batch size. DPSM is formulated as a bilevel optimization that minimizes the prediction set size (upper level) conditioned on the learned quantile of conformity scores (lower level). This paper introduces algorithm to solve DPSM objective and show that it improves the learning bound to be dependent on the number of training samples instead.

给作者的问题

N/A

论据与证据

Theoretical
- The theoretical claims in Theorem 4.1 show that the learning bound is improved, which is standard ML generalization analysis.
- However, no theoretical results on the convergence of bilevel optimization is obtained as the authors pose it as an open challenge.
Numerical
- DPSM is shown to outperform other conformal training methods with significant reduction in prediction set sizes. It also has better predictive efficiency.
- DPSM is shown to converge effectively.
- Assumption 3.1 and 3.2 are empirically validated.
- The numerical experiments are not very extensive in terms of datasets. Also the number of trials 10 is a bit few. Visualizations with confidence intervals could better illustrate the robustness of algorithm performance.

方法与评估标准

The proposed method to resolve the dependence of learning bound on batch size is principled and well explained. The evaluation metrics of coverage guarantee and prediction set sizes make sense.

理论论述

The proof of Theorem 4.1 appears to be correct.

实验设计与分析

The experimental design makes sense but could benefit from a more extensive comparison on a wider range of datasets.

补充材料

Proof of Theorem 4.1 in Appendix C.

与现有文献的关系

Improve conformal training will contribute to the active field of uncertainty quantification, with meaningful implications to high-stake applications.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-04-01

We thank the reviewer for the insightful review. We provide an external PDF (click this link to access) to present the empirical results needed to answer some of your questions.

Q1: No theoretical results on the convergence of bilevel optimization are obtained as the authors pose it as an open challenge.

A1: As we discussed in Section 2 and 4.2, in the current bilevel optimization literature, the convergence guarantee requires some restrictive conditions that do not hold in our problem. E.g., implicit gradient methods require a twice differentiable and strongly convex lower-level function. Iterative differentiation methods require an iterative subroutine for the lower-level problem with more computational cost. Penalty-based methods typically require continuously differentiable upper and lower-level functions, but our lower-level QR loss is nonsmooth.

Although we do not provide a theoretical convergence result for our simple stochastic gradient method, we link it to a most relevant bilevel study by showing that the lower-level QR loss satisfies Hölderian error bound condition (Lemma 4.5), which, to some extent, justifies that using the simple first-order stochastic gradient updates in our algorithm can possibly achieve the $\epsilon$ -optimal solution for the original bilevel problem (Corollary 4.6). Finding an affirmative answer to this challenge is part of our future work.

Additionally, we empirically verify convergence behavior in experiments:

Figure 1(a) and 1(b): upper and lower loss.

Figure 1(c): error of the learned quantile in DPSM vs. batch quantile in ConfTr.

Figure 1(d): convergence conformal loss (average soft set size) in DPSM vs. ConfTr.

We further plot the convergence of optimization error to 0 in Fig 1(b) and (c) of external PDF (See also A1 to Reviewer j13f). These empirical results demonstrate the stable convergence of DPSM for the bilevel problem.

Q2: The numerical experiments are not very extensive in terms of datasets.

A2: We would like to highlight that conformal training is expensive in practice, and our experiments are extensive compared with the existing conformal training literature. We will try to add results with more datasets in the final paper.

Why is conformal training expensive?

(i) Requires fine-tuning several hyperparameters (e.g., learning rates, batch size, regularization parameter, $\tau_{{\text{sigmoid}}}$ , etc.) and selecting appropriate options (backbone model, conformal scoring function, etc.) from both ML and CP (see Appendix E)

(ii) Typically, requires additional operations such as computing and sorting conformity scores in each iteration, which is more expensive than standard SGD based training of deep models..

Comparing our datasets with previous conformal training literature. We conducted experiments on three widely-used benchmark datasets (CIFAR-100, Caltech-101 and iNaturalist) that span diverse difficulty levels and scales, e.g., 341 classes of data with 224x224 resolution for iNaturalist (see Table 2). In contrast, prior conformal training papers [r1, r2, r3, r4] worked on MNIST, CIFAR-10, or CIFAR-100, which are smaller than ours, e.g., up to 100 classes of data with 32x32 resolution for CIFAR-100.

Q3: The number of trials 10 is a bit few. Visualizations with confidence intervals could be better

A3: Our experiments followed common practice in the CP and conformal training literature [r1, r2, r5, r6], where 10 random splits are widely adopted and considered sufficient to evaluate algorithm performance (in our Table 1).

To study the robustness of DPSM, we show additional experiments with 20 trials on CIFAR-100 using HPS score in the table below.

Methods	CE	CUT	ConfTr	DPSM
DenseNet	$2.59\pm 0.075$	$2.28 \pm 0.070$	$2.27 \pm 0.054$	$2.17 \pm 0.047$
ResNet	$3.39\pm 0.13$	$3.00 \pm 0.71$	$3.77 \pm 0.10$	$2.94 \pm 0.072$

These results are very similar to those obtained with 10 trials (Table 1), confirming that the improvement of DPSM in predictive efficiency is robust.

Additionally, we plot the visualizations with confidence intervals to further illustrate the stability and robustness of DPSM in Fig 5 in the external PDF. We will include results with more trials in the final paper.

[r1] Sharma, et al. PAC-Bayes generalization certificates for learned inductive conformal prediction. NeurIPS 2023

[r2] Correia, et al. An information theoretic perspective on conformal prediction. NeurIPS 2024

[r3] Einbinder, et al. Training uncertainty-aware classifiers with conformalized deep learning. NeurIPS 2022

[r4] Stutz, et al. "Learning Optimal Conformal Classifiers. ICLR 2022

[r5] Ding, et al. Class-conditional conformal prediction with many classes. NeurIPS 2024

[r6] Huang, et al. Conformal Prediction for Deep Classifier via Label Ranking. ICML 2024

审稿人评论

2025-04-06

I appreciate the authors' thoughtful response. I am maintaining my positive score.

审稿意见

评分: 32025-03-12

This paper introduces a novel training strategy for deep classifiers, leveraging insights from conformal prediction. Specifically, it proposes a new training loss function that is formulated as the sum of two components: a conventional loss function—aiming to enhance predictive accuracy—and a (regularized) conformal alignment loss designed to reduce the size of the conformal prediction set.

Compared to the existing work [1], which had the same idea but calculated the conformal alignment loss using minibatch data and an empirical quantile derived from the minibatch, this work parameterize the quantile as a variable and propose a simultaneous optimization of both the model weights and the quantile. This is achieved by formulating the problem as a bilevel optimization task: at the upper level, the objective is to minimize the sum of the conventional loss and the conformal alignment loss, while at the lower level, the goal is to estimate the $(1-\alpha)$ -th quantile of the nonconformity score distribution. The authors demonstrate that, under certain assumptions, their proposed conformal alignment loss—parameterized by both the model and the quantile—achieves an improved learning bound for the conformal alignment loss. Specifically, they achieve an $O(1/\sqrt{n})$ upper bound, where $n$ is the training sample size, while the existing works [1,2] only achieve an upper bound of $O(1/\sqrt{b})$ , where $b<<n$ represents the minibatch sample size.

[1] "Learning Optimal Conformal Classifiers", Stutz et. al., 2022. [2] "Training Uncertainty-Aware Classifiers with Conformalized Deep Learning", Einbinder et. al., 2022.

给作者的问题

My apologies if I have overlooked anything, but I found the following two points unclear and would appreciate further clarification from the authors.

In the second line of Equation (5), why is $\hat{\ell}(f, q)$ defined as an empirical estimation over all $n$ training samples? Shouldn't $\hat{\ell}(f, q)$ be computed using only the sampled batch?
If I understand correctly, the main proposed algorithm, DPSM, aims to solve the optimization problem formulated in Equation (8), where $q$ is estimated by minimizing the average pinball loss over the nonconformity scores of all $n$ training samples. This suggests that the quantile loss is trained on the full batch of training data in each epoch after updating the model $f$ . Furthermore, the improved learning bound established in the main theorem also assumes that the quantile function is trained using all $n$ samples. However, in Algorithm 1, the gradient of the quantile loss is computed using mini-batches. Could you clarify the reasoning behind or if I misunderstood? Additionally, why is the same subset $\mathcal{D}_1$ used for both the conformal alignment loss and the quantile loss? Would there be any benefit in performing a further data split?

论据与证据

The authors assert that (1) the proposed algorithm achieves an improved learning bound compared to existing methods, and (2) their approach empirically results in smaller conformal prediction set sizes. Both claims are substantiated through theoretical justifications and empirical evidence from numerical experiments.

方法与评估标准

The datasets used for evaluating their proposed method and the benchmark methods selected for comparison are appropriate.

Regarding the evaluation criteria, the authors employ marginal coverage and average set size, both of which are standard metrics in the conformal prediction literature. However, this paper does not evaluate conditional coverage, which may be essential for fully assessing the improvements of an uncertainty-aware classifier trained with the proposed loss function. A more detailed discussion on this point can be found in the Other Strengths and Weaknesses section.

理论论述

Yes, I have reviewed the proof of Theorem 4.1 and did not notice any apparent issues.

实验设计与分析

Yes, I have reviewed the experimental setup in Section 5.1 and the corresponding analysis in Section 5.2.

补充材料

The main proof of Theorem 4.1.

与现有文献的关系

The key contribution of this paper is the introduction of a training loss function that explicitly incorporates the objective of reducing conformal prediction set sizes. The innovation lies in the use of a bi-level optimization framework, which jointly learns both the model and the quantile of nonconformity scores. However, the primary motivation of consistently favoring smaller prediction sets may limit the practical utility and broader significance of the approach, as discussed in detail in the weaknesses section.

遗漏的重要参考文献

To the best of my knowledge, the most relevant works essential for understanding the paper’s key contribution have been appropriately cited.

其他优缺点

Strengths: The paper is generally well-organized and well-structured. The main message and key contributions of this work are clearly articulated and easy to follow.
The idea of simultaneously optimizing the quantile and model weights in the conformal alignment loss is novel.

Weaknesses: One of my primary concerns regarding this work is the underlying motivation—specifically, the emphasis on minimizing the prediction set size as the main objective, irrespective of the characteristics of the underlying data. In uncertainty quantification, the goal is to accurately capture and represent the uncertainty associated with predictions. However, reducing the prediction set size alone does not necessarily equate to improved uncertainty quantification. For example, in practical scenarios, datasets often contain a mix of both easy-to-classify and hard-to-classify samples. Ideally, an uncertainty quantification method should be able to distinguish between these cases—producing small prediction sets for easy samples while allowing larger prediction sets for more challenging samples to reflect their inherent difficulty. Prioritizing the minimization of prediction set size across the board may not align with this fundamental principle. For instance, if 90% of the data consists of easy samples while the remaining 10% are difficult, one could achieve a very low average prediction set size by assigning small sets to the easy samples while returning empty sets for the hard ones. However, such an approach is not particularly practical or useful, as it fails to provide meaningful predictions for more challenging cases.

In the context of conformalized training, optimizing for a classifier that solely aims to minimize prediction set size seems less convincing. A more meaningful objective would be to develop a classifier that can offer fair and adaptive predictions, differentiating samples based on their intrinsic difficulty. Indeed, prior work, such as [2], has pointed out that training with the ConfTr loss function, which also seeks to minimize the smallest set size, does not necessarily improve conditional coverage, particularly when the data exhibits heterogeneity. That being said, while achieving a smaller average prediction set size can be beneficial as a secondary advantage, it is not entirely convincing to frame this as the primary goal in conformalized training. Instead, a more robust approach would be to balance prediction set size minimization with a framework that ensures fair and informative coverage across different types of samples.

[2] "Training Uncertainty-Aware Classifiers with Conformalized Deep Learning", Einbinder et. al., 2022.

其他意见或建议

In the second line of Theorem 3.5, there is an extra right parenthesis between $(1-\alpha)$ and $(s+1)$ .

Additionally, $\frac{1}{n}\sum_{i=1}^n \left[ \sum_{y\in \mathcal{Y}} \tilde{1}[S_f(X_i, y) \leq q] \right]$ appears to be repeatedly defined in Equation (9) and the second line of Equation (5), which may lead to potential confusion.

作者回复

2025-04-01

We thank the reviewer for the insightful review. We provide an external PDF (click this link to access) to present the empirical results

Q1: Evaluation: No conditional coverage

A1: Please refer to A2 for Reviewer aEdV and Fig 4 in external PDF for updated conditional coverage measures. DPSM shows improved class-conditional coverage

Q2. Minimizing average prediction set size (APSS): motivation not convincing since:

1) it is a secondary advantage of conformal training. The primary objective is adaptive prediction sets to hardness and conditional coverage across heterogeneous data;

2) it alone does not imply meaningful UQ. Predictions with small APSS may not be adaptive and practically useful, e.g., small prediction sets for easy inputs and empty sets for hard inputs fail to provide informative predictions on difficult ones

A2: 1) Predictive efficiency is one of the primary desiderata in CP, rather than a secondary advantage

For practitioners, smaller prediction sets of CP are found practically useful to improve downstream decision making from humans [r1, r2]. Therefore, recent work in CP explicitly aims to improve predictive efficiency of CP [r3, r4]. Others target training the underlying model to encourage predictive efficiency in CP [r5, r6]. Therefore, it is a central and practically important goal. DPSM contributes to this line of work by providing provable improvements in predictive efficiency that is verified by experiments, e.g., 20% reduction of APSS over the best baseline

Clarification: DPSM can also be used for conditional coverage. The idea of DPSM is to learn the quantile, rather than estimating it in batches. It is agnostic to the considered coverage type and thus highly compatible with different conditional coverages. For class-conditional coverage, we can use the same principle to learn the quantile for each class and regularize fair distributions of conformity scores over heterogeneous classes

2) Small prediction sets for easy inputs and empty sets for difficult ones are informative and practically useful

Under nominal coverage, a small prediction set (e.g., singleton) implies high confidence of the model. On the other hand, an empty prediction set indicates selective rejection, which attracts increasing focus in large language models, e.g., abstention over unreliable prediction (e.g., hallucination) on difficult inputs [r7, r8]. In addition, conformity score is a more fine-grained adaptive measure

Q3: $1/n \sum^n_{i=1}[\sum_{y \in \mathcal Y} \tilde 1 [S_f(X_i, y) \leq q]]$ is repeatedly defined in Eq (9) and (5)

A3: This was intentional because it is a shared part to define SA and DPSM conformal losses in Eq (5) and (9), respectively (as stated in L247 below Eq (9)). This highlights the key difference between the two conformal losses: the input quantile of SA conformal loss is from a random batch, while DPSM uses a learnable quantile as input in Eq (9). Their definitions are used in their corresponding learning bound results

Q4: Why is $\hat \ell (f,q)$ defined over all training samples rather than batch?

A4: We aim to compare learning bounds for SA and DPSM conformal losses, so we ensure their learning error is defined over the entire training set, rather than a random batch. Although we compute the batch-level quantile for SA method in practice, to make the comparison agnostic to the randomness of batches, we derive an in-effect conformal loss over all training data (Prop 3.4) used for learning bounds (Thm 3.5)

Q5: Why is QR loss defined over all data in Eq (8), but Algo 1 computes batch gradient? Any benefit for this further data split?

A5: An optimal solution of QR loss over all samples is the ( $1-\alpha$ )-quantile over all data. In Algo 1, we use stochastic gradients to iteratively update the learnable quantile until QR loss converges. It is analogous to applying SGD to optimize a loss defined over all training samples: iterative SGD update can train the deep model to convergence

We clarify that $D_1$ is employed to train classification loss and QR loss, rather than conformal loss. Data split helps prevent overfitting and has been in the conformal training literature (see [r9]).

[r1] Conformal Prediction Sets Improve Human Decision Making. ICML 2024

[r2] On the Utility of Prediction Sets in Human-AI Teams. IJCAI 2022

[r3] Uncertainty sets for image classifiers using conformal prediction. ICLR 2021

[r4] Conformal Prediction for Deep Classifier via Label Ranking. ICML 2024

[r5] The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks. ICML 2024

[r6] Enhancing Trustworthiness of Graph Neural Networks with Rank-Based Conformal Training. AAAI 2025

[r7] Large language model validity via enhanced conformal prediction methods. NeurIPS 2024

[r8] Conformal language modeling. ICLR 2024

[r9] Training uncertainty-aware classifiers with conformalized deep learning. NeurIPS 2022

审稿人评论

2025-04-05

Thank you to the authors for thoughtfully addressing my previous comments and for providing additional experimental results. These responses addressed my main concerns, and I have accordingly raised my score.

I have one remaining question regarding the conditional coverage results: does the dataset used in the experiments exhibit heterogeneity?

作者评论

2025-04-08

FQ1: Do the datasets used in the experiments exhibit heterogeneity

FA1: Thank you for raising your score and the thoughtful follow-up question. Yes, the used datasets exhibit heterogeneity.

We originally conducted experiments on three widely-used benchmark datasets (Caltech-101, CIFAR-100, and iNaturalist). All three datasets exhibit heterogeneity in prediction difficulty as demonstrated by the following results:

(i) Standard deviation of prediction set sizes over each testing data samples on three datasets*

To quantify this heterogeneity, we report the standard deviation of prediction set sizes evaluated over all testing data by all methods (using the HPS score with DenseNet model) as a proxy for variability in prediction uncertainty (their mean values reported in Table 1 in the original paper):

Methods	Caltech-101	CIFAR-100	iNaturalist
CE	$1.48$	$2.72$	$59.12$
CUT	$0.75$	$2.29$	$46.95$
ConfTr	$2.99$	$1.86$	$44.35$
DPSM	$0.30$	$2.16$	$33.93$

These non-zero and varying standard deviations suggest the sample-level heterogeneity in predictive uncertainty across datasets.

(ii) Visualization of datasets’ heterogeneity via histograms of sample-level HPS (non-conformity scores) and prediction set sizes.

To further illustrate the sample-level heterogeneity, we provide two additional experiments on three datasets (Caltech-101, CIFAR-100, iNaturalist) with DenseNet in an additional external PDF (click this link to access).

Specifically, in the first experiment, Fig 1, 2 and 3 demonstrate the histograms of HPS scores over all testing samples on three datasets, in which we compare DPSM with the three baselines (CE, CUT, ConfTr). We find that the sample-level HPS scores of all four methods are distributed across the full range $[0, 1]$ (i.e., the domain of the HPS scores), which highlights the heterogeneity of datasets.

In the second experiment, Fig 4 shows the histograms of prediction set sizes produced by DPSM over all testing samples on three datasets. We find that the sample-level prediction set sizes of DPSM are distributed widely in the domain (a prediction set size can be an integer taking value from $\\{0, 1, …, K\\}$ for $K$ classes), especially for harder CIFAR-100 and iNaturalist. This figure also demonstrates the heterogeneity of datasets.

审稿意见

评分: 32025-03-14

This paper proposed a new method, called DPSM, for minimizing the size of the prediction set for conformal prediction (CP) via bi-level optimization. The main idea is to reformulate the quantile estimation in CP into an optimization problem, such that it can be treated as a lower-level optimization, while the set size minimization as the upper-level one. The authors claimed and theoretically showed that such a formulation can improve the learning bound of previous methods that are also devoted to minimizing the prediction set's size. Experimental results demonstrate the effectiveness of the proposed method, that it reduces the average size of the prediction sets over existing methods in most cases.

update after rebuttal

I would keep my initial rating after reading the authors' responses.

给作者的问题

No further questions.

论据与证据

The main claim by the authors is that the proposed DPSM has better learning bound with respect to sample complexity than existing methods, which is theoretically verified. Experiments also demonstrate the effectiveness of DPSM. However, a concern is that the learning bound was not empirically tested. Due to the approximation in the optimization, it is a little questionable that the theoretical bound can be indeed achieved.

方法与评估标准

Since the proposed method focuses on the prediction set's size, the average prediction set's size is directly used as the performance measure, which is reasonable.

For the method itself, the motivation is clear, and the theoretical results seem good (though I haven't checked the proofs in detail). However, I am a little confused that, since the stochastic approximation is also adopted in DPSM implementation for optimizing the quantile as that in SA-based methods, why it can achieve much better learning bound.

理论论述

I haven't checked all the details of the proofs, but I have a concern about the proof for Theorem 3.5. Specifically, I notice that the proof for Theorem 3.5 relies on some properties of Beta distribution, which are only available in the asymptotic sense according to Proposition 3.4.

实验设计与分析

The overall experimental designs and analyses are promising. A small issue:

In Figures 1 and 3, it seems that the algorithm did not sufficiently converge with 40 epochs. So, I suggest more iterations.

补充材料

I did not fully check the details of the proofs in the supplementary material but looked through the overall logic. I also reviewed the additional experimental results provided.

与现有文献的关系

The existing SA-based conformal training methods [1, 2] compute the empirical batch-level quantile during each training iteration. The proposed DPSM tries to improve the theoretical learning bound and practical effectiveness by a bi-level optimization-based formulation.

[1] Stutz, D., Cemgil, A. T., Doucet, A., et al. Learning optimal conformal classifiers. arXiv preprint arXiv:2110.09192,2021.

[2] Einbinder, B.-S., Romano, Y., Sesia, M., and Zhou, Y. Training uncertainty-aware classifiers with conformalized deep learning. Advances in Neural Information Processing Systems, 35:22380–22395, 2022.

遗漏的重要参考文献

Related references have been properly cited and discussed.

其他优缺点

Strength:

The bi-level optimization for conformal training is new and inspirative.
Theoretical analysis is provided to support the reasonability of the proposed method.
Experimental results are promising.

Weakness:

The gap between the theory and implementation of the proposed method was not discussed.

其他意见或建议

In the caption of Table 3, "above" should be "below".
Figure 4 shows the assumption verification using different networks but the same conformity score, compared with that in Figure 2. I suggest trying other conformity scores.

作者回复

2025-04-01

We thank the reviewer for the insightful review. We provide an external PDF (click this link to access) to present the empirical results needed to answer some of your questions.

Q1: Learning bound not empirically tested? Questionable if achievable due to optimization approximation?

A1. Learning bound in Thm 4.1 cannot be empirically computed (conformal loss ${\mathcal L}_c$ is defined over population, as per the standard generalization error [r1]). However, we can verify it via a strategy in ML literature that approximates the generalization error by the absolute gap between the train and test error [r2, r3]. For our CP case, we use the average prediction set size (APSS) evaluated on train and test sets to approximate our learning bound. Specifically, at iteration t, we:

compute APSS on train set using $f_t$ and $q_t$ (as the threshold for CP), denoted by $APSS^{tr}(f_t, q_t)$ . It includes opt. error since $q_t$ is not optimal (true quantile on train set);
compute $APSS^{te}(f_t,q^{te}(f_t))$ on testing data, where $q^{te}(f_t)$ is the true quantile

Then we compute $|APSS^{tr}(f_t, q_t)-APSS^{te}(f_t, q^{te}(f_t))|$ as an approximation of the learning bound. We follow the same strategy to approximate the learning bound for the SA-based ConfTr, where we use the batch-level quantiles from batches randomly sampled from train set for train APSS, and from testing data for test APSS, respectively. We include this comparison in Fig 1 (a) in the external PDF to verify that the approximated learning error is improved (and converges to 0) by DPSM despite optimization approximation.

Opt. approximation: We showed the convergence of losses and est. error of learned quantile empirically in Fig 1(a)-(c) of original paper. To investigate how the learned $q_t$ impacts the opt. error on conformal and QR losses, we compute these two losses with the learned $q_t$ and optimal $q^{tr}(f_t)$ . On train set, we further show the gap between the conformal loss (and QR loss resp.) with $q_t$ vs. $q^{tr}(f_t)$ (we denote as opt. error) in Fig 1 (b) and (c) of external PDF. Both opt. error converges to nearly 0

Q2: Why can DPSM achieve much better learning bound as SA is also adopted?

A2: DPSM mainly differs from the SA-based method in how the quantile is instantiated

DPSM: the quantile is a learnable parameter in lower-level QR

SA-based: uses batch-level quantile at each iteration

Although they share the same SA-type update (SGD), the bottleneck of SA-based method is the large deviation of the batch-level quantile w.r.t the true quantile (at least $\Omega(1/s)$ , Thm 3.5). Instead, DPSM decouples the quantile estimation from stochastic batches via learning a quantile parameter. By releasing the bottleneck of estimating batch-level quantiles, it allows a much smaller learning bound (at most $O(1/\sqrt{n})$ , Thm 4.1) when DPSM learns the optimal quantile. We further show that DPSM learns a very accurate quantile (the gap w.r.t true quantile converges to 0 as in Fig 1(c) of original paper). Thus, it secures the improved learning bound

Q3: The proof for Thm 3.5 relies on some properties of beta distribution which are only available in the asymptotic sense according to Prop 3.4.

A3: Thank you for pointing this out. We will revise Thm 3.5 to explicitly include the assumption used by Prop 3.4. For the asymptotic result, following some standard analysis techniques in ML literature, e.g., [r4, r5, r2], we use it to formally construct the in-effect conformal loss over $n$ training data characterized by Beta distribution. Based on it, we derived the learning bounds for the SA-based method that are compared with DPSM in Section 4.

Q4: The algorithm did not sufficiently converge with 40 epochs.

A4: Thank you for your suggestion. We have extended training epochs to 100 to show the sufficient convergence of upper and lower loss in Figure 2 of external PDF.

Q5: Verify assumption with other conformity scores.

A5: We originally selected HPS score to verify assumptions following the recent conformal training literature [r6].

We add the verification experiments with APS and RAPS scores in Figure 3 of external PDF, where Assumption 3.1 and 3.2 are still empirically valid. We will report the results in the revised paper.

[r1] Mohri, et al. Foundations of machine learning. MIT press 2018

[r2] Yang, et al. Exact gap between generalization error and uniform convergence in random feature models. ICML 2021

[r3] Yuan, et al. Stagewise Training Accelerates Convergence of Testing Error Over SGD. NeurIPS 2019

[r4] Emami, et al. Generalization error of generalized linear models in high dimensions. International conference on machine learning. ICML 2020

[r5] Velikanov, et al. Generalization error of spectral algorithms. ICLR 2024

[r6] Sharma, et al. PAC-Bayes generalization certificates for learned inductive conformal prediction. NeurIPS 2023

审稿意见

评分: 32025-03-14

This paper introduces the Direct Prediction Set Minimization (DPSM) algorithm, a novel bilevel optimization approach that minimizes prediction set sizes in conformal classifier training with a learning bound of $O(1/\sqrt{n})$ , surpassing prior methods limited by $\Omega(1/s)$ . Experimental results on benchmark datasets demonstrate a significant 20.46% reduction in prediction set size compared to the best baselines, validating its theoretical and practical effectiveness.

给作者的问题

N/A

论据与证据

DPSM is a novel conformal training algorithm using bilevel optimization to minimize prediction set size.
The proposed method is theoretically robust, as the authors provide comprehensive theoretical analysis and an upper bound on the learning error.

方法与评估标准

N/A

理论论述

The theoretical analysis is interesting and solid. The bilevel approach and pinball loss integration make it intriguing, while rigorous proofs and clear assumptions ensure solidity.

实验设计与分析

Assessing the score function with fixed hyper-parameter settings provides little insight; a parameter-tuning approach would more effectively demonstrate DPSM's advantages.
Including conditional coverage metrics such as WSC, SSCV, or CovGap is essential to evaluate how DPSM’s set size reduction affects conditional coverage.
The authors should report model accuracy across different loss functions, as higher accuracy, which significantly influences set size, could result in smaller prediction sets and needs to be documented.

补充材料

I have checked the supplementary material.

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-04-01

We thank the reviewer for the insightful review. We provide an external PDF (click this link to access) to present the empirical results needed to answer some of your questions.

Q1: Assessing the score function with fixed hyper-parameter settings provides little insight.

A1: In our original experiments, we use the score functions (HPS, APS, and RAPS) and set their hyper-parameters following the established practice in recent CP and conformal training papers [r1, r2, r3, r4]. Specifically, only RAPS involves explicit hyperparameters, i.e., $\lambda$ and $k_{reg}$ , and we used standard fixed hyperparameter settings commonly adopted in the CP literature, e.g., [r1, r2, r3]. HPS and APS do not have hyperparameters.

To investigate the impact of the hyperparameters of RAPS over DPSM, we additionally evaluated the average prediction set size based on all conformal training methods with different values for the two hyperparameters in RAPS, where the values follow the original setting in the RAPS paper [r5]. We selected the best combination with the smallest prediction set size for each conformal training method and reported their average prediction set sizes as follows.

Methods	CE	CUT	ConfTr	DPSM
DenseNet	$2.73\pm 0.043$	$2.14 \pm 0.055$	$2.69 \pm 0.053$	$2.34 \pm 0.055$
ResNet	$3.25\pm 0.14$	$2.93 \pm 0.60$	$4.02 \pm 0.11$	$2.93 \pm 0.05$

These results are the same as those obtained originally with the fixed RAPS hyperparameters (Table 6 of original paper). We will include the additional results with the tuned RAPS score in the revised paper.

Q2: Including conditional coverage metrics such as WSC, SSCV, or CovGap is essential.

A2: To investigate the impact of DPSM’s set size reduction over the conditional coverage, we report a table of WSC ( $\uparrow$ better), SSCV ( $\downarrow$ better) and CovGap ( $\downarrow$ better) of all conformal training methods on CIFAR-100 with HPS score and DenseNet backbone below:

Mesures	CE	CUT	ConfTr	DPSM
WSC	$0.88\pm 0.016$	$0.90\pm 0.022$	$0.88\pm 0.018$	$0.89\pm 0.011$
SSCV	$0.12\pm 0.024$	$0.09\pm 0.019$	$0.21\pm 0.061$	$0.17\pm 0.034$
CovGap	$4.54 \pm 0.49$	$5.20 \pm 0.29$	$4.56 \pm 0.28$	$4.43 \pm 0.41$

These results show that DPSM achieves the best performance for class-conditional coverage (the smallest CovGap, with 2.4% $\downarrow$ over the best baseline CE). For size-stratified coverage (SSCV), DPSM has a worse measure compared with CE and CUT, but is better than ConfTr. For WSC, CUT and DPSM have comparable performance.

Visualized class-conditional coverage and prediction set size

We further plot Fig 4 in the external PDF to show the distribution of class-conditional coverage and class-wise average prediction set size as fine-grained measures for class conditional coverage. DPSM shows a bit more concentration in terms of class-wise coverage and smaller class-wise prediction set size. This result is supported by the above table, where CovGap is a measure for the violation of class-conditional coverage and DPSM has 2.4% $\downarrow$ over the best baseline (CE) as per the above table.

Q3: The authors should report model accuracy across different loss functions.

A3: We agree that explicitly reporting model accuracy across different training loss functions would provide a more comprehensive understanding of the benefits in DPSM. Below, we report the testing accuracy on CIFAR-100 of all methods in the following table:

Methods	CE	CUT	ConfTr	DPSM
DenseNet	$0.775\pm 0.006$	$0.752 \pm 0.006$	$0.739 \pm 0.006$	$0.751 \pm 0.005$
ResNet	$0.739\pm 0.005$	$0.701 \pm 0.004$	$0.648 \pm 0.004$	$0.699 \pm 0.005$

These results show that while DPSM achieves slightly lower accuracy than CE, it maintains comparable accuracy to CUT and significantly outperforms ConfTr. Combined with the best performance of DPSM in terms of average prediction set size reported in Table 1 of original paper, DPSM demonstrates an effective balance between model accuracy and predictive efficiency of CP. We will include these accuracy results in the revised paper.

[r1] Ding, et al. Class-conditional conformal prediction with many classes. NeurIPS 2023

[r2] Tawachi, et al. Multi-dimensional conformal prediction. ICLR 2025

[r3] Correia, et al. An information theoretic perspective on conformal prediction. NeurIPS 2024.

[r4] Lu, et al. Federated conformal predictors for distributed uncertainty quantification. ICML 2023.

[r5] Angelopoulos, et al. Uncertainty Sets for Image Classifiers using Conformal Prediction. ICLR 2021

最终决定Accept (poster)

2025-05-01

The paper introduces Direct Prediction Set Minimization (DPSM), a novel conformal training algorithm that leverages bi-level optimization to minimize prediction set sizes in conformal prediction (CP). DPSM presents a compelling contribution to conformal training through its bi-level optimization framework, offering a theoretically grounded improvement in learning bounds and strong empirical performance in reducing prediction set sizes. The authors’ rebuttal effectively addresses concerns about conditional coverage, empirical robustness, and implementation clarity, reinforcing the method’s potential. However, the singular focus on set size minimization raises valid concerns about its alignment with broader uncertainty quantification goals, and the lack of theoretical convergence guarantees limits its theoretical completeness. The reviewers’ consensus leans positive, reflecting confidence in its contributions to conformal prediction. The suggested improvements can be incorporated in the final version to enhance clarity and impact.