Conformal Prediction for Class-wise Coverage via Augmented Label Rank Calibration
Provable conformal prediction method to produce small prediction sets for class-conditional coverage for imbalanced classification problems using the augmented rank calibration.
摘要
评审与讨论
This paper introduces the Rank Calibrated Class-conditional Conformal Prediction (RC3P) algorithm, designed to address the issue of large prediction sets in conformal prediction (CP), especially in imbalanced classification tasks. The RC3P algorithm enhances class-conditional coverage by selectively applying class-wise thresholding based on a label rank calibration strategy. This method reduces the size of prediction sets compared to the standard class-conditional CP (CCP) method while maintaining valid class-wise coverage. The authors demonstrate that RC3P achieves significant reductions in prediction set sizes across multiple real-world datasets.
优点
- The introduction of a rank calibration strategy to the class-conditional CP framework is a novel contribution. It offers a practical solution to the challenge of large prediction sets in CP, especially for imbalanced datasets.
- The paper provides comprehensive experimental results across various datasets and imbalance types, showing a significant reduction in prediction set sizes while maintaining valid class-wise coverage.
- The authors provide rigorous theoretical guarantees for the class-wise coverage and improved predictive efficiency of the RC3P algorithm, demonstrating its robustness and general applicability.
- The paper is well-structured and clearly written, with detailed explanations of the algorithm, theoretical analyses, and experimental setups.
缺点
-
While the paper provides theoretical guarantees for coverage, the theoretical analysis of the predictive efficiency could be expanded. Specifically, a deeper exploration of the conditions under which RC3P outperforms CCP in terms of predictive efficiency would strengthen the paper.
-
Although the experiments are thorough, additional experiments on more diverse and larger-scale datasets could further validate the generalizability of the RC3P method.
-
One additional comment is to verify the predictive efficiency using approximate conditional coverage metrics suggested in the literature, for example, by "Cauchois, M., Gupta, S., and Duchi, J. (2021). Knowing what you know: valid and validated confidence sets in multiclass and multilabel prediction. Journal of Machine Learning Research, 22(81):1–42" to ensure that approximate conditional coverage is not negatively impacted.
问题
- Can the authors provide more details on the computational complexity of the RC3P algorithm compared to the CCP and Cluster-CP methods?
- How does the RC3P algorithm perform on larger and more diverse datasets beyond the ones tested in this paper?
局限性
Yes
We thank the reviewer for the constructive feedback. Below we provide our rebuttal for the key questions from the reviewer.
Q1: The theoretical analysis of the predictive efficiency could be expanded.
A1: We agree that a deeper exploration of the conditions under which RC3P outperforms CCP would strengthen our paper. However, predictive efficiency analysis is challenging and there is no existing theoretical framework for thoroughly analyzing the predictive efficiency of CP methods (most of them focus on coverage guarantees), especially for the class-wise coverage setting. In an attempt to fill this gap in the current state of knowledge, we provide Lemma 4.2 and Theorem 4.3 to show under what conditions of the model, the predictive efficiency can be improved (See also GR2).
We highlight that Lemma 4.2 is not our main intellectual contribution for the improved predictive efficiency of RC3P. Instead, it showcases a condition number that parametrizes whether a target event “the predictive efficiency is improved by RC3P’’ happens or not, i.e., .
Then Theorem 4.3 further analyzes the transition from the parameter of RC3P to the parameter of the target event to indicate how to configure RC3P to improve predictive efficiency. Specifically, as pointed out in its Remark, if we set as small as possible but still guarantee (required by the model-agnostic coverage, Theorem 4.1) in RC3P, then we have a higher probability to guarantee . Therefore, the overall transition of parameters is expressed by: which is equivalent to (7) in Remark and helps us configure RC3P to most likely improve the predictive efficiency over baseline CCP method.
In addition to serving as a guideline for configuring RC3P to improve the predictive efficiency, we can further investigate how to interpret the involved condition, e.g., the defined terms D and B involve properties of the underlying model. Further analyzing how to ensure a large can help guarantee the improved predictive efficiency of RC3P over CCP. For instance, according to the definition in Theorem 4.3, a small requires a high top-k accuracy. Given the challenging nature of analyzing predictive efficiency, we will continue to expand the theoretical analysis as part of immediate future work.
Q2: Additional experiments on more diverse and larger-scale datasets
A2:
We add balanced experiment on diverse datasets (CIFAR-100, Places365, iNaturalist, ImageNet) in Experiment (1) of GR1 (see Table1 in the attached PDF). The models are pre-trained. UCR is controlled to . RC3P significantly outperforms the best baseline with reduction in APSS ( better) on average, with on iNaturalist and on ImageNet. We get similar improvements with APS and THR scoring functions.
In addition to the large-scale experiments on balanced data in Experiment (1) of GR1, in Experiment (2) Large-scale imbalanced classification experiment (Table 2 in PDF), we also tested baselines and RC3P on a large-scale dataset (ImageNet-LT) with a deep model trained from scratch using the imbalanced training method LDAM [r1]. UCR is controlled to . RC3P significantly outperforms the best baseline with reduction in APSS ( better) on average.
Q3: Adding approximate conditional coverage metrics?
A3: We highlight that our goal is class conditional coverage, which is a parallel notion with the conditional coverage. Class-conditional coverage that is conditioned on (output space) each class, i.e., while conditional coverage conditioned on the (input space) features , i.e., belongs to “pre-specified groups”. We will add other approximate conditional coverage metrics in the revised version based on the recent work from Gibbs and Candes (2024).
Q4: The computational complexity of the RC3P, CCP and Cluster-CP
A4: Suppose , where is the number of calibration samples in each class , is the total number of calibration samples, and is the number of candidate class labels. CCP needs to sort the conformity scores of each class to compute class-wise quantiles. The computational complexity for CCP is
Compared with CCP, the additional process in RC3P is to find the rank threshold for each class, where the computational complexity is either with brute-force search or with binary search over classes. Therefore, by combining the computation costs from rank calibration and score calibration in RC3P, we get the total complexity as or .
Cluster-CP first runs -means clustering algorithm where is the number of clusters. The computational complexity for the clustering algorithm is , where is the number of iterations. Then, Cluster-CP computes cluster-wise quantiles with time complexity as where we assume the number of samples in each cluster is in . Therefore, the total computation complexity for Cluster-CP is
I thank the authors for the detailed response. i maintain my positive score of weak accept.
This paper introduces the Rank Calibrated Class-conditional CP (RC3P) algorithm, which reduces prediction set sizes while ensuring valid class-conditional coverage by selectively applying class-wise thresholding. The RC3P algorithm achieves class-wise coverage regardless of the classifier and data distribution, and show improvement empirically.
优点
The idea is simple but effective, supported by solid theoretical and empirical evidence.
缺点
- Some notions, especially the Y and y, are confusing. Similarly, k is for label class, is for a different notion of label rank, but both uses k.
- More extra illustrations may be helpful for understanding. For example, regarding the tony example in para 1 of section 4, try give a concrete figure example. Also, give a table of important dataset statistics, e.g., total sample size, number of class, imbalance rate.
问题
- For fairness evaluation, especially for many classes, what's the worst under coverage ratio? Only the mean is provided.
局限性
Yes
We thank the reviewer for the constructive feedback. Below we provide our rebuttal for the key questions from the reviewer.
Q1: Some notions, especially the Y and y, are confusing. Similarly, k is for label class, is for a different notion of label rank, but both uses k.
A1: Our notations are consistent, but we will try to make them more clearer. We denote as the ground-truth label of data sample (see Line 103) and as a realized class label (see Line 107). is the number of candidate classes, where the maximum rank is still (see Line 103). is used to describe the rank threshold of top- class-wise error (see Line 108). Thus, we use as the rank threshold of class .
Q2: Try to give a concrete figure example.
A2: We plot the class-wise score distribution as a concrete example of the explanation in Section 4.1 (Table 6 in PDF).
The true labels of all data from class are ranked within , which indicates low uncertainty in class . However, nearly half of APS scores with high ranks (i.e., rank 2, 3) are larger than class quantile (the dashed straight line), and the corresponding label will not be included in prediction sets through CCP. On the contrary, the maximum true label ranks of data in class are 9, which indicates high uncertainty in the model's predictions, but most of the APS scores in class are smaller than the class quantile (the dashed straight line). Thus, the uniform class-wise iteration strategy of CCP includes more uncertain labels in the prediction sets and degenerates the predictive efficiency.
Q3: Give a table of important dataset statistics?
A3: We show the description of datasets in the following table. Thank you for your suggestion.e will add it in the revised paper.
| Dataset | CIFAR-10 | CIFAR-100 | mini-ImageNet | FOOD-101 |
|---|---|---|---|---|
| Number of training samples | 50000 | 50000 | 30000 | 75750 |
| Number of classes | 10 | 100 | 100 | 101 |
| Each class calibration samples | 500 | 50 | 150 | 125 |
| Imbalanced ratios | { 0.5, 0.4, 0.3, 0.2, 0.1} | { 0.5, 0.4, 0.3, 0.2, 0.1} | { 0.5, 0.4, 0.3, 0.2, 0.1} | { 0.5, 0.4, 0.3, 0.2, 0.1} |
Q4: What's the worst under coverage ratio?
A4: We add the experiments to report the worst class-conditional coverage (WCCC) as a new metric (Table 5 in PDF) on imbalanced mini-ImageNet datasets under the same setting from Table 16 in the Appendix. In WCCC, RC3P significantly outperforms CCP and Cluster-CP with improvement. Similar improvements of RC3P can be found on other datasets.
Thank you for response. I will keep my positive rating.
This paper aims to reduce the prediction set sizes of conformal predictors while achieving class-conditional coverage. The authors identify that class-wise conformal prediction (CCP) scans all labels uniformly, resulting in large prediction sets. To address this issue, they propose Rank Calibrated Class-conditional Conformal Prediction (RC3P), which augments the label rank calibration strategy within CCP. Theoretical analysis and experimental results demonstrate that RC3P outperforms CCP.
优点
-
The proposed method is well-motivated, and the paper is easy to follow. The authors find that CCP scans all labels uniformly, resulting in large prediction sets.
-
The proposed method introduces a novel conformal predictor for class-conditional coverage, achieving smaller set sizes than other classical methods on imbalanced datasets.
缺点
-
The proposed methods and experiments appear logically incoherent. The authors provide an explanation for the large prediction sets of CCP, but the interesting conclusion is not related to imbalanced data. Why do the authors only consider imbalanced datasets? Additionally, could the authors report the performance of different methods on balanced data?
-
The datasets used in the experiments are small, which is impractical for real-world applications. As shown in Section 5, the number of classes in the four datasets is fewer than 101. Moreover, Cluster-CP was proposed for datasets with many classes, so it may be unfair to compare these methods only on small-scale datasets. Furthermore, the classical method THR[1] should be considered in the experiments.
-
The contribution of Lemma 4.2 is limited. The assumption in Lemma 4.2 is very strong, which can directly infer the final results.
[1] Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association
问题
-
What is the ratio of training set, calibration set and test set? Does RC3P still outperform other methods when the size of calibration set is small? I think an ablation experiment about the size of calibration set is better.
-
How do you control the UCR of RC3P? Please provide some details.
局限性
They are adequately discussed.
We thank the reviewer for the constructive feedback. Below we provide our rebuttal for the key questions from the reviewer.
Q1: Why only consider imbalanced datasets? Report the performance on balanced data?
A1: RC3P is a general class-conditional CP method and works for the models trained on both imbalanced and balanced classification data. We did imbalanced experiments originally in the main paper, since the imbalanced setting is more common and challenging in practice, which can also differentiate the performance of RC3P with baselines.
The example in Section 4.1 only gives an intuitive motivation, but it is valid for both balanced and imbalanced settings. Additionally, since the standard CCP iterates over all class-wise quantiles, the prediction set sizes can be more diverse and exhibit large variance under the imbalanced setting, which further degenerates the predictive efficiency.
We add balanced experiment on diverse datasets (CIFAR-100, Places365, iNaturalist, ImageNet) in Experiment (1) of GR1 (see Table1 in the attached PDF). The models are pre-trained. UCR is controlled to . RC3P significantly outperforms the best baseline with reduction in APSS ( better) on average, with on iNaturalist and on ImageNet. We get similar improvements with APS and THR scoring functions.
Q2: It may be unfair to compare these methods only on small-scale datasets.
A2: In addition to the large-scale experiments on balanced data in Experiment (1) of GR1, in Experiment (2) Large-scale imbalanced classification experiment (Table 2 in PDF), we also tested baselines and RC3P on a large-scale dataset (ImageNet-LT) with a deep model trained from scratch using the imbalanced training method LDAM [r1]. UCR is controlled to . RC3P significantly outperforms the best baseline with reduction in APSS ( better) on average.
Q3: Furthermore, the classical method THR[1] should be considered in the experiments.
A3: We have added the experiments with the THR scoring function [r2] in Experiment (4) of GR1 (Table 4 in PDF) for baselines and RC3P. UCR is controlled to on CIFAR-10 and on other datasets. RC3P significantly outperforms best baselines with on mini-ImageNet reduction compared with the best baseline. Similar improvements of RC3P can be found on other datasets.
Q4: The contribution of Lemma 4.2 is limited. The assumption in Lemma 4.2 is very strong, which can directly infer the final results.
A4: (See also GR2)
We highlight that Lemma 4.2 is not our main intellectual contribution for the improved predictive efficiency of RC3P. Instead, it showcases a condition number that parametrizes whether a target event “the predictive efficiency is improved by RC3P’’ happens or not, i.e., .
Then Theorem 4.3 further analyzes the transition from the parameter of RC3P to the parameter of the target event to indicate how to configure RC3P to improve predictive efficiency. Specifically, as pointed out in its Remark, if we set as small as possible but still guarantee (required by the model-agnostic coverage, Theorem 4.1) in RC3P, then we have a higher probability to guarantee . Therefore, the overall transition of parameters is expressed by: which is equivalent to (7) in Remark and helps us configure RC3P to most likely improve the predictive efficiency over baseline CCP method.
Q5: What is the ratio of training set, calibration set and test set?
A5: We show the description of datasets in the following table. Thank you for your suggestion. We will add it in the revised paper.
| Dataset | CIFAR-10 | CIFAR-100 | mini-ImageNet | FOOD-101 |
|---|---|---|---|---|
| Number of training samples | 50000 | 50000 | 30000 | 75750 |
| Number of classes | 10 | 100 | 100 | 101 |
| Each class calibration samples | 500 | 50 | 150 | 125 |
| Imbalanced ratios | { 0.5, 0.4, 0.3, 0.2, 0.1} | { 0.5, 0.4, 0.3, 0.2, 0.1} | { 0.5, 0.4, 0.3, 0.2, 0.1} | { 0.5, 0.4, 0.3, 0.2, 0.1} |
Q6: Does RC3P still outperform other methods when the size of calibration set is small?
A6: We have added the experiments with various sizes for the calibration sets in Experiment (3) of GR1 (Table 3 in PDF): Baselines and RC3P are calibrated with calibration sets of various numbers of samples. UCR is controlled to for CCP and for Cluster-CP and RC3P. Following Cluster-CP, we set each class calibration samples . RC3P significantly outperforms the best baseline with reduction in APSS ( better) on mini-ImageNet. Similar improvements of RC3P can be found on other datasets.
Q7: How do you control the UCR of RC3P?
A7: (See also GR3*)
[r1] Sadinle et al., 2019. Least ambiguous set-valued classifiers with bounded error levels.
[r2] Ding et al., 2024. Class-conditional conformal prediction with many classes.
Thank you for your response. I still have concerns about the controlled UCR. I think tunning the on the calibration data breaks the basic assumption of CP, i.e., exchangeability. Tunning the hyper-parameters on fresh/hold-out data is reasonable.
FQ1: “I think tunning the on the calibration data breaks the basic assumption of CP, i.e., exchangeability. Tunning the hyper-parameters on fresh/hold-out data is reasonable.”
FA1: Thank you for your feedback. In the following response, we
(1) clarify that tuning the will not violate the exchangeability assumption.
(2) elaborate our experiment setting and show additional experimental results, where RC3P significantly outperforms the best baseline with (four datasets) or (excluding CIFAR-10) reduction in APSS ( better). It is similar to the improvements of RC3P that we report in our main paper ( (four datasets) or (excluding CIFAR-10) reduction).
(3) explain that limited calibration samples may result in inaccurate quantile and thus inaccurate coverage/efficiency measures.
(4) elaborate the experiment where RC3P still outperforms baselines without tuning in Table 7 of rebuttal PDF. Conditioned on similar APSS of all methods, RC3P significantly outperforms the best baselines with reduction in UCG on average.
(1) Adding inflation on calibration dataset will not violate the exchangeability assumption: The exchangeability is defined [fr1] by the invariance of the joint distribution of the variables to every permutation for , i.e., .
The exchangeability is the property of distribution. In our experiments, adding inflation on nominal coverage is an algorithm design and will not change the data distribution, where calibration and testing datasets still consist of i.i.d. samples form the same distribution. The same algorithm design (i.e., inflating the coverage during calibration) is applied in previous CP papers, such as ARCP (see Theorem 2, Equation 8 in [fr2]) that uses an inflated nominal coverage.
(2) New experiment on hold-out data for tuning : RC3P achieves the same level of improvement in the reduction of APSS ( from the best baseline).
Below we report the experiments where is tuned on hold-out samples on four datasets with the APS scoring function and imbalance EXP, 0.1. We first split calibration dataset into calibration and hold-out samples (50-50 split), where hold-out samples are used to tune and calibration samples are used to compute class-wise quantiles. UCR is controlled to on CIFAR-10 and on other datasets.
| Dataset | CIFAR-10 | CIFAR-100 | mini-ImageNet | FOOD-101 |
|---|---|---|---|---|
| # hold-out samples per class | 250 | 25 | 75 | 62 |
| CCP | 1.940 0.020 | 38.937 0.335 | 32.249 0.419 | 33.253 0.367 |
| Cluster-CP | 2.212 0.019 | 34.125 0.990 | 36.372 0.270 | 40.762 0.343 |
| RC3P | 1.940 0.020 | 19.668 0.005 | 17.736 0.002 | 21.565 0.007 |
These results shows that RC3P significantly outperforms the best baseline with (four datasets) or (excluding CIFAR-10) reduction in APSS ( better). Similar order of improvements of RC3P can be found in our main paper ( on four datasets or excluding CIFAR-10 APSS reduction from the best baseline, Table 1).
(3) Limited calibration samples may result in inaccurate quantile and thus inaccurate coverage/efficiency measures:
There are few training data samples for the tail classes and a limited number of calibration samples (e.g., 50 calibration samples of each class in CIFAR-100, see A5 for Q5) in our setting. If we split the calibration datasets into calibration and hold-out samples, the limited samples may cause inaccurate class-wise quantiles compared to the true quantiles, and thus inaccurate coverage and efficiency measures. In practice, our experiment in the main paper used as many samples as possible for calibration. For instance, comparing the results in the above table with Table 1 in our main paper, there are minor perturbations in APSS measures.
(4) Without tuning , RC3P still outperforms than baselines ( reduction in UCG on average): We add the experiments without controlling UCR in Experiment (1) of GR1 (Table 7 in PDF) under the same setting with the main paper.
The model is trained from scratch using LDAM [fr3]. UCR is not controlled. We then use the total under coverage gap (UCG, better) between class conditional coverage and target coverage of all under-covered classes. We choose UCG as the fine-grained metric to differentiate the coverage performance in our experiment setting. Conditioned on similar APSS of all methods, RC3P significantly outperforms the best baselines with reduction in UCG on average.
[fr1] Shafer, G. and Vovk, V., 2008. A tutorial on conformal prediction
[fr2] Gendler et al., 2021. Adversarially robust conformal prediction
[fr3] Cao et al, 2019. Learning imbalanced datasets with label-distribution-aware margin loss
Thank the authors for the detailed response. Most of my concerns have been addressed. I have increased my score to 6.
Moreover, in Table 1 of the manuscript, how do you choose the number of clusters for Cluster-CP?(I cannot find this experiment setting in the manuscript.) In the original paper on Cluster-CP, the number of clusters is determined by hold-out data.
Thanks for your feedback and questions.
We employed the Cluster-CP code from the authors and the same methodology to choose the number of clusters.
The paper proposes a new algorithm called Rank Calibrated Class-conditional CP (RC3P) that augments the label rank calibration to conformal classification calibration step. It theoretically proves it
优点
- Overall, the idea is clearly presented, and the motivation behind the problem—improving efficiency in CCP for imbalanced data—is compelling.
- The idea of only using top-k classes in the conformal calibration is intuitive.
- The experimental results are adequate, covering different datasets/settings.
缺点
- R3CP heavily relies on the ranking of candidate class labels. If the classifier's ranking is not reliable, the result could be more conservative.
- Lemma 4.2 is not very convincing, since the assumption is basically the conclusion itself. Theorem 4.3 tries to give a condition on when Lemma 4.2 holds, but it is still not very informative. It's not clear to me that R3CP is better than CCP.
问题
In the simulation result, you "set the UCR of RC3P the same as or smaller (more restrictive) than that of other methods under 0.16 on CIFAR-10 and 0.03 on other datasets". I though UCR is a metric and why would you be able to do that.
局限性
See above.
We thank the reviewer for the constructive feedback. Below we provide our rebuttal for the key questions from the reviewer.
Q1: R3CP heavily relies on the ranking of candidate class labels?
A1: RC3P does not heavily rely on model’s label ranking to guarantee the class-conditional coverage, and is instead agnostic to the model’s ranking performance, as shown in Theorem 4.1.
The improved predictive efficiency of RC3P relies on certain conditions of the model, as specified in Lemma 4.2 and Theorem 4.3. However, these conditions, e.g., the parameterized condition number , very likely holds in practice: we empirically verify it and it holds for all experiments. See Figure 3 in the main paper and Figures 15-25 in Appendix.
Q2: Lemma 4.2 not convincing and Theorem 4.3 not informative. Not clear why R3CP is better than CCP
A2: (See also GR2)
We highlight that Lemma 4.2 is not our main intellectual contribution for the improved predictive efficiency of RC3P. Instead, it showcases a condition number that parametrizes whether a target event “the predictive efficiency is improved by RC3P’’ happens or not, i.e., .
Then Theorem 4.3 further analyzes the transition from the parameter of RC3P to the parameter of the target event to indicate how to configure RC3P to improve predictive efficiency. Specifically, as pointed out in its Remark, if we set as small as possible but still guarantee (required by the model-agnostic coverage, Theorem 4.1) in RC3P, then we have a higher probability to guarantee . Therefore, the overall transition of parameters is expressed by: which is equivalent to (7) in Remark and helps us configure RC3P to most likely improve the predictive efficiency over baseline CCP method.
Practical Implications for the analysis of predictive efficiency in experiments: We conducted extensive experiments across multiple datasets to compare the predictive efficiency of each method and verify the conditional number . See Figure 3 in the main paper and Figures 15-25 in Appendix.
Our empirical results consistently show that RC3P achieves better predictive efficiency than CCP and the condition holds across all experimental settings. This empirical validation demonstrates that the practical implementation of RC3P aligns with the theoretical improvements shown in Lemma 4.2 and Theorem 4.3.
Q3: UCR is a metric and why would you be able to set it.
A3: (See also GR3*)
Why: While UCR is a coverage metric, fixing it to a certain value (e.g., 0) across methods allows us to ensure fair comparisons to make valid conclusions about the predictive efficiency of all CP methods.
Because coverage and predictive efficiency are two competing metrics in CP, e.g., achieving better coverage (resp. predictive efficiency) degenerates predictive efficiency (resp. coverage). If we do not fix one of them, then we cannot conclude which method has more predictive efficiency by comparison. We add the experiments on balanced CIFAR-100 datasets without controlling the UCR. See the table below.
| Methods | CCP | Cluster-CP | RC3P |
|---|---|---|---|
| UCR | 0.444 | 0.446 | 0.228 |
| APSS | 11.376 | 9.7865 | 10.712 |
As shown, RC3P achieves the best coverage metric (UCR), while Cluster-CP achieves the best metric for prediction set sizes (APSS). So this comparison does not conclude which one is more efficient.
Therefore, we fix UCR to first guarantee the class-conditional coverage, conditioned on which we can perform a meaningful comparison of predictive efficiency. This strategy is also discussed in [r1].
How: We add an inflation quantity to the nominal coverage, i.e., , where is a tunable hyper-parameter and represents the number of calibration samples in each class (for CCP and RC3P) or each cluster (for Cluster CP). The structure follows the format of generalization error when setting the empirical quantile, as in [r2]. We select the minimal value from the above range for such that UCR is under 0.16 on CIFAR-10 and 0.03 on other datasets.
We add the experiments without controlling UCR in Experiment (1) of GR1 (Table 7 in PDF) under the same setting with the main paper
The model is trained from scratch using LDAM [r1]. UCR is not controlled. We then use the total under coverage gap (UCG, better) between class conditional coverage and target coverage of all under covered classes. We choose UCG as the fine-grained metric to differentiate the coverage performance in our experiment setting. Conditioned on similar APSS of all methods, RC3P significantly outperforms the best baselines with 35.18% reduction in UCG on average.
[r1] Fontana et al., 2023. Conformal prediction: a unified review of theory and new challenges
[r2] Vladimir Vovk, 2012. Conditional validity of inductive conformal predictors
Thank you for the detailed rebuttal. I acknowledge that I have read it and it addresses most of my concerns. I'll maintain my score for now.
We thank all reviewers for their constructive feedback. Below we provide a summary of our rebuttal for key questions from reviewers’ as a global response.
GR1. We added 7 new experiments for all CP baselines and RC3P (We train the model from scratch in imbalanced settings by [r1])
(1) Balanced experiment on diverse datasets (CIFAR-100, Places365, iNaturalist, ImageNet) (Table 1 in PDF), as suggested by Reviewer Bzsm.
In the table, RC3P significantly outperforms the best baseline with reduction in APSS ( better) on average, with on iNaturalist and on ImageNet. We get similar improvements with APS and THR scoring functions.
(2) Large-scale imbalanced classification experiment (Table 2 in PDF), as suggested by Reviewer Bzsm and 2LUb.
RC3P significantly outperforms the best baseline with reduction in APSS ( better) on average.
(3) Various sizes for the calibration sets (Table 3 in PDF), as suggested by Reviewer Bzsm.
RC3P significantly outperforms the best baseline with reduction in APSS ( better) on mini-ImageNet. Similar improvements of RC3P can be found on other datasets.
(4) Calibration with the THR scoring function [r2] (Table 4 in PDF) for baselines and RC3P, as suggested by reviewer Bzsm.
RC3P significantly outperforms best baselines with on mini-ImageNet reduction compared with the best baseline. Similar improvements of RC3P can be found on other datasets.
(5) Worst class-conditional coverage (WCCC) as a new metric under the same setting from Table 16 in the Appendix (Table 5 in PDF) on imbalanced mini-ImageNet, as suggested by Reviewer vcMw.
In WCCC, RC3P significantly outperforms CCP and Cluster-CP with improvement. Similar improvements of RC3P can be found on other datasets.
(6) Class-wise score distribution as a concrete example of the explanation in Section 4.1 (Table 6 in PDF), as suggested by Reviewer vcMw.
(7) Comparison without controlling UCR under the same setting with the main paper (Table 7 in PDF), as suggested by Reviewer yaTg and Bzsm
We then use the total under coverage gap (UCG, better) as fine-grained metric for coverage. Conditioned on similar APSS of all methods, RC3P significantly outperforms the best baselines with 35.18% reduction in UCG on average.
GR2. Theoretical significance of Lemma 4.2 and Theorem 4.3?
We highlight that Lemma 4.2 is not our main intellectual contribution for the improved predictive efficiency of RC3P. Instead, it showcases a condition number that parametrizes whether a target event “the predictive efficiency is improved by RC3P’’ happens or not, i.e., .
Then Theorem 4.3 further analyzes the transition from the parameter of RC3P to the parameter of the target event to indicate how to configure RC3P to improve predictive efficiency. Specifically, as pointed out in its Remark, if we set as small as possible but still guarantee (required by the model-agnostic coverage, Theorem 4.1) in RC3P, then we have a higher probability to guarantee . Therefore, the overall transition of parameters is expressed by: which is equivalent to (7) in Remark and helps us configure RC3P to most likely improve the predictive efficiency over baseline CCP method.
GR3. Why control UCR? How to control UCR?
Why: While UCR is a coverage metric, fixing it to a certain value (e.g., 0) across methods allows us to ensure fair comparisons to get valid conclusions about the predictive efficiency of all CP methods.
Because coverage and predictive efficiency are two competing metrics in CP, e.g., achieving better coverage (resp. predictive efficiency) degenerates predictive efficiency (resp. coverage). If we do not fix one of them, then we cannot conclude which CP method has better predictive efficiency by comparison. We add the experiments on balanced CIFAR-100 datasets without controlling the UCR. See the table below.
| Methods | CCP | Cluster-CP | RC3P |
|---|---|---|---|
| UCR | 0.444 | 0.446 | 0.228 |
| APSS | 11.376 | 9.7865 | 10.712 |
As shown, RC3P achieves the best coverage metric (UCR), while Cluster-CP achieves the best metric for prediction set sizes (APSS), so this comparison does not conclude which one has higher predictive efficiency. Therefore, we fix UCR to first guarantee the class-conditional coverage, conditioned on which we can perform a meaningful comparison of predictive efficiency. This strategy is also discussed in [r3].
How: We add an inflation quantity to the nominal coverage, i.e., , where is a tunable hyper-parameter and represents the number of calibration samples in each class (for CCP and RC3P) or each cluster (for Cluster CP). The structure follows the format of generalization error when setting the empirical quantile, as in [r4]. We select the minimal value from the above range for such that UCR is under 0.16 on CIFAR-10 and 0.03 on other datasets due to the variability of data.
[r1] Cao et al, 2019. Learning imbalanced datasets with label-distribution-aware margin loss
[r2] Sadinle et al., 2019. Least ambiguous set-valued classifiers with bounded error levels
[r3] Fontana et al., 2023. Conformal prediction: a unified review of theory and new challenges
[r4] Vladimir Vovk, 2012. Conditional validity of inductive conformal predictors
Dear Authors and Reviewers!
Thank you for your reviews, rebuttal, additional comments, questions and responses!
We have the last two days of the discussion phase! Please use this time as efficiently as possible :)
The submission is borderline. The authors have made an effort to address all the issues raised in your comments.
@Reviewers: Please read each other's reviews, the authors' rebuttal, and the discussion. Please acknowledge the authors' efforts and actively participate in the final discussion!
Thank you,
NeurIPS 2024 Area Chair
The paper concerns the problem of conformal predictions. The Authors introduce a new algorithm that reduces the size of prediction sets to achieve class-conditional coverage, using an additional label rank calibration step. Theoretical analysis and experimental results are provided to justify the proposed solution in comparison to the baseline.
The paper is clearly presented and the introduced algorithm is novel, well-motivated and sound, supported by solid theoretical and empirical evidence. The reviewers raised several critical remarks, including inconsistency between motivations of the method and empirical studies (the absence of experiments on imbalanced datasets in the original submission), limitations in the theoretical analysis, unclear notation in some places, or the choice of benchmark datasets. The Authors prepared rebuttal with additional empirical results and clarifications regarding theoretical and empirical results. Overall, the reviewers are satisfied with the Authors' response.