PaperHub
5.7
/10
Rejected3 位审稿人
最低3最高8标准差2.1
6
3
8
4.3
置信度
正确性3.0
贡献度2.7
表达3.3
ICLR 2025

On Temperature Scaling and Conformal Prediction of Deep Classifiers

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

We theoretically and empirically analyze the impact of temperature scaling beyond its usual calibration role on key conformal prediction methods.

摘要

关键词
classificationtemperature scalingconformal predictionconditional coverageprediction sets

评审与讨论

审稿意见
6

This paper delves into the impact of Temperature Scaling (TS) on conformal prediction methods. Traditionally, researchers have applied TS before conducting conformal prediction on the resulting classifiers. The paper argues that while it enhances group coverage, it negatively affects set size. I find this paper offers valuable experiments and intriguing insights into the interplay between conformal prediction and temperature scaling, and I recommend an acceptance. My hesitation to give a higher score stems from (a) the paper's inherently limited scope to classification tasks, as regression tasks do not typically encounter this issue; (b) it is somehow intuitive that TS might not be the best choice for the set size, which is not particularly surprising.

优点

  1. The paper focuses on an interesting question: the influence of temperature scaling on conformal prediction outcomes, concerning group coverage and band length.
  2. Empirical evidence is provided to suggest that TS can detrimentally harm the set size of conformal prediction measures such as APS and RAPS, while simultaneously improving group coverage.
  3. The authors demonstrate that TS facilitates a trade-off between set size and group coverage.
  4. Theoretical insights into the empirical findings are offered, along with practical guidelines for practitioners.

缺点

have few major concerns with this paper. As previously mentioned, my decision not to assign a higher score is based on:

  1. The limited scope of the paper to classification tasks, as regression tasks do not typically present this issue.
  2. The somewhat expected nature of the finding that TS might impact set size negatively.

问题

See above.

评论

2. "The somewhat expected nature of the finding that TS might impact set size negatively."

Please note that our empirical and theoretical study shows that TS can also affect the prediction set size positively (at the price of degradation in conditional coverage) and not only negatively.

We believe that our discoveries on the effect of TS on CP methods are not intuitive.

When initiating our research, we hypothesized that CP algorithms would overall benefit from a well-calibrated model, as it seemed reasonable to assume that softmax outputs aligned more closely with true correctness probabilities would enhance performance. Presumably, this has also been the intuition of all the works (cited in our paper) that applied TS calibration before applying CP without any examination of the effect of this procedure on the CP methods (Angelopoulos et al., 2021; Lu et al., 2022; Gibbs et al., 2023; Lu et al., 2023).

However, upon further investigation, we realized that the behavior for APS and RAPS differs from LAC, and that the primary factor that affects APS and RAPS is not the calibration level itself but rather the temperature value, TT.

Specifically, as presented in our paper, we discovered that when T<1T<1 then TS positively affects the prediction set size of APS and RAPS, but at the price of negative effect on their class-conditional coverage. And vise versa, for 1<T<Tcritical1< T < T_{critical} (negative effect on prediction set size but positive effect on conditional coverage). All the more so, the phenomenon where there exists TcriticalT_{critical} such that for TS with T>TcriticalT>T_{critical} the trends (degradation/improvement) in the prediction set size and conditional coverage are swapped is also very non-intuitive.

评论

We thank the reviewer for their overall positive feedback and are pleased that they recognize the importance of our work and the novel insights we provide regarding the effect of TS on CP. Below we provide a detailed response to the reviewer's comments.

1. "The limited scope of the paper to classification tasks, as regression tasks do not typically present this issue."

We believe that exploring and understanding the interplay between calibration and CP in regression is a great idea for future research, separated from this paper. Below we present justifications for our focus on classification:

  • Importance of classification tasks. The classification task is extremely prevalent in practice and is an impactful area of research, with countless significant studies dedicated solely to classification. In fact, many seminal machine learning ideas [1],[2],[3] were originally presented in the context of classification and only later applied to other tasks, such as regression, in separated works.
    Our work focuses on two post-hoc uncertainty quantification methods: TS calibration and CP. Seminal studies in these topics, such as [4],[5],[6], have also been dedicated exclusively to classification, underscoring the relevance and importance of our focus.

  • Addressing regression tasks in addition to classification tasks is beyond the scope of this paper. Regression constitutes a fundamentally different task than classification, and as such, uncertainty quantification methods are very different in regression and classification. For example, CP outputs a prediction set (of discrete labels) in classification but a continuous interval in regression.
    Our study delivers a comprehensive investigation into classification tasks, featuring extensive empirical analysis and a rigorous theoretical exploration of our findings. Tackling regression tasks would require a fundamentally different set of empirical experiments and theoretical approaches. Thus, extending our work to encompass regression tasks would be beyond the feasible scope of a single paper.

We would like to emphasize again that we fully acknowledge the significance of regression problems. We view the investigation of the interplay between calibration and CP in regression as a promising direction for future research, as we mention in the conclusion of the revised paper in line 538.

============================

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25 (2012).

[2] Sergey Ioffe and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167, 2015.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

[4] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. "On calibration of modern neural networks." In International conference on machine learning, pp. 1321–1330. PMLR, 2017.

[5] Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. "Uncertainty sets for image classifiers using conformal prediction." In International Conference on Learning Representations, 2021.

[6] Yaniv Romano, Matteo Sesia, and Emmanuel Candes. "Classification with valid and adaptive coverage." Advances in Neural Information Processing Systems 33 (2020): 3581-3591.

评论

Dear Reviewer yXD1,

We would greatly appreciate receiving your feedback on our responses before the discussion period ends (in less than two days).
Have we addressed your comments?

Thank you again for the review.

评论

I read the responses carefully, and decide to keep my score unchanged at the time being. I still believe that: since TS is not an end-to-end training, TS is hardly to be the best one for conformal prediction. As an intuitive case: if the neural network can consistently predict 95% correct labels, the best confidence band (with alpha = 0.05) result is to set T to be infty (which returns band with size=1). However, if we use TS, the best T cannot be infty. So if the original T is pretty larger (might be infty), after TS, the results of CP should be worse. Such a simple example could tell us that TS might not always be the best choice for conformal prediction. ***The authors could correct me if the above example is wrong. ***

For the first point on the extensions, I do agree that classification is an important task. I just want to mention for other readers that this limits the scope of this paper (and I find it difficult to extend the ideas to regression tasks), and this does not influence much on my positive evaluation.

Moreover, I am closely tracking the discussions between the authors and Reviewer eNPR.

评论

We sincerely thank Reviewer yXD1 for their response, and glad to hear about their acknowledgment regarding the importance of classification tasks.

Let us begin with some comments on end-to-end training and post-hoc methods for improving CP. End-to-end and post-hoc methods are fundamentally different, each with distinct advantages and limitations. For example, modifying an objective, e.g. increasing the importance of prediction set size or conditional coverage, in an end-to-end method requires access to the training set and re-training the entire model (both may not be feasible to the user). On the other hand, post-hoc methods do not require to re-train the network. Furthermore, while end-to-end training of deep neural networks benefits from their overparametrization when minimizing the loss objectives, typically, optimality/guarantees cannot be ensured theoretically. Indeed, [Stutz et al. (2022)] which proposes end-to-end training for optimizing CP properties, still includes a post-hoc application of CP in order to ensure marginal coverage.

Our work focuses on TS with temperature that is modified post-hoc, not as an assertion that it is better than end-to-end training, but rather as a recognition that no single ideal approach exists. We position our work as a valuable complement to end-to-end methods, such as [Stutz et al. (2022)].

Regarding the example you provided, given such a model with 95%95\\\% accuracy and α=0.05\alpha = 0.05, you suggest that taking the predicted label to be the prediction set is the best and this can be done by taking TT \xrightarrow{} \infty --- we assume your intention is taking T0T \xrightarrow{} 0 which raises even more the highest post-softmax entry and decreases the post-softmax vector entropy. Indeed, taking T0T \xrightarrow{} 0 will cause that, however, as discussed in our paper "best confidence band" can be interpreted with respect to two objectives, prediction set size and proximity to conditional coverage. Your suggestion is correct when considering the prediction set sizes as the primary objective. In Section 5 of our paper, we discuss this trade-off in detail and demonstrate how using TS allows for easy tuning of the CP performance between these objectives. Specifically, if minimizing prediction set sizes is the sole objective, we suggest taking T0T \to 0. Conversely, if ensuring class-conditional coverage is the priority, choosing T=TcriticalT = T_{critical} may be more appropriate.

Interestingly, one of our experiments closely aligns with your example. Consider CIFAR10-ResNet34 dataset-model pair, which achieves a Top-1 accuracy of 95.3%95.3\\\%. Figures 12 and 14 (right) show AvgSize and TopCovGap as functions of TT at an error level of α=0.05\alpha = 0.05. As you noted, AvgSize approaches 1 as T0T \to 0, but importantly, TopCovGap also increases, illustrating the trade-off.

Therefore, in this case of very accurate model (with respect to α\alpha) we do see that if the user prioritizes the prediction set sizes (over conditional coverage) they can simply set extremely small TT that effectively behave like T0T \to 0 and reach minimal set sizes, which is in agreement with our proposed guidelines,

If at any point in our response we have misinterpreted your intention, we would be grateful for any clarification and will gladly adjust our response accordingly. Thank you again for your important comment.

审稿意见
3

In this paper, the authors investigate the interplay between conformal prediction and calibration. Firstly, the paper empirically shows that while temperature scaling improves the class-conditional coverage of adaptive CP methods. Then, the authors establish a mathematical theory that explains this phenomenon. Finally, the paper offers a guideline to effectively combine adaptive CP with calibration.

优点

  1. The work is well-writing. Basically, the paper is written in a good manner and I believe readers can easily touch the core idea.

  2. This paper provides a theoretical analysis to show how temperature values influence the properties of prediction sets. With the theoretical results, researchers can understand why temperature scaling can affect the conformal prediction.

  3. The authors provide empirical validation of their theoretical framework.

缺点

  1. The paper presents an inconsistency in its mathematical derivations. In Eq. (3), the analysis is mainly based on the relationship between i=1Mπii=1MπT,iandqqT\sum_{i=1}^M\pi_i-\sum_{i=1}^M\pi_{T,i}\quad\text{and}\quad{q}-q_T (omit \hat for \pi and q due to the tex support), which represents the difference of accumulated probability and threshold value before and after applying a temperature. Later, in Eq. (5), the problem above is translated to investigate the relationship between i=1Mexp(zi)i=1Mexp(zi/T)andi=1Mexp(ziq)i=1Mexp(ziq/T).\sum_{i=1}^M\exp(z_i)-\sum_{i=1}^M\exp(z_i/T)\quad\text{and}\quad\sum_{i=1}^M\exp(z_i^q)-\sum_{i=1}^M\exp(z_i^q/T).However, based on the definition on zqz^q, we know that if MM is not the true label of zqz^q, then i=1Mexp(ziq/T)q^T\sum_{i=1}^M\exp(z_i^q/T)\neq\hat{q}_T. Therefore, it's unclear how the problem in Eq. (3) can be equivalent to the analysis in Eq. (5).

  2. The assumptions in Theorem 4.4 appear to be unreasonable. Theorem 4.4 in the paper states that if Δz>b(T)\Delta z>b(T), then rising z1z_1 leads to an increase in the set size. However, the assumption that 'Δz\Delta z is preserved as z1z_1 increases' (line 458) is not natural for me because counterexamples exist where Δz\Delta z increases as z1z_1 increases. Furthermore, even if we accept the condition that Δz\Delta z remains constant, the paper's claim that 'zqz^q has a larger dominant entry than typical zz' lacks proper justification.

  3. The theoretical bounds and empirical results show inconsistency. The paper reports an empirical critical temperature of T=1.524T^{*}=1.524. However, this value falls between the theoretical temperature ranges (0,0.813)(0,0.813) and (1.25,4.81)(1.25,4.81), suggesting a gap between theory and practice. This inconsistency challenges the paper's claim that 'the bounds in Theorem 4.4 do not require unreasonable values of Δz\Delta z and T' (line 463). Moreover, it indicates that using the median of Δz\Delta z to estimate TcriticalT_{critical} may not be a reliable approach.

  4. The proposed approach of using TcriticalT_{critical} to enhance conditional coverage (as proposed in Section 5) has limitations. As discussed in Weaknesses[3], the theorem may fail to provide accurate estimation of TcriticalT_{critical}. Even though an accurate estimation can be achieved, simply applying TcriticalT_{critical} falls short of achieving group-wise coverage that Mondrian conformal prediction provides. Furthermore, the paper does not present empirical evidence demonstrating how their proposed guideline enhances conformal prediction performance. Overall, these limitations restrict the practical applicability of the theoretical results.

Minor Comments:

  1. The paper's analysis is limited to Temperature Scaling, while leaving out other important calibration methods such as histogram binning.
  2. The mathematical proofs lack clarity. For example, in proof of Theorem 4.1, the first equation (line 750-755) is stated without proper mathematical justification.
  3. typo in Section 5: "guidelinse" should be "guidelines".

I would be open to reconsidering my recommendation if these issues could be addressed in the rebuttal.

问题

  1. While Theorem 4.4 provides valuable theoretical insights, it is hard to understand how to use the theoretical results for estimating TcriticalT_{critical}. I would greatly appreciate it if the authors could provide an explicit expression for computing T^critical\hat{T}_{critical}.
评论

We sincerely thank the reviewer for this detailed review. Below we provide our point-to-point response to each of their concerns. We hope that our detailed response will lead to reconsidering the score.

1. On the link between Eq. (3) and Eq. (5)

We sincerely thank the reviewer for this comment. Indeed, there was a missing part in the derivation in the original version which relates g(zq,T,M)g(z^q,T,M) (associated with Eq. 5) to g(zq,T,Lq)g(z^q,T,L^q) (associated with q^q^T\hat{q}-\hat{q}_T in Eq. 3), where by LqL^q we denote the rank of the true label of the quantile sample zqz^q.

We empirically observed that g(zq,T,M)g(zq,T,Lq)|g(z^q,T,M) - g(z^q,T,L^q)| is negligible in our experiments (where Δzq>>1\Delta z^q >> 1). Moreover, we also managed to show theoretically that g(zq,T,M)g(zq,T,Lq)|g(z^q,T,M) - g(z^q,T,L^q)| decays exponentially with Δzq\Delta z^q.

Accordingly, in the revised version, before Eq. 5, we added this clarification which refers the reader to Proposition A.6 in the appendix, where we formally show (under minor assumption which is fully empirically justified) that g(z,T,M)g(z,T,L)<(C1)exp(Δz)(C1)exp(Δz)+1If0<T<1 |g(z,T,M) - g(z,T,L)| < \frac{(C-1)\exp{(-\Delta z)}}{(C-1)\exp{(-\Delta z)} + 1} \quad \text{If} \quad 0 < T < 1 g(z,T,M)g(z,T,L)<(C1)exp(Δz/T)(C1)exp(Δz/T)+1IfT>1 |g(z,T,M) - g(z,T,L)| < \frac{(C-1)\exp{(-\Delta z / T)}}{(C-1)\exp{(-\Delta z / T)} + 1} \quad \text{If} \quad T > 1

This formally implies that g(zq,T,M)g(zq,T,Lq)|g(z^q,T,M) - g(z^q,T,L^q)| is small since Δzq>>1\Delta z^q >> 1.

To conclude, our proof for the bound of g(zq,T,M)g(zq,T,Lq)|g(z^q,T,M) - g(z^q,T,L^q)| as well as the empirical values presented in Appendix A.1, link Eq. (3) and Eq. (5). Furthermore, the relevance of our theoretical analysis is further reinforced by the fact that the analysis of Eq. (5) provides fine-grained explanations for the behavior observed in the experiments (as we explain in the response to the comments below --- the theory indeed agrees with the experiments).

Again, we are grateful to the reviewer for pointing out this issue, and as mentioned, we addressed this in the revised version in lines 428–432 in the main paper and in Appendix A.1.

 
2(a). On the assumptions in Theorem 4.4 - line 458 in the original version

It seems there might have been some misunderstanding regarding the sentence in line 458 in the original version (now associated with lines 451-452 in the revision).

Theorem 4.4 indeed tells that if Δz>b(T)\Delta z > b(T), then increasing z1z_1 leads to an increase in the prediction set size. However, contrary to the reviewer's comment, in line 458 there was no additional constantness assumption on Δz\Delta z.

The complete sentence in line 458 of the original version, which the reviewer refers to, is:
"indeed, the condition on Δz\Delta z is preserved as z1z_1 increases."

We do not require that Δz\Delta z is preserved or constant if z1z_1 increases (which cannot be true). On the contrary, for T>1T>1, we explain that if z1z_1 increases then Δz\Delta z increases, and therefore Δz>b(T)\Delta z > b(T) continues to hold, i.e., the condition on Δz\Delta z in the theorem still holds.

Following this comment, in the revision we rephrased the sentence for better clarity (lines 451-452 in the revised version):
"indeed, when z1z_1 increases then Δz\Delta z increases, and thus the inequality Δz>bT>1(T)\Delta z > b_{T>1}(T) remains satisfied."

 
2(b). On the assumptions in Theorem 4.4 - larger Δz\Delta z for zqz^q

Regarding the justification of "zqz_q has a larger dominant entry than typical zz": As stated in lines 418-431 of the original version (406-420 in the revised version), the score of the quantile sample zqz_q exceeds the scores of (1α)%(1 - \alpha)\% (e.g. 90%) of the other samples. Figure 3 (Figure 4 in the original version with a minor fix) illustrates a strong correlation between the score and Δz\Delta z. Since zqz^q has higher score than the median score, associated with some zz', this empirically supports the condition that it has larger Δz\Delta z. In numbers, in Figure 3, the quantile sample has Δzq11\Delta z^q \approx 11 while the median sample has Δz8\Delta z' \approx 8 and larger Δz=z(1)z(2)\Delta z = z_{(1)}-z_{(2)} implies more dominant highest entry. The difference is even enhanced after the softmax due to the relation πiexpzi\pi_i \propto \exp{z_i}.

评论

3. On the agreement of the theoretical bounds and the empirical results

We believe there may have been a misunderstanding of the reviewer regarding the content of the paragraph in lines 463-470 in the original version (lines 458-465 in the revised version).

We demonstrate that our theory not only provides intuition for the phenomenon but also offers reasonable bounds that explain the empirical observations for a "typical" sample in the CIFAR100-Resnet50 setting. For this typical sample, which we interpret as the sample with the median score, we have Δz8\Delta z \approx 8. Based on Theorem 4.4 (when we plug in it Δz8\Delta z \approx 8 and C=100C=100), the conditions on Δz\Delta z of the theorem hold for temperatures in the ranges (0,0.831)(0, 0.831) and (1.25,4.81)(1.25, 4.81).

The optimal calibration temperature for CIFAR100-Resnet50 is T=1.524T^* = 1.524. The reviewer incorrectly referred to it as the critical temperature (which we defined differently - TcriticalT_{critical} is the temperature where the trend in the curve of the prediction set size vs. TT changes from increase to decrease).

T=1.524T^* = 1.524 indeed falls within the range where the theorem is valid. The theorem predicts that for this temperature the prediction set size of a typical (median-score) sample increases. This is aligned with the increase in the mean prediction set size that we see in Table 1 for APS for CIFAR-100, ResNet50 after performing TS calibration (which resulted in using T=1.524T^* = 1.524).

To conclude, there is no inconsistency here.

 
4. Regarding the proposed approach in Section 5

In continuation with our response to the previous comment, it seems that reviewer confused between TT^* and TcriticalT_{critical}. Accordingly, their claim: "As discussed in Weaknesses[3], the theorem may fail..." is not justified.

The proposed approach in Section 5 suggests to apply TS separately for calibration and for CP, and explains how the samples that are used for calibration can be utilized to estimate the curves of the CP method (AvgSize and TopCovGap vs. T) as demonstrated in Appendix C. As we explain in Section 5, the range of TT for exploration can be bounded by TcriticalT_{critical} that can be approximated by our theory. Again, this is not the temperature TT^* that is optimal for calibration but rather the intersection of the two functions TT1ln(4T)\frac{T}{T-1}\ln(4T) and T1T+1ln(4T(C1)2)\frac{T-1}{T+1}\ln(4T(C-1)^2) in the T>1T>1 branch in Eq. 6.

Since our guideline allows the user to set a temperature given their preferences between AvgSize and TopCovGap, it has practical significance.

Regarding the Mondrian conformal prediction [1], this CP method is based on obtaining CP thresholds separately per class. We observed that our approach outperforms it in our experiments, which is aligned with its inferior performance reported in previous paper when the number of samples used for calibrating CP per class is small [2].

Specifically, in our experiments, we consider CIFAR-100 that has C=100C=100 classes and CP set (used to calibrate the CP) of size up to 2000, and ImageNet that has C=1000C=1000 classes and CP set of size up to 10000. This means that approximately we have only 20 samples per class to calibrate CP for CIFAR-100 and 10 samples per class to calibrate CP for ImageNet.

Our metric TopCovGap measures worst-case class-coverage violation. For a user that prioritizes class-conditional coverage, our detailed research and guidelines in Section 5 suggest to use TS with T=TcriticalT=T_{critical}, where AvgSize is the high, but by the trade-off TopCovGap is low.
The properties of RAPS with such TS outperforms Mondrian CP in both TopCovGap and AvgSize.

We thank the reviewer for mentioning the Mondrian CP. We discuss and compare our approach to it in the revision in lines 523-526 and in Appendix D. These results, which show that our approach outperforms it, further strengthen the practical significance of our contribution.

=========

[1] Vladimir Vovk. "Conditional validity of inductive conformal predictors." In Asian Conference on Machine Learning, 2012.

[2] Ding, Tiffany, et al. "Class-conditional conformal prediction with many classes." Advances in Neural Information Processing Systems, 2024.

评论

Minor comment 1: "The paper's analysis is limited to Temperature Scaling"

We will raise several justifications of our study on TS confidence calibration method:

  • Popularity and efficiency of Temperature Scaling. As stated in lines 44-48, "Guo et al. (2017) demonstrated the usefulness of a simple Temperature Scaling (TS) procedure (a single parameter variant of Platt scaling (Platt et al., 1999)). Since then, TS calibration has gained massive popularity (Liang et al., 2018; Ji et al., 2019; Wang et al., 2021; Frenkel & Goldberger, 2021; Ding et al., 2021; Wei et al., 2022). Thus studying it is of high significance."
  • Utilization of Temperature Scaling as a pre-process step to Conformal Prediction. As we mentioned in line 60, the previous papers (Angelopoulos et al., 2021; Lu et al., 2022; Gibbs et al., 2023; Lu et al., 2023) that apply initial calibration before applying CP methods, use TS calibration algorithm rather than any other calibration method.
  • The ability to develop a theory on Temperature Scaling. The simplicity of the TS procedure enabled us to develop a theoretical framework for its effect on CP, which is a notable achievement given the complexity of the problem.
  • Temperature Scaling can be used beyond calibration. Since TS is based on a single parameter (the temperature), as discussed in Section 5 based on our discoveries, it paves the way for practitioners to use TS beyond the value associated with calibration in order to trade prediction set sizes and conditional coverage of adaptive CP methods (APS and RAPS) according to their specific needs.

Therefore, it is fair to focus on TS in this paper and defer the study of other calibration methods to future research.

 
Minor comment 2: Clarity of the mathematical proofs

Following this comment, in the revised version we added explanations to steps in the proofs.

Regarding the specific question of the reviewer on lines 750-755 in the original version (proof of Theorem A.1), according to the lemma in Theorem A.1, zizj,TT~0\forall z_i \geq z_j, T \geq \tilde{T} \geq 0 The following holds:

$

\exp{(z_i/\tilde{T})} \cdot \exp{(z_j/T)} \geq \exp{(z_i/T)} \cdot \exp{(z_j/\tilde{T})}

$

Note that we denoted the sets of the indices I=[1,2,,L]I = [1, 2, \ldots, L] and J=[L+1,L+2,,C]J = [L+1, L+2, \ldots, C]. In the theorem, we deal with a sorted logits vector, and therefore, as stated in the original version: "Because zz is sorted, iI,jJ\forall i \in I , j \in J we have zi>zjz_{i}>z_{j}." So, we have the above inequality for each iI,jJi \in I , j \in J. To obtain the inequality in line 750 in the original version we merely sum the inequality over all iI,jJi \in I , j \in J combinations: i=1Lj=L+1Cexp(zi/T~)exp(zj/T)i=1Lj=L+1Cexp(zi/T)exp(zj/T~) \sum_{i=1}^L \sum_{j=L+1}^C \exp{(z_i/\tilde{T})} \cdot \exp{(z_j/T)} \geq \sum_{i=1}^L \sum_{j=L+1}^C \exp{(z_i/T)} \cdot \exp{(z_j/\tilde{T})} To get to line 755 in the original version, we separated the summation of the indexes iI,jJi \in I, j \in J: i=1Lexp(zi/T~)j=L+1Cexp(zj/T)i=1Lexp(zi/T)j=L+1Cexp(zj/T~). \sum_{i=1}^{L}\exp(z_{i}/\tilde{T}) \cdot \sum_{j=L+1}^{C}\exp(z_{j}/T) \geq \sum_{i=1}^{L}\exp(z_{i}/T) \cdot \sum_{j=L+1}^{C}\exp(z_{j}/\tilde{T}). In the revision, we stated verbally the summation and separation of the sums.

 
Minor comment 3: typo in Section 5

Thank you for bringing this to our attention. We have carefully reviewed the paper for language and typographical errors and made corrections where necessary.

 
Question 1: Computing TcriticalT_{critical}

In our paper (and has been emphasized in the revision in lines 477-480, 491-492), TcriticalT_{critical} is the temperature where the trend of the prediction set size, as TT increases, changes from increasing to decreasing. Our theory allows to approximate it by computing the intersection of the two functions TT1ln(4T)\frac{T}{T-1}\ln(4T) and T1T+1ln(4T(C1)2)\frac{T-1}{T+1}\ln(4T(C-1)^2) in the T>1T>1 branch in Eq. 6. This intersection has no analytical solution but can be easily computed numerically. We now mention this in lines 477-480 in the revised version.

评论

Thank you for the detailed response. This addressed some of my concerns about the theoretical results. However, after carefully reading the paper and discussions, I will keep my score as reject due to the limited contribution of this work. As commented by Reviewer yXD1, the scope of the main analysis is limited in the temperature scaling in classification. Despite it is commonly used in deep learning, but previous works in conformal prediction only use the temperature scaling as a small trick and do not promote it as an effective tool to improve the efficiency or conditional coverage. The guideline mentioned by the authors is also trivial, as one can easily drop the temperature scaling (as a post-hoc method) for conformal prediction. Therefore, the main contribution of this work is showing how the trick affect the conformal prediction, which is insufficient for this conference. It might improve the quality of this work if the authors can extend the analysis to other methods of confidence calibration and propose an effective method to handle their conflicts.

评论

Dear Reviewer eNPR,

We would greatly appreciate receiving your feedback on our responses before the discussion period ends (in less than two days).
Have we addressed your comments?

Thank you again for the review.

评论

We thank reviewer eNPR for their comment and glad to hear we addressed their concerns regarding our theoretical analysis. We respectfully disagree on their claims on the “limited contribution of the work”. Before we respond to their comment point-to-point, we would like to emphasize that we addressed all the main concerns raised by the reviewer in the first round. In this round, the reviewer focuses on one minor comment (minor comment number 1 from their first post) and presents new claims on the contribution of the work. Below we rebut each of their claims.

The scope of the main analysis is limited in the temperature scaling in classification

Regarding the focus on Classification.

As we commented in the first round, classification constitutes as an important task, and we justify our focus on it with the following:

  • Importance of classification tasks.

The classification task is extremely prevalent in practice and is an impactful area of research, with countless significant studies dedicated solely to classification. In fact, many seminal machine learning ideas [1],[2],[3] were originally presented in the context of classification and only later applied to other tasks, such as regression, in separated works.

Our work focuses on two post-hoc uncertainty quantification methods: TS calibration and CP. Seminal studies in these topics, such as [4],[5],[6], have also been dedicated exclusively to classification, underscoring the relevance and importance of our focus.

  • Addressing other tasks in addition to classification tasks is beyond the scope of this paper.

The only other task that may be as prevalent in machine learning as classification is regression. Let us explain why it is not reasonable to include it together with classification in our paper.

Regression constitutes a fundamentally different task than classification, and as such, uncertainty quantification methods are very different in regression and classification. For example, CP outputs a prediction set (of discrete labels) in classification but a continuous interval in regression.

Our study delivers a comprehensive investigation into classification tasks, featuring extensive empirical analysis and a rigorous theoretical exploration of our findings. Tackling regression tasks would require a fundamentally different set of empirical experiments and theoretical approaches. Thus, extending our work to encompass regression tasks would be beyond the feasible scope of a single paper.


Note that Reviewer yXD1 acknowledged the importance of classification task in the second round and noted that this does not influence their evaluation.

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25 (2012).

[2] Sergey Ioffe and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167, 2015.

[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 770–778, 2016.

[4] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. "On calibration of modern neural networks." In International conference on machine learning, pp. 1321–1330. PMLR, 2017.

[5] Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. "Uncertainty sets for image classifiers using conformal prediction." In International Conference on Learning Representations, 2021.

[6] Yaniv Romano, Matteo Sesia, and Emmanuel Candes. "Classification with valid and adaptive coverage." Advances in Neural Information Processing Systems 33 (2020): 3581-3591.

Regarding the focus on Temperature Scaling.

As we answered in the first round to Minor comment 1, TS calibration constitutes as an important confidence calibration method, and we provided several justifications for our focus on it in this paper.

评论

Despite it is commonly used in deep learning, but previous works in conformal prediction only use the temperature scaling as a small trick and do not promote it as an effective tool to improve the efficiency or conditional coverage.

As we discussed in our paper, previous works use TS calibration from a basic premise it will benefit CP, without actually studying its effect. The fact that they missed / overlooked the phenomenon that TS significantly affects the performance of adaptive CP in a trade-off manner, and that we discovered it, is a strength of our work, not a weakness.

We carefully studied, both empirically and theoretically this effect for a wide range of temperatures (beyond the value that is optimal for calibration). Based on our novel findings, which are surprising and non-intuitive, and on the theory that we developed, we proposed guidelines of using TS with CP, which differ from the current art, and can improve dramatically the efficiency or the conditional coverage. For instance, for CIFAR100-ResNet50, tuning the temperature can lead to more than 50%50\\\% decrease in AvgSize and TopCovGap of APS.

Our guidelines provides the ability to trade between these objectives. This kind of post-hoc ability has not been known before. In addition, we do not see our method as a “trick”, but an actual approach that can benefit practitioners as explained above.

The experiments that have been added to the revision, where our approach outperforms Mondrian (class wise) CP, further strengthen the practical significance of our work.

The guideline mentioned by the authors is also trivial, as one can easily drop the temperature scaling (as a post-hoc method) for conformal prediction.

The reviewer's intention in this comment is not clear. We assume that by "drop" they mean "omit".

Clearly one can omit the usage of TS as post-hoc and achieve marginal coverage by the chosen CP method, as well as prediction sizes and proximity to conditional coverage as if TS has been used with T=1T=1. In our paper we extend this point of view into an ability to continuously tune the trade-off between prediction sizes and proximity to conditional coverage to better fit the user requirements.

It might improve the quality of this work if the authors can extend the analysis to other methods of confidence calibration and propose an effective method to handle their conflicts.

In addition to the justifications we presented to the focus on TS, our work is full of content, from deep empirical examination to rigorous theoretical understanding and practical implications. The theoretical analysis on TS requires its own deep mathematical foundation. Developing new analyses for more calibration methods is beyond the scope of a single paper. Yet, we believe it can constitute as an interesting future research direction.

评论

Therefore, the main contribution of this work is showing how the trick affect the conformal prediction, which is insufficient for this conference.

We disagree with this claim which attempts to undermine our contributions.

We emphasize that our study offers several notable contributions spanning empirical discoveries, theoretical understanding, and practical implications.

  • The novel exploration of the consequences of combining TS calibration and CP, two popular and fundamentally distinct uncertainty quantification methods, represents a significant contribution to the field of trustworthy machine learning. In particular, we have discovered surprising and non-intuitive phenomena: TS calibration can have negative effect on properties of adaptive CP, and beyond calibration, modifying the temperature enables to trade the prediction set sizes and the class-conditional coverage through a surprising non-monotonic pattern.

  • Beyond the empirical exploration of how TS influences CP, we developed a rigorous theoretical analysis that explains the non-intuitive results. We believe our theoretical framework is a contribution to the field, which may be useful also for future theoretical studies. As Reviewer 3FBk stated: "Specifically, the theory about how set size is affected by temperature is fairly general-purpose."

  • We expand on the observed phenomenon by translating the empirical results into practical insights for practitioners. The ability to continuously tune a single parameter and control the trade-off between efficiency and proximity to conditional coverage represents a significant contribution to uncertainty quantification. Until now, practitioners seeking to control these objectives had limited options, typically constrained to selecting a CP algorithm, with few popular methods available. Notably, adaptive methods are known to improve conditional coverage, but beyond this, guidance is sparse. This limitation underscores the importance of utilizing our guidelines as a valuable contribution.

  • Focusing solely on either prediction set size or class-conditional coverage, the use of TS as a post-hoc method outperforms existing adaptive CP methods. For example, in ImageNet-ViT, setting T0T \to 0 reduces prediction set sizes by 90% compared to the vanilla APS method. Furthermore, as detailed in Appendix D of our revised version, using TS with T=TcriticalT = T_{critical} surpasses Mondrian CP in all aspects. This superiority highlights the significant contribution of our method.


To conclude, our work makes significant contributions across empirical discoveries, theoretical understanding, and practical implications.

审稿意见
8

The paper provides a very detailed theoretical and empirical study of the effect of temperature scaling on prediction set sizes and conditional coverage in conformal prediction. It finds that temperature scaling can (sometimes drastically) increase set sizes for common classification tasks. It explains this phenomenon theoretically and provides guidelines to practitioners about how to set the parameter moving forward.

优点

The paper is strong and I recommend acceptance.

I was impressed and surprised by the insights in this paper. Temperature scaling has a huge effect, the experiments bear this out, and the theory provides some explanation as to why, and might be useful to others in the field. (Specifically, the theory about how set size is affected by temperature is fairly general-purpose.)

  • The empirical experiments are painstakingly detailed and very scientific.
  • The theory is useful and correct.

缺点

  • The theorems are somewhat weak, and of limited practical value in terms of being applied directly. They seem to be useful mostly for the purpose of post-hoc explanation of why this phenomenon happens.

问题

I have not much to ask or say. The paper was clear! A typo/English language check would improve it before the next round.

评论

We are grateful to the reviewer for the positive review, which acknowledges the importance of our work, and appreciates our empirical experiments and theoretical analysis. Below is our response to the comments.

  • Values of the theorems

Indeed, the core goal of our theoretical study is to provide mathematical reasoning for the surprising phenomena that we discovered empirically. We believe that the "why" question is important when the observations are non-intuitive. Yet, at the same time, practitioners can utilize Theorem 4.4 to approximate the temperature TcriticalT_{critical} where the monotonicity in the CP properties breaks, and utilize it when defining a range for tuning TT to trade between prediction set sizes and class-conditional coverage according to their specific requirements, as described in Section 5. For example, if class-conditional coverage is a key priority for the practitioner, selecting T=TcriticalT = T_{critical} might be an appropriate choice. And indeed, in Appendix D in the revised version we show the advantages of this approach over an existing approach.

  • Typo/English language check

We have carefully reviewed the paper for language and typographical errors and made corrections where necessary.

评论

I understand your perspective, thank you!

评论

We sincerely thank the reviewers for their positive evaluation of our work, valuable suggestions, and insightful questions. We provided a point-to-point response to each of the comments and addressed any concern. In particular, apart from minor adjustments, the revised version includes:

  • Justifying, both empirically and theoretically, the move from the events in Eq. (3) to those in Eq. (5). See Appendix A.1.
  • Strengthening the practical significance of our work, by showing that our novel guidelines yield a new method that outperforms the Mondrian (classwise) CP in benchmark settings. See Appendix D.

We would be happy to answer any other questions that the AC or the reviewers have.

评论

We sincerely thank all the reviewers for their time and efforts and their valuable feedback.

We would like to reiterate the key contributions of our study.

  • We discovered surprising empirical phenomena regarding the impact of TS (beyond calibration) on CP, revealing non-monotonic trade-off effects that depend on the temperature.

  • We introduce a novel theoretical framework that explains our findings and formalizes the impact on prediction set sizes, which can be useful for future research.

  • We provide practical insights, introducing a novel approach for trading prediction set sizes and conditional coverage by tuning a single temperature parameter (unrelated to the one used for calibration).

  • When prioritizing prediction set sizes or class-conditional coverage, TS with our theoretically and empirically backed temperature values outperforms existing adaptive CP techniques.

 
These contributions have been recognized by the majority of the reviewers.

Reviewer 3FBk stated that "the paper is strong" and that they were "impressed and surprised by the insights in this paper". They praised also our experiments as "painstakingly detailed", and recognized our theoretical analysis as "useful and correct" and "might be useful to others in the field."

Reviewer yXD1 stated that "the paper focuses on an interesting question" and that they "find this paper offers valuable experiments and intriguing insights".

All the three reviewers recognized the presentation, stating that the paper is “well written” and “clear”.

Reviewer eNPR is the only reviewer with negative recommendation. In the rebuttal we fully addresses all their comments. Notably, out of the weaknesses raised, only one was a valid concern, which we have fully addressed (see the first point in our previous general response). We also strengthened our practical contribution (see the second point in our previous general response).
When responding to our rebuttal, Reviewer eNPR based their negative score on claims about the contribution of our work, which we have rebutted above, and argued that studying TS does not suffice because:
"previous works in conformal prediction only use the temperature scaling as a small trick and do not promote it as an effective tool to improve the efficiency or conditional coverage."
However, the fact that these previous works missed/overlooked the phenomenon that TS significantly affects the performance of adaptive CP in a non-intuitive trade-off manner, and that we discovered it, is a strength of our work, not a weakness.

In the rebuttal (and in the paper itself) we also detailed the reasons that justify the focus of the paper on TS, which include:

  • Enormous popularity and efficiency of TS calibration
  • Existing CP works did not use any other calibration method
  • TS can be used beyond calibration
  • The ability to develop a theory on TS

More details for each of these points can be found below in the response to Reviewer eNPR. Therefore, we believe that it is fair to focus on TS in this paper (which is already full of content) and defer the study of other calibration methods to future research.

To conclude,
given all the above: 1) significant contributions (from deep empirical examination and discoveries to rigorous theoretical understanding and practical implications), 2) justification of focusing on TS, and 3) detailed rebuttal (which addresses all the points raised by the single negative reviewer), we sincerely hope that our paper will be accepted.

AC 元评审

The paper investigates the impact of Temperature Scaling (TS) on Conformal Prediction (CP) methods for deep neural network classifiers. Its key scientific claims and findings include: TS, typically used for softmax calibration, influences CP methods beyond its calibration role. Specifically, it improves class-conditional coverage for adaptive CP methods like APS and RAPS but can negatively impact prediction set sizes. The effect of TS on CP exhibits a non-monotonic trend. By tuning the temperature parameter, practitioners can trade off smaller prediction sets against better class-conditional coverage.

The authors provide a theoretical framework explaining the non-monotonic behavior of CP methods under TS. They derive critical temperatures where the trade-off flips and use mathematical bounds to quantify the phenomenon. The paper offers actionable insights for practitioners. Extensive empirical results on benchmark datasets (CIFAR-100, ImageNet) show that TS can yield a great reductions in prediction set sizes while maintaining valid coverage (this is quite normal because the coverage is automatic as long as excheangeability is maintained.

The decision to weakly reject the paper is based on the perception that the core discussion revolves around how Temperature Scaling (TS) serves as a regularization parameter, affecting classifier accuracy, which in turn influences the efficiency (size of prediction sets) in conformal prediction. While TS is recognized as a calibration method, its impact on prediction set sizes essentially reflects its effect on the underlying model’s accuracy. Since conformal prediction inherently maintains marginal coverage regardless of TS, the work is seen as focusing on a fairly indirect and expected outcome of TS. Consequently, the contributions are perceived as incremental, and the paper could benefit from a deeper engagement with and contextualization within the broader literature. Addressing this gap would clarify its novelty and relevance.

审稿人讨论附加意见

Some reviewers argued that the results of the paper were incremental and trivial, particularly highlighting that the impact of temperature scaling (TS) on conformal prediction (CP) seemed intuitive and unsurprising. For instance, Reviewer eNPR noted that previous works treated TS as a minor calibration step and not as a substantial tool for improving efficiency or conditional coverage, suggesting that the paper merely formalized a phenomenon that was already evident. Similarly, Reviewer yXD1 mentioned that it was somewhat expected that TS might negatively impact prediction set sizes while improving conditional coverage. However, the authors contended that their discoveries—particularly the non-monotonic trade-offs introduced by TS—were novel and non-intuitive, and they supported this claim with both empirical evidence and theoretical analysis. They also emphasized the practical significance of their guidelines for trading off efficiency and conditional coverage in CP using TS. Despite addressing these concerns in their responses, some reviewers maintained skepticism about the overall novelty and impact of the work.

最终决定

Reject

公开评论

Please be aware that the meta-review by the AC contained several factual errors and overlooked the positive reviews.

Specifically, the AC stated:
"The decision to weakly reject the paper is based on the perception that the core discussion revolves around how Temperature Scaling (TS) serves as a regularization parameter, affecting classifier accuracy, which in turn influences the efficiency (size of prediction sets) in conformal prediction."
This is factually incorrect. TS does not affect classifier accuracy because it does not change the ranking of the softmax output.

The AC also claimed:
"While TS is recognized as a calibration method, its impact on prediction set sizes essentially reflects its effect on the underlying model’s accuracy."
Again, this statement is incorrect.

The AC used these false arguments to claim that increasing TS has an “expected” effect on CP. However, our work directly contradicts such claims, as we clearly demonstrate non-monotonic behavior that we theoretically explain.
Furthermore, the AC appears to have disregarded positive reviews and heavily relied on a reviewer whose claims were fully rebutted during the response period.

评论

Dear authors,

Rescaling the entries of a vector (with the same scalar) obviously does not change the order of its elements, so does not change the argmax used to attribute the classes. Here, the term "accuracy" obviously refers to "accuracy of the distribution of the score" (that is how well the model' predicted probabilities reflects the true likelihood) which is an output of the model itself (predicted classes, predicted confidence).

Sorry, if the wording was ambiguous but Here, there is no reclassification involved anyway, so the context is very clear.

Temperature scaling recalibrates the model outputs i.e. aligns frequency of correct prediction with the predicted confidence. Since the conformity scores of LAC, APS, RAPS are directly based on these (potentially recalibrated) confidence values, any change in the calibration is naturally ("expected" to) reflected in the CP outcomes. This was a point discussed during the review process and reported in my meta-review.

Regards,

AC

公开评论

Please note that even with this (uncommon) usage of the term accuracy, the results are neither "expected" nor "natural" as the AC seems to claim.

If the AC believes otherwise, please answer the following and explain how:

  1. Could you predict that TS (beyond calibration) has a trade-off effect on performance metrics of APS/RAPS while having only a negligible patternless effect on LAC — We discovered this in our work.
  2. Could you anticipate a non-monotonic trend in the performance metrics of APS/RAPS as the temperature changes? — We revealed this empirically and then provided a theoretical explanation.
  3. Could you approximately predict the temperature at which the mean prediction set size of APS reaches its peak? — Our theory does, and we illustrated how practitioners can leverage this through the trade-off we identified.
评论

Hi authors,

I'd like to clarify that my concerns are not addressed in the rebuttal and I hope this reply can help you in the next version. One of the main concerns is the practical value of the theoretical results (Section 5). The theory shows how to find the worst T that leads to the largest sets. However, it does not show how to find the sweet spot (best T) of the tradeoff between efficiency and conditional coverage. In Section 5, the authors claim that one can select different T according to the target (size/covGap/ECE):

  1. For the smallest size, we have to conduct a grid search to find the optimal T (should be in (0,1], instead of (1, \infty)), which is irrelevant to the theory.
  2. For the smallest CovGap, the authors suggest using the worst T for the size, which cannot guarantee the best CovGap but the largest Size. This is not convincing as a meaningful guideline in practice.

Therefore, I believe this paper should be improved in the next version. About the theory, the non-monotonic trend in (1, \infty) is interesting but might be meaningless as it does not have practical value. Finally, the writing should also be improved as it is hard for readers to understand the notation and logic (such as using the worst T (TcriticalT_{critical}) to prioritize class-conditional coverage).

Best wishes,

Reviewer eNPR

公开评论

We thank Reviewer eNPR for their engagement and new comments (which do not appear below). 

We encourage visitors of the page to read the previous discussion where we rebutted the reviewer's earlier comments.

Below are our responses addressing each of the new comments.

First, note that CovGap (the marginal coverage) is ensured, so we assume that the reviewer means TopCovGap, which reflects class-conditional coverage.

Second, our work includes both empirical (discoveries, practical guidance) and theoretical contributions.
We believe that equipping surprising, nonintuitive empirical discoveries with mathematical theory that explains them is very important in exact sciences.

Since there is a trade-off between AvgSize and TopCovGap for APS/RAPS, there is no single value of TT that is optimal in terms of both. (The term "sweet spot" used by Reviewer eNPR is not well defined).

In Section 5 and Appendix C we present a practical way that each user can set TT based on their personally preferred balance between AvgSize and TopCovGap.

  1. If one cares only about AvgSize and wants to use APS/RAPS, the 0<T<10<T<1 branch of our theory suggests that he picks the smallest TT (which is aligned with the empirical observations, up to numerical issues that arise when approaching 0).  Furthermore, grid search has negligible complexity compared to training DNNs (in the considered settings, it takes no more than a few minutes).
  2. Section 5 and Appendix C provide meaningful guidance that goes beyond the theory and allows to directly set a value of TT that is approximately optimal for TopCovGap. We emphasize and demonstrate this in a newer version. Typically, due to the trade-off that we discovered, this value is close to the value of TT where AvgSize is the largest.

In a newer version we further clarify the notation and discussion on the critical temperature. Yet, this is a minor issue that did not justify rejecting the paper. Overall, the reviewers complimented/gave a good grade to our writing/presentation (including Reviewer eNPR in their first review).

To conclude, Reviewer eNPR's new comments have also been addressed, and the suggested modifications are minor. As such, in our opinion, they do not justify the disappointing decision. We hope for a better experience in the future.

评论

This cannot address my concerns, and my concern about the practical guidelines is reflected in the rebuttal.