PaperHub
6.0
/10
Poster4 位审稿人
最低5最高8标准差1.2
5
8
6
5
3.0
置信度
正确性3.0
贡献度2.3
表达2.5
NeurIPS 2024

An Information Theoretic Perspective on Conformal Prediction

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

We link conformal prediction to information theory, and thus derive a principled way to use side information in conformal prediction, and new upper bounds on the intrinsic uncertainty of the data-generating process, which improve conformal training.

摘要

关键词
conformal predictioninformation theoryuncertainty quantification

评审与讨论

审稿意见
5

The paper provides a series of upper bounds for the conditional entropy of the labels given the attributes. The right-hand side of the bounds depends on the coverage probability of the corresponding Conformal Prediction sets. The authors propose to use the bound as an alternative objective function for conformal training.

优点

  • The link between Conformal Prediction (CP) and List Decoding is inspiring, even if it is partly unclear how it is used in the bounds.
  • Using CP distribution-free and finite sample properties to compute empirical bounds is interesting and may have an impact beyond the CP community.
  • I like the idea of using the bounds for training CP algorithms.

缺点

  • The authors may explain more intuitively the role of the arbitrary distribution QQ. Later in the paper QQ seems to be linked to the underlying point-predictor. An explicit construction (in the Introduction) would help motivate the proposed approach.
  • The practical consequences of the authors' findings could be clarified better.
  • A few practical examples would be helpful. For example, the authors may add a concrete way to define QQ at the end of Section 3.1.
  • The technical contribution seems a relatively straightforward consequence of the inequality H(YX)<H(wY(Y,X)wX(Y,X))H(Y|X) < H(w_Y(Y, X)|w_X(Y, X)).
  • The challenging part of optimizing a CP algorithm is smoothing the sorting operation. The proposed training method does not seem to solve or address that problem. The authors should justify better the use of conditional entropy instead of E(C){\rm E}(|{\cal C}|). In particular, how an upper bound could be more efficient than the original objective function?

问题

  • Is the prediction set in Equation 3 calibrated using PP?
  • What do you mean by to learn machine learning models that are more amenable to SCP?
  • Where is the conditional entropy defined? Is the right-hand side of Equation 3 guaranteed to be positive?
  • Is QQ supposed to be an approximation of PP learned from data?
  • Is WW required to be a stochastic map? What happens if WW is deterministic?
  • Would it be possible to add a baseline trained in a standard way, i.e. without conformal training?

局限性

I appreciate the authors addressed their work's limitations in a dedicated section. Perhaps, the paper misses a theoretical and intuitive justification for using the bounds for conformal training.

作者回复

We thank the reviewer for the insightful comments. We address the main concerns below.

The authors may explain more intuitively the role of the arbitrary distribution QQ.

One can think of our results as way to upper bound the (irreducible) data aleatoric uncertainty H(YX)H(Y|X) of the true data distribution PP by leveraging conformal prediction and another distribution QQ. We present our results in all generality, as they hold for any distribution QX,Y=PXQYXQ_{X,Y} = P_X Q_{Y|X}, but clearly QQ will be more informative and lead to tighter bounds if it is a good approximation to PP.

Is QQ supposed to be an approximation of PP learned from data?

In short, yes. In the conformal training experiments, we directly learn QYXQ_{Y|X} from data to tighten our bounds.

[...] the authors may add a concrete way to define QQ at the end of Section 3.1.

In Section 3.2, specifically lines 156 to 156, we comment on two good choices for QYXQ_{Y|X}, namely the predictive model itself (the very same used to construct the prediction sets) or a uniform distribution, which leads to the model-based (eq. 5) and simple Fano (eq. 6) bounds, resp. We will make this point more evident in the text.

The technical contribution seems a relatively straightforward consequence of the inequality H(YX)<H(wY(Y,X)wX(Y,X)).H(Y|X) < H(w_Y(Y,X)|w_X(Y,X)).

Could you clarify what you mean by this inequality on conditional entropies? Unless we are missing something, it does not hold in general; naively taking wY(Y,X)=Yw_Y(Y,X) = Y and wX(Y,X)=Yw_X(Y,X)=Y, we get H(wY(Y,X)wX(Y,X))=0.H(w_Y(Y,X)|w_X(Y,X))=0. Our results are proven using canonical information theoretical inequalities like Fano and DPI, but they require a tailored choice of distributions as well as considering mathematical subtleties of including marginal coverage guarantees of conformal prediction in the bound.

The paper misses a theoretical and intuitive justification for using the bounds for conformal training.

We provide a motivation in section 4, lines 187 to 190, which can be summed up as follows. Our results upper bound the irreducible uncertainty of the true distribution PP, as given by the conditional entropy H(YX)H(Y|X). Thus, if we fit a model QYXQ_{Y|X} to minimize our upper bounds, we can argue we push QYXQ_{Y|X} closer to the true conditional distribution PYXP_{Y|X}, using the training data to decrease the "reducible" uncertainty as much as possible. Although we are not aware of previous work leveraging the same motivation, the same argument could also be used to justify the cross-entropy, which also upper bounds H(YX)H(Y|X), with equality when QYX=PYXQ_{Y|X}=P_{Y|X}.

The challenging part of optimizing a CP algorithm is smoothing the sorting operation. The proposed training method does not seem to solve or address that problem.

Our main contribution is a new connection between notions of uncertainty coming from information theory and conformal prediction. Conformal training is simply a useful application of that connection, where we demonstrate that our methods lead to equivalent or better results than the original ConfTr objective (we use the exact same relaxation of the sorting operator for all methods). The fact that we do not address what is perceived as the main challenge in conformal training does not mean our results are not relevant in that context.

The authors should justify better the use of conditional entropy instead of EC(X)E|C(X)|. In particular, how an upper bound could be more efficient than the original objective function?

We provide a justification in section 4 and clarify further in the answers above. The last point seems to be a misunderstanding of our results. We provide upper bounds to the conditional entropy H(YX)H(Y|X) and not EC(X)E|C(X)|. We also show in the end of section 4 and appendix F.1, that the ConfTr objective, i.e. logEC(X)\log E|C(X)|, is also an upper bound to H(YX)H(Y|X), but a looser one than the simple Fano bound. Therefore, if we adhere to the motivation of minimizing an upper bound to H(YX)H(Y|X), our upper bounds should in theory give better results.

Is the prediction set in Equation 3 calibrated using PP?

Yes. Yet, many recent works have extended conformal prediction to more general settings where calibration and test points are not identically distributed (see e.g. [1,2]), and in principle our results can also be extended to this setting if the conformal prediction method in question still provides lower and upper bounds for P(YtestC(Xtest)).P(Y_{test} \in C(X_{test})).

What do you mean by to learn machine learning models that are more amenable to SCP?

Conformal prediction sets ultimately depend on the characteristics of the predictive model used to construct them. For example, two models with the same accuracy could lead to different prediction sets, and in that sense, one could say the model which produces more informative (narrower) sets is more amenable to conformal prediction than the other. The idea of conformal training is exactly to steer the training process towards those models that are more amenable to conformal prediction.

Where is the conditional entropy defined?

We will explicitly define H(YX)=EPXY[logPYX]H(Y|X) = -E_{P_{XY}} [\log P_{Y |X}] in the text.

Is the right-hand side of Equation 3 guaranteed to be positive?

Yes, the proof in section D.1 proves this.

Is WW required to be a stochastic map? What happens if WW is deterministic?

The result (Theorem C.3) holds for deterministic maps as well.

Would it be possible to add a baseline trained in a standard way, i.e. without conformal training?

Such a baseline is already included in all our results, namely models trained purely via the cross-entropy that we identify by the acronym CE. We will make this more clear in the text.

References

[1] Barber, Rina Foygel, et al. "Conformal prediction beyond exchangeability." The Annals of Statistics 51.2 (2023): 816-845.

[2] Prinster, Drew, et al. "Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them)." ICML 2024.

评论

The rebuttal clarifies my doubts. Sorry for the strict inequality typo in my question. I meant <= instead of <. After reading the authors' answers and the other reviewers' comments, I tend to confirm my score. But I would not vote against acceptance.

评论

We are grateful that the reviewers went through our answers and happy that we could address the reviewer's doubts. We are also happy that the reviewer is not voting against acceptance. In that light, we would like to kindly ask what needs to be changed in the paper so it passes the acceptance bar. We are under the impression that the current requests are addressable with minimal changes in the camera ready version, but we would be happy to be aware of bigger changes the reviewer has in mind.

审稿意见
8

This paper observes a connection between the efficiency of CP (tightness of its prediction sets) and information theory. It uses this to derive 3 bounds on a CP's efficiency depending on the entropy of the label given a data record H(Y|X). It then uses these bounds as the optimization objective for the underlying model (nonconformity scorer). The bounds are empirically shown to be valid (i.e., reasonable) training objectives.

优点

  • This paper is truly well written: it explains extremely well prior literature as well as the concepts being presented. The appendix is also well-curated.
  • The idea behind this paper (to connect information theory and CP) is very interesting, and I'm sure it'll lead to more research.
  • The theoretical analysis is non-trivial and insightful.

缺点

  • Bounds based on Fano's inequality may be loose.
  • Empirically, the method does not outperform SOTA; however, I do think there's more value in this work than just improving CP numerically.

问题

Thank you for your submission!

I'm afraid I could not understand how the connection to list decoding was used when deriving your results. I can see this is being discussed in the appendix, but perhaps it'd be best to mention in the main body which ideas are borrowed from that theory, and how they're being used here. (Can you please clarify, in your rebuttal?)

I did not quite understand why federated learning was introduced in this paper; I didn't feel like it was necessary, and it almost feels like an afterthought. Perhaps it'd be best to motivate it better in the introduction.

"To the best of our knowledge, Bellotti [8] were the first to advance the idea of conformal training." Note that, concurrently to Bellotti, this idea was also discovered by Colombo and Vovk in "Training conformal predictors".

While your work is the first one to deal with the interesting intersection between information theory and CP efficiency, please note that other works have looked at:

  1. connecting CP efficiency to the goodness of the underlying algorithm; e.g.:
  • (Dhillon et al.) On the Expected Size of Conformal Prediction Sets
  • (Cherubin) How do the performance of a Conformal Predictor and its underlying algorithm relate?

and also

  1. how to best characterize the efficiency of a CP: (Vovk et al.) Criteria of efficiency for conformal prediction.

局限性

These were appropriately described by the paper.

作者回复

We appreciate the reviewer's insightful comments and are encouraged by the positive feedback and recognition of our work's significance. We answer the remaining questions below.

Bounds based on Fano's inequality may be loose.

That is definitely true of the simple Fano bound, where we assume a uninformative uniform distribution over the labels. However, the model-based Fano bound should be tight for a model QQ that approximates the true distribution PP well enough.

Empirically, the method does not outperform SOTA; however, I do think there's more value in this work than just improving CP numerically.

While there was no clear winner among our bounds, they still improve on ConfTr variants in two ways: (i) the DPI and model-based Fano bounds proved more effective than ConfTr on training complex models from scratch (e.g. in the CIFAR results in Table 1) and (ii) our bounds also seem to be more robust than ConfTr, since even after fine tuning hyperparameters as described in Section G.1, ConfTr often failed to converge properly (e.g. ConfTr performance in MNIST in Table 1 was worse than expected due to training instability).

I'm afraid I could not understand how the connection to list decoding was used when deriving your results. I can see this is being discussed in the appendix, but perhaps it'd be best to mention in the main body which ideas are borrowed from that theory, and how they're being used here. (Can you please clarify, in your rebuttal?)

Indeed, the main results can be derived without resorting to list decoding. We mostly derive our results directly using the data processing or Fano's inequality, as shown in Appendix D. We highlight the connection to list decoding because it is quite natural and might bring more interesting results to both fields in the future. In fact, a slightly looser version of our simple Fano bound can be obtained directly from Fano's inequality for list decoding, see Theorem C.6 and Proposition C.7 in the appendix.

I did not quite understand why federated learning was introduced in this paper; I didn't feel like it was necessary, and it almost feels like an afterthought. Perhaps it'd be best to motivate it better in the introduction.

The information theoretic perspective also facilitates the usage of side information in conformal prediction, as we discuss in Section 5. Federated learning just happens to be a natural and elegant application of both conformal training and side information, as the global objective factorizes nicely into local objectives that can be computed locally in each device with the device ID as side information. This scenario can be quite useful in practice, and is perhaps a more convincing use case for side information than assuming access to superclass information like in the experiments in Table 2.

Note that, concurrently to Bellotti, this idea was also discovered by Colombo and Vovk in "Training conformal predictors".

Thanks for bringing that work to our attention. We will definitely cite and discuss it in the final version of the paper.

Related work

Thanks for the extra references. We will discuss these contributions in the paper, and the work of Dhillon et al. in particular is also relevant to our experiments in Appendix G.3, where we also use our results to lower bound the expected prediction set size.

评论

Thank you for your response.

Indeed, the main results can be derived without resorting to list decoding.

I do appreciate the intellectual honesty in the above comment. But if that's the case, you may want to:

  1. remove the list decoding discussion (sorry!), or put it in a separate section that isn't part of the main discussion (e.g., a section before conclusions),
  2. or, try and rephrase your proofs to actually use it. (If you do go down this root, you should of course mention that your results don't actually need the list decoding theory.)

The latter would of course be preferable, especially because the connection to list decoding is indeed quite interesting. Please treat the above as a recommendation, not a requirement.

I will keep my positive score.

评论

Thanks for the suggestions. We will move the discussion on list decoding to the final stages of the paper to improve readability.

审稿意见
6

The paper aims to relate the uncertainty measured by conformal prediction by the uncertainty measured by the conditional entropy H(YX)H(Y|X). Towards this end, they make the following contributions:

  1. They observe that the miscoverage quantity from conformal prediction is the same as the decoding-error quantity from list decoding in information theory.
  2. They exploit this observation to upper bound the conditional entropy H(YX)H(Y|X) in terms of quantities from conformal prediction.
  3. They use these bounds as objectives for conformal training in order to maximize efficiency i.e keep the prediction sets as small as possible.
  4. They empirically verify that their modifications to conformal training using their upper bounds improve efficiency.

优点

Originality The problem of relating the uncertainty measured in information theory to that from conformal prediction is very interesting avenue of exploration.

缺点

Clarity and Quality

In general, it is confusing to interpret what the results are really saying. In particular, I have the following confusion about the theoretical result that remains after reading the paper:

  1. H(YX)H(Y|X) only measures aleatoric uncertainty whereas the conformal interval captures both aleatoric and epistemic uncertainty. So then, it is not surprising that the conditional entropy would be upper bounded by something in terms of the conformal size C(X)|C(X)|. Is this the correct interpretation for the Propositions, or is there something else going on here?

For the empirical result:

  1. In Section 4, the authors state that minimizing the RHS of the Fano bound is equivalent to the ConfTr objective from [7]. However, in the tables, the rows corresponding to ConfTr and Fano admit different efficiency - what is the cause of this?

  2. What is the intuition for why minimizing the RHS of the bounds is better than minimizing C(X)|C(X)| directly?

问题

  1. Could you answer W1 above, and in general connect the theoretical results to epistemic/aleatoric uncertainty?

I am happy to change my assessment/confidence if the authors are able to alleviate my concerns through discussion.

局限性

Yes


EDIT POST REBUTTAL:

After discussing with the authors, I have increased my score to 6. The authors have done a great job in the discussion of contextualizing the results in terms of epistemic&aleatoric uncertainty. My hope is that the authors in the final version will:

  1. Modify the exposition of the paper to discuss the role of these categories of uncertainty.
  2. In the discussion section, be upfront about the looseness of the bounds when epistemic uncertainty is high.
  3. In the discussion, mention the challenges to quantifying the epistemic uncertainty theoretically.
作者回复

We thank the reviewer for the insightful comments and interesting questions, which we discuss below.

H(YX)H(Y|X) only measures aleatoric uncertainty whereas the conformal interval captures both aleatoric and epistemic uncertainty. So then, it is not surprising that the conditional entropy would be upper bounded by something in terms of the conformal size C(X)|C(X)|. Is this the correct interpretation for the Propositions, or is there something else going on here?

This intuition is absolutely correct. Yet, even though the cardinality of the prediction set size provides an intuitive notion of uncertainty, quantitatively connecting it to the aleatoric or even total uncertainty is non-trivial. Importantly, the prediction set size varies with the user-defined miscoverage rate α\alpha, whereas arguably any notion of uncertainty should not. Our upper bounds account for that, and α\alpha comes up quite naturally in the derivation of all three bounds. Moreover, it is worth noticing C(X)|C(X)| only appears explicitly on the simple Fano bound, which is expected to the be the least tight among our three bounds, further illustrating the difficulty in connecting the cardinality of the prediction set size to other established notions of uncertainty.

In Section 4, the authors state that minimizing the RHS of the Fano bound is equivalent to the ConfTr objective from [7]. However, in the tables, the rows corresponding to ConfTr and Fano admit different efficiency - what is the cause of this?

Equation (8) is equivalent to the ConfTr objective but not to the simple Fano bound of equation (6), and hence why the results are different. We will update the wording in section 4 to make this more evident. In section F.1, we actually prove the simple Fano bound is a tighter upper bound to the conditional entropy than the ConfTr objective, i.e., H(YX)(6)(8).H(Y|X) \leq (6) \leq (8).

What is the intuition for why minimizing the RHS of the bounds is better than minimizing C(X)|C(X)| directly?

From a theoretical perspective, we argue in Section 4 that minimizing an upper bound to the data aleatoric uncertainty, as captured by H(YX)H(Y|X), is a reasonable optimization objective. This motivation also justifies the cross-entropy and logC(X)\log |C(X)| as learning objectives, but from this angle, our upper bounds are actually tighter bounds to H(YX)H(Y|X) and thus likely to be more effective: The DPI bound is provably tighter than the cross-entropy (see Remark D.1 in the appendix) and the simple Fano bound is tighter than the ConfTr objective that minimizes logC(X)\log |C(X)| directly as mentioned in the answer above.

From a practical perspective, logC(X)\log |C(X)| alone might not provide a strong enough learning signal to fit complex models (see e.g. ConfTr results for CIFAR in Table 1) and an extra hyperparameter needs to be introduced to balance between optimizing logC(X)\log |C(X)| and some other loss function like the cross-entropy which gives the ConfTr class objective. In contrast, our upper bounds, especially DPI and Model-based Fano ones, already capture both efficiency and predictive power of the model, with no need for extra hyperparameters. Moreover, in our experiments, our upper bounds proved to be more robust, whereas ConfTr and ConfTr class failed to converge in some cases even after hyperparameter fine-tuning (see, for example, ConfTr results on THR/MNIST in Table 1).

Could you connect the theoretical results to epistemic/aleatoric uncertainty?

That is an important point, and one of our objectives in this paper was indeed to improve the current understanding of how conformal prediction relates to other forms of uncertainty quantification, where the separation between epistemic and aleatoric uncertainty is already more established (albeit still not solved [1]). To the best of our knowledge, this separation has never been made in the conformal prediction literature and remains an entirely open question [2]. The main challenge in the context of conformal prediction is in characterizing the epistemic uncertainty without further information or assumptions on the model. Since epistemic uncertainty is a property of the model itself and not the data, it requires an understanding of the model's hypothesis space, for example via a distribution over the model parameters in the Bayesian setting; something that conformal prediction ignores or abstracts away via the scoring function. That is why in this paper we focus on the data aleatoric uncertainty H(YX)H(Y | X) as its connection to conformal prediction is more tangible, as evidenced by our upper bounds.

References

[1] Wimmer, Lisa, et al. ”Quantifying aleatoric and epistemic uncertainty in machine learning: Are conditional entropy and mutual information appropriate measures?.” Uncertainty in Artificial Intelligence. PMLR, 2023.

[2] Hullermeier, Eyke, and Willem Waegeman. ”Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods.” Machine learning 110.3 (2021): 457-506.

评论

I thank the authors for clarifying my questions. I think that the goal of relating conformal intervals to epistemic and aleatoric uncertainty is a very well-motivated goal, and conceptually very useful for the ML community -- given the application of UQ in safety-critical applications. Additionally, it is a theoretically challenging pursuit, so I commend the authors on their efforts.

However, I still think the results are confusing to interpret, and does not appropriately shed much light on this motivating question. Additionally, the discussion of the authors in the rebuttal around AU and EU appears a bit confused. In particular, the authors mention that "epistemic uncertainty is a property of the model itself and not the data". However, even the paper [1] that the authors link mentions that:

EU then refers to a lack of knowledge about the discrepancy between pp and p^\hat{p} that can – in contrast to AU – be alleviated by gathering more training samples.

This is the more standard definition of EU that I am aware of. So either the authors are using the terminology incorrectly or defining things differently, but these new definitions are not clear.

For these reasons, I will stick with my score. I would encourage the authors to motivate their paper from the perspective of AU, EU and frame the results explicitly in these terms as well, by addressing these confusions.

评论

We would like to thank the reviewer for engaging in the discussion.

First of all, we must clarify that we share the same the notion of epistemic uncertainty; it is indeed given by the gap between pp and p^\hat p, and can be alleviated with more training samples. In that sense, for a fixed pp, the epistemic uncertainty is only a function of the model p^\hat p. We apologize if stating epistemic uncertainty is a property of the model created confusion, but such characterization is also common in the literature (see e.g. the discussion in the introduction in [3]).

While the general definition of epistemic uncertainty might be simple on the surface, actually quantifying this gap does involve assumptions about the model. The same reference quoted by the reviewer [1], states in section 2.1

To capture EU [epistemic uncertainty], the learner must be able to express uncertainty about θ\theta [model parameters], which can be accomplished by a (second-order) probability distribution over (first-order) distributions θ\theta.

Conformal prediction methods pose no assumptions on the underlying machine learning model and typically leverages a single predictor/scoring function. In that setting, there is no clear way to represent the uncertainty over the model parameters (or more generally over the model's hypothesis space) and capture epistemic uncertainty, as also highlighted in [2]:

Machine learning methods for probability estimation, i.e., for training a probabilistic predictor, often commit to a single hypothesis learned on the data, thereby ignoring epistemic uncertainty.

As we mentioned in our rebuttal, that is why we focused on the data aleatoric uncertainty, since quantifying epistemic uncertainty would require assumptions about the model hypothesis space that, at least for now, are not directly connected to any conformal prediction method, and thus are outside the scope of this work.

We share the reviewer's appreciation for the goal of connecting conformal prediction to other notions of uncertainty, and we would argue that relating it to the data aleatoric uncertainty, as we have in this paper, is a valid contribution and a solid first step towards that goal. Further, we would like to emphasize that there are important contributions to uncertainty quantification without epistemic and aleatoric decomposition, and conformal prediction is a notable example. We see our work within this line of research, and as we showed in the paper, our results already proved useful in practical applications, namely conformal training and incorporation of side information.

We hope to have addressed the reviewer's concerns around this topic.

Extra references

[3] Bengs, Viktor, Eyke Hüllermeier, and Willem Waegeman. "Pitfalls of epistemic uncertainty quantification through loss minimisation." Advances in Neural Information Processing Systems 35 (2022): 29205-29216.

评论

Thank you for the ongoing discussion.

To clarify, the confusion was stemming from the fact that you mentioned that "epistemic uncertainty is not a property of the data", whereas the learned model p^\hat{p} depends on both the model family as you mention but also the training data - which you later seem to agree with.

Here is my remaining question:

If the learned model is bad (due to small or biased data for e.g), the conformal interval will increase in size since the residuals used in constructing the interval will increase. So it is capturing some notion of epistemic uncertainty. Even if the aleatoric uncertainty were 00 but the data is bad, the conformal interval would not shrink to 0.

Having small aleatoric uncertainty is pretty common in ML settings when the labels are assumed to be a deterministic function of the inputs (perhaps with some small additive noise) e.g in image classification.

In such a case, the bounds appear to be saying that the noise level \leq log of conformal set size. Which could be an arbitrarily loose inequality.

Could you comment on the tightness of the provided bounds in different settings? And how the level of epistemic uncertainty may play into the looseness of the bound.

评论

Thanks again for the interesting discussion.

Those are good questions. We start with the question around the tightness of the bounds, having small aleatoric uncertainty, and whether it leads to an arbitrary loose inequality. It is important to distinguish among our three bounds.

Simple Fano bound. The reviewer is correct that the simple Fano bound (eq. 6) can be loose in general, since the expected prediction set size would not shrink to zero, even when YY is a deterministic function of XX. It would not be arbitrarily loose because all the terms are bounded, but in the limit, for a completely uninformative model with C(X)=Y|C(X)|=|\mathcal Y|, it will essentially converge to H(YX)H(X)log(Y).H(Y|X) \leq H(X) \leq \log(|\mathcal Y|). Conversely, in the case of lower epistemic uncertainty, the bound would be tighter, since the more capable model would lead to narrower prediction set sizes, as the reviewer pointed out.

DPI bound. This bound directly leverages the model, as represented by the distribution QQ. It is easy to verify that the DPI bound is tight in the case of zero epistemic uncertainty, i.e. P=QP=Q. This is evident in eq. 18, where the KL term would be zero and the cross-entropy would be H(YX)H(Y|X) by definition. Epistemic uncertainty plays a role here, since for PQP\neq Q the cross-entropy is only an upper bound to H(YX)H(Y|X), which in principle can be arbitrarily loose for a very bad QQ. Curiously, the KL term in eq. 18, actually tightens the bound in comparison to the cross-entropy alone, which is an interesting aspect of this result. Note that this argument is general and holds for any H(YX)H(Y|X), even when H(YX)=0H(Y|X)=0.

Model-based Fano. The model-based Fano bound follows a similar pattern, and the role of epistemic uncertainty is clear in eq. 21, where it is captured in the two KL divergence terms in the second equality. For P=QP=Q, these KL terms are exactly zero, tightening the bound.

Now, we address the second question around the role of epistemic uncertainty in our bound. In all three cases, epistemic uncertainty definitely plays a role in the tightness of the bounds, either via the expected prediction set size or directly via the model distribution in QQ. While the role of epistemic uncertainty might be clear qualitatively, what we tried to convey in the rebuttal is that quantifying it exactly is quite challenging using conformal prediction alone (it is already complicated in other contexts as discussed in the references we shared), and that is why we refrained from touching on the aleatoric/epistemic decomposition in the paper. However, we are happy to include the discussion above in the text, if the reviewer thinks it would add to the paper.

审稿意见
5

This work develops the upper bound of conditional entropy H(Y|X) of data generation process by information theoretic inequality. New objective for conformal training and upper bound for the inefficiency (size of prediction set) for a trained model. Including side information to improve the efficiency of CP is also introduced by the proposed upper bound.

优点

(1) Detailed background information about conformal prediction helps readers build basic understanding easily. (2) Extensive theoretical works are presented, demonstrating the soundness of proposed upper bound of the inefficiency of CP for a given model. (3) Five datasets and three baselines are included in the experiments. Standard deviation of experimental results are presented as well.

缺点

(1) Lack of clear notation clarification. For example, the subscripts of Q in Eq. (5) is not explained, making it hard to understand. Also, even if line 136 explains the meaning of H(Y|X), I suggest to write it explicitly for more clearness. (2) Lack of full algorithm representation. The work only states that including the upper bound as an additional loss term in line 186, but there is no enough explanation on how the proposed algorithm really works. (3) Proof is not self-contained. For instance, the paper leads readers to Appendix D.1 for the proof of DPI bound. And in Appendix D.1, line 806 further lead readers to Theorem C.4, which is not proved in detail.

问题

The reason of limiting alpha in (0,0.5) is not explained in Proposition 3.1, as Theorem 2.1 works for any alpha belongs to (0,1).

Setting alpha=0.01, as stated in the caption of Table.1, is a very limited experiment setup.

局限性

N/A

作者回复

We thank the reviewer for the comments and useful feedback. We address the main concerns below.

Weaknesses

Lack of clear notation clarification. For example, the subscripts of Q in Eq. (5) is not explained, making it hard to understand

The subscripts of QQ (and PP) indicate the random variable described by the distribution, as explained in the first paragraph of Section 2. Following the same logic, we also include conditioning in the notation. For example, QYX,C(X),YC(X)Q_{Y | X, C(X), Y \in C(X)} is a conditional distribution of YY given input variable XX, prediction set C(X)C(X) and the event of coverage YC(X).Y \in C(X). We will clarify that in the text.

I suggest to write it [H(YX)H(Y|X)] explicitly for more clearness.

We will explicitly define H(YX)=EPXY[logPYX]H(Y|X) = -E_{P_{XY}} [\log P_{Y |X}] in the text.

Lack of full algorithm representation. The work only states that including the upper bound as an additional loss term in line 186, but there is no enough explanation on how the proposed algorithm really works.

There is a detailed discussion of conformal training and our experimental setup in appendix F and G, resp. Although we have tried to include enough details on the working of our algorithms, we also agree that a full algorithmic description of the conformal training and federated learning procedures could be helpful and will add those to the appendix. At the end of this answer, we provide a sketch of how conformal training works for each batch, which is similar to [2]. We must highlight, however, that the main contribution of the paper is the new connection between notions of uncertainty given by conformal prediction and information theory. Conformal training is simply a notable application where we demonstrate this connection to be quite useful in practice.

Proof is not self-contained. For instance, the paper leads readers to Appendix D.1 for the proof of DPI bound. And in Appendix D.1, line 806 further lead readers to Theorem C.4, which is not proved in detail.

We separated the theoretical results into different sections, concentrating the most important and new results in appendix D. The remaining results in appendices B and C are well-known in the information theory or machine learning communities, and we state them mostly for the sake of completion. We appreciate the feedback though and are happy to state Theorem C.4 just before Proposition 3.1 if that improves readability.

Regarding Theorem C.4. itself, it is a direct application of the classical data processing inequality for ff-divergences (Theorem C.3) to the conformal prediction procedure, and that is why the proof there is succinct. However, we are happy to expand the proof for clarity, as we show at the end of this answer.

Questions

The reason of limiting alpha in (0,0.5) is not explained in Proposition 3.1, as Theorem 2.1 works for any alpha belongs to (0,1).

This is explained in equation (19) in the appendix and lines 813 to 817, and is due to the concavity of the binary entropy function. Note that this is not restrictive, since α\alpha values are typically small, and even on the off chance of the user picking α>0.5\alpha > 0.5, we can also get an upper bound by replacing hb(α)h_b(\alpha) with hb(αn)h_b(\alpha_n). To be precise for α>0.5\alpha > 0.5, the binary entropy is non-decreasing and preserves the inequality, so we can apply the same argument in (19) to the upper bound of conformal prediction P(YtestC(Xtest))1αnhb(YtestC(Xtest))hb(1αn)=hb(αn)P(Y_{test} \in C(X_{test})) \leq 1-\alpha_n \Rightarrow h_b(Y_{test} \in C(X_{test})) \leq h_b(1-\alpha_n) = h_b(\alpha_n).

Setting alpha=0.01, as stated in the caption of Table.1, is a very limited experiment setup.

We follow the same setup of the main reference in conformal training [2], where the models were also trained only with α=0.01\alpha=0.01. Note that we also evaluate our models under different α\alpha values at test time in Table 10 in the appendix, where we also see improved results for other common α\alpha values α=0.05\alpha=0.05 and α=0.1\alpha=0.1.

Sketch of Conformal Training Algorithm

for each batch B 
    1. Randomly split B into B_cal and B_test
    2. Use B_cal to compute the empirical 1-alpha quantile using differentiable sorting
    3. Construct a prediction set C(x) for each x in B_test using a smooth assignment (see eq. 29)
    4. At this point we have all we need to compute any of the upper bounds to H(Y|X) in a differentiable manner
    5. Update model parameters to minimize the upper bound to H(Y|X) via gradient descent

Detailed proof of Theorem C.4

Consider the conditional distribution PEX,YP_{E|X,Y} mapping distributions PX,YP_{X,Y} and QX,YQ_{X,Y} to PEP_E and QEQ_E, respectively. From here, it suffices to apply Theorem C.3, but we can break it down further as follows. Consider the joints induced by this mapping PE,X,Y=PEX,YPXYP_{E,X,Y} = P_{E|X,Y}P_{XY} and QE,X,Y=PEX,YQXYQ_{E,X,Y} = P_{E|X,Y}Q_{XY}, and the following two basic properties of ff-divergences as applied to this specific case (see e.g. [1]): Df(PE,X,YQE,X,Y)=Df(PX,YQX,Y)D_f(P_{E,X,Y} || Q_{E,X,Y}) = D_f(P_{X,Y} || Q_{X,Y}) and Df(PE,X,YQE,X,Y)Df(PEQE).D_f(P_{E,X,Y} || Q_{E,X,Y}) \geq D_f(P_{E}||Q_{E}). The proof then follows directly from these two results with Df(PX,YQX,Y)Df(PEQE).D_f(P_{X,Y} || Q_{X,Y}) \geq D_f(P_{E}||Q_{E}).

References

[1] Polyanskiy, Y. and Wu, Y. "Lecture notes on information theory". Lecture Notes for ECE563 (UIUC) and, 6(2012-2016):7, 2014. 17, 18

[2] Stutz, David, et al. "Learning Optimal Conformal Classifiers." ICLR 2022.

评论

The limitation of using just alpha=0.01 is addressed by Table 10. Thank the authors for pointing this out.

I raised my rating to 5.

评论

Thank you again for your thoughtful review. We're glad our clarifications were helpful and appreciate you raising your score, having a more positive outlook on the paper.

最终决定

The main idea of the paper is to show how the uncertainty measured by conformal prediction is related to the the uncertainty measured by the conditional entropy. Towards this end, the authors show that miscoverage in CP corresponds to list decoding error, and using this observation the authors provide an upper bound for the conditional entropy in terms of quantities from conformal prediction. Such bounds/insights are then used from conformal training to improve efficiency.

All the major concerns of the reviewers have been resolved after the rebuttal process, and based on the reviewers' recommendations and my reading of the paper, I would recommend acceptance. The paper provides a novel perspective on CP from the lens of information theory.