PaperHub
7.3
/10
Poster4 位审稿人
最低3最高5标准差0.9
5
3
5
5
3.8
置信度
创新性3.0
质量3.0
清晰度2.8
重要性2.5
NeurIPS 2025

Non-exchangeable Conformal Prediction with Optimal Transport: Tackling Distribution Shift with Unlabeled Data

OpenReviewPDF
提交: 2025-05-08更新: 2025-10-29
TL;DR

We express the coverage gap from distribution shift in terms of an optimal transport objective between the calibration and unlabelled test distribution, which we then use to optimize weights for non-exchangeable conformal prediction.

摘要

关键词
conformal predictionoptimal transportnon-exchangeabilitystochastic dominanceunsupervised adaptation

评审与讨论

审稿意见
5

The authors study non-exchangeable conformal prediction via a new perspective, through the lens of optimal transport, and show that it is possible to estimate the loss in coverage and mitigate it in case of distribution shift.

优缺点分析

The paper is of very high quality: It is very well-written, it tackles a very interesting problem, and it does so in an innovative manner. No real weaknesses to signal.

问题

See Limitations.

局限性

The only two very minor comments are the following. First, CP is not strictly speaking an uncertainty quantification technique, but rather an uncertainty representation one. This is because it does not attach a number to (different types of) uncertainties faced by the user, but rather it represents them through the prediction set: Larger sets represent higher (total) uncertainty. It would be nice to have a footnote about this.

Second, do the author think that their OT method can generalize to the ambiguous ground truth case of https://openreview.net/forum?id=L7sQ8CW2FY? It would be nice to see a small comment on this in the future research section.

最终评判理由

This is a good paper that clears NeurIPS's publishing threshold. I also appreciated the interaction with the authors.

格式问题

No

作者回复

Thank you for taking the time to review our work. We sincerely appreciate the encouraging feedback and the thoughtful comments, which we address in detail below.

Uncertainty representation vs quantification

That is a good point. We agree that conformal prediction is more accurately described as an uncertainty representation technique, as it conveys uncertainty through the size of the prediction set rather than assigning a numerical value. We will add a brief footnote to clarify this distinction in the revised version.

Ambiguous ground-truth

Thanks for highlighting another interesting scenario, where our methods are applicable. We will also include a discussion on ambiguous ground-truth in the paper.

Our approach is designed to operate directly on (non)conformity scores. This is convenient because these scores are typically unidimensional, which greatly facilitates computation. More importantly, it allows us to handle arbitrary distribution shifts. This advantage also extends to cases with ambiguous ground truth. For instance, we can directly apply our method to the unidimensional score proposed by Stutz et al., which is computed by averaging the conformity scores of each class, weighted by their plausibilities.

Formally, assuming KK possible classes and plausibilities λ\lambda in the simplex ΔK1\Delta^{K-1}, our method also applies to the score function of the form s(x,λ):=Kλks(x,yk)s'(x, \lambda) := \sum_{K} \lambda_k s(x, y_k). This enables us to fit a new set of weights over the calibration data to obtain a threshold that adapts under distribution shift. Once this threshold is established, one can proceed with the construction of predictive sets as in (Stutz et al. 2023), or following the approach of Caprio et al., as you referenced.

The main challenge in applying our methods to ambiguous ground truth lies not in the theoretical construction, but in the design of the auxiliary distributions. This is because, rather than working with a discrete set of KK classes, we must consider the full simplex ΔK1\Delta^{K-1}. Although the min-max construction remains straightforward to derive, it may be suboptimal, as the resulting auxiliary distributions can be "too far apart" to be effective. Identifying more effective strategies for constructing auxiliary distributions in this setting is a promising direction for future work. We thank the reviewer again for pointing us in this direction.

References

Stutz et al. "Conformal prediction under ambiguous ground truth". TMLR, 2023

Caprio et al. "Conformalized Credal Regions for Classification with Ambiguous Ground Truth." TMLR, 2024.

评论

I thank the authors for their answer, and I'm happy to confirm my positive score.

审稿意见
3

This paper studies conformal prediction when the examples in the calibration set and the test set are no longer exchangeable. Firstly, the paper upper bounds the total coverage gap between the calibration distribution s_ \\# P and the test distribution s _ \\# Q through the maximal density value of s_ \\# P and the W1W_ 1 distance between s_ \\# P and s_ \\# Q. Then, the paper utilizes two auxiliary distributions to further upper bound the above upper bound, which allows us to upper bound the total coverage gap using only unlabeled data from the test distribution. Then, the paper proposes to learning the weights in weighted conformal prediction by minimizing the obtained upper bound and then use the learned weights to run the weighted conformal prediction algorithm. Experiments show that the proposed method achieves promising performance.

优缺点分析

Strengths

  • As shown in the experiments, the proposed method seems to work well in the non-exchangeable case.

Weaknesses

  • The total coverage gap is not well motivated. It seems that we usually care about the performance when α\alpha is small and close to 00, for example, in traditional statistics, we usually use α=0.05,0.1\alpha= 0.05, 0.1, while α=0.9\alpha = 0.9 is not used. So, it seems using the total coverage gap as a measure is not appropriate.

  • There is no theoretical guarantee of the coverage [main issue]. As a statistical framework, the goal of conformal prediction is to minimize the prediction set size under the constraint that the coverage is above a level of 1α1-\alpha. Although the upper bound of the total coverage gap can somehow control the coverage, there is no explicit result showing how we can control the coverage over a predefined level.

  • A mismatch between the theory and the algorithm [main issue]. The upper bounds in this paper are for the total coverage gap of the conformal prediction algorithm. However, equation (8) is minimized to learn weights for the weighted conformal prediction algorithm. Furthermore, a similar upper bound for weighted conformal prediction algorithms is not trivial, because now the weighted empirical distribution function is not an unbiased estimation of the weighted distribution function due to the dependence between the learned weights wiw_ i and calibration scores s(xi,yi)s(x_ i, y_ i).

  • A mismatch between the theory and experiments. The theory considers the total coverage gap, while the experiments only consider a specified coverage level.

  • The proposed method needs to use all test examples simultaneously, which means that the proposed method can not be used in the common situations where the test examples come sequentially and we can not see xn+kx_ {n+k} until C(xn+k1)\mathcal{C}(x_ {n+k-1}) is predicted. This issue also implies that the comparison in the experiments may not be fair if the baselines only use the information of xix_ i while the proposed method uses the information of x1,,xmx_ 1, \dots, x_ m in the test set.

  • Writing issues. The mathematical formulations are not rigorous enough.

    • In the equation on line 103, Sc\mathcal{S}_ c should be used instead of ScS_ c. Moreover, it should be \mathcal{S}_ c \sim s_ \\# P^n but not S_ c \sim s_ \\# P.
    • In equation (16) in Appendix A.1, s_ \\#P(s_ c) is used. However, as defined in line 92, s_ \\#P is a pushforward measure. As a measure, the input of s_ \\#P should be a measurable set, but not a value scs_ c. So here you should use the density function of s_ \\#P with respect to the Lebesgue measure.
    • In line 63, the separable metric space is defined on X\mathcal{X}. However, the X\mathcal{X} notation here is the same as the input space, which may cause a misunderstanding.

问题

Please refer to the weaknesses.

局限性

The proposed method can not be used in the case with sequential test examples, which is common in our daily life.

最终评判理由

After the discussion, the authors solved some issues I proposed. However, one of these issues can not be resolved since the paper focuses on the practical performance of conformal prediction without a theoretical guarantee, which is not acceptable to the conformal prediction field, in my opinion. So I increased my score, but I still lean towards rejection.

格式问题

No.

作者回复

We thank the reviewer for their detailed feedback and constructive criticism. We address each concern systematically below.

1. Motivation for Total Coverage Gap

The total coverage gap serves two key purposes:

  1. General Optimization Objective: The total coverage gap captures the overall effect of distribution shift, being zero when the distributions are identical. Similarly, the likelihood ratios of Tibshirani et al. (2019) do not target a specific α\alpha, yet learning them remains both reasonable and effective for mitigating covariate shift effects. Both methods aim to learn a set of of weights only once using some unlabeled test data, such that applying standard split CP (with any α\alpha) to the weighted calibration data, approximates calibration on the test distribution.

  2. Theoretical Insight: The total coverage gap also establishes a novel connection between conformal prediction and optimal transport, specifically through the 1-Wasserstein distance. This provides a broader theoretical perspective, even though it is not directly tied to a particular α\alpha

That said, we agree that some values of α\alpha are more relevant in practice. For this reason, we derive a bound for specific α\alpha values (Theorem A.6 in the appendix) and consider the coverage gap over a given range of α\alpha values (see discussion with reviewer XezP). However, in our experiments, these more targeted objectives have so far been less effective for optimization. We believe improving them is a promising direction for future work.

2. Lack of Theoretical Coverage Guarantees

We agree that statistical guarantees are central to conformal prediction. However, our work specifically addresses the setting of arbitrary distribution shifts, where standard conformal guarantees no longer hold. That is, our method does not invalidate any existing guarantees but rather operates in a regime where those guarantees are already compromised.

Moreover, unlike prior approaches, we make no assumptions about the nature of the distribution shift. For example, Tibshirani et al. (2019) rely on specific shift types, and Barber et al. (2023) require prior knowledge of the shift mechanism. In the general setting we consider, deriving formal coverage guarantees is extremely challenging. Nevertheless, we show that it is still possible to bound the coverage gap. Our key contribution is to use these bounds, even if they are not always tight, as a principled and effective objective for optimization. This allows us to approach the desired coverage level using only unlabeled samples from QQ.

3. A Mismatch Between the Theory and the Algorithm

We thank the reviewer for highlighting this important point and will clarify it in the paper. By optimizing the calibration weights, we are directly minimizing the total coverage gap, but indeed this introduces dependence between the weights and the calibration scores. However, this can be easily addressed by fitting a weight function w:s(x,y)Rw: s(x, y) \rightarrow \mathbb{R} on a separate set of calibration samples from PP, thereby restoring independence. This is similar to the setup in Tibshirani et al. (2019), where weights are functions of covariates only, i.e., w:xRw: x \rightarrow \mathbb{R}.

To address this concern, we conducted additional experiments comparing our original method with the weight function approach trained on held-out data. The optimization remains essentially same: we use a set of labeled samples from PP and a set of unlabeled samples from QQ to learn the weight function w:s(x,y)Rw: s(x, y) \rightarrow \mathbb{R} via our objectives. Once this function is learned, we apply it to another set of samples from PP, the calibration set proper, before applying weighted split CP.

The results for iWildCam and ImageNet-C with severity 3 are shown in Table 1 below. We observe almost no difference in performance between the two approaches. The only downside of learning the weight function is that one needs another set of samples from PP (we use 300 samples for calibration in both cases, and another 300 to learn the weight function). However, once more, this is the same scenario considered in Tibshirani et al. (2019).

Table 1: Comparison of Original Method vs. Weight Function Approach

DatasetMethodCoverage at 90%Coverage at 95%Coverage at 99%
ImageNet-C 3Original (min,max)89.4 ± 6.993.8 ± 4.897.9 ± 4.2
Weight Function (min,max)89.1 ± 11.394.4 ± 9.198.7 ± 8.1
Original (f,U)93.5 ± 4.495.0 ± 3.999.5 ± 0.9
Weight Function (f,U)90.0 ± 6.895.0 ± 4.299.3 ± 1.1
iWildCamOriginal (min,max)85.2 ± 4.194.3 ± 3.299.1 ± 0.5
Weight Function (min,max)87.7 ± 1.195.0 ± 1.899.2 ± 0.7
Original (f,U)88.2 ± 3.494.9 ± 2.399.1 ± 0.5
Weight Function (f,U)88.1 ± 1.795.0 ± 1.999.3 ± 0.6

4. Mismatch Between Theory and Experiments

This apparent mismatch is by design and reflects practical considerations. We optimize the total coverage gap (which does require specifying α\alpha) but evaluate at specific coverage levels (90%) because that aligns with how conformal prediction is typically used in practice, as you rightly pointed out.

To provide a more complete picture, we also report the total coverage gap itself in Tables 2, 3, and 4 in the appendix. We are also happy to include coverage results for additional α\alpha values if that would help strengthen the paper.

5. Using All Examples Simultaneously

There seems to be a misunderstanding regarding this point. Similarly to Tibshirani et al. (2019), we assume access to a small unlabeled dataset from the test distribution QQ, denoted DQ(1)\mathcal{D}_Q^{(1)}, which we use to learn the weights (or weight function). Once the weights are learned, we have standard split conformal prediction but with a weighted calibration set, and it does not depend on how the test samples DQ(2)\mathcal{D}_Q^{(2)} are observed. Our methods are applicable to sequential test examples to the same extent that standard split CP is and does not use all examples simultaneously. Note that we use disjoint subsets of QQ for weight learning (DQ(1)\mathcal{D}_Q^{(1)}) and evaluation (DQ(2)\mathcal{D}_Q^{(2)}).

In the case test samples come sequentially, we can always wait until we have enough samples before learning the weights. As shown, in Appendix B.3.1, we do not need many unlabeled test samples to learn the weights effectively. That means we use standard split CP for first mm samples, learn the weights, and then use these weights when applying split CP on the following samples. It should also be possible to refine the weights on the fly as new test samples arrive using the same objectives, but we leave that for future work.

6. On the Fairness of the Comparison with Other Baselines

  1. Tibshirani et al. (2019): As mentioned previously, our method is directly comparable to the density ratio approach proposed by Tibshirani et al. (2019), and we follow a similar setup for a fair comparison.

  2. Kasa et al. (2024): The method of Kasa et al. (2024) is designed for sequential test samples and does not use the unlabeled test samples in DQ(1)\mathcal{D}_Q^{(1)}. However, we adapted their method to the split CP setting as described in Appendix B.1.3 such that the comparison remains informative. In our experiments, we estimate the entropy required by their method using all test samples in DQ(2)\mathcal{D}_Q^{(2)}, which means, in our experiments, their method effectively uses the entire test set simultaneously, unlike ours.

  3. Gibbs et al. (2023): The method of Gibbs et al. (2023) targets conditional coverage and also does not use the unlabeled test samples in DQ(1)\mathcal{D}_Q^{(1)}. As discussed in Appendix B.1.4, it offers stronger guarantees, their method comes with significant computational cost and relies on stronger assumptions. Still, we believe the comparison is informative, especially in highlighting the difficulty of prespecifying the class of distribution shifts, as required by their method, in large-scale classification tasks like those we consider.

6. Mathematical Formulation Issues

Thanks for pointing this out. We will improve the notation accordingly. Specifically, in *Equation (16)**, we will use the density function notation psP(sc)p_{s \sharp P}(s_c) instead of the measure notation sP(sc)s \sharp P(s_c), since sPs\sharp P is a pushforward measure and should take measurable sets as input, not point values.

7. Overall Assessment and Significance

We believe our work merits higher scores based on several key contributions:

Novel Theoretical Framework: We derive new bounds to the coverage via a new connection between conformal prediction and optimal transport theory, establishing new theoretical insights through the 1-Wasserstein distance. This represents a significant theoretical advance in understanding distribution shifts in conformal prediction.

Practical Algorithm: Our method handles arbitrary distribution shifts using only unlabeled test data, without requiring prior knowledge of the shift type or mechanism. Strong empirical results across diverse datasets (ImageNet-C, iWildCam, synthetic regression) and shift types demonstrate its practical relevance.

Methodological Innovation: The use of auxiliary distributions to derive bounds without labeled test data is a novel and flexible approach that could inspire future work in non-exchangeable conformal prediction.

Broad Impact: Our framework applies to both classification and regression tasks and offers a principled solution to a fundamental challenge in deploying conformal prediction under real-world conditions.

评论

Thank you for the reply.

After the rebuttal, I am clearer about what this paper does. Some issues are solved, while some new issues arise.

1. Unsolved main issues

  • Theoretical results. I understand that the proposed method performs well in practice. However, conformal prediction is a topic born to be with theoretical guarantees. I can not imagine conformal prediction methods without coverage guarantee, just as I can not imagine hypothesis testing methods which can not control the typy-1 error. Maybe the topic itself is not good enough. I am not saying all papers should have theoretical guarantees; however, for conformal prediction, it is.
  • Mismatch between theory and algorithm. Maybe the authors misunderstood my point in my review. The dependence is not the main issue here. My point is, the theory is an upper bound of the different performance of the conformal prediction algorithm between PP and QQ. However, the proposed algorithm is not the original conformal prediction algorithm, but the conformal prediction algorithm with weighted scores. It means that the upper bounds may not hold for the proposed method.

2. New issue

It seems I misunderstood the concrete setting in this paper. I thought it was a setting like test-time adaptation, where we adapt the model using the observed batch of test data. After the clarification, I understand that the setting is more like domain adaptation, where we have access to unlabeled target data at the training stage but not at the test stage. I recommend that the authors provide a full version algorithm of the proposed method, but not just the algorithm to train the weighted scores (not a main issue). Now I understand why the proposed method works well in practice. It is because the unlabeled target data, just like the success of domain adaptation. Since there is a line of work on reweighting methods in domain adaptation, it seems that this paper is a combination of ideas in conformal prediction and domain adaptation. It makes the results not as strong as I thought at the beginning.

评论

Thank you for engaging in the discussion. We are glad the rebuttal clarified key aspects of our paper and resolved several concerns. Below, we address the remaining points.

Theoretical Results

We fully agree that theoretical guarantees are critical in CP. Instead of dismissing them, we view our work as a significant step towards addressing this fundamental challenge in the non-exchangeable setting where standard guarantees break down due to distribution shift.

A relevant but perhaps overlooked aspect of our contribution is that, in theory, our bounds offer guarantees. In particular, Theorem A.6 bounds ΔP,Q(α)\Delta_{P, Q}(\alpha), the coverage gap for a specific α\alpha, which can be used to achieve guarantees of the form Q(YtC(Xt))1αΔP,Q(α)Q(Y_t \in C(X_t)) \geq 1 - \alpha - \Delta_{P, Q}(\alpha). While in practice these bounds may not yet be tight enough for this purpose, they establish a principled framework for understanding and quantifying coverage gaps under arbitrary shifts. Tightening these bounds is a clear direction for future work.

Still, our bounds are already practically relevant as they serve as principled optimization objectives that demonstrably improve coverage across diverse datasets and shift types. This dual theoretical and practical contribution advances the field by providing both mathematical insights and tools for improving coverage under shift.

Mismatch Theory and Algorithm

We appreciate the clarification on this point. Below we show our bounds also apply to weighted empirical distributions. Importantly, note that the bounds hold empirically too (see appendix, Table 5).

Assume nn samples (xi,yi)(x_i, y_i) from PP with scores sis_i and weights wiw_i, where wi0w_i \geq 0 and jwj=1\sum_j w_j =1. The weighted empirical distribution is P^w=i=1nwiδ(xi,yi)\hat P_w = \sum_{i=1}^n w_i \delta_{(x_i,y_i)} and its pushforward by the score function is sP^w=i=1nwiδsis\sharp \hat P_w = \sum_{i=1}^n w_i \delta_{s_i}.

The coverage gap with respect to P^w\hat P_w is: ΔP^w,Qw=01P^w(StQαw)Q(StQαw)dα \Delta_{\hat P_w, Q}^w = \int_0^1 \left| \hat P_w (S_t \leq Q_\alpha^w) - Q (S_t \leq Q_\alpha^w ) \right| d\alpha

where QαwQ_\alpha^w is the (1α)(1-\alpha)-quantile of the weighted empirical CDF FsP^w=i=1nwi1(sis)F_{s\sharp \hat P_w} = \sum_{i=1}^n w_i \mathbf{1}(s_i \leq s).

We can also express this in terms of CDFs: =01FsP^w(Qαw)FsQ(Qαw)dα= \int_0^1 \left|F_{s\sharp \hat P_w}(Q_\alpha^w) - F_{s\sharp Q}(Q_\alpha^w)\right| d\alpha

Similarly to the original proof, a key insight is that the weighted quantile QαwQ_\alpha^w takes values in {s1,...,sn}\{s_1, ..., s_n\}. For each α\alpha, there exists some jj such that Qαw=sjQ_\alpha^w = s_j. In this case, each sjs_j takes wiw_i of the [0,1][0,1] range, instead of 1/n1/n as in the original proof. Therefore: =i=1nwiFsP^w(si)FsQ(si)= \sum_{i=1}^n w_i \left|F_{s\sharp \hat P_w}(s_i) - F_{s\sharp Q}(s_i)\right|

Note that we work directly with the weighted empirical distribution rather than sampling from it. The weights wiw_i appear naturally in the weighted sum, making this equivalent to the weighted-CDF bound from Theorem 3.2. As in the original proof, we can obtain the 1-Wasserstein bound by taking the maximum over the weighted density.

This shows there is no mismatch between theory and algorithm, as the bounds of Theorem 3.2 also apply to weighted empirical distributions.

Domain Adaptation

Our work targets arbitrary distribution shifts, where split CP guarantees do not hold. To be effective in mitigating the effects of such shifts, we need some information about the test distribution. Prior work in non-exchangeable CP assumed restrictive prior knowledge about the shift type or mechanism. Instead, we make a much less restrictive assumption: access to unlabeled samples from the test distribution. The only similarity between our methods and domain adaptation is this use of unlabeled samples. This superficial connection does not constitute a meaningful similarity and certainly does not diminish the novelty or value of our contribution.

Unlike domain adaptation methods, our goal is not to improve model performance but to improve coverage under distribution shift. The use of unlabeled data in our method is guided by a theoretical framework specific to CP, not by domain adaptation principles. For these reasons, we strongly disagree with the characterization of our method as a simple combination of CP and domain adaptation; this framing overlooks both the distinct objectives and the theoretical foundations of our approach.

That said, we do see potential for synergy between our methods and domain adaptation. They are orthogonal and could be combined: one could first adapt the model using domain adaptation techniques, then apply our approach to improve coverage with respect to the adapted model.

We will include an algorithm in the final version. It is identical to split CP, except that the threshold is computed on sP^ws\sharp \hat P^w instead of sP^s\sharp \hat P. We hope these clarifications, along with the strong empirical performance noted by the reviewer, reinforce the value of our contribution.

评论

Thank you for your reply. Most of my questions are answered. I am willing to increase the score accordingly.

评论

Thank you again for your thorough review and thoughtful engagement. You've raised several interesting insights that we’ll be incorporating into the final version of the paper. We're glad to hear that most of your questions have been addressed, and we appreciate your intention to increase the score.

审稿意见
5

The paper revisits the problem of non-exchangeability for conformal prediction, and proposes an optimal-transport-based conformal framework with rigorous guarantees that addresses the preservation of coverage even in the presence of distribution shifts via what they dub Total Coverage Gap: the gaps in coverage at all coverage levels.

Starting off with a simple TV-distance bound on the TCG (which is hard to estimate), the paper then replaces this bound with two further upper bounds, which lead to estimable and/or learnable quantities that can serve as the optimization target for the algorithm. They provide such an algorithm (or rather meta-algorithm) that needs to be given a pair of "sandwiching distribuitions" such that, as long as the sandwiching is approximately satisfied, the method will exhibit robustness to distribution shift.

The paper then moves on to a suite of experiments designed to show that the proposed method is indeed less sensitive to shifts/perturbations in the tested settings (both regression and classification).

优缺点分析

Overall, the paper has several strengths which will make it interesting to the conformal community. First, its upper bound that lets one learn weights based on unlabeled data is quite relevant and may/can help the unlabeled data setting in the CP literature become more well-studied.

The fact that the optimal transport view of the problem results in the to-be-learned objects being sandwiching distributions is also quite simple/memorable, and may lead to some productive investigations around how to choose such a pair of distributions depending on any available prior (but see more on this below).

Additionally, while one might say that the paper's theory is not very challenging to derive, but I believe the value is contained in the formulation of the bounds themselves, which are shown to be good surrogate optimization objectives in the presence of data shift (with its own set of empirical tradeoffs compared to prior work).

As for the weaknesses of the paper, one reasonably immediate question that was left empirically unaddressed is: Suppose I only care about certain coverage levels; say I know I will be enforcing coverage over 90%, but am unsure which \alpha \geq 0.9 I will pick; then what can I guarantee about total coverage gap just at those levels? In other words, this considers the more general (and arguably more natural in applications) setting where we have a weighting function over the coverage levels that determines which ones are relevant. This setting would also help with the positioning of the paper.

Another question that skews in a more modeling/theory direction is --- how does one choose the sandwiching distributions? This question has many facets, some best left for future work, but in the context of this paper I would have liked to see a discussion of

--- (possibly stylized) conditions under which the uninformative-informative proposal would be provably stochastically dominating;

--- which stochastically dominating pairs of distributions could one provably choose, given varying amounts of prior knowledge on the sandwiched distribution? (e.g. when nothing is known, we have the proposed min and max pair);

--- what happens empirically when we intentionally misspecify the pair of distributions and it doesn't really sandwich the true distribution.

These are all relatively straightforward but very informative questions that would strengthen the paper's message as well.

问题

Several questions, if addressed in the rebuttal productively, can help me increase my score. (Also, see the Weaknesses section above.)

(1) A deeper discussion of the possible choices of sandwiching distributions. Both in theory as well as in empirics. (See also Weaknesses section above.)

(2) Investigation (empirically) of the above proposal to restrict the total coverage gap to a "natural" range such as above 85% or 90%. Does it help control coverage better or worse than in the total case considered now? (Both could in principle occur for different reasons.)

(3) Further investigation, and discussion in more detail, of the experiments from the perspective of: (i) How robust are the empirics to the degree of misspecification in the sandwiching distributions? (Can insert, and test for the consequences of, this misspecification in a controlled manner.) (ii) How come the looser 1-Wasserstein bounds overperformed the other upper bound in regression tasks? I'd like to see a deeper dive than in the paper. (iii) Learning weights via the proposed procedure --- is it possible to visualize by how much more efficient/robust/etc the weights learned via this procedure are vs. the Tibshirani et al brute-force density ratio proposal?

Typo: remove "takes" in line 77.

局限性

Yes.

最终评判理由

I remain positive about this paper after discussion with the authors and after reading the discussion the authors had with the other reviewers. I believe the method is both practical, and also opens up a number of empirical and theoretical challenges for the future (such as sandwiching bounds; range-of-alpha coverage and other topics; see the review points above). This is a good sign, and thus I believe the paper is a worthy contribution in its current form, even without engaging with all these gaps yet. While there is a concern from some reviewers about the coverage guarantees, in a sense the distribution shift setting justifies a corresponding shift in the coverage amount, and I currently believe it's also fair to relegate bounding that amount in theory and practice to future work.

格式问题

N/A

作者回复

We thank the reviewer for their thoughtful and constructive feedback. We address each of the main concerns below.

Coverage Level Weighting and Restricted Total Coverage Gap

That is a very interesting point. Thanks for bringing that up.

One could extend the definition to accept a range α\alpha^{-} and α+\alpha^{+} such that ΔP,Q(α,α+):=αα+ΔP,Q(α)α+αdα \Delta_{P,Q}(\alpha^{-}, \alpha^{+}) := \int_{\alpha^{-}}^{\alpha^{+}} \frac{\Delta_{P,Q}(\alpha)}{\alpha^{+} - \alpha^{-}} d\alpha.

The proof of Theorem 3.2 in appendix A.1 would follow in a similar fashion, except now we only consider score values in a restricted range. We would lose the connection to the 1-Wassertein distance, but for the weighted-CDF version of the bound we would get

ΔP,Q(α,α+)1α+αRsP(sc)FsP(sc)FsQ(sc)1(sc[FsP1(α),FsP1(α+)])dsc,\Delta_{P,Q}(\alpha^{-}, \alpha^{+}) \leq \frac{1}{\alpha^{+} - \alpha^{-}} \int_\mathbb{R} s\sharp P(s_c) | F_{s\sharp P}(s_c) - F_{s\sharp Q}(s_c) | \mathbf{1} (s_c \in [F_{s\sharp P}^{-1}(\alpha^{-}), F_{s\sharp P}^{-1}(\alpha^{+})]) ds_c, where 1\mathbf{1} is a indicator function.

We tried to optimize this weighted-CDF bound during the rebuttal phase, but it underperformed the total coverage gap version of the bound. In the table below, we see the improvement over the unweighted version is small when optimizing ΔP,Q(0.0,0.15)\Delta_{P,Q}(0.0, 0.15). We see two possible reasons for this. First, optimizing the bound for a specific coverage gap range may be inherently more difficult; for example, the upper bound for a fixed α\alpha derived in Appendix A.3 also failed to yield better empirical performance. Second, the total coverage gap may already serve as a strong objective, leaving limited room for improvement through alternative formulations. Yet, these more targeted objectives are a very promising avenue for future work, and we will include this discussion in the paper.

Empirical Comparison: Total Coverage Gap vs. High Coverage Gap (α0.15\alpha \leq 0.15)

MethodCoverage at 90%Coverage at 95%Coverage at 99%
iWildCam
Uncorrected79.6 ± 3.188.5 ± 2.898.5 ± 0.6
(min,max) Total85.2 ± 4.194.3 ± 3.299.1 ± 0.5
(min,max) α0.15\alpha \leq 0.1578.5 ± 3.189.0 ± 2.898.6 ± 0.5
(f,U) Total88.2 ± 3.494.9 ± 2.399.1 ± 0.5
(f,U) α0.15\alpha \leq 0.1578.6 ± 3.089.1 ± 2.998.6 ± 0.5

Empirical Robustness to Sandwiching Distribution Misspecification

That is an interesting point. As mentioned in the paper, (sQ^f,sQ^U)(s_{\sharp}\hat{Q}_f, s_{\sharp}\hat{Q}_U) might already not safisfy the stochastic dominance relationship in practice. However, our experiments show that coverage often improves even in such cases. This suggests that the sandwiching approach can still yield meaningful bounds, even when the sandwiching distributions are somewhat misspecified. More generally, one can expect improved coverage whenever these distributions better approximate the true test score distribution than the calibration scores do.

Performance Analysis of 1-Wasserstein vs. Weighted CDF Bounds

One might expect the weighted-CDF bound to outperform the 1-Wasserstein bound, given that it is provably tighter. However, in some experiments, particularly the regression task, the 1-Wasserstein bound performed better. As discussed in Section~B.3.2, we hypothesize that this is because the 1-Wasserstein bound is easier to optimize. It only requires identifying the maximum value of the score density, while the weighted-CDF bound depends on accurately estimating the entire density. It is not entirely clear why regression tasks are more sensitive to this difference. One possibility is that, in conformal regression-as-classification, the output space must be discretized into bins somewhat arbitrarily. This binning may introduce additional noise into the score function, especially under distribution shift, making density estimation more challenging.

Comparison with Likelihood Ratio Methods

Learning weights via the proposed procedure --- is it possible to visualize by how much more efficient/robust/etc the weights learned via this procedure are vs. the Tibshirani et al brute-force density ratio proposal?

Could you clarify what you mean by robustness in this context and what sort of visualization you have in mind? In our experiments, we found that our method consistently outperforms the likelihood ratio approach proposed by Tibshirani et al. across a variety of settings. We believe this is largely because learning accurate likelihood ratios is inherently difficult, especially in high-dimensional spaces. In contrast, our method focuses on unidimensional scores, which are much easier to estimate and optimize.

Additionally, our approach is not limited to covariate shift scenarios, which makes it more broadly applicable and potentially more robust in diverse real-world conditions.


We thank the reviewer again for the thoughtful and constructive feedback. These questions helped us clarify both theoretical and practical aspects of our method. We hope our responses address the main concerns, but we are happy to discuss further if anything remains unclear.

评论

Thank you for your detailed and constructive response.

For the guarantees for a range of values of alpha, thanks for considering how this would work in your framework and running preliminary experiments. The improvement over the unweighted version being small certainly sounds interesting. Possibly the reason may be the more difficult estimation for this new object, as you point out, but this does deserve its own future study to know why.

For Wasserstein vs weighted CDF bounds, the binning could be the reason, and if so, this would again be productive for a follow-up investigation (and indeed, binning-related issues also commonly present an issue in distribution-free calibration settings).

For the learned weights study/visualization, this is a relatively minor point, but I was referring to possibly visualizing how the weights themselves, which you learn in your procedure, are a "smoother" or more well-conditioned object than standard density ratio weights such as in Tibshirani et al. Again, this is not a major point, and I do get the various overall performance benefits of your approach compared to Tibshirani et al.

Furthermore, I wanted to encourage you for follow-up investigations to further look into the sandwiching bounds, not just from the empirical standpoint but also theoretically: as I wrote in the original review, depending on the amount of prior information, provable sandwiching bounds could intuitively turn out more or less conservative, and rigorously quantifying that would be interesting.

Overall, I continue to support this paper, and thank you for your constructive engagement with my points (and those of the other reviewers).

评论

Thank you for your continued support of our work and for your thoughtful engagement with our responses. We greatly appreciate your constructive feedback and suggestions for future research directions. The points discussed here add to the paper, and we will definitely include them in the final version.

Theoretical Framework for Sandwiching Distribution Design

Indeed, leveraging prior information to design tighter sandwiching distributions is a promising idea. In our context, where we leverage unlabeled test data, designing the sandwiching distributions can be seen as a label selection problem: for each test instance xix_i, we must choose two labels yi,yiYy_i^{\downarrow}, y_i^{\uparrow} \in \mathcal{Y} to construct:

sQ^=1mi=1mδs(xi,yi),sQ^=1mi=1mδs(xi,yi)s_{\sharp}\hat Q_{\downarrow} = \frac{1}{m}\sum_{i=1}^m \delta_{s(x_i, y_i^{\downarrow})}, \quad s_{\sharp}\hat Q_{\uparrow} = \frac{1}{m}\sum_{i=1}^m \delta_{s(x_i, y_i^{\uparrow})}

A natural way to improve this is by constraining the set of candidate labels at each test point using prior knowledge. That is, we select yi,yiy_i^{\downarrow}, y_i^{\uparrow} from a feasible subset YifeasibleY\mathcal{Y}_i^{\text{feasible}} \subset \mathcal{Y}. Provided the prior knowledge is accurate, i.e., yiYifeasibley_i \in \mathcal{Y}_i^{\text{feasible}}, this approach leads to provably tighter bounds. For instance, we could take

yi=argminyYifeasibles(xi,y),yi=argmaxyYifeasibles(xi,y)y_i^\downarrow = \arg\min_{y \in \mathcal Y_i^{\text{feasible}}} s(x_i, y), \quad y_i^\uparrow = \arg\max_{y \in \mathcal Y_i^{\text{feasible}}} s(x_i, y)

Here, there is a clear relationship between the amount of information and the tightness of the sandwiching distributions: the smaller the feasible set, the tighter the bounds.

In some cases, we may have direct knowledge about the feasible sets themselves. For instance, the labels might be structured in a hierarchy that we could exploit. For a simple example in the context of the ImageNet dataset, we might have some side information about super classes, e.g., we might know the correct class is a dog, which excludes most of the possible classes, except for the ones identifying different dog breeds.

We may also be able to derive the feasible sets from the distribution shift characteristics (e.g., bounded perturbations) with, e.g., an optimization procedure. Such cases are interesting directions for future work to improve our bounds and optimization objectives.

评论

Thank you for the follow-up response. I do like this discussion on sandwiching bounds, and believe it would (in some form) present a valuable addition to the paper, so as to point readers to it as a future work direction.

My concerns stated in the review have been addressed to a sufficient degree, and I have updated my score correspondingly.

审稿意见
5

This paper introduces a new conformal prediction algorithm in settings with distribution shift. The authors leverage optimal transport theory to quantify and mitigate the coverage gap, by learning a distribution that minimizes a novel coverage gap upper bound. The method results in smaller coverage gaps and more efficient prediction sets compared to baselines.

优缺点分析

Strengths

  • the paper's perspective and theoretical contributions are very novel, breathing new light into CP's applicability on distribution shifts.
  • the paper explicitly addresses the issue of unlabeled test data, a common scenario in real-world applications. The construction of auxiliary distributions from unlabeled data for bounding the coverage gap is a clever and practical solution.
  • overall writing is very good. the math is well-explained and the proofs look correct to me.
  • discussion and comparison with existing work is thorough and done with care. I appreciate the details in experiment implementation in Appendix B.1.
  • empirical performance that corroborates the theory. It is cool to see although the upper bound using auxiliary distributions can be loose, the prediction intervals show large improvements in terms of calibration. The studies in appendix B.3 are also informative of the properties of the algorithm.

Weaknesses

  • I think within the scopes of a conference paper this submission does not have significant weakness.

问题

These are just nice-to-haves if the authors have time; since my score is already high.

  • Can you provide some qualitative examples of the UQ provided by your algorithm vs baselines? It will help us understand the experiment setting and your results better. As someone who is not in CV I do not know what ImageNet-C looks like.

  • I am curious about the computational costs of this method. The paper fits a Gaussian KDE to the weighted calibration scores (O(n2)\mathcal{O}(n^2)). How long does it take (and on what machine) on your datasets? Would increasing number of calibration samples beyond 300 point, for better calibration quality? would any approximation method for density estimation or computing the 1-Wasserstein distance work in place of full fitting?

局限性

Yes.

最终评判理由

The authors have answered my questions and I maintain my positive recommendation.

格式问题

na

作者回复

Thanks for your review and for the encouraging comments regarding the novelty of our approach and the thoroughness of our experiments. We address your questions below.

Can you provide some qualitative examples of the UQ provided by your algorithm vs baselines?

We agree that qualitative examples are helpful for understanding the behavior of uncertainty quantification methods, and we plan to include them in the final version of the paper. Since we cannot share images during the rebuttal phase, we refer to Angelopoulos et al. (2021), who also apply conformal prediction to ImageNet and provide relevant visual examples.

To help clarify our experimental setting for readers unfamiliar with ImageNet-C: it applies 15 corruption types (e.g., Gaussian noise, motion blur, frost, fog) at 5 severity levels to create realistic distribution shifts that models encounter in deployment. In our setting, we observe that prediction sets increase in size with the degree of distribution shift, as expected. For instance, sets are larger on ImageNet-C with severity five than with severity one, as shown in Table 1.

Regarding other baselines, our primary focus is on improving conformal prediction under distribution shift, so comparisons to non-conformal baselines are not the central aim of this work. That said, we’d be happy to consider including such comparisons if they could help clarify or strengthen the paper. Are there particular baselines you think would be especially valuable to include, or aspects of the comparison that you feel would add meaningful insight?

How long does it take (and on what machine) on your datasets?

To be precise, the computational complexity of evaluating the KDE is given by O(nk)\mathcal O(n \cdot k), where nn is the number of datapoints where the KDE is defined (nn samples from PP) and kk is the number of points where the KDE is evaluated. While nn is fixed for a given sample from PP, the value of kk can be tuned according to the compute budget. It directly affects the accuracy of downstream computations: a higher kk improves the precision of the approximate integral in the case of the weighted-CDF bound, and enhances the estimation of maxsP\max s_\sharp P in the context of the 1-Wasserstein bound.

In practice, fitting the weights is relatively fast. On a standard commercial GPU (RTX 2080 Ti), the process takes under a minute for 300 evaluation points and less than 10 minutes for 10,000 points. These runtimes were sufficient for our use case, but there is still room for optimization. Several techniques exist to accelerate KDE evaluation, including the fast Gauss transform and tree-based approximations, which could be incorporated to reduce computational overhead. See, for example, Yang, Duraiswami, and Gumerov (2003) and the references therein.

Would increasing number of calibration samples beyond 300 point, for better calibration quality?

Yes, we have observed that larger numbers of calibration samples are correlated with better coverage. See Tables 2 and 3 in Appendix B for an analysis of how the number of calibration samples from PP and unlabeled samples from QQ affects coverage.

This trend is expected. More calibration data typically improves the reliability of coverage estimates, and in our setting, also facilitates learning as it increases the number of points we can use to approximate QQ via the weighted empirical distribution over calibration scores. More unlabeled samples from QQ are also beneficial, since it increases the amount of information we have from the test distribution. However, we chose to focus our experiments on smaller calibration sets to demonstrate that our method remains effective even in more challenging and realistic scenarios where both labeled calibration data and unlabeled test samples from QQ are scarce.

Would any approximation method for density estimation or computing the 1-Wasserstein distance work in place of full fitting?

Yes, alternative methods are applicable, provided the resulting bound remains differentiable with respect to the learnable weights. For density estimation, other approaches can be used, including the efficient KDE variants mentioned above. Similarly, various methods exist for approximating the 1-Wasserstein distance. However, since the scores are typically one-dimensional, the 1-Wasserstein distance can be computed exactly in closed form, leaving little room for improvement through approximation.

References

Angelopoulos et al. "Uncertainty Sets for Image Classifiers using Conformal Prediction." ICLR, 2021.

Yang, Duraiswami, and Gumerov. "Improved fast gauss transform and efficient kernel density estimation." ICCV, 2003.

评论

Thank you to the authors for your detailed response; I have no further questions. I'm happy to confirm my positive score.

最终决定

This paper proposes a weighted conformal prediction algorithm for non-exchangeable data, leveraging optimal transport theory to quantify and mitigate the coverage gap. The method results in smaller coverage gaps and more efficient prediction sets compared to baselines. Most reviewers liked this paper, praising the clarity, novelty, and the possibility of leveraging unlabeled data for bounding the coverage gap. The main weakness is the lack of statistical coverage guarantees (which are central in conformal prediction methods), however this is a price which has to be paid for extending the theory for arbitrary distribution shifts (non-exchangeable data), where standard conformal guarantees no longer hold, and it is my opinion that the results of this paper are interesting to the community.