/10

Poster4 位审稿人

最低3最高3标准差0.0

ICML 2025

Stochastic Online Conformal Prediction with Semi-Bandit Feedback

Haosen Ge,Hamsa Bastani,Osbert Bastani

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

摘要

关键词

Online Conformal PredictionSemi-bandit Feedback

评审与讨论

审稿意见

评分: 32025-03-12

The paper studies the problem of conformal prediction with semi-bandit feedback. In this setting, it formulates an online learning problem where the algorithm must generate conformal sets that maintain valid coverage guarantees in each round. The authors propose an algorithm that satisfies this requirement and demonstrate that it achieves sublinear regret under a regret definition based on score thresholds.

给作者的问题

What is $\epsilon_t$ in Lemma 4.2? I assume it corresponds to the $\epsilon_t$ defined before Equation (4), but it should be explicitly referenced to avoid ambiguity.

Additionally, why is Theorem 3.1 stated twice in the paper?

In the proof of Theorem 3.1, immediately after the line “By a union bound and by Lemma 4.1, we have”, what is $\delta$ here?

Assumption 2.1 seems to appear without sufficient justification. The authors state that ``A natural way to formalize the continuity assumption is based on the cumulative distribution function (CDF) of the scoring function” and then assume $\phi(\tau) = \psi(G^{*}(\tau))$ . Why is this considered natural? A more intuitive assumption, in my view, would be to require that $\phi$ is a Lipschitz continuous function of $\tau$ .

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes, I checked all the proofs in the paper.

I believe there is a minor issue in the proof of Theorem 3.1. In the transition from line 285 to 287, the notation switches from $\overline{G}_t$ to $G_t$ . If we use the equality $\overline{G}_t(\tau) = G_t(\tau) + \epsilon_t$ , an additional application of the triangle inequality is required, which introduces an extra contribution of $\varepsilon_t$ .

Going through the proof, it seems to me that the proof easily generalizes to the case where $\psi$ is $\alpha$ -Holder for $\alpha \leq 1$ . The regret should still be sublinear but with a rate slower than $\sqrt{T}$ and perhaps approaching $T$ as $\alpha \to 0$ .

实验设计与分析

Yes, I think the experiments are good and comprehensive. One additional aspect I would like to see is a $\log$ - $\log$ plot of regret along with the corresponding slope parameter to verify whether the rate is indeed smaller than $\sqrt{T \log{T}}$ . From the results, it appears that the rate is approximately $\sim \sqrt{T}$ , but a formal verification would be helpful. Alternatively, the authors could include a reference plot of $f(T) = \sqrt{T \log{T}}$ in Figure 1 for comparison.

补充材料

N/A

与现有文献的关系

The paper seems to improve on existing results.

遗漏的重要参考文献

Not that I am aware of.

其他优缺点

The paper is generally well-written and is easy to follow. The theoretical results are nice and verified with experiments.

Regarding potential weaknesses, I am not entirely convinced by the notion of regret as defined in the paper. Why is $\tau$ given such emphasis? It seems more like a means to an end rather than the ultimate objective. What truly matters is the size of the conformal sets. Would it be a big deal if the conformal set remains nearly the same for $\tau = \tau^{\star}$ and $\tau = 2\tau^{\star}$ ? A more natural metric might be the size of the conformal set in classification with finite labels or, for regression, its Lebesgue measure.

其他意见或建议

N/A

作者回复

2025-04-01

We thank the reviewer for the thoughtful comments. We address the questions and major concerns below:

Concern 1: Line 285-287 Proof

We apologize for the confusion. We will fix the typo in the revised draft. We also note that the final regret bound is correct because we add the additional $\epsilon_t$ in line 290.

Concern 2: Reference Plot

We thank the reviewer for the helpful advice. We updated Figure 1 to add $f(T) = \sqrt{T \log T}$ as a reference. We also take the log of the y-axis to improve readability. The figure is uploaded here:
https://imgur.com/a/j05QwzS

Concern 3: Natural Metric

It is correct that $\tau$ serves as a surrogate for prediction set size, as the prediction set size is monotonic in $\tau$ . In practice, we find that this surrogate performs reasonably well, and we use a variant of it in our experiments. The reason we do not directly use prediction set size is explained in lines 148–150 (right column): specifically, because prediction set size is a discontinuous function, it does not satisfy Assumption 2.1. We have also conducted experiments on the convergence behavior of the prediction set size and would be happy to share the results.

Alternatively, a guarantee that existing conformal prediction algorithms provide is that the coverage rate is also not too much higher than desired, which shows that the algorithm is not overly conservative. Indeed, our regret guarantee can be applied to obtain such a bound. Specifically, consider the loss function $\phi(\tau) = 1-\min\\{G^*(\tau),G^*(\tau^*)\\}$ (the min is to ensure the step-wise regret is non-negative). On the event that $\tau_t\le\tau^*$ for all $t$ (which holds with high probability, see Lemma 4.2), then $\min\\{G^*(\tau_t),G^*(\tau^*)\\}=G^*(\tau_t)$ , so the step-wise regret is
$r_t := \phi(\tau_t) - \phi(\tau^*) = G^*(\tau^*) - G^*(\tau_t) = \mathbb{P}[\tau_t\le f(x_t,y_t^*)\le\tau^*].$

In other words, the step-wise regret $r_t$ equals the overcoverage probability -- i.e., the probability that our algorithm covers $y_t^*$ when the "oracle" algorithm (which knows $\tau^*$ ) does not. Then, by Theorem 3.1, we have $r_t = \mathbb{P}[\tau_t\le f(x_t,y_t^*)\le\tau^*] = O(\sqrt{\log t/t})$ , implying that the overcoverage probability converges to zero. We are happy to add a discussion of this result to our paper.

Question 1: $\epsilon_t$ in Lemma 4.2

It is correct that $\epsilon_t$ is defined in Equation (4). We will include this definition in the statement of the lemma in the revised draft to avoid any ambiguity.

Question 2: Theorem 3.1 Stated Twice

We apologize for the confusion. We re-stated the theorem in the proof section to improve its readability. We are happy to remove it.

Question 3: $\delta$ in Lemma 4.1

We define $\delta = 1/T^2$ in line 199-200 (right column). We will include this definition in the revised proof to avoid future confusion.

Question 4: Assumption 2.1

We meant natural in the sense that it naturally captures relevant structure in our problem, not necessarily that it is a natural assumption. Specifically, this assumption is a natural way to avoid the pathological cases discussed in lines 143–150. To summarize, consider a CDF such that $G^*(\tau_0) = G^*(\tau^*) = 1 - \alpha$ with $\tau_0 < \tau^*$ . Recall that $\tau^* = \sup\\{\tau \in R: G^*(\tau) \le 1 - \alpha\\}$ . Since we almost surely never observe any scores between $\tau_0$ and $\tau^*$ , it becomes extremely difficult to identify $\tau^*$ from $[\tau_0,\tau^*]$ with a finite sample. Even a small error in estimating the empirical CDF---e.g., $G_t(\tau^*) = G^*(\tau^*) + \epsilon$ ---can result in significant regret. We apologize for the confusion, and will clarify our explanation of this assumption.

审稿人评论

2025-04-02

I thank the authors for their response. Including a brief discussion of your reply to my Concern 3 in the main text would be helpful and appreciated.

Regarding Assumption 2.1, thank you for the clarification. That said, the assumption still feels somewhat abrupt and lacks sufficient context. I would recommend adding a discussion on whether similar assumptions have been used in prior work. If possible, it would also be valuable to outline what an ideal structural assumption on $\phi$ might look like, acknowledge the limitations of the current formulation, and perhaps pose this as an open question for future work.

I would like to retain my original score.

审稿意见

评分: 32025-03-14

The paper introduces a novel algorithm for online conformal prediction in settings where only partial feedback is available, specifically semi-bandit feedback. Conformal prediction is a framework for uncertainty quantification that outputs prediction sets with guaranteed coverage probabilities. The key challenge addressed in this work is the lack of a large calibration dataset, which is typically required for conformal prediction. Instead, the authors consider an online setting where examples arrive sequentially, and the true label is only observed if it is contained in the prediction set.

The proposed algorithm, Semi-bandit Prediction Set (SPS), dynamically constructs prediction sets over time. It uses a high-probability upper bound on the cumulative distribution function (CDF) of the scoring function to ensure that the prediction sets maintain the desired coverage rate. The algorithm guarantees sublinear regret, providing strong theoretical guarantees and empirical performance. The algorithm is applicable to a wide range of real-world tasks, including document retrieval, image classification, and auction price-setting.

给作者的问题

See the Theoretical Claims above.

论据与证据

The claims made in the submission are well-supported by both theoretical proofs and empirical results. The authors provide clear evidence for the effectiveness of their algorithm in maintaining the desired coverage rate, achieving sublinear regret, and ensuring zero undercoverage. The empirical evaluation is thorough, and the theoretical analysis is rigorous, making the claims convincing and credible.

方法与评估标准

The proposed methods and evaluation criteria are well-suited for the problem of online conformal prediction with semi-bandit feedback. The algorithm is designed to handle the challenges of this setting, and the evaluation criteria align with the goals of the application. The benchmark datasets and baselines are appropriate and provide a thorough evaluation of the algorithm’s performance.

理论论述

The inductive step in Lemma 4.2 could be more detailed. Specifically, it would be helpful to explicitly state how the inductive hypothesis is used which is not so clear to me.

Overall, the theoretical analysis is rigorous and supports the claims made in the paper.

实验设计与分析

The experimental design and analyses are sound and valid. The tasks, datasets, baselines, and metrics are well-chosen and relevant to the problem of online conformal prediction with semi-bandit feedback.

补充材料

yes, the whole part.

与现有文献的关系

The key contributions of the paper are closely related to the broader scientific literature in conformal prediction, online learning, and bandit feedback.

遗漏的重要参考文献

其他优缺点

Strengths: Providing the first regret guarantee for online conformal prediction with semi-bandit feedback.

Weaknesses: The writing should be improved and well-organized.

其他意见或建议

See the Theoretical Claims above.

作者回复

2025-04-01

We thank the reviewer for the thoughtful comments. We address the questions and major concerns below:

Concern 1: Lemma 4.2

We apologize for the confusion. A more detailed proof is included below:

Base case: when $t = 1$ , we set $\tau_1 = -\infty$ . Thus, $\tau_1 \le \tau^*$ .

Inductive Hypothesis: Assume $\tau_k \le \tau^*$ for some $k \ge 1$ .

We prove $\tau_{k+1} \le \tau^*$ under our assumption $\sup_{\tau \in R}|G_t(\tau) - G_t^*(\tau)| \le \epsilon_t$ for all $t \in [T]$ . Note that $\tau_k \le \tau^*$ by the inductive hypothesis, so we have $G_k^*(\tau) = G^*(\tau)$ for all $\tau \ge \tau^*$ . Thus, we have

|G_k(\tau^*) - G^*(\tau^*)| = |G_k(\tau^*) - G_k^*(\tau^*)| \le \sup_{\tau \in R}|G_k(\tau) - G_k^*(\tau)| \le \epsilon_k,

where the last inequality follows from our assumption.

Thus, we have $\overline{G}_k(\tau^*) = G_k(\tau^*) + \epsilon_k \ge G^*(\tau^*) = 1 - \alpha$ . Recall we define

\tau_{1-\alpha, k} = \sup\\{\tau \in R \mid \overline{G}_k(\tau) \le 1 - \alpha\\};

thus, we have $\tau_{1-\alpha, k} \le \tau^*$ . Since $\tau_{k+1} = \max\\{\tau_{1-\alpha, k}, \tau_k\\}$ with $\tau_{1-\alpha, k} \le \tau^*$ and $\tau_k \le \tau^*$ , we have $\tau_{k+1} \le \tau^*$ .

We are happy to further elaborate on any portion of the proof.

审稿意见

评分: 32025-03-15

The paper proposes a novel stochastic online conformal prediction algorithm designed for constructing prediction sets under semi-bandit feedback. It addresses the challenge of sequential decision-making where feedback is limited to partial information (e.g., only observing certain bids or labels). The algorithm ensures desired coverage rates while achieving sublinear regret and maintaining zero undercoverage count. The authors demonstrate the algorithm's effectiveness across three tasks: image classification (ImageNet), document retrieval (SQuAD), and second-price auctions. The key contributions include a new approach to conformal prediction that is robust to semi-bandit feedback and does not require hyperparameter tuning. Experimental results show that the proposed method outperforms existing strategies in terms of regret, coverage rate, and undercoverage.

给作者的问题

How does your proposed algorithm handle extreme cases where the semi-bandit feedback is highly noisy or sparse? A more detailed explanation of its robustness in such scenarios would clarify its practical applicability.

Can you provide a more in-depth explanation of the trade-off between computational efficiency and the desired coverage rate in your method, especially when dealing with large-scale datasets?

论据与证据

Yes, the claims made in the submission are supported by convincing evidence. The paper provides experimental results across multiple tasks (ImageNet, SQuAD, second-price auctions) demonstrating that the proposed algorithm outperforms existing methods (e.g., ACI, greedy, and DLR) in terms of cumulative regret, coverage rate, and undercoverage count. The results are backed by quantitative metrics and comparisons, including regret, coverage rate, and undercoverage count, which show the algorithm’s superior performance.

方法与评估标准

Yes, the proposed methods and evaluation criteria are well-suited for the problem. The use of benchmark datasets like ImageNet, SQuAD, and second-price auctions aligns with the tasks being evaluated (image classification, document retrieval, and auction reservation price prediction). The evaluation criteria, including cumulative regret, coverage rate, and undercoverage count, effectively measure the algorithm's performance in achieving reliable prediction sets under stochastic semi-bandit feedback.

理论论述

High-level the proofs presented, particularly the bounds on cumulative regret and coverage rate, appear mathematically sound based on the assumptions and lemmas provided, but want to acknowledge that I did not check everything in details for the correctness of the proofs.

实验设计与分析

based on the description, the experimental setup appears appropriate for evaluating the proposed method, with tasks like image classification, document retrieval, and second-price auction reservation price prediction used to assess the algorithm's performance across different domains. The metrics (cumulative regret, coverage rate, and undercoverage count) are well-suited for the problem.

补充材料

No.

与现有文献的关系

The paper's key contributions—developing a stochastic online conformal prediction algorithm with semi-bandit feedback—build upon and extend prior work in conformal prediction, online learning, and semi-bandit feedback. Specifically, it relates to prior methods like Adaptive Conformal Inference (ACI), Decaying Learning Rate (DLR), and greedy strategies, highlighting the limitations of these approaches (e.g., failure to maintain coverage rate and suboptimal regret). The paper builds on existing work in these areas by proposing a method that guarantees coverage rate, sublinear regret, and zero undercoverage, addressing issues like biased updates and slow convergence in existing algorithms. This contributes to improving prediction set reliability and uncertainty quantification in various machine learning tasks.

遗漏的重要参考文献

The paper sufficiently cites foundational works such as ACI, DLR, and greedy algorithms, but it could benefit with further discussion on more recent advancements in conformal prediction and semi-bandit feedback. Specifically, recent work on the adaptation of conformal prediction methods to deep learning models (such as those applied to large-scale image or text datasets) could further strengthen the background context.

其他优缺点

Strengths:

The paper presents an original approach by combining conformal prediction with semi-bandit feedback, addressing a gap in the existing literature.
It demonstrates a practical application to real-world tasks (image classification, document retrieval, second-price auctions) which adds significant relevance.
The experimental results are thorough, showing that the proposed method outperforms several baselines, particularly in maintaining coverage and achieving sublinear regret.

Weaknesses:

While the proposed method is effective, the paper could benefit from additional comparative analysis against a broader set of methods or more challenging scenarios.

其他意见或建议

The paper would benefit from a clearer explanation of the key assumptions and how they relate to the experimental setup.
It would be useful to provide a more detailed discussion on the limitations of the proposed method, especially in edge cases or different types of feedback.

作者回复

2025-04-01

We thank the reviewer for the thoughtful comments. We address the questions and major concerns below:

Concern 1: Additional Comparative Analysis

We appreciate the reviewer’s suggestion. In both our discussions and experiments, we have considered a comprehensive set of popular online conformal prediction algorithms from recent literature that operate in similar contexts, and we demonstrate that our algorithm outperforms them in the semi-bandit setting. We believe these scenarios already demonstrate challenging and practical applications of our techniques, but we are happy to consider additional scenarios the reviewer might have in mind.

Concern 2: Key Assumptions

Assumption 2.1 rules out cases where the reward varies substantially in regions where the CDF $G^*$ is very flat. In such regions, the number of observed samples is typically small. Therefore, if $\tau^*$ lies in a flat region of $G^*$ and the reward fluctuates significantly, the accrued regret can be large. This type of regularity condition is common in regret analyses within the bandit literature, where some assumptions on the reward mechanism are required to ensure meaningful guarantees (e.g., see [1]).

The bounded reward assumption in Assumption 2.2 typically holds in practice. First, since the CDF $G^*$ is a bounded function, it is natural to assume that the reward $\phi(\tau) = \psi(G^*(\tau))$ is also bounded. Alternatively, if we consider $\phi$ as representing the prediction set size, the assumption still holds because prediction set sizes are inherently bounded. The bounded reward condition is also standard in regret analyses (e.g., see [2]).

Next, we show that Assumption 2.3 can be removed with some additional technical work. We begin by redefining $\tau^*$ as $\tau^* = \sup\\{\tau: G^*(\tau) \le 1 - \alpha\\}$ , and let $1 - \alpha' = G^*(\tau^*)$ . The proof then proceeds almost identically as before, with $1 - \alpha'$ substituted for $1 - \alpha$ . The only complication is that $\tau_t$ is still computed using $\alpha$ . However, since the DKW inequality applies to arbitrary CDFs, we can still show that $\tau_t \le \tau^*$ with probability at least $1/T^2$ . As a result, we obtain that $G(\tau_t) \to 1 - \alpha'$ , i.e., our algorithm maintains and converges to an $\alpha'$ -coverage rate. With these modifications, our regret guarantee continues to hold.

In the experiments section, we choose a loss function (lines 374–377, right column) that satisfies Assumptions 2.1 and 2.2. Since relaxing Assumption 2.3 does not impact our regret or coverage guarantees, we use the observed empirical distribution of scores as the ground-truth distribution.

[1] Kleinberg, Robert, Aleksandrs Slivkins, and Eli Upfal. "Multi-armed bandits in metric spaces." In Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 681-690. 2008.
[2] Auer, P. "Finite-time Analysis of the Multiarmed Bandit Problem." (2002).

Concern 3: Limitations

One limitation of our algorithm is that it currently operates under the i.i.d. assumption, whereas several existing online conformal prediction approaches are designed to work in adversarial setting; however, these existing techniques cannot handle semi-bandit feedback. With respect to different feedback mechanisms, our algorithm also functions in the full-information feedback setting without modification; however, in this case, there may be more effective strategies than our algorithm. Finally, one might also consider bandit (instead of semi-bandit) feedback, but it is not clear what that would look like in the conformal prediction setting. We are happy to add a discussion to our paper.

Question 1: Noisy and Sparse Labels

In general, conformal prediction naturally handles label noise, since the ground truth label $y^*$ is assumed to be a random variable even conditioned on $x$ . In terms of "extreme cases", our coverage and regret guarantees still hold but the prediction sets may be large. Regarding sparsity, we note our conformal prediction setting is special, and our semi-bandit feedback cannot be sparse -- we only fail to observe the true label when it is miscovered, which happens with probability at least $\alpha$ . We are happy to add a discussion to our paper.

Question 2: Computation Efficiency

As our algorithm only needs access to the empirical score distributions, the computational efficiency does not depend on the choice of the coverage rate $\alpha$ . For the same reason, the scaling of our algorithm does not depend on the data dimensionality except through how long it takes to evaluate the predictive model. Since it is an online algorithm, its computation time scales linearly in the size of the dataset (with constant compute per iteration) and uses constant memory. These scalability properties are similar to existing conformal prediction algorithms. We are happy to add a discussion to our paper.

审稿意见

评分: 32025-03-17

This paper introduces an online conformal prediction method with semi-bandit feedback (i.e. we only observe the true label if it is contained in the prediction set). Authors show that, under the iid setting, their method controls the expected cumulative regret and that $\tau_t$ (the threholding value at time $t$ ) converges to $\tau^*$ the optimal one at $t$ goes to infinity. More importantly, they show that $\tau_t \leq \tau^*$ for all $t$ with high probability. They finally We evaluate the algorithm on several task to empirically show that they achieve performance compared to several baselines.

给作者的问题

1\ My first concern is about the paper [1]. It seems to me that because you assume that the data are iid, then the method of [1] is better bu maybe I am wrong. Can you elaborate on this?

2\ Throughout the paper, $f$ is fixed at the beginning. Do you think it is possible to “relearn” the score function as new points are observed?

3\ Line 213 "As a consequence, our algorithm converges to the true $\tau^*$ ". Why? can you elaborate on this? Furthermore, do you think that it is possible to obtain a rate of convergence?

4\ In general, papers in online CP control the FCP. Do you think it could be possible to have such result here? (Perhaps it is trivial due to the fact that $\tau_t \leq \tau^*$ ).

5\ Can you give some important examples on $\phi$ ?

6\ In the experiment, it seems that the method is relatively conservative with coverage around 0.95 even with $T=10000$ points. Can you elaborate on this?

7\ Minor question: In the online literature, they also control dynamic regret. Do you think it is possible to do that here?

论据与证据

Yes, the claims are clear. However, in my opinion, the iid assumption is an important one (especially in online CP) and should appear in the theorem.

方法与评估标准

Yes, the evaluation makes sense.

理论论述

Yes, I partially check the proofs and it seems that this is ok.

实验设计与分析

Yes, the experimental design seems correct to me.

补充材料

与现有文献的关系

The article assumes an iid framework, which is a bit different from the previous article on online CP. Furthermore, my main concern on this point is that I don't understand how this article compares to [1]. My first impression is that [1] does the same thing as this article but better (I think the semi-bandit feedback can be easily handled by [1]'s technique). But maybe I am wrong and I would like to hear the authors' opinion on this point.

[1] “On-Line Confidence Machines Are Well-Calibrated” by Vladimir Vovk.

遗漏的重要参考文献

No, everything is ok (except maybe [1] ; see previous point).

其他优缺点

Strenghts:

1\ The subject is of huge interest for the CP community.

2\ The paper is well written and quite easy to follow.

Weaknesses:

1\ The paper only consider the iid setting

2\ The $f$ function is not time-dependent. In a real-life scenario, we probably want to use the new points to improve our $f$ function.

3\ This point could also be a strenght (depending on your point of view), but the method is in the end an application of DKW inequality (reason why I think that [1] is better -- but I could be wrong).

4\ The figures are not very easy to read. Maybe a logarithmic scale or something like that would be better.

5\ The code is not provided.

其他意见或建议

Minor:

1\ Line 75: "Angelopoulos et al. (2024b) points out that ACI still works if we take the “quantile function” to be the identity function .." I do not understand this remark. Do you mean that the update in ACI can be made on $q_t$ and not on $\alpha_t$ ?

2\ Line 84: "we show that a standard variation the scoring function" problem in the sentence.

3\ Line 104: "convergence of $\tau_t$ to $\tau^*$ " They are never defined before this paragraph. Therefore, on first reading, we cannot understand this statement. Overall, the text sometimes refers to mathematical objects that are not yet defined.

4\ Line 433: "hyperparamters"

作者回复

2025-04-01

Thank you for the thoughtful comments; we address major concerns below.

Concern 1: Comparison to [1]

By our understanding, the algorithm in [1] cannot be straightforwardly modified to handle semi-bandit feedback. Specifically, they use a strangeness measure $A_t:(z_1,...,z_t)\mapsto(\alpha_1,...,\alpha_t)$ (Eq (8) in [1]) to construct prediction sets, where $z_i=(x_i,y_i)$ is a calibration example; e.g.: $\alpha_i=\frac{\min_{j\neq i:y_j=y_i}d(x_i,x_j)}{\min_{j\neq i:y_j\neq y_i}d(x_i, x_j)}.$

Clearly, computing $\alpha_i$ requires the true label $y_i$ . Next, given a new input $x_t$ , they return the prediction set of all labels $y$ such that $\alpha$ satisfies
$\frac{\\#\\{i=1,...,t:\alpha_i\ge\alpha\\}}{t} \ge\delta,$ where
$(\alpha_1,...,\alpha_{t-1},\alpha)=A_t((x_1,y_1),...,(x_{t-1},y_{t-1}),(x_t,y))$
is the strangeness measure assuming the true label for $x_t$ is $y$ (Eqs (9) to (11) in [1]). Thus, to construct $\alpha$ , they require $\alpha_i$ for all $i$ , so all true labels $y_i$ are required. This requirement persists in their modifications (Eqs (16), (20)–(21) in [1]).

Our main contribution is precisely to handle semi-bandit feedback. We will add a discussion.

Concern 2: i.i.d.

We believe the i.i.d. assumption is reasonable in our active learning setting. We agree that extending our work to the adversarial setting is an important direction, but it is beyond the scope of our paper.

Concern 3: Re-learn $f$

We believe there are practical settings where training is decoupled from calibration, especially when training is costly (e.g., LLMs). One of our applications involves using a pretrained document retriever in a new domain where fine-tuning may be infeasible. However, we agree that extending to settings where the model $f$ is updated is important and will add a discussion.

Concern 4: Simple use of DKW

As discussed above, we do not believe [1] can easily be adapted to semi-bandit feedback. Also, we emphasize that a naïve application of DKW is insufficient; we have introduced additional novel techniques to handle semi-bandit feedback.

Concern 5: Image Readability

We have updated Figure 1 to use a log scale:
https://imgur.com/a/j05QwzS

Q3: Line 213 $\tau_t$ convergence

Our convergence guarantee is based on the fact that the empirical CDF converges to the true CDF, i.e., $\overline{G}_t$ converges to $G^*$ for $\tau\in[\tau^*,+\infty)$ . For intuition, consider full feedback. Let $G'_t$ denote the empirical CDF. Similar to Eq (4), define
$\overline{G}'_t(\tau)=G'_t(\tau)+\epsilon_t$

where $\epsilon_t=\sqrt{\log(2/\delta)/2t}$ , and choose
$\tau_t=\sup\\{\tau\in\mathbb{R}\mid\overline{G}'_{t-1}(\tau)\le1-\alpha\\}.$
By DKW, $\overline{G}'_t(\tau^*)$ converges to $G^*(\tau^*)$ , so $\tau_t$ converges to $\tau^*$ . The convergence rate $O(t^{-1/2})$ comes from DKW.

With semi-bandit feedback, convergence is not guaranteed a priori. However, Lemmas 4.1 and 4.2 show our CDF upper bound $\overline{G}_t(\tau^*)$ still converges to $G^*(\tau^*)$ . Thus, $\tau_t$ converges to $\tau^*$ , with the same convergence rate.

Q4: FCP

Apologies, we are not sure what "FCP" means. The false negative rate (FNR) is controlled by the coverage guarantee (assuming a false negative is a miscovered example). We are not aware of an analog of false positive rate (FPR) in conformal prediction. One guarantee of existing conformal prediction algorithms is the coverage rate is also not too much higher than desired, i.e., they are not overly conservative. Our regret guarantee provides such a bound. Consider the loss $\phi(\tau)=1-\min\\{G^*(\tau),G^*(\tau^*)\\}$ (the min ensures the step-wise regret is non-negative). On the event $\tau_t\le\tau^*$ for all $t$ (which holds with high probability by Lemma 4.2), the step-wise regret is
$r_t=\phi(\tau_t)-\phi(\tau^*)=G^*(\tau^*)-G^*(\tau_t)=\mathbb{P}[\tau_t\le f(x_t,y_t^*)\le\tau^*],$
i.e., $r_t$ is the overcoverage probability that we cover $y_t^*$ but the oracle does not. By Theorem 3.1, $r_t=O(\sqrt{\log t/t})$ , so the overcoverage probability converges to zero. We will add a discussion.

Q5: Example on $\phi$

Note that Algorithm 1 does not depend on $\phi$ ; it is introduced solely to evaluate the algorithm's performance. The example in our experiments is a practical choice; the overcoverage rate above is another.

Q6: Conservativeness

This is a consequence of our PAC guarantee and semi-bandit. Our algorithm ensures $\alpha$ -coverage at every $t$ , which implies $\tau_t\le\tau^*$ for all $t$ . However, as shown in our regret analysis, $\tau_t$ is guaranteed to converge to $\tau^*$ , with convergence rate given by DKW.

Q7: Dynamic Regret

Since we consider the i.i.d. setting, the optimal policy is static (i.e., $\tau_t=\tau^*$ ), so our regret equals dynamic regret. We believe extensions to the adversarial setting (where static and dynamic regret are distinct) is beyond the scope of our paper.

最终决定Accept (poster)

2025-05-01

This paper addresses online conformal prediction with semi-bandit feedback, where calibration is performed in online setting and with the feedback observed only when the true label is contained in the prediction set. A novel conformal prediction algorithm is presented at this setting and is shown to achieve sublinear regret. Empirical evaluation demonstrated the validity in image classification, document retrieval, and second-price auction reservation price prediction. All reviewers feel that online conformal prediction under the semi-bandit feedback setting is original and the method is well supported by both theories and experiments. Although there were some concerns in clarity and writing, the author response resolved up to some extent.