Localized Adaptive Risk Control
A novel scheme for online calibration that offers both worst-case deterministic long-term risk control and statistical localized risk guarantees.
摘要
评审与讨论
This paper proposed a novel localized adaptive risk control algorithm that provides not only average case risk guarantees but also worst-case guarantees. Simulations in several different applications are provided, demonstrating the improved performance when compared with adaptive risk control.
优点
The paper is very well written. The illustration figures are very helpful in explaining the localization effects of difference choices of .
The problem is also well motivated: the worst case risk guarantee is indeed very important in applications such as medical imaging.
The proposed algorithm is novel.
缺点
-
My major concern is on the simulation results. Though I appreciate the improved performance compared with ARC in different settings, the paper does not provide any comparisons with the state-of-the-art approaches in these applications. For example, for electricity demand prediction, the SOTA method is based on quantile regression and there are many papers published trying to provide better predictions on demand. How does the proposed LARC compare with these SOTA methods? If LARC does not outperform the existing methods, what are the benefits of using LARC? Similarly, tumor image segmentation is also a standard task in medical imaging, so it will be interesting to compare LARC with the SOTA methods in that field too.
-
It seems that the worst-case guarantee is only provided for i.i.d. data, is that correct? If so, what's the challenge of obtaining a worst-case guarantee for arbitrary data sequence?
In addition, the paper mentioned that the major improvement of this paper compared with [Angelopoulos et al., 2024b] is the online setting. However, usually, i.i.d. data is obtained by sampling uniformly from an offline data set. So what is the significance of the theoretical results if the worst-case guarantee can only be provided for i.i.d. cases?
问题
See above
局限性
NA
We thank the reviewer for the insightful comments. Below, we address each comment point by point:
- The focus of this work is on producing calibrated predictions, i.e., predictions with risk control guarantees, rather than on providing more accurate predictions. To this end, we compare standard L-ARC against ARC with decaying step sizes [Angelopolous et al. 2024] (presented at ICML 2024), which, to the best of our knowledge, is the state-of-the-art online calibration scheme that offers statistical guarantees.
- The worst-case guarantee in 2.3.2. is valid for any sequence, not necessarily i.i.d. The statistical localized guarantee in Section 2.3.1. is obtained under the same data model considered in [Angelopoulous et al. 2024b].
- The worst-case deterministic guarantee (Theorem 2) is valid for any choice of data sequence, even if adversarial. Moreover, if the data is generated in an i.i.d. fashion—such as by sampling without replacement from a dataset or through online interaction with a stationary environment—L-ARC also provides the statistical guarantee stated in Theorem 1.
Thanks for your responses. I will keep my score.
Thank you for considering our responses and for your prompt reply! Given the overall positive assessment of the paper and the comments, we would appreciate it if you could elaborate on what would have been necessary to achieve a higher score.
This paper addresses the design and analysis of the localized version of adaptive risk control (L-ARC). In the first section, the problem of classical ARC is nicely introduced, showing the threshold updating mechanism and the convergence analysis of the resulting loss. Then, the problem setting, design, and analysis of ARC are naturally generalized to those of L-ARC. Finally, L-ARC is applied to three practical problems of electricity demand forecasting, tumor segmentation, and beam selection. The result of the experiments supports the usefulness of L-ARC.
优点
The application scope of L-ARC is broad, and it can be utilized in various application examples. The reviewer believes L-ARC has a high potential for broad impacts. The update model (17) and (18) developed for L-ARC in this paper is novel. The theoretical guarantee, in particular, the convergence analysis of the localized loss is also given under standard assumptions.
缺点
Both Theorems 1 and 2 state that the localized loss is bounded by \kappa * B, meaning the maximum value of the kernel and an upper bound of the loss. In the reviewer's understanding, the theorems are technically correct, but he/she cannot find their value. They seem to be just deriving an obvious upper bound from the assumptions.
问题
Related to Theorems 1 and 2, could the author comment on their value of the upper bound of the localized loss?
L-ARC updates the threshold function, denoted by . The setting seems to be restrictive. Could the authors extend the setting to the case of multiple prediction sets? In other words, multiple scoring functions s_1, s_2, ... are considered and their thresholds are adaptively updated in a similar manner to (17) and (18).
局限性
The limitation of L-ARC is stated in Section 5, saying that it requires a memory function of storing the input data.
We thank the reviewer for the insightful comments. Below, we address each comment point by point:
- The terms in the bound incorporate factors that relate the algorithm's guarantees to domain-dependent quantities, such as the maximum value of the loss and the maximum value of the kernel . These quantities naturally appear in kernel-based algorithms [Kivinen et al. 2004, Gibbs et al 2023]. The values of these terms depend on the specific problem. For example, we have for the miscoverage loss and the FNR loss (experiment in Section 3.2); for the SNR regret in the experiment in Section 3.3, ; and for the electricity forecast, is given by the maximum value of the label variable . The value of is a positive quantity controlled by the designer; it can be chosen arbitrarily, and it determines the level of localization of L-ARC.
- L-ARC can be generalized to multiple prediction sets and experts. We thank the reviewer for this suggestion, and note that this direction can be pursued by combining the results of L-ARC with those from the recent work “Improved Online Conformal Prediction via Strongly Adaptive Online Learning” by Bhatnagar et al. This paper focuses on localization in time using multiple experts active at different time instants. Combining this approach with L-ARC would provide a calibration method with localized guarantees both in the input space and in time. We will include this interesting research direction in our future work.
Thanks for the reply comments. They convinced the reviewer.
Thank you for taking the time to review our paper and for your useful comments!
The paper introduces Localized Adaptive Risk Control (L-ARC), an enhancement of Adaptive Risk Control (ARC). L-ARC updates a threshold function within a Reproducing Kernel Hilbert Space (RKHS) to provide localized risk guarantees while maintaining ARC's worst-case performance. Experiments show that L-ARC improves fairness across different data subpopulations in tasks like image segmentation.
优点
The introduction of Localized Adaptive Risk Control (L-ARC) is a significant advancement over traditional ARC. By focusing on localized risk guarantees, L-ARC addresses the critical issue of uneven risk distribution across different data subpopulations, which is a well-known limitation of ARC.
缺点
- The introduction of a threshold function within an RKHS and the associated online adaptation process may complicate the implementation. Practitioners might find it challenging to understand and apply the method without a significant background in RKHS and online learning algorithms.
- The paper primarily compares L-ARC with traditional ARC. Including comparisons with other state-of-the-art risk control or calibration methods would provide a more comprehensive evaluation of L-ARC's strengths and weaknesses.
问题
Refer to Weaknesses.
局限性
N/A
We thank the reviewer for the insightful comments. Below, we address each comment point by point:
- We aim to ease the implementation of the algorithm by providing code to reproduce the experiments. In the revised manuscript, we will also clarify the connection with online learning algorithms, enabling practitioners to leverage this connection when implementing L-ARC.
- We compare L-ARC against ARC with decaying step sizes [Angelopolous et al. 2024] (presented at ICML 2024), which, to the best of our knowledge, is the only online calibration scheme that offers statistical guarantees. Another recent algorithm, “Improved Online Conformal Prediction via Strongly Adaptive Online Learning” by Bhatnagar et al., proposes localizing predictions in time rather than in the input space. Since these two approaches are complementary and orthogonal, a direct comparison may not be particularly insightful. Nonetheless, L-ARC can be combined with this approach to achieve both time and input space localization. We consider this an interesting research direction for future work.
Thanks for the response. I will keep my rating.
Thank you for taking the time to review our paper! We hope that the above has addressed your concerns. If not, please let us know.
This paper introduces an online calibration method to enhance the Adaptive Risk Control (ARC) framework. ARC traditionally adjusts prediction sets based on a scalar threshold to ensure long-term risk control and marginal coverage guarantees. However, as mentioned in the paper, it may unevenly distribute risk guarantees across different subpopulations. L-ARC addresses this by updating a threshold function within a reproducing kernel Hilbert space (RKHS) to provide localized statistical risk guarantees, ranging from conditional to marginal risk, while preserving the worst-case performance. My additional comments are as follows.
优点
-
The paper is easy to read. It could have been better structured, but it is okay.
-
The authors have proposed a new technique for calibration and provided theoretical and experimental results.
缺点
-
The concept of input space is not clear in the introduction.
-
Is the pre-trained model trained on all the data sources that will be used during the calibration process if they are different?
-
Is the loss function convex in (2) convex? And what the intuitive explanation that (6) should hold is. Does it somehow follow from the regret analysis of the online gradient descent algorithm, which converges at the same rate?
-
What is the crucial reason for using the RKHS function instead of a scalar? How is this useful?
-
Ideally, the introduction of RKHS should have helped with the rates, but it looks like the rates in (12) and (13) are worse than the rates in (9) and (6), so then in what terms localized ARS is better?
-
On the other hand, keeping track of a function g_t in RKHS for each t is way more expensive than just keeping track of a scalar in practice. Why the proposed method makes sense is not clear.
-
What is the reason to consider the functional form in (15)
-
How are the updates in (17)-(21) derived? Equation (19) is fine, which is the standard representation of a function in RKHS
-
As mentioned before, the additional challenge with RKHS is that we need to now store a^i_t from i =1 to t because we need to update them, and for large T, this could be problematic in practice. And if you make the dictionary size of storing them constantly, then it would incur additional regret in the final terms.
-
The experimental results are weak. They are not descriptive enough to make sense of the advantages of the proposed techniques over the advantages of existing methods. The long-term coverage plots look almost similar, so why is the benefit coming in other figures? Also, authors should report the downside of the proposed RKHS-based approach, which would require more memory access to implement the proposed method, which is not required for ARS. So, comparisons are also not fair.
问题
Please check the weaknesses.
局限性
N/A
We thank the reviewers for their insightful comments. Below, we address each comment point by point:
- The input space of the prediction model is commonly referred to as the feature space in the machine learning literature. We will be happy to clarify the terminology in the revised manuscript.
- As typically assumed in the conformal prediction literature [Angelopolous et al. 2022, Angelopolous et al. 2024, Feldman et al. 2022], L-ARC is agnostic to the data distribution used to pre-train the prediction model. This means that L-ARC does not assume that the model is trained on the same population used for calibration and testing. We will clarify this point in the revised manuscript.
- The loss function in (2) is not necessarily convex; for example, it could be the miscoverage loss, which is non-convex. The assumptions about the properties of the loss function used to prove Theorems 1 and 2 are stated in Assumptions 3, 4, and 5. As referenced in the main text, the result reviewed in (6) is known from conformal inference [Angelopolous et al. 2024 and Gibbs and Candes 2021]. We will specify that the loss in (2) is not necessarily convex in the revised manuscript.
- As mentioned in the introduction and shown in Figure 1, using a single scalar threshold, “ARC may distribute such guarantees unevenly in the input space, favoring a subpopulation of inputs at the detriment of another subpopulation.” In contrast, as described in Section 1.3, L-ARC uses a threshold function in an RKHS, allowing it to “produce prediction sets with localized statistical risk control guarantees as in (9), while also retaining the worst-case deterministic long-term guarantees (6) of ARC.” As demonstrated in the experiments, which encompass both standard [Angelopolous et al. 2022, Angelopolous et al. 2024] and new benchmarks, L-ARC not only retains the long-term coverage of ARC but it also provides fairer and more homogeneous risk control across different subpopulations of the input space, which is an essential requirement for many applications.
- As mentioned at the beginning of Section 1.2, ARC does not offer any form of localized risk guarantee. As elaborated in Section 1.3, by replacing the scalar threshold with a member of the RKHS, L-ARC can instead provide localized risk control, a more general statistical guarantee. As stated in the bullet points in Section 1.3, the convergence results are the same as those for ARC, with an additional term that depends on the level of localization of the kernel function. This function dictates the degree of localization of the guarantee, as explained in Figure 2. These guarantees recover (6)-(7) when a non-localized threshold is used.
- As mentioned in the conclusion, memory efficiency is the main drawback of using localized thresholds, which motivates future work. However, the benefit of localized risk control may justify the increased computational and memory cost in some applications for which fairness is critical. While we have decided to leave the derivation and analysis of a memory-efficient version of L-ARC for future work, the attached PDF file demonstrates how to obtain a simple variant of L-ARC with constant memory requirements that still exhibits improved localized risk control. We plan to include these new empirical results in the supplementary material for the revised manuscript.
- The functional form (15) highlights how the current approach generalizes ARC. The proposed threshold function combines a scalar component , like ARC, plus a varying component from an RKHS which allows to localize the risk control guarantee. We will elucidate the relation to ARC and the functional representation in the revised text.
- Equation (17)-(21) can be interpreted as the updates rule of (regularized) online gradient descent. In the revised manuscript we will refer to the work of [Kivinen et al., 2004] for a derivation.
- See the previous point and the attached pdf for a solution to this problem that has constant, not linear, memory requirements.
- In our paper, we provide experimental results that include standard and recent benchmarks [Angelopoulos et al. [2024b], Angelopoulos et al. [2022b]], as well as a new beam selection problem with engineering significance. Across this variety of experiments, as the reviewer notes, L-ARC not only guarantees the same worst-case deterministic guarantees as ARC (the curves overlap), but it also substantially improves coverage across different subpopulations of the data, as shown in the right panels of Figures 3, 4, and 5. The price to pay for localized risk control and increased fairness is an increased memory requirement. This cost is justified in all applications for which fairness and conditional coverage are necessary, as in the proposed examples. To address the memory requirement issue, in the attached PDF we provide a variant of L-ARC that allows control and reduction of memory and computational requirements. We will be happy to include these additional experiments in the supplementary material of the paper.
Thank you for your response. However, I remain unclear about the claim that having non-sublinear regret in equation (13) is beneficial. Additionally, addressing the tradeoff between finite memory size and regret performance seems crucial for this paper, rather than leaving it to future work. Without considering finite memory in theoretical analysis, the practical applicability of the proposed approaches is limited. Because without establishing this connection theoretically, I am not sure if the theoretical contributions of this work are good enough.
While the paper includes a set of experimental results, the key takeaways are not immediately clear. The presentation needs significant improvement to facilitate understanding. For example, the reduction in average miscoverage error in Figure 3b is not sufficiently explained, leaving uncertainty about its adequacy. Similarly, the results for beam selection in Figure 5 are difficult to interpret, and the benefit is unclear. Although the authors mention that L-ARC requires less time, the actual time measurements are not reported in the figures, and the percentage gains remain unclear.
Overall, the paper shows promise, but I cannot recommend it for acceptance in its current form. To acknowledge the efforts, I will slightly increase the scores, but major revisions are necessary before this paper can be considered for publication. I encourage the authors to continue refining and improving the paper.
Thank you very much for considering our response and for your valuable reply!
We never claimed that the additional term in (13) is beneficial; rather, this term is a consequence of using an RKHS threshold instead of a constant. As elaborated in the text, this term gauges the impact of RKHS localization on the final bound. The implicit benefit of including (13) is that by allowing for a “controllable” suboptimality gap, we can target the statistical localized guarantees in Theorem 1 that are not achievable using other schemes, such as ARC. We also note that this suboptimality gap is practically non-existent; as you previously observed, the long-term coverage curves of ARC and L-ARC overlap. However, the conditional risk control properties of L-ARC are substantially superior as highlighted in the experiment.
The theoretical contributions of the paper include the derivation of a novel algorithm for online calibration based on RKHS that we prove to enjoy both long-run and localized statistical guarantees. To do so we introduce several novel technical results (see our response to reviewer LXXA). We believe that the paper offers substantial theoretical advancements to the field of online calibration. We would be happy to include the memory-efficient variant of L-ARC in the additional material, although we will reserve its analysis for future work as we believe the current work provides enough contributions.
This paper proposes the Localized Adaptive Risk Control (L-ARC) scheme for learning to perform conformal prediction from online data. In the setting under consideration, the data is potentially non-i.i.d. and the scalar threshold parameter typically used by existing methods in the construction of the prediction set is replaced by a function learned online from the data stream via functional stochastic gradient descent over a reproducing kernel Hilbert space (RKHS). It is shown that the selection of the RKHS -- i.e., the properties of the associated kernel, such as the scale parameter of a radial basis function (RBF) kernel -- yields "localization" of the resulting conformal predictor to the data set in the sense that prediction performance (miscoverage error) remains consistent across distinct subpopulations within the data. This is demonstrated through experiments on electricity forecasting, medical image segmentation, and beam selection problems that illustrate the localization property, while theoretical results characterize the price paid for this localization in terms of looser bounds on suboptimality and convergence rate.
The key algorithmic contribution of the paper is the use of online kernel learning, which is well-studied (compare eqs. (19)-(21) in the present submission with (10)-(12) in [Kivinen et al., 2004]), to learn a function to replace the scalar threshold typically used in conformal prediction. Existing methods for online conformal prediction [Gibbs and Candès, 2021; Bhatnagar et al., 2023; Feldman et al., 2023; Angelopoulos et al., 2024] consider either a fixed scalar threshold in the construction of the prediction set, while recent work on conformal prediction with a threshold function [Gibbs et al., 2023] does not directly apply to the online setting. The primary theoretical contributions are the establishment of upper bounds characterizing the localization effects of the choice of RKHS kernel in terms of a certain notion of suboptimality and the rate of convergence to a neighborhood of optimality, where the size of the neighborhood is shown to depend on the choice of kernel. The experimental results illustrate that L-ARC outperforms the existing, non-localized ARC method [Gibbs and Candès, 2021; Feldman et al., 2022] at achieving the desired level of miscoverage error across distinct subpopulations.
优点
While the existence of an online method for conformal prediction over non-i.i.d. data that uses threshold functions appears to be an open problem in the recent conformal prediction literature, the significance and potential utility of using a threshold function taken from a RKHS instead of a scalar or another class of function are not immediately obvious. However, the theoretical and experimental results of this paper indicate that a useful notion of "localization", where the choice RKHS kernel allows reliable conformal prediction on data with distinct subpopulations, results from considering this class of threshold functions -- this is a very interesting and original insight that is likely to draw attention in the conformal prediction community, and is a major strength of the paper. The theoretical results also provide some useful insight into the effect of kernel choice in L-ARC, and the main steps in the analysis appear to be sound (I read but did not thoroughly check all details in the appendix). Finally, the experimental results provide strong support to the utility of the localization effect of the choice of kernel, without which the theoretical results would lose much of their force and the significance and potential utility of L-ARC would remain unclear.
缺点
The primary weaknesses of this work arise from lack of context with previous work and lack of motivation and discussion of the technical results. These issues make it difficult to accurately judge its significance and contribution. Specifically:
- Important context with previous work is missing. First, the L-ARC method proposed in Sec. 2.2 is essentially an adaptation of online kernel learning (see [Kivinen et al., 2004] and its many citers) to the conformal prediction setting (as mentioned in the summary above, eqs. (19)-(21) in the present submission are very similar to (10)-(12) in [Kivinen et al., 2004]), yet this connection is not mentioned. This information is important for clarifying the limits of the present paper's contribution.
- The relationship between the technical results presented and previous analyses is unclear. In particular, it is unclear from the text (including the proofs in the appendix) what parts of the analysis draw on previous analyses of conformal prediction methods -- if the results are entirely independent and original, this can be highlighted -- and what key technical innovations were required in the theoretical analysis.
- Motivation of the technical results and discussion and clarification of their meaning is generally lacking. As a result, the technical meaning and effect of "localization" characterized in Sec. 2.3, especially in the Assumptions and Theorem 1, remain unclear. Specific questions regarding these issues are included in Questions 3, 4, and 5 below.
问题
- What are the main technical innovations in the proofs of the main results?
- What parts of the analysis draw on previous work?
- How reasonable are the assumptions presented in Sec. 2.3 and when do they hold (especially Assumptions 4 and 5)?
- What is significance of the weighting inside the expectation on the LHS of eq. (22) and why is this relaxed, reweighted expectation meaningful?
- What is the role and importance of the weighting function and the corresponding terms containing in eq. (22)? The presence of seems like an artifact arising from considering the covariate shift in lines 418-419 in the appendix, which might be expected to go away when is taken in the proof of Lemma 1 in Sec. A; why do and persist in the statement of Theorem 1 and how do we interpret them?
- The present paper uses the term Adaptive Risk Control (ARC) to refer to the methods proposed in [Gibbs and Candès, 2021] and [Feldman et al., 2022] on lines 18-19, but these works call their methods Adaptive Conformal Inference (ACI) and Rolling Risk Control (Rolling RC), respectively; which of these does ARC refer to, and which is implemented in the experiments?
局限性
Aside from the issues raised above, the limitations have been adequately addressed.
We thank the reviewer for the insightful comments. Below, we address each comment point by point:
- From a methodological point of view, the main innovation lies in using functions from an RKHS to define prediction sets that are calibrated online and that ensure localized risk control. The RKHS function is optimized online using gradient descent steps applied to kernels. As the reviewers noted, the optimization with kernel was originally studied in [Kivinen et al., 2004] and we apply this framework to online calibration. To this end, we had to develop several technical innovations. In Theorem 1, we analyze the functional derivative of the update and show that the stationary point satisfies the localized risk guarantee (41). This allows us to prove that, under Assumptions 4 and 5, we asymptotically achieve localized coverage. In Theorem 2, we prove that, for the problem of online calibration, the threshold function defined by the RKHS has a bounded infinity norm across all iterations (Proposition 3). This allows us to bound the first term in (76) and prove the worst-case guarantees. All these intermediate results as well as final theorems are novel contributions. In the revised manuscript we will clarify the connection between L-ARC update rules and [Kivinen et al., 2004], and emphasize the technical novelties introduced by the proof.
- The parts of the proof that are drawn from previous work are in Theorem 1, which leverages, as intermediate steps, the regret bounds (47) and (49) from [Kivinen et al. 2024] and [Cesa-Bianchi et al. 2004, Theorem 2], respectively. These works are referenced in the Appendix.
- Assumptions 2 and 3 are standard in the conformal risk control literature (see [Angelopolous et al. 2022, Angelopolous et al. 2024, Feldman et al. 2022]). These are reasonable and easily met in practice, as they state that the risk functional is bounded and that it decreases if the set grows larger, which is true for typical losses. Assumption 4 is a stronger version of Assumption 3, similar to (6) in [Angelopolous et al. 2024]. Like the previous assumption, it states that if the set size increases, the expected risk is strictly decreasing. Finally, Assumption 5 states that the loss function is left-continuous in the threshold value, and it is automatically satisfied for many popular losses. For all other losses, we note that this assumption can be easily satisfied by replacing ≤ with < in the definition of the set in (14). We plan to make this modification in the definition of the set predictor to remove this assumption.
- As discussed in Section 1.2, the weighting term in the localized risk equation amounts to a shift in the covariate distribution. By proving that the inequality is guaranteed for every shift , we prove that the L-ARC provides risk control for all potential distributions shift in set . Specifically, inequality (22) states that for any shift it is possible to bound the shifted average risk averaged by a quantity that is shift dependent, hence the presence of and . For example, if there is no shift, , and this quantity equals zero. We thank the reviewer for carefully reading the Appendix and spotting the typo in lines 418-419. In fact, the is a left over from a previous proof attempt, and it should be removed in the current version. We plan to correct this typo in the revised manuscript.
- Adaptive Risk Control refers to a generalization of ACI from conformal prediction to risk control, and is instantiated in our experiments with a decaying step size [Angelopolous et al. 2024] to guarantee statistical coverage. We chose [Angelopolous et al. 2024] as a benchmark because it is the state-of-the-art for online calibration, and, at the time of writing, it is the only online calibration scheme with statistical guarantees.
Thanks to the authors for their response. It has mitigated several of my concerns and I have raised my score accordingly.
We are glad that we were able to address your concerns, and we thank you again for your valuable feedback!
We thank the reviewers for their insightful comments. Multiple reviewers raised concerns about the linear memory requirement of L-ARC. In the original submission, we acknowledged this limitation and planned to leave the derivation and analysis of memory-efficient variants for future work. Nonetheless, in response to these comments, we have decided to include a PDF document that presents a variant of L-ARC with constant memory requirements and provides empirical results demonstrating how this scheme balances memory efficiency with localized risk control. We hope this addresses the reviewers’ concerns and plan to include these new results as part of the supplementary material.
The paper develops a Localized Adaptive Risk Control (L-ARC) scheme for learning an online conformal predictor. Here a reproducing kernel Hilbert space (RKHS) is used to learn (via stochastic gradient descent) a function for an adaptive threshold, enhancing baseline fixed-threshold ARC. The threshold is used to determine the prediction set.
The RKHS kernel choice enables reliable conformal prediction on data with distinct subpopulations, an open problem prior to this paper and important in some applications. The reviewers noted that this is an interesting and original insight that is likely to draw attention in the conformal prediction community and is a major strength of the paper.
The authors show theoretically that the stationary point of their learning algorithm satisfies the localized risk guarantee, and connections are made with kernel learning theory. It seems likely that the use of a functional form for the threshold with applicability to online use will lead to further interest in the community as the adaptive risk control methods are further studied and developed.
The reviewers also emphasized that the experimental results provide strong evidence and validate the utility of the localization effect of the choice of kernel. The experiments in different applications show improvement in comparison to online risk control method benchmarks. The paper emphasizes the improvement over fixed threshold ARC algorithms. (It may be true that other ML methods will be superior in any given application.)
Any online method may have tradeoffs. Online kernel methods have linearly growing memory requirements unless some (typically ad hoc) method is used to control the memory. In rebuttal the authors have acknowledged the issue and shown such a method on a simulation example. The reviewers have noted that the analysis in the paper is useful and provides insight, while also emphasizing the importance of the theoretical characterization of tradeoffs due to finite memory requirements in their RKHS based approach (reserved for future work).
The reviewers have emphasized that the insights regarding the connection between kernel choice and localization effect provided by the existing analysis are of primary interest. The truncated version of the algorithm and associated experiments furnished during the discussion phase also help mitigate the memory issue. Perhaps of interest to the authors are theories that consider balancing the memory use (the model size) and performance, e.g., Koppel, IEEE SP Magazine May 2020.
Reviewers have also emphasized that the presentation of the paper would benefit from additional discussion of context as well as improved motivation and interpretation of the technical results, and recommend these issues be addressed in the final version.