/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Avoiding Catastrophe in Online Learning by Asking for Help

Benjamin Plaut,Hanlin Zhu,Stuart Russell

提交: 2025-01-11更新: 2025-08-06

TL;DR

If a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.

摘要

关键词

online learningAI safetyasking for helpirreversibility

评审与讨论

审稿意见

评分: 32025-03-13

This paper formula the events of catastrophe as product of payoff functions where each payoff function can be seen as the probability of catastrophe occurs at each time step. Apart from the multiplicative objective in different to standard regret in online learning, the agent observes feedback only when asking for help from a mentor. The goal is to maximize product of pay off against mentor's performance while maintaining non-trivial query numbers.

This paper first establishes a worst-case impossibility result, showing that an agent cannot avoid catastrophe with sublinear mentor queries, even when the mentor follows a safe policy. For the case where catastrophe avoidance is possible, the paper identifies a connection to Littlestone and VC dimensions and leverages a variant of Hedge [Russo et al. (2024)], which enables achieving sublinear regret with sublinear queries.

给作者的问题

None

论据与证据

Yes

方法与评估标准

the metric is analogous to regret in online learning with the mentor's performance as the comparitor, which makes sense.

理论论述

Checked Theorem 4.1

实验设计与分析

not applicable

补充材料

checked appendix A

与现有文献的关系

the first provable algorithm in learning with irreversible costs without bayesian inference

遗漏的重要参考文献

well referenced.

其他优缺点

Strength:

The paper does well in motivating the considered problem, the sensible formulation and iterative feedback set-up, and drawing differences of the novel formulation in comparison to standard online learning, thus it was clear that the formulation itself is not a trivial problem.
The result of this paper is considered as complete as it gives negative case and solvable cases

Weakness: for the problem that is solvable, it generally relying on the VC / Littlestone Dimension as a parameter to Hedge Algorithm, which can be impractical to compute

其他意见或建议

None

作者回复

2025-03-28

We thank the reviewer for the thoughtful review. We believe the reviewer articulates only a single weakness:

Weakness: for the problem that is solvable, it generally relying on the VC / Littlestone Dimension as a parameter to Hedge Algorithm, which can be impractical to compute

We agree that this is a weakness of our current algorithm. We offer two responses:

We are the first to show that avoiding irreversible mistakes without Bayesian inference is possible at all. Eventually we would indeed like an efficient algorithm, but we believe that proving that the problem is solvable at all is an important step towards finding an efficient algorithm for the problem.
We are optimistic about the prospect of a more efficient algorithm based on techniques from Block et al. (2022). In the context of standard online learning (i.e., without irreversible mistakes), they show how to explore a hypothesis class without explicitly computing the VC dimension or constructing an $\varepsilon$ -cover. We will add a mention of this to the conclusion, as we think this is an important avenue for future work.

We would also like to clarify the regret bound. The reviewer writes:

...which enables achieving sublinear regret with sublinear queries.

However, our algorithm achieves subconstant regret with sublinear queries.

审稿意见

评分: 32025-03-13

The paper addresses a novel problem of avoiding catastrophe in online learning with the help of a mentor policy. Specifically, for each input and action, there is a probability of catastrophe. The objective is to minimize the probability of catastrophe over T rounds of play while keeping the number of queries to the mentor relatively low. The mentor policy is assumed to either belong to a class with finite VC dimension and the input distribution is "smooth"; or to have finite Littlestone dimension. Additionally, the authors assume some kind of Lipschitz continuity in the mentor's policy, which they refer to as "local generalization."

The authors propose an algorithm with sub-constant regret and sub-linear queries to the mentor. The algorithm discretizes the policy class and, at a high level, queries the mentor with a fixed probability while running the Hedge algorithm on the discretized policy set over queried rounds. Furthermore, if an input is too far from a previously queried input, the algorithm also queries the mentor. The local generalization essentially ensures that if the input is close to an already queried input, the mentor's action will be approximately the same.

Post Rebuttal

Overall, my assessment remains unchanged: I still find it odd that the regret compared to the mentor becomes somewhat vacuous unless the mentor avoids catastrophe with a probability close to 1. Additionally, I suggest the authors include a more detailed discussion regarding their packing argument, specifically addressing why current tools fail and how their approach overcomes these challenges. Currently, this aspect is quite vague, both in the main text and in their rebuttal.

I will keep my score unchanged.

给作者的问题

In line 229, what do you mean by feature embedding? Isn't $x$ already the "state" embedding?

It seems that whenever hedgeQuery is true, then X is not updated, and that whenever the algorithm "asks for help," the policy is not updated. This is not very intuitive; why is it that you don't need that?

Is there a specific policy class for which you can make the algorithm efficient (not exponential in $d$ )?

Is there a difference between $s_t$ (as mentioned in Algorithm 1) and $x_t$ in the rest of the paper, or is that just a typo?

论据与证据

I did not find problematic claims; the analysis combines a few known techniques, and the analysis overview in the main text made sense to me.

方法与评估标准

Overall, the evaluation criteria is not standard, but this is part of the paper's novelty. The requirement that the number of queries (or that the querying rate vanishes with $T$ ) makes perfect sense, as this essentially means that the algorithm becomes self-sufficient over time.

The multiplicative objective that quantifies the overall probability of avoiding catastrophe also makes sense to me.

The regret compared to the mentor makes sense on one hand; however, the fact that it becomes somewhat vacuous unless the mentor avoids catastrophe with a probability close to 1 is a bit odd. I understand the authors' discussion regarding that in principle, but more technically, it implies that $\mu_t(x_t,\pi^m) \to 1$ , but $\pi^m$ is fixed, so it kind of assumes that the $x$ 's are such that the mentor's policy avoids catastrophe with a probability that becomes arbitrarily close to 1. If the mentor's policy were not fixed, it would make more sense.

理论论述

I did not verify the appendix's proofs, but as mentioned, the analysis overview in the main text made sense to me.

实验设计与分析

There are no experiments in the paper.

补充材料

I have not read the supplementary material.

与现有文献的关系

This is somewhat related in motivation to safety in online learning (say, in constrained MDPs), and also somewhat related to imitation learning, as here the algorithm learns by querying the mentor. In a sense, it combines motivation from these two lines of work, as in safety there is no component of requiring a mentor, and in imitation, there is no explicit safety requirement.

遗漏的重要参考文献

Not that I'm aware of.

其他优缺点

Overall, the paper is well-written and easy to follow.

The local generalization assumption is not a standard one and is not adequately justified in the paper.

The analysis mainly combines known techniques, which I find okay in this case since the setting is novel and well-motivated. Properly combining known techniques to solve a new problem is definitely not a trivial task. The main claim of technical novelty by the authors lies in the packing argument, which they only briefly discuss in the last paragraph of Section 5. However, I think that this paragraph does not adequately convey the technical challenge and how it is addressed.

其他意见或建议

N/A

作者回复

2025-03-28

We thank the reviewer for the thoughtful review. We respond to each concern and question:

Concern 1

The regret compared to the mentor makes sense on one hand; however, the fact that it becomes somewhat vacuous unless the mentor avoids catastrophe with a probability close to 1 is a bit odd…If the mentor's policy were not fixed, it would make more sense.

To clarify, the mentor’s policy $\pi^m$ is not fixed across values of $T$ . When we stated in the paper that $\pi^m$ is fixed, we meant that $\pi^m$ does not vary across time steps in a given problem instance, i.e., we do not deal with non-stationarity. It is crucial that the mentor policy can vary across values of $T$ (and across problem instances) for the exact reason stated by the reviewer. Our original writing was unclear and we will fix this.

We also mention that our subconstant bound on standard additive regret (Theorem 5.3) is unaffected by this issue entirely.

Concern 2

The local generalization assumption is not a standard one and is not adequately justified in the paper.

We agree on the importance of justifying assumptions and will expand this justification when revising.

To recap, local generalization states that $|\mu_t^m(x) - \mu_t(x, \pi^m(x'))| \le L ||x-x'||$ for any $x,x’$ . We make the following claims:

A. One can transfer knowledge between similar situations, with the reliability of the transfer depending on the degree of similarity. This is well-known in the education and psychology literature (e.g., Esser et al. 2023, Hajian 2019).

B. Local generalization captures the idea of Claim A. Specifically,

B1. $\mu_t^m(x)$ is the effect of taking the right action for the current input $x$

B2. $\mu_t(x, \pi^m(x'))$ is the effect of using what you learned in $x’$ for the current input $x$

B3. $||x-x'||$ captures the similarity between $x$ and $x’$

To effectively focus our revision, it would help us to understand which of these claims felt weakest to the reviewer, or if their skepticism is based on something else.

Concern 3

The main claim of technical novelty by the authors lies in the packing argument, which they only briefly discuss in the last paragraph of Section 5. However, I think that this paragraph does not adequately convey the technical challenge and how it is addressed.

To recap, we perform a packing argument with respect to an arbitrary categorization of the data instead of considering all data in aggregate as is typical in packing arguments. In our case, the categorization is based on the algorithm’s action. However, we should have emphasized in the paper that our technique can be applied to any categorization of the data.

In fact, we think our technique has promise as a general-purpose way to bound how well categorized data covers an input space. It’s often insufficient to simply have lots of data for each category: to ensure good generalization, each category should be represented well across the input space. For example, suppose one has plenty of data on healthy and sick patients, but all the sick patient data is from hospital visits: models trained on this data may make inaccurate predictions about sick patients outside of the hospital, despite plenty of sick patient data. We will expand this discussion.

Question 1

In line 229, what do you mean by feature embedding? Isn't $x$ already the "state" embedding?

Yes, we just mean state embedding. The idea is that the algorithm is agnostic to how the states are represented, so local generalization just has to be satisfied for some way of representing states. We will clarify this.

Question 2

It seems that whenever hedgeQuery is true, then X is not updated, and that whenever the algorithm "asks for help," the policy is not updated. This is not very intuitive; why is it that you don't need that?

Those updates are not necessary because the algorithm already learns enough from the existing updates. Omitting those updates from the algorithm allows us to fully separate the Hedge updates and the ask-for-help updates, which slightly simplifies the analysis. However, we see why this is unintuitive, and we will add an explanation of this. We are also open to including those updates in the algorithm.

Question 3

Is there a specific policy class for which you can make the algorithm efficient (not exponential in $d$ )?

In general, the size of a cover of the policy class must be exponential in $d$ . Intuitively, this is because a VC dimension of $d$ means that the policy class can realize all $2^d$ labelings of $d$ points.

However, we are optimistic about reducing the dependence on $d$ in future work via techniques from Block et al. (2022), who show how to bypass explicitly constructing covers. We will mention this in the conclusion.

Question 4

Is there a difference between ( $s_t$ as mentioned in Algorithm 1) and $x_t$ in the rest of the paper, or is that just a typo?

This is simply a typo. We apologize for the confusion.

审稿意见

评分: 32025-03-14

The paper studies how to design learning algorithms that avoid catastrophe. In particular, the goal of the learner is to minimize the chance of a catastrophe. This is modeled as an online learning problem in which the learner aims at minimizing the product of the payoffs. The learner is equipped with a mentor that can be queried to obtain the best action for the current round. The problem is in general unlearnable since any algorithm queries the mentor a linear number of times or guarantee to cause catastrophe with high probability. The authors show that when the policy class is learnable, it is possible to design a learning algorithm that queries the mentor a sublinear number of times and whose regret approaches $0$ .

给作者的问题

None.

论据与证据

Yes.

方法与评估标准

Not Applicable.

理论论述

I only checked the main paper.

实验设计与分析

Not applicable.

补充材料

No.

与现有文献的关系

The paper introduces a new model. However, it borrows some techniques used for other online learning problems.

遗漏的重要参考文献

No.

其他优缺点

The paper introduces a new problem, and the problem of avoiding catastrophe is well-motivated. However, how the problem is modeled as an online learning problem with mentor is a bit arbitrary. I'm not sure if the right way to sell the paper is yours, i.e., you model the problem of avoiding catastrophe as an online learning problem, or to focus more on a general online learning setting in which the goal is to maximize the product of the payoff which has as application avoiding catastrophe.

From a technical perspective, it seems that the paper is mostly putting together results from previous works. Moreover, it is unclear whether the paper introduces general techniques or models that can be applied to broader and more fundamental problems.

其他意见或建议

None.

作者回复

2025-03-28

We thank the reviewer for the thoughtful review. We respond to each of the reviewer’s concerns:

Concern 1

However, how the problem is modeled as an online learning problem with mentor is a bit arbitrary. I'm not sure if the right way to sell the paper is yours, i.e., you model the problem of avoiding catastrophe as an online learning problem, or to focus more on a general online learning setting in which the goal is to maximize the product of the payoff which has as application avoiding catastrophe.

We appreciate the suggestion and are happy to make the paper’s framing more general if the reviewer thinks that would strengthen the paper. In fact, this seems more like a strength of our work than a weakness: if we understand correctly, the reviewer suggests that our work is actually more general than our current framing.

Concern 2

From a technical perspective, it seems that the paper is mostly putting together results from previous works. Moreover, it is unclear whether the paper introduces general techniques or models that can be applied to broader and more fundamental problems.

We do agree that technical novelty is not our primary contribution. However, we offer some mild pushback:

The existing results we utilize come from a variety of sources and topics. We think that combining previously disparate results in a new way can also be a form of technical novelty.
We believe our packing argument is novel, as discussed at the top of the right column of page 8. To recap, we perform a packing argument with respect to an arbitrary categorization of the data instead of considering all data in aggregate as is typical in packing arguments. In our case, the categorization is based on the algorithm’s action. However, we should have emphasized in the paper that our technique can be applied to any categorization of the data.

In fact, we think our technique has promise as a general-purpose way to bound how well categorized data covers an input space. It’s often insufficient to simply have lots of data for each category: to ensure good generalization, each category should be represented well across the input space. For example, suppose one has plenty of data on healthy and sick patients, but all the sick patient data is from hospital visits: models trained on this data may make inaccurate predictions about sick patients outside of the hospital, despite plenty of sick patient data. We will expand this discussion when revising.

审稿意见

评分: 42025-03-21

This work introduces an online learning framework that avoids catastrophic mistakes by allowing an agent to query a mentor when uncertain, rather than relying solely on trial-and-error. The approach maximizes the product of payoffs, where each payoff represents the chance of avoiding catastrophe, and leverages local generalization to apply knowledge from similar inputs. The authors prove that without mentor assistance, any algorithm with sublinear queries incurs high regret, but with help, one can achieve subconstant regret and a sublinear query rate under standard learnability assumptions (finite VC or Littlestone dimensions). Their results offer theoretical guarantees for safe online learning in high-stakes scenarios where irreversible errors must be avoided.

给作者的问题

On the bottom of page 3 the authors state "sublinear regret is possible iff the hypothesis ...". Is this supposed to by an if and only if or is this a typo?

论据与证据

Yes the claims made in this work are supposed by clear and convincing evidence, as detailed in section 1.4.

方法与评估标准

Yes

理论论述

I briefly checked the correctness of Theorem 4.1.

实验设计与分析

N/A

补充材料

I briefly reviewed the proof of Theorem 4.1

与现有文献的关系

The paper extends classical online learning frameworks, which typically assume that all errors are recoverable by leveraging concepts like finite VC/Littlestone dimensions and algorithms such as Hedge, by incorporating a mentor query mechanism and local generalization to safely handle irreversible, catastrophic mistakes.

遗漏的重要参考文献

Not to my knowledge.

其他优缺点

None.

其他意见或建议

Page 4: "Also, that work" ---> Also, their work

作者回复

2025-03-28

We thank the reviewer for the thoughtful review.

Page 4: "Also, that work" ---> Also, their work

We appreciate the correction and will make this change.

On the bottom of page 3 the authors state "sublinear regret is possible iff the hypothesis ...". Is this supposed to by an if and only if or is this a typo?

This is indeed “if and only if”. See, e.g., Corollary 21.8 in Shalev-Shwartz and Ben-David, Understanding Machine Learning: From Theory to Algorithms (2014). We will expand “iff” to “if and only if” to avoid reader confusion.

最终决定Accept (poster)

2025-05-01

We thank the authors for their submission. Some reviewers mentioned limited novelty compared to previous works. Others had reservations on the formulation itself and the fact the mentor can always avoid catastrophe. That said, the reviewers did appreciate the overall formulation of the problem and the results therein.