5.3

/10

Rejected4 位审稿人

最低5最高6标准差0.4

3.5

置信度

正确性2.5

贡献度2.5

表达3.0

ICLR 2025

ROSARL: Reward-Only Safe Reinforcement Learning

Geraud Nangue Tasse,Tamlin Love,Mark Nemecek,Steven James,Benjamin Rosman

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

摘要

关键词

Reinforcement LearningDeep Reinforcement LearningSafetySafe AISafe RL

评审与讨论

审稿意见

评分: 5置信度: 42024-11-01

The paper tackles the problem of solving tasks safely using reinforcement learning. The main idea of the paper is to introduce a term to appropriately penalise unsafe behaviour, the analysis of the paper seeks to bound the magnitude of this term to minimise the probability of arriving at an unsafe goal state while achieving the task. To develop this term, the authors establish several objects required to quantify the complexity and "size" of the problem. These two measures are then used to construct the additional reward term to adhere to the safety criterion. It is then shown theoretically that adding a reward term that adheres to certain inequality constraints ensures a high degree of safe behavior.

优点

The paper addresses an important challenge with an intriguing idea. Beyond supplying solely theoretical results, the paper provides a practical method suggested by the theoretical analysis. The results suggest that the method leads to a notable reduction in safety costs which is one of the main stated goals of the paper.

The intuition of the paper is largely clear and most of the non-technical parts are nicely written and overall the concepts well-explained. The authors have included a practical example midway through the paper which is useful for discussing the ideas in a concrete setting and explaining some of the concepts.

缺点

===Analysis===

A1. Some parts of the paper aren't written with a high level of rigour, for example, the experiments of Section 5 are divided into parts depending on whether the "theoretical assumptions are satisfied" but there is no clear statement as to what exact assumptions this refers to. Additionally, there are a few typos in the mathematical expressions e.g., the argument of the min and max operator should be $s_T$ not $s$ . There are other examples and some more technical concerns I have such as the existence of $D$ - I have given more details of this concern below. Other more minor points are as follows:

**In Definition 2 the case of equality is not covered.

**Theorem 1 is a bit pointless - perhaps the authors could consider writing it as a note within the text.

A2. I am finding it difficult to understand how the object in Definition 3 could be computed. Although the restriction to proper policies ensures the goal state is reached in a finite number of time-steps, in a given environment, for any deterministic policy I may be able to increase the number of time steps it takes to reach the goal by for example asking the agent to go back on itself before going forward again. In this case, it is unclear whether such a maximum exists (although the function T may be upper bounded on the set of proper policies, its image is not a closed set).

A3. I found it slightly troubling that many of the key results and characterisations are derived using the solvability parameter $C$ an then because calculating this is impractical as the authors acknowledge, the algorithm put forward uses a replacement for $C$ without rigorous explanation as to whether the resulting calculations are good proxies. Indeed, although some intuition is provided as to why all is not lost by omitting $C$ in the calculation, the effect of this omission on the computations and overall behaviour is not properly studied.

===Experiments===

B1. A possible concern is that although from the empirical evaluation the method does reduce the total costs by an appreciable amount, this seems to come at a heavy price in the returns. Is there a configurable parameter that can control this trade-off. I would also like to see a comparison to the baselines that allow this trade-off to be configured, specifically considering a range of values for the parameter that controls this trade-off.

B2. Overall, I did not find the experimental results to be extremely compelling, firstly the results themselves suggest that the method reduces costs at a significant price to the rewards but also, the environments are quite simple and the method has not been tested on a range of complex environments e.g. safety gym.

问题

On lines 267-268, is the method considering proper policies, if so why do the expressions have $R_{\rm MAX}$ and $R_{\rm MIN}$ and not $DR_{\rm MAX}$ and $DR_{\rm MIN}$ respectively?
How can we be sure that introducing the minmax penalty preserves the original solution of the MDP (and hence, we are still solving the original problem)?

2024-11-24

We really appreciate the reviewer's time and effort spent reviewing our paper. We hope the reviewer’s concerns have been addressed below.

===Experiments===

We believe the reviewer’s concerns here is the most important, since they highlight that we weren’t sufficiently clear on our problem statement and hence the purpose of these experiments. We have moved the safety definition to the background section to make our problem statement clearer (the ROSARL paragraph). We have also added additional safety-gym experiments in the appendix, in addition to the ones that were already there. Please see the general comment for a more detailed explanation.

===Analysis===

A1: The experiments of Section 5 are divided into parts depending on whether the "theoretical assumptions are satisfied" but there is no clear statement as to what exact assumptions this refers to.

We assume the stochastic shortest path setting (Bertsekas & Tsitsiklis, 1991) with finite state and action spaces, as stated and described in the background section.
In experiment 5.1, this was stated on line 291 before paper updates. It satisfies our theoretical setting described in the background since it is a shortest path gridworld. For better clarity, we have updated that line to reference our theoretical setting from the background.
In experiment 5.2, it was stated on line 319 and explicitly on lines 341-343 before paper updates. It does not satisfy our theoretical setting because it is not a shortest path domain. For better clarity, we have updated that line to reference our theoretical setting from the background.

A1: Additionally, there are a few typos in the mathematical expressions e.g., the argument of the min and max operator should be $s_T$ not $s$ .

We hope the following clarifies why those arguments in definitions 3 and 4 are correct.
As stated in Definition 1, $s_{T}$ denotes the final state of a trajectory starting from state $s$ , and $P^{\pi}_s(s_T \in \mathcal{G^!})$ is the probability of reaching $\mathcal{G^!}$ from $s$ when following a proper policy $\pi \in \Pi$ . Hence the arguments should be $s$ and $\pi$ .
Similarly, as stated in Definition 3, $T(s_T \in \mathcal{G} | \pi)$ is the timesteps taken to reach $\mathcal{G}$ from $s$ when following a proper policy $\pi$ . Hence the arguments should be $s$ and $\pi$ .

A1: In Definition 2 the case of equality is not covered.

We choose to define the Minmax penalty this way to represent the boundary where on one side no reward function has an optimal policy that is unsafe, and on the other there exist a reward function with an optimal policy that is unsafe.
In the equality case, there may exist both safe and unsafe optimal policies simultaneously---hence an RL algorithm with such rewards will not be guaranteed to converge to safe optimal policies.
For example, consider the chain-walk example in Figure 2 with $p_1=p_2=0$ , reward of -2 for unsafe transitions (i.e. transitions from $s_0$ to $s_1$ ), and a reward of -1 for all the other transitions in $s_0$ and $s_2$ . We can see that in state $s_0$ , both action $a_1$ (safe transition to $s_2$ ) and action $a_2$ (unsafe transition to $s_1$ ) are optimal. Hence we can’t guarantee that an RL algorithm in this example will converge to a safe optimal policy.
For better clarity, we have added this explanation to the text (right after Def 2 and in Section 3.1)

A2: I am finding it difficult to understand how the object in Definition 3 could be computed. For any deterministic policy I may be able to increase the number of time steps it takes to reach the goal by for example asking the agent to go back on itself before going forward again.

Kindly note that this will require changing that deterministic policy or changing the environment dynamics. Given our proper policy assumption, and our assumption that the state and action spaces are finite, the maximum of the number of timesteps exists.
See (Jaksch et al., 2010, page 2, Definition 1) for a similar definition of Timesteps and diameter.

2024-11-24

A3: Indeed, although some intuition is provided as to why all is not lost by omitting $C$ in the calculation, the effect of this omission on the computations and overall behaviour is not properly studied.

An important point to note here is that Algorithm 2 is only concerned with learning a sufficient penalty for the current task, since the theoretical results show that learning one for all tasks (i.e $\bar R_{MIN}$ ) is guaranteed to be computationally inefficient.
Our theoretical results showed that the penalty one should give for unsafe transitions should be inversely proportional to the solvability.
This effect is what we aimed to capture by the calculations in Algorithm 2, as explained in lines 281-287. To investigate it, we ran Experiments 5.1 and 5.2.
- Both experiments demonstrate that the learned penalty from Algorithm 2 is indeed inversely proportional to the solvability, since the probability of reaching safe goals decreases with increasing stochasticity (Figures 4.b and 5.b).
- Experiment 5.1 also shows that there is a gap between the learned penalty and the derived lower bound minmax penalty. It also shows that despite this gap, at least for a single task, the penalty learned in Algorithm 2 is sufficient to learn optimal safe policies. Figure 4.a illustrates the actual policies that were learned.
- Similarly, Experiment 5.2 shows that despite the theoretical assumptions not holding, the learned penalty is still sufficient for the agent to prioritise learning safe policies over maximising rewards. It demonstrates that the agent using the learned penalty is still able to maximise rewards when it doesn’t come at the cost of safety, as shown by the graphs for noise=0. It also demonstrates that the agent doesn’t completely ignore the rewards when the noise is not too large, as shown by the graphs for noise=2.5.
To the best of our knowledge, the empirical results we provided across our experiments (learned penalties, failure rates, success rates, length of policies, sample trajectories) are the most relevant for analysing the effect of omitting $C$ . We are happy to report any other empirical result the reviewer would like us to add.
We also believe that investigating the omission $C$ theoretically is not trivial, and hence is an interesting direction for future work by leveraging the theoretical foundations laid out by this work.

Questions

1: On lines 267-268, is the method considering proper policies, if so why do the expressions have $R_{MAX}$ and $R_{MIN}$ and not $DR_{MAX}$ and $DR_{MIN}$ respectively?

Please note that we do not have access to the diameter $D$ nor the reward bounds $R_{MAX}$ and $R_{MIN}$ in algorithm 2. The expressions are to estimate $R_{MAX}$ and $R_{MIN}$ , and use them to estimate the value bounds $V_{MIN}$ and $V_{MAX}$ as described. For better clarity, we have moved the pseudocode of Algorithm 2 to the main text.

2: How can we be sure that introducing the minmax penalty preserves the original solution of the MDP (and hence, we are still solving the original problem)?

We hope the problem statement clarification above answers this question.
Our theoretical results show that any penalty larger than the derived lower bound of the minmax penalty will guarantee optimal policies that are safe (See Definition 2 and Theorem 2).
Our empirical results demonstrate that introducing the learned penalty in Algorithm 2 also leads to safe policies with different reward performances depending on the level of stochasticity of the environment.

References:

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.

评论- Re:

2024-11-26

Thanks for your responses and for clarifying some points. Some other questions of mine remain unresolved.

For completion I think the equality case should be covered since to me the definition looks incomplete without it.
My issue with Definition 3 is that although each policy in the set of proper policies will arrive at the absorbing state in finite time, the set of proper policies is not guaranteed to be a finite set. Therefore, for any proper policy that requires $T<\infty$ time steps to arrive at the absorbing state, I can always construct another proper policy that requires $T'$ timesteps where $\infty>T'>T$ . I think this may be resolved by imposing compactness on the policy space $\Pi$ .
My concern about the omission of $C$ within the protocol remains. While I understand the difficulties of computing $C$ in practice, it seems that the theoretical guarantees established in the paper cannot be assured by the algorithm nor do we have any assurance of near guarantees when using the protocol of the algorithm. This means there is a significant disconnect between the theory and the method.

2024-11-26

Thank you to the reviewer for taking the time to read our rebuttal and for their fast response. Regarding the outstanding concerns:

We are happy to add the equality case following the earlier explanation we gave.
Please remember that $\Pi$ is the set of deterministic proper policies with a finite set of states and actions. Hence $\Pi$ is finite and compact. Compactness is a useful (and usual) assumption in continuous MDPs, but we are assuming finite MDPs for our theory.
Please note that we make no theoretical claims about our practical algorithm (now labeled Algorithm 1 in the updated paper), only empirical ones. We agree that theoretical guarantees for that algorithm will be useful, but we hope the reviewer understands that it is beyond the scope of this work. We used our theoretical results to motivate the design of Algorithm 1. We used experiments to validate it, demonstrating and analysing its behaviour and performance compared to state-of-the-art baselines. Hence we hope the reviewer will evaluate Algorithm 1 on the merit of our strong empirical results.

We are happy to provide any further clarifications, and any additional experiment that the reviewer requests time permitting.

评论- Re:

2024-11-26

Thanks for the clarifying comments, they have helped me understand the contribution better.

As a side-remark, I do believe however for Point 2, there is some confusion about the compactness of the space of functions (policies) versus the compactness of the spaces on which they operate. Specifically, I don't think the reasoning about the compactness of the policy space is correct. Consider for example, the family of functions $F=\\{f_n(x)=1/x^n | x\in [1,2] ,n=0,1,\ldots, +\infty\\}$ . For any value $n$ , the image space of $f$ is contained within $[0,2]$ which is a compact set (this follows simply because we have continuous functions acting on a compact set, $[1,2]$ ) but clearly, the function space $F$ is an infinite set. I think the issue here nevertheless is a side-note since it seems to me that the underlying issue can be resolved by assuming the compactness of $\Pi$ .

2024-11-28

We appreciate the reviewer's score increase and are glad most of the reviewer's concerns have been addressed.

Regarding the remark on the size and compactness of $\Pi$ , the given example helped us understand the source of the misunderstanding. To clarify this:

Please note that we are considering only stationary deterministic policies. That is, we define a policy as a pure function $\pi:\mathcal{S}\rightarrow \mathcal{A}$ that only takes as input a state $s\in\mathcal{S}$ and outputs an action $a\in\mathcal{A}$ , and does not change with time.
So the set of all policies is finite (since $\mathcal{S}$ and $\mathcal{A}$ are finite), containing $|\mathcal{A}|^\{|\mathcal{S}|\}$ policies which correspond to taking every possible action in every possible state. For example, in the Chain-walk MDP with 4 states $\mathcal{S}=$ { $s_0,s_1,s_2,s_3$ } and 2 actions $\mathcal{A}=$ { $a_0,a_1$ }, there exists only $2^4$ possible policies.
Hence the set of all proper policies $\Pi$ is also finite, and hence compact.

Please let us know if this clarification helped and if the reviewer has further concerns.

审稿意见

评分: 5置信度: 42024-11-02

The paper discusses a reward-only safe RL as an alternative to reward-shaping RL, and constraint-based RL for safe RL. The main idea is to compute the minmax penalty, which will be used to re-cast the problem so that a safe policy can be computed using standard RL methods. The method is evaluated on classical methods such as TRPO Lagrangian, CPO and a newer method called Saute RL with TRPO backbone. The environment is quite challenging and coming from the safety gym.

优点

This is a different and possibly a new method to view safe RL
The paper is quite well-written, however, I believe the algorithm is not described in the best way (see weaknesses)
The authors present novel theoretical results and then discuss their strengths and limitations. In particular, they discuss convergence, computational complexity, applicability of the results

缺点

The algorithm is not well-described. It’s hard to understand the mathematical formulation of the problem and the solution.
Experimental results are limited in scope and some of the conclusions are slightly misleading. a. For example, Saute RL solves a problem with probability one constraints and cannot be directly compared to TRPO Lagrangian which solves the problem with constraints on average. Especially, when the environment is highly stochastic. b. There are other baselines that the authors could compare to, to make their case. I recommend looking into implementations in https://github.com/PKU-Alignment/safety-gymnasium c. There’s no ablation study
Presentation of the paper focuses on the theoretical results, which are hard to judge without an explicit mathematical problem formulation.
The experimental part is hard to evaluate due to the lack of details, for example the algorithm is not even in the main paper (it’s in appendix).

问题

The authors look for a minimax penalty. Could the authors comment on the similarities and the differences with Lagrangian methods, where the algorithm performs an automatic search for a Lagrangian coefficient that weighs the constraint violation values.
What is the problem that this algorithm aims to solve? I would like to see a precise mathematical formulation.

2024-11-24

We really appreciate the reviewer's time and effort spent reviewing our paper. We hope the reviewer’s concerns have been addressed below.

The algorithm is not well-described. It’s hard to understand the mathematical formulation of the problem and the solution. Presentation of the paper focuses on the theoretical results, which are hard to judge without an explicit mathematical problem formulation.

We have moved the safety definition to the background section to make our problem setting clearer in the background and given a precise mathematical problem formulation.
Briefly, where $\widehat\Pi$ is the set of all safe policies, the goal of an agent in this work is to learn an optimal safe policy $\widehat\pi^*$ that maximises the value function $V^{\widehat\pi^*}(s) = \max_{\widehat\pi\in\widehat\Pi} V^{\widehat\pi}(s)$ for all $s \in \mathcal{S}$ .
Hence, the optimal safe policy according to our definition is the one that maximises rewards only over safe policies, similarly to prior works.
Hence, it is optimal to sacrifice returns by taking the longer paths to the goal, if that ensures the policy is safer. This is especially true in stochastic environments. It is also especially true in environments with dense rewards like safety-gym where the rewards incentivise the agent to go straight to the goal ignoring all the hazards along the way. Also note how when the environment is not noisy, the policy that maximises rewards is also a safe policy. Figures 4 and 5 are perfect examples of these, and demonstrate how TRPO Minmax indeed achieves that desired behaviour.

Experimental results are limited in scope and some of the conclusions are slightly misleading. a. For example, Saute RL solves a problem with probability one constraints and cannot be directly compared to TRPO Lagrangian which solves the problem with constraints on average. Especially, when the environment is highly stochastic.

To the best of our understanding, the problem definition of Saute RL is CMDPs (See Definitions 2 and 3 in Sootla et al., 2022), the same as in TRPO-Lagrangian and the other baselines (Ray et al., 2019). None of these works restrict their problem setting to deterministic dynamics. Also Sootla et al. (2022) explicitly mention that they also handle stochastic environments (see their page 1 and page 4).
Sootla et al. (2022) themselves use the exact same baselines as us (Ray et al., 2019) to compare with their approach (which includes TRPO-Lagrangian).
Sootla et al. (2022) uses a similar safety-gym task to our safety-gym PILLAR task (Figure 1.d in Sootla et al., 2022). See their description on page 16. The only differences are:
- We use a pillar instead of a hazard for the safety constraint.
- We require that the policies must be maximally safe by setting the cost threshold to 0. As control, we see that Saute RL and all the other algorithms (including ours) are able to satisfy this constraint when there is no noise.
- We add different degrees of noise to the environment, since we still want policies that are safe even in stochastic environments. The results demonstrate that Saute RL as implemented by (Sootla et al., 2022) is simply terrible at this in this environment. The additional experiments using Omnisafe's Saute-RL implementation gives musch better performance (See Table 2 in page 26)

There are other baselines that the authors could compare to, to make their case. I recommend looking into implementations in https://github.com/PKU-Alignment/safety-gymnasium c. There’s no ablation study

Due to space constraints, our original submission included several additional experiments in more complex safety-gym domains (Figure 13).
We have now also added several experiments using the Omnisafe baselines https://github.com/PKU-Alignment/omnisafe, which includes safety-gymnasium

The experimental part is hard to evaluate due to the lack of details, for example the algorithm is not even in the main paper (it’s in appendix).

We have added Algorithm 2 to the main paper. We are happy to provide any additional details the reviewer requests.

References:

Aivar Sootla, Alexander I Cowen-Rivers, Taher Jafferjee, ZiyanWang, David H Mguni, JunWang, and Haitham Ammar. Saut´e RL: Almost surely safe reinforcement learning using state augmentation. In International Conference on Machine Learning, pp. 20423–20443. PMLR, 2022.
Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking Safe Exploration in Deep Reinforcement Learning. 2019.

2024-11-24

The authors look for a minimax penalty. Could the authors comment on the similarities and the differences with Lagrangian methods, where the algorithm performs an automatic search for a Lagrangian coefficient that weighs the constraint violation values.

In deterministic environments with a cost threshold of 0, the set of safe policies for Lagrangian methods are the same as ours.
However, in stochastic environments, Lagrangian methods require the correct choice of inequality constraints to even be well defined. If the cost threshold is not carefully chosen, there may exist no policy that satisfies the CMDP constraints, implying there would exist no optimal safe policy to converge to.
For example, in the Lava or the Safety Gym Pillar domains with $noise>0$ , a cost threshold of 0 can never be satisfied by any policy for all states, making these approaches theoretically ill-defined in these environments with that cost threshold.
That said, we found in practice that a cost threshold of 0 gave them the best performance in the Safety Gym experiments (compared to 1 and the default of 25).
In contrast, we showed the existence of a Minmax penalty irrespective of the stochasticity of the environment.
Additionally, while Lagrangian methods theoretically define and learn the Lagrange coefficients for each reward function even when the cost function and cost threshold remain unchanged, our minmax penalty approach is theoretically defined and learned for all reward functions.
We have added this explanation to the related works section.

2024-11-24

Thank you for the detailed response, update manuscript and additional experiments

My remaining concerns are

Baselines.
- There are other newer baselines in OMNIsafe that were not used and that have probably a better performance than TRPO. The authors did demonstrate that using OMNIsafe was not hard.
- Re Saute RL. Sootla el al 2022 is called Sauté RL: Almost Surely Safe Reinforcement Learning Using State Augmentation, and therefore they do have a different formulation to a CMDP as highlighted by the title and the theoretical results. This setting would be equivalent to your formulation with the definition of safe policy that reaches the unsafe absorbing state with probability equal to 0. There's no surprise that Saute RL would be more conservative, although considering that Omnisafe works much better there may be different explanations. Nevertheless, you could strengthen the claim by correctly saying that Saute RL is intrinsically more conservative than ROSARL and demonstrate that this indeed happens on the pillar environment. Furthermore, the large reward penalties could potentially lead to poor training performance.
There's no ablation on the way r_min is set. You could try a similar strategy to Sootla et al 2022 and set a fixed negative value, for example.

Regarding "terrible performance" in comparison to Omnisafe implementation, I would caution of using such language in a scientific context without explaining the reason for this behavior. Saute RL is effectively a wrapper on an already existing implementation of an RL algorithm (in this case safety baselines) with a few new hyperparameters.

I would recommend to make the algorithm the center of the presentation and explain why it works later. This simple modification seems to be working well and you have presented a motivation for using it. It would work with discounted cost setting as well if state augmentation by Sootla et al is used to define the absorbing state.

Also there was a similar work published recently that could be of interest to the authors: Solving Richly Constrained Reinforcement Learning through State Augmentation and Reward Penalties https://arxiv.org/pdf/2301.11592

I will keep my score, but it is an interesting idea that results in a simple and effective algorithm.

2024-11-25

Thank you to the reviewer for taking the time to read our rebuttal and for their fast response. Regarding the outstanding concerns:

1- Baselines.

Which additional OMNIsafe baseline would the reviewer like to see (and in which environment)? We are happy to report on it, time permitting given that it is computationally expensive (we will need to run 10 million steps for 10 random seeds to be consistent).
We used TRPO to be consistent with prior works (since it is the baseline RL algorithm used in most prior works), and also because the benchmark results provided by OMNIsafe shows that it is in general competitive or better than the other baselines (Policy Gradient, Natural PG, and PPO). For example, looking at SafetyPointGoal1-v0 in Table 2 of https://github.com/PKU-Alignment/omnisafe/tree/main/benchmarks/on-policy , we see that the TRPO baseline has the highest rewards and lowest costs.

2- Please note that for our approach, r_min is learned. Assuming the reviewer means the unsafe rewards for Saute RL, we used the default fixed negative value of -0.2 from OMNIsafe since it is what they used for all their Saute RL benchmark results https://github.com/PKU-Alignment/omnisafe/tree/main/benchmarks/on-policy. For the base environment (i.e not Sauté'ed), we used a fixed negative value -1 (i.e. this is the reward TRPO gets for unsafe transitions).

Indeed, using our approach to learn a sufficient unsafe reward for Sauté'ed environments is an interesting direction for future works.
We thank the reviewer for the suggested paper (https://arxiv.org/pdf/2301.11592) which we will add to our related works.

评论- Re Saute RL

2024-11-28

Again, thank you to the reviewer for taking the time to engage with us in this discussion phase. After thoroughly reading Sootla el al 2022 again, we would like to address the "Re Saute RL" concern in detail here.

Sootla el al 2022 is called Sauté RL: Almost Surely Safe Reinforcement Learning Using State Augmentation, and therefore they do have a different formulation to a CMDP as highlighted by the title and the theoretical results.

To the best of our understanding, this is not correct

Sootla el al. 2022 use a similar definition of CMDP (Definition 2 Page 3) to the original one by Altman 1999 (Sections 2.1-2.2 Page 21-23) and also prior works like Ray et al. 2019 (Definition 2 Page 5). We have also included a similar definition in page 2 of our paper (labeled Safe RL).
This similarity is not only recognised by us, it is also recognised by Omnisafe (Ji et al. 2024). We quote Ji et al. 2024 page 2 "The Adapter component serves to reconcile structural differences between algorithms for various classes of Constrained Markov Decision Processes (CMDPs) (Altman, 1999; Sun et al., 2021; Sootla et al., 2022b)".
Hence, we hope the reviewer agrees that at the very least, our additional Omnisafe experiments (Appendix F pages 26-34) are as fair as one can reasonably expect.

For reference, here is the CMDP definition of Sootla el al 2022 (Page 3):

Definition 2 The constrained MDP (c-MDP) is a tuple $\mathcal{M}_c=\left\langle\mathcal{S}, \mathcal{A}, \mathcal{P}, c, \gamma_c, l, \gamma_l\right\rangle$ where additional terms are the safety cost $l: \mathcal{S} \times \mathcal{A} \rightarrow[0,+\infty)$ and the safety discount factor $\gamma_l \in(0,1)$ . The associated optimization problem is:

$\min\_\pi \mathbb{E}\_s^\pi J\_{task}$

s. t.: $\mathbb{E}\_s^\pi J\_\{safety\} \geq 0,$

$J_{\text {safety }} \triangleq d-\sum_{t=0}^{\infty} \gamma_l^t l\left(s_t, \boldsymbol{a}_t\right)$

Sootla el al. 2022 then define their Sauté RL problem setting (Definition 3 page 4) by defining how they convert a regular CMDP into a Sauté MDP. Finally, they provide Theorem 3 (their final theorem, page 5) which says that if there exists an optimal policy for a Sauté MDP, then the policy is almost surely safe.

This setting would be equivalent to your formulation with the definition of safe policy that reaches the unsafe absorbing state with probability equal to 0

To the best of our understanding, this is not correct

Such a definition would mean there exist no safe policies in many stochastic environments, making the problem of finding an optimal safe policy in such environments ill-defined.
For example, in the Lava gridworld (with slip probability $>0$ ) and Safety-Gym Pillar environment (with noise $>0$ ), there exists no policy that reaches the unsafe absorbing state with probability equal to 0.
In contrast, our formulation defines a safe policy as one that minimises the probability of reaching an unsafe absorbing state. Hence, the optimal safe policy need not have a probability 0 of reaching an unsafe absorbing state.

There's no surprise that Saute RL would be more conservative, although considering that Omnisafe works much better there may be different explanations.

Unfortunately, we are unsure what the reviewer means by conservative here, which also makes it difficult to understand the follow-up statement.
If by more conservative the reviewer means that Saute RL has higher costs than our approach, then it is also more conservative than all the other Safety-gym baselines since it also has a higher cost than them.
However, we apologise if we have misunderstood the reviewer's statement here.

$~$

References:

Eitan Altman. Constrained Markov decision processes: stochastic modeling. Routledge, 1999.
Jiaming Ji, Jiayi Zhou, Borong Zhang, Juntao Dai, Xuehai Pan, Ruiyang Sun, Weidong Huang, Yiran Geng, Mickel Liu, and Yaodong Yang. Omnisafe: An infrastructure for accelerating safe reinforcement learning research. Journal of Machine Learning Research, 25(285):1–6, 2024. URL http://jmlr.org/papers/v25/23-0681.html.

2024-11-28

Sootla el al. 2022 then define their Sauté RL problem setting (Definition 3 page 4) by defining how they convert a regular CMDP into a Sauté MDP.

That's incorrect. They first perform a conversion in the deterministic case in Section 3.1 and then say in Section 3.2: let's formulate the same thing in a stochastic environment. They get a standard MDP that they can solve easily and then they ask in Section 3.3: what are we actually solving? Turns out that the problem they're solving is an MDP with constraints almost surely if n goes to minus infinity.

Finally, they provide Theorem 3 (their final theorem, page 5) which says that if there exists an optimal policy for a Sauté MDP, then the policy is almost surely safe

Yes, which means that the constraints has to be fulfilled with probability 1! In the standard CMDP formulation, we take the average over the accumulated cost. Sootla et al 2020 enforce the constraint with probability 1. This is what is meant by almost surely. The introduction is diving into why the constraints on average may not be preferable in some applications. Refer to Definition 4 please for the problem formulation (page 5 on arxiv). They use $z_t$ to denote the accumulated discounted cost at time t. I appreciate that their presentation may be lacking clarity since their building the problem bottoms up. Please refer to their follow-up paper https://arxiv.org/pdf/2206.02675, where they discuss the formulations in Section 2.1

Hence, we hope the reviewer agrees that at the very least, our additional Omnisafe experiments (Appendix F pages 26-34) are as fair as one can reasonably expect.

For clarity, I don't think your experiments were flawed. I don't doubt that Saute RL doesn't perform as good as your approach. I think your conclusions are slightly inaccurate.

This setting would be equivalent to your formulation with the definition of safe policy that reaches the unsafe absorbing state with probability equal to 0

To the best of our understanding, this is not correct

Such a definition would mean there exist no safe policies in many stochastic environments, making the problem of finding an optimal safe policy in such environments ill-defined.

For example, in the Lava gridworld (with slip probability $>0$ ) and Safety-Gym Pillar environment (with noise $>0$ ), there exists no policy that reaches the unsafe absorbing state with probability equal to 0.

That's exactly my point. In the formulation by Sootla et al 2022, they consider the setting where the constraints have to be fulfilled with probability 1. Therefore, in highly stochastic environments their algorithm will struggle. By modifying the reward one can get sensible looking policies, but this is not the theory behind it.

Again this concern is solved easily by pointing out the conservatism of their formulation, problems with connecting theory and practice (choosing the minimal reward), and the fixed nature of the minimal reward (which you learn).

Unfortunately, we are unsure what the reviewer means by conservative here, which also makes it difficult to understand the follow-up statement.

Apologies for the lack of clarity. I mean that the set of policy satisfying the constraints (the set of safe policies) in Sootla et al 2022 will be smaller than in your case. To be more specific, there will be safe policies (in some problems) in your definition, which do not satisfy the constraints by Sootla 2022.

This is not the reason for my score though.

I am in between 5 and 6, but due to the overall presentation points I lean closer to 5, however. I must admit though the sheer number of new experiments is hard to ignore.

In terms of additional baselines, I think the issue is that it's hard to know which experiments you need without experimenting. What do you want to show? What do you want to study? Without going far, Sootla et al 2022 showed that their algo:

It can work with different on-policy and off-policy model free methods
It can work with model-based methods: policy based and planning based
It can generalize across different constraints
It can solve some environments, where the average constrained methods do not perform well

This concern is hard to solve during the rebuttal and it's not fair to expect it.

2024-12-03

Thank you to the reviewer for the clarifications. We are glad that the reviewer is happy with the Saute RL experiments, and we are happy to update our conclusions regarding its performance.

Precisely, we will remark that its poor performance is possibly because it is focused on learning almost surely safe policies, but such policies do not exist in the noisy Safety Gym Pillar tasks.

Regarding the main aims of our experiments, they naturally follow from our theory and proposed practical algorithm. We used our theoretical results to motivate the design of Algorithm 1, but gave no theoretical guarantees around it. Hence, we used experiments to validate it, demonstrating and analysing its behaviour and performance compared to state-of-the-art baselines. For example,

Our theoretical results showed that the penalty one should give for unsafe transitions should be inversely proportional to the solvability. Experiments 5.1 and 5.2 demonstrate that the learned penalty from Algorithm 1 is indeed inversely proportional to the solvability, since the probability of reaching safe goals decreases with increasing stochasticity.
Experiment 5.1 also shows that there is a gap between the learned penalty and the derived lower bound minmax penalty. It also shows that despite this gap, at least for a single task, the penalty learned in Algorithm 1 is sufficient to learn optimal safe policies.
Similarly, Experiment 5.2 shows that despite the theoretical assumptions not holding, the learned penalty is still sufficient for the agent to prioritise learning safe policies over maximising rewards. It demonstrates that the agent using the learned penalty is still able to maximise rewards when it doesn’t come at the cost of safety, as shown by the graphs for noise=0. It also demonstrates that the agent doesn’t completely ignore the rewards when the noise is not too large, as shown by the graphs for noise=2.5.

Similarly to the examples the reviewer gave for SauteRL, we demonstrated the following for our algorithm:

It can learn optimal safe policies (Fig 4 page 7)
It can work in undiscounted (Fig 4 page 7) and discounted settings (Fig 5 page 8)
It can work with sparse rewards (Fig 4 page 7) and dense rewards (Fig 5 page 8)
It can work in tabular (Fig 4 page 7) and function approximation settings (Fig 5 page 8)
It can work across different levels of stochasticity (Fig 5 page 8, Table 2 page 26)
It can work when safe and unsafe goals are terminal and non-terminal (Tables 3-7 pages 30-31)
It can work with different implementations and hyperparameters of RL algorithms. E.g Safety Gym and Omnisafe implementations of TRPO.
It can work with different RL algorithms. E.g on-policy (TRPO and PPO, Tables 3-7 pages 30-31) and off-policy (Q-Learning, Fig 4 page 7)
It can solve some environments, where the average constrained and probability one constrained methods do not perform well (Table 2 page 26). That is, it achieves the lowest failure rate, highest success rate, and highest returns.
It has the lowest failure rate in some environments (all tested environments in the main paper and ablations), where the average constrained and probability one constrained methods do not perform well

We hope this helps to resolve the reviewers outstanding concerns.

审稿意见

评分: 5置信度: 32024-11-04

This paper introduces Reward-Only Safe RL (ROSARL), a novel safe RL agent that focuses on learning appropriate penalties for unsafe states rather than using explicitly constructed constraints or human-designed penalties. The key technical contribution is the concept of a "Minmax penalty" - the smallest penalty for unsafe states that guarantees safe optimal policies regardless of task rewards. The authors derive theoretical upper and lower bounds on this penalty based on environment diameter and solvability. A practical algorithm is designed for estimating the minimax penalties during under model-free RL setting. The authors empirically demonstrate that the resulting algorithm yields stronger and safer performance on grid world and PILLAR environments, compared to selected baseline methods like CPO and TRPO-Lagrangian.

优点

The theoretical proofs are clearly and correctly stated.
The motivation is clear and well-written.
The proposed practical algorithm is simple (in a good) that can be integrated with existing RL methods.

缺点

The practical algorithm uses a simplified estimate that does not explicitly consider solvability. Moreover, the algorithm leverages value function estimates as approximations without theoretical guarantees. More analysis for addressing the gap between the theoretically constructed objectives and the actual practical objective used would significantly improve the paper.
There are two main issues with the experimental study. Firstly, only two environments are used for empirical evaluation of the proposed algorithm, whilst one of them is grid world. Secondly, there are limited analysis of failure cases, which I believe might contain interesting insights.
Some minor issues. It would be useful to include some crucial baseline comparisons, such as PID-Lagrangian (Stooke et al. 2020). Also, there is no analysis or ablation studies on the computational complexity of the proposed algorithm, hindering its practical applicability.

问题

Why does the practical algorithm work despite ignoring solvability? What are the theoretical guarantees?
What happens when the environment has delayed consequences for unsafe actions rather than immediate terminal states?

2024-11-24

We really appreciate the reviewer's time and effort spent reviewing our paper. We hope the reviewer’s concerns have been addressed below.

Why does the practical algorithm work despite ignoring solvability? What are the theoretical guarantees?

In brief, it is because Algorithm 2 is only concerned with learning a sufficient penalty for the current task, since the theoretical results show that learning one for all tasks (i.e $\bar R_{MIN}$ ) is guaranteed to be computationally inefficient.
Our theoretical results showed that the penalty one should give for unsafe transitions should be inversely proportional to the solvability $C$ . This effect is what we aimed to capture by the calculations in Algorithm 2, as explained in lines 281-287.
To investigate its effect, we ran Experiments 5.1 and 5.2. Both demonstrated that the learned penalty is indeed inversely proportional to the solvability, since the probability of reaching safe goals decreases with increasing stochasticity (Figures 4.b and 5.b). They also demonstrate that at least for a single task, the penalty learned in Algorithm 2 is sufficient to learn optimal safe policies. Figure 4.a and 4.b illustrates the actual policies that were learned.
To the best of our knowledge, the empirical results we provided across our experiments (learned penalties, failure rates, success rates, length of policies, sample trajectories) are the most relevant for analysing the effect of omitting $C$ . We are happy to report any other empirical result the reviewer would like us to add.
We also believe that investigating the omission of $C$ theoretically is not trivial, and hence is an interesting direction for future work by leveraging the theoretical foundations laid out by this work.

What happens when the environment has delayed consequences for unsafe actions rather than immediate terminal states?

If the delayed consequences for unsafe actions are still Markov, then our theory still holds. The experimental results should also still demonstrate similar results since in the stochastic experiments (and also the safety-gym ones since the agent has velocity), choosing the wrong action earlier in the trajectory can irreversibly change the probability of the agent terminating. Hence, such transitions can also be considered unsafe transitions with a delayed consequence.
Figure 21 (page 24) also demonstrates the effect of completely removing termination.

Only two environments are used for empirical evaluation of the proposed algorithm, whilst one of them is grid world. Secondly, there are limited analysis of failure cases.

The 2 environments were chosen to have controlled experiments in the main paper, while the original submission also included more experiments in the Appendix in a number of more complex safety-gym environments (See Figure 13, page 19).
If we went straight to function approximation without the gridworld experiment, it would be unclear if the results obtained are caused mainly by the penalty learning approach in Algorithm 2, or mainly by the use of function approximation, or a mixture of both. Hence it will not be a controlled experiment for investigating the ability of Algorithm 2 to learn optimal safe policies.
To have a controlled experiment to investigate the effect of noise on the performance Algorithm 2 relative to the baselines, we used the PILLAR task because:
- It is similar to the safety-gym task used by Sootla et al. (2022) (See Figure 1.d in Sootla et al., See their description in page 16)
- It has the same high-dimensional continuous observation space, continuous action space, and continuous dynamics, as any of the other default safety-gym tasks.
- The control is noise=0, where we see that all the other algorithms (including ours) are able to fairly quickly converge to optimal (or near-optimal) safe policies.
Without this experiment, if we observe poor performance of any of the algorithms in a more complex environment, it is unclear what is the cause (since there is no control for that).
Beyond the quantitative results and sample trajectories we provided to analyse and better understand the success and failure cases, we are happy to include and discuss any additional ones the reviewer requests to maximise clarity and insight.

审稿意见

评分: 6置信度: 32024-11-04

This paper addresses the difficulty of reward/cost engineering in Safe RL by proposing a minimax penalty. Particularly, it uses notions of environment diameter and solvability to bound on both sides and estimate the minimum penalty for unsafe states that leads to optimally safe policies independent of task rewards.

优点

well-motivated
extensive experimentation
good theoretical justification
good study of success/failure cases of proposed approach and baselines

缺点

consider adding up/down arrows to indicate if higher/lower values are good in the graphs (Average Returns(↑) and Failure Rate(↓))
consider citing some works on feasibility/reachability in RL

Minor errors: -Definition 1 typo “sThen”

问题

what are the limitations of this approach?

2024-11-24

We really appreciate the reviewers' positive outlook on our paper, and their time and effort spent in reviewing it.

Consider adding up/down arrows to indicate if higher/lower values are good in the graphs (Average Returns(↑) and Failure Rate(↓))

Thank you to the reviewer for the suggestion. We have added the arrows to all the graphs.

Consider citing some works on feasibility/reachability in RL

Thank you for the suggestion.
We have added a brief description of CMDPs in the background, and cited Russell & Norvig (2016) and Ray et al. (2019) for it (since they are about learning feasible policies)
We have cited Tasse et al. (2022) for reachability in temporal-logic RL---where there may or may not exist a policy that satisfies a task specification (i.e that reaches an accepting state).
Since the suggestion is also relevant for the diameter in RL, we have also cited Auer et al. (2008). Their definition is similar to ours, except that we are maximising over deterministic proper policies while they are minimising over all deterministic policies.
We are happy to add any other helpful citations.

What are the limitations of this approach?

This approach is currently only applicable to safety critical tasks where no unsafe transition is tolerated, and hence the agent must try its best to solve given tasks while minimising the probability of those unsafe transitions. However, there are several other notions of safety in the literature, such as keeping the average total costs below a threshold, minimising a conditional value at risk (CVaR) measure, and also minimising chance constraints (Chow et al., 2017).
This approach is currently unable to handle different degrees of safety. It may be desirable for an agent to accommodate different degrees of safety---for example "breaking a vase" is less unsafe than "hitting a baby". Our focus on scalar rewards at unsafe states may lead to a natural future extension here. Since the Minmax reward is the least negative reward that guarantees safety, we may assign weights to it corresponding to different levels of unsafety.

References:

Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 1994. ISBN 0-13-103805-2.
Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking Safe Exploration in Deep Reinforcement Learning. 2019.
Geraud Nangue Tasse, Devon Jarvis, Steven James, and Benjamin Rosman. Skill machines: Temporal logic skill composition in reinforcement learning. In The Twelfth International Conference on Learning Representations, 2022.
Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.
Chow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.

2024-11-27

Thank you, I maintain my score.

评论- Updated paper with clarifications and additional experiments (Part 1)

2024-11-24

We would like to thank all the reviewers for their time and effort spent reviewing our paper and providing useful feedback. A number of reviewers have expressed concerns about:

Not knowing what our precise mathematical problem definition is.
The lack of complex environments and lack of ablation experiments.
The use of only 2 environments in the experiments in the main paper.

We believe these are extremely important points that are at the source of their other concerns, and hence merit a general response. We hope the following clarifications and additional experiments convincingly clarifies all these concerns. We have also marked in red all the main changes in the main paper, and moved the related works section to the appendix for additional space (we also include a brief version in the main paper).

Our precise mathematical problem definition

We have moved the safety definition to the background section to make our problem setting clearer in the background and given a precise mathematical problem formulation:
- Where $\widehat\Pi$ is the set of all safe policies, the goal of an agent in this work is to learn an optimal safe policy $\widehat\pi^*$ that maximises the value function $V^{\widehat\pi^*}(s) = \max_{\widehat\pi\in\widehat\Pi} V^{\widehat\pi}(s)$ for all $s \in \mathcal{S}$ .
Hence, the optimal safe policy according to our definition is the one that maximises rewards only over safe policies, similarly to prior works.
Hence, it is optimal to sacrifice returns by taking the longer paths to the goal, if that ensures the policy is safer. This is especially true in stochastic environments. It is also especially true in environments with dense rewards like safety-gym where the rewards incentivise the agent to go straight to the goal ignoring all the hazards along the way. Also note how when the environment is not noisy, the policy that maximises rewards is also a safe policy. Figures 4 and 5 are perfect examples of these, and demonstrate how TRPO Minmax indeed achieves that desired behaviour.

The lack of complex environments and lack of ablation experiments

The original submission included ablations in 4 complex safety-gym environments (see Figures 13-22 in pages 19-25). We investigated the effect of removing safe goals termination on the performance of all the algorithms, and the effect of different cost thresholds on the baselines.
We have now also added several experiments using Safety-Gymasium (as requested by Reviewer SFyW) and with additional baselines using OmniSafe (as requested by Reviewer PeFw). Instead of PID-Lagrangian mentioned by Reviewer PeFw which is an older algorithm, we included P3O instead. Precisely:
- We evaluate the performance of trained models in the Pillar domain (see Table 2 in page 26). We also include the training curves. We observe that only TPRO-Minmax prioritises minimising the probability of unsafe transitions, consistently achieving the lowest cost while trading off the rewards. Interestingly, by using Algorithm 2 with OmniSafe’s implementation of TRPO, TPRO-Minmax achieves the lowest cost, highest success rate, and highest returns across all noise levels.
- We also investigate the effect of termination on the performance of trained models in 2 Safety-Gymnasium default environments (PointGoal1 and PointPush1). See Tables 3-7 in pages 30-31. We also include the training curves. Consistent with previous results, we observe that our approach consistently achieves the lowest cost while trading off the rewards.
We are also happy to run any additional ablation and any of the baselines included in OmniSafe that the reviewers believe is necessary, time permitting.

评论- Updated paper with clarifications and additional experiments (Part 2)

2024-11-24

The use of only 2 environments in the experiments in the main paper

Due to space constraints, the 2 environments were chosen to have controlled experiments in the main paper, while the original submission also included more experiments in the Appendix in 4 more complex safety-gym environments (see Figures 13-22 in pages 19-25).
If we went straight to function approximation without the gridworld experiment, it would be unclear if the results obtained are caused mainly by the penalty learning approach in Algorithm 2, or mainly by the use of function approximation, or a mixture of both. Hence it will not be a controlled experiment for investigating the ability of Algorithm 2 to learn optimal safe policies.
To have a controlled experiment to investigate the effect of noise on the performance of Algorithm 2 relative to the baselines, we used the PILLAR task because:
- It is similar to the safety-gym task used by Sootla et al. (2022) (See Figure 1.d in Sootla et al., See their description in page 16)
- It has the same high-dimensional continuous observation space, continuous action space, and continuous dynamics, as any of the other default safety-gym tasks.
- The control is noise=0, where we see that all the other algorithms (including ours) are able to fairly quickly converge to optimal (or near-optimal) safe policies.
Without this experiment, if we observe poor performance of any of the algorithms in a more complex environment, it is unclear what is the cause (since there is no control for that).
We have now also included an additional experiment in the Appendix where we evaluate the performance of the trained models in the Pillar domain (see Table 1 in page 17). We are happy to do the same for all the other experiments if desired, time permitting.

All these additional experiments show that our approach consistently prioritises minimising cost, consistently achieving the lowest cost while only sacrificing rewards when the environment is too complex or noisy.

AC 元评审

2024-12-23

The paper proposes a safe RL algorithm, Reward-Only Safe RL (ROSARL), that learns penalties for unsafe states (behaviors) rather than using predefined penalties. The main technicality is the concept of minmax penalty, i.e., the smallest penalty for unsafe states that guarantees safe optimal policies (independent of task rewards). The authors derive theoretical bounds on this penalty based on environment diameter and solvability.They propose a method for estimating the minimax penalties in a model-free RL setting. Finally, they empirically evaluate their algorithm and compare it with CPO and TRPO-Lagrangian.

(+) The paper addresses an important challenge with a novel idea. (+) The motivation behind the work is well-explained. The paper is well-written, although reviewers think the algorithm could be explained better and parts of the paper, especially in the experimental section are not written rigorously. (+) The paper has a good balance of theory and algorithmic implementation.

(-) The reviewers found the scope of the experiments limited and some conclusions from them slightly misleading. They also feel comparison to newer baselines would be necessary. Overall, They did not find the experimental results compelling. (-) The practical algorithm is obtained using a rough estimate from the theoretical results. There is not much in the paper to identify the gap between the theoretically constructed objectives and the one used by the practical algorithm. (-) The reviewers are not satisfied with the way that the algorithm is explained. Moreover, they found parts of the paper, especially in the experimental section not written rigorously enough.

I see this as a borderline paper. Thus, I strongly recommend that the authors revise their work using the reviewers' comments and prepare it for an upcoming venue.

审稿人讨论附加意见

Although the authors answered some of the reviewers' questions and clarified some issues, it seems they still have concerns that remained unaddressed. Some of these issues were listed in the meta-review.

最终决定Reject

2025-01-22

Reject