Safety-Polarized and Prioritized Reinforcement Learning
摘要
评审与讨论
Main findings:
- The paper introduces MAXSAFE, a chance-constrained bi-level optimization framework for safe reinforcement learning to achieve hard-constraint satisfaction and near-zero costs in sparse cost setting. Particularly, MAXSAFE minimizes the unsafe probability and maximizes the return under safe policies.
Main results:
- This paper tests their algorithm on diverse autonomous driving and safe control tasks. The MAXSAFE achieves comparable reward-safety trade-off on both benchmarks.
Algorithmic ideas:
- The algorithm of this paper contains Q-learning + reachability estimation function (REF) + safety polarization + priority replay experience (PER). In particular, this paper extends REF to state-action REF (SA-REF) and constructs optimal action masks based on polarization function to get rid of the gating operator. Besides, this paper utilizes the TD error of REF for PER to address the sparse cost setting.
给作者的问题
Q1: How does the “off-policy” affect the theoretical analysis since the analysis procedure follows just one policy π but off-policy RL has different behavior and target policies? (RESPPO is an on-policy algorithm, so it has no such problem.)
Q2: What is the purpose of introducing the gating operator in Section 5.1. and then getting rid of it in Equation (17)?
论据与证据
This paper claims (1) highly consequential cost and (2) sparse cost settings as a challenge. I agree that claim (1) is important, which is equivalent to the generally studied hard-constraint problem. However, I am confused why sparse cost is a challenge. For the reward, we have sparse reward settings because we want to maximize the cumulative reward and sparse reward settings indicate the environment cannot give valid reward signals. As for the cost, sparse cost settings indicate most of the cases are safe, which enlarges the feasible region of safety. This setting actually simplifies the problem in my view.
- Full cost -> No feasible region; Dense cost -> small feasible region; Sparse cost -> Large feasible region; No cost -> Full feasible region (common RL).
方法与评估标准
- Yes, the proposed methods and evaluation criteria make sense for the problem. However, the benchmark is a little old. For autonomous driving, here is a survey about open-source simulators. The highway-env is categorized into “driving policy simulator” in reference [1]. However, lots of newer simulators from 2020 to 2023 are also listed.
- [1] Li, Y., Yuan, W., Zhang, S., Yan, W., Shen, Q., Wang, C., & Yang, M. (2024). Choose your simulator wisely: A review on open-source simulators for autonomous driving. IEEE Transactions on Intelligent Vehicles. https://ieeexplore.ieee.org/abstract/document/10461065.
理论论述
In Appendix B.1., |max Q_1-max Q_2 |≤max |Q_1-Q_2 | is not obvious. I think it is better to give more details or hints.
实验设计与分析
The baselines for comparison in experimental designs do not satisfy the soundness/validity. This paper claims on sparse cost and high consequential (safety critical) tasks. For this kind of tasks, lots of hard-constraint methods have been proposed in safety-critical RL. However, most of the baselines (except for RESPPO in Appendix) in this paper are soft-constraint RL methods, whose objective is to satisfy the constraint in expectation rather than hard constraint. I think this paper should add some hard-constraint RL methods (such as [1] and [2]) for comparison.
[1] Yang, Y., Jiang, Y., Liu, Y., Chen, J., & Li, S. E. (2023). Model-free safe reinforcement learning through neural barrier certificate. IEEE Robotics and Automation Letters, 8(3), 1295-1302. Code link: https://github.com/jjyyxx/srlnbc
[2] Zhao, W., He, T., & Liu, C. (2023, June). Probabilistic safeguard for reinforcement learning using safety index guided gaussian process models. In Learning for Dynamics and Control Conference (pp. 783-796). PMLR.
补充材料
Yes. Appendix A shows very well why state-agnostic masking threshold is not enough. Appendix C explains all details of the experiments and implementations.
与现有文献的关系
- This paper introduces CMDP and Action-correction-based Safe RL, which are two types of baselines in experiment. Besides, Action Masking for Safe RL is also introduced as the foundation for their method.
- Prioritized Experience Replay: This paper introduces PER, which is the foundation for their proposed Safety PER.
遗漏的重要参考文献
Yes, there are lots of related works that are essential for this paper but are not cited/discussed in this paper.
- Hamilton-Jacobi reachability and reachability estimation function (REF) in Safe RL. As the most important contribution of this paper, these concepts are not introduced or discussed until they propose SA-REF. For example, some literatures include but not limited to [1], [2], [3], [4], and [5]. It is better to discuss this topic in related work and then explain the concept in Preliminaries.
[1] Ganai, M., Gao, S., & Herbert, S. (2024). Hamilton-jacobi reachability in reinforcement learning: A survey. IEEE Open Journal of Control Systems.
[2] Wang, Y., & Zhu, H. (2024, July). Safe Exploration in Reinforcement Learning by Reachability Analysis over Learned Models. In International Conference on Computer Aided Verification (pp. 232-255). Cham: Springer Nature Switzerland.
[3] Zhu, K., Lan, F., Zhao, W., & Zhang, T. (2024). Safe Multi-Agent Reinforcement Learning via Approximate Hamilton-Jacobi Reachability. Journal of Intelligent & Robotic Systems, 111(1), 7.
[4] Dong, Y., Zhao, X., Wang, S., & Huang, X. (2024). Reachability Verification Based Reliability Assessment for Deep Reinforcement Learning Controlled Robotics and Autonomous Systems. IEEE Robotics and Automation Letters, 9(4), 3299-3306.
[5] Zheng, Y., Li, J., Yu, D., Yang, Y., Li, S. E., Zhan, X., & Liu, J. (2024). Safe offline reinforcement learning with feasibility-guided diffusion model. arXiv preprint arXiv:2401.10700.
- Hard-constraint RL: By now, many researches have studied hard-constraint RL (safety-critical RL; near-zero constraint; persistently safe)
[1] Yang, Y., Jiang, Y., Liu, Y., Chen, J., & Li, S. E. (2023). Model-free safe reinforcement learning through neural barrier certificate. IEEE Robotics and Automation Letters, 8(3), 1295-1302.
[2] Zhao, W., He, T., Li, F., & Liu, C. (2024). Implicit Safe Set Algorithm for Provably Safe Reinforcement Learning. arXiv preprint arXiv:2405.02754.
[3] Suttle, W., Sharma, V. K., Kosaraju, K. C., Seetharaman, S., Liu, J., Gupta, V., & Sadler, B. M. (2024, April). Sampling-based safe reinforcement learning for nonlinear dynamical systems. In International Conference on Artificial Intelligence and Statistics (pp. 4420-4428). PMLR.
[4] Zhao, W., He, T., & Liu, C. (2023, June). Probabilistic safeguard for reinforcement learning using safety index guided gaussian process models. In Learning for Dynamics and Control Conference (pp. 783-796). PMLR.
[5] Wei, H., Liu, X., & Ying, L. (2024, March). Safe reinforcement learning with instantaneous constraints: the role of aggressive exploration. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 19, pp. 21708-21716).
[6] Gu, S., Sel, B., Ding, Y., Wang, L., Lin, Q., Knoll, A., & Jin, M. (2025). Safe and balanced: A framework for constrained multi-objective reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[7] Tan, D. C., McCarthy, R., Acero, F., Delfaki, A. M., Li, Z., & Kanoulas, D. (2024, August). Safe Value Functions: Learned Critics as Hard Safety Constraints. In 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE) (pp. 2441-2448). IEEE.
其他优缺点
- Strengths
- It extends REF to state-action based, which is more detailed for reinforcement learning.
- The Theoretical proofs are in detail.
- The appendix is very detailed.
- Weaknesses
- Although Theoretical proofs are in detail, there is a lack of explanation of the derivation process.
- The sparse cost settings are not convincing. Sparse cost settings indicate that most of the cases are safe, which enlarges the feasible region of safety. This setting actually simplifies the problem.
其他意见或建议
- Some key terms are not explained, at least not clearly, such as “maximal safety”, “safety polarization”, “REF” .
- Better to add a section about “Hamilton-Jacobi reachability and reachability estimation function (REF) in Safe RL” in related work.
- Add a section about hard-constraint safe RL in related work.
- Explain the formulation of reachability and REF in Preliminaries
- The space in line 101-102 at Page 2 is incorrect.
We would like to sincerely thank the reviewer for the thorough reading of our paper. We address your valuable questions in the following responses.
Q1: Sparse cost challenge
In our setup, episodes terminate immediately upon safety violations, as we treat safety as a hard constraint. The learner receives no cost feedback until a violation occurs, resulting in at most one non-zero cost per episode—what we refer to as sparse safety cost. This sparsity makes it difficult to learn safety estimations, as the agent must infer early indicators of unsafe outcomes without intermediate cost signals. These challenges motivate our use of PER, which prioritizes unsafe transitions to improve policy safety.
Q2: Question about proving -contraction
A brief proof here: https://anonymous.4open.science/r/rebuttal-6740/proof_details.png.
Q3: Two more hard-constraint baselines
We conduct experiments with PPOBarrier on highway benchmarks and find it performs conservatively, with consistently low rewards across all tasks. In contrast, our SPOM and SPOM_PER achieve a better tradeoff. Detailed SWU scores, crash rates, and episode rewards are provided in https://anonymous.4open.science/r/rebuttal-6740/ppo_barrier_table.png, along with training curves in https://anonymous.4open.science/r/rebuttal-6740/ppo_barrier_training_curve.png. As for the second referenced work, its setting assumes access to analytical violation functions, while our SA-REF is learned without any domain-specific knowledge.
Q4: Literature on HJ reachability and REF
We will include all relevant literature on HJ reachability and REF in a new section of our Related Works. Briefly, early works employed HJ reachability value functions to assess state feasibility, relying on known system dynamics and numerical methods. Some studies approximate unknown dynamics using Gaussian Processes (Zhang, W. et al. 2023)or symbolic regression (Wang, Y. et al. 2024). Once reachability is computed, the state space is partitioned into feasible/infeasible regions to guide policy optimization (Zheng, Y. et al. 2024). Other works focus on formal safety verification of DRL systems (Dony, Y. et al. 2024). However, HJ-based value functions are not well-suited for stochastic MDPs during RL training. To address this, REF estimates unsafe probabilities via backward reduction (Ganai, M. et al. 2024). Our work extends REF to the state-action level, enabling state-dependent action masking to reduce safety violations.
Q5: Literature on hard-constraint safe RL
Thanks for the suggestions. We indeed discuss hard-constrained safe RL in our related works. Based on policy optimization strategies, we categorize prior works into: action correction and action masking. Action correction modifies actions after they are proposed (e.g., via projection), while action masking alters the action distribution directly. The papers mentioned by the reviewer largely fit within this framework and will be incorporated into our Related Works. For instance, Tan,D. et al. 2024 uses value functions as control barrier functions in a shielding setup, aligning with action correction. Yang,Y. et al. 2023 learns certificates that softly penalize unsafe actions, guiding policies toward safer regions—similar in spirit to action masking. Suttle, W. et al. 2024 introduces truncated sampling from state-dependent safe action sets, which we view as a general form of action masking. We will include all the literature in our paper for completeness.
Q6: How does the “off-policy” affect the theoretical analysis?
We assume that actions are sampled from a behavior policy in Section 4.2, which is a common assumption in Q-learning to ensure sufficient exploration of the state-action space. Our theoretical analysis follows the classical Q-learning framework: we prove the -contraction property of our operator and demonstrate convergence in tabular MDPs. This directly parallels traditional Q-learning theory.
Q7:Gating operator
The gating operator in Section 5.1 represents our hard masking rule. In Theorem 4.4, we prove its convergence in tabular MDPs. When extending this method to deep Q-learning, we found that using Equation (17) as the objective yields better empirical performance. We report the ablation results in Section 6.3, where ‘OAM’ refers to Optimal Action Masking, alongside other polarization function choices.
Q8: Key terms
•Maximal safety: A policy that achieves the lowest probability of safety violations.
•Safety polarization: The use of our polarization function to mask actions with relatively high SA-REF values.
•REF: can be regarded as the probability for the current state to reach unsafe regions under the current policy.
We hope our responses have addressed your concerns and we welcome any further suggestions to improve the paper. If you find our clarifications satisfactory, we would greatly appreciate it if you could consider increasing your score accordingly.
Thank you for answering my questions. While some concerns have been addressed, the following one remain:
Q1: I agree to treat hard constraint safety as “episodes terminate immediately upon safety violations”. Do you want to illustrate that the cost is not sparse in the environment but sparse in the replay buffer due to the immediate termination?
Q2: I listed the hard-constraint RL methods because these methods should be selected as baselines since your claim is hard constraint. However, you picked most of the soft-constraint RL methods as baselines. I think this is a fatal issue of this paper, where you study field A but select field B as baselines.
Q3: The benchmarks are 5-8 years ago. I think it could be OK because benchmarks may not update rapidly (Although there is a survey of open-source simulators as in my comment). However, the selected baselines were also five years ago (before 2020 except for RESPPO). I am curious why not choose the methods in 2020-2024 for comparison.
Q4: Besides, it is confusing that the hard-constraint baselines (such as RESPPO, PPO-barrier) cannot outperform soft-constraint baselines in hard-constraint settings. I am wondering if there is any reasonable justification.
Thanks for your reply. We are willing to explain the above questions you raise.
Q1: Sparse cost challenge
If we were to force a distinction between whether the cost signal lies in the environment or in the replay buffer, we would argue that it is sparse in the replay buffer. This means that within the training data used by our algorithm, unsafe transition pairs are significantly less frequent than safe ones, making it difficult to accurately estimate the current safety situation using temporal difference learning.
Q2: Baselines about hard-constraint safe-RL
First, we clarify that the term hard-constraint in our work primarily refers to the evaluation setting, where an episode is immediately terminated upon any safety violation. At the algorithmic level, all safe RL methods inherently aim to balance the trade-off between reward and safety. In other words, the use of hard constraints in evaluation does not invalidate comparisons with techniques such as reward shaping or PPOLag. In fact, the PPOBarrier baseline could also be considered a variant of the PPOLag method, differing only in the way it learns a barrier certificate. Here, we want to emphasize that the distinction between hard-constraint and soft-constraint methods at the algorithmic level is often blurred in practice, making it difficult to draw rigid boundaries when selecting baselines. From our point of view, the key difference lies in how the control policy is optimized safely. We categorize existing approaches into action masking and action correction, as previously clarified in our response.
In addition, we have carefully reviewed all the hard-constraint RL literature referenced. Nearly half of these works are grounded in a control-theoretic perspective, where the safety specification—such as mentioned in Zhao, W. et al. 2024 and Zhao, W. et al. 2023 or mentioned in Suttle, W. et al. 2024—is manually defined for each individual task. This is notably different from our setting, where we do not assume access to any domain-specific analytical safety functions. Instead, we only rely on cost signals obtained from the simulator, which aligns closely with the standard RL paradigm.
Regarding the paper Wei, H. et al. 2024, we argue that its setting falls outside the scope of deep RL. It assumes a linear expression of the Q-function and evaluates on a simple Frozen Lake environment, where both the state and action spaces are finite. The work Tan, D. C. et al. 2024 is indeed equivalent to SafeQ mentioned in our paper. Both approaches use binary costs to learn a safety value function and adopt rejection sampling to enforce constraints—an idea that can be interpreted as a form of action masking. In our work, we demonstrate that simply truncating actions based on a fixed masking threshold across all state-action pairs is insufficient under our MAXSAFE framework, as illustrated by the motivating example in Appendix A. Finally, the paper Gu, S. et al. 2024 focuses on managing multiple objectives. We argue that their setting still relies on predefined cost budgets and does not fall under the category of hard-constraint methods, if one insists on making such a distinction.
To conclude, we have provided a thorough explanation for why we do not include the aforementioned works as baselines in our study. Nevertheless, we warmly welcome suggestions for additional baselines that are suitable for our problem setting, and we are willing to reproduce their results for a fair comparison.
Q3: Why not good as other soft-constraint baseline?
PPOBarrier and RESPO have been previously implemented in Safety-Gymnasium. However, their inclusion in our safety-critical benchmarks does not necessarily imply that they outperform simpler approaches such as reward shaping. There may be multiple reasons for this. One possible explanation is that directly applying a Lagrangian method may not be an effective way to manage the reward-safety trade-off in our benchmarks. Additionally, the learning of the barrier function in PPOBarrier may be suboptimal due to the sparse cost setting.
We truly appreciate the reviewer’s constructive feedback. If you find our clarifications satisfactory, we would greatly appreciate it if you could consider increasing your score accordingly.
The paper proposed MAXSAFE, a safe RL algorithm which aims to maximize the return while minimizing (reduce to zero) the probability of visiting an unsafe state. MAXSAFE is extended based on Q-learning algorithm, and thus is applicable to discrete action MDP problems. The major contribution of this paper is that it outlines the proposed process in masking unsafe actions from the Q-safety critic and provide theoretical guarantee in tabular MDP setting.
给作者的问题
Please refer to earlier sections.
论据与证据
-
One assumption is that the policy space is sufficiently large that unsafe probability can be minimized to near zero. Although the paper cited a prior work which also makes this assumption, I'd argue that this is not generally applicable to most safe RL problems where the transition dynamics can be stochastic and hence it might not be feasible to reduce the probability of visiting unsafe states to near zero.
-
The proposed method is limited to discrete action setting but most of the safe RL problems and benchmarks handle continuous action in general.
方法与评估标准
-
The CMDP-based RL selected is based on RCPO (Tessler et al., 2019) which (a) is an on-policy algorithm, (b) dynamically adjusts the safety-reward tradeoff coefficient based on the specified level of safety. These two features are not present in MAXSAFE, which is an off-policy algorithm and targets to absolutely minimizes the probability of visiting unsafe states. This makes me think that this selection of baseline might not be valid, considering that on-policy algorithm may need more samples to converge and RCPO also needs to learn the optimal tradeoff. Perhaps an SAC-Lagrangian baseline (modified to handle discrete actions) with different tradeoff setting is a better baseline?
-
I have a little bit of doubt whether condensing both the safety and reward performance figure to a single evaluation metric SWU is useful. It seems to overly simplify the tradeoff into one single number without any statistical confidence value. Perhaps using the more conventional way, e.g. report reward and safety separately with both average and stdev, would be better.
理论论述
The theoretical part in the main paper mainly includes the steps to derive the steps to learn the optimal action masks, which has some resemblance to the cited baseline SafeQ (Srinivasan et al., 2020). The steps look sound, do note that I did not carefully check the detailed step-by-step proof in the appendix.
实验设计与分析
Most of the safe RL literature uses the safety gym / gymnasium or bullet gym for experiments and the chosen environments in this paper: highway-environments and ACC are a little unfamiliar to me.
Perhaps running experiments in these well-tested benchmark domains would be easier for reviewers to assess.
补充材料
The supplementary material includes additional explanation, theoretical proofs, implementation details and experiment domain descriptions. Code is not submitted.
As mentioned earlier, do note that I did not carefully check the detailed step-by-step proof in the appendix.
与现有文献的关系
MAXSAFE is relevant to the safe RL work as it provides an algorithm which can learn a safe policy while optimizing reward, although it has some inherent restrictions which might limit its applicability to general safe RL setting.
遗漏的重要参考文献
NA
其他优缺点
Using Prioritized Experience Replay is an interesting strength of MAXSAFE as it enhances its capability in handling sparse and catastrophic safety violation setting.
其他意见或建议
NA
We would like to sincerely thank the reviewer for the valuable feedback of our work.
Q1: The soundness of our assumption that there exists a sufficiently large policy space with minimal unsafe probability.
We assume that in the environments we consider, there exists a sufficiently large policy space in which the probability of unsafe events is minimal. Such environments are common in autonomous driving and robotic control. For instance, there are many feasible driving policies that allow a vehicle to operate safely on the road without collisions, as well as a sufficiently large set of policies enabling a robot to navigate without falling or colliding with obstacles—indicating that a large policy space with minimal unsafe probability indeed exists in these environments. This assumption aligns well with many practical scenarios. Building on this assumption, our MAXSAFE policy selects from within this space and aims to optimize reward performance while ensuring safety.
Q2: The proposed method is limited to discrete action spaces.
Our action masking method is currently only suitable for RL tasks with discrete action spaces, as simple action masking strategies may not perform well in continuous settings. We view this as an interesting direction for future work, which would require more advanced techniques to better optimize the trade-off between safety and performance.
Nevertheless, the proposed safety Prioritized Experience Replay (safety PER) is a general approach and can also be applied to the learning of the cost critic. To validate this, we run experiments on two environments—SafetyCarCircle and SafetyPointGoal—from the Omnisafe benchmark, https://github.com/PKU-Alignment/omnisafe. The results suggest that the simple SAC-Lag algorithm can benefit from being combined with safety PER to achieve better performance. We use 6 random seeds for evaluation, and we provide both the training curves and the mean/std of episode return and episode cost. The results are available at:https://anonymous.4open.science/r/per_test/results.png, https://anonymous.4open.science/r/per_test/training_curve.png.
Q3: Is a SAC-Lagrangian baseline (modified to handle discrete actions) with different trade-off settings a better baseline?
We include a variety of off-policy baseline RL algorithms. Since our tasks involve discrete action spaces, we adopt DQN-based off-policy methods. For example, our baseline RCDQN is an off-policy algorithm, and all algorithms with names ending in “DQN” are off-policy methods designed for discrete action settings.
Q4: Usefulness of the SWU score.
We adopt the evaluation setting from previous literature to demonstrate the trade-off between safety and reward. In our paper, we also report the training results for episode reward and crash rate in Table 5.
Q5: Choice of benchmarks
Regarding the concern about why we do not use Safety-Gym as our benchmark suite, the primary reason is that our setting fundamentally differs from that of Safety-Gym. In our framework, safety is treated as the highest priority. Specifically, we assume that an episode terminates immediately upon the execution of any unsafe action (i.e., entering an unsafe region). In contrast, Safety-Gym employs a scalar cost signal and allows temporary safety violations as long as the expected cumulative cost remains within a predefined budget. For example, the “circle” environment in Safety-Gym’s safe navigation task is conceptually similar to the “circle” environment in our benchmarks. However, the underlying assumptions differ: in Safety-Gym, safety is treated as a constraint, and episodes continue despite violations, provided the overall cost remains within budget. In contrast, our formulation enforces strict safety—any violation leads to immediate episode termination. This setting better reflects real-world scenarios where safety is non-negotiable and must take precedence over performance. A formal comparison of the two formulations is provided in the Preliminaries section of our paper. We refer the reviewer to that section for detailed definitions. Due to our strict safety assumption, the cost signal in our setting only appears at the end of an episode, resulting in significantly sparser cost feedback. To address this challenge and improve learning efficiency, we introduce Prioritized Experience Replay (PER) in our SPOM_PER method. This design is motivated by the need to balance reward maximization with strong safety guarantees.
We hope our responses have addressed your concerns and we welcome any further suggestions to improve the paper. If you find our clarifications satisfactory, we would greatly appreciate it if you could consider increasing your score accordingly.
I thank the authors for the detailed response and it clarified some of the points I raised. I do have some comments remaining:
-
While I noted that Prioritized Experience Replay can be applied to learning of cost critic in continuous action setting, I understand this is not the core part of this paper though and SPOM is still targeted at discrete action domain. That said, I do note that the authors mentioned this is part of interesting future work.
-
I understand that a number of the baselines are DQN based, I specifically mentioned SAC-Lag because it can be applied to discrete action setting while might explore better than DQN (epsilon-greedy) in terms of exploring the safety space, due to the entropy term built in to the objective. Maybe SAC-Lag or other SAC-based baseline would be helpful in this case?
-
In Table 5, it'd be helpful to list the statistical deviation too (e.g. std dev).
-
For safety-gymnasium comparison, I understand that the safety setting is different. However, I think the safety gymnasium env can still be lightly extended and used, where an episode is terminated after incurring a cost.
-
It'd be great for assessing reproducibility if the code is shared.
Thanks for your reply. We are willing to explain the above questions you raise.
Q1: SAC-based baseline
We implement the SAC-Lag baseline on four autonomous driving benchmarks. Since our task involves a discrete action space, we follow the setup of Soft Actor-Critic for Discrete Action Settings (https://arxiv.org/abs/1910.07207) to compute the TD target and policy loss. We employ a double Q-network for the reward critic and a single network for the cost critic, and apply automatic tuning, as is standard in SAC. To align with our hard-constraint cost violation setting, we set the cost limit to zero, using it as the Lagrangian penalty term in the actor loss, consistent with the PPO-Lag baseline. The results (episode rewards, crash rates and SWU scores) are available at the following link: https://anonymous.4open.science/r/per_test/SACLag.png.
The results show that our SPOM and SPOM_PER achieve higher SWU scores than the SAC-Lag baseline. The entropy-based exploration strategy adopted by the SAC-Lag baseline leads to conservative behavior across our benchmarks, which means it is suboptimal and not better than epsilon-greedy exploaration policy in our hard-constraint sparse cost setting. However, we believe that identifying the optimal exploration strategy under sparse cost conditions is orthogonal to our main contribution and represents an interesting direction for future research.
Q2: Add standard deviation results
Thank you for your valuable reminder. We also believe that including standard deviations provides a more rigorous and comprehensive reflection of the experimental results. All standard deviation results have now been included in the table, available at the following link: https://anonymous.4open.science/r/per_test/mean_std_results.png. Table 5 will be updated accordingly in the next version of the paper.
Q3: Safety-Gymnasium
Thank you for your suggestion. We do not initially consider directly using Safety-Gymnasium, as our evaluation setting involves hard constraints, and many tasks in Safety-Gymnasium are designed for continuous action spaces. Given the current time constraints, adopting Safety-Gymnasium requires us to not only rewrite the benchmark interfaces but also re-implement all of our existing DQN-based baselines for the new environments, which demands substantial time and effort. Nevertheless, we find the reviewer’s suggestion very insightful and consider re-implementing our methods to be compatible with Safety-Gymnasium as a direction for future work.
Q4: Code availability
Yes, we would be happy to share our code upon acceptance of the paper.
We truly appreciate the reviewer’s constructive feedback. If you find our clarifications satisfactory, we would greatly appreciate it if you could consider increasing your score accordingly.
The paper has a clear motivation to improve the safe RL through action masking. To avoid directly applying infinity to Q, the paper introduces the polarization functions. To improve the learning of REF, the paper uses prioritized learning. The result shows significant improvement in both reward and safetyness compared to the baselines.
update after rebuttal
It seems that the authors have addressed most of the other reviewers' concerns well. I am happy to raise my score to 5.
给作者的问题
- Is there any particular reason why safety gym is not included as a benchmark? Or why do the authors choose to run the methods on a new benchmark?
论据与证据
Yes, the claim made by the submission is supported clearly. There are clear improvements in the experiments.
方法与评估标准
The proposed method is on the right direction to tackle the key challenge of safe RL - balance between 2 objectives.
The evaluation criteria seems correct to me. Hopefully, the author can add Safety RL Gym as an additional benchmark, but I am totally okay without it.
理论论述
I checked the covergence proof and there is no issue.
实验设计与分析
The proposed experiment seems a new experiment setting. I double check and the settings look good overall, including the metrics and the reward design.
补充材料
Appendix A looks great in terms of why the threshold masking does not work.
Appendix B is very detailed.
Appendix C includes ppo-lag and respo, and explains their failure due to the high variance.
与现有文献的关系
The key contribution of the paper is to introduce the polarization function, which I feel is the correct way of learning action masked policy.
Typically, people do a mixture of 2 modes - optimizing an objective and minimizing the cost - by adding them together as 1 objective. These approaches often suffer from sensitive Lagrangian hyperparameter or other issue to balance between the 2 modes.
The paper made crystal clear improvement by directly injecting the safe signal into the learning of Q function, which I feel to be the correct way to ensure safetyness inherently for learning a policy. The introduction of polarization function also feels novel and tackles the right key challenge part to learn the correct Q function.
遗漏的重要参考文献
NA
其他优缺点
The paper is written very well. It is overall very easy to read, easy to grasp the key idea of the author.
其他意见或建议
I wish the impact statement can have more text. Since this is about Safe RL, any improvement may have a positive impact to the society if the algorithm will be deployed at scale in real world.
We would like to sincerely thank the reviewer for the positive recognition of our work.
Regarding the concern about why we did not choose Safety-Gym as our benchmark suite, the primary reason is that our setting fundamentally differs from that of Safety-Gym. In our framework, safety is treated as the top priority. Accordingly, we assume that an episode terminates immediately upon the execution of any unsafe action (i.e., entering into any unsafe region). In contrast, Safety-Gym employs a scalar cost signal and allows safety violations as long as the expected cost-to-go remains within a predefined budget. For example, the “circle” environment in Safety-Gym’s safe navigation task is conceptually similar to the “circle” environment in our benchmarks. However, the underlying assumptions are different: in Safety-Gym, safety is treated as a constraint, and episodes continue even when temporary violations occur, as long as the total cost remains within the safety budget. In contrast, our formulation assumes strict safety—any safety violation immediately terminates the episode. This setting better reflects scenarios where safety is non-negotiable and takes precedence over performance. The mathematical comparison between these two setups is provided in the Preliminaries section of our paper, and we refer the reviewer to that part for detailed formulation. As a result of our strict safety assumption, the cost signal in our setting only appears at the end of an episode, making it significantly more sparse. To address this challenge and improve learning efficiency under such sparse cost feedback, we propose the use of Prioritized Experience Replay (PER) in our SPOM_PER method. This design decision is also motivated by the need to balance reward maximization with strong safety guarantees.
We truly appreciate the reviewer’s positive feedback and encouragement. We hope our responses have addressed your concerns, and we welcome any further suggestions to improve the paper.
Thank you so much for the explanation. I understand now that the new benchmark is better to reflect the assumption of immediate termination for unsafe state. I remain my recommendation for acceptance.
This work introduce a chance-constrained bi-level optimization framework called MaxSafe, for the maximal-safety RL problem. MaxSafe first minimizes the unsafe probability and then maximizes the return among the safest policies.
给作者的问题
Could the authors answer points (2) and (3) in Strengths And Weaknesses?
论据与证据
- The authors assume that there is sufficiently large policy space with minimum unsafe probability. And justify their assumption by saying that it has already been explored in the literature (e.g., Ganai et al., 2023).
However I'd say that this assumption depends very much on the environment at hand. What class of environments would justify such an assumption?
方法与评估标准
Yes, the experimental evaluation is rather thorough, with several relevant RL environments considered, and different related approaches compared.
理论论述
The theoretical results appears to be sound. Proofs are provided in the appendix, but I did not check them.
The authors claim that "Extensive experiments demonstrate that our method achieves an optimal trade-off between reward and safety, delivering near-maximal safety." Is this meant in terms of the SWU metrics introduced in Sec. 6?
实验设计与分析
The authors present experiments on autonomous driving and safe control tasks, demonstrating that the proposed algorithms, SPOM and SPOM PER, achieve the best reward-safety trade-off (as computed by safety-weighted-utility) among state-of-the-art safe RL methods.
补充材料
No, I did not.
与现有文献的关系
The state of the art is discussed in depth in the related work section. In particular the authors refer to (Ganai et al., 2023) and their reachability estimation function (REF) to capture the probability of constraint violation at state s under a given policy. In this work, the authors extend the definition of REF to be state-action dependent.
遗漏的重要参考文献
Nothing relevant not discussed to my knowledge.
其他优缺点
-
The paper makes both theoretical (Sec. 4) and practical (Sec. 5) contributions. In Sec. 4 the authors define a methodology to obtain optimal policies with provably minimal safety violations. Then, in Sec. 5 they adapt the methodology to an implementation with deep RL, by using polarization functions and safety prioritized experience replay.
-
As a weakness, the experimental evaluation does not appear to be conclusive, this might be due in part to the fact that the trade-off between safety and performance is elusive. As it appears from Fig. 1, the methods here proposed, SPOM_PER and SPOM, not always guarantee the lowest crash rate.
-
Also, related to the point made above, it is not clear how much justified is the assumption that there is sufficiently large policy space with minimum unsafe probability. Consider a stochastic exploration task, where there is a non-zero probability of ending up in an unsafe state. It might be that no policy is zero unsafe, as there is always some non-zero chance of ending up in a non-safe state. It looks like in such case the method described in the paper would fail.
其他意见或建议
Nope
We would like to express our sincere gratitude to the reviewer for the valuable feedback and constructive comments. To better quantify the trade-off between reward and safety, we adopt the SWU score introduced by Yu, H., Xu, W., and Zhang, H. in Towards Safe Reinforcement Learning with a Safety Editor Policy (NeurIPS 2022b) as our final evaluation metric to compare the performance of different algorithms. The following sections address the two main questions raised by the reviewer.
Q1: The soundness of our assumption that there exists a sufficiently large policy space with minimal unsafe probability.
We assume that in the environments we consider, there exists a sufficiently large policy space where the probability of unsafe events is minimal. Such environments are common in autonomous driving and robotic control. For instance, there are many possible driving policies that allow a vehicle to operate safely on the road without collisions, and a sufficiently large set of policies that enable a robot to navigate without falling or colliding with obstacles—indicating that a large policy space with minimal unsafe probability indeed exists in there environments. This assumption aligns well with many practical scenarios. Building on this assumption, our MAXSAFE policy selects policies from this space and aims to optimize reward performance while ensuring safety.
Q2: The proposed methods SPOM_PER and SPOM do not always guarantee the lowest crash rate.
In our experiments, we use the SWU score to evaluate the trade-off between safety and reward. As shown in Table 1 of our paper, SPOM_PER and SPOM achieve the best overall performance across all six benchmark environments. Although the Recovery baseline achieves the lowest crash rates in Roundabout and Intersection, its reward performance is significantly worse, indicating that it suffers from an overly conservative policy in these two environments. In contrast, our SPOM_PER and SPOM attain the highest SWU scores across all benchmarks. Also, the results demonstrate that applying Prioritized Experience Replay (PER) to our proposed SA-REF effectively reduces safety violations while preserving strong reward performance.
We sincerely appreciate the reviewer’s constructive feedback, which has been invaluable in improving the clarity and rigor of our work. We hope our detailed responses have addressed your concerns, and we welcome any further suggestions to strengthen our paper.
The MAXSAFE framework introduces a promising chance-constrained bi-level optimization approach for safe reinforcement learning with novel components including optimal action masks, safety polarization, and safety prioritized experience replay. While reviewers noted concerns about the policy space assumption (BcoJ, urEV) and baseline selection methodology (bCNL), the authors provided adequate justifications and implemented improvements including additional baselines and statistical validation. The paper focuses on specific application domains where the assumptions are reasonable, and the SWU score demonstrates strong overall performance in reward-safety trade-offs. The current restriction to discrete action spaces represents a reasonable scope limitation rather than a fundamental flaw. Though one reviewer maintained a weak reject position (urEV), another recommended acceptance (dGwp), and a third moved to weak accept after rebuttals (bCNL). On balance, the paper's novel bi-level optimization approach makes a valuable contribution to safe reinforcement learning despite limitations that could be addressed in future work.