Spectral-Risk Safe Reinforcement Learning with Convergence Guarantees
We propose a safe RL algorithm with spectral risk constraints, which shows convergence to an optimum in tabular settings.
摘要
评审与讨论
This paper considers spectral-risk safe RL algorithm with convergence guarantee. In particular, the paper considers a constrained MDP approach where the objective is to maximize the expected cumulative reward subject to the constraint that the spectral-risk measure is below a certain threshold. The main challenge lies in solving the optimization problem as it requires solve a bi-level optimization framework. The paper proposes first proposes a policy gradient based approach for solving the inner problem, and then proposes a gradient-based approach to sample according to the optimal spectral risk-measure.
优点
-
Developing a safe RL policy with spectral risk measure is important research problem.
-
The paper provides convergence guarantees and effective policies.
-
The paper also provides practical approaches for implementation. The empirical performance is good.
缺点
-
The paper is a little-bit dense to parse everything.
-
While the paper provides a convergence guarantee, the paper did not provide any non-asymptotic rates for convergence. In particular, the sample-complexity guarantee has not been provided.
问题
-
In order to solve the inner-optimization problem, the paper considers a primal-dual type algorithm for solving the feasibility function. What types of feasibility functions have been considered? Can the authors provide some outline on how the dual-variable updates as shown in Appendix B ensure feasibility?
-
For solving the outer optimization, the paper relies on minimizing the loss function. However, optimal policy that maximizes is difficult because of the non-linearity in the feasibility function. Hence, the reviewer did not get how the algorithm solves the inner-optimization problem which seems to be different compared to the inner-optimization problem described in Section 5.
-
The convergence result (Theorem 6.3) relies on the Robbins-Monroe condition. Is it satisfied (in particular, for the optimization framework considered in Section 6 which is different from Section 5)?
局限性
Not as such.
Thanks for the detailed review and valuable feedback on our paper. We appreciate the effort to review our work. Below is our response to the reviewer's comments.
Weakness (convergence rate):
Thanks for pointing out the need for convergence rate analysis. We have addressed the convergence rate of the proposed method in the general response above. Please check the general response.
Q1 (feasibility):
The feasibility function is defined as if else , which is commented in the footnote of line 114. This function is used to express the constrained optimization problem as an unconstrained one in equation (6), since two problems " s.t. " and "" are equivalent. When solving the risk-constrained RL problem, the feasibility function is not used directly; instead, the constraints are used as they are, as shown in equation (7).
According to Theorem 5.3, an optimal policy for the inner problem can be obtained if the policy is updated using the proposed update rule, which is described below in line 169. As mentioned in lines 171-172, the proposed update rule requires satisfying the following conditions on and , which the reviewer refers to as "dual-variables": 1) and should exist within , and 2) and when constraints are not satisfied. Various methods for choosing and that satisfy the conditions are presented in Appendix B. By using one of these methods to calculate and , an optimal policy can be obtained according to Theorem 5.3, ensuring that the constraints are satisfied. A brief insight into the proof of Theorem 5.3 is as follows: the update rule is designed to give more weight to the objective function when the constraints are satisfied and more weight to the constraints when they are not. As a result, through repetitive policy updates, the policy converges within or on the boundary of the constraints.
Q2 (outer optimization):
The reviewer seems to think that calculating requires solving another inner problem, which is different from Section 5. In other words, the reviewer seems to think that in order to train the sampler, it is necessary to find a policy that maximizes . However, this is not correct. The policy is only updated using the method introduced in Section 5, and only the sampler is trained in Section 6. There is no other inner problem. As seen in the equation below line 240, the sampler's loss function is composed of , where is the policy obtained by solving the inner problem from Section 5 for over iterations. can be calculated using value functions corresponding to , and the sampler is trained to output higher probabilities for that yield a higher value.
To be more specific, policies are trained for each of several using the method introduced in Section 5, and these policies are expressed as { }. Then, we use these policies to calculate and use these values to train the sampler. To implement the set of policies, along with the state are added as inputs to the actor-network , as detailed in Appendix C.
Additionally, the explanation on lines 237-238 does not mean that we need to find , but rather demonstrates that can be used as a measure of how well RCRL is solved. If there is any misunderstanding on our answer, please provide further clarification.
Q3 (Robbins-Monro condition):
Since it is possible to adjust the learning rate freely, the Robbins-Monro condition can be satisfied. Furthermore, as mentioned in the above response to Q2, there is no additional inner problem in calculating . Therefore, the method described in Section 6 does not need to satisfy the Robbins-Monro condition.
Thanks for your response. I will go over the comments carefully.
As for the proof like CRPO that you provided in the general response, I have one question. Since this is a constrained problem, how you are ensuring that . Since this is a constrained problem, may be infeasible and may have a better value compared to .
Thanks for the response. The question on the inequality seems to be arised from the proof of the convergence rate in the general response. In the proof, we said " for ", and means that the policy at iteration, , satisfies all constraints. Therefore, among the constraint-satisfying policies, has the maximum value of , which results in .
Thanks for the quick explanation. However, it seems that (even in the CRPO case) there is a slack parameter , hence, those policies may not exactly satisfy the constraints. For example, it may happen that , but . Thus, may not satisfy the constraint and may be infeasible.
Thanks for clarifying the raised question with an example. As the reviewer pointed out, we confirmed that there was an error in the proof. We have uploaded the revised proof to a comment on the general response above. Please refer to the comment.
I do not have any more comments. I have adjusted my score.
This paper proposes a spectral risk measure-constrained RL algorithm, called spectral-risk-constrained policy optimization (SRCPO). This algorithm leverages the duality of spectral risk measures and treats the risk-constrained RL problem as a bilevel optimization. In this bilevel optimization problem, the outer problem is to optimize dual variables derived from the risk measures, while the inner problem is to find an optimal policy given these dual variables. The proposed algorithm is the first to guarantee convergence to an optimum in the tabular setting. Furthermore, the authors conduct experiments on continuous control tasks, and show that the proposed algorithm achieves the best performance among the compared RCRL algorithms.
优点
- The studied problem, i.e., spectral risk measure-constrained RL, is well-motivated and can be applied to safety-critical scenarios, e.g., healthcare and finance.
- This paper is very well-written and easy to follow.
- The authors design a general algorithm framework and provide the convergence guarantee for a family of spectral risk-constrained RL problems.
- Empirical evaluations are also provided to demonstrate the performance superiority of the proposed algorithm in practice.
缺点
- The idea of handling risk-sensitive RL by solving an inner problem (i.e., find an optimal policy under a fixed dual variable) and an outer problem (i.e., optimize dual variables) is not new. This idea appears in prior risk-sensitive RL works, e.g., (although these prior works may not consider constraints)
[1] Wang, Kaiwen, et al. "Near-minimax-optimal risk-sensitive reinforcement learning with cvar." International Conference on Machine Learning, 2023.
[2] Chen, Yu, et al. "Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation and Human Feedback." International Conference on Learning Representations, 2023.
- The current format of Algorithm 1 is not clear enough. The specific algorithm steps should be included, e.g., the policy update, the distribution update (how to find ) and the sampling.
- It seems that Theorems 5.3 and 6.3 are the theoretical guarantees for the inner and outer problems, respectively. Can the authors give the theoretical guarantee of the overall performance for the spectral risk measure-constrained RL problem?
- The provided results only state that the algorithm will converge to the optimal policy. Can the authors comment on the convergence rate or sample complexity?
问题
Please see the weaknesses above.
局限性
Please see the weaknesses above.
Thanks for the thorough review and insightful comments on our paper. We appreciate the effort to review our work. The following is our response to the reviewer's comments.
Weakness 1 (inner and outer problems)
As the reviewer commented, handling risk by separating the given problem into the inner and outer problems can be considered as not new; instead, the proposed method for solving each problem can be considered novel. Therefore, we will modify lines 50-51 to emphasize the proposed process of the inner and outer problems rather than the bilevel optimization framework itself.
Weakness 2 (algorithm details)
Algorithm 1 only provides a brief overview of the proposed method, but Algorithm 2 in Appendix C shows details of the method, including information on the policy and the distribution update. To improve clarity, we will change the title of Algorithm 1 to "Overview of SRCPO," and at the end of Chapter 4, we will add a note indicating that a detailed algorithm is provided in Appendix C.
Weakness 3 (overall performance)
Through Section 5, we can find an optimal policy of the inner problem for various . It means that we can find a set of optimal policies, { }, where denotes the optimal policy for . Additionally, through Section 6, we can obtain an optimal sampler that samples only . Thus, by solving the inner and outer problems, we obtain and . Sampling from and finding the corresponding policy in leads to the optimal policy of the RCRL problem. In conclusion, by solving the inner and outer problems respectively, we can obtain an optimal policy for the original RCRL problem defined in equation (5), thus ensuring overall performance.
Weakness 4 (convergence rate)
We have addressed the convergence rate of the proposed method in the general response above. Please check the general response.
Thank the authors for their response. Since there exists a technical error in the proof, I think that this paper needs a major revision, and thus I decreased my score.
Thanks for the response. However, we would like to clarify that there are no technical errors in our paper. The error mentioned in the general response arose from an additional analysis of the convergence rate, which was conducted in response to the reviewers' requests. For more details, there was an error in the proof regarding the convergence rate in the general response, specifically the part " for ," which has now been corrected to "one of the following must hold: (1) or (2) ." The revised proof has corrected all errors, and the reviewer DPN9 has also increased the score from 6 to 7. We apologize for the error, which has now been corrected, and would like to reiterate that there are no errors in the paper itself.
Thank you for your explanation!
I misunderstood something before. I will maintain my original score 7, and support acceptance.
In this paper, the authors have considered the framework of risk-constrained reinforcement learning (RCRL) by tackling risk-measure-based constraints. The non-linearity of risk meaures makes the convergence of RL schemes challenging. In this paper, a spectral risk measure-constarined RL algorithm is proposed. The proposed algorithm is based on a bilevel optimization. The outer problem deals with optimization of dual variables, while the inner one finds the optimal solution given the a set of dual variables. The authors claims that the proposed scheme is the first method which convergs to optimality for tabular MDP problems. Simulation results are presented where the proposed methods are tested on continuous control tasks.
优点
The paper is well-written and easy to follow. The results and claims presented in the paper appears to be correct.
缺点
There are some limitations with respect to the constructed loss function and scalability of the adopted linear interpolation mechanism.
问题
-
The proposed approach approximates the spectrum by a discretized spectrum. What is the gap in performance with respect to the optimal solution of the original system? Although in Theorem 6.2, it is derived that as becomes large, performance gap reduces to zero, how does this scale as the number of states and actions increases?
-
The intuition behind the sampling process and why it works is not clear.
-
What is the rationale behind the loss function proposed by the authors? Why is the loss function novel? Is there any other loss function for which the scheme does work well? If yes, can the authors come up with a more general loss function? If no, then this is a limitation of the work that it works for only a specific type of loss function (constructed by the authors).
局限性
Yes
Thanks for the valuable feedback on our paper. We appreciate the effort in reviewing our work. We have carefully considered the reviewer's comments. Below, we address the concerns the reviewer has raised:
Q1 (performance gap):
In the general response above, we analyzed the performance gap caused by the spectrum approximation; please refer to the general response. Briefly, we first bounded the gap by the difference in performance of the original problem at different thresholds. Analyzing the performance difference at different thresholds in a constrained optimization problem is challenging due to discontinuous changes in the feasible set or optimal points. Thus, we made some assumptions to analyze the gap and found that the gap is bounded by . This suggests that the gap affects the scale of the reward and costs rather than the state and action spaces. Therefore, we conjecture that the value of can be set independently of the state and action spaces.
Q2 (intuition behind the sampling process):
Thanks for pointing out the need for clarification. Section 6.2 introduced 1) why we need to learn the distribution of (sampler ), 2) the training method, and 3) the technique for sampling from . Then, in Section 7, we presented the practical implementation method, which seems to be pointed out by the reviewer. As mentioned in lines 230-231, is sampled within the range . To implement this, distributions with finite intervals, such as the beta or truncated normal distribution, can be used. However, defining the interval of the distribution with complicates the gradient computation graph in PyTorch, increasing the likelihood of computational errors. To address this, we decided to sample the difference of , which is motivated from the fact that is equivalent to , where . To further remove in the interval, we instead sampled within , but it can result in . Nevertheless, if increases, the value of also increases due to the conjugate function in (2), resulting in reducing . As a result, the distribution of is trained to output low probabilities for such cases. We will add this explanation to the appendix.
Q3 (loss function):
First, we want to clarify the meaning of line 235, which seems to be pointed out by the reviewer. The meaning is that the process of solving the outer problem is novel rather than claiming that the proposed loss function itself is novel. Specifically, the proposed method enables the inner and outer problems to be solved concurrently rather than sequentially. Thus, we will change the phrase to "novel update process."
Next, the rationale for structuring the loss indicated in lines 236 and 240 is as follows: The goal of the sampler is to produce a high probability for the optimal . To achieve this, we need an indicator that can provide feedback on how well the RCRL problem is solved for a given . Being inspired by the paper [R1], which converts constrained RL to unconstrained RL using penalty functions, we defined the indicator as shown in line 236.
Additionally, as observed in the proof of Theorem 6.3, any function that satisfies \max\_\pi J(\pi; \beta)= \max\_{\pi \in \{ \pi|J_{C_i}(\pi;\beta) \leq d_i }} J_R(\pi) can be used. One such could be if for ; otherwise, . Alternatively, instead of providing feedback, a method that uses the policy's value function to compute the optimal distribution in a closed form could be proposed. As mentioned earlier, since we have proposed a new approach to solving the outer problem, it does not seem necessary to propose a more general loss function. Developing more efficient and reliable loss functions could be done in future work.
References
- [R2] Zhang, Linrui, et al. "Penalized proximal policy optimization for safe reinforcement learning." arXiv preprint arXiv:2205.11814 (2022).
Thank you for the detailed response which helped in having a better understanding of the paper. I have read all the responses and discussions. Some of my concerns are answered now. However, since the paper needs a revision in the technical part, I will keep my score as it is. I have no further questions.
Thanks for the response, and we are glad to hear that some of the reviewer's concerns have been addressed. We would like to clarify that the error was not in the submitted manuscript but in the global response (convergence rate analysis), which has now been resolved and does not affect the main theoretical results of the manuscript.
Thanks for the clarification. I have read the paper and response again, and I am happy to raise the score accordingly.
The paper provides the new spectral-risk-costrained policy optimization algorithm, that uses the duality of spectral measures, and bilevel optimization approach to address the risk constrained reinforcement learning problem, solving the inner primal problems and outer dual problem. The paper provides global convergence guarantees in the tabular setting. The paper supports the results with the experimental comparisons to other methods.
优点
The paper is well written, despite many complex notions used in the flow, they all are clearly introduced. The theoretical result seems to be solid and important to the community. The idea to solve the dual problem of the risk-constrained rl using a sampler distribution seems to be novel too. The proofs are provided, however I did not check them carefully. The experiments demonstrate a good performance of the proposed method.
缺点
There are several things which are not very clearly defined and a few minor typos that I noticed.
line 88: F_X is not defined, I believe it is a probability function? line 168: where the log in eq. (9) comes from?
Conflicting notations:
- Now, aplha is used both for denoting the learning rate and the risk parameter. Maybe, the authors could use different notations for these notions.
- F is probably a probability function, and then a feasibility function (line 114).
- in theorem 6.2 i is used to iterate over the discretization 1:M-1, but later used to iterate over constraints 1:N, and j is used for discretizations. maybe use j in thm. 6.2 and other places consistently too?
Typos:
line 102: an -> a
问题
all questions are above in the weaknesses section
局限性
The limitations are discussed. There is no potential negative societal impact.
Dear Reviewer,
Thanks for the positive review and valuable feedback on our paper. We appreciate the effort in reviewing our work. Below, we respond to each of the points mentioned.
Weakness 1 (line 88 and 168):
is the cumulative density function (CDF) of the random variable . We will add the definition of to the paper after line 88.
The logarithmic operator in eq. (9) comes from the following derivation of the policy gradient of the sub-risk measure (by differentiating in eq. (8)):
Weakness 2 (conflicting notations):
Thanks for the comments on the conflict notations. We will change the notations as follows:
- learning rate: from to .
- Feasibility function: from to .
- Iterator of the discretization in Theorem 6.2: from to .
We will also correct the typo too.
I thank the authors for their response. I keep my score and support acceptance.
General Response
We appreciate all the reviewers for their insightful comments and suggestions. In this response, we will address the common concern raised by the reviewers: convergence rate analysis and performance gap.
Convergence rate
We will analyze the convergence rate of the proposed method in Section 5 through finite-time analysis, similar to the approach used in CRPO [R1]. To achieve this, we make a few minor modifications to the policy update rule in line 169 as follows:
- "if constraints are satisfied" is changed to "if ," where is a positive number.
- The time varying step size is changed to a fixed learning rate .
Then, the following is satisfied:
Let and , where and .
Then, , and for .
Proof:
By subtituting and into Eq. (30),
.
Using the definition of ,
.
Also, due to for , .
Since , we can get .
Then,
.
Also, .
Performance Gap
Let { }, { }, where . Then, the performance gap is . According to Lemma 6.1, , where , so . Thus, .
As a result, the gap is bounded by the performance difference between the two thresholds. However, in constrained optimization, this type of problem is challenging due to discontinuous changes in the feasible set or optimal points. Thus, we will use some assumptions to analyze the gap practically, which will be introduced below.
Le . Then, . The performance gap is expressed as , where and . As a result, we need to find .
Since the one-norm above is the maximum distance between the two nearest policies in set and , we only need to evaluate the boundaries and . If on is given, we need to find the closest policy on : s.t. . Let . Then, . Here, we assume that there exists an occupancy measure that satisfies the equality in the above equation. Then, . Finally, the performance gap is bounded by .
Without the assumption, it is not guaranteed that the performance gap will always be bounded by . Nevertheless, an insight can be gained that the gap is influenced by the scale of rewards and costs rather than the state space and action space. Therefore, it seems possible to conjecture that the proposed approximate approach is scalable.
References
- [R1] Xu et. al. "CRPO: A new approach for safe reinforcement learning with convergence guarantee." ICML, 2021.
As the reviewer DPN9 pointed out, the proof of the convergence rate provided in the general response above has an error: " for " does not hold. We apologize for the confusion. We modify the proof as follows, starting from the statement "Also, due to for " in the original proof above.
Given that , one of the following must hold: (1) or (2) .
Assume neither condition is satisfied. Then, , which leads to a contradiction. Therefore, one of the two conditions must hold.
Furthermore, if we assume , it follows that , which is also a contradiction. Hence, .
If ,
, and this inequality also holds for .
Consequently, .
Also, .
Thanks to the reviewers for reading and responding to the rebuttal, and we would like to clarify the following points:
- [no modification of technical parts] There have been no modifications to the technical parts of the manuscript and the supplementary during the rebuttal period. Our technical contributions are summarized as follows:
- Show the linear property of the performance difference in RL problems with sub-risk measure constraints.
- Ensure convergence to an optimal policy of the inner problem.
- Analyze the approximation error of the discretized spectrum.
- Ensure convergence to an optimal outer variable of the outer problem.
- [Convergence rate analysis] In response to questions raised by Reviewers tq31 and DPN9 regarding the convergence rate, we have provided the convergence rate and proof in the general response above. After discussions with Reviewer DPN9, an issue in the proof (provided in the general response, not in the manuscript) has been corrected, and there are no errors now.
- [Performance gap caused by spectrum discretization] In response to queries about the performance gap caused by the spectrum discretization, we discovered that the performance gap is bounded by the performance difference of the original problem at different thresholds. Additionally, with some approximations, it is bounded by , which does not depend on the state and action spaces, suggesting that the spectrum discretization is scalable.
The paper proposes a spectral risk measure-constrained RL algorithm, as well as a bilevel optimization approach. The proposed algorithm comes with optimality guarantees in the tabular setting, and the good performance is also demonstrated via numerical experiments. The reviewers appreciated the importance of the problem, strength of the results (both theoretical and experimental) and they praised the clarity of the paper. The rebuttal from the authors addressed the reviewers' concerns, and there is a consensus towards accepting the paper. I concur with this evaluation: this will be a nice addition to the NeurIPS program.