PaperHub
7.0
/10
Poster4 位审稿人
最低7最高7标准差0.0
7
7
7
7
3.0
置信度
正确性3.3
贡献度3.3
表达3.0
NeurIPS 2024

Spectral-Risk Safe Reinforcement Learning with Convergence Guarantees

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We propose a safe RL algorithm with spectral risk constraints, which shows convergence to an optimum in tabular settings.

摘要

关键词
Safe Reinforcement LearningRisk ConstraintSpectral Risk Measure

评审与讨论

审稿意见
7

This paper considers spectral-risk safe RL algorithm with convergence guarantee. In particular, the paper considers a constrained MDP approach where the objective is to maximize the expected cumulative reward subject to the constraint that the spectral-risk measure is below a certain threshold. The main challenge lies in solving the optimization problem as it requires solve a bi-level optimization framework. The paper proposes first proposes a policy gradient based approach for solving the inner problem, and then proposes a gradient-based approach to sample according to the optimal spectral risk-measure.

优点

  1. Developing a safe RL policy with spectral risk measure is important research problem.

  2. The paper provides convergence guarantees and effective policies.

  3. The paper also provides practical approaches for implementation. The empirical performance is good.

缺点

  1. The paper is a little-bit dense to parse everything.

  2. While the paper provides a convergence guarantee, the paper did not provide any non-asymptotic rates for convergence. In particular, the sample-complexity guarantee has not been provided.

问题

  1. In order to solve the inner-optimization problem, the paper considers a primal-dual type algorithm for solving the feasibility function. What types of feasibility functions FF have been considered? Can the authors provide some outline on how the dual-variable updates as shown in Appendix B ensure feasibility?

  2. For solving the outer optimization, the paper relies on minimizing the loss function. However, optimal policy that maximizes J(π,β)J(\pi,\beta) is difficult because of the non-linearity in the feasibility function. Hence, the reviewer did not get how the algorithm solves the inner-optimization problem which seems to be different compared to the inner-optimization problem described in Section 5.

  3. The convergence result (Theorem 6.3) relies on the Robbins-Monroe condition. Is it satisfied (in particular, for the optimization framework considered in Section 6 which is different from Section 5)?

局限性

Not as such.

作者回复

Thanks for the detailed review and valuable feedback on our paper. We appreciate the effort to review our work. Below is our response to the reviewer's comments.

Weakness (convergence rate):

Thanks for pointing out the need for convergence rate analysis. We have addressed the convergence rate of the proposed method in the general response above. Please check the general response.

Q1 (feasibility):

The feasibility function is defined as F(x)=0\mathcal{F}(x)=0 if x0x\leq0 else \infty, which is commented in the footnote of line 114. This function is used to express the constrained optimization problem as an unconstrained one in equation (6), since two problems "maxπJR(π)\max_\pi J_R(\pi) s.t. JCi(π)diJ_{C_i}(\pi)\leq d_i" and "maxπJR(π)i=1NF(JCi(π)di)\max_\pi J_R(\pi) - \sum_{i=1}^N\mathcal{F}(J_{C_i}(\pi) - d_i)" are equivalent. When solving the risk-constrained RL problem, the feasibility function is not used directly; instead, the constraints are used as they are, as shown in equation (7).

According to Theorem 5.3, an optimal policy for the inner problem can be obtained if the policy is updated using the proposed update rule, which is described below in line 169. As mentioned in lines 171-172, the proposed update rule requires satisfying the following conditions on λt\lambda_t and νt\nu_t, which the reviewer refers to as "dual-variables": 1) λt,i\lambda_{t,i} and νt\nu_t should exist within [0,λmax][0, \lambda_\mathrm{max}], and 2) λt,i(JCi(θ)di)0\lambda_{t,i}(J_{C_i}(\theta)-d_i)\geq0 and iλt,i=1\sum_i \lambda_{t,i}=1 when constraints are not satisfied. Various methods for choosing λt\lambda_t and νt\nu_t that satisfy the conditions are presented in Appendix B. By using one of these methods to calculate λt\lambda_t and νt\nu_t, an optimal policy can be obtained according to Theorem 5.3, ensuring that the constraints are satisfied. A brief insight into the proof of Theorem 5.3 is as follows: the update rule is designed to give more weight to the objective function when the constraints are satisfied and more weight to the constraints when they are not. As a result, through repetitive policy updates, the policy converges within or on the boundary of the constraints.

Q2 (outer optimization):

The reviewer seems to think that calculating J(π;β)J(\pi; \beta) requires solving another inner problem, which is different from Section 5. In other words, the reviewer seems to think that in order to train the sampler, it is necessary to find a policy that maximizes J(π;β)J(\pi; \beta). However, this is not correct. The policy is only updated using the method introduced in Section 5, and only the sampler is trained in Section 6. There is no other inner problem. As seen in the equation below line 240, the sampler's loss function is composed of J(πβ,t;β)J(\pi_{\beta, t}; \beta), where πβ,t\pi_{\beta, t}​ is the policy obtained by solving the inner problem from Section 5 for β\beta over tt iterations. J(πβ,t;β)J(\pi_{\beta, t}; \beta) can be calculated using value functions corresponding to πβ,t\pi_{\beta, t}​, and the sampler is trained to output higher probabilities for β\beta that yield a higher J(πβ,t;β)J(\pi_{\beta, t}; \beta) value.

To be more specific, policies are trained for each of several β\beta using the method introduced in Section 5, and these policies are expressed as { πβ,tβBN\pi_{\beta,t} | \beta \in B^N }. Then, we use these policies to calculate J(πβ,t;β)J(\pi_{\beta,t};\beta) and use these values to train the sampler. To implement the set of policies, β\beta along with the state are added as inputs to the actor-network πθ(as;β)\pi_\theta(a|s ; \beta), as detailed in Appendix C.

Additionally, the explanation on lines 237-238 does not mean that we need to find maxπJ(π;β)\max_\pi J(\pi; \beta), but rather demonstrates that J(π;β)J(\pi; \beta) can be used as a measure of how well RCRL is solved. If there is any misunderstanding on our answer, please provide further clarification.

Q3 (Robbins-Monro condition):

Since it is possible to adjust the learning rate freely, the Robbins-Monro condition can be satisfied. Furthermore, as mentioned in the above response to Q2, there is no additional inner problem in calculating J(π;β)J(\pi;\beta). Therefore, the method described in Section 6 does not need to satisfy the Robbins-Monro condition.

评论

Thanks for your response. I will go over the comments carefully.

As for the proof like CRPO that you provided in the general response, I have one question. Since this is a constrained problem, how you are ensuring that JR(π)JR(πt)0J_R(\pi^*)-J_R(\pi_t)\geq 0. Since this is a constrained problem, πt\pi_t may be infeasible and may have a better value compared to JR(π)J_R(\pi^*).

评论

Thanks for the response. The question on the inequality seems to be arised from the proof of the convergence rate in the general response. In the proof, we said "JR(π)JR(πt)0J_R(\pi^*) - J_R(\pi_t) \leq 0 for tNt \in \mathcal{N}", and tNt \in \mathcal{N} means that the policy at tt iteration, πt\pi_t, satisfies all constraints. Therefore, among the constraint-satisfying policies, π\pi^* has the maximum value of JRJ_R, which results in JR(π)JR(πt)J_R(\pi^*) \geq J_R(\pi_t).

评论

Thanks for the quick explanation. However, it seems that (even in the CRPO case) there is a slack parameter η\eta, hence, those policies may not exactly satisfy the constraints. For example, it may happen that Jc,iπt>diJ_{c,i}^{\pi_t}>d_i, but Jc,iπt<di+ηJ_{c,i}^{\pi_t}<d_i+\eta. Thus, πt\pi_t may not satisfy the constraint and may be infeasible.

评论

Thanks for clarifying the raised question with an example. As the reviewer pointed out, we confirmed that there was an error in the proof. We have uploaded the revised proof to a comment on the general response above. Please refer to the comment.

评论

I do not have any more comments. I have adjusted my score.

审稿意见
7

This paper proposes a spectral risk measure-constrained RL algorithm, called spectral-risk-constrained policy optimization (SRCPO). This algorithm leverages the duality of spectral risk measures and treats the risk-constrained RL problem as a bilevel optimization. In this bilevel optimization problem, the outer problem is to optimize dual variables derived from the risk measures, while the inner problem is to find an optimal policy given these dual variables. The proposed algorithm is the first to guarantee convergence to an optimum in the tabular setting. Furthermore, the authors conduct experiments on continuous control tasks, and show that the proposed algorithm achieves the best performance among the compared RCRL algorithms.

优点

  1. The studied problem, i.e., spectral risk measure-constrained RL, is well-motivated and can be applied to safety-critical scenarios, e.g., healthcare and finance.
  2. This paper is very well-written and easy to follow.
  3. The authors design a general algorithm framework and provide the convergence guarantee for a family of spectral risk-constrained RL problems.
  4. Empirical evaluations are also provided to demonstrate the performance superiority of the proposed algorithm in practice.

缺点

  1. The idea of handling risk-sensitive RL by solving an inner problem (i.e., find an optimal policy under a fixed dual variable) and an outer problem (i.e., optimize dual variables) is not new. This idea appears in prior risk-sensitive RL works, e.g., (although these prior works may not consider constraints)

[1] Wang, Kaiwen, et al. "Near-minimax-optimal risk-sensitive reinforcement learning with cvar." International Conference on Machine Learning, 2023.

[2] Chen, Yu, et al. "Provably Efficient Iterated CVaR Reinforcement Learning with Function Approximation and Human Feedback." International Conference on Learning Representations, 2023.

  1. The current format of Algorithm 1 is not clear enough. The specific algorithm steps should be included, e.g., the policy update, the distribution update (how to find ξ\xi) and the sampling.
  2. It seems that Theorems 5.3 and 6.3 are the theoretical guarantees for the inner and outer problems, respectively. Can the authors give the theoretical guarantee of the overall performance for the spectral risk measure-constrained RL problem?
  3. The provided results only state that the algorithm will converge to the optimal policy. Can the authors comment on the convergence rate or sample complexity?

问题

Please see the weaknesses above.

局限性

Please see the weaknesses above.

作者回复

Thanks for the thorough review and insightful comments on our paper. We appreciate the effort to review our work. The following is our response to the reviewer's comments.

Weakness 1 (inner and outer problems)

As the reviewer commented, handling risk by separating the given problem into the inner and outer problems can be considered as not new; instead, the proposed method for solving each problem can be considered novel. Therefore, we will modify lines 50-51 to emphasize the proposed process of the inner and outer problems rather than the bilevel optimization framework itself.

Weakness 2 (algorithm details)

Algorithm 1 only provides a brief overview of the proposed method, but Algorithm 2 in Appendix C shows details of the method, including information on the policy and the distribution update. To improve clarity, we will change the title of Algorithm 1 to "Overview of SRCPO," and at the end of Chapter 4, we will add a note indicating that a detailed algorithm is provided in Appendix C.

Weakness 3 (overall performance)

Through Section 5, we can find an optimal policy of the inner problem for various β\beta. It means that we can find a set of optimal policies, Π=\Pi^*={ πββBN\pi_\beta^*|\beta \in B^N }, where πβ\pi_\beta^* denotes the optimal policy for β\beta. Additionally, through Section 6, we can obtain an optimal sampler ξ\xi^* that samples only β=argmaxβJR(πβ)\beta^* = \arg\max_\beta J_R(\pi_\beta^*). Thus, by solving the inner and outer problems, we obtain Π\Pi^* and ξ\xi^*. Sampling β\beta^* from ξ\xi^* and finding the corresponding policy in Π\Pi^* leads to the optimal policy πβ\pi_{\beta^*}^* of the RCRL problem. In conclusion, by solving the inner and outer problems respectively, we can obtain an optimal policy for the original RCRL problem defined in equation (5), thus ensuring overall performance.

Weakness 4 (convergence rate)

We have addressed the convergence rate of the proposed method in the general response above. Please check the general response.

评论

Thank the authors for their response. Since there exists a technical error in the proof, I think that this paper needs a major revision, and thus I decreased my score.

评论

Thanks for the response. However, we would like to clarify that there are no technical errors in our paper. The error mentioned in the general response arose from an additional analysis of the convergence rate, which was conducted in response to the reviewers' requests. For more details, there was an error in the proof regarding the convergence rate in the general response, specifically the part "JR(π)JR(πt)0J_R(\pi^*) - J_R(\pi_t) \geq 0 for tNt \in \mathcal{N}," which has now been corrected to "one of the following must hold: (1) NT2|\mathcal{N}| \geq \frac{T}{2} or (2) tN(JR(π)JR(πt))0\sum_{t \in \mathcal{N}}(J_R(\pi^*) - J_R(\pi_t)) \leq 0." The revised proof has corrected all errors, and the reviewer DPN9 has also increased the score from 6 to 7. We apologize for the error, which has now been corrected, and would like to reiterate that there are no errors in the paper itself.

评论

Thank you for your explanation!

I misunderstood something before. I will maintain my original score 7, and support acceptance.

审稿意见
7

In this paper, the authors have considered the framework of risk-constrained reinforcement learning (RCRL) by tackling risk-measure-based constraints. The non-linearity of risk meaures makes the convergence of RL schemes challenging. In this paper, a spectral risk measure-constarined RL algorithm is proposed. The proposed algorithm is based on a bilevel optimization. The outer problem deals with optimization of dual variables, while the inner one finds the optimal solution given the a set of dual variables. The authors claims that the proposed scheme is the first method which convergs to optimality for tabular MDP problems. Simulation results are presented where the proposed methods are tested on continuous control tasks.

优点

The paper is well-written and easy to follow. The results and claims presented in the paper appears to be correct.

缺点

There are some limitations with respect to the constructed loss function and scalability of the adopted linear interpolation mechanism.

问题

  1. The proposed approach approximates the spectrum by a discretized spectrum. What is the gap in performance with respect to the optimal solution of the original system? Although in Theorem 6.2, it is derived that as MM becomes large, performance gap reduces to zero, how does this MM scale as the number of states and actions increases?

  2. The intuition behind the sampling process and why it works is not clear.

  3. What is the rationale behind the loss function proposed by the authors? Why is the loss function novel? Is there any other loss function for which the scheme does work well? If yes, can the authors come up with a more general loss function? If no, then this is a limitation of the work that it works for only a specific type of loss function (constructed by the authors).

局限性

Yes

作者回复

Thanks for the valuable feedback on our paper. We appreciate the effort in reviewing our work. We have carefully considered the reviewer's comments. Below, we address the concerns the reviewer has raised:

Q1 (performance gap):

In the general response above, we analyzed the performance gap caused by the spectrum approximation; please refer to the general response. Briefly, we first bounded the gap by the difference in performance of the original problem at different thresholds. Analyzing the performance difference at different thresholds in a constrained optimization problem is challenging due to discontinuous changes in the feasible set or optimal points. Thus, we made some assumptions to analyze the gap and found that the gap is bounded by KRmax/(CmaxM)KR_\mathrm{max}/(C_\mathrm{max}M). This suggests that the gap affects the scale of the reward and costs rather than the state and action spaces. Therefore, we conjecture that the value of MM can be set independently of the state and action spaces.

Q2 (intuition behind the sampling process):

Thanks for pointing out the need for clarification. Section 6.2 introduced 1) why we need to learn the distribution of β\beta (sampler ξ\xi), 2) the training method, and 3) the technique for sampling β\beta from ξ\xi. Then, in Section 7, we presented the practical implementation method, which seems to be pointed out by the reviewer. As mentioned in lines 230-231, β[j]\beta[j] is sampled within the range [β[j1],Cmax/(1γ)][\beta[j-1], C_\mathrm{max}/(1-\gamma)]. To implement this, distributions with finite intervals, such as the beta or truncated normal distribution, can be used. However, defining the interval of the distribution with β\beta complicates the gradient computation graph in PyTorch, increasing the likelihood of computational errors. To address this, we decided to sample the difference of β\beta, which is motivated from the fact that β[i][β[i1],Cmax/(1γ)]\beta[i] \sim [\beta[i-1], C_\mathrm{max}/(1-\gamma)] is equivalent to β[i]=β[i1]+Δ\beta[i] = \beta[i-1] + \Delta, where Δ[0,Cmax/(1γ)β[i1]]\Delta \sim [0, C_\mathrm{max}/(1-\gamma) - \beta[i-1]]. To further remove β\beta in the interval, we instead sampled Δ\Delta within [0,Cmax/(1γ)][0, C_\mathrm{max}/(1-\gamma)], but it can result in β[i]>Cmax/(1γ)\beta[i] > C_\mathrm{max}/(1-\gamma). Nevertheless, if β\beta increases, the value of JC(π;β)J_C(\pi; \beta) also increases due to the conjugate function in (2), resulting in reducing JR(π)J_R(\pi). As a result, the distribution of β\beta is trained to output low probabilities for such cases. We will add this explanation to the appendix.

Q3 (loss function):

First, we want to clarify the meaning of line 235, which seems to be pointed out by the reviewer. The meaning is that the process of solving the outer problem is novel rather than claiming that the proposed loss function itself is novel. Specifically, the proposed method enables the inner and outer problems to be solved concurrently rather than sequentially. Thus, we will change the phrase to "novel update process."

Next, the rationale for structuring the loss indicated in lines 236 and 240 is as follows: The goal of the sampler is to produce a high probability for the optimal β\beta. To achieve this, we need an indicator that can provide feedback on how well the RCRL problem is solved for a given β\beta. Being inspired by the paper [R1], which converts constrained RL to unconstrained RL using penalty functions, we defined the indicator as shown in line 236.

Additionally, as observed in the proof of Theorem 6.3, any function JJ that satisfies \max\_\pi J(\pi; \beta)= \max\_{\pi \in \{ \pi|J_{C_i}(\pi;\beta) \leq d_i }} J_R(\pi) can be used. One such JJ could be J(π;β)=JR(π)J(\pi;\beta) = J_R(\pi) if JCi(π;β)diJ_{C_i}(\pi;\beta)\leq d_i for i\forall i; otherwise, Rmax/(1γ)-R_\mathrm{max}/(1-\gamma). Alternatively, instead of providing feedback, a method that uses the policy's value function to compute the optimal distribution in a closed form could be proposed. As mentioned earlier, since we have proposed a new approach to solving the outer problem, it does not seem necessary to propose a more general loss function. Developing more efficient and reliable loss functions could be done in future work.

References

  • [R2] Zhang, Linrui, et al. "Penalized proximal policy optimization for safe reinforcement learning." arXiv preprint arXiv:2205.11814 (2022).
评论

Thank you for the detailed response which helped in having a better understanding of the paper. I have read all the responses and discussions. Some of my concerns are answered now. However, since the paper needs a revision in the technical part, I will keep my score as it is. I have no further questions.

评论

Thanks for the response, and we are glad to hear that some of the reviewer's concerns have been addressed. We would like to clarify that the error was not in the submitted manuscript but in the global response (convergence rate analysis), which has now been resolved and does not affect the main theoretical results of the manuscript.

评论

Thanks for the clarification. I have read the paper and response again, and I am happy to raise the score accordingly.

审稿意见
7

The paper provides the new spectral-risk-costrained policy optimization algorithm, that uses the duality of spectral measures, and bilevel optimization approach to address the risk constrained reinforcement learning problem, solving the inner primal problems and outer dual problem. The paper provides global convergence guarantees in the tabular setting. The paper supports the results with the experimental comparisons to other methods.

优点

The paper is well written, despite many complex notions used in the flow, they all are clearly introduced. The theoretical result seems to be solid and important to the community. The idea to solve the dual problem of the risk-constrained rl using a sampler distribution seems to be novel too. The proofs are provided, however I did not check them carefully. The experiments demonstrate a good performance of the proposed method.

缺点

There are several things which are not very clearly defined and a few minor typos that I noticed.

line 88: F_X is not defined, I believe it is a probability function? line 168: where the log in eq. (9) comes from?

Conflicting notations:

  • Now, aplha is used both for denoting the learning rate and the risk parameter. Maybe, the authors could use different notations for these notions.
  • F is probably a probability function, and then a feasibility function (line 114).
  • in theorem 6.2 i is used to iterate over the discretization 1:M-1, but later used to iterate over constraints 1:N, and j is used for discretizations. maybe use j in thm. 6.2 and other places consistently too?

Typos:

line 102: an -> a

问题

all questions are above in the weaknesses section

局限性

The limitations are discussed. There is no potential negative societal impact.

作者回复

Dear Reviewer,

Thanks for the positive review and valuable feedback on our paper. We appreciate the effort in reviewing our work. Below, we respond to each of the points mentioned.

Weakness 1 (line 88 and 168):

F_XF\_X is the cumulative density function (CDF) of the random variable XX. We will add the definition of FXF_X to the paper after line 88.

The logarithmic operator in eq. (9) comes from the following derivation of the policy gradient of the sub-risk measure (by differentiating π\pi' in eq. (8)): _θR_σg(Gπθ)=_θE_d_ρπ_θ,π_θ[A_gπ(s,a)]/(1γ)=E_d_ρπ_θ[aθπθ(as)Agπ(s,a)]/(1γ)=E_d_ρπ_θ,πθ[θlogπθ(as)Agπ(s,a)]/(1γ).\nabla\_\theta \mathcal{R}\_\sigma^g(G^{\pi_\theta}) = \nabla\_\theta\mathbb{E}\_{d\_\rho^{\pi\_\theta},\pi\_\theta}[A\_g^\pi(s,a)]/(1-\gamma) = \mathbb{E}\_{d\_\rho^{\pi\_\theta}}[\sum_a \nabla_\theta \pi_\theta(a|s) A_g^\pi(s,a)]/(1-\gamma) = \mathbb{E}\_{d\_\rho^{\pi\_\theta}, \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) A_g^\pi(s,a)]/(1-\gamma).

Weakness 2 (conflicting notations):

Thanks for the comments on the conflict notations. We will change the notations as follows:

  1. learning rate: from α\alpha to ω\omega.
  2. Feasibility function: from F\mathcal{F} to χ\chi.
  3. Iterator of the discretization in Theorem 6.2: from ii to jj.

We will also correct the typo too.

评论

I thank the authors for their response. I keep my score and support acceptance.

作者回复

General Response

We appreciate all the reviewers for their insightful comments and suggestions. In this response, we will address the common concern raised by the reviewers: convergence rate analysis and performance gap.

Convergence rate

We will analyze the convergence rate of the proposed method in Section 5 through finite-time analysis, similar to the approach used in CRPO [R1]. To achieve this, we make a few minor modifications to the policy update rule in line 169 as follows:

  1. "if constraints are satisfied" is changed to "if JCi(πt)di+ηJ_{C_i}(\pi_t)\leq d_i + \eta i\forall i," where η\eta is a positive number.
  2. The time varying step size αt\alpha_t is changed to a fixed learning rate α\alpha.

Then, the following is satisfied:

Let α=(1γ)2.5/T\alpha=(1-\gamma)^{2.5}/\sqrt{T} and η=(2D+20K2)/((1γ)2.5T)\eta = (2D + 20K^2)/((1-\gamma)^{2.5}\sqrt{T}), where D:=sdρπ(s)DKL(π(s)π0(s))D:=\sum_s d_\rho^{\pi^*}(s)D_\mathrm{KL}(\pi^*(\cdot|s)||\pi_0(\cdot|s)) and K:=max(NλmaxCmax,λmaxRmax,Rmax,1)K:=\max(N\lambda_\mathrm{max}C_\mathrm{max}, \lambda_\mathrm{max}R_\mathrm{max}, R_\mathrm{max}, 1).
Then, JR(π)E_tN[JR(πt)](2D+20K2)/((1γ)2.5T)J_R(\pi^*) - \mathbb{E}\_{t\sim \mathcal{N}}[J_R(\pi_t)] \leq (2D + 20K^2)/((1-\gamma)^{2.5}\sqrt{T}), and E_tN[JCi(πt)]di(2D+20K2)/((1γ)2.5T)\mathbb{E}\_{t\sim\mathcal{N}}[J_{C_i}(\pi_t)] - d_i \leq (2D + 20K^2)/((1-\gamma)^{2.5}\sqrt{T}) for i\forall i.

Proof:

By subtituting JR(π)JR(πt)2Rmax/(1γ)|J_R(\pi^*)-J_R(\pi_t)|\leq 2R_\mathrm{max}/(1-\gamma) and JCi(π)JCi(πt)2Cmax/(1γ)|J_{C_i}(\pi^*)-J_{C_i}(\pi_t)|\leq 2C_\mathrm{max}/(1-\gamma) into Eq. (30),
tN(α(JR(π)JR(πt))2α2NλmaxCmax/(1γ))+tN(αη2α2λmaxRmax/(1γ))\sum_{t\in\mathcal{N}}(\alpha(J_R(\pi^*)-J_R(\pi_t))-2\alpha^2N\lambda_\mathrm{max}C_\mathrm{max}/(1-\gamma))+\sum_{t\notin\mathcal{N}}(\alpha\eta-2\alpha^2\lambda_\mathrm{max}R_\mathrm{max}/(1-\gamma))
D+tN2α2(Rmax+αNλmaxCmax)2/(1γ)5+tN2α2(αλmaxRmax+NλmaxCmax)2/(1γ)5\leq D + \sum_{t\in\mathcal{N}}2\alpha^2(R_\mathrm{max}+\alpha N\lambda_\mathrm{max}C_\mathrm{max})^2/(1-\gamma)^5+\sum_{t\notin\mathcal{N}}2\alpha^2(\alpha\lambda_\mathrm{max}R_\mathrm{max}+N\lambda_\mathrm{max}C_\mathrm{max})^2/(1-\gamma)^5.

Using the definition of KK, tNα(JR(π)JR(πt))+tNαηD+2Tα2K2(α+1)2/(1γ)5+2Tα2K/(1γ)\sum_{t\in\mathcal{N}}\alpha(J_R(\pi^*)-J_R(\pi_t))+\sum_{t\notin\mathcal{N}}\alpha\eta\leq D+2T\alpha^2K^2(\alpha+1)^2/(1-\gamma)^5+2T\alpha^2K/(1-\gamma).
Also, due to JR(π)JR(πt)0J_R(\pi^*)-J_R(\pi_t)\geq0 for tNt\in\mathcal{N}, (TN)αηD+2Tα2K2(α+1)2/(1γ)5+2Tα2K/(1γ)(T-|\mathcal{N}|)\alpha\eta\leq D+2T\alpha^2K^2(\alpha+1)^2/(1-\gamma)^5+2T\alpha^2K/(1-\gamma).

Since αηT/2D+2Tα2K2(α+1)2/(1γ)5+2Tα2K/(1γ)\alpha\eta T/2\geq D+2T\alpha^2K^2(\alpha+1)^2/(1-\gamma)^5+2T\alpha^2K/(1-\gamma), we can get (TN)αηTαη/2NT/2(T-|\mathcal{N}|)\alpha\eta\leq T\alpha\eta/2\Rightarrow |\mathcal{N}|\geq T/2.

Then, JR(π)E_tN[JR(πt)]=tN(JR(π)JR(πt))/NJ_R(\pi^*)-\mathbb{E}\_{t\sim\mathcal{N}}[J_R(\pi_t)]=\sum_{t\in\mathcal{N}}(J_R(\pi^*)-J_R(\pi_t))/|\mathcal{N}|
2(D+2Tα2K2(α+1)2/(1γ)5+2Tα2K/(1γ))/(αT)(2D+20K2)/((1γ)2.5T)\leq2(D+2T\alpha^2K^2(\alpha+1)^2/(1-\gamma)^5+2T\alpha^2K/(1-\gamma))/(\alpha T)\leq(2D+20K^2)/((1-\gamma)^{2.5}\sqrt{T}).

Also, E_tN[JCi(πt)]diη=(2D+20K2)/((1γ)2.5T)\mathbb{E}\_{t\sim\mathcal{N}}[J_{C_i}(\pi_t)]-d_i\leq\eta=(2D + 20K^2)/((1-\gamma)^{2.5}\sqrt{T}).

Performance Gap

Let A(d)=A(d)={ πR_σ(X)d\pi|\mathcal{R}\_{\sigma}(X)\leq d }, A~(d)=\tilde{A}(d)={ πR_σ~(X)d\pi|\mathcal{R}\_{\tilde{\sigma}}(X)\leq d }, where X=GCπX=G_C^\pi. Then, the performance gap is G:=maxπA(d)JR(π)maxπA~(d)JR(π)G:=\max_{\pi\in A(d)}J_R(\pi)-\max_{\pi\in\tilde{A}(d)}J_R(\pi). According to Lemma 6.1, R_σ(GCπ)R_σ~(GCπ)K/M|\mathcal{R}\_{\sigma}(G_C^\pi)-\mathcal{R}\_{\tilde{\sigma}}(G_C^\pi)|\leq K/M, where K=Cmaxσ(1)/(1γ)K=C_\mathrm{max}\sigma(1)/(1-\gamma), so A(dK/M)A~(d)A(d-K/M) \subset \tilde{A}(d). Thus, GmaxπA(d)JR(π)maxπA(dK/M)JR(π)G\leq\max_{\pi\in A(d)}J_R(\pi)-\max_{\pi\in A(d-K/M)}J_R(\pi).

As a result, the gap is bounded by the performance difference between the two thresholds. However, in constrained optimization, this type of problem is challenging due to discontinuous changes in the feasible set or optimal points. Thus, we will use some assumptions to analyze the gap practically, which will be introduced below.

Le dρπ(s,a):=(1γ)tγtP(st=s,at=a)d_\rho^\pi(s,a):=(1-\gamma)\sum_t \gamma^t\mathbb{P}(s_t=s, a_t=a). Then, (1γ)(JR(π)JR(π))=E_(s,a)dρπ[ARπ(s,a)]=dρπ,ARπ=dρπdρπ,ARπdρπdρπ_1Rmax/(1γ)(1-\gamma)(J_R(\pi)-J_R(\pi'))=\mathbb{E}\_{(s,a)\sim d_\rho^\pi}[A_R^{\pi'}(s,a)]=\langle d_\rho^\pi,A_R^{\pi'}\rangle=\langle d_\rho^\pi-d_\rho^{\pi'},A_R^{\pi'}\rangle\leq||d_\rho^\pi-d_\rho^{\pi'}||\_1 R_\mathrm{max}/(1-\gamma). The performance gap is expressed as G=maxπAminπA(JR(π)JR(π))maxπAminπAdρπdρπ_1Rmax/(1γ)2G=\max_{\pi'\in A'}\min_{\pi\in A}(J_R(\pi')-J_R(\pi))\leq\max_{\pi'\in A'}\min_{\pi\in A} ||d_\rho^{\pi'}-d_\rho^\pi||\_1 R_\mathrm{max}/(1-\gamma)^2, where A=A(dK/M)A=A(d-K/M) and A=A(d)A'=A(d). As a result, we need to find maxπAminπAdρπdρπ_1\max_{\pi'\in A'}\min_{\pi \in A}||d_\rho^{\pi'}-d_\rho^\pi||\_1.

Since the one-norm above is the maximum distance between the two nearest policies in set AA and AA', we only need to evaluate the boundaries A\partial A and A\partial A'. If π\pi' on A\partial A' is given, we need to find the closest policy π\pi on A\partial A: minπAdρπdρπ_1=minπdρπdρπ_1\min_{\pi\in\partial A}||d_\rho^{\pi'}-d_\rho^\pi||\_1=\min_\pi||d_\rho^{\pi'}-d_\rho^\pi||\_1 s.t. JC(π)JC(π)=K/MJ_C(\pi')-J_C(\pi)=K/M. Let JC(π)=infgR_σg(π)=R_σg(π)J_C(\pi)=\inf_g\mathcal{R}\_\sigma^{g}(\pi)=\mathcal{R}\_\sigma^{g^*}(\pi). Then, JC(π)JC(π)R_σg(π)R_σg(π)=dρπdρπ,Agπ/(1γ)dρπdρπ_1Cmax/(1γ)2J_C(\pi')-J_C(\pi)\leq\mathcal{R}\_\sigma^{g^*}(\pi')-\mathcal{R}\_\sigma^{g^*}(\pi)=\langle d_\rho^{\pi'}-d_\rho^{\pi},A_{g*}^\pi\rangle/(1-\gamma)\leq||d_\rho^{\pi'}-d_\rho^\pi||\_1 C_\mathrm{max}/(1-\gamma)^2. Here, we assume that there exists an occupancy measure dρπd_\rho^{\pi} that satisfies the equality in the above equation. Then, minπAdρπdρπ_1=K(1γ)2/(CmaxM)maxπAminπAdρπdρπ_1=K(1γ)2/(CmaxM)\min_{\pi\in\partial A}||d_\rho^{\pi'}-d_\rho^\pi||\_1=K(1-\gamma)^2/(C_\mathrm{max}M)\Rightarrow\max_{\pi'\in A'}\min_{\pi\in A}||d_\rho^{\pi'}-d_\rho^\pi||\_1=K(1-\gamma)^2/(C_\mathrm{max}M). Finally, the performance gap is bounded by KRmax/(CmaxM)KR_\mathrm{max}/(C_\mathrm{max}M).

Without the assumption, it is not guaranteed that the performance gap will always be bounded by KRmax/(CmaxM)KR_\mathrm{max}/(C_\mathrm{max}M). Nevertheless, an insight can be gained that the gap is influenced by the scale of rewards and costs rather than the state space and action space. Therefore, it seems possible to conjecture that the proposed approximate approach is scalable.

References

  • [R1] Xu et. al. "CRPO: A new approach for safe reinforcement learning with convergence guarantee." ICML, 2021.
评论

As the reviewer DPN9 pointed out, the proof of the convergence rate provided in the general response above has an error: "JR(π)JR(πt)0J_R(\pi^*) - J_R(\pi_t) \geq 0 for tNt \in \mathcal{N}" does not hold. We apologize for the confusion. We modify the proof as follows, starting from the statement "Also, due to JR(π)JR(πt)0J_R(\pi^*) - J_R(\pi_t) \geq 0 for tNt \in \mathcal{N}" in the original proof above.


Given that D+2Tα2K2(α+1)2(1γ)5+2Tα2K(1γ)αηT2D + \frac{2T\alpha^2K^2(\alpha + 1)^2}{(1-\gamma)^5} + \frac{2T\alpha^2K}{(1-\gamma)} \leq \frac{\alpha\eta T}{2}, one of the following must hold: (1) NT2|\mathcal{N}| \geq \frac{T}{2} or (2) tN(JR(π)JR(πt))0\sum_{t \in \mathcal{N}}(J_R(\pi^*) - J_R(\pi_t)) \leq 0.
Assume neither condition is satisfied. Then, αηT2<(TN)αη=tNαη<tNα(JR(π)JR(πt))+tNαηD+2Tα2K2(α+1)2(1γ)5+2Tα2K(1γ)αηT2\frac{\alpha\eta T}{2} < (T - |\mathcal{N}|)\alpha\eta = \sum_{t \notin \mathcal{N}}\alpha\eta < \sum_{t \in \mathcal{N}}\alpha(J_R(\pi^*) - J_R(\pi_t)) + \sum_{t \notin \mathcal{N}}\alpha\eta \leq D + \frac{2T\alpha^2K^2(\alpha + 1)^2}{(1-\gamma)^5} + \frac{2T\alpha^2K}{(1-\gamma)} \leq \frac{\alpha\eta T}{2}, which leads to a contradiction. Therefore, one of the two conditions must hold.
Furthermore, if we assume N=0|\mathcal{N}| = 0, it follows that αηTαηT2\alpha\eta T \leq \frac{\alpha\eta T}{2}, which is also a contradiction. Hence, N0|\mathcal{N}| \neq 0.

If NT/2|\mathcal{N}| \geq T/2, JR(π)E_tN[JR(πt)]=tN(JR(π)JR(πt))NJ_R(\pi^*) - \mathbb{E}\_{t \sim \mathcal{N}}[J_R(\pi_t)] = \frac{\sum_{t \in \mathcal{N}}(J_R(\pi^*) - J_R(\pi_t))}{|\mathcal{N}|} 2(D+2Tα2K2(α+1)2(1γ)5+2Tα2K(1γ))αT2D+20K2(1γ)2.5T\leq \frac{2(D + \frac{2T\alpha^2K^2(\alpha + 1)^2}{(1-\gamma)^5} + \frac{2T\alpha^2K}{(1-\gamma)})}{\alpha T} \leq \frac{2D + 20K^2}{(1-\gamma)^{2.5}\sqrt{T}}, and this inequality also holds for tN(JR(π)JR(πt))0\sum_{t \in \mathcal{N}}(J_R(\pi^*) - J_R(\pi_t)) \leq 0.
Consequently, JR(π)E_tN[JR(πt)]2D+20K2(1γ)2.5TJ_R(\pi^*) - \mathbb{E}\_{t \sim \mathcal{N}}[J_R(\pi_t)]\leq\frac{2D + 20K^2}{(1-\gamma)^{2.5}\sqrt{T}}.
Also, E_tN[JCi(πt)]diη=2D+20K2(1γ)2.5T\mathbb{E}\_{t \sim \mathcal{N}}[J{C_i}(\pi_t)] - d_i \leq \eta = \frac{2D + 20K^2}{(1-\gamma)^{2.5}\sqrt{T}}.

评论

Thanks to the reviewers for reading and responding to the rebuttal, and we would like to clarify the following points:

  1. [no modification of technical parts] There have been no modifications to the technical parts of the manuscript and the supplementary during the rebuttal period. Our technical contributions are summarized as follows:
    • Show the linear property of the performance difference in RL problems with sub-risk measure constraints.
    • Ensure convergence to an optimal policy of the inner problem.
    • Analyze the approximation error of the discretized spectrum.
    • Ensure convergence to an optimal outer variable of the outer problem.
  2. [Convergence rate analysis] In response to questions raised by Reviewers tq31 and DPN9 regarding the convergence rate, we have provided the convergence rate and proof in the general response above. After discussions with Reviewer DPN9, an issue in the proof (provided in the general response, not in the manuscript) has been corrected, and there are no errors now.
  3. [Performance gap caused by spectrum discretization] In response to queries about the performance gap caused by the spectrum discretization, we discovered that the performance gap is bounded by the performance difference of the original problem at different thresholds. Additionally, with some approximations, it is bounded by KRmax/(CmaxM)KR_\mathrm{max}/(C_\mathrm{max}M), which does not depend on the state and action spaces, suggesting that the spectrum discretization is scalable.
最终决定

The paper proposes a spectral risk measure-constrained RL algorithm, as well as a bilevel optimization approach. The proposed algorithm comes with optimality guarantees in the tabular setting, and the good performance is also demonstrated via numerical experiments. The reviewers appreciated the importance of the problem, strength of the results (both theoretical and experimental) and they praised the clarity of the paper. The rebuttal from the authors addressed the reviewers' concerns, and there is a consensus towards accepting the paper. I concur with this evaluation: this will be a nice addition to the NeurIPS program.