Rectified Robust Policy Optimization for Robust Constrained Reinforcement Learning without Strong Duality
The strong duality for constrained robust RL doesn't hold. We use a primal-only algorithm to address it.
摘要
评审与讨论
This paper studied robust constrained reinforcement learning (RCRL). The paper first proposed a counterexample illustrates that strong duality does not generally hold in RCRL. Therefore, this paper proposes the rectified robust policy optimization algorithm based on a previous algorithm CRPO. The paper provides the sample complexity of converging to an approximately optimal and safe policy. The empirical results justify the algorithm.
给作者的问题
Please see the weakness.
论据与证据
The explicit challenges of combining Robust RL and constrained RL are not elaborated explicitly on. Although the paper emphasizes that the introduction of uncertainty leads to the failure of the strong duality guarantee, causing the primal-dual methods used in constrained RL to become ineffective, there is no discussion on what the challenges are when primal-only methods (such as CRPO) are combined with Robust RL.
方法与评估标准
Yes.
理论论述
Yes. I sketched the proof of the main theorem.
实验设计与分析
Yes.
补充材料
Yes. I sketched the proof of the main theorem.
与现有文献的关系
The authors list scenarios where machines may experience unforeseen transitions due to equipment aging or failure, and discuss how to ensure robot safety in these worst-case situations. Additionally, the authors could enumerate more application scenarios of Robust Constrained RL (such as autonomous driving, smart healthcare, etc.) to enable readers to fully recognize the practical research value and strong motivation of Robust Constrained RL.
遗漏的重要参考文献
This paper has a relatively comprehensive discussion on the relevant literature of Robust RL and Constrained RL, and also discusses the work on the study of strong duality under some previous Robust settings.
其他优缺点
Weakness
-
The algorithm RRPO seems to merely incorporate robust RL methods for value function approximation based on CRPO. The authors do not elaborate on the challenges of elevating from CRPO to RRPO, that is, combining Robust RL and Constrained RL. Therefore, the technical contribution is a bit marginal.
-
The assumption 4.2 contradicts the Counterexample, where the state is transient and under any policy. In other words, under Assumption 4.2, could the duality gap be zero and the primal-dual type methods are applicable and have strong guarantee?
-
The theory (Theorem 4.3) does not seem to suggest any relationship between the sample complexity and the model mismatch.
-
In the experiments, the paper only used CMDP method as the baseline, it is more convincing to consider robust CMDP methods (e.g., Wang et. al 2022 and Ghosh e.t. 2024).
其他意见或建议
Please see the weakness.
We appreciate the reviewer's constructive suggestions. Here is our point-to-point responses:
Relation To Broader Scientific Literature:
The reviewer's suggestion on providing more application scenarios is indeed helpful. We will revise the introduction section to include these applications.
Other Strengths And Weaknesses:
-
RRPO is similar to CRPO: Given that strong duality may not hold in robust constrained RL, it is natural to explore whether existing algorithms in constrained RL can still be effective. Our contribution is not in proposing an entirely new algorithm but in demonstrating that it is possible to apply an existing classical constrained RL approach to achieve optimal sample complexity without relying on strong duality assumptions. This direction will not be less challenging than proposing a new algorithm for these reasons: (1) Many literatures in constrained RL will rely on the famous paper "Constrained Reinforcement Learning Has Zero Duality Gap". If the strong duality doesn't hold, most of these theoretical analysis will be invalid. (2) Even if the convergence analysis doesn't rely on the zero duality gap, in our empirical examples, the CRPO will still converges to the infeasible result. As the result, we successfully address these two challenges and show that it is not necessary to include additional complicated structures.
-
Assumption 4.2 violates the counterexample: We would like to clarify that Assumption 4.2 doesn't contradict the counterexample, because
- The stationary distribution at is not zero unless the policy . In this case (), everytime the agent arrives the state , it will be trapped there, which makes the state transient. More explicitly, if we let and , we can solve the stationary distribution to check the transiency as .
- In Assumption 4.2, we are using the discounted visitation measure. This definition is slightly different from the stationary distribution. As for all ergodic Markov chain, the stationary distribution is independent with respect to the initial distribution, while the discounted visitation measure will rely on that. Therefore, we comment it under Assumption 4.2 that we can use a more uniform initial distribution to obtain a strictly positive , which is also valid for our counterexample and will not change the absence of strong duality of robust constrained RL.
-
In Theorem 4.3, the dependency on the model mismatch is included in the Policy Evaluation Accuracy assumption. Usually, the larger the model mismatch is, the higher computation is required to achieve the desired accuracy. It could depend on the structure of uncertainty set and the the robust policy evaluation algorithm, which is out of the scope of our study.
-
We appreciate the reveiwer's valuable suggestions. We will add these baselines in our revision.
This paper studies efficient algorithms for solving constrained RMDPs. In the light of the lack of strong duality for constrained RMDPs, a primal-only algorithm, Rectified Robust Policy Optimization (RRPO), is proposed with theoretical convergence rate guarantees. The performance of the algorithm is further justified by numerical simulations.
给作者的问题
See above.
论据与证据
The claims are supported by both theoretical and empirical results, though some results seem suspicious.
方法与评估标准
The evaluation method looks reasonable to me.
理论论述
After a careful check of the proofs (esp. those in Section C), I find two major concerns that may jeopardize the validity of the theoretical results.
- The proofs fail to handle the multiple update timescales well.
- The current analysis looks very similar in flavor to that of the standard NPG. However, the NPG update rule at time with respect to the th value function (Theorem C.5, and when it is cited on line 973 and line 1089) only holds when is chosen to be updated in the th round.
- More specifically, those equations only hold for , where , but not for any where . In this way it is hard to do telescoping sum.
- I personally expect to see how the authors can handle the interaction of multiple update timescales when I see the algorithm, but obviously this key technical challenge is "circumvented" using bad notations.
- Despite the above issue, Lemma C.6 and C.7 also use Lemma C.4 in an incorrect way.
- Note that in the statement of Lemma C.4, all initial distributions involved are identically . From a very high level, if this lemma is the only tool used, we should not expect different initial distributions in the results.
- However, this pattern is broken in the proofs of Lemma C.6 (inequality (i) around line 1033) and Lemma C.7 (first inequality around line 1075).
- It is possible to show a performance difference lemma that involves different initial distributions, but that lemma probably has a different form, which may significantly change the form of the final results. Specifically, I would expect an additional term characterizing the difference between and , the two initial distributions.
实验设计与分析
Since this is mostly a theory-oriented paper, the numerical simulations only act as a supporting evidence. For this purpose, the experimental design and results look good to me.
补充材料
See the "Theoretical Claims" section.
与现有文献的关系
N/A
遗漏的重要参考文献
N/A
其他优缺点
Strengths:
- I appreciate appendix A.2 that compares this paper against a few very recent work and justifies for its novelty.
Weaknesses:
- The writing of the paper can be further improved.
- Algorithm 1 is written in a very ambiguous way. Variables are sometimes marked with time (e.g., ), but usually not. The parameter is used before explained (in a later sentence in Section 4.2). Some variable are never used (e.g., ).
- In the last section for numerical examples, most efforts are spent in explaining the experimental setting, leaving only a couple of sentences explaining the observations and their implications.
- Some assumptions are hidden deeply in the proof details.
- It seems from the problem formulation that the algorithm can handle general uncertainty sets, but in fact it only works with those uncertainty sets equipped with efficient Q-function approximators.
- In Lemma C.8, is assumed to be Lipschitz without proofs.
- Proofs may be invalid due to the issues mentioned above.
其他意见或建议
- There are a few typos and notational inconsistencies.
- In equations (6) and (7): the robust value function is denoted by instead of .
- On page 14, step 3 (below line 750): the partial derivative should be .
- In the appendix: and are interchangeably used.
伦理审查问题
N/A
Regarding Major Issues
We deeply appreciate the reviewer's careful reading. We absolutely understand that these two concerns are critical and we truly believe they are caused by some misunderstandings from our unclear presentations. Here we make additional discussions to make it more clear and explain why they are correct:
-
Invalid telescoping: It is correct that all inequalities (Lemma C.6 and Lemma C.7) only hold when . However, we would like to highlight that we can always sum it over with carefully handling the mismatch in the indices. To make it more clear, we will use to indicate that at the -th step, is sampled. Therefore, when we sum over (as an example), it becomes As the right-hand side doesn't depend on the index , this summation is always valid. We spent a lot of effort to make it happen: (1) We use the Lipschitzness in of to remove the dependence on in Line 1165, (2) we replace \hat{Q}\_i^{\pi\_i} - \{Q}\_i^{\pi\_i} with the policy evaluation error (Assumption 4.1), and (3) we follow the standard NPG result to decomposite KL divergence to construct the desired crossing structure (Lemma C.7).
When handling the left-hand side, we set to be "not too small" in Lemma C.8. Under this setting, we ensure that the set is always non-empty; it means in the whole training process , there always exists such that the indice is sampled. As the result, in Line 1169, we can keep this term and lower bound other terms using a trivial bound obtained from Line 1157 - Line 1166.
-
Misuse Lemma C.4 in Lemma C.6 and Lemma C.7: We hope clarify that the we are correctly using Lemma C.4. It doesn't require the initial distribution to be fixed as ; instead, it is an inequality depending on the initial distribution. When we change the initial distribution to another distribution , the inequality still reads: The difference includes: (1) The visitation measure is changing from to . (2) The robust value function is changing from to . As the result, we are using Lemma C.4 in Lemma C.6 and Lemma C.7 as follows:
- In Lemma C.6, we use Lemmca C.4 in Line 957 - Line 958, which honestly follows the structure of Lemma C.4. In Line 1033 mentioned by the reviewer, it follows another result , which applies a trick of the change of measure to change the current measure to another measure . The Lemma C.4 is not involved here.
- In Lemma C.7, we use Lemma C.4 in Line 1075. Here, we have defined a short-cut notation to represent in Line 1073. As the subscript is aligned with the argument in on the left-hand side, our applying Lemma C.4 here is still correct.
Regarding Other Issues
We thank the reviewer again for providing us with these constructive suggestions. We will revise our manuscript based on these valuable comments and we will clearly state that our method can only handle the uncertainty set where the efficient Q-function approximators are equipped.
I appreciate the authors' timely response. However, the responses do not fully answer my questions.
- Invalid telescoping. Given the above explanations, I'm still skeptical about the validity of the proof. For a concrete example, in the proof of Lemma C.6, the index in the statement of the lemma is a running index . However, the on line 974 actually means defined above, since in the th iteration there is only one used for NPG update. Similar issues exist with Lemma C.7. I acknowledge that the issue is (kind of) circumvented on line 1169, but line 1165 still seems problematic to me.
- At least, the current system of notations is disastrous and needs to be revised significantly.
- Misuse of Lemma C.4. The explanation regarding the short-hand notation is fair, but I'm really surprised to see a shorthand notation used as the initial distribution (appearing on the subscript as , which makes me doubt the upper bound on . On the other hand, the issue of misuse still exists on lines 1021-1039, where and appear in the same equation for no reasons and I do think Lemma C.4 cannot account for it this time.
Given the number of issues that exist in the paper, I would suggest a major revision of the paper that could potentially appear in other conferences. I decide to keep my rating for now.
We appreciate the reivewer's comment and we politely disagree with the reviewer's comment and evaluation on our work.
- Invalide telescoping. We have answered the Reviewer ywXF's question and it indicates that it is the reviewer's misunderstanding and incorrect evaluation on our derivation steps. We follow the standard notations used in the existing literature [ref1], which is clear in the context.
[ref1] Xu, Tengyu, Yingbin Liang, and Guanghui Lan. "Crpo: A new approach for safe reinforcement learning with convergence guarantee." International Conference on Machine Learning. PMLR, 2021.
- Misuse of Lemma C.4. We have clearly stated that the short-hand notation is defined in Line 1075, which should not be something suprising. Our derivation on lines 1021-1039 directly follows the peer-reviewed literature [ref1]'s Lemma 6, which has been carefully verified and been shown to be correct.
[ref1] Xu, Tengyu, Yingbin Liang, and Guanghui Lan. "Crpo: A new approach for safe reinforcement learning with convergence guarantee." International Conference on Machine Learning. PMLR, 2021.
As the result, we hope the reviewer kindly re-evaulate our work.
This paper first uncovers that strong duality does not generally hold in robust constrained RL via a toy example, which indicates that traditional primal-dual methods may fail to find optimal feasible policies. To address the limitation, it proposes a primal-only algorithm called RRPO, which introduces a 3-stage policy update: Threshold Updates, Constraint Rectification, Objective Rectification. Related convergence analysis is also presented in this paper.
给作者的问题
see Other Strengths And Weaknesses.
论据与证据
No.
-
It is not rational to argue that primal-dual methods will fail in robust constrained RL via a simple toy example. Actually, even if in some constrained RL (without robust) cases with nonconvex objectives and constraints, the strong duality does not hold as well, but primal-dual methods can still obtain not bad performance in these cases. In that case, it is hard to say the primal-only method must surpass the primal-dual methods.
-
The grid-world and mountain car control tasks are too trivial. The proposed method needs to be compared with more methods such as [1,2] in more tasks (e.g., mujoco locomotion tasks) to demonstrate its efficiency.
[1] Zhou R, Liu T, Cheng M, et al. Natural actor-critic for robust reinforcement learning with function approximation[J]. Advances in neural information processing systems, 2023, 36: 97-133.
[2] Kumar N, Derman E, Geist M, et al. Policy gradient for rectangular robust markov decision processes[J]. Advances in Neural Information Processing Systems, 2023, 36: 59477-59501.
方法与评估标准
Yes.
理论论述
Yes. I have checked the correctness of the provided toy example and related proofs of NPG convergence.
实验设计与分析
Yes, I have checked the two experiments in this paper, including grid-world and mountain car.
补充材料
Yes. I have reviewed the proof and experimental settings in the supplementary materials.
与现有文献的关系
-
The paper uncovers that the strong duality does not hold in the robust constrained RL setting.
-
The paper proposes a primal-only method to address the limitation and presents detailed proof.
遗漏的重要参考文献
Yes, this paper ignores some literature like [1,2,3] in the field of constrained RL.
[1] Wang Y, Zhan S S, Jiao R, et al. Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments[C]//International Conference on Machine Learning. PMLR, 2023: 36593-36604.
[2] Ding S, Wang J, Du Y, et al. Reduced policy optimization for continuous control with hard constraints[J]. Advances in Neural Information Processing Systems, 2023, 36: 38642-38667.
[3] Yang L, Ji J, Dai J, et al. Constrained update projection approach to safe policy optimization[J]. Advances in Neural Information Processing Systems, 2022, 35: 9111-9124.
其他优缺点
Strengths:
- The theoretical analysis of this paper is complete and detailed.
Weaknesses:
-
It seems RRPO does not propose anything particularly novel. The core idea--alternating update is similar to CRPO and the technique
-
How does RRPO update its value function? The reviewer thinks, although it may not be the key component of this paper, the authors should illustrate it clearly in their pseudo code.
其他意见或建议
see Other Strengths And Weaknesses.
We sincerely appreciate the reviewer’s detailed comments and valuable insights. Below, we provide our point-by-point responses.
Claims And Evidence:
-
Primal-dual methods will fail in robust constrained RL:
We would like to clarify that we never claimed that primal-dual methods will fail in all cases. Instead, we carefully used the phrase "may fail" to emphasize its potential limitations, motivating the need for a non-primal-dual approach. Specifically, we stated:
"... it indicates that primal-dual methods may fail to find optimal feasible policies in robust constrained settings." (Page 2).
-
Trivial experiments:
Our experiments were carefully designed to highlight scenarios where a non-robust algorithm may violate constraints in the worst case. We also appreciate the reviewer’s suggestion to include experiments in the MoJoCo environment to demonstrate cases where the robust algorithm outperforms the non-robust one under perturbations. We will incorporate this addition in our revision.
Essential References Not Discussed:
We sincerely appreciate the reviewer for bringing these important references to our attention. We will add these missing references into our work.
Other Strengths And Weaknesses:
- RRPO is similar to CRPO: Given that strong duality may not hold in robust constrained RL, it is natural to explore whether existing primal-only algorithms can still be effective. Our contribution is not in proposing an entirely new algorithm but in demonstrating that this classical constrained RL approach can achieve optimal sample complexity without relying on strong duality assumptions, which is different from existing robust constrained RL literatures which additionally assume the strong duality.
- Solving the robust value function: Rather than proposing a specific solution method for the robust value function, we assume it can be computed with desired accuracy. In Appendix C.3, we describe two valid robust policy evaluation approaches. Additionally, a particular update rule for the p-norm -rectangular set is given in Eq.(14).
This submission aims to demonstrate that strong duality does not hold in robust constrained RL, and proposes a primal-only RL algorithm called RRPO (rather than a primal-dual method). The submission provides theoretical convergence guarantees for RRPO alongside experiments on simple illustrative domains. The reviewers expressed several concerns with the submission in its current state, including issues in the theoretical results and notation, as well as insufficient scale and realism of the empirical domains tested. In general, there was some confusion about what the core contribution of the paper is, given the similarity of the proposed method to CRPO. The authors clarified in the response that their intent was not to propose a new algorithm but rather demonstrate the implications of lack of strong duality (countering the paper "Constrained Reinforcement Learning Has Zero Duality Gap" in this setting). In future versions, in order to better get this across, the submission would benefit from significant reframing, alongside larger-scale experiments on e.g. MuJoCo or other more realistic domains.