PaperHub
2.5
/10
Rejected3 位审稿人
最低1最高2标准差0.5
2
1
2
ICML 2025

Rectified Robust Policy Optimization for Robust Constrained Reinforcement Learning without Strong Duality

OpenReviewPDF
提交: 2025-01-20更新: 2025-06-18
TL;DR

The strong duality for constrained robust RL doesn't hold. We use a primal-only algorithm to address it.

摘要

关键词
reinforcement learningrobust reinforcement learningconstrained reinforcement learning

评审与讨论

审稿意见
2

This paper studied robust constrained reinforcement learning (RCRL). The paper first proposed a counterexample illustrates that strong duality does not generally hold in RCRL. Therefore, this paper proposes the rectified robust policy optimization algorithm based on a previous algorithm CRPO. The paper provides the sample complexity of converging to an approximately optimal and safe policy. The empirical results justify the algorithm.

给作者的问题

Please see the weakness.

论据与证据

The explicit challenges of combining Robust RL and constrained RL are not elaborated explicitly on. Although the paper emphasizes that the introduction of uncertainty leads to the failure of the strong duality guarantee, causing the primal-dual methods used in constrained RL to become ineffective, there is no discussion on what the challenges are when primal-only methods (such as CRPO) are combined with Robust RL.

方法与评估标准

Yes.

理论论述

Yes. I sketched the proof of the main theorem.

实验设计与分析

Yes.

补充材料

Yes. I sketched the proof of the main theorem.

与现有文献的关系

The authors list scenarios where machines may experience unforeseen transitions due to equipment aging or failure, and discuss how to ensure robot safety in these worst-case situations. Additionally, the authors could enumerate more application scenarios of Robust Constrained RL (such as autonomous driving, smart healthcare, etc.) to enable readers to fully recognize the practical research value and strong motivation of Robust Constrained RL.

遗漏的重要参考文献

This paper has a relatively comprehensive discussion on the relevant literature of Robust RL and Constrained RL, and also discusses the work on the study of strong duality under some previous Robust settings.

其他优缺点

Weakness

  • The algorithm RRPO seems to merely incorporate robust RL methods for value function approximation based on CRPO. The authors do not elaborate on the challenges of elevating from CRPO to RRPO, that is, combining Robust RL and Constrained RL. Therefore, the technical contribution is a bit marginal.

  • The assumption 4.2 contradicts the Counterexample, where the state s1s_1 is transient and d(s1)0d(s_1) \to 0 under any policy. In other words, under Assumption 4.2, could the duality gap be zero and the primal-dual type methods are applicable and have strong guarantee?

  • The theory (Theorem 4.3) does not seem to suggest any relationship between the sample complexity and the model mismatch.

  • In the experiments, the paper only used CMDP method as the baseline, it is more convincing to consider robust CMDP methods (e.g., Wang et. al 2022 and Ghosh e.t. 2024).

其他意见或建议

Please see the weakness.

作者回复

We appreciate the reviewer's constructive suggestions. Here is our point-to-point responses:

Relation To Broader Scientific Literature:

The reviewer's suggestion on providing more application scenarios is indeed helpful. We will revise the introduction section to include these applications.

Other Strengths And Weaknesses:

  1. RRPO is similar to CRPO: Given that strong duality may not hold in robust constrained RL, it is natural to explore whether existing algorithms in constrained RL can still be effective. Our contribution is not in proposing an entirely new algorithm but in demonstrating that it is possible to apply an existing classical constrained RL approach to achieve optimal sample complexity without relying on strong duality assumptions. This direction will not be less challenging than proposing a new algorithm for these reasons: (1) Many literatures in constrained RL will rely on the famous paper "Constrained Reinforcement Learning Has Zero Duality Gap". If the strong duality doesn't hold, most of these theoretical analysis will be invalid. (2) Even if the convergence analysis doesn't rely on the zero duality gap, in our empirical examples, the CRPO will still converges to the infeasible result. As the result, we successfully address these two challenges and show that it is not necessary to include additional complicated structures.

  2. Assumption 4.2 violates the counterexample: We would like to clarify that Assumption 4.2 doesn't contradict the counterexample, because

    • The stationary distribution at s1s_1 is not zero unless the policy π(a1s0)=1\pi(a_1|s_0)=1. In this case (π(a1s0)=1\pi(a_1|s_0)=1), everytime the agent arrives the state s0s_0, it will be trapped there, which makes the state s1s_1 transient. More explicitly, if we let π0:=π(a0s0)\pi_0:= \pi(a_0|s_0) and π1:=π(a1s0)\pi_1:= \pi(a_1|s_0), we can solve the stationary distribution to check the transiency as μ=(11+π1(1p)π1(1p)1+π1(1p))\mu = \begin{pmatrix} \dfrac{1}{1 + \pi_1(1-p)} \\ \dfrac{\pi_1(1-p)}{1 + \pi_1(1-p)} \end{pmatrix}.
    • In Assumption 4.2, we are using the discounted visitation measure. This definition is slightly different from the stationary distribution. As for all ergodic Markov chain, the stationary distribution is independent with respect to the initial distribution, while the discounted visitation measure will rely on that. Therefore, we comment it under Assumption 4.2 that we can use a more uniform initial distribution to obtain a strictly positive p_minp\_{\min}, which is also valid for our counterexample and will not change the absence of strong duality of robust constrained RL.
  3. In Theorem 4.3, the dependency on the model mismatch is included in the Policy Evaluation Accuracy assumption. Usually, the larger the model mismatch is, the higher computation is required to achieve the desired accuracy. It could depend on the structure of uncertainty set and the the robust policy evaluation algorithm, which is out of the scope of our study.

  4. We appreciate the reveiwer's valuable suggestions. We will add these baselines in our revision.

审稿意见
1

This paper studies efficient algorithms for solving constrained RMDPs. In the light of the lack of strong duality for constrained RMDPs, a primal-only algorithm, Rectified Robust Policy Optimization (RRPO), is proposed with theoretical convergence rate guarantees. The performance of the algorithm is further justified by numerical simulations.

给作者的问题

See above.

论据与证据

The claims are supported by both theoretical and empirical results, though some results seem suspicious.

方法与评估标准

The evaluation method looks reasonable to me.

理论论述

After a careful check of the proofs (esp. those in Section C), I find two major concerns that may jeopardize the validity of the theoretical results.

  1. The proofs fail to handle the multiple update timescales well.
    • The current analysis looks very similar in flavor to that of the standard NPG. However, the NPG update rule at time tt with respect to the iith value function (Theorem C.5, and when it is cited on line 973 and line 1089) only holds when ViπtV_i^{\pi_t} is chosen to be updated in the ttth round.
    • More specifically, those equations only hold for tNit \in \mathcal{N}_i, where Ni:={tViπt is sampled to be updated}\mathcal{N}_i := \lbrace t \mid V^{π_t}_i ~\textrm{is sampled to be updated} \rbrace, but not for any tNjt \in \mathcal{N}_j where jij \neq i. In this way it is hard to do telescoping sum.
    • I personally expect to see how the authors can handle the interaction of multiple update timescales when I see the algorithm, but obviously this key technical challenge is "circumvented" using bad notations.
  2. Despite the above issue, Lemma C.6 and C.7 also use Lemma C.4 in an incorrect way.
    • Note that in the statement of Lemma C.4, all initial distributions involved are identically μ\mu. From a very high level, if this lemma is the only tool used, we should not expect different initial distributions in the results.
    • However, this pattern is broken in the proofs of Lemma C.6 (inequality (i) around line 1033) and Lemma C.7 (first inequality around line 1075).
    • It is possible to show a performance difference lemma that involves different initial distributions, but that lemma probably has a different form, which may significantly change the form of the final results. Specifically, I would expect an additional term characterizing the difference between μ\mu and ν\nu, the two initial distributions.

实验设计与分析

Since this is mostly a theory-oriented paper, the numerical simulations only act as a supporting evidence. For this purpose, the experimental design and results look good to me.

补充材料

See the "Theoretical Claims" section.

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  1. I appreciate appendix A.2 that compares this paper against a few very recent work and justifies for its novelty.

Weaknesses:

  1. The writing of the paper can be further improved.
    • Algorithm 1 is written in a very ambiguous way. Variables are sometimes marked with time tt (e.g., d0t+1d_0^{t+1}), but usually not. The parameter θt\theta_t is used before explained (in a later sentence in Section 4.2). Some variable are never used (e.g., N0\mathcal{N}_0).
    • In the last section for numerical examples, most efforts are spent in explaining the experimental setting, leaving only a couple of sentences explaining the observations and their implications.
  2. Some assumptions are hidden deeply in the proof details.
    • It seems from the problem formulation that the algorithm can handle general uncertainty sets, but in fact it only works with those uncertainty sets equipped with efficient Q-function approximators.
    • In Lemma C.8, Viπ(ν)V_i^{\pi}(\nu^*) is assumed to be Lipschitz without proofs.
  3. Proofs may be invalid due to the issues mentioned above.

其他意见或建议

  1. There are a few typos and notational inconsistencies.
    • In equations (6) and (7): the robust value function is denoted by V~\widetilde{V} instead of VV.
    • On page 14, step 3 (below line 750): the partial derivative should be Lπ1\frac{\partial \mathcal{L}}{\partial \pi_1}.
    • In the appendix: dKLd_{KL} and DKLD_{KL} are interchangeably used.

伦理审查问题

N/A

作者回复

Regarding Major Issues

We deeply appreciate the reviewer's careful reading. We absolutely understand that these two concerns are critical and we truly believe they are caused by some misunderstandings from our unclear presentations. Here we make additional discussions to make it more clear and explain why they are correct:

  1. Invalid telescoping: It is correct that all inequalities (Lemma C.6 and Lemma C.7) only hold when tNi:={t:Viπt is sampled to update}t\in \mathcal{N}_i:=\\\{ t: V_i^{\pi_t} \text{ is sampled to update}\\\}. However, we would like to highlight that we can always sum it over tt with carefully handling the mismatch in the indices. To make it more clear, we will use iti_t to indicate that at the tt-th step, ViπtV_i^{\pi_t} is sampled. Therefore, when we sum V_i_tπ(μ)V_i_tπ_t+1(μ)a[d(π,π_t)d(π,π_t+1)]+bV\_{i\_t}^{\pi^*}(\mu) - V\_{i\_t}^{\pi\_{t+1}}(\mu) \leq a[d(\pi^*, \pi\_t) - d(\pi^*, \pi\_{t+1})]+b over tt (as an example), it becomes _t=1T[V_i_tπ(μ)V_i_tπ_t+1(μ)]a[d(π,π_1)d(π,π_T)]+Tb.\sum\_{t=1}^T \left[ V\_{i\_t}^{\pi^*}(\mu) - V\_{i\_t}^{\pi\_{t+1}}(\mu) \right] \leq a [d(\pi^*, \pi\_1) - d(\pi^*, \pi\_{T})] + Tb. As the right-hand side doesn't depend on the index ii, this summation is always valid. We spent a lot of effort to make it happen: (1) We use the Lipschitzness in π\pi of ViπV_i^\pi to remove the dependence on ii in Line 1165, (2) we replace \hat{Q}\_i^{\pi\_i} - \{Q}\_i^{\pi\_i} with the policy evaluation error (Assumption 4.1), and (3) we follow the standard NPG result to decomposite KL divergence to construct the desired crossing structure a[d(π,π_1)d(π,π_T)]a [d(\pi^*, \pi\_1) - d(\pi^*, \pi\_{T})] (Lemma C.7).

    When handling the left-hand side, we set δ\delta to be "not too small" in Lemma C.8. Under this setting, we ensure that the set N0\mathcal{N}_0 is always non-empty; it means in the whole training process t=1,2,,Tt=1,2,\dots,T, there always exists tt such that the indice 00 is sampled. As the result, in Line 1169, we can keep this term _iN0\sum\_{i \in \mathcal{N}_0} and lower bound other terms _iN0\sum\_{i \notin \mathcal{N}_0} using a trivial bound obtained from Line 1157 - Line 1166.

  2. Misuse Lemma C.4 in Lemma C.6 and Lemma C.7: We hope clarify that the we are correctly using Lemma C.4. It doesn't require the initial distribution to be fixed as μ\mu; instead, it is an inequality depending on the initial distribution. When we change the initial distribution μ\mu to another distribution ν\nu, the inequality still reads: aE_dνπ[Aπ(s,a)]Vπ(ν)Vπ(ν)bE_dνπ[Aπ(s,a)].a E\_{d_\nu \otimes \pi'} [A^\pi(s,a)] \leq V^{\pi'}(\nu) - V^{\pi}(\nu) \leq b E\_{d_\nu \otimes \pi'} [A^\pi(s,a)]. The difference includes: (1) The visitation measure is changing from dμd_\mu to dνd_\nu. (2) The robust value function is changing from V(μ)V(\mu) to V(ν)V(\nu). As the result, we are using Lemma C.4 in Lemma C.6 and Lemma C.7 as follows:

    • In Lemma C.6, we use Lemmca C.4 in Line 957 - Line 958, which honestly follows the structure of Lemma C.4. In Line 1033 mentioned by the reviewer, it follows another result d_ν=d_ννν(1γ)νd\_\nu = \frac{d\_\nu}{\nu} \nu \leq (1-\gamma) \nu, which applies a trick of the change of measure to change the current measure d_νd\_\nu to another measure ν\nu. The Lemma C.4 is not involved here.
    • In Lemma C.7, we use Lemma C.4 in Line 1075. Here, we have defined a short-cut notation ν\nu^* to represent d_μπ,Pd\_\mu^{\pi^*, P^*} in Line 1073. As the subscript μ\mu is aligned with the argument in Vπ(μ)V^\pi(\mu) on the left-hand side, our applying Lemma C.4 here is still correct.

Regarding Other Issues

We thank the reviewer again for providing us with these constructive suggestions. We will revise our manuscript based on these valuable comments and we will clearly state that our method can only handle the uncertainty set where the efficient Q-function approximators are equipped.

审稿人评论

I appreciate the authors' timely response. However, the responses do not fully answer my questions.

  1. Invalid telescoping. Given the above explanations, I'm still skeptical about the validity of the proof. For a concrete example, in the proof of Lemma C.6, the index ii in the statement of the lemma is a running index i[I]\forall i \in [I]. However, the ii on line 974 actually means iti_t defined above, since in the ttth iteration there is only one QiQ_i used for NPG update. Similar issues exist with Lemma C.7. I acknowledge that the issue is (kind of) circumvented on line 1169, but line 1165 still seems problematic to me.
    • At least, the current system of notations is disastrous and needs to be revised significantly.
  2. Misuse of Lemma C.4. The explanation regarding the short-hand notation ν\nu^* is fair, but I'm really surprised to see a shorthand notation used as the initial distribution (appearing on the subscript as dνπd_{\nu^*}^{\pi}, which makes me doubt the upper bound on CapproxC_{\textrm{approx}}. On the other hand, the issue of misuse still exists on lines 1021-1039, where μ\mu and ν\nu appear in the same equation for no reasons and I do think Lemma C.4 cannot account for it this time.

Given the number of issues that exist in the paper, I would suggest a major revision of the paper that could potentially appear in other conferences. I decide to keep my rating for now.

作者评论

We appreciate the reivewer's comment and we politely disagree with the reviewer's comment and evaluation on our work.

  1. Invalide telescoping. We have answered the Reviewer ywXF's question and it indicates that it is the reviewer's misunderstanding and incorrect evaluation on our derivation steps. We follow the standard notations used in the existing literature [ref1], which is clear in the context.

[ref1] Xu, Tengyu, Yingbin Liang, and Guanghui Lan. "Crpo: A new approach for safe reinforcement learning with convergence guarantee." International Conference on Machine Learning. PMLR, 2021.

  1. Misuse of Lemma C.4. We have clearly stated that the short-hand notation ν\nu^* is defined in Line 1075, which should not be something suprising. Our derivation on lines 1021-1039 directly follows the peer-reviewed literature [ref1]'s Lemma 6, which has been carefully verified and been shown to be correct.

[ref1] Xu, Tengyu, Yingbin Liang, and Guanghui Lan. "Crpo: A new approach for safe reinforcement learning with convergence guarantee." International Conference on Machine Learning. PMLR, 2021.

As the result, we hope the reviewer kindly re-evaulate our work.

审稿意见
2

This paper first uncovers that strong duality does not generally hold in robust constrained RL via a toy example, which indicates that traditional primal-dual methods may fail to find optimal feasible policies. To address the limitation, it proposes a primal-only algorithm called RRPO, which introduces a 3-stage policy update: Threshold Updates, Constraint Rectification, Objective Rectification. Related convergence analysis is also presented in this paper.

给作者的问题

see Other Strengths And Weaknesses.

论据与证据

No.

  1. It is not rational to argue that primal-dual methods will fail in robust constrained RL via a simple toy example. Actually, even if in some constrained RL (without robust) cases with nonconvex objectives and constraints, the strong duality does not hold as well, but primal-dual methods can still obtain not bad performance in these cases. In that case, it is hard to say the primal-only method must surpass the primal-dual methods.

  2. The grid-world and mountain car control tasks are too trivial. The proposed method needs to be compared with more methods such as [1,2] in more tasks (e.g., mujoco locomotion tasks) to demonstrate its efficiency.

[1] Zhou R, Liu T, Cheng M, et al. Natural actor-critic for robust reinforcement learning with function approximation[J]. Advances in neural information processing systems, 2023, 36: 97-133.

[2] Kumar N, Derman E, Geist M, et al. Policy gradient for rectangular robust markov decision processes[J]. Advances in Neural Information Processing Systems, 2023, 36: 59477-59501.

方法与评估标准

Yes.

理论论述

Yes. I have checked the correctness of the provided toy example and related proofs of NPG convergence.

实验设计与分析

Yes, I have checked the two experiments in this paper, including grid-world and mountain car.

补充材料

Yes. I have reviewed the proof and experimental settings in the supplementary materials.

与现有文献的关系

  1. The paper uncovers that the strong duality does not hold in the robust constrained RL setting.

  2. The paper proposes a primal-only method to address the limitation and presents detailed proof.

遗漏的重要参考文献

Yes, this paper ignores some literature like [1,2,3] in the field of constrained RL.

[1] Wang Y, Zhan S S, Jiao R, et al. Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments[C]//International Conference on Machine Learning. PMLR, 2023: 36593-36604.

[2] Ding S, Wang J, Du Y, et al. Reduced policy optimization for continuous control with hard constraints[J]. Advances in Neural Information Processing Systems, 2023, 36: 38642-38667.

[3] Yang L, Ji J, Dai J, et al. Constrained update projection approach to safe policy optimization[J]. Advances in Neural Information Processing Systems, 2022, 35: 9111-9124.

其他优缺点

Strengths:

  1. The theoretical analysis of this paper is complete and detailed.

Weaknesses:

  1. It seems RRPO does not propose anything particularly novel. The core idea--alternating update is similar to CRPO and the technique

  2. How does RRPO update its value function? The reviewer thinks, although it may not be the key component of this paper, the authors should illustrate it clearly in their pseudo code.

其他意见或建议

see Other Strengths And Weaknesses.

作者回复

We sincerely appreciate the reviewer’s detailed comments and valuable insights. Below, we provide our point-by-point responses.

Claims And Evidence:

  • Primal-dual methods will fail in robust constrained RL:

    We would like to clarify that we never claimed that primal-dual methods will fail in all cases. Instead, we carefully used the phrase "may fail" to emphasize its potential limitations, motivating the need for a non-primal-dual approach. Specifically, we stated:

    "... it indicates that primal-dual methods may fail to find optimal feasible policies in robust constrained settings." (Page 2).

  • Trivial experiments:

    Our experiments were carefully designed to highlight scenarios where a non-robust algorithm may violate constraints in the worst case. We also appreciate the reviewer’s suggestion to include experiments in the MoJoCo environment to demonstrate cases where the robust algorithm outperforms the non-robust one under perturbations. We will incorporate this addition in our revision.

Essential References Not Discussed:

We sincerely appreciate the reviewer for bringing these important references to our attention. We will add these missing references into our work.

Other Strengths And Weaknesses:

  • RRPO is similar to CRPO: Given that strong duality may not hold in robust constrained RL, it is natural to explore whether existing primal-only algorithms can still be effective. Our contribution is not in proposing an entirely new algorithm but in demonstrating that this classical constrained RL approach can achieve optimal sample complexity without relying on strong duality assumptions, which is different from existing robust constrained RL literatures which additionally assume the strong duality.
  • Solving the robust value function: Rather than proposing a specific solution method for the robust value function, we assume it can be computed with desired accuracy. In Appendix C.3, we describe two valid robust policy evaluation approaches. Additionally, a particular update rule for the p-norm (s,a)(s,a)-rectangular set is given in Eq.(14).
最终决定

This submission aims to demonstrate that strong duality does not hold in robust constrained RL, and proposes a primal-only RL algorithm called RRPO (rather than a primal-dual method). The submission provides theoretical convergence guarantees for RRPO alongside experiments on simple illustrative domains. The reviewers expressed several concerns with the submission in its current state, including issues in the theoretical results and notation, as well as insufficient scale and realism of the empirical domains tested. In general, there was some confusion about what the core contribution of the paper is, given the similarity of the proposed method to CRPO. The authors clarified in the response that their intent was not to propose a new algorithm but rather demonstrate the implications of lack of strong duality (countering the paper "Constrained Reinforcement Learning Has Zero Duality Gap" in this setting). In future versions, in order to better get this across, the submission would benefit from significant reframing, alongside larger-scale experiments on e.g. MuJoCo or other more realistic domains.