PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
5
4
3.0
置信度
创新性2.5
质量3.0
清晰度3.0
重要性2.8
NeurIPS 2025

Global Convergence for Average Reward Constrained MDPs with Primal-Dual Actor Critic Algorithm

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

Global Convergence with Order-Optimal rate for Average Reward Constrained MDPs with Primal-Dual Natural Actor Critic Algorithm

摘要

This paper investigates infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) under general parametrized policies with smooth and bounded policy gradients. We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate. In particular, our algorithm achieves global convergence and constraint violation rates of $\tilde{\mathcal{O}}(1/\sqrt{T})$ over a horizon of length $T$ when the mixing time, $\tau_{\mathrm{mix}}$, is known to the learner. In absence of knowledge of $\tau_{\mathrm{mix}}$, the achievable rates change to $\tilde{\mathcal{O}}(1/T^{0.5-\epsilon})$ provided that $T \geq \tilde{\mathcal{O}}\left(\tau_{\mathrm{mix}}^{2/\epsilon}\right)$. Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.
关键词
Reinforcement LearningActor-Critic MethodsConstrained Markov Decision ProcessesPolicy GradientGlobal Convergence

评审与讨论

审稿意见
4

This paper studies infinite-horizon average reward constrained MDPs with general parametrization. It proposes a primal-dual natural actor-critic approach that achieves global convergence and constraint violation of O~(1/T)\tilde{O}(1/\sqrt{T}) if the mixing time is known, and O~(1/T0.5ϵ)\tilde{O}(1/T^{0.5-\epsilon}) it the mixing time is unknown, improving the SoTA for general policy parametrization in both cases.

优缺点分析

Strengths:

  • The paper is well-written, and the problem is well-motivated
  • This result improves upon previous bounds in the same setting and is the first to propose and analyze an approach for the infinite-horizon average-reward C-MDP problem when the mixing time is unknown.
  • The paper does a good job on discussing the difficulties of applying a primal-dual actor-critic approach in a setting with constraints, in particular the difficulties in estimating the critic parameters.

Weaknesses:

  • The final regret and constraint bounds depend on ϵbias\epsilon_{\text{bias}}, which reflects the expressiveness of the policy class πθ\pi_\theta, and on ϵapp\epsilon_{\text{app}}, which captures the quality of the feature map. Both terms contribute linearly in TT to the bounds due to this dependency. While the authors argue that ϵbias\epsilon_{\text{bias}} can be made small with sufficiently expressive neural networks, the assumption that ϵapp\epsilon_{\text{app}} is negligible appears less justified. It would be helpful if the authors explicitly noted in Table 1 or in the main text that the quality of the regret bounds has this dependency that may be linear in TT if the feature map is not good, otherwise the claims are a little misleading. This raises two important questions that I think should be addressed in the paper:

    • How realistic is it to obtain a good feature map in practice?
    • If the linear value function approximation is relaxed, can similar bounds still be obtained?
  • Another claim of the paper is that the one of the proposed approaches does not require prior knowledge of the mixing time and is the first to address this challenge in the context of infinite-horizon average-reward C-MDPs. This is achieved through the use of Multi-Level Monte Carlo (MLMC). However, the analysis closely resembles that of [31] and Patel et al. (2024), yet neither is recognized. This omission gives the impression that the authors are the first to introduce these ideas for handling unknown mixing times. Unless I am mistaken, I strongly recommend that the authors cite these related works when discussing MLMC and clearly explain how their analysis is extended or adapted to address the constrained setting.

Patel et al., Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles, 2024

问题

Minor issues and clarifications:

  • In equation (13), why does νgπθk\nu_g^{\pi_{\theta_k}} depend on gg? Isn’t this just the state-action distribution under policy πθk\pi_{\theta_k}?
  • Page 5, line 158: "feaure" → "feature"
  • Page 7, line 223, item (b): πθ\pi_\thetaπθ1\pi_{\theta_1}

局限性

yes.

最终评判理由

I decided to maintain my positive score, as the authors have addressed all my concerns regarding the dependence of the regret bound on ϵbias\epsilon_\text{bias} and ϵapp\epsilon_\text{app}, and I hope they will consider including these dependencies in Table 1. They also responded to my comments on related work, and I trust they will incorporate this discussion to help clarify the comparison for future readers.

格式问题

The paper does not seem to have a formatting issue.

作者回复

Dependence of the bounds on ϵbias\epsilon_{\mathrm{bias}} and ϵapp\epsilon_{\mathrm{app}}: Thank you for pointing this out. We will mention the dependence of the convergence rates on ϵbias\epsilon_{\mathrm{bias}} and ϵapp\epsilon_{\mathrm{app}} in the main text as well as in Table 1 in the revised manuscript.

On good feature maps: We would like to emphasize that the concept of linear function approximation is not only a common assumption for the actor-critic algorithms, but it is also central to the rich literature of linear MDPs [1]. Although there is no consensus, to the best of our knowledge, on how to build feature maps for a generic task, several task-dependent features have been reported in the literature [2], [3] that perform well in experiments. Hence, it is quite practical to assume that the learner has access to a good feature map.

On nonlinear value function approximation: The benefit of linear critic approximation is that both the NPG and the critic updates can be described as stochastic linear recursions (SLRs). This allows us to utilize Theorem B.1, which is crucial in obtaining the optimal convergence rates. Extending these results to non-linear critic approximation is extremely challenging since the convergence analysis of SLRs can no longer be applied. Very recently, [4] has provided a regret analysis with non-linear critic approximation in the unconstrained discounted reward setup. However, they introduce a data-drop technique and do not use the MLMC-based estimation, which is central to our work. Whether their approach can be extended to the constrained average reward scenario is currently an open question.

Relation to [31] and (Patel et al.): We agree that there are some similarities between our algorithm and that proposed by the cited papers. For example, MLMC-based estimation and linear critic approximation are central to all of these papers. Despite these apparent similarities, there are some crucial differences. For example, both of the cited papers analyze unconstrained setups, liberating them from the nuances of dual projection and its parameter optimization. Secondly, unlike our paper, [31] merely guarantees convergence to a first-order stationary point. Although (Patel et al.) guarantees global convergence, their established rate O~(T1/4)\tilde{\mathcal{O}}(T^{-1/4}) is nowhere near the theoretical lower bound Ω(T1/2)\Omega(T^{-1/2}). Thirdly, the set of assumptions used in [31] is different than ours. For example, Assumption 4.2(3) used in [31] is not needed in our analysis. Finally, the analysis of stochastic linear recursions (SLRs), which is central in our paper (Theorem B.1), is present in neither of the cited papers. Hence, although we are not the first to remove dependence on the knowledge of τmix\tau_{\mathrm{mix}} using MLMC (though we are the first to do so in the CMDP setup), our analysis is not an adopted version of that presented in either of the cited papers. We will include a summary of the above comparison in the revised manuscript.

Typos: Thank you for pointing out the typos. We will correct those in the revised manuscript.

References:

[1] ''Provably efficient reinforcement learning with linear function approximation'', COLT 2020.

[2] ''Making Linear MDPs Practical via Contrastive Representation Learning'', ICML 2022.

[3] ''Achieving sub-linear regret in infinite horizon average reward constrained MDP with linear function approximation'', ICLR 2023.

[4] ''Order-optimal global convergence for actor-critic with general policy and neural critic parametrization'', UAI 2025.

评论

I would like to thank the authors for their responses. I maintain my positive score about the paper.

评论

Thank you for your feedback. We are glad to hear that our response addressed your comments.

审稿意见
4

This paper studies infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) with a smooth policy with bounded gradients. It proposes Primal-Dual Natural Actor-Critic with the global convergence rate of 1/T1/\sqrt{T}, which significantly improves the existing convergence rates for the considered setting and matches the lower bound.

优缺点分析

Strength

The paper provides a solid theoretical result in safe RL with flexible policy parameterization. The proposed algorithm has a significantly better global convergence rate compared to existing results in terms of the number of steps TT.

Weakness

It is not clear what is a technical novelty beyond tuning the problem parameters and and a sharper analysis. For example, it is written that "Achieving this challenging goal is made possible due to the use of Multi-level Monte Carlo (MLMC)-based estimates in the inner loop subroutines and a sharper analysis of these update", but is there any technical novelty or challenges in the sharper analysis? Current techniques seem to be mainly borrowed from Ganesh et al. 2024.

Furthermore, although the convergence rate is much better than previous results, it is still unclear how close the obtained results are to a lower bound.

Because of these points, the significance of the paper is difficult to evaluate for me.

问题

It is not clear what is a technical novelty beyond tuning the problem parameters and a sharper analysis. Would you elaborate on it?

局限性

It is nice to discuss the constraint violation of the proposed method. Currently, the proposed method violates the constraint, and it is not practical to directly use this method for safety critical applications.

Also, by "general" policies, the paper refers to a policy with smooth and bounded gradients. It would be nice to be more specific and call the considered policy class as smooth policies with bounded gradients in the title and abstract.

最终评判理由

While the paper provides a theoretically solid analysis and algorithm, the analysis is not fully convincing to me in that it uses the definition of constraint violation not really appropriate for safety critical situation. Therefore, I keep my score of 4.

格式问题

No.

作者回复

Technical novelty: Our primary novelty lies in achieving an optimal-order convergence rate for average reward Constrained Markov Decision Processes (CMDPs) with general policy parameterization, which was previously unresolved in the literature.

We indeed leverage the natural actor-critic framework in conjunction with the MLMC-based estimation technique, which recently has obtained an order-optimal convergence rate in the unconstrained setup [1]. However, directly applying these techniques to constrained MDPs by integrating a primal-dual approach does not yield the same optimal convergence rate. The key challenge arises from introducing the dual variable, which significantly complicates the convergence analysis. Specifically, as highlighted in Eq. (79,82) in our paper, the dual learning rate β\beta appears in both the numerator and denominator of the convergence bounds. Therefore, merely following the unconstrained setting by choosing β\beta inversely proportional to TT fails to replicate the optimal convergence rate achieved in the unconstrained case.

This fundamental challenge has also been observed in the literature. For example, [2] addresses the unconstrained average reward RL and achieves a convergence rate of O~(T1/4)\tilde{O}(T^{-1/4}), while [3] deals with the constrained counterpart, achieving a slower rate of O~(T1/5) \tilde{O}(T^{-1/5}).

Our work addresses these critical challenges through a sharper and more refined analytical approach. We specifically modify both the primal and dual learning rate to scale inversely with the square root of TT, a significant departure from the constant primal learning rate used in the unconstrained analyses ([1] and [2]). Additionally, we introduce a refined error bound for the Lagrangian (last expression on Page 30) and strategically set the inner-loop complexity to be nearly logarithmic in TT. These innovative technical modifications enable us to overcome the identified challenges and achieve the first optimal-order convergence rate for the average reward constrained RL setting.

Relation to lower bound: As mentioned in Table 1 in the paper, the theoretical lower bound on the global convergence rate is Ω(T1/2)\Omega(T^{-1/2}) where TT indicates the horizon length. Theorems 4.9 and 4.10 in our paper establish convergence rates of O~(T1/2)\tilde{\mathcal{O}}(T^{-1/2}) and O~(T1/2+ϵ)\tilde{\mathcal{O}}(T^{-1/2+\epsilon}) for arbitrarily small ϵ\epsilon. Therefore, not only do our results vastly improve upon the state-of-the-art convergence rates, but they nearly match the theoretical lower bound as well.

On constraint violation: Our algorithm can achieve a small average constraint violation, provided that the horizon length TT is sufficiently large and the errors due to policy and critic approximations are negligible. That being said, we agree with the reviewer that in some safety-critical applications, even a small constraint violation is undesirable. To force the constraint violation to zero, we can resort to a standard ``trick'' commonly used in constrained RL literature. The main idea is to consider a stringent constraint of the form Jcπη>0J_c^{\pi}\geq \eta>0 in the algorithm, instead of the formulated constraint Jcπ0J_c^{\pi}\geq 0. Now, the hyperparameter η\eta can be appropriately chosen so that our desired constraint violation is forced to zero. Please see section H.5 of [4] for a detailed exposition. Similar ideas are also discussed in [5]. Hence, one can extend a small constraint violation result to a zero one with some additional assumption, and we leave it as future work.

On general policies: The class of policies considered in the paper is called ``general'' because it contains various commonly used policy classes such as tabular, softmax, log-linear, neural, etc., as special cases. This terminology is common across the RL literature [6], [7]. We believe that renaming the general policy class as policies with smooth and bounded gradients would de-emphasize the generic applicability of our results. Therefore, instead of renaming it in the title, we will mention the smoothness and boundedness properties of the policy gradients in the abstract and introduction of the revised paper.

References:

[1] "A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach", ICML 2025.

[2] "Regret analysis of policy gradient algorithm for infinite horizon average reward Markov decision processes", AAAI 2024.

[3] "Learning general parameterized policies for infinite horizon average reward constrained MDPs via primal-dual policy gradient algorithm", NeurIPS 2024.

[4] ''Achieving sub-linear regret in infinite horizon average reward constrained MDP with linear function approximation'', ICLR 2023.

[5] ''Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach'', AAAI 2022.

[6] ''Sample-Efficient Constrained Reinforcement Learning with General Parameterization'', NeurIPS 2024.

[7] ''A novel framework for policy mirror descent with general parameterization and linear convergence'', NeurIPS 2023.

评论

Thank you very much for the response.

Technical novelty

Understood. My concerns regarding the technical challenges are now resolved. Is there a similar discussion on technical challenges included in the paper? If not, I highly recommend that the authors incorporate one, as it would help clarify the contributions.

Relation to lower bound

I should have been more specific about what I meant by the lower bound. I was referring to a lower bound that directly corresponds to the considered setting, with explicit dependence on constants such as G1G_1. Since the lower bound in the paper is derived for average-reward MDPs with a tabular policy parameterization, its connection to such constants remains unclear. For example in Eq. (76), G16G_1^6 appears to be very huge to me.

On constraint violation

I am sorry. This was also my fault for not clearly stating what I meant by constraint violation. The paper defines it as

1Kk=0KE[Jc(θk)].- \frac{1}{K} \sum_{k=0}^K \mathbb{E} \left[ J_c (\theta_k) \right].

However, I see two concerns with this definition in practical settings:

  1. First, it is possible that, for example, E[Jc(θk)]=0.5- \mathbb{E} \left[ J_c (\theta_k) \right] = 0.5 for odd kk and E[Jc(θk)]=0.5- \mathbb{E} \left[ J_c (\theta_k) \right] = - 0.5 for even kk. In this case, the average constraint violation converges to zero, but such oscillatory behavior would likely be undesirable in practice.

  2. Second, it seems possible that

k=0KE[c(Sk,Ak)]=K,- \sum_{k=0}^K \mathbb{E} \left[ c (S_k, A_k) \right] = K ,

which is used in other papers such as [12, 34] and makes more sense to me, while

E[Jc(θk)]=0- \mathbb{E} \left[ J_c (\theta_k) \right] = 0

for all kk. This discrepancy raises concerns about whether the definition used here accurately captures meaningful constraint violation.

To illustrate this, consider an environment with four states s1,s2,s3,s4s^1, s^2, s^3, s^4 and two actions: +1+1 and 1-1. Taking action aa in state sis^i transitions the agent to state smin(max(i+a,0),4)s^{\min(\max(i + a, 0), 4)}. The cost is 11 for s1s^1 and s4s^4, and 1-1 for s2s^2 and s3s^3. Suppose the initial state is s2s^2, and an algorithm alternates between two policy parameters: θ\theta, which always chooses +1+1, and θ\theta', which always chooses 1-1, depending on whether the timestep is even or odd. The resulting trajectory oscillates between s2s^2 and s3s^3, leading to

k=0KE[c(Sk,Ak)]=K.- \sum_{k=0}^K \mathbb{E} \left[ c (S_k, A_k) \right] = K.

However, Jc(θ)=Jc(θ)=0J_c (\theta) = J_c (\theta') = 0 for all kk.

On general policies

Understood. My comment was only a suggestion. I believe being as concrete as possible can be helpful for readers, even if prior works use ambiguous terms.

评论

Thank you for your letting us know your concerns and for the detailed reading of the manuscript.

On lower bound: We are unaware of any theoretical lower bound that directly corresponds to our general parameterized setting and captures the dependence of various parameters, such as G1G_1. We have reported a lower bound in Table 1 that dictates the dependence on TT. This result emerges from analyzing a tabular setup. Since tabular policies are special cases of general parameterized policies, the stated result also serves as a lower bound for our case (as far as TT is concerned).

On constraint violation: The intuition behind the definition of constraint violation, as stated in our paper, is as follows. Suppose that an algorithm generates a sequence of policy parameters θ1,,θK\\{\theta_1, \cdots, \theta_K\\} during the trainingphase*training phase*. Now, if one of these policy parameters is chosen at random and deployed in the evaluationphase*evaluation phase*, what is the constraint violation one is expected to see? Note that each policy parameter has an equal chance 1/K1/K of being selected. Moreover, if the parameter θi\theta_i is chosen, then the expected constraint violation is E[Jc(θi)]-\mathbf{E}[J_c(\theta_i)]. Therefore, the expected constraint violation during the evaluation phase is:

1Kk=1KE[Jc(θk)]-\dfrac{1}{K}\sum_{k=1}^{K}\mathbf{E}[J_c(\theta_k)]

The above definition of constraint violation is also used in many prior works [1, 2]. In contrast, if one is interested in the constraint violation during the training phase itself, then the expression of expected constraint violation should be of the following form, as stated by the reviewer.

1Kk=1KE[c(sk,ak)]-\dfrac{1}{K}\sum_{k=1}^K \mathbf{E}[c(s_k, a_k)]

In summary, the two definitions of expected constraint violations have different purposes and application scenarios. In some RL applications, a reliable simulator is available for training. In such cases, the algorithm's performance during training is not of much importance, although the performance of the deployed policy during the evaluation phase is. However, if a simulator is unavailable or the available simulators are not reliable, then in-training performance guarantee takes centre stage. We hope this clarifies the confusion regarding the definition of constraint violation.

We will expand on the technical novelty, and mention the different definitions of constraint violation in the Table to clarify the difference.

References:

[1] ''Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach'', AAAI 2022.

[2] ''Sample-efficient constrained reinforcement learning with general parameterization'', NeurIPS 2024.

评论

We are unaware of any theoretical lower bound that directly corresponds to our general parameterized setting and captures the dependence of various parameters

I am sorry for the confusion. What I wrote is that because of the loose lower bound, it is difficult to evaluate how significant the result is.

On constraint violation

I am not fully convinced by the discussion. Firstly, the argument that other papers are using this criteria does not justify its use. Secondly, the safety during the training is important, and even if one only cares the final policy, the considered objective function is inappropriate. Indeed, picking up and using a policy as suggested in the response can result in violating a constraint with probability 0.5 in the worst case.

Therefore, I keep my score of 4, which is weak accept.

审稿意见
5

This paper studies the problem of learning optimal policies for infinite-horizon average-reward Constrained Markov Decision Processes (CMDPs) under general policy parameterizations. The authors propose a Primal-Dual Natural Actor-Critic (PDNAC) algorithm designed to manage constraints effectively while achieving global convergence guarantees. The algorithm reaches O~(1/T)\tilde{O}(1/\sqrt{T}) rates for both optimality gap and constraint violation when the mixing time τmix\tau_{\text{mix}} is known. When without this knowledge, the algorithm achieves rates arbitrarily close to the theoretical lower bound, improving significantly over the previous best O~(1/T1/5)\tilde{O}(1/T^{1/5}) rate. The results derive from a multi-timescale nested-loop design, careful tuning of learning rates, and an application of Multi-Level Monte Carlo (MLMC) methods.

优缺点分析

Strengths

  1. The paper presents a model-free algorithm that achieves a convergence rate of O~(1/T)\tilde{O}(1/\sqrt{T}) in both reward suboptimality and constraint violation. This matches the known information-theoretic lower bound for average-reward CMDPs and substantially improves upon existing results, which only attain rates like O~(1/T1/5)\tilde{O}(1/T^{1/5}).
  2. Even when the mixing time τmix\tau_{\text{mix}} is unknown, the algorithm still achieves a near-optimal rate of O~(1/T0.5ϵ)\tilde{O}(1/T^{0.5 - \epsilon}), under mild assumptions on the time horizon. This makes the method more practically applicable, as τmix\tau_{\text{mix}} is rarely known in real-world settings.
  3. The analysis is conducted under general policy parameterization, as opposed to the more tractable linear or tabular cases considered in prior work. This broadens the applicability of the results to large-scale or continuous-state environments and makes the theoretical contributions more meaningful.

Weaknesses

  1. The proposed method primarily combines existing components—natural actor-critic updates, primal-dual optimization, and MLMC estimation—without introducing fundamentally new algorithmic primitives.
  2. The algorithm requires prior knowledge of the Slater constant δ\delta to project the dual variable λ\lambda into a bounded region [0,2/δ][0, 2/\delta]. This is a relatively strong assumption in practice.
  3. The paper assumes the cost function is bounded in [1,1][-1, 1], while the reward function may differ in scale. Could the authors explain it?
  4. Some typos: Line 101: $\bcancel{the}. Line 141: “occupation” -> “occupancy”. Line 180: “utilizes” -> “utilize”.

问题

see weakness

局限性

yes

最终评判理由

The paper is well-written, well-motivated and has a strong theoretical result. Therefore, I believe this paper should be accepted.

格式问题

no

作者回复

We thank the reviewer for the constructive review. Please find our responses below.

Algorithmic novelty: Our primary novelty lies in achieving an optimal-order convergence rate for average reward Constrained Markov Decision Processes (CMDPs) with general policy parameterization, which was previously unresolved in the literature.

We indeed leverage the natural actor-critic framework in conjunction with the MLMC-based estimation technique, which recently has obtained an order-optimal convergence rate in the unconstrained setup [1]. However, directly applying these techniques to constrained MDPs by integrating a primal-dual approach does not yield the same optimal convergence rate. The key challenge arises from introducing the dual variable, which significantly complicates the convergence analysis. Specifically, as highlighted in Eq. (79,82) in our paper, the dual learning rate β\beta appears in both the numerator and denominator of the convergence bounds. Therefore, merely following the unconstrained setting by choosing β\beta inversely proportional to TT fails to replicate the optimal convergence rate achieved in the unconstrained case.

This fundamental challenge has also been observed in the literature. For example, [2] addresses the unconstrained average reward RL and achieves a convergence rate of O~(T1/4)\tilde{O}(T^{-1/4}), while [3] deals with the constrained counterpart, achieving a slower rate of O~(T1/5) \tilde{O}(T^{-1/5}).

Our work addresses these critical challenges through a sharper and more refined analytical approach. We specifically modify both the primal and dual learning rate to scale inversely with the square root of TT, a significant departure from the constant primal learning rate used in the unconstrained analyses ([1] and [2]). Additionally, we introduce a refined error bound for the Lagrangian (last expression on Page 30) and strategically set the inner-loop complexity to be nearly logarithmic in TT. These innovative technical modifications enable us to overcome the identified challenges and achieve the first optimal-order convergence rate for the average reward constrained RL setting.

Knowledge of Slater's constant: We would like to point out that almost all primal-dual algorithms in the literature, to the best of our knowledge, assume access to Slater's constant δ\delta (see, e.g., [3], [4]) to define the dual projection operator. Nevertheless, we agree with the reviewer's concern that the value of δ\delta might be unknown for most practical scenarios. In such a case, δ\delta can be taken as a hyperparameter of the algorithm, which needs to be fine-tuned during the training period. This is in line with many RL algorithms in the literature that also require fine-tuning of several unknown hyperparameters, such as the Lipschitz constant, mixing time, or hitting time for effective implementation [2]. In other words, the issue of fine-tuning is not unique to δ\delta, but common across the RL literature. Since δ\delta is such that JcπδJ_c^{\pi}\geq \delta for some*some* policy π\pi, a coarse estimate of δ\delta can be obtained by computing an estimate of JcπJ_c^{\pi} for some arbitrarily chosen policy π\pi.

Scaling of reward and cost functions: The reward function is typically assumed to be bounded, which allows us to choose [0,1][0, 1] as its range without loss of generality. Note that the constraint in our optimization problem is of the form Jcπ0J_c^{\pi}\geq 0. Therefore, if the range of the cost function is chosen as either [0,1][0, 1] or [1,0][-1, 0], then the constraint would either be obeyed by all policies or no policies at all. Both of these cases would lead to trivial solutions. To avoid this problem, we have chosen the range of the cost function to be [1,1][-1, 1].

Typos: Thank you for pointing out the typos. We will correct those in the revised manuscript.

References:

[1] "A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach." ICML 2025.

[2] "Regret analysis of policy gradient algorithm for infinite horizon average reward Markov decision processes." AAAI 2024.

[3] "Learning general parameterized policies for infinite horizon average reward constrained MDPs via primal-dual policy gradient algorithm." NeurIPS 2024.

[4] "Sample-efficient constrained reinforcement learning with general parameterization." NeurIPS 2024.

评论

Thank the authors for their response. I would like to clarify whether PDNAC, as proposed in this paper, requires the exact value of the Slater constant, or if an upper or lower bound would suffice.

评论

Thank you for your concern. We do not require the exact value of the Slater's constant δ\delta, any positive lower bound of δ\delta would suffice for the analysis.

评论

Thank authors for their response. I kepp my positive score about this paper.

审稿意见
4

The paper introduces Primal-Dual Natural Actor-Critic (PDNAC), a model-free algorithm for infinite-horizon average-reward constrained MDPs (CMDPs) with general policy parametrisation. By combining natural policy gradients with a carefully tuned primal-dual update and multi-level Monte-Carlo (MLMC) estimators, PDNAC attains a global convergence and constraint-violation rate of O~(T1/2)\tilde{O}(T^{-1/2}) when the mixing time is known, and O~(T0.5+ε)\tilde{O}(T^{-0.5+\varepsilon}) otherwise, matching the information-theoretic lower bound and outperforming the previous O~(T1/5)\tilde{O}(T^{-1/5}) state of the art. Key ingredients are (i) a nested loop that behaves like a single-timescale update, (ii) MLMC critics that keep the bias low even with very short inner loops, and (iii) a tight finite-time analysis that isolates approximation error terms (εbias,εapp\varepsilon_{bias}, \varepsilon_{app}) so they do not dominate asymptotics.

优缺点分析

Strengths

  • Rigorous finite-time proofs that reach the lower bound for both objective gap and constraint violation (Theorems 4.9-4.10).
  • Clear comparison against a wide range of baselines (Table 1).
  • Solid justification of why naïve primal-dual tuning fails and how natural gradients fix the β\beta-variance trade-off.
  • Closes the long-standing gap between achievable and optimal rates for general-param CMDPs.
  • Removes the need for mixing-time knowledge at only ε\varepsilon-slack cost—important for large, unknown environments.
  • Techniques (nested MLMC + natural gradients) could influence broader safe-RL literature.

Weaknesses

  • When τmix\tau_{mix} is unknown, the horizon requirement TO~(τmix2/ε)T \gtrsim \tilde{O}(\tau_{mix}^{2/\varepsilon}) can be prohibitive.
  • Assumes ergodicity and a linear critic with bounded approximation error; real tasks may violate these.

问题

How sensitive is PDNAC to mis-specifying the inner-loop length HH when τmix\tau_{mix} is unknown?

局限性

The authors did not create Limitations section

格式问题

no

作者回复

On prohibitive horizon length: We agree that, for small ϵ\epsilon and unknown τmix\tau_{\mathrm{mix}}, the length of the required horizon length TT, as stated in Theorem 4.10, can indeed be very large. In such a scenario, one can switch to the parameter setting of Theorem 4.9, which does not impose any such restriction on TT. However, since this setup requires the knowledge of τmix\tau_{\mathrm{mix}}, one can treat τmix\tau_{\mathrm{mix}} as one of the unknown hyperparameters of the algorithm, and fine-tune it during the training phase. This is in line with other RL algorithms in the literature that also require fine-tuning of several unknown hyperparameters, such as the Lipschitz constant or hitting time [1]. Despite these practical solutions, we acknowledge that a systematic theoretical investigation is needed to improve the requirement of the horizon length, TT (in the absence of knowledge of τmix\tau_{\mathrm{mix}}). We will include it as one of the future works in the revised paper.

On the assumption of ergodicity: The assumption of ergodicity ensures that any state of the MDP is achievable from any other state via some policy. This ensures that a well-designed algorithm can sufficiently explore the state space and obtain a near-optimal policy exploiting the convergence of the policy-induced Markov chain to a stationary distribution. Therefore, it is no surprise that ergodicity is one of the most commonly employed assumptions in the RL literature (see the discussion following Assumption 2.2 in our paper). Extending the analysis to beyond-ergodic MDPs is extremely difficult, and a few works that are available in this direction primarily target tabular policies in the unconstrained setup. Designing algorithms achieving optimal convergence rates for beyond-ergodic CMDPs with general parameterized policies is currently an open question.

On the linear critic approximation error: Bounded linear critic approximation error is one of the commonly employed assumptions in the actor-critic literature. Please refer to the discussion following Assumption 2.2 for an overview of the relevant works. Apart from being important from a theoretical perspective, such a framework also yields good numerical results in experiments, as demonstrated in [2]. Despite these successes, we acknowledge that it is important to extend the analysis to non-linear critic to widen the applicability of the algorithm. Although some recent results are available in this direction [3], they primarily target unconstrained discounted reward setups. Moreover, their algorithms and analysis techniques are vastly different from ours. Whether similar analyses can be performed for constrained average reward settings is currently an open question.

Sensitivity of PDNAC to misspecifying HH with unknown τmix\tau_{\mathrm{mix}}: Theorem 4.10 states that if τmix\tau_{\mathrm{mix}} is unknown, then one can choose H=TϵH=T^{\epsilon} for some ϵ>0\epsilon>0 which results in objective convergence and constraint violation rates of O~(T1/2+ϵ)\tilde{\mathcal{O}}(T^{-1/2+\epsilon}) for sufficiently large TT. Hence, if HH is misspecified as H=Tϵ+ΔϵH=T^{\epsilon+\Delta \epsilon} for some small error Δϵ\Delta\epsilon, then the convergence rates are modified by a factor of O~(TΔϵ)\tilde{\mathcal{O}}(T^{\Delta\epsilon}).

References:

[1] ``Regret analysis of policy gradient algorithm for infinite horizon average reward Markov decision processes'', AAAI 2024.

[2] ``Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic'', ICML 2023.

[3] ``Order-optimal global convergence for actor-critic with general policy and neural critic parametrization'', UAI 2025.

评论

I have read all the comments and they left me with a positive impression. Thank you for your thoughtful response.

评论

Thank you for your feedback. We are glad to hear that our response addressed your comments.

最终决定

This paper proposes Primal-Dual Natural Actor-Critic (PDNAC) for infinite-horizon average-reward CMDPs, achieving order-optimal convergence rates for both optimality gap and constraint violation. Reviewers agreed it is technically solid, closes a key gap in constrained RL, and improves substantially over prior work. Concerns about novelty, assumptions (e.g., Slater constant, ergodicity, linear critic approximation), and the definition of constraint violation were partly resolved in rebuttal. Overall, the work makes a nice theoretical advance and is recommended for acceptance.