6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.0

置信度

创新性2.5

质量3.0

清晰度3.0

重要性2.8

NeurIPS 2025

Global Convergence for Average Reward Constrained MDPs with Primal-Dual Actor Critic Algorithm

Yang Xu,Swetha Ganesh,Washim Uddin Mondal,Qinbo Bai,Vaneet Aggarwal

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

TL;DR

Global Convergence with Order-Optimal rate for Average Reward Constrained MDPs with Primal-Dual Natural Actor Critic Algorithm

摘要

This paper investigates infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) under general parametrized policies with smooth and bounded policy gradients. We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate. In particular, our algorithm achieves global convergence and constraint violation rates of $\tilde{\mathcal{O}}(1/\sqrt{T})$ over a horizon of length $T$ when the mixing time, $\tau_{\mathrm{mix}}$, is known to the learner. In absence of knowledge of $\tau_{\mathrm{mix}}$, the achievable rates change to $\tilde{\mathcal{O}}(1/T^{0.5-\epsilon})$ provided that $T \geq \tilde{\mathcal{O}}\left(\tau_{\mathrm{mix}}^{2/\epsilon}\right)$. Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.

关键词

Reinforcement LearningActor-Critic MethodsConstrained Markov Decision ProcessesPolicy GradientGlobal Convergence

评审与讨论

审稿意见

评分: 4置信度: 32025-06-19

This paper studies infinite-horizon average reward constrained MDPs with general parametrization. It proposes a primal-dual natural actor-critic approach that achieves global convergence and constraint violation of $\tilde{O}(1/\sqrt{T})$ if the mixing time is known, and $\tilde{O}(1/T^{0.5-\epsilon})$ it the mixing time is unknown, improving the SoTA for general policy parametrization in both cases.

优缺点分析

Strengths:

The paper is well-written, and the problem is well-motivated
This result improves upon previous bounds in the same setting and is the first to propose and analyze an approach for the infinite-horizon average-reward C-MDP problem when the mixing time is unknown.
The paper does a good job on discussing the difficulties of applying a primal-dual actor-critic approach in a setting with constraints, in particular the difficulties in estimating the critic parameters.

Weaknesses:

The final regret and constraint bounds depend on $\epsilon_{\text{bias}}$ , which reflects the expressiveness of the policy class $\pi_\theta$ , and on $\epsilon_{\text{app}}$ , which captures the quality of the feature map. Both terms contribute linearly in $T$ to the bounds due to this dependency. While the authors argue that $\epsilon_{\text{bias}}$ can be made small with sufficiently expressive neural networks, the assumption that $\epsilon_{\text{app}}$ is negligible appears less justified. It would be helpful if the authors explicitly noted in Table 1 or in the main text that the quality of the regret bounds has this dependency that may be linear in $T$ if the feature map is not good, otherwise the claims are a little misleading. This raises two important questions that I think should be addressed in the paper:
- How realistic is it to obtain a good feature map in practice?
- If the linear value function approximation is relaxed, can similar bounds still be obtained?
Another claim of the paper is that the one of the proposed approaches does not require prior knowledge of the mixing time and is the first to address this challenge in the context of infinite-horizon average-reward C-MDPs. This is achieved through the use of Multi-Level Monte Carlo (MLMC). However, the analysis closely resembles that of [31] and Patel et al. (2024), yet neither is recognized. This omission gives the impression that the authors are the first to introduce these ideas for handling unknown mixing times. Unless I am mistaken, I strongly recommend that the authors cite these related works when discussing MLMC and clearly explain how their analysis is extended or adapted to address the constrained setting.

Patel et al., Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles, 2024

问题

Minor issues and clarifications:

In equation (13), why does $\nu_g^{\pi_{\theta_k}}$ depend on $g$ ? Isn’t this just the state-action distribution under policy $\pi_{\theta_k}$ ?
Page 5, line 158: "feaure" → "feature"
Page 7, line 223, item (b): $\pi_\theta$ → $\pi_{\theta_1}$

局限性

yes.

最终评判理由

I decided to maintain my positive score, as the authors have addressed all my concerns regarding the dependence of the regret bound on $\epsilon_\text{bias}$ and $\epsilon_\text{app}$ , and I hope they will consider including these dependencies in Table 1. They also responded to my comments on related work, and I trust they will incorporate this discussion to help clarify the comparison for future readers.

格式问题

The paper does not seem to have a formatting issue.

作者回复

2025-07-30

Dependence of the bounds on $\epsilon_{\mathrm{bias}}$ and $\epsilon_{\mathrm{app}}$ : Thank you for pointing this out. We will mention the dependence of the convergence rates on $\epsilon_{\mathrm{bias}}$ and $\epsilon_{\mathrm{app}}$ in the main text as well as in Table 1 in the revised manuscript.

On good feature maps: We would like to emphasize that the concept of linear function approximation is not only a common assumption for the actor-critic algorithms, but it is also central to the rich literature of linear MDPs [1]. Although there is no consensus, to the best of our knowledge, on how to build feature maps for a generic task, several task-dependent features have been reported in the literature [2], [3] that perform well in experiments. Hence, it is quite practical to assume that the learner has access to a good feature map.

On nonlinear value function approximation: The benefit of linear critic approximation is that both the NPG and the critic updates can be described as stochastic linear recursions (SLRs). This allows us to utilize Theorem B.1, which is crucial in obtaining the optimal convergence rates. Extending these results to non-linear critic approximation is extremely challenging since the convergence analysis of SLRs can no longer be applied. Very recently, [4] has provided a regret analysis with non-linear critic approximation in the unconstrained discounted reward setup. However, they introduce a data-drop technique and do not use the MLMC-based estimation, which is central to our work. Whether their approach can be extended to the constrained average reward scenario is currently an open question.

Relation to [31] and (Patel et al.): We agree that there are some similarities between our algorithm and that proposed by the cited papers. For example, MLMC-based estimation and linear critic approximation are central to all of these papers. Despite these apparent similarities, there are some crucial differences. For example, both of the cited papers analyze unconstrained setups, liberating them from the nuances of dual projection and its parameter optimization. Secondly, unlike our paper, [31] merely guarantees convergence to a first-order stationary point. Although (Patel et al.) guarantees global convergence, their established rate $\tilde{\mathcal{O}}(T^{-1/4})$ is nowhere near the theoretical lower bound $\Omega(T^{-1/2})$ . Thirdly, the set of assumptions used in [31] is different than ours. For example, Assumption 4.2(3) used in [31] is not needed in our analysis. Finally, the analysis of stochastic linear recursions (SLRs), which is central in our paper (Theorem B.1), is present in neither of the cited papers. Hence, although we are not the first to remove dependence on the knowledge of $\tau_{\mathrm{mix}}$ using MLMC (though we are the first to do so in the CMDP setup), our analysis is not an adopted version of that presented in either of the cited papers. We will include a summary of the above comparison in the revised manuscript.

Typos: Thank you for pointing out the typos. We will correct those in the revised manuscript.

References:

[1] ''Provably efficient reinforcement learning with linear function approximation'', COLT 2020.

[2] ''Making Linear MDPs Practical via Contrastive Representation Learning'', ICML 2022.

[3] ''Achieving sub-linear regret in infinite horizon average reward constrained MDP with linear function approximation'', ICLR 2023.

[4] ''Order-optimal global convergence for actor-critic with general policy and neural critic parametrization'', UAI 2025.

2025-08-04

I would like to thank the authors for their responses. I maintain my positive score about the paper.

2025-08-06

Thank you for your feedback. We are glad to hear that our response addressed your comments.

审稿意见

评分: 4置信度: 32025-07-02

This paper studies infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) with a smooth policy with bounded gradients. It proposes Primal-Dual Natural Actor-Critic with the global convergence rate of $1/\sqrt{T}$ , which significantly improves the existing convergence rates for the considered setting and matches the lower bound.

优缺点分析

Strength

The paper provides a solid theoretical result in safe RL with flexible policy parameterization. The proposed algorithm has a significantly better global convergence rate compared to existing results in terms of the number of steps $T$ .

Weakness

It is not clear what is a technical novelty beyond tuning the problem parameters and and a sharper analysis. For example, it is written that "Achieving this challenging goal is made possible due to the use of Multi-level Monte Carlo (MLMC)-based estimates in the inner loop subroutines and a sharper analysis of these update", but is there any technical novelty or challenges in the sharper analysis? Current techniques seem to be mainly borrowed from Ganesh et al. 2024.

Furthermore, although the convergence rate is much better than previous results, it is still unclear how close the obtained results are to a lower bound.

Because of these points, the significance of the paper is difficult to evaluate for me.

问题

It is not clear what is a technical novelty beyond tuning the problem parameters and a sharper analysis. Would you elaborate on it?

局限性

It is nice to discuss the constraint violation of the proposed method. Currently, the proposed method violates the constraint, and it is not practical to directly use this method for safety critical applications.

Also, by "general" policies, the paper refers to a policy with smooth and bounded gradients. It would be nice to be more specific and call the considered policy class as smooth policies with bounded gradients in the title and abstract.

最终评判理由

While the paper provides a theoretically solid analysis and algorithm, the analysis is not fully convincing to me in that it uses the definition of constraint violation not really appropriate for safety critical situation. Therefore, I keep my score of 4.

格式问题

No.

作者回复

2025-07-30

Technical novelty: Our primary novelty lies in achieving an optimal-order convergence rate for average reward Constrained Markov Decision Processes (CMDPs) with general policy parameterization, which was previously unresolved in the literature.

We indeed leverage the natural actor-critic framework in conjunction with the MLMC-based estimation technique, which recently has obtained an order-optimal convergence rate in the unconstrained setup [1]. However, directly applying these techniques to constrained MDPs by integrating a primal-dual approach does not yield the same optimal convergence rate. The key challenge arises from introducing the dual variable, which significantly complicates the convergence analysis. Specifically, as highlighted in Eq. (79,82) in our paper, the dual learning rate $\beta$ appears in both the numerator and denominator of the convergence bounds. Therefore, merely following the unconstrained setting by choosing $\beta$ inversely proportional to $T$ fails to replicate the optimal convergence rate achieved in the unconstrained case.

This fundamental challenge has also been observed in the literature. For example, [2] addresses the unconstrained average reward RL and achieves a convergence rate of $\tilde{O}(T^{-1/4})$ , while [3] deals with the constrained counterpart, achieving a slower rate of $\tilde{O}(T^{-1/5})$ .

Our work addresses these critical challenges through a sharper and more refined analytical approach. We specifically modify both the primal and dual learning rate to scale inversely with the square root of $T$ , a significant departure from the constant primal learning rate used in the unconstrained analyses ([1] and [2]). Additionally, we introduce a refined error bound for the Lagrangian (last expression on Page 30) and strategically set the inner-loop complexity to be nearly logarithmic in $T$ . These innovative technical modifications enable us to overcome the identified challenges and achieve the first optimal-order convergence rate for the average reward constrained RL setting.

Relation to lower bound: As mentioned in Table 1 in the paper, the theoretical lower bound on the global convergence rate is $\Omega(T^{-1/2})$ where $T$ indicates the horizon length. Theorems 4.9 and 4.10 in our paper establish convergence rates of $\tilde{\mathcal{O}}(T^{-1/2})$ and $\tilde{\mathcal{O}}(T^{-1/2+\epsilon})$ for arbitrarily small $\epsilon$ . Therefore, not only do our results vastly improve upon the state-of-the-art convergence rates, but they nearly match the theoretical lower bound as well.

On constraint violation: Our algorithm can achieve a small average constraint violation, provided that the horizon length $T$ is sufficiently large and the errors due to policy and critic approximations are negligible. That being said, we agree with the reviewer that in some safety-critical applications, even a small constraint violation is undesirable. To force the constraint violation to zero, we can resort to a standard ``trick'' commonly used in constrained RL literature. The main idea is to consider a stringent constraint of the form $J_c^{\pi}\geq \eta>0$ in the algorithm, instead of the formulated constraint $J_c^{\pi}\geq 0$ . Now, the hyperparameter $\eta$ can be appropriately chosen so that our desired constraint violation is forced to zero. Please see section H.5 of [4] for a detailed exposition. Similar ideas are also discussed in [5]. Hence, one can extend a small constraint violation result to a zero one with some additional assumption, and we leave it as future work.

On general policies: The class of policies considered in the paper is called ``general'' because it contains various commonly used policy classes such as tabular, softmax, log-linear, neural, etc., as special cases. This terminology is common across the RL literature [6], [7]. We believe that renaming the general policy class as policies with smooth and bounded gradients would de-emphasize the generic applicability of our results. Therefore, instead of renaming it in the title, we will mention the smoothness and boundedness properties of the policy gradients in the abstract and introduction of the revised paper.

References:

[1] "A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach", ICML 2025.

[2] "Regret analysis of policy gradient algorithm for infinite horizon average reward Markov decision processes", AAAI 2024.

[3] "Learning general parameterized policies for infinite horizon average reward constrained MDPs via primal-dual policy gradient algorithm", NeurIPS 2024.

[4] ''Achieving sub-linear regret in infinite horizon average reward constrained MDP with linear function approximation'', ICLR 2023.

[5] ''Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach'', AAAI 2022.

[6] ''Sample-Efficient Constrained Reinforcement Learning with General Parameterization'', NeurIPS 2024.

[7] ''A novel framework for policy mirror descent with general parameterization and linear convergence'', NeurIPS 2023.

2025-08-05

Thank you very much for the response.

Technical novelty

Understood. My concerns regarding the technical challenges are now resolved. Is there a similar discussion on technical challenges included in the paper? If not, I highly recommend that the authors incorporate one, as it would help clarify the contributions.

Relation to lower bound

I should have been more specific about what I meant by the lower bound. I was referring to a lower bound that directly corresponds to the considered setting, with explicit dependence on constants such as $G_1$ . Since the lower bound in the paper is derived for average-reward MDPs with a tabular policy parameterization, its connection to such constants remains unclear. For example in Eq. (76), $G_1^6$ appears to be very huge to me.

On constraint violation

I am sorry. This was also my fault for not clearly stating what I meant by constraint violation. The paper defines it as

$- \frac{1}{K} \sum_{k=0}^K \mathbb{E} \left[ J_c (\theta_k) \right].$

However, I see two concerns with this definition in practical settings:

First, it is possible that, for example, $- \mathbb{E} \left[ J_c (\theta_k) \right] = 0.5$ for odd $k$ and $- \mathbb{E} \left[ J_c (\theta_k) \right] = - 0.5$ for even $k$ . In this case, the average constraint violation converges to zero, but such oscillatory behavior would likely be undesirable in practice.
Second, it seems possible that

$- \sum_{k=0}^K \mathbb{E} \left[ c (S_k, A_k) \right] = K ,$

which is used in other papers such as [12, 34] and makes more sense to me, while

$- \mathbb{E} \left[ J_c (\theta_k) \right] = 0$

for all $k$ . This discrepancy raises concerns about whether the definition used here accurately captures meaningful constraint violation.

To illustrate this, consider an environment with four states $s^1, s^2, s^3, s^4$ and two actions: $+1$ and $-1$ . Taking action $a$ in state $s^i$ transitions the agent to state $s^{\min(\max(i + a, 0), 4)}$ . The cost is $1$ for $s^1$ and $s^4$ , and $-1$ for $s^2$ and $s^3$ . Suppose the initial state is $s^2$ , and an algorithm alternates between two policy parameters: $\theta$ , which always chooses $+1$ , and $\theta'$ , which always chooses $-1$ , depending on whether the timestep is even or odd. The resulting trajectory oscillates between $s^2$ and $s^3$ , leading to

$- \sum_{k=0}^K \mathbb{E} \left[ c (S_k, A_k) \right] = K.$

However, $J_c (\theta) = J_c (\theta') = 0$ for all $k$ .

On general policies

Understood. My comment was only a suggestion. I believe being as concrete as possible can be helpful for readers, even if prior works use ambiguous terms.

评论- Response

2025-08-05

Thank you for your letting us know your concerns and for the detailed reading of the manuscript.

On lower bound: We are unaware of any theoretical lower bound that directly corresponds to our general parameterized setting and captures the dependence of various parameters, such as $G_1$ . We have reported a lower bound in Table 1 that dictates the dependence on $T$ . This result emerges from analyzing a tabular setup. Since tabular policies are special cases of general parameterized policies, the stated result also serves as a lower bound for our case (as far as $T$ is concerned).

On constraint violation: The intuition behind the definition of constraint violation, as stated in our paper, is as follows. Suppose that an algorithm generates a sequence of policy parameters $\\{\theta_1, \cdots, \theta_K\\}$ during the $*training phase*$ . Now, if one of these policy parameters is chosen at random and deployed in the $*evaluation phase*$ , what is the constraint violation one is expected to see? Note that each policy parameter has an equal chance $1/K$ of being selected. Moreover, if the parameter $\theta_i$ is chosen, then the expected constraint violation is $-\mathbf{E}[J_c(\theta_i)]$ . Therefore, the expected constraint violation during the evaluation phase is:

$-\dfrac{1}{K}\sum_{k=1}^{K}\mathbf{E}[J_c(\theta_k)]$

The above definition of constraint violation is also used in many prior works [1, 2]. In contrast, if one is interested in the constraint violation during the training phase itself, then the expression of expected constraint violation should be of the following form, as stated by the reviewer.

$-\dfrac{1}{K}\sum_{k=1}^K \mathbf{E}[c(s_k, a_k)]$

In summary, the two definitions of expected constraint violations have different purposes and application scenarios. In some RL applications, a reliable simulator is available for training. In such cases, the algorithm's performance during training is not of much importance, although the performance of the deployed policy during the evaluation phase is. However, if a simulator is unavailable or the available simulators are not reliable, then in-training performance guarantee takes centre stage. We hope this clarifies the confusion regarding the definition of constraint violation.

We will expand on the technical novelty, and mention the different definitions of constraint violation in the Table to clarify the difference.

References:

[1] ''Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach'', AAAI 2022.

[2] ''Sample-efficient constrained reinforcement learning with general parameterization'', NeurIPS 2024.

2025-08-08

We are unaware of any theoretical lower bound that directly corresponds to our general parameterized setting and captures the dependence of various parameters

I am sorry for the confusion. What I wrote is that because of the loose lower bound, it is difficult to evaluate how significant the result is.

On constraint violation

I am not fully convinced by the discussion. Firstly, the argument that other papers are using this criteria does not justify its use. Secondly, the safety during the training is important, and even if one only cares the final policy, the considered objective function is inappropriate. Indeed, picking up and using a policy as suggested in the response can result in violating a constraint with probability 0.5 in the worst case.

Therefore, I keep my score of 4, which is weak accept.

审稿意见

评分: 5置信度: 32025-07-03

This paper studies the problem of learning optimal policies for infinite-horizon average-reward Constrained Markov Decision Processes (CMDPs) under general policy parameterizations. The authors propose a Primal-Dual Natural Actor-Critic (PDNAC) algorithm designed to manage constraints effectively while achieving global convergence guarantees. The algorithm reaches $\tilde{O}(1/\sqrt{T})$ rates for both optimality gap and constraint violation when the mixing time $\tau_{\text{mix}}$ is known. When without this knowledge, the algorithm achieves rates arbitrarily close to the theoretical lower bound, improving significantly over the previous best $\tilde{O}(1/T^{1/5})$ rate. The results derive from a multi-timescale nested-loop design, careful tuning of learning rates, and an application of Multi-Level Monte Carlo (MLMC) methods.

优缺点分析

Strengths

The paper presents a model-free algorithm that achieves a convergence rate of $\tilde{O}(1/\sqrt{T})$ in both reward suboptimality and constraint violation. This matches the known information-theoretic lower bound for average-reward CMDPs and substantially improves upon existing results, which only attain rates like $\tilde{O}(1/T^{1/5})$ .
Even when the mixing time $\tau_{\text{mix}}$ is unknown, the algorithm still achieves a near-optimal rate of $\tilde{O}(1/T^{0.5 - \epsilon})$ , under mild assumptions on the time horizon. This makes the method more practically applicable, as $\tau_{\text{mix}}$ is rarely known in real-world settings.
The analysis is conducted under general policy parameterization, as opposed to the more tractable linear or tabular cases considered in prior work. This broadens the applicability of the results to large-scale or continuous-state environments and makes the theoretical contributions more meaningful.

Weaknesses

The proposed method primarily combines existing components—natural actor-critic updates, primal-dual optimization, and MLMC estimation—without introducing fundamentally new algorithmic primitives.
The algorithm requires prior knowledge of the Slater constant $\delta$ to project the dual variable $\lambda$ into a bounded region $[0, 2/\delta]$ . This is a relatively strong assumption in practice.
The paper assumes the cost function is bounded in $[-1, 1]$ , while the reward function may differ in scale. Could the authors explain it?
Some typos: Line 101: $\bcancel{the}. Line 141: “occupation” -> “occupancy”. Line 180: “utilizes” -> “utilize”.

问题

see weakness

局限性

yes

最终评判理由

The paper is well-written, well-motivated and has a strong theoretical result. Therefore, I believe this paper should be accepted.

格式问题

作者回复

2025-07-30

We thank the reviewer for the constructive review. Please find our responses below.

Algorithmic novelty: Our primary novelty lies in achieving an optimal-order convergence rate for average reward Constrained Markov Decision Processes (CMDPs) with general policy parameterization, which was previously unresolved in the literature.

Knowledge of Slater's constant: We would like to point out that almost all primal-dual algorithms in the literature, to the best of our knowledge, assume access to Slater's constant $\delta$ (see, e.g., [3], [4]) to define the dual projection operator. Nevertheless, we agree with the reviewer's concern that the value of $\delta$ might be unknown for most practical scenarios. In such a case, $\delta$ can be taken as a hyperparameter of the algorithm, which needs to be fine-tuned during the training period. This is in line with many RL algorithms in the literature that also require fine-tuning of several unknown hyperparameters, such as the Lipschitz constant, mixing time, or hitting time for effective implementation [2]. In other words, the issue of fine-tuning is not unique to $\delta$ , but common across the RL literature. Since $\delta$ is such that $J_c^{\pi}\geq \delta$ for $*some*$ policy $\pi$ , a coarse estimate of $\delta$ can be obtained by computing an estimate of $J_c^{\pi}$ for some arbitrarily chosen policy $\pi$ .

Scaling of reward and cost functions: The reward function is typically assumed to be bounded, which allows us to choose $[0, 1]$ as its range without loss of generality. Note that the constraint in our optimization problem is of the form $J_c^{\pi}\geq 0$ . Therefore, if the range of the cost function is chosen as either $[0, 1]$ or $[-1, 0]$ , then the constraint would either be obeyed by all policies or no policies at all. Both of these cases would lead to trivial solutions. To avoid this problem, we have chosen the range of the cost function to be $[-1, 1]$ .

Typos: Thank you for pointing out the typos. We will correct those in the revised manuscript.

References:

[1] "A sharper global convergence analysis for average reward reinforcement learning via an actor-critic approach." ICML 2025.

[2] "Regret analysis of policy gradient algorithm for infinite horizon average reward Markov decision processes." AAAI 2024.

[3] "Learning general parameterized policies for infinite horizon average reward constrained MDPs via primal-dual policy gradient algorithm." NeurIPS 2024.

[4] "Sample-efficient constrained reinforcement learning with general parameterization." NeurIPS 2024.

2025-08-04

Thank the authors for their response. I would like to clarify whether PDNAC, as proposed in this paper, requires the exact value of the Slater constant, or if an upper or lower bound would suffice.

评论- Response

2025-08-04

Thank you for your concern. We do not require the exact value of the Slater's constant $\delta$ , any positive lower bound of $\delta$ would suffice for the analysis.

2025-08-09

Thank authors for their response. I kepp my positive score about this paper.

审稿意见

评分: 4置信度: 32025-07-03

The paper introduces Primal-Dual Natural Actor-Critic (PDNAC), a model-free algorithm for infinite-horizon average-reward constrained MDPs (CMDPs) with general policy parametrisation. By combining natural policy gradients with a carefully tuned primal-dual update and multi-level Monte-Carlo (MLMC) estimators, PDNAC attains a global convergence and constraint-violation rate of $\tilde{O}(T^{-1/2})$ when the mixing time is known, and $\tilde{O}(T^{-0.5+\varepsilon})$ otherwise, matching the information-theoretic lower bound and outperforming the previous $\tilde{O}(T^{-1/5})$ state of the art. Key ingredients are (i) a nested loop that behaves like a single-timescale update, (ii) MLMC critics that keep the bias low even with very short inner loops, and (iii) a tight finite-time analysis that isolates approximation error terms ( $\varepsilon_{bias}, \varepsilon_{app}$ ) so they do not dominate asymptotics.

优缺点分析

Strengths

Rigorous finite-time proofs that reach the lower bound for both objective gap and constraint violation (Theorems 4.9-4.10).
Clear comparison against a wide range of baselines (Table 1).
Solid justification of why naïve primal-dual tuning fails and how natural gradients fix the $\beta$ -variance trade-off.
Closes the long-standing gap between achievable and optimal rates for general-param CMDPs.
Removes the need for mixing-time knowledge at only $\varepsilon$ -slack cost—important for large, unknown environments.
Techniques (nested MLMC + natural gradients) could influence broader safe-RL literature.

Weaknesses

When $\tau_{mix}$ is unknown, the horizon requirement $T \gtrsim \tilde{O}(\tau_{mix}^{2/\varepsilon})$ can be prohibitive.
Assumes ergodicity and a linear critic with bounded approximation error; real tasks may violate these.

问题

How sensitive is PDNAC to mis-specifying the inner-loop length $H$ when $\tau_{mix}$ is unknown?

局限性

The authors did not create Limitations section

格式问题

作者回复

2025-07-30

On prohibitive horizon length: We agree that, for small $\epsilon$ and unknown $\tau_{\mathrm{mix}}$ , the length of the required horizon length $T$ , as stated in Theorem 4.10, can indeed be very large. In such a scenario, one can switch to the parameter setting of Theorem 4.9, which does not impose any such restriction on $T$ . However, since this setup requires the knowledge of $\tau_{\mathrm{mix}}$ , one can treat $\tau_{\mathrm{mix}}$ as one of the unknown hyperparameters of the algorithm, and fine-tune it during the training phase. This is in line with other RL algorithms in the literature that also require fine-tuning of several unknown hyperparameters, such as the Lipschitz constant or hitting time [1]. Despite these practical solutions, we acknowledge that a systematic theoretical investigation is needed to improve the requirement of the horizon length, $T$ (in the absence of knowledge of $\tau_{\mathrm{mix}}$ ). We will include it as one of the future works in the revised paper.

On the assumption of ergodicity: The assumption of ergodicity ensures that any state of the MDP is achievable from any other state via some policy. This ensures that a well-designed algorithm can sufficiently explore the state space and obtain a near-optimal policy exploiting the convergence of the policy-induced Markov chain to a stationary distribution. Therefore, it is no surprise that ergodicity is one of the most commonly employed assumptions in the RL literature (see the discussion following Assumption 2.2 in our paper). Extending the analysis to beyond-ergodic MDPs is extremely difficult, and a few works that are available in this direction primarily target tabular policies in the unconstrained setup. Designing algorithms achieving optimal convergence rates for beyond-ergodic CMDPs with general parameterized policies is currently an open question.

On the linear critic approximation error: Bounded linear critic approximation error is one of the commonly employed assumptions in the actor-critic literature. Please refer to the discussion following Assumption 2.2 for an overview of the relevant works. Apart from being important from a theoretical perspective, such a framework also yields good numerical results in experiments, as demonstrated in [2]. Despite these successes, we acknowledge that it is important to extend the analysis to non-linear critic to widen the applicability of the algorithm. Although some recent results are available in this direction [3], they primarily target unconstrained discounted reward setups. Moreover, their algorithms and analysis techniques are vastly different from ours. Whether similar analyses can be performed for constrained average reward settings is currently an open question.

Sensitivity of PDNAC to misspecifying $H$ with unknown $\tau_{\mathrm{mix}}$ : Theorem 4.10 states that if $\tau_{\mathrm{mix}}$ is unknown, then one can choose $H=T^{\epsilon}$ for some $\epsilon>0$ which results in objective convergence and constraint violation rates of $\tilde{\mathcal{O}}(T^{-1/2+\epsilon})$ for sufficiently large $T$ . Hence, if $H$ is misspecified as $H=T^{\epsilon+\Delta \epsilon}$ for some small error $\Delta\epsilon$ , then the convergence rates are modified by a factor of $\tilde{\mathcal{O}}(T^{\Delta\epsilon})$ .

References:

[1] ``Regret analysis of policy gradient algorithm for infinite horizon average reward Markov decision processes'', AAAI 2024.

[2] ``Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic'', ICML 2023.

[3] ``Order-optimal global convergence for actor-critic with general policy and neural critic parametrization'', UAI 2025.

2025-08-06

I have read all the comments and they left me with a positive impression. Thank you for your thoughtful response.

2025-08-06

Thank you for your feedback. We are glad to hear that our response addressed your comments.

最终决定Accept (poster)

2025-09-17

This paper proposes Primal-Dual Natural Actor-Critic (PDNAC) for infinite-horizon average-reward CMDPs, achieving order-optimal convergence rates for both optimality gap and constraint violation. Reviewers agreed it is technically solid, closes a key gap in constrained RL, and improves substantially over prior work. Concerns about novelty, assumptions (e.g., Slater constant, ergodicity, linear critic approximation), and the definition of constraint violation were partly resolved in rebuttal. Overall, the work makes a nice theoretical advance and is recommended for acceptance.