PaperHub
6.2
/10
Poster5 位审稿人
最低5最高7标准差0.7
7
6
5
7
6
2.8
置信度
正确性3.0
贡献度3.0
表达3.0
NeurIPS 2024

Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

Regret and constraint violation analysis of infinite horizon average reward constraint MDP.

摘要

This paper explores the realm of infinite horizon average reward Constrained Markov Decision Processes (CMDPs). To the best of our knowledge, this work is the first to delve into the regret and constraint violation analysis of average reward CMDPs with a general policy parametrization. To address this challenge, we propose a primal dual-based policy gradient algorithm that adeptly manages the constraints while ensuring a low regret guarantee toward achieving a global optimal policy. In particular, our proposed algorithm achieves $\tilde{\mathcal{O}}({T}^{4/5})$ objective regret and $\tilde{\mathcal{O}}({T}^{4/5})$ constraint violation bounds.
关键词
Average Reward MDPConstraint ViolationRegret.

评审与讨论

审稿意见
7

The paper explores RL in Infinite-horizon Ergodic Average-reward constrained MDPs under general policy parametrization. In this setting, the authors propose a primal-dual based policy gradient algorithm that simultaneously optimizes the policy parameter and a constraint violation penalty term λ\lambda. They also equip their method with techniques to address challenges due to the average-reward MDP setup, constraints on the MDP and general policy parametrization. They prove global convergence of their resulting method in terms of the Lagrange function and from this derive bounds on the expected regret as well as constraint violation. Furthermore, when the policy class is expressive enough to approximately contain an optimal policy of the constrained MDP so that the un-expressivity parameter εbias\varepsilon_{bias} is zero or negligibly small, the authors prove that the expected average regret and constraint violation of their method decreases at a rate of O(T1/5)O(T^{-1/5}).

优点

  1. The authors consider the task of reinforcement learning in average-reward constrained MDPs and propose novel techniques to address underlying challenges.
  2. This is a complete piece of work that explores provably efficient RL in average reward constrained MDPs. Their claims are backed by theoretical analysis, and they highlight the strength as well as weaknesses of their work.
  3. This submission is clearly written. The authors highlight the required assumptions and adequately discuss the nature of most relevant parameters.
  4. This work significantly contributes to existing theory on reinforcement learning when it is more desirable to optimize the average return, rather than the typical discounted return, and in addition to return optimization, the policy is required to adequately adhere to additional constraints on the MDP.

缺点

See questions.

问题

  1. In Lines 243-246, the authors claim that there are scenarios under which the un-expressivity parameter εbias\varepsilon_{bias} is zero or negligible. I could reason about this claim when π\*\pi^{\*} is an optimal policy in the unconstrained MDP, but unsure that the same holds when π\pi^{*} is an optimal policy in the constrained MDP (see Line 240). Can the authors kindly provide exact pointers in the referenced papers that verify this claim, or a proof in the appendix?

  2. In Lines 323 and 324, the authors claim that the primal-dual approach to policy optimization is known to give worse results than in the unconstrained version, even for the discounted setting. Could this be related to the fact that the method is policy, rather than value based? In the offline setting, [1] takes a value-based primal-dual approach and their result implicitly highlights that the constrained setting may not be worse off than the unconstrained setting.

  3. Can the authors kindly elaborate on the bias of single trajectory-based estimations in this setting? Reference [13] in the paper seems to achieve better performance even with this biased estimation strategy.

  4. How strong is Assumption 5?

[1] Hong, K., & Tewari, A. (2024). A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Low-Rank MDPs. arXiv preprint arXiv:2402.04493.

Comments

  1. Line 265 makes reference to equation 21 twice rather than the expressions after lines 263 and 264.
  2. Expectation is sometimes expressed without the parenthesis. For example, see Equations 20, 22, as well as Lines 265, 266 e.t.c.

局限性

None. This is a purely theory paper.

作者回复

Response to Question 1: Note that a constrained MDP can be considered as an unconstrained MDP where the (equivalent) reward function is r+λcr+\lambda c and the optimization is performed over all policies πΠ\pi\in \Pi and λ0\lambda\geq 0. In other words, the original constrained optimization is equivalent to the following primal formulation: max_πΠmin_λ0Jπ_r+λc\max\_{\pi\in\Pi}\min\_{\lambda\geq 0} J^{\pi}\_{r+\lambda c}. Note that the solution to this max-min problem, π\pi^* is precisely the solution to the original CMDP. Therefore, it is natural that ϵbias\epsilon_{\mathrm{bias}}, in this case, is defined via π\pi^* and AL,λA_{\mathrm{L}, \lambda} where AL,λ=Ar+λAcA_{\mathrm{L}, \lambda}=A_r+\lambda A_c denotes the advantage function corresponding to r+λcr+\lambda c. A similar definition of ϵbias\epsilon_{\mathrm{bias}} can also be found in earlier works (e.g., see Assumption 4 in ref. [4])

Response to Questions 2 and 3: The remark that the primal-dual approach performs worse than the unconstrained setup was mainly targeted for general parameterization. Table 1 in the paper shows that a similar conclusion does not hold for either tabular or linear MDPs. The paper by Hong and Tewari cited by the reviewer assumes a linear MDP setup and thus is not a focus of this discussion. To understand why a gap arises between a constrained and an unconstrained case in general parameterization, one can take a look at Lemma 6. An extra β\beta term (dual learning rate) appears in the convergence of the first-order stationary result which is accompanied by an O(β1)\mathcal{O}(\beta^{-1}) term during the segregation of objective optimality gap and constraint violation rate (Theorem 1). Note that the second O(β1)\mathcal{O}(\beta^{-1}) term is absent in unconstrained case which allows us to choose β=0\beta = 0 and obtain a convergence rate of O~(T1/4)\tilde{\mathcal{O}}(T^{-1/4}) as in [13]. However, such a choice is infeasible for the constrained case, leading to a worsening of the final result.

Response to Question 4: For Assumption 5: Assumption 5 is also common (Liu et al., 2020) in the policy gradient analysis. This is not a very restrictive assumption. We corroborate this claim by quoting (Liu et al., 2020) (where this assumption is Assumption 2.1):

``This is a common (and minimal) requirement for the convergence of preconditioned algorithms in both convex and nonconvex settings in the optimization realm, for example, the quasi-Newton algorithms [40, 41, 42, 43], and their stochastic variants [44, 45, 46, 47, 36]. In the RL realm, one common example of policy parametrizations that can satisfy this assumption is the Gaussian policy [2, 48, 19, 21], where πθ(s)=N(μθ(s),Σ)\pi_\theta(\cdot| s) = N(\mu_\theta(s), \Sigma) with mean parametrized linearly as μθ(s)=ϕ(s)Tθ\mu_\theta(s) = \phi(s)^T\theta, where ϕ(s)\phi(s) denotes some feature matrix of proper dimensions, θ\theta is the coefficient vector, and Σ>0\Sigma >0 is some fixed covariance matrix. In this case, the Fisher information matrix at each ss becomes ϕ(s)Σ1ϕ(s)T\phi(s)\Sigma^{-1}\phi(s)^T, independent of θ\theta, and is uniformly lower bounded (positive definite sense) if ϕ(s)\phi(s) is full-row-rank, namely, the features expanded by θ\theta are linearly independent, which is a common requirement for linear function approximation settings [49, 50, 51].

For μθ(s)\mu_\theta(s) being nonlinear functions of θ\theta, e.g., neural networks, the positive definiteness can still be satisfied, if the Jacobian of μθ(s)\mu_\theta(s) at all θ\theta uniformly satisfies the aforementioned conditions of ϕ(s)\phi(s) (the Jacobian in the linear case). In addition, beyond Gaussian policies, with the same conditions mentioned above on the feature ϕ(s)\phi(s) or the Jacobian of μθ(s)\mu_\theta(s), Assumption 2.1 also holds more generally for any full-rank exponential family parametrization with mean parametrized by μθ(s)\mu_\theta(s), as the Fisher information matrix, in this case, is also positive definite, in replace of the covariance matrix ϕ(s)Σ1ϕ(s)T\phi(s)\Sigma^{-1}\phi(s)^T in the Gaussian case [11].

Indeed, the Fisher information matrix is positive definite for any regular statistical model [28]. In the pioneering NPG work [24], F(θ)F(\theta) is directly assumed to be positive definite. So is in the follow-up works on natural actor-critic algorithms [39, 7]. In fact, this way, Fρ(θ)F_\rho(\theta) will define a valid Riemannian metric on the parameter space, which has been used for interpreting the desired convergence properties of natural gradient methods [3, 32]. In sum, the positive definiteness on the Fisher preconditioning matrix is common and not restrictive. "

We will expand on this in the revision to explain Assumption 5.

评论

I thank the authors for the rebuttal.

I am satisfied with their response and have read other reviews and responses.

评论

Thank you for your feedback. We are pleased to hear that our response addressed your key comments.

审稿意见
6

This paper tackles the infinite horizon average reward Constrained MDP setting and proposes an analysis of regret and constraint violation under a general policy parametrization. They devise a primal-dual policy gradient algorithm achieving global optimality while ensuring sublinear bounds on the regret and constraint violation.

优点

The paper is interesting and well-written and places itself well in the related literature by filling the gap about infinite horizon average reward constraint MDPs under general parametrized policies. They employ similar techniques that can be found in the related unconstrained setting but also highlight and overcome the challenges encountered while facing the constrained and average reward setting. In particular, they take inspiration from already existing techniques developed for the discounted setting and adapt them to cope with the average reward MDP.

缺点

From the theoretical perspective, I do not recognize specific weaknesses.

However, I would like to highlight that the authors assume the knowledge of the mixing and hitting time of the MDP. I recognize that this is a standard assumption in this type of work, however, I wonder if they are necessary to achieve these results or if the authors are aware of similar techniques that can provide sublinear guarantees without this knowledge. Recently, the work of [1] highlighted some techniques that allow achieving global convergence results without this knowledge. Would it be possible to apply these techniques to this setting as well?

From the simulation perspective, I believe that some experimental results showing the validity of the presented approach may be beneficial for the presented work.

[1] Patel B. et al., Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles, ICML 2024.

问题

See the weakness section.

局限性

N/A

作者回复

Response to Weakness 1: Thank you for suggesting the paper by Patel et. al. where the algorithm works without the knowledge of tmixt_{\mathrm{mix}}. We would like to point out that such a feat is achieved by introducing an additional assumption and weakening the convergence result. In particular, the paper solves the average (unconstrained) MDP problem via an actor-critic approach where the value function of the critic is taken to be a linear function of a feature vector. No such assumption is needed in our paper. Secondly, their approach introduces an additional critic error TϵcriticT\sqrt{\epsilon_{\mathrm{critic}}} in the final regret, in addition to the usual approximation error TϵbiasT\sqrt{\epsilon_{\mathrm{bias}}}. Despite these drawbacks, we agree that their approach is worth exploring in the future. However, we believe that if the linear critic assumption in this paper is relaxed, the regret bounds will become even worse (a similar effect has been recently shown in the sample complexity result of discounted MDPs). We are currently not sure about the applicability of this technique to our CMDP problem.

We would also like to mention that the suggested paper became public on arXiv on March 18th for the first time, which is within two months of the NeurIPS 2024 abstract submission deadline, and thus could be considered as concurrent research. We will cite this work in the final version, and suggest extending its approaches to CMDPs as a future work.

Response to Weakness 2: We agree that empirical evaluations will be a nice corroboration of our established results. However, considering this is a theory-oriented paper, our goal is to propose a policy gradient-type algorithm in an infinite horizon average reward setting and establish its theoretical guarantees. A detailed evaluation will be the subject of future work.

评论

I thank the authors for their detailed responses, I have no further questions for the moment.

审稿意见
5

The paper presents a primal-dual policy gradient algorithm for infinite horizon average-reward Constrained Markov Decision Processes with general policy parametrization. It uniquely addresses minimizing regret and managing constraints simultaneously, providing the first analysis showing sub-linear bounds of O(T^{4/5}) for both regret and constraint violations. This approach extends the applicability of reinforcement learning to complex, large-scale decision-making problems under practical constraints.

优点

  • The paper is the first to apply a primal-dual policy gradient framework to this particular setting of CMDPs, addressing a gap in the literature.
  • The problem is well-motivated.
  • The paper offers theoretical sub-linear bounds for both regret and constraint violations.

缺点

  • The assumption of ergodicity is quite strong and may limit the applicability of the proposed algorithm in practical scenarios where such conditions are not met. This contrasts with the Upper Confidence Reinforcement Learning (UCRL) framework, which operates under the more relaxed condition of weakly communicating MDPs.
  • The literature review table omits the MDP modeling assumptions. Specifically, the regret bound offered by the proposed algorithm is O(T^{4/5}), while tabular constrained RL algorithms under more general multichain MDP (communicating/weakly communicating) assumptions in [6] achieve O(T^{2/3}). Related facts were not mentioned in the paper.
  • The mixing time, t_{\text{mix}}, tends to depend unfavorably on the state space S and action space A, reflecting in the overall diameter of the MDPs.

问题

  • Can you explain in more detail the dependence of the mixing time on the state space S and action space A is critical, especially in large-scale problems. Clarification on how t_{\text{mix}} scales with S and A and its impact on the algorithm’s performance would provide a better understanding of the method’s applicability to complex environments.
  • Error propagation for estimated t_{\text{mix}} : Information on the numerical performance of the algorithm, particularly how t_{\text{mix}} influences runtime and convergence in practical scenarios, would substantiate the theoretical claims. Experimental results or simulations that highlight these dynamics could significantly enhance the manuscript.

局限性

The authors touched upon future work/limitations on theoretical complexity

作者回复

Response to Weakness 1: Note that the framework of UCRL that works for Weakly Communicating (WC) MDPs is a model-based method that cannot be extended to large state space. Designing model-free algorithms, especially for Constrained MDPs (CMDPs), is an extremely difficult problem. This is evident from the fact that no model-free algorithms are available in the literature that yield provable guarantees for WC CMDPs. Only a single paper in the literature (ref. [8] in the paper) provides a model-free algorithm that works for ergodic CMDPs. Unfortunately, this paper has a worse O~(T5/6)\tilde{\mathcal{O}}(T^{5/6}) regret compared to our result and adopts the tabular setup. In contrast, our paper provides the first regret bound for ergodic CMDPs with general parameterization (that subsumes the tabular case) and improves the regret to O~(T4/5)\tilde{\mathcal{O}}(T^{4/5}). Extension of our result to the WC case will be considered in the future.

Response to Weakness 2: Thanks for providing the suggestion. We will add an extra column in the table that states the model assumptions used in the associated works for a fairer comparison between the algorithms.

Response to Weakness 3: Intuitively, tmixt_{\mathrm{mix}} indicates how fast the tt-step transition probability converges to the stationary distribution. To the best of our knowledge, there is no known lower bound of tmixt_{\mathrm{mix}} in terms of SS and AA. Therefore, it is theoretically possible that, even in an MDP with infinite states, tmixt_{\mathrm{mix}} could be finite.

Response to Question 1: Please refer to weakness 3.

Response to Question 2: We agree that empirical evaluations will help characterize the impact of tmixt_{\mathrm{mix}} estimation on the performance of the algorithm. However, since ours is a theory-oriented paper, our main goal is to establish provable guarantees for the proposed algorithm. A detailed evaluation will be the subject of future work.

审稿意见
7

This paper studies learning in constrained MDPs with the long-run average-reward objective. It gives the first sample complexity, regret bound and constraint violation bound in this setting with general parameterization, whereas all prior work is restricted to either tabular or linear parameterizations.

优点

The result and analysis look solid. The presentation is also clear.

缺点

1 The assumption that all policies induce aperiodic irreducible Markov chain is a bit strong. To what extend can you weaken the assumptions such that the current analysis in the paper still go through? Is it enough if we only assume the Markov chain is weakly-communicating?

2 Claiming that the regret is O(T^{4/5}) in table 1 is a bit misleading, because in the regret and constraint violation bounds in Theorem 2, there are actually terms linear in T whose coefficients depend on the transfer error \epsilon_{bias}. I understand that the linear-in-T error terms due to the transfer error are common in the analysis of RL with general parameterizations, and the transfer error is zero when the parameterization is tabular.
However, it is still worth discussing whether the coefficients of the linear-in-T terms in Theorem 2 are tight and optimal, how they compare to the prior work, and whether they are truly negligible when the policy is parameterized by a neural net of a reasonable scale. At the very least, how the regret and constraint violation are actually calculated should be clarified in a prominent part of the paper.

问题

See my questions in the weaknesses section.

局限性

There is no negative social impact. There are some discussion of limitation in the conclusion section, but not sufficient.

作者回复

Response to Weakness 1: Unfortunately, the proposed algorithm may not be extended to the weakly communicating (WC) setting. In general, it is difficult to propose a model-free algorithm with provable guarantees for Constrained MDPs (CMDPs) without considering the ergodic model. WC CMDPs impose extra challenges in learning compared to the ergodic case. For example, there is no uniform bound for the span of the value function for all stationary policies. It is also unclear how to estimate a policy’s bias function accurately without the estimated model, which is an important step for estimating the policy gradient. The above-stated difficulty is evident from the fact that no model-free algorithm is available in the literature that works for WC CMDPs with provable guarantees. Moreover, only a single model-free algorithm (ref. [8] in the paper) is available for ergodic CMDPs. In other words, all existing algorithms that work for WC CMDPs are model-based in nature. These approaches cannot be extended to the large state space scenario which is the premise of our paper.

Response to Weakness 2: Thanks for providing the suggestion. We will modify the table to point out the dependence on the linear term TϵbiasT\sqrt{\epsilon_{\mathrm{bias}}}. The transferred compatible function approximation error is a common assumption in policy-based algorithms (see ref. [10], [11] in the paper). From the sample complexity result (Theorem 1), we can see that the proposed algorithm converges to the neighborhood of the global optima within a distance of ϵbias\sqrt{\epsilon_{\mathrm{bias}}}, which is the same result as in the literature.

评论

I appreciate the authors' informative response to my comments.

审稿意见
6

This paper propose a primal dual policy gradient method for solving average reward constrained MDPs. The primal problem is minimizing the usual RL objective with a averaged reward, plus a penalty induced constraint violation term. The dual problem is to find appropriate Lagrangian multiplier that balances the two objectives to reach equilibrium. The proposed method has reached the claimed regret bound and constraint violation bounds with general policy parameterization.

优点

The paper is well written and components are clearly explained. The presentation is good and easy to understand. The investigated topic is interesting and important for real life applications, and average reward is harder to study in general w.r.t to its discounted counterparts due to the absence of contraction of the Bellman operator under the infinity norm. The analysis is easy to follow and plainly laid out. The basic flow of thought is easy to follow and nicely connected. The analysis itself is solid in terms of its assumptions made in the paper.

缺点

There are a few points that made this paper limited in its technical contribution.

  1. The claimed the general parameterization seems to rely on an accurate policy model such as neural network. In the paper the authors assumes somehow we can obtain an accurate policy for free. In reality, this is far from the truth. In this sense, the claimed contribution is not much of importance.
  2. Also related to the first point, it seems that for policy evaluation (algorithm 2), the values functions are still in tabular form. Additionally, this is closely following the proposed policy evaluation method in [17] referred in the paper, which seems to the hardest part in average-reward MDPs. Existing literature shows that for policy optimization alone, it is not so different from discounted setup. See

Li, Tianjiao, Feiyang Wu, and Guanghui Lan. "Stochastic first-order methods for average-reward markov decision processes." arXiv preprint arXiv:2205.05800 (2022).

  1. There is no numerical experiments to validate the proposed algorithm. There is a growing important for RL researcher to bridge the gap between theories and practice.

问题

It seems uniform ergodicity is crucial in the paper. I wonder one can relax the assumption for the MDP to be unichain.

局限性

Yes

作者回复

Response to Weakness 1: We would like to clarify that we do not assume access to an accurate knowledge of the optimal policy since that would contradict our learning objective. We merely assume that the policy belongs to a class where each member is indexed by a parameter θ\theta. The goal is to find the parameter θ\theta^* that corresponds to the optimal policy. In reality, such a parameterized class can be modeled via a neural network where its weights can act as the index parameter. We start with an arbitrary θ0\theta_0 (which can be far away from θ\theta^*) and slowly move towards the optimality using our proposed primal-dual steps.

Response to Weakness 2: We agree with the reviewer that the policy evaluation part is inspired by [17]. However, we would like to point out some major differences.

(1) The value estimation (Algorithm 2) is not tabular in the sense that it need not be evaluated for all state-action pairs. This routine only needs to be invoked for the pairs that are encountered by Algorithm 1 on the go.

(2) The parameter HH in Algorithm 2 differs from that in [17]. With the same parameter choice, our proposed policy gradient algorithm would diverge.

Additionally, the overall approach in our paper differs significantly from that of [17].

(3) In particular, our paper considers a constrained setting, which includes a primal-dual approach that is not included in [17].

(4) Finally, [17] considers a tabular setting which enjoys the strong duality. However, the general parameterization setting considered in our paper does not have the same benefit. The convergence result for the dual problem, therefore, does not automatically translate to that for the primal problem. Our novel introduction of Theorem 1 allows us to separate the objective and constraint violation rates.

Response to Weakness 3: We agree that empirical evaluations will be a nice corroboration of our established results. However, considering this is a theory-oriented paper, our goal is to propose a policy gradient-type algorithm in an infinite horizon average reward setting and establish its theoretical guarantees. A detailed evaluation will be the subject of future work.

评论

I thank the authors for their response. Although I am not entirely satisfied by the author's rebuttal, the overall quality of this paper is acceptable.

最终决定

This paper proposes a primal-dual-based policy gradient method for solving infinite-horizon average-reward CMDPs with general function approximation. The paper proves that the proposed algorithm achieves an O(T4/5)O(T^{4/5}) bound on the regret and constraint violation.

The reviewers unanimously agree that the paper's contributions merit acceptance. Please incorporate the reviewers' feedback. In particular, addressing the following concerns will help strengthen the paper:

  • Included a fairer comparison between algorithms especially those that achieve a stronger O(T2/3)O(T^{2/3}) regret (response to Rev. dyHb)
  • Mention the limitation of the derived results - tmixt_{mix} is typically unknown and in some cases, it may have a bad dependence on the size of the state space.
  • Include the appropriate reference to [17], and explain why the assumption can/cannot be relaxed to handle unichain/weakly-communicating MDPs.
  • Justification for Assumption 5 (response to Rev. UyCM)