PaperHub
6.0
/10
Poster4 位审稿人
最低6最高6标准差0.0
6
6
6
6
3.3
置信度
正确性2.8
贡献度2.8
表达3.0
NeurIPS 2024

Sample-Efficient Constrained Reinforcement Learning with General Parameterization

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

New state-of-the-art sample complexity for CMDPs.

摘要

We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon while ensuring that the expected discounted sum of costs exceeds a certain threshold. Building on the idea of momentum-based acceleration, we develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that ensures an $\epsilon$ global optimality gap and $\epsilon$ constraint violation with $\tilde{\mathcal{O}}((1-\gamma)^{-7}\epsilon^{-2})$ sample complexity for general parameterized policies where $\gamma$ denotes the discount factor. This improves the state-of-the-art sample complexity in general parameterized CMDPs by a factor of $\mathcal{O}((1-\gamma)^{-1}\epsilon^{-2})$ and achieves the theoretical lower bound in $\epsilon^{-1}$.
关键词
Constrained MDPSample ComplexityConstraint ViolationGlobal Optimality.

评审与讨论

审稿意见
6

The paper studies constrained Markov decision processes. Using momentum acceleration, the paper develops a primal-dual natural policy gradient-based algorithm and shows that it achieves an \epsilon global optimality with O(\epsilon^-3) samples. The policy class considered in the paper is the generic Fisher non-degenerate policies.

优点

  • The paper is the first in the literature that derives the O(\epsilon^-3) sample complexity for constrained MDP under more general policy parameterizations than the softmax.
  • The presentation of theoretical results are clear and easy to follow. I appreciate that the author presents a complete flow of proof in the main body of the paper.

缺点

I mainly have the following concern, and if it can be addressed, then I am willing to adjust my rating. In literature [25], a sample complexity of O(\epsilon^-2) is derived for unconstrained MDP. To me, it seems it is not impossible to translate this result to the constrained setting by adding a primal-dual component, and if this can be done, I think the sample complexity should be O(\epsilon^-3) or so. I wonder could the author discuss the feasibility of this generalization (and perhaps the underlying challenges if it is not generalizable).

问题

  • I hope the author could discuss the concern I raised in the weakness section.
  • Since the paper actually studies Fisher non-degenerate policies, I think a more common (and precise) way is to directly use the name "Fisher non-degenerate policies". Using the name "general parameterization" is a little bit misleading.

局限性

The author has discussed the limitation of the paper briefly in the conclusion section.

作者回复

On the Generalizability of [25]: The approach explored in [25] does not generalize to CMDPs. To understand the reasoning, note that one of the preliminary steps in [25] is showing a descent-like inequality. For example, Lemma 6 in [25] reads as follows.

$

\delta\_{t+1} \leq (1-\kappa \gamma_t)\delta_t +R_t, ~~R_t=\mathcal{O}\left(\gamma_t\mathbf{E}\left[\Vert e_t\Vert\right]+ \gamma_t^2 + \sqrt{\epsilon\_{\mathrm{bias}}}\gamma_t\right)  

$

where δt=JJ(θt)\delta_t = J^*- J(\theta_t) is the optimality gap, ete_t is the difference between the gradient estimate and its actual value, γt\gamma_t is a time-dependent learning rate, and κ<1\kappa<1 is some problem-dependent constant. Next, using recursion, and an appropriate choice of γt\gamma_t, the following result is derived (Lemma 12 in [25]).

$

\delta_T \leq \mathcal{O}\left(\frac{\delta_0}{(T+1)^2}\right)+ \dfrac{\sum_{t=0}^{T-1}R_t(t+2)^2}{(T+1)^2}

$

The authors of [25] could prove the convergence of δT\delta_T up to a factor of ϵbias\sqrt{\epsilon_{\mathrm{bias}}} precisely because each term in RtR_t is accompanied by γt\gamma_t which is chosen to decrease sufficiently fast with tt. Following Lemma 6 of [25], we can prove the following bound for a primal-dual algorithm.

$

\bar{\delta}^{L}\_t \leq (1-k\gamma_t)\delta_t^L + R_t^L

$

where γt\gamma_t is the time-dependent primal rate, δtL=JrJL(θt,λt)\delta_t^L = J_r^* - J_{\mathrm{L}}(\theta_t, \lambda_t) is the Lagrange optimality gap, RtLR_t^L has a dependence on γt\gamma_t that is similar to that of RtR_t, and δˉ_tL=JrJL(θt+1,λt)\bar{\delta}\_t^L = J_r^* - J_{\mathrm{L}}(\theta_{t+1}, \lambda_t). To obtain a recursion on δtL\delta_t^L, one can use the following.

$

\delta_{t+1}^L - \bar{\delta}\_t^L = J_{\mathrm{L}}(\theta_{t+1}, \lambda_t) - J_{\mathrm{L}}(\theta_{t+1}, \lambda_{t+1}) = (\lambda_t - \lambda\_{t+1})J_c(\theta\_{t+1}) = \mathcal{O}(\zeta_t)

$

where ζt\zeta_t is the time-dependent dual learning rate. Combining the above two results, we have the following recursion.

$

\delta_{t+1}^{L} \leq (1-k\gamma_t)\delta_t^{L} + R_t^L + \mathcal{O}(\zeta_t)

$

Note that, unlike RtLR_t^L, ζt\zeta_t has no dependence on γt\gamma_t. Thus, to establish convergence via recursion, one must choose ζt\zeta_t to decrease faster than O(t1)\mathcal{O}(t^{-1}). On the other hand, the process of decomposing the optimality gap and constraint violation rates from the Lagrange convergence introduces a term of the form O(1/(tζt))\mathcal{O}(1/(t\zeta_t)) (see the proof of Theorem 1 in our paper) which requires ζt\zeta_t to decrease slower than O(t1)\mathcal{O}(t^{-1}) to obtain meaningful results. Therefore, any choice of ζt\zeta_t will either disrupt the convergence of the Lagrange function or blow up the constraint violation rates. This negates the possibility of generalizing the approach of [25] to CMDPs via primal-dual algorithms.

On Fisher Degenerate Policies: Thank you for this comment. We will modify the terminology in the revised paper accordingly.

评论

I thank the author for the explanations. I am happy to increase my score to 6 : )

评论

Thank you for your feedback. We are pleased to hear that our response addressed your key comments. Please feel free to let us know if there are any further comments or concerns that we can address.

审稿意见
6

The paper considers the constrained MDP setting. Given a target error of ϵ>0\epsilon > 0, the paper provides a primal-dual algorithm on that will return an epsilon optimal policy making at most epsilon constraint violations. The claim is that the algorithm solves such problem with O(ϵ3\epsilon^{-3}) sample complexity.

优点

The idea of incorporating momentum acceleration to a primal-dual algorithm seems interesting.

缺点

The Slater's constant cslaterc_{slater} is missing from the big O-notations in your HH and KK terms. The slater's constant is an unique quantity in CMDP and should be clearly stated in your sample complexity. In the relaxed-feasibility problem, where the algorithm returns a policy that is allowed to have some small constraint violation (such as epsilon constraint violation), the Slater's constant can be smaller than the target error epsilon. In this case, the Slater's constant can be set to be the ϵ\epsilon. Your paper is in this relaxed-feasibility setting.

By writing the Slater's constant cslaterc_{slater} explicitly in the big O-notations, we see that H=O((1γ)3ϵ1cslater1)H = \mathcal{O}((1-\gamma)^{-3} \epsilon^{-1} c_{slater}^{-1}) and K=O((1γ)4ϵ2cslater2)K = \mathcal{O}((1-\gamma)^{-4} \epsilon^{-2} c_{slater}^{-2}). If the Slater's constant = epsilon, then we see that Hϵ2H \propto \epsilon^{-2} and Kϵ4K \propto \epsilon^{-4} making the total sample complexity to be proportional to ϵ6\epsilon^{-6}. Then, the result is independent of the Slater's constant.

问题

Your result stated in Lemma 6 is based on a previous result provided by [6]. In [6], the result relies on an oracle that, when given an input, provides a noisy unbiased gradient of a function. However, your algorithm uses a sampling procedure to obtain a gradient estimate, which appears to rely on access to a generative model (Kakade (2003) On the sample complexity of reinforcement learning). Are you considering the setting of having access to a generative model, or are you assuming access to an oracle as in [6], for which your Lemma 6 is based? If you are considering the setting of having access to a generative model, the paper should address the statistical challenges associated with sampling, and the sample complexity should include the number of queries to a simulator, incorporating TT into the sample complexity. Therefore, the result from [6] in Lemma 6 does not immediately apply. If you are considering the setting of having access to an oracle, it is not clear why Algorithm 1 is necessary.

局限性

By stating the Slater's constant in the sample complexity, the sample complexity is in the order of epsilon to the minus 6 instead of minus 3 so to be independent of the Slater's constant. See comments in the Weaknesses section.

作者回复

On Slater's Constant: We would like to clarify that, for a given problem instance, Slater's constant is fixed and outside of the control of the learner. On the other hand, the target optimality error ϵ\epsilon can be chosen by the learner to be arbitrarily small. Therefore, it would be inappropriate to assign cslater=O(ϵ)c_{\mathrm{slater}}=\mathcal{O}(\epsilon). Note that, in all previous works on CMDPs, cslaterc_{\mathrm{slater}} is treated as a constant, rather than a function of ϵ\epsilon, e.g., see [1], [2].

On Query Complexity of the Simulator: Our algorithm does not assume access to a gradient oracle. Rather, we assume access to a generative model that yields a sample of the next state ss' given the current state-action pair (s,a)(s, a) following sP(s,a)s'\sim P(s, a). We use this generator to obtain an unbiased sample of the gradient via Algorithm 1. It is easy to check that the expected number of queries to the generator to generate one gradient sample is T=O((1γ)1)T=\mathcal{O}((1-\gamma)^{-1}). We do account for this factor in computing the overall sample complexity of our algorithm. In particular, note that the lengths of inner and outer loops in Algorithm 2 are H=O((1γ)3ϵ1)H = \mathcal{O}((1-\gamma)^{-3}\epsilon^{-1}) and K=O((1γ)4ϵ2)K=\mathcal{O}((1-\gamma)^{-4}\epsilon^{-2}) respectively, which leads to an overall sample complexity of O(THK)=O((1γ)8ϵ3)\mathcal{O}(THK) = \mathcal{O}((1-\gamma)^{-8}\epsilon^{-3}). We shall elaborate on the impact of TT on the sample complexity in the revised manuscript.

References:

[1] Bai, Qinbo, Amrit Singh Bedi, and Vaneet Aggarwal. "Achieving zero constraint violation for constrained reinforcement learning via conservative natural policy gradient primal-dual algorithm." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. No. 6. 2023.

[2] Ding, D., Zhang, K., Basar, T. and Jovanovic, M., 2020. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33, pp.8378-8390.

评论

Thank you for the comment. We would like to clarify the following points.

P1: Theorem 1 says that the expected objective optimality gap and the constraint violation are of the following form. OptimalityGapϵbias+ϵ, \mathrm{Optimality Gap}\leq \sqrt{\epsilon_{\mathrm{bias}}} + \epsilon, ConstraintViolation(1γ)cslaterϵbias+ϵ(a)ϵbias+ϵ\mathrm{ConstraintViolation} \leq (1-\gamma)c_{\mathrm{slater}} \sqrt{\epsilon_{\mathrm{bias}}} + \epsilon \overset{(a)}{\leq} \sqrt{\epsilon_{\mathrm{bias}}} + \epsilon where inequality (a) follows from the fact that cslater(0,11γ]c_{\mathrm{slater}}\in (0, \frac{1}{1-\gamma}] (see Assumption 1). Note that both ϵbias\epsilon_{\mathrm{bias}} and cslaterc_{\mathrm{slater}} are problem-dependent constants and have no functional dependence on ϵ\epsilon. We believe that our current phrasing "our algorithm achieves ϵ\epsilon optimality gap and ϵ\epsilon constraint violation" is the source of this confusion. Instead, if we modify it to "our algorithm achieves ϵ\epsilon optimality gap and ϵ\epsilon constraint violation up to a factor of ϵbias\sqrt{\epsilon_{\mathrm{bias}}}", the impact of both the problem-dependent constants and ϵ\epsilon are separately acknowledged, and one need not force them to be of the same order. It also emphasizes the fact that due to incompleteness of the parameterized policy class i.e., ϵbias>0\epsilon_{\mathrm{bias}}>0, it is impossible to reach arbitrarily close to the optimal point, no matter how small ϵ\epsilon is chosen. Unlike complete policy classes such as direct tabular parameterization (where ϵbias=0\epsilon_{\mathrm{bias}}=0), it is one of the well-known disadvantages of general parameterization.

P2: We also acknowledge the fact that the choice of HH, KK, and the overall sample complexity is dependent on cslaterc_{\mathrm{slater}}. Currently, such dependence is shown only in the appendix. We can explicitly state this dependence in the main text (Theorem 1) and make it visible to the readers.

P3: To make a fair comparison, we will modify Table 1 to show the impact of cslaterc_{\mathrm{slater}} on the sample complexities provided by other existing works (if such explicit dependence is available in the paper).

评论

To clarify, my intended point is as follows. As you mentioned, ConstraintViolation ϵbias+ϵ\leq \sqrt{\epsilon_{bias}} + \epsilon. The Slater's constant can be much smaller than this suboptimality bound of ϵbias+ϵ\sqrt{\epsilon_{bias}} + \epsilon, then it is reasonable to set cslater=ϵc_{slater} = \epsilon. In this case, the sample complexity becomes independent of Slater's constant, but the sample complexity would then be ϵ6\propto \epsilon^{-6} since both HH and KK also depend on this Slater's constant. It's this distinction that I would like to emphasize.

评论

Thanks a lot for this. We will provide explicit dependence of the result on Slater's constant in the final version as suggested. That way, the result can be seen in different asymptotes, including when epsilon is much smaller than Slater's constant or vice versa.

评论

I thank the author for addressing my concern. I will increase the score to 6.

评论

The problem that you are solving is relaxed feasibility, where the returned policy parameterized by θ0,...,θK1\theta_0,...,\theta_{K-1} is allowed to make some constraint violation, i.e. E[1/Kk=0K1Jπ_c,rhoJ_c,ρ(θk)]O(cslater)+ϵE [ 1/K \sum_{k=0}^{K-1} J^{\pi^*}\_{c,rho} - J\_{c,\rho} (\theta_k) ] \leq O(c_{slater}) + \epsilon (this is stated in your Theorem 1). In this case, the target error ϵ\epsilon can be chosen to be large and because you are allowing for some constraint violation, it is appropriate to assign cslaterc_{slater} to ϵ\epsilon. In doing so, then the bound will be independent of the slater's constant, but it will inflate the bound to ϵ6\epsilon^{-6}. The effect of the slater's constant should be reflected in the bound, and not be treated as a constant. If other papers treat the slater's constant as a constant, it doesn't mean it is appropriate to do so. Your H and K are missing the slater's constant and ignoring its effect. Having slater's constant also differentiates CMDP from MDP, for more details, please see Vaswani, S., Yang, L., and Szepesvári, C. (2022). Near-optimal sample complexity bounds for constrained mdps. I ask the authors to state these distinctions more clearly.

审稿意见
6

The paper studies the sample complexity of the constrained Markov Decision Process (CMDP), and derives an algorithm that attains the O(ϵ3)O(\epsilon^{-3}) that improves the SOTA sample complexity bound for CMDP by a factor of O(ϵ1)O(\epsilon^{-1}).

优点

CMDP with general parameterization is a very challenging problem, and thus there have not been many papers making progress in this challenging setting compared to the tabular CMDP setting. The authors modify the classic primal-dual natural policy gradient (NPG) framework by leveraging an accelerated stochastic gradient descent (ASGD) method to improve the SOTA sample complexity bound by a factor of O(ϵ1)O(\epsilon^{-1}).

缺点

Even though the authors claim they have improved the SOTA sample complexity bound for CMDP with general parameterization, I still think it is hard to say the authors have "beat" the previous results. ϵ\epsilon is not the only factor that matters for the sample complexity bound for discounted infinite-horizon MDPs, since γ\gamma is also another significant factor. Unfortunately, it seems that the authors' bound has a dependency of O(1/(1γ)8)O(1/(1-\gamma)^{-8}), so that makes the sample complexity bound read O(1/(1γ)8ϵ3)O(1/(1-\gamma)^8\epsilon^3). Compared to a prior work [1] which has a sample complexity bound of O(1/(1γ)6ϵ4)O(1/(1-\gamma)^6\epsilon^4), it is clear the two bounds are not necessarily comparable due to one bound is superior in terms of the order of ϵ\epsilon and the other superior in terms of the order of γ\gamma.

Also, [1]'s algorithm has zero constraint violation while the authors' algorithm permits O(ϵ)O(\epsilon) constraint violation, which is a worse result.

Hence, unless the authors can clarify their result has improvement in ϵ\epsilon and γ\gamma and constraint violations (or at least provide stronger justifications that their algorithm requires a smaller number of samples to achieve near-optimality), it is hard to evaluate the value of the authors' result.

[1] Bai, Qinbo, Amrit Singh Bedi, and Vaneet Aggarwal. "Achieving zero constraint violation for constrained reinforcement learning via conservative natural policy gradient primal-dual algorithm." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. No. 6. 2023.

问题

As mentioned, the major weakness of the paper is that you have overlooked the sample complexity bound's dependency in γ\gamma and it seems you have a worse constraint violation result. Can you update Table 1 with sample complexity bounds having dependency in γ\gamma and with constraint violations? More importantly, can you justify why your result has seemingly worse dependency in γ\gamma and constraint violations?

局限性

The major limitation of the paper is the lack of proper comparison of the sample complexity bounds with prior works.

作者回复

Fairer Comparison of Sample Complexities: We would like to point out that the sample complexity of O~((1γ)6ϵ4)\tilde{\mathcal{O}}((1-\gamma)^{-6}\epsilon^{-4}) reported by [1] in their AAAI version is erroneous. The authors have subsequently corrected their result and the sample complexity has been modified to O~((1γ)8ϵ4)\tilde{\mathcal{O}}((1-\gamma)^{-8}\epsilon^{-4}) in the arXiv version (updated May 2024). Please see Table 1 in [2] to verify our claim. Therefore, in comparison to the state-of-the-art (SOTA), we improve the sample complexity in terms of the optimality gap ϵ\epsilon without sacrificing the powers of (1γ)(1-\gamma). We also note from Table 1 in [2] that other existing results on general parameterized CMDPs are worse than our result both in terms of ϵ\epsilon and (1γ)(1-\gamma). Following the reviewer's suggestion, we will revise the comparison table in our manuscript to exhibit the dependence of the sample complexity on both ϵ\epsilon and (1γ)(1-\gamma). The updated table is provided below for a quick reference.

AlgorithmSample ComplexityParameterization
PMD-PD [6]O(ϵ3)\mathcal{O}(\epsilon^{-3})Softmax
PD-NAC [7]O(ϵ6)\mathcal{O}(\epsilon^{-6})softmax
NPG-PD [5]O((1γ)5ϵ2)\mathcal{O}((1-\gamma)^{-5}\epsilon^{-2})Softmax
CRPO [4]O((1γ)7ϵ4)\mathcal{O}((1-\gamma)^{-7}\epsilon^{-4})Softmax
NPD-PG [5]O((1γ)8ϵ6)\mathcal{O}((1-\gamma)^{-8}\epsilon^{-6})General
CRPO [4]O((1γ)13ϵ6)\mathcal{O}((1-\gamma)^{-13}\epsilon^{-6})General
C-NPG-PDA [2]O~((1γ)8ϵ4)\tilde{\mathcal{O}}((1-\gamma)^{-8}\epsilon^{-4})General
PD-ANPG (This Work)O~((1γ)8ϵ3)\tilde{\mathcal{O}}((1-\gamma)^{-8}\epsilon^{-3})General
Lower Bound [8]Ω((1γ)5ϵ2)\Omega((1-\gamma)^{-5}\epsilon^{-2})-

The dependence of the sample complexities of PMD-PD and PD-NAC on γ\gamma is not depicted in the papers.

Zero Constraint Violation: Transforming an ϵ\epsilon constraint violation (CV) result to a zero CV result is a relatively straightforward exercise that is routinely employed in the literature. The main idea is to choose a slightly modified cost function c=c(1γ)δc'=c-(1-\gamma)\delta and demonstrate an ϵ\epsilon CV result for cc'. This can be done by following the same procedure stated in our paper that is used for producing a similar result for cc. Observe that Jcπ=JcπδJ^{\pi}_{c'}=J^{\pi}_c-\delta for any policy π\pi. Therefore, choosing an appropriate value of δ\delta, one can prove that the obtained ϵ\epsilon CV result for cc' implies a zero CV result for cc. Application of this "conservative constraint" technique is abundant in the CMDP literature, e.g., see [1], [3]. In the revised manuscript, we will provide an outline explaining how to derive the zero CV result.

References:

[1] Bai, Qinbo, Amrit Singh Bedi, and Vaneet Aggarwal. "Achieving zero constraint violation for constrained reinforcement learning via conservative natural policy gradient primal-dual algorithm." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. No. 6. 2023.

[2] https://arxiv.org/pdf/2206.05850

[3] Ding, D., Wei, C.Y., Zhang, K. and Ribeiro, A., 2024. Last-iterate convergent policy gradient primal-dual methods for constrained MDPs. Advances in Neural Information Processing Systems, 36.

[4] Xu, T., Liang, Y. and Lan, G., 2021, July. CRPO: A new approach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning (pp. 11480-11491). PMLR.

[5] Ding, D., Zhang, K., Basar, T. and Jovanovic, M., 2020. Natural policy gradient primal-dual method for constrained Markov decision processes. Advances in Neural Information Processing Systems, 33, pp.8378-8390.

[6] Liu, T., Zhou, R., Kalathil, D., Kumar, P.R. and Tian, C., 2021. Policy optimization for constrained MDPs with provable fast global convergence. arXiv preprint arXiv:2111.00552.

[7] Zeng, S., Doan, T.T. and Romberg, J., 2022, December. Finite-time complexity of online primal-dual natural actor-critic algorithm for constrained Markov decision processes. In 2022 IEEE 61st Conference on Decision and Control (CDC) (pp. 4028-4033). IEEE.

[8] Vaswani, S., Yang, L. and Szepesvári, C., 2022. Near-optimal sample complexity bounds for constrained MDPs. Advances in Neural Information Processing Systems, 35, pp.3110-3122.

评论

Thank you for your explanations for addressing my concerns completely. I have raised my score to 6.

评论

Thank you for your feedback. We are pleased to hear that our response addressed your comments.

审稿意见
6

This paper studied the sample complexity of learning a discounted-reward MDP with a discounted cost constraint under a general policy parameterization. It proposes a policy-gradient-type of algorithm that combines natural policy gradient and an accelerating stochastic gradient descent algorithm proposed in [6].

Then the paper proves that with O(1/ϵ3)O(1/\epsilon^3) samples, the policy learned by this algorithm violates the constraint by at most ϵ\epsilon, and is optimal up to ϵbias+ϵ\epsilon_{bias} + \epsilon error, where ϵbias\epsilon_{bias} is a constant coming from policy parameterization.

The sample complexity of this policy improves upon the prior SOTA, which was O(1/ϵ4)O(1/\epsilon^4).

优点

The paper is well-written. The main contribution, the algorithm, the key idea of the proof and the proof structure are clearly presented.

The improvement of sample complexity from O(1/ϵ4)O(1/\epsilon^4) to O(1/ϵ3)O(1/\epsilon^3) is non-trivial, following from a clever observation given in lemma 3.

The analysis looks solid, though I do have questions about a few minor details, see the questions 1 and 2 below.

缺点

1 The “sample efficient” in the title seems to overclaim the contribution of the paper, since there is still an order gap between the sample complexity of the algorithm in this paper, O(ϵ3)O(\epsilon^{-3}), and the lower bound, O(ϵ2)O(\epsilon^{-2}).

2 It is not entirely accurate to say that the algorithm achieves “ϵ\epsilon global optimality gap” and “ϵ\epsilon constraint violation”, as the actual bounds in Theorem 1 contain constant terms involving ϵbias\epsilon_{bias}. Although ϵbias\epsilon_{bias} is common in the analysis of RL algorithms with general parameterizations, it is crucial to discuss whether they are tight and optimal, how they compare to the prior work, and whether they are truly negligible when the policy is parameterized by a neural net of a reasonable scale. At the very least, the sense in which your policy is epsilon-optimal should be clarified in a prominent part of the paper.

问题

1 I believe the definition of L in (9) has a typo: the square should be inside the expectation, rather than outside the expectation, so that its gradient with respect to w is given by (10). Does this typo affect the proof?

2 The logic from (75) to (76) is not very clear. How did you get rid of the first term on LHS of (76), and where does λ\lambda^* comes from?

局限性

There is no negative social impact. A discussion of limitation is included in the submission.

作者回复

Response to Weakness 1: The term "sample-efficient" simply acknowledges the fact that the sample complexity of our algorithm is better than that of all existing algorithms in the general parameterized CMDP setup. Note that we refrain from using the term "sample optimal" since there is still a gap between our bound O(ϵ3)\mathcal{O}(\epsilon^{-3}) and the theoretical lower bound of Ω(ϵ2)\Omega(\epsilon^{-2}).

Response to Weakness 2: Thank you for mentioning the inappropriate use of the phrases "ϵ\epsilon optimality gap" and "ϵ\epsilon constraint violation". In the revised paper, we will change these phrases to emphasize the presence of the ϵbias\epsilon_{\mathrm{bias}} term. Moreover, we will also discuss the precise meaning of optimality in the introduction of our paper.

Response to Question 1: We apologize for the confusion. The notation E[]2\mathbf{E}[\cdot]^2 in (9) was intended to denote E[()2]\mathbf{E}[(\cdot)^2]. We will add additional parentheses in the revised manuscript to make the notation unambiguous.

Response to Question 2: The equations (75)(76)(75)-(76) use the fact that the average of the occupancy measures is still an occupancy measure. Therefore, the average of the occupancy measures corresponding to the policies πθk_k=1K\\{\pi_{\theta_k}\\}\_{k=1}^K would be an occupancy measure corresponding to some policy, say πˉ\bar{\pi}. Since the value function JgπJ_g^{\pi} can be written as a linear function of the occupancy measure corresponding to the policy π\pi, the transition from (75)(75) to (76)(76) is immediate. Moreover, the optimal dual variable λ\lambda^* comes into the picture due to the use of Lemma 10. We shall expand these arguments in the revised manuscript.

评论

I thank the authors for the answering all of my questions.

最终决定

This paper introduces a novel algorithm called Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) for Constrained Markov Decision Processes (CMDPs). The algorithm achieves state-of-the-art results in terms of sample complexity, proving that it can reach an ϵ\epsilon global optimality gap and ϵ\epsilon constraint violation with O(ϵ3)O(\epsilon^{-3}) samples.

Four reviewers reviewed the paper, unanimously agreeing to accept it. They acknowledged the algorithm's innovation and effectiveness, particularly highlighting its improvements in sample complexity. However, some concerns were raised, including the sample complexity's dependency and the need for more detailed comparisons with previous work.

Overall, the paper presents a promising algorithm that makes progress in sample complexity for CMDPs. However, the authors need to further clarify some details and provide more comprehensive comparisons with existing works.