A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach
Order-optimal global convergence rate for an actor-critic approach in average reward RL with general policy parametrization
摘要
评审与讨论
This paper investigates average-reward reinforcement learning with general policy parametrization. The authors propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm and provide a finite-sample analysis.
Update after rebuttal:
The authors have satisfactorily addressed all of my concerns in the rebuttal, including the clarification regarding the single-loop versus double-loop structure. Based on their responses, I am increasing my score to 3.
给作者的问题
My main concern is about the high-level idea. Typically, addressing Markovian noise requires either a projection step or the assumption of a finite action space. However, it is unclear to me how the authors handle Markovian noise without either of these approaches. Could the authors provide a high-level explanation of their methodology?
Besides, some other concerns are:
-
To establish Theorem 1, the objective function must be -Lipschitz w.r.t. . Can this property be directly inferred from existing assumptions? If not, this constitutes an additional assumption, and the authors should justify its reasonableness. Additionally, does depend on , or ?
-
What step-size is used in Theorem 1? Is it a constant or decaying step size? Furthermore, how does the step size scale with and the mixing time? This is a crucial aspect and should be explicitly stated in or before the theorem.
-
What are the definitions of and in Eq.(35)? Can they take arbitrary values?
-
The proposed algorithm operates on a two-timescale framework, as evident from Algorithm 1, where multiple critic updates (update on in line 12) occur before a single actor update (update on in line 27). Two-timescale methods are generally easier to analyze since the outer loop can proceed after the inner loop converges. However, prior works such as NAC-CFA and MAC already focus on single-timescale algorithm, which are more commonly used in practice. In the comparison table, I suggest adding a column indicating whether each method follows a single- or two-timescale approach.
-
Given the previous point, the novelty of this work may be questionable unless the authors can demonstrate that the sample complexity analysis remains valid for a single-timescale version of Algorithm 1.
论据与证据
Yes
方法与评估标准
Yes
理论论述
Yes
实验设计与分析
N/A
补充材料
I read but did not carefully check the math in supplementary.
与现有文献的关系
N/A
遗漏的重要参考文献
No
其他优缺点
Strengths: The authors propose a Multi-level Monte Carlo-based Natural Actor-Critic algorithm that achieves a global convergence rate of . The algorithm eliminates the need for knowledge of mixing times. In addition, the finite-sample analysis provided in the paper extends to infinite state and action spaces.
Weaknesses: Please refer to Question section.
其他意见或建议
N/A
Markovian Noise: We agree that applying projections or invoking the assumption of finite action space is common in the literature. For example, the MAC algorithm derives bounds using MLMC estimates from (Dorfman and Levy 2022), which assume a finite state space and a bounded domain, thereby necessitating a projection step in the critic update. Our analysis models the NPG and critic updates as stochastic linear recursions with Markovian noise, which we show to converge without invoking the above preconditions. A similar result can be found in (Beznosikov et al. 2023). However, their update differs from ours since we deal with biased estimates, which are not considered in the cited paper.
(Q1) Lipschitz Property: Lipschitz continuity and smoothness of the function can be easily proven within the framework of discounted-reward MDPs (Lemma 2, Mondal and Aggarwal 2024). In this case, depends only on the discount factor. However, to the best of our knowledge, no such result is known in the average-reward case. We want to emphasize that nearly all papers analyzing general parameterized policy gradient methods for average-reward MDPs assume the stated properties of the function. These properties are expected to hold in practice since they ensure that a slight change in the parameter does not result in a large change in the function. In our assumption, is a constant that does not depend on other model-specific parameters.
(Q2) Step Sizes: The values of the step sizes and are given in Theorems 3 and 4, respectively: , and . Moreover, the value of is mentioned at the top of Page 20, line 1045: . Observe that none of these parameters depend on the mixing time, , which is consistent with our claim that our algorithm does not require the knowledge of . In the revised version, we will explicitly state the values of these parameters in Theorem 1.
(Q3) Regarding and : The terms and in (35) are parts of Theorem 2 that analyzes the convergence of a general stochastic linear recursion (SLR). Specifically, Eq. (35) forms a precondition for Theorem 2. We recognize that both NPG and critic updates can be formulated as SLRs. Moreover, these updates satisfy a precondition similar to (35). In particular, (53) and (54) allow us to choose and for the critic update while (55) dictates and for the NPG update. Note that, although for the NPG update is dependent on , it is not used as an input to our algorithm, thereby maintaining the claim that our algorithm does not require knowledge of the mixing time.
(Q4) On Two-Timescale Approach: We emphasize that no algorithm (be it single- or two-timescale) exists in the literature that achieves the optimal convergence rate of for average-reward MDPs with general parameterized policies via Markovian sampling. Our natural actor-critic-based algorithm achieves this goal, and that itself is a contribution worth highlighting.
Secondly, the algorithms mentioned by the reviewer are not single timescale. We note that the paper NAC-CFA (https://arxiv.org/pdf/2406.01762) mentions (page 5) that ``Here, the AC and NAC algorithms in Algorithm 1 are single-loop, single sample trajectory and two timescale." Similarly, the MAC algorithm is also two timescale since the critic loop has an order-larger step size which makes the algorithm two timescale. Thus, we note that both the mentioned algorithms are two timescale. We will mention this in the paper and enlist designing a one-scale algorithm as a future work.
(Q5) Novelty: Our main contribution is designing an algorithm for average-reward MDPs that achieves the optimal rate for general parameterized policies with Markovian sampling. This is a contribution worth highlighting because no algorithm, be it single- or two-timescale, achieves this feat. Moreover, as previously pointed out, all known actor-critic algorithms on average-reward MDPs with general parameterized policies are all, to the best of our knowledge, two-timescale. In summary, we respectfully disagree with the reviewer's stance that not being a single-timescale algorithm is a good enough reason to discount the merit of our work.
The optimization literature teaches us that single-timescale algorithms typically develop based on the intuitions provided by two-timescale algorithms. We are, therefore, hopeful that the intuitions provided by our algorithm will pave the way to designing the first single-timescale actor-critic algorithm for average reward MDPs with general parameterized policies. We will mention this as a future work in the revised paper.
Thank you for the authors’ response. The clarifications on Markovian noise, the Lipschitz and smoothness property, and the definitions of make sense to me. I encourage the authors to explicitly include the clarification on the Lipschitz and smoothness properties in the paper. Additionally, I have noticed a slight difference in previous works regarding this assumption, as PHPG assumes the smoothness of rather than itself. I believe this distinction is worth highlighting in the paper as well.
Regarding the distinction between single- and two-timescale algorithms, I apologize for the confusion in my initial comment. My intent was to distinguish between single-loop and double-loop algorithms rather than single- and two-timescale methods. It is evident that the algorithm proposed in this paper follows a single-loop structure, as multiple critic updates occur before a single actor update. However, I believe that comparing the analysis of single-loop and double-loop algorithms may not be entirely fair, as in double-loop algorithms, the outer loop proceeds only after the inner loop has converged, which simplifies the theoretical analysis. This distinction should be explicitly noted in the comparison table for clarity.
Thank you for the clarification. We will definitely incorporate the suggested discussion on the smoothness assumption in the revised paper. Regarding the algorithmic novelty, our point still stands. There exists no algorithm in the literature (be it single- or double-loop) that achieves the order-optimal global convergence rate with general parameterized policies via Markovian sampling. Our proposed algorithm accomplishes this feat, making it a significant contribution worth highlighting.
Additionally, the paper https://openreview.net/pdf?id=jh3UNSQK0l on single-timescale Actor-Critic (AC) highlights that single-loop two-timescale AC still shares the same drawbacks as double-loop AC. Specifically, even with single loop, the paper notes that these may be inefficient in practice (“However, it is still considered inefficient as the actor update is artificially slowed down") and relies on a decoupled analysis of the policy and critic updates (“ The two-timescale allows the critic to approximate the desired Q-value in an asymptotic way, which enables a decoupled convergence analysis of the actor and the critic”). That said, we acknowledge that analyzing a double-loop structure might be easier than its single-loop counterpart. We will include this point in the paper and outline the design of a single-loop algorithm that achieves the optimal convergence rate as a direction for future work.
We hope our rebuttal has adequately addressed all of your concerns. Given these clarifications, we would greatly appreciate it if you could reconsider your evaluation and adjust the score accordingly to reflect your updated perception of the paper.
The paper establishes sample complexity for MLMC-NAC algorithm, claims to significantly improve existitng result of .
State-space independent result is not surpring as NAC (in exact gradient NAC case) is also state-independant.
给作者的问题
Question: How this MLMC NAC algorithm can be used in large scale problems. Does this paper has any message for the practioners?
论据与证据
Yes, seems so.
方法与评估标准
Yes, seems so.
理论论述
I didn't verify the proofs.
实验设计与分析
No.
补充材料
No
与现有文献的关系
In my personal opinion, this NAC algorithm guanrantees are of theoretical and conceptual interests. I am not sure, this flavour of NAC algorithms are used in large scale problems.
遗漏的重要参考文献
The paper doesn't cite [2], establishing 1/T convergence rate for average reward case (for exact gradient case).
This recent work [1] also addresses the convergence of dynamic programming techniques in robust average reward MDPs and thus should be included in the related works literature."
[1] @inproceedings{murthy2023modified, title={Modified policy iteration for exponential cost risk sensitive mdps}, author={Murthy, Yashaswini and Moharrami, Mehrdad and Srikant, R}, booktitle={Learning for Dynamics and Control Conference}, pages={395--406}, year={2023}, organization={PMLR} }
[2] @inproceedings{ kumar2025global, title={Global Convergence of Policy Gradient in Average Reward {MDP}s}, author={Navdeep Kumar and Yashaswini Murthy and Itai Shufaro and Kfir Yehuda Levy and R. Srikant and Shie Mannor}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=2PRpcmJecX} }
其他优缺点
Strengths: Good theoretical result.
其他意见或建议
Suggestion: Presentation can be improved.
(a) Related Works: Thanks for mentioning references [1] and [2]. We will include them in the revised version of our paper. In addition to [1], we will also include some other works on robust MDPs.
(b) Message for Practitioners: Our algorithm is designed for large state and action spaces, with parameterized representations for both the actor and critic, making it well-suited for large-scale problems. While there are multiple practical insights, we highlight a few key ones below.
-
Sampling: Many existing policy gradient-type algorithms assume access to a simulator capable of generating independent trajectories at will from a given state distribution (e.g., see [1], [2]). Typically called the i.i.d sampling, the procedure of generating independent and identically distributed trajectories greatly simplifies the algorithm design and analysis. However, for many practical applications, building an accurate simulator is itself a difficult problem. Fortunately, our algorithm works on a single trajectory, eliminating the need for a simulator.
-
Dependence on Unknown Parameters: As stated in the paper, many previous works assume knowledge of mixing time and hitting time [3], [4], which are difficult to obtain for most applications. Our algorithm does not require the knowledge of the above parameters, thereby making it easier to adopt in practice.
-
Memory/Space Complexity: Our algorithm is memory-efficient. One can verify that the memory complexity of our algorithm is where are the sizes of the policy parameter, , and the critic parameter, respectively. It is to be noted that although the Fisher matrix, (and its estimates) require space, one only needs to store quantities of the form , , which need space.
-
Computational Complexity: One way of computing the natural policy gradient (NPG) is first obtaining an estimate of the Fisher matrix , and then directly using its pseudo-inverse to compute . Such a method was adopted in [5]. We, instead, pose the problem of computing the NPG as a stochastic least square problem with Markovian noise. This eliminates the computationally expensive process of inverting a regularized Fisher matrix estimate (whose computational complexity is where denotes the size of the policy parameter). It can be checked that the computational complexity of our algorithm for a given iteration instance is where is the size of the critic parameter.
-
Convergence Rate: Finally, the convergence rate of our algorithm is optimal in the horizon length, . Practically, it indicates that our algorithm takes a relatively small number of training iterations to reach a certain accuracy.
[1] Liu, Y., et al., An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33:7624–7636, 2020.
[2] Mondal, W. U. and Aggarwal, V. Improved sample complexity analysis of natural policy gradient algorithm with general parameterization for infinite horizon discounted reward Markov decision processes. International Conference on Artificial Intelligence and Statistics, 2024.
[3] Bai, Q., et al., Regret analysis of policy gradient algorithm for infinite horizon average reward Markov decision processes. Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
[4] Ganesh, S. et al., Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite Horizon Average Reward MDPs. In The 28th International Conference on Artificial Intelligence and Statistics, 2025.
[5] Xu, T., et al., Improving sample complexity bounds for (natural) actor-critic algorithms. Advances in Neural Information Processing Systems, 2020.
This paper studies the convergence of Actor-Critic algorithm in average reward setting of reinforcement learning. The paper proposed a multi-level Monte-Carlo based natural actor critic algorithm which achieves a global convergence rate to a neighborhood of the optimal policy, where is the horizon length of the sampled trajectory.
Update after rebuttal: I appreciate the authors’ effort in rebuttal. Although some of my concerns have been addressed, the remaining concern is with regard to the trajectory length. Throughout the paper and most of the rebuttal it seems to be 2^{Q_{kh}}, T_{\max}$$. However, it was pointed out in the last reply to rebuttal comment that it was a typo. Various places (line 6, 11, 23) of algorithm 1 in the paper used max operator rather than min operator. It appears the paper can benefit from greater clarity in conveying the idea.
给作者的问题
Please see the weakness point.
论据与证据
Yes.
方法与评估标准
Proposed method makes sense. However, there’s no empirical evaluations for the current version.
理论论述
The reviewer didn’t check the correctness of the proofs.
实验设计与分析
NA
补充材料
The reviewer didn’t review the supplementary material.
与现有文献的关系
The key contributions relate to the convergence performance of actor-critic algorithmic framework in RL, especially in the average reward setting. The paper proposed the model-free approach with order-wise comparable convergence result as the model-based approach.
遗漏的重要参考文献
The work lacks the discussion of a very important branch of sample efficiency literature for example A and references within. Instead of convergence rate, which seems to only measure critic estimation and NPG estimation, the sample complexity is a more practical measure of complexity of as it represents the true resource consumption from the sample necessity. In the context of the current paper, the paper should include the policy update component.
[A] Xu T, Wang Z, Liang Y. Improving sample complexity bounds for (natural) actor-critic algorithms[J]. Advances in Neural Information Processing Systems, 2020, 33: 4358-4369.
其他优缺点
Strength: The paper studied the challenging average reward setting with finite-time convergence result. The analysis for this setting is in general non-trivial.
Weakness:
Sample complexity perspective: it seems to measure the sample trajectory length for each pair of values. But in fact, the algorithm relies on number of trajectories for the entire NAC algorithm to work. Based on Theorem 1, is large as well. This potentially poses sample inefficiency. Please elaborate or correct if the reviewer misunderstood.
其他意见或建议
Potential typo: Line 320, left column, left hand side of the inequality, subscript should be ?
(a) Sample Complexity: We note that there is a misunderstanding here. The convergence rate and the sample complexity are two faces of the same coin, and one can be derived from the other, as long as the error metric is taken to be the same. For example, consider our algorithm that uses a trajectory of length where for each . It is easy to check that
Since we have taken the number of iterations of to be and , and , the expected number of state transition samples used by our algorithm is . Theorem 1 exhibits that for such choices of parameters, our algorithm achieves global error (up to some additive factors of and ). This establishes the convergence rate of our algorithm. Alternatively, if we want to ensure a global error of (up to the additive factors stated before), the expected number of state transition samples required would be . This expresses the same result in terms of sample complexity.
Although both notions are the same, the literature on average-reward MDP typically expresses the results in terms of the convergence rate, while the literature on discounted-reward MDP typically adopts the sample complexity framework. The cited paper [A] analyzes a discounted-reward setup where the common metric is the sample complexity. The same article has also been cited in our paper (line 118), where we mention its global convergence rate to be . This is obtained from their equivalent sample complexity result of . It is to be noted that even in the discounted MDP setup, there is no actor-critic algorithm that achieves sample complexity for general parameterization.
(b) Typo: Thanks for pointing out the typo. We will correct it in the revised version.
I would like to thank the authors for the response.
-
However, the order-wise result in the response should be \mathbf{E}[\max $2^{Q_{kh}},T_{\max} $] \ge T_{\max} = \Theta(T_{\max}) instead of the stated result.
-
In the following literature [A], the sample complexity result is for discounted setting.
[A] Xu, Tengyu, Zhe Wang, and Yingbin Liang. "Improving sample complexity bounds for (natural) actor-critic algorithms." Advances in Neural Information Processing Systems 33 (2020): 4358-4369.
I would like to maintain the score.
- We apologize for the typo. The length of the trajectory should be . In this case, one can see that
Therefore, our conclusions remain unchanged.
- We note that the cited paper [A] has an incorrect sample complexity for NAC which was subsequently corrected in the arXiv version (https://arxiv.org/pdf/2004.12956). In this paper, the sample complexity is established via the vanilla actor-critic (AC) algorithm is to ensure a first-order local or stationary convergence error of (please see the footnotes following the comparison table in the mentioned paper). In contrast, the same paper establishes a sample complexity of via a natural actor-critic (NAC) algorithm to ensure an global error. We reported the global convergence result in our paper.
We hope the above clarification resolves all of your remaining concerns.
This work considers the average-reward RL setting with general policy parametrization. The authors improve over the state-of-the-art global convergence guarantee from a rate of to without requiring knowledge of the mixing or hitting times. This is done by adopting a multi-level MC procedure to estimate the relevant quantities and through a tighter analysis of the derived estimates.
给作者的问题
See Sections above
论据与证据
The claims made in the submission are clear and supported by theoretical results.
Concerning the table, the authors states that their approach works for infinite state and actions, and similar claims are made for some of the related works. However, by inspecting those works, it appears that their results are defined with respect to large but finite spaces. Can the authors comment on this?
方法与评估标准
No experiments were presented in the work.
理论论述
I checked the proof outline in Section 5 which seems correct to me.
实验设计与分析
No experiments.
补充材料
I reviewed the supplementary material but I did not carefully check the proof of all the results.
与现有文献的关系
The paper presents an improvement in terms of global convergence guarantees with respect to state-of-the-art approaches. A similar convergence result was achieved in Ganesh et al. (2024) but in a slightly different setting with finite state and actions and knowledge of the mixing time. A work closer in terms of assumptions and setting is the one from Patel et al. (2024) but achieves a worse convergence rate of .
遗漏的重要参考文献
In my opinion, the related literature has been thoroughly discussed.
其他优缺点
Among the strengths of the work, I mention the newly achieved result in terms of convergence for the considered setting without knowledge of the mixing time and with infinite state and actions.
Concerning the weaknesses, the authors do not properly highlight the technical novelty in terms of theoretical analysis. Indeed, this work shares many similarities both in terms of assumptions and in terms of methods with the one of Patel et al. (2024). I believe that the work may benefit from a clearer comparison between the current work and the one just mentioned.
Another weakness is the absence of numerical simulations.
其他意见或建议
See above
Concerning the table, the authors state that their approach works for infinite states and actions, and similar claims are made for some of the related works. However, by inspecting those works, it appears that their results are defined with respect to large but finite spaces. Can the authors comment on this?
We agree that some related works (NAC-CFA and MAC) assume large but finite state spaces, and we have mistakenly given them more credit by stating that their conclusions extend to infinite state spaces. We will make changes in the revised paper accordingly. In this context, we want to emphasize that our work does apply to infinite states and action spaces, and there is no mistake in that claim.
Concerning the weaknesses, the authors do not properly highlight the technical novelty in terms of theoretical analysis. Indeed, this work shares many similarities both in terms of assumptions and in terms of methods with the one of Patel et al. (2024). I believe that the work may benefit from a clearer comparison between the current work and the one just mentioned.
The assumptions used in this work are commonly used in the literature and thus are similar to those used in multiple works including (Patel et al., 2024).
Although our methods bear some similarities with that of (Patel et al., 2024), there are multiple novel components in the analysis that allow us to establish an order-optimal convergence rate of for average reward MDPs without the knowledge of mixing time, a feat achieved by no other work in the literature. While existing AC analyses (including that in the cited paper) apply relatively loose bounds, we refine the analysis to get sharper results. The first step towards this goal is to establish that the global convergence error is bounded by terms proportional to the bias and second-order error in the NPG estimation (Lemma 1). In prior AC works (including the cited paper), a coarser bound was used instead of the NPG bias that led to an error of the form in the global convergence bound where is the critic estimate at time and denotes the true value. Utilizing Lemma 1 and Theorem 3, our analysis refines this term to , which can be significantly smaller than the previous estimate. It is to be noted that bounding remains challenging due to Markovian sampling. Interestingly, we observe that both critic and NPG updates can be interpreted as linear recursions with Markovian noise. Efficient analysis of such a linear recursion is another novelty of our paper. In Theorem 2, we obtain a convergence rate for a generic stochastic linear recursion. This, along with the improved error estimates, form a basis for Theorems 3 and 4, which finally leads to our desired result in Theorem 1.
In summary, improving the error estimates, recognizing the NPG and critic updates as linear recursions, and performing a sharp convergence analysis of a general stochastic linear recursion are the novel cornerstones of our analysis.
Another weakness is the absence of numerical simulations.
The focus of the paper is obtaining the first theoretical result for a state-action space size independent global convergence rate of for general parameterized policies in average reward infinite-horizon MDPs, using a practical algorithm that does not require the knowledge of mixing times. Evaluation of this work on some practical applications is left as future work.
This manuscript identifies the main issue of the previous analysis on natural policy gradient via actor-critic approach, and proposes new technical tools that view both the critic update and natural policy gradient finding as a stochastic linear recursion, to overcome the issue and obtain minimax optimal convergence rate in terms of . Although most of the theoretical analysis does not deviate away from the existing work, most of the reviewers do find that the proposed technical improvement is solid.
As multiple reviewers pointed out, the presentation should be further improved. The authors should highlight the key component of the technical improvements that can let the reader easily separate the manuscript with other existing works. There are also some other typos which make one of the reviewer confused about the total sample complexity. I do hope the authors can make the necessary adjustments based on reviewers' comments.