PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
5
4
4
3.5
置信度
创新性3.3
质量3.0
清晰度3.0
重要性2.8
NeurIPS 2025

Regret Analysis of Average-Reward Unichain MDPs via an Actor-Critic Approach

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

Actor-Critic methods are widely used for their scalability, yet existing theoretical guarantees for infinite-horizon average-reward Markov Decision Processes (MDPs) often rely on restrictive ergodicity assumptions. We propose NAC-B, a Natural Actor-Critic with Batching, that achieves order-optimal regret of \$\tilde{O}(\sqrt{T})\$ in infinite-horizon average-reward MDPs under the unichain assumption, which permits both transient states and periodicity. This assumption is among the weakest under which the classic policy gradient theorem remains valid for average-reward settings. NAC-B employs function approximation for both the actor and the critic, enabling scalability to problems with large state and action spaces. The use of batching in our algorithm helps mitigate potential periodicity in the MDP and reduces stochasticity in gradient estimates, and our analysis formalizes these benefits through the introduction of the constants $C_{\text{hit}}$ and $C_{\text{tar}}$, which characterize the rate at which empirical averages over Markovian samples converge to the stationary distribution.
关键词
Actor-CriticRegretPolicy GradientAverage-reward MDPUnichain MDP

评审与讨论

审稿意见
5

This work proposes a natural actor critic algorithm with batching for average-reward Markov Decision Processes, and for the first time weakens the existing strong ergodicity assumptions to the unichain assumption and still preserves the order-optimal regret.

优缺点分析

Strengths: This work studies average-reward Markov Decision Processes, an important problem, and makes a fundamental breakthrough in the regret analysis by reducing the existing strong ergodicity assumptions to the unichain assumption while still preserving the order-optimal regret. The non-trivial challenges of transient states are solved by their novel proof techniques, such as using the convergence rate of transition kernel running average (Lemma 2) and conditioning on high probability event of bounded stopping time (Theorem 2). The general proof outline looks reasonable (I did not check details). The presentation is in general very clear.

Weaknesses: A few points may need clarification or correction as shown in the questions below. This work could be stronger using experimental comparison with existing algorithms for average-reward MDPs.

问题

(1) Why are some actor-critic algorithms [26, 34, 14] and unichain settings [3, 20] excluded from Table 1? You could mention the properties of these works that do not fit the caption of Table 1, such as "model-free", "infinite-horizon" and "average-reward".

(2) At the beginning of Section 4.1, is bv(θ,z)b_v(\theta, z) defined in (14)? (16) defines bub_u with 3 inputs.

(3) Is the outer product defined as a×b=aba\times b=ab^{\top} or baba^{\top}? You could define in the paper to be more clear.

(4) In the proof of Lemma 2, should TθT_{\theta} be replaced with Tθ1T_{\theta}-1 since sTθSRθs_{T _ {\theta}}\in\mathcal{S} _ {\mathcal{R}}^{\theta}? Eθ\mathbb{E} _ {\theta} be conditioned on FTθ\mathcal{F} _ {T_{\theta}} based on strong Markov property?

局限性

Linear function approximation of critic is assumed, which could be extended to more general function approximation. Also, the conclusion section mentions future directions to extending this analysis to other useful settings such as constrained MDPs, multi-agent RL etc. These indicate the limitation that the current analysis only focuses on very basic setting.

最终评判理由

I just found every reviewers are satisfied with the authors' rebuttal. I would like to keep my current rating. Reviewer i9Td

格式问题

N/A

作者回复

We sincerely thank the reviewer for their thoughtful feedback and comments. Below we address the questions raised.

Response to Question 1: We agree that providing more context on the works both within and outside the table would help improve the clarity of our paper, which we will make sure to do. For the specific papers mentioned:

  • [26] Provides only local convergence, giving a bound on the norm of the policy gradient rather than the suboptimality in average reward relative to the optimal policy.

  • [34] Based on a model-based approach using a generative model, whereas our setting is entirely model-free.

  • [14] Does not provide a regret bound, but establishes a closely related global convergence guarantee. We will include it in the table for completeness. A regret bound may not follow directly from this result due to the use of varying trajectory lengths across iterations.

  • [3] This is already included in the table. If you are referring to the other unichain work [21], it is because the unichain result in this work assumes access to a simulator, unlike the other works listed in the table.

  • [20] Assumes access to exact gradients, which significantly simplifies the setup and enables the authors to obtain a logarithmic regret bound.

We will definitely expand details on these related works in the final version.

Response to Question 2: Yes that is correct, the reason bvb_v is defined with 2 inputs and bub_u with 3 is that bub_u depends additionally on the estimate of value function parameter ξ\xi as well so that it can form an estimate of the policy gradient. Thus, bvb_v and bub_u are different functions.

Response to Question 3: Thank you for pointing this out. In the context of our paper, either definition is acceptable, as we only use it to define the Fisher information matrix where we take a=b=logπθ(as)a = b = \nabla \log \pi_\theta(a|s). We will revise the expression to omit the outer product notation and simply write it as [logπθ(as)logπθ(as)][\nabla \log \pi_\theta (a|s)\nabla \log \pi_\theta(a|s)^\top].

Response to Question 4: Thank you for pointing this out. We will update TθT_\theta to Tθ1T_\theta - 1 accordingly.

Response to Limitations: We note that this work provides the first order-optimal regret analysis for general policy gradient methods in non-ergodic MDPs under the average-reward setting. By showing that unichain MDPs can be effectively handled within the policy gradient framework, we open several promising research directions outlined in the future work section. While empirical validation is valuable and planned for future work, our focus here is on establishing provable guarantees for policy gradient methods, which are widely used in practice but lack comparable theoretical guarantees in this setting.

评论

My questions are well solved. I would like to keep my current acceptance rating.

评论

Thank you for your feedback. We are glad to hear that our response addressed your comments.

审稿意见
5

The authors study infinite-horizon average-reward reinforcement learning under the assumption that the learner has access to a policy class and a feature mapping that enables linear approximation of the value functions for all policies in the class. A closely related prior work proposed an actor-critic algorithm for this setting but relied on a strong ergodicity assumption. This paper relaxes that requirement to a weaker unichain assumption by introducing a novel batch algorithm for gradient estimation and providing the associated analysis. The key to the analysis is handling the set of transient states.

优缺点分析

Strengths: The paper presents its contributions clearly. The proposed batching strategy for gradient estimation, along with the accompanying analysis under the unichain assumption, is novel and well-articulated.

Weaknesses:

  • While the paper offers a novel contribution, its overall novelty and significance appear limited, as the algorithmic structure closely resembles that of MLMC-NAC ([14], ICML 2025).
  • The exposition could be improved for clarity. Understanding the proposed algorithm currently requires substantial prior knowledge of policy gradient methods in the average-reward setting. In particular, elaborating on the phrase "follows standard constructions in Temporal Difference" (Line 201) and incorporating relevant content from Appendix B into the main text would enhance accessibility.

问题

Q1) The proposed algorithm resembles MLMC-NAC [14], with a difference in how the inner loop size is determined. Specifically, MLMC-NAC samples the inner loop size from 2^Geom(1/2) while the proposed algorithm in this submission fixes the size to B. What is the intuition behind the difference? Is the difference necessary for the analysis?

Q2) How challenging would it be to extend this work to settings where only the Bellman optimality equation is assumed? Since not all policies are exploratory under this weaker assumption, I would expect that some non-trivial algorithmic techniques are required to ensure sufficient exploration. Have there been any prior attempts to address this within the class of policy gradient algorithms?

局限性

Yes

最终评判理由

Authors have addressed my major concern on the novelty of the submission. As clarified in their response to my question, although the algorithm may appear similar to prior work at a surface level, they introduced necessary modifications without increasing its complexity and provided accompanying analysis that removes the need for the strong ergodicity assumption required by previous studies.

Generally, relaxing the assumption on ergodicity requires significant work under the average-reward setting, and I believe this submission is a good step toward to relaxing this assumption under the policy-based algorithm class.

格式问题

None.

作者回复

We thank the reviewer for their feedback. Below we address the points raised.

Response to Weakness 1/Question 1: While our algorithm shares surface-level similarity with MLMC-NAC in the use of nested sampling, there are critical differences in both methodology and analysis objectives.

First, we intentionally adopt a simpler design in which the inner loop size is fixed to a batch size BB rather than being sampled from a geometric distribution as in MLMC-NAC. This choice is deliberate: our goal is to demonstrate that even a simple averaging scheme, without elaborate techniques such as MLMC, can achieve order-optimal regret guarantees. We believe this yields a stronger statement, as it shows that optimal performance can be guaranteed only using simple sample averaging even without ergodicity.

More specifically, MLMC-NAC in [14] uses multilevel Monte Carlo to achieve an O(1/T)O(1/T) bias with only O(logT)O(\log T) samples in expectation, leveraging exponential mixing of the MDP. Since we do not assume such fast mixing, we resort to simple averaging, which requires Θ(T)\Theta(\sqrt{T}) samples to control bias. However, the reduced variance from averaging allows us to limit the number of outer iterations to $O(\log T)$, preserving order-optimal complexity through a more refined analysis.

Finally, while the reviewer only mentions the algorithm comparison and not the analysis - our analysis is significantly different from [14]. In addition to the differences arising from the algorithmic since the bulk of our analysis is in handling challenges arising due to the unichain assumption vs the well-studied ergodic case. Analyses in the ergodic setting typically rely on exponentially fast mixing, a property that does not generally hold in the classical unichain setting. For example, in periodic MDPs, the limit limt(Pπθ)t(s0,)\lim_{t \to \infty}(P_{\pi_\theta})^t(s_0, \cdot) may not exist. To overcome this, we introduce a large-batch averaging technique and show that the time-averaged distribution converges to the stationary distribution at a rate of O(1/t)O(1/t) (Lemma 2), even without exponential mixing. This averaging approach is a key novelty that enables us to prove decay rates for the bias and variance of our estimators (Lemma 3), which in turn facilitates bounds on value and Q-functions. Another technical contribution is our treatment of transient states, which remain problematic due to their limited long-term relevance. To isolate their effect, we establish a rapid entry time into the recurrent class (Lemma 10), allowing the use of the strong Markov property to restrict the analysis to the recurrent component.

Response to Weakness 2: Thank you for pointing this out. We will move the details of the TD learning procedure from the appendix to the main paper.

Response to Question 2: When only the Bellman optimality equation is assumed, the classic policy gradient theorem is not applicable, as it relies on the unichain assumption. To the best of our knowledge, no policy gradient methods with guarantees exist for average-reward, infinite-horizon settings beyond this case. Our work is the first to establish such guarantees under the general unichain assumption, the weakest condition under which the policy gradient theorem is known to hold.

While value-based methods achieve regret bounds under only the Bellman optimality assumption, policy gradient methods require fundamentally different analysis. Policy gradient methods scale far better with large state and action spaces, making it the practical default. Given their empirical success, extending theoretical guarantees to more general MDPs is a compelling goal. One possible direction, though purely conjectural at this stage and which may require significant effort, is to restrict policy search to unichain-inducing policies by maintaining strong exploration early and then annealing it as the algorithm converges. Weakly communicating MDPs, the most general class solvable with a single stream of experience, always admit an optimal unichain policy. This conjectured approach could bridge the gap between theory and practice but remains to be rigorously investigated. We will point to this future work direction in the final version.

评论

Thank you for the detailed explanation. All my questions have been addressed. I will take your response into consideration when making the final recommendation.

评论

Thank you for your feedback. We are glad to hear that our response addressed your comments.

审稿意见
4

This paper investigates average-reward Markov decision processes under the unichain assumption. The authors introduce a model-free, policy-gradient algorithm called Natural Actor-Critic with Batching. A regret bound of O~(T)\tilde{O}(\sqrt{T}) is proved with two parameters identified, ChitC_{hit} the worst-case expected time to enter the recurrent class, and CtarC_{tar} the expected time to reach a state drawn from stationary distribution.

优缺点分析

Strength:

  1. This paper proposes the algorithm NAC-B and provides the first order-optimal regret analysis for general PG in non-ergodic MDPs, which allows transient states and periodicity, for the average-reward RL problem.
  2. With the large-batch averaging, they novelly prove the total-variation bound for the time-averaged state distribution (Lemma 2).
  3. The authors propose the new techniques of building a “hit-the-recurrent-class” high-probability event and projecting updates onto the orthogonal complement of the shifting kernel to handle singular critics.

Weakness:

  1. This paper comes with no experiment, especially considering the very large batch B=O(T)B=O(\sqrt{T}) in their algorithm, which needs to be stored before performing the NPG update. Experiments are needed to verify the practical feasibility of this design.
  2. Two properties of the unichain MDP ChitC_{hit} and CtarC_{tar} might be very large. In large or poorly connected MDPs, these may scale with the size of the state space or the intermediate policies. A more detailed analysis and comparison with prior work would clarify their impact on sample complexity.

问题

Please refer to the weakness

局限性

This manuscript omits a societal impact discussion, but this is not a blocking issue since this paper’s theoretical focus and negligible immediate societal risk.

最终评判理由

I think this paper's contribution is great in case of providing the first order-optimal regret analysis for general PG in non-ergodic MDPs. The writing is clear and theoretical contribution is obvious while the only concern is the lack of real life empirical experiment, so I will keep my original assessment.

格式问题

None

作者回复

We thank the reviewer for their feedback and comments. We hope to clarify and address the points raised.

Regarding experiments and practicality of large-batch sizes: As noted in the manuscript, this work focuses on establishing theoretical regret guarantees, with detailed experimental evaluations left for future work.

While our algorithm makes use of a large batch size B=O(T)B = \mathcal{O}(\sqrt{T}), this does not require storing all T\sqrt{T} samples. In fact, we only need to maintain an online average of the policy gradient estimate, which can be updated incrementally as new samples are collected. This design avoids the need for storing the entire batch, making the implementation efficient and scalable in practice.

In our algorithm, large batches are theoretically justified: collecting a substantial number of samples per iteration improves the accuracy of both the critic estimate and the resulting natural policy gradient direction. This leads to fewer policy updates and ultimately enables order-optimal regret.

To further illustrate the practicality of using a large number of samples per policy update, consider Trust Region Policy Optimization (TRPO), which is based on the Natural Actor Critic framework. Although TRPO differs from our method, for instance by being formulated in the discounted reward setting, it reflects standard design choices in practice. It is typical for TRPO to use thousands of samples per policy iteration, with values commonly ranging from 2,000 to 20,000 depending on the environment. For instance, the Spinning Up implementation sets the default number of samples per iteration to 4,000, as specified in both their documentation and source code. Although the notion of "batch size" differs due to how gradients are computed in both works, this is to highlight that even TRPO relies on large number of state-action samples to estimate the natural policy gradient direction.

On the Scalability of ChitC_{\text{hit}} and CtarC_{\text{tar}}: We appreciate the reviewer’s concern regarding the constants ChitC_{\text{hit}} and CtarC_{\text{tar}}, which capture the difficulty of exploration under a given policy class. We agree that these constants may scale with structural properties of the MDP such as its connectivity or the number of states. However, we note that some dependence on state-space size or connectivity in discrete state MDPs is unavoidable in general since any algorithm must visit all states to avoid incurring constant regret at each iteration. We note that even the mixing time, tmixt_{mix} in ergodic MDPs can scale with the size of the state space.

We emphasize that our result extends existing guarantees from ergodic MDPs to the more general unichain setting without any degradation in the underlying constants. For instance, the state-of-the-art complexity for actor-critic methods in ergodic MDPs is O(tmix3T)O(t_{\text{mix}}^3 \sqrt{T}) [14]. Although this result does not provide a regret bound, it may be interpreted as a pseudo-regret guarantee. In contrast, our result achieves a bound of O(C3T)O(C^3 \sqrt{T}), where the constant CC, defined in our paper, can be viewed as a characterization of the Cesàro mixing time as established in Lemma 2. The Cesàro mixing time (Section 6.6 in [A]) captures time-averaged convergence behavior and is at most 7tmix7 t_{\text{mix}} in ergodic chains (see Exercise 6.11 in [A]).

Furthermore, in certain ergodic Markov chains, the Cesàro mixing time can be substantially smaller than the standard mixing time. For instance, in the biased random walk on the nn-cycle (Example 24.2 in [A]), the standard mixing time is of order n2n^2 (see Section 5.3.2 in [A]), whereas the Cesàro mixing time is only O(n)O(n) (see Example 24.2 in [A]). This improvement arises from bounds on the expected stopping time to reach a randomly chosen state from the stationary distribution, which scales linearly with nn. By averaging the distributions over the first tt steps, Cesàro mixing mitigates the effects of periodicity and accelerates convergence to the stationary distribution compared to relying on the distribution at a single time step.

In addition, we note that the computational complexity of our algorithm is independent of the cardinalities of the state and action spaces, offering a substantial advantage over tabular or value-based approaches. While our algorithm is applicable to continuous state and action spaces, our theoretical analysis presently focuses on the finite setting.

We will revise the paper to clarify the interpretation of these constants and include further discussion comparing them with mixing-time-based assumptions used in prior work.


[A] Levin, D.A., & Peres, Y. (2017). Markov Chains and Mixing Times (2nd ed.). American Mathematical Society.

评论

I thank the authors for addressing my concerns. All my questions are addressed, and I will keep my positive evaluation for this work.

评论

Thank you for your feedback. We are glad to hear that our response addressed your comments.

审稿意见
4

This paper presents the natural actor-critic algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) under the unichain assumption, which allows for both periodicity and transient states. Unlike prior works that rely on ergodicity for theoretical analysis, this paper derives an O~(T)\tilde{O}(\sqrt{T}) regret bound under substantially weaker assumptions. Key technical innovations include the use of batching to mitigate variance and periodicity, and the introduction of constants ChitC_{\text{hit}} and CtarC_{\text{tar}} to characterize convergence behavior. The algorithm employs general function approximation for both actor and critic components, enhancing scalability.

优缺点分析

Strengths: Novelty: The paper makes a novel and substantial contribution by analyzing regret for actor-critic methods under the unichain assumption—this represents a significant step forward in understanding policy gradient methods in more general settings. Analysis: The regret bound is carefully derived under general function approximation, with comprehensive proofs and clear articulation of assumptions. The techniques used to handle the challenges of periodicity and transient states are thoughtful and well-executed. Clarity: Despite the technical depth, the writing is generally clear, and the theoretical insights are well-motivated.

Weaknesses: While I do not consider the absence of experiments a major concern for a theoretical paper, it would still be valuable to evaluate the practical behavior of NAC-B—particularly its sensitivity to function approximation quality, step sizes, and batch sizes. For instance, although the use of large batches is theoretically well-justified, it could be costly in practice. A discussion or guidance on batch size selection in practical settings would help strengthen the paper. In addition, although the unichain assumption is more general than the commonly studied ergodic setting, the paper would benefit from a more detailed discussion of its practical relevance. Specifically, it would be helpful to clarify in what types of real-world applications this assumption is likely to hold, as many practical environments may not satisfy the unichain condition.

问题

See weakness.

局限性

A theoretical work may not pose the societal impact as a high priority.

格式问题

no

作者回复

We thank the reviewer for their comments and feedback. We address the weaknesses raised below.

Regarding experimental evaluation: While empirical validation is valuable, our focus here is on establishing theoretical guarantees, which remain limited for policy gradient methods beyond ergodicity. However, we would like to clarify the rationale for using large batch sizes in policy gradient estimation. In our algorithm, large batches are theoretically justified: collecting a substantial number of samples per iteration improves the accuracy of both the critic estimate and the resulting natural policy gradient direction. This leads to fewer policy updates and ultimately enables order-optimal regret.

To further illustrate the practicality of using a large number of samples per policy update, consider Trust Region Policy Optimization (TRPO), which is based on the Natural Actor Critic framework. Although TRPO differs from our method, for instance by being formulated in the discounted reward setting, it reflects standard design choices in practice. It is typical for TRPO to use thousands of samples per policy iteration, with values commonly ranging from 2,000 to 20,000 depending on the environment. For instance, the Spinning Up implementation sets the default number of samples per iteration to 4,000, as specified in both their documentation and source code. Although the notion of "batch size" differs due to how gradients are computed in both works, this is to highlight that even TRPO relies on large number of state-action samples to estimate the natural policy gradient direction.

Regarding the unichain assumption: We present two examples below: one in which the MDP is periodic and another in which it is reducible. In both cases, the optimal policy does not induce an ergodic chain, implying that one cannot restrict the search to policies that induce ergodic chains. This demonstrates that periodicity and transient states may be unavoidable. Nonetheless, the underlying themes in these examples are commonly encountered in practice.

An autonomous drone operates in three modes: active delivery (s1s_1), standby hovering (s2s_2) and low‑battery shutdown (s3s_3). At each step it may choose to continue deliveries (action a1a_1) or return to standby (action a2a_2). In state s1s_1, action a1a_1 leads to s1s_1 with probability 0.80.8, to s2s_2 with probability 0.150.15 and to s3s_3 with probability 0.050.05. In s2s_2, action a1a_1 returns deterministically to s1s_1 while action a2a_2 keeps the drone in s2s_2 with probability 0.90.9 and moves it to s3s_3 with probability 0.10.1. Once in s3s_3, the drone remains there indefinitely. Rewards are 11 in s1s_1, 0.50.5 in s2s_2 and 00 in s3s_3. All policies induce a single recurrent class, either the cycle {s1,s2}\{s_1,s_2\} or the absorbing state {s3}\{s_3\}, so the MDP is unichain but not ergodic.

In robotic gait control, particularly for bipedal or quadrupedal robots, the system transitions through discrete movement phases such as left stance, right stance, or flight. States include these phases along with joint configurations and velocities, while actions correspond to motor commands. The dynamics are stochastic yet controllable, and most policies lead to all phases being visited, making the system irreducible. However, optimal locomotion, minimizing energy or maximizing stability, relies on repeating a fixed sequence of movements, such as alternating stances, inducing a periodic policy. This reflects the inherent cyclic nature of locomotion, where balance and rhythm are essential for efficient and stable motion. More broadly, periodic optimal policies can naturally arise in other settings where the environment or reward structure strongly incentivizes strict alternation or repetition over a fixed cycle, such as traffic signal control, task scheduling, or maintenance operations.

评论

Thank you for the detailed explanation! My concerns are well explained.

评论

Thank you for your feedback. We are glad to hear that our response addressed your comments.

最终决定

This paper develops a NAC with batching algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) under the unichain assumption. The reviewers are all positive about this paper, and its techniques to dealing with the unichain case are of potential interest for analyzing other settings in the future. It is recommended to be accepted.