PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
5
4
3.5
置信度
创新性3.3
质量3.3
清晰度3.5
重要性3.5
NeurIPS 2025

Offline Actor-Critic for Average Reward MDPs

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
Reinforcement LearningOfflineActor-Critic

评审与讨论

审稿意见
5

This paper proposes a pessimistic actor-critic algorithm for offline reinforcement learning (RL) in infinite-horizon average reward Markov Decision Processes (AMDPs), using linear function approximation of the value function. The key technical contribution is a novel, computationally efficient critic step that avoids solving successive sequence of regression problems by directly solving a fixed-point Bellman equation using a linear optimization formulation with convex quadratic constraints. The algorithm is proven to be sample-efficient, with a near-optimal sample complexity of O~(ϵ2)\mathcal{\tilde{O}}(\epsilon^{-2}), under a weak notion of data coverage (the feature coverage ratio). This is the first such result in the offline average-reward setting that matches the best known rate and only requires for the MDP induced by all policies to be unichain.

优缺点分析

Strengths:

  • The paper addresses the problem of offline RL infinite-horizon, average reward setting, an important gap in literature
  • The analysis requires mild assumption compared to prior literature - it requires only unichain (instead of uniformly ergodic) and a relatively weak notion of the feature coverage ratio
  • The algorithm cleverly constructs a pessimistic estimate of the average reward by solving a single fixed-point Bellman equation and address computational efficiency through the reformulation as a linear optimization with convex quadratic constraints
  • Significantly improves upon the sample complexity in comparison to prior offline RL literature and matches the best known in online RL domain
  • Well written paper with a clear presentation of the proof sketch and a through discussion of the assumptions used

Weaknesses:

  • A discussion on the effects of scaling state and action space might be nice to include
  • Insight into how the current work may be extended to the general non-linear function approximation setting would make the paper comprehensive
  • While weaker than uniform ergodicity, the unichain assumption may not always hold. As mentioned in the section on future directions, it would be interesting to see how this affects algorithm design and analysis.

问题

N/A

局限性

Yes, the authors have adequately addressed the limitations and potential negative societal impact of their work.

最终评判理由

Accept because an important problem (offline RL in infinite-horizon, average reward setting) considered, clever algorithm, significant improvement in sample complexity from previous SOTA, well-written

Not Strong Accept because the problem considered is a logical next step in the offline RL line of work and the proof techniques presented are significant but not ground breaking

格式问题

N/A

作者回复

Thank you for your time and detailed review. Please see our responses below.

A discussion on the effects of scaling state and action space might be nice to include

The bound in Theorem 5.1 only depends on the size of the action space, and only scales logarithmically with this number. Importantly, the bound does not depend directly on the size of the state space at all, only on the dimension of the feature map dd. We will be sure to add this discussion in the revision.

Insight into how the current work may be extended to the general non-linear function approximation setting would make the paper comprehensive

The main obstacle in generalizing our algorithm in its exact form to more general function approximation is the confidence ellipsoid, ξΛ^β\lVert \xi \rVert_{\hat{\Lambda}} \leq \beta . This is based on martingale concentration techniques from the linear MDP literature (Lemma B.1). These techniques allow us to very precisely quantify the level of uncertainty in the dataset. It is not yet clear how to do this for more general function approximation like neural networks.

However a promising avenue may be to adapt techniques from papers such as [Xie et al., 2021] and [Cheng et al., 2022]. Specifically, one can consider removing the pessimism parameter ξ\xi and considering a Lagrangian relaxation of the critic's problem. Specifically, in episode kk we may ask the critic to find a function qkQq_k\in \mathcal{Q} minimizing J+λNi=1N(q(si,ai)+Jr(si,ai)Eaπk[q(si,a)])2,J + \frac{\lambda}{N} \sum_{i = 1}^N (q(s_i, a_i) + J - r(s_i, a_i) - E_{a \sim \pi_k}[q(s_i', a)])^2, where Q\mathcal{Q} is some suitable class for function approximation, λ\lambda is a Lagrange multiplier which can be tuned to control the degree of pessimism, and 1Ni=1N(q(si,ai)+Jr(si,ai)Eaπk[q(si,a)])2\frac{1}{N} \sum_{i = 1}^N (q(s_i, a_i) + J - r(s_i, a_i) - E_{a \sim \pi_k}[q(s_i', a)])^2 is the average square TD error. The main challenge we see here, which is not present in the discounted setting, is the presence of the additional parameter JJ meant to estimate the average reward of policy πk\pi_k. Ideally, JJ should depend on qq in such a way such that J(qk)J(q_k) is a pessimistic estimate of the average reward of policy πk\pi_k. We leave this interesting problem for future work.

While weaker than uniform ergodicity, the unichain assumption may not always hold. As mentioned in the section on future directions, it would be interesting to see how this affects algorithm design and analysis.

There are two main difficulties we come across without the unichain assumption. The first is that, in this case, JπJ^{\pi} not be constant and could depend the initial state. The formulation of the optimization problem (6) may need to change to reflect this. The second is that, without Assumption 3.1, the true Q-functions may be unbounded. The stability of our algorithm relies on the members of the function class Q(Bw)\mathcal{Q}(B_w) being bounded. If the true Q functions are unbounded, but the elements of Q(Bw)\mathcal{Q}(B_w) are, it may be unreasonable to expect the completeness and approximate realizability assumptions to hold. We will add the discussion in the revision.

评论

Thank you for the response!

审稿意见
4

The authors proposed an actor-critic offline reinforcement learning algorithm for average reward MDP, which first achieved O(\eps^{-2}) sample complexity. As a paper focusing on the theoretical guarantee, the authors provided detailed proof and discussion on the relationship with the previous paper.

优缺点分析

Quality: 3 The main paper quality is good. The author provided an actor-critic framework to the offline RL problem. Decompose the sub-optimality into actor and critic parts. For the actor parts, the author adopts the Natural policy gradient method and achieves an optimization error convergence with order O(1/\sqrt(k)). For the critic part, the authors applied Bellman equation regression with pessimism parameters. The proof sketch and details are solid. Moreover, the author discusses the assumption and previous work in depth, which is helpful.

Clarity: 2 Despite the algorithm and proof being solid in this paper, some errors or typos may lead to misunderstanding.

(1) In the section 4.2 policy update, it is better to make it clear that the update follows Mirror Ascent with KL divergence or Natural policy gradient instead of general gradient ascent to make a better understanding (2) In the proof of Lemma B.3, the notation \bar{\omega}_k and \omega_k^* are mixed up? Make it hard to read the proof of Lemma B.3 (3) In the proof of Theorem 5.1, line 610, the optimization error term is missing.

Significance: 3 The authors discussed the convergence rate of the offline RL under the Average MDP setting, which is an important topic in RL theory and closes the gap.

Originality: 2 Although the discussed problem is important to the community. However, it seems the novelty is somehow limited. The actor-critic framework is followed from [Zanette 2021]. Although I agree that [Zanette 2021] discussed the episodic setting and the induction technique is invalid in the average setting. However, it seems that the idea of proof procedure (Lemma 6.2 and 6.3) is similar to the FOPO algorithms' proof in [Wei et al, 2021]. The only difference is that [Wei et al, 2021] discusses online problems with Optimism, and this paper discusses offline problems with pessimism.

It would be great to highlight the contribution of the technical part of this paper if the authors could answer the following question.

The author mentioned that [Gabbianelli et al., 2024] is the only paper that also discusses offline RL under average MDP. However, they only achieve O(\eps^{-4}). Could the author explain what the new idea or technique is adopted to improve the result to O(\eps^{-2})?

问题

Please refer to the clarity and Originality Parts for the questions. This paper is technically solid. I would increase the score if the author could confirm with me the question mentioned above.

局限性

yes

最终评判理由

Reason to borderline accept: close the gap of RL theory in the offline setting, which is important to the community. Reason not to accept or strong accept: Although the proposed algorithms achieved faster convergece rate, the proof technique is somehow not groundbreaking

格式问题

None

作者回复

Thank you very much for your time and detailed review. Please see our responses to your questions below.

Clarity

There are a few typos in the proof. Thank you for pointing these out!

The wkw_k^* appearing in the proof of Lemma B.3 is a typo. It should be wˉk\bar{w}_k throughout the proof. We have corrected this.

The second inequality in line 610 should read 1Kk=1KJπJk^+κBw,Bθ2BwlogAK+2βϕμπΛ^1+εBw,Bθ+κBw,Bθ.\frac{1}{K} \sum_{k = 1}^{K} J^{\pi} - \hat{J_k} + \kappa_{B_w, B_{\theta}} \leq 2B_w \sqrt{\frac{\log A}{K}} + 2\beta \lVert\phi^{\mu^{\pi}}\rVert_{\hat{\Lambda}^{-1}} + \varepsilon_{B_w, B_{\theta}} + \kappa_{B_w, B_{\theta}}. We have corrected this as well.

We will also update section 4.2 to be clear that the policy update follows mirror ascent with KL divergence regularization.

Originality

However, it seems that the idea of proof procedure (Lemma 6.2 and 6.3) is similar to the FOPO algorithms' proof in [Wei et al, 2021]. The only difference is that [Wei et al, 2021] discusses online problems with Optimism, and this paper discusses offline problems with pessimism.

Although our analysis is partly inspired by [Wei et al., 2021], we would like to point out some notable differences below.

  1. Unlike [Wei et al, 2021], we do not require the perfect linear MDP assumption. Specifically, we only assume the approximate realizability and restricted closedness properties. Additional work is required for the analysis to handle this level of generality. For instance, in the proof of Lemma 6.2, we cannot assume that wkw_k^{\star} parameterizes the true Q-function qπkq^{\pi_k} exactly. So the function qk(s,a)=ϕ(s,a)wkq_k^{\star}(s, a) = \phi(s, a)^{\top}w_k^{\star} need not be a solution to the Bellman equation. This results in the additional term Λ^1i=1Nϕ(si,ai)Δk(si,ai)\hat{\Lambda}^{-1} \sum_{i = 1}^{N} \phi(s_i, a_i) \Delta_{k}(s_i,a_i) in equation (17), where Δk(s,a)=qk(s,a)+JkTπkqk(s,a)\Delta_k(s, a) = q_k^{\star}(s, a) + J_k^{\star} - T^{\pi_k}q_k^{\star}(s, a) is the irreducible Bellman error due to realizability holding only approximately. When arguing that wk,Jkw_k^{\star}, J_k^{\star} is feasible for the optimization problem (6) with high probability, this term needs to be accounted for. This is why the additional factor of κQ(Bw),ΠN\kappa_{\mathcal{Q}(B_w), \Pi} \sqrt{N} appears in the definition of the confidence parameter β\beta. A similar error term also appears in Lemma 6.3 due to the restricted closedness assumption.

  2. Because our algorithm is policy based, our analysis is built on the extended performance difference lemma (Lemma 6.1). This differs from the proof structure in [Wei et al., 2021] since their algorithm takes greedy actions. More specifically, the way we decompose the sub-optimality in step 3 of the proof sketch is quite different. We decompose the sub-optimality into an optimization term and a bias term. Our Lemma 6.3 focuses on controlling the bias in the estimate of the Q-function for each policy πk\pi_k encountered during the algorithm's execution. Such a statement does not appear in [Wei et al., 2021] because thier algorithm is designed to approximate the solution to the Bellman optimality equation in each episode, rather than the solution for a fixed policy. They then need to argue that the bias in their estimate of the optimal Q function is not too large. Therefore, their regret analysis is based entirely on the Bellman optimality equation and martingale difference concentration rather than a performance difference lemma.

The author mentioned that [Gabbianelli et al., 2024] is the only paper that also discusses offline RL under average MDP. However, they only achieve O(\eps^{-4}). Could the author explain what the new idea or technique is adopted to improve the result to O(\eps^{-2})?

Thank you for the question. To explain our new technique, we first briefly summarize the main ideas of [Gabbianelli et al., 2024] using their notation. In their work, they assume a linear MDP structure, realized by ΨRd×S\Psi \in R^{d \times |S|}, wRdw \in R^d satisfying r(s,a)=ϕ(s,a),w,P(ss,a)=ϕ(s,a),ψ(s),r(s, a) = \langle \phi(s, a), w \rangle, \qquad P(s'|s, a) = \langle \phi(s, a), \psi(s') \rangle, where ψ(s)\psi(s') is the ss' column of Ψ\Psi. To adapt to average reward case, they also assume there is some ϱRd\varrho \in R^d with ϕ(s,a),ϱ=1\langle \phi(s, a), \varrho\rangle = 1. Assuming that state action pairs are drawn i.i.d from some distribution μB\mu_{B}, the population covariance is Λ=EμB[ϕ(s,a)ϕ(s,a)]\Lambda = E_{\mu_B}[\phi(s, a)\phi(s, a)^{\top}]. Under their linear MDP assumption, they observe the average reward (they call ρπ\rho^{\pi}) can be written as ρπ=λπ,w\rho^{\pi} = \langle \lambda^{\pi}, w \rangle where λπ=E(s,a)μππ[ϕ(s,a)]\lambda^{\pi} = E_{(s, a)\sim \mu^{\pi} \otimes \pi}[\phi(s, a)]. They then formulate the problem as a linear program with Lagrangian f(β,π;ρ,θ)=ρ+β,Λ[w+Ψvθ,πθρϱ].f(\beta, \pi; \rho, \theta) = \rho + \langle \beta, \Lambda[w + \Psi v_{\theta, \pi} - \theta - \rho \varrho]\rangle. Here vθ,π(s)=aπ(as)ϕ(s,a),θv_{\theta, \pi}(s) = \sum_{a}\pi(a|s)\langle \phi(s, a), \theta \rangle and β=Λ1λ\beta = \Lambda^{-1}\lambda is a reparameterization used so that knowledge of Λ\Lambda is not necessary to run the algorithm. Their main idea is to solve the saddle point problem minρ,θmaxβ,πf(β,π;ρ,θ).\min_{\rho, \theta}\max_{\beta, \pi} f(\beta, \pi; \rho, \theta). Note, importantly, that this is a population level problem. Since they only assume access to i.i.d. samples from μB\mu_{B}, they solve the problem via stochastic gradient descent-ascent.

The main bottleneck in the analysis of [Gabbianelli et al., 2024] is their algorithm's double-loop structure that, in each outer loop tt, uses an inner-loop to solve minρ,θf(βt,πt;ρ,θ)\min_{\rho, \theta} f(\beta_t, \pi_t; \rho, \theta) nearly exactly using O(ε2)O(\varepsilon^{-2}) samples.Then as they need O(ε2)O(\varepsilon^{-2}) outer-loop total iterations, they could only get the O(ε4)O(\varepsilon^{-4}) upper bound.

The key reason for the sample complexity improvement in our paper is the efficient use of the entire dataset for the critic's optimization problem (6). Namely, the data to construct this problem only needs to be sampled once and is then re-used in each iteration of our algorithm. This avoids the inner-loop that uses additional samples. The only issue that needs to be dealt with is the statistical correlation between iterates of the algorithm arising from data re-use. This correlation is handled in the analysis using the ϵ\epsilon-net covering technique.

We would also like to point out that the linear program structure of [Gabianelli et al., 2024] crucially relies on the linear MDP structure. Our use of the constrained optimization approach is the key reason we can relax the assumptions to cover restricted closedness and approximate realizability.

评论

The authors have clearly answered my question (1) the difference that I didn't notice compared with [Wei et al, 2021] (2) The reason why the proposed algorithm achieved much faster convergence rate than [Gabbianelli et al., 2024].

I highly encourage the authors to explain the advantage over [Gabbianelli et al., 2024] in their revised version. I will keep my positive score.

审稿意见
5

The authors propose an actor-critic method for efficient offline RL in average reward MDPs with an intractably large or infinite number of states. They consider the linear function approximation framework, and make assumptions on the expressivity of their function approximator including approximate qπq^{\pi}-realizability and Bellman completeness under any policy. Inspired by the actor-critic algorithm of Zanette et. al [2021] for the finite-horizon setting, for steps k=1,...,Kk=1,...,K their algorithm obtains the critic’s weight wkw_k as the solution of an optimization problem, then updates the actor parameter θk\theta_k such that the resulting actor policy in any state is equivalent to one step of mirror-ascent with KL regularization and critic evaluation q^k(s,)=ϕ(s,),wk\hat{q}_k(s,\cdot)=\langle \phi(s,\cdot),w_k\rangle as the gradient. Similar to Zanette et al [2021], their optimization problem is constructed in each step using samples from the offline dataset, and constrains the critic weights to be equivalent to the perturbed solution of a ridge regression problem – with the sample estimate of qπkq^{\pi_k} as the target. Furthermore, in the absence of the backward induction property enjoyed by finite-horizon MDPs, and the presence of the cumulative return JπJ^{\pi} in the Bellman expression of QQ-functions in the average-reward MDP setting, they propose two modifications to the optimization problem: 1) a new decision variable JJ to estimate JπkJ^{\pi_k} and 2) a direct parametrization of an estimator for the state-value function vπkv^{\pi_k} based on the critic’s weights. With these modifications, they prove O(ε2)O(\varepsilon^{-2}) sample complexity guarantee under the weakest known data collection and coverage conditions. Precisely, they do not require that the dataset is generated i.i.d or by a single behaviour policy, and they only require the dataset to properly cover a single direction in the feature space. Their sample complexity guarantee also scales linearly (rather than quadratically) with the coverage parameter.

优缺点分析

As is evident in the long summary, I did enjoy reading the paper. The problem of offline RL in infinite-horizon MDPs with intractably many states, under weaker conditions like qπq^{\pi}-realizability and Bellman completeness, is indeed a relevant open problem in the offline RL literature. The paper addresses this problem with an O(ε2)O(\varepsilon^{-2}) sample complexity guarantee and their bound scales linearly with the single-direction coverage parameter, rather than quadratically – which is more common in related works.

There is no notable weakness. However, I would like to highlight that the algorithm is not fully computationally efficient. It is true that Algorithm 1 can be efficiently implemented with the help of an interior-point method to solve the optimization problem in step 5. While interior-point methods are highly precise, they are not exact. That said, the theoretical guarantees provided in the paper hold under the assumption that there exists an optimization solution which provides an exact solution to step 5. I strongly recommend that the authors add a comment to clarify this, especially in the abstract and other parts of the main text with explicit claims about computational efficiency.

问题

I have just one question:

  • Do the authors assume that the initial state is fixed?

Some typos:

  1. L286: “However, their algorithm’s...” should be “However, their algorithm...”
  2. L299: “...while we study the more general function...” should be “...while we study the more general linear function...”
  3. After L335, the second inequality: (s,a)μπ(s,a)\sim\mu^{\pi} should be sμπs \sim\mu^{\pi}

局限性

NA

最终评判理由

I maintain my positive score of the paper due to the strengths listed above, and a satisfactory discussion with the authors. The paper addresses an important problem in offline RL, with adequate theoretical justifications and very few typos. Furthermore, the authors acknowledged my comments on, and agreed to clarify potentially misleading claims about the computational efficiency of their method.

格式问题

None

作者回复

We thank the reviewer for their time and positive review. Please see our response below.

However, I would like to highlight that the algorithm is not fully computationally efficient. It is true that Algorithm 1 can be efficiently implemented with the help of an interior-point method to solve the optimization problem in step 5. While interior-point methods are highly precise, they are not exact. That said, the theoretical guarantees provided in the paper hold under the assumption that there exists an optimization solution which provides an exact solution to step 5. I strongly recommend that the authors add a comment to clarify this, especially in the abstract and other parts of the main text with explicit claims about computational efficiency.

Thank you for raising this point. While we assumed the use of exact solutions, inexact solutions still only incur an additive error proportional to the accuracy of the solution. Specifically, suppose that the optimization problem (6) is solved to within accuracy δ\delta, so that JkJ_k is within δ\delta of the optimal solution and the constraint violation is also at most δ\delta. The computational complexity of many existing algorithms to obtain such a solution solution scale as logδ1\log\delta^{-1} [Nesterov and Nemirovskii, 1994]. Then additional error will be incurred in Lemma 6.2 where we will instead have JkJk+δJπk+κQ(Bw),Π+δ,k[K].J_k \leq J_k^* + \delta \le J^{\pi_k} + \kappa_{\mathcal{Q}(B_w), \Pi} + \delta,\quad \forall k \in [K]. There will also be an error incurred in Lemma 6.3 due to the constraint violations, which can be updated to read E(s,a)μ[qk(s,a)ϕ(s,a)wkˉ]2(β+δ)ϕμΛ^1+δ.|E_{(s, a) \sim \mu} [q_k(s, a) - \phi(s, a)^{\top}\bar{w_k}]| \leq 2(\beta + \delta) \lVert\phi^{\mu}\rVert_{\hat{\Lambda}^{-1}} + \delta. Here the factor of 2δ2\delta that is a coefficient of ϕμΛ^1\lVert\phi^{\mu}\rVert_{\hat{\Lambda}^{-1}} is due to the violation of the quadratic constraints and the additive factor is from a violation of the linear constraint. We will add a clarifying statement in the main text accordingly.

Do the authors assume that the initial state is fixed?

We do not need to assume that the initial state is fixed. It may be random. Under assumption 3.1, the average reward JπJ^{\pi} does not depend on the initial state. Therefore, the identity of the initial state, or distribution if it is random, does not change the problem or our results.

Thank you for pointing out the typos! We will correct these in the revision.

[Yurii Nestorov and Arkadii Nemirovskii, 'Interior-Point Polynomial Algorithms in Convex Programming', Society for Industrial and Applied Mathematics, 1994]

评论

Re Discussion on Computational Efficiency of Alg 1: Thanks for the concise response. However, my comment is regarding the author's remark "To our knowledge, this is the first result with optimal εdependence in the offline average reward setting via a computationally efficient algorithm." in the abstract, and similar remarks on Algorithm 1 being "fully computationally efficient" in the main text. While the authors are free to discuss error propagation of an inexact (and fully computationally efficient) version of Algorithm 1, the above remarks are presently misleading due to my earlier comments.

评论

Thank you for the clarification. We will remove the phrase "via a computationally efficient algorithm" from the last sentence of the abstract. We will also modify the main text to be clear that an implementation of the algorithm with exact solutions to the critic’s optimization problem is not fully computationally efficient. We will add an additional discussion about error propagation for an inexact, computationally efficient version in the revision.

审稿意见
4

The paper proposes a pessimistic actor–critic algorithm for offline reinforcement learning in infinite-horizon average-reward MDPs (AMDPs) with linear function approximation.

  • The critic obtains a lower-bound (pessimistic) estimate of the average reward and its policy gradient by solving a single linear program with convex quadratic constraints that represents the fixed-point Bellman equation, avoiding the step-by-step regressions common in episodic settings .
  • The actor performs a conservative policy-improvement step on a soft-max policy class. Repeating the alternation KK times and outputting a mixture of intermediate policies yields an ε\varepsilon-optimal policy (up to model-mis-specification error) with sample complexity O(ε2)O(\varepsilon^{-2})—the first optimal-rate result for offline AMDPs .
  • The theory relies on (i) unichain dynamics with bounded Q-span, and (ii) approximate realizability + Bellman restricted closedness of the linear feature class .
  • Compared with the only prior provably efficient offline AMDP method, PDOR, the new algorithm improves the ε-dependence from O(ε4)O(\varepsilon^{-4}) to O(ε2)O(\varepsilon^{-2}) and removes the need for exact linear MDP structure .

优缺点分析

Strengths

  • Rigorous proofs culminating in Theorem 5.1 with explicit finite-sample bounds
  • Clear comparison against prior rates (PDOR, discounted-setting results).
  • Tackles the average-reward offline regime, important for continuing tasks (inventory control, admission control) where resets are impossible
  • Optimal-rate guarantees were missing in this setting.

Weaknesses

  • Results hinge on linear approximation
  • Pessimism paradigm already popular in episodic/dis­counted RL

问题

The conclusion suggests extending to “general function approximation.” What obstacles prevent directly replacing the linear critic with a neural network trained by convex-constrained regression?

局限性

yes

格式问题

no

作者回复

We thank the reviewer for the positive feedback.

The main obstacle in directly replacing the linear critic with a neural network is the method we use to enforce pessimism. Specifically, the variable ξ\xi in (6) is constrained to a confidence ellipsoid ξΛ^β\lVert \xi \rVert_{\hat{\Lambda}} \leq \beta. The parameter β\beta quantifies the uncertainty in the dataset. This is determined precisely using martingale concentration results from the linear MDP literature. It is not yet clear how to extend this technique to more general models such as neural networks.

There are methods in the literature that implement pessimism for more general approximation. However, they need to sacrifice the precise uncertainty quantification that can be achieved in the linear setting. Two relevant papers mentioned in the related work section are those of [Xie et al., 2021] and [Cheng et al., 2022], which focus on the discounted setting. Algorithm 1 in [Xie et al., 2022] alternates between policy optimization steps and a critic which chooses a pessimistic approximation model from a general function class. When specialized to the case of linear function approximation, their algorithm can be viewed as solving a Lagrangian relaxation of a constrained problem similar to ours, but with only the linear constraint. Since they work with more general function approximation, they are not able to quantify the uncertainty in the dataset to the same level of precision as we are in our paper. Instead, the amount of pessimism is determined by the Lagrange multiplier, which is a tunable hyper-parameter. The algorithm is very general, but when specified to the linear setting, the lack of precise uncertainty quantification results in convergence rates of only N1/3N^{-1/3} for NN data samples.

It would be interesting to see if similar ideas work for average reward MDPs, but it does not seem trivial due to the presence of the additional variable JJ in the average reward setting. We leave these interesting questions for future work.

评论

Thank you for the clarification. I will maintain my positive score.

最终决定

This paper studies an offline actor-critic algorithm for policy optimization on average-reward MDPs with linear (value) function approximation using the pessimism principle. All the reviewers were positive about the contributions of the paper, despite noting some limitations (e.g. reliance on linear approximation, lack of computational efficiency of the procedure). The authors are recommended to include the minor clarifications that they provided to reviewers during the rebuttal/discussion phase in the camera-ready version.