/10

Poster4 位审稿人

最低3最高4标准差0.5

ICML 2025

Actor-Critics Can Achieve Optimal Sample Efficiency

提交: 2025-01-24更新: 2025-07-24

TL;DR

Actor-critic algorithms with general function approximation can achieve $\sqrt{T}$ regret and $1/\epsilon^2$ sample-complexity without assuming reachability or coverage.

摘要

Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $\epsilon$-optimal policy with a sample complexity of $O(1/\epsilon^2)$ trajectories with general function approximation when strategic exploration is necessary. We address this open problem by introducing a novel actor-critic algorithm that attains a sample-complexity of $O(dH^5 \log|\mathcal{A}|/\epsilon^2 + d H^4 \log|\mathcal{F}|/ \epsilon^2)$ trajectories, and accompanying $\sqrt{T}$ regret when the Bellman eluder dimension $d$ does not increase with $T$ at more than a $\log T$ rate. Here, $\mathcal{F}$ is the critic function class, and $\mathcal{A}$ is the action space. Our algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets. We extend this to the setting of Hybrid RL, where we show that initializing the critic with offline data yields sample efficiency gains, and also provide a non-optimistic provably efficient actor-critic algorithm, addressing another open problem in the literature. Numerical experiments support our theoretical findings.

关键词

actor-criticpolicy gradientstrategic explorationoptimismreinforcement learningsample efficiency

评审与讨论

审稿意见

评分: 42025-03-09

Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. In this paper, the authors introduce a novel actor-critic algorithm that attains near-optimal sample-complexity and regret under MDPs with low Bellman eluder dimension. The algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets. They also extend their algorithm to the setting of Hybrid RL, showing that initializing the critic with offline data yields sample efficiency gains compared to purely offline or online RL.

给作者的问题

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

The proof looks correct.

实验设计与分析

NA.

补充材料

No.

与现有文献的关系

This paper is the first to achieve near optimal sample complexity and regret for actor critic algorithms under MDPs with low Bellman eluder dimension.

遗漏的重要参考文献

其他优缺点

Strengths:

The motivation for actor critic algorithms under general function approximation is strong.
The paper is well-written in general.

Weaknesses:

It will be better if there is some discussion about lower bounds. Is there any lower bounds for the setting? How large is the gap if comparing the upper bounds in this paper to lower bounds?
The algorithm is not computationally efficient.

Overall, I think the strengths outperform the weaknesses. I lean towards acceptance.

其他意见或建议

作者回复

2025-04-01

We thank the reviewer for the comments and suggestions! We are glad that the reviewer thinks our study is well motivated and our paper is well written. We provide answers to the questions below. Please let us know if you have further questions or concerns!

Revisions

Moreover, we have revised our paper to mitigate the computational inefficiency issue by leveraging an offline dataset with good coverage to bypass the need for optimism, and included numerical experiments based on the suggestions from reviewer TW2G in this link (https://github.com/anonymoosemoose/actorcriticicml2025/blob/main/actor_critic_figs.pdf).

Response to Weaknesses and Questions

Q1: It will be better if there is some discussion about lower bounds. Is there any lower bounds for the setting? How large is the gap if comparing the upper bounds in this paper to lower bounds?

Unfortunately, it does not seem like there exist lower bounds for general function approximation. This has impacted other works in a similar way, as for example both Agarwal et al. (2022) and Zhao et al. (2024) appeal to lower bounds for linear MDPs within their papers on online RL with general function approximation. The result of $\sqrt{T}$ regret is optimal in both papers, as well as ours, even though in the linear case our algorithms have a worse dependence on $H$ by a factor of $H^2$ (though this is not our focus – our focus is on the dependence on $T$ ). Achieving a rate that is minimax-optimal in the case of linear MDPs can be done via adapting the algorithm of Zhao et al. (2024) and making the necessary adjustments, though this is challenging and best left for future work.

Q2: The algorithm is not computationally efficient.

Thanks for raising this concern! We have managed to address this within the hybrid RL setting. More specifically, recent work (Song et al., 2023) shows that in hybrid RL settings with good offline coverage, optimism (which can often be computationally inefficient) isn’t necessary, and FQI is provably efficient. We extend this to the actor-critic setting with Algorithms 1H and 2H, corresponding to Algorithms 1 and 2 in our paper, but without computing confidence sets. When ERM is efficient, these become computationally practical. We show that:

Algorithm 1H achieves a regret of

= O\left(\sqrt{H^4 T \log|\mathcal{A}|} + \sqrt{\beta^\pi H^4 c_{\text{off}}^*T^2/N_{\text{off}}} + \sqrt{\beta^\pi H^4 T\text{SEC}(\mathcal{F}, \Pi, T)}\right),

Algorithm 2H achieves a regret of

\text{Reg}^*(T) = O\left(\sqrt{d H^5 T \log T \log|\mathcal{A}|} + dH^3 \log T + \sqrt{\beta^* H^4 c_{\text{off}}^* T^2/N_{\text{off}}} + \sqrt{\beta^* H^4 T \text{SEC}(\mathcal{F}, \Pi, T)}\right) ,

where $c_{\text{off}}^*$ is the Bellman-type single-policy concentrability, $\beta^\pi$ , and $\beta^*$ are confidence set widths scaling with the log-covering number, and $N_{\text{off}}$ is the number of offline samples. If $N_{\text{off}} = \Omega(T)$ , we have $\sqrt{T}$ regret, and we incur only an additional factor of $\sqrt{c_{\text{off}}^*}$ as a price to pay for omitting optimism. This improves upon Zhou et al. (2023), who required $1/\epsilon^6$ samples.

Experimental results. (https://github.com/anonymoosemoose/actorcriticicml2025/blob/main/actor_critic_figs.pdf)

Optimism in linear MDPs. We compare Alg. 1 and Alg. 2 to a rare-switching version of LSVI-UCB on a linear MDP tetris task. The condition required for Alg. 1 to work holds here, as we do not clip the Q-function estimates. We see that Alg. 1 performs better than LSVI-UCB (!), and Alg. 2 achieves sqrt(T) regret even though it performs slightly worse than LSVI-UCB
Deep hybrid RL. We compare Algs. 1H & 2H (with offline pretraining) to Cal-QL (Nakamoto et al., 2023). Alg. 2H takes a max over 10 randomly sampled actions from the policy to enhance exploration as Cal-QL does (this can be viewed as a computationally efficient version of optimism, or alternatively as approximately targeting pi^*). It outperforms Cal-QL, which outperforms 1H. Neither uses pessimism. This supports the claim that hybrid RL can avoid both pessimism and full optimism while retaining strong performance.

We thank the reviewer for their comments. If you believe that the revisions we have made have improved the paper, we would greatly appreciate it if you decided to increase your score

Agarwal et al. (2022), VOQL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation.
Zhao et al. (2024), A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation.
Song et al. (2023), Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient.
Nakamoto et al. (2023), Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning.

审稿意见

评分: 42025-03-11

This paper makes the following three key contributions in the analysis of actor-critic algorithms with general function approximation:

Proposes an actor-critic algorithm, DOUHUA, which operates under the closure under truncated sum assumption. The regret bound is established as
$O\left(\sqrt{H^4T\log |\mathcal{A}|} + \sqrt{\beta H^4T\text{SEC}(\mathcal{F},\Pi, T)}\right).$
Introduces an alternative actor-critic algorithm, NORA, which does not rely on the closure under truncated sum assumption. The regret bound is derived as
$O\left(\sqrt{dH^5T\log T\log |\mathcal{A}|} + dH^3\log T +\sqrt{\beta H^4T\text{SEC}(\mathcal{F},\Pi, T)}\right).$
Extends NORA to the hybrid RL setting and establishes a regret bound that is slightly tighter than that in the online setting.

给作者的问题

As I mentioned in the weaknesses, could the authors attempt to discuss the estimation of the lower bound, potential estimation methods, or the connections between actor-critic algorithms and traditional online RL algorithms?

论据与证据

To the best of my knowledge, the claims in the paper are well-supported.

方法与评估标准

This paper is purely theoretical and therefore does not involve evaluation criteria.

理论论述

I did not verify the proofs in the paper line by line, but after going through the proofs of the main theorems (Thm 1, Thm 2), I believe there are essentially no errors. The paper provides clear guidance on the proof ideas, making it easy to understand.

As for the other parts, including the proof of Thm 3 and the related discussions, I have also reviewed them and find the reasoning to be correct. The mathematical derivations are solid.

实验设计与分析

This paper is purely theoretical and therefore does not involve experiments or analyses.

补充材料

I read all the supplementary material.

I sincerely appreciate that, in addition to the proofs of the main theorems, the authors have provided highly detailed discussions in the appendix. While most of the proofs are not particularly complex, they offer valuable insights that extend beyond the main theorems themselves, providing significant additional value to the reader.

I believe Appendix E plays an important role in the paper and could be appropriately referenced in the main text. It offers a glimpse into the authors' thought process in designing the algorithm and also clarifies some questions that arose while reading the main text.

与现有文献的关系

The primary contribution of this paper is establishing the complexity under the general function approximation setting, which closely aligns with the complexity under the linear MDP setting (with $\log |\mathcal{F}|$ corresponding to $d^2$ ), thereby filling a gap in this research area.

Additionally, the paper introduces an analytical approach that constrains the overall complexity by limiting the number of critic updates, laying a solid foundation for future research.

遗漏的重要参考文献

To my knowledge, no.

其他优缺点

Strengths:

The paper is well-structured. In addition to presenting the main theorems, it includes many insightful discussions, making it very reader-friendly.
The paper provides a proof sketch of the main theorems in the main text, with a clear presentation and well-structured reasoning.
Beyond purely theoretical discussions, the paper also attempts to use the main theorems to explain empirical results. Although no validation experiments were conducted for these inferences, the effort to integrate theoretical analysis with experiments is appreciated.

Weaknesses:

From my perspective, there are no significant flaws in the paper as a theoretical study. One possible improvement would be to include discussions on lower bounds and provide more intuitive insights into the relationship between general function approximation and the linear case. Additionally, it would be helpful to clarify the key differences or connections between actor-critic algorithms and general online RL algorithms.

其他意见或建议

N/A.

作者回复

2025-04-01

We thank the reviewer for helpful comments and suggestions on our paper! We also thank the reviewer for believing that our paper is well-structured and clearly presented, and note that we have provided detailed discussions in the appendix in addition to the main theorems. We have addressed your questions below. Please feel free to check and let us know for any additional questions!

Experiments

We have revised our paper and included numerical experiments based on the suggestions from reviewer TW2G in this link (https://github.com/anonymoosemoose/actorcriticicml2025/blob/main/actor_critic_figs.pdf). Given the comments you’ve made on the effort to integrate theoretical analysis with experiments, we thought that you’d appreciate them. Here is a quick summary:

Optimism in linear MDPs. We compare Alg. 1 and Alg. 2 to a rare-switching version of LSVI-UCB on a linear MDP tetris task. The condition required for Alg. 1 to work holds here, as we do not clip the Q-function estimates. We see that Alg. 1 performs better than LSVI-UCB (!), and Alg. 2 achieves sqrt(T) regret even though it performs slightly worse than LSVI-UCB
Deep hybrid RL. We compare Algs. 1H & 2H (with offline pretraining) to Cal-QL (Nakamoto et al., 2023). Alg. 2H takes a max over 10 randomly sampled actions from the policy to enhance exploration as Cal-QL does (this can be viewed as a computationally efficient version of optimism, or alternatively as approximately targeting pi^*). It outperforms Cal-QL, which outperforms 1H. Neither uses pessimism. This supports the claim that hybrid RL can avoid both pessimism and full optimism while retaining strong performance.

Response to Weaknesses and Questions

Q1: One possible improvement would be to include discussions on lower bounds and provide more intuitive insights into the relationship between general function approximation and the linear case.

Thanks for raising this point! Lower bounds are sparse for classes with general function approximation. To illustrate, both Agarwal et al. (2022) and Zhao et al. (2024) appeal to lower bounds for linear MDPs within their papers on online RL with general function approximation. The result of $\sqrt{T}$ regret is optimal in both papers, as well as ours, even though in the linear case our algorithms have a worse dependence on $d$ and $H$ (though this is not our focus – our focus is on the dependence on $T$ ).

The linear MDP case is a special instance of general function approximation where the Q-functions are almost closed under truncated sums. This allows Sherman et al. (2024) and Cassel and Rosenberg (2024) to achieve $\sqrt{T}$ regret via bypassing the truncation and applying rare-switching to account for the bonus functions not being closed under truncated sums. We will include this discussion within the paper.

Q2: Additionally, it would be helpful to clarify the key differences or connections between actor-critic algorithms and general online RL algorithms. Actor-critic methods are a specific subset of online RL algorithms that separately learn a policy (the “actor”) and a value function (the “critic”) in real time. In contrast, general online RL encompasses any algorithm that updates its learning parameters from streaming experience. These can have actor components, critic components, both, or neither.

Other Remarks We also appreciate that the reviewer read Appendix E and found it to be helpful. We wanted to include it in the main paper, but had to omit it due to a lack of space. Given your comments, it seems like that was a short-sighted decision, and we’ll include it in the main paper within the revision.

If you believe that these revisions and changes have improved the paper, we would greatly appreciate it if you decided to increase your score

Agarwal et al. (2022), VOQL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation.
Zhong and Zhang (2023), A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes
Zhao et al. (2024), A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation
Sherman et al. (2024) Rate-Optimal Policy Optimization for Linear Markov Decision Processes
Cassel and Rosenberg (2024) Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes
Nakamoto et al. (2023), Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning.
Song et al. (2023), Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient.

审稿意见

评分: 32025-03-12

This paper mainly concerns the theoretical analysis of the actor-critic type of algorithms in finite-horizon episodic MDPs. Beyond linear MDP scenarios, the work focuses on the regret and complexity analysis of general function approximation case. To begin with, an algorithm named Double Optimistic Updates for Heavily Updating Actor-critics (DOUHUA) is proposed, the regret and lower bound of sample complexity of which are shown. However, such bounds depend on some corresponding log-covering number that may linearly blow up. Motivated by such gap, a second algorithm named No-regret Optimistic Rare-switching Actor-critic (NORA) is proposed, which is shown to achieve $\\sqrt{T}$ regret, under a specific assumption (only considering distributions that are Dirac deltas on a single state-action pair).

给作者的问题

What are the assumptions for state and action spaces?
For Equation 1, is the superscript $(t)$ missing for functions?
Definition 1: is the sequence of Bellman errors some specific element of the auxiliary function class that is claimed to exist in Assumption 2?
Theorem 1: when there are two spaces in the subscript of the covering number, i.e. $\\mathcal{N}\_{\\mathcal{F}, (\\mathcal{T}\^{\\Pi})\^T\\mathcal{F}}$ , how is it defined?

论据与证据

Assumption 3: solely assuming Dirac deltas as the candidate for distributions can be a strong type of assumption.
Generalizability: as discussed in the part 'Quality of the regret bound', SEC is further bounded by the Bellman eluder dimension, which may scale with $T$ . The main regret result of $\\sqrt{T}$ is not universal. Further evidence beyond linear/kernel MDPs is preferable to support the statement 'function classes with low Bellman eluder dimension are ubiquitous', as here the paper claims to solve general function settings.

方法与评估标准

Methods

Algorithm 1: as pointed out in the paper, DOUHUA is an extension of the former algorithm GOLF. According to the reviewer's understanding, in GOLF, at episode $t$ , the behaviour policy $\\pi\^{(t)}$ is a greedy policy w.r.t. to $f\^{(t)}$ , which was calculated by the argmax throughout the confidence set. Then $\\pi\^{(t)}$ is executed to collect online data. However, the order of updating policies and functions are different in DOUHUA. Especially, after calculating the current function approximation $f\^{(t)}$ , the policy is not updated. Instead, $\\pi\^{(t)}$ was already derived using mirror descent idea with the last function approximation $f\^{(t-1)}$ . In a word, the current policy that helps collect new data is not derived from the current function. I wonder if this is a misspecification (due to asynchronous updates) or a specific design of DOUHUA for theoretical properties?

理论论述

Roughly checked the proof sketch in the main paper, while not the Appendix.
Given the confusion in several notations as mentioned below, the reviewer is not sure about the unchecked part.

实验设计与分析

N/A

补充材料

Not yet reviewed.

与现有文献的关系

Compared to former work on linear MDPs, the paper tries to solve the setting with general function approximation.

遗漏的重要参考文献

None.

其他优缺点

Significance

As mentioned in the above, 1) how ubiquitous are the function classes with low Bellman eluder dimensions and 2) whether the assumption of Dirac delta distributions are mild largely affect the significance.

Clarity

In the last sentence of Section 1.2, the citation format needs adjustment (similar issues appear several times).
In Line 140, it is recommended to explicitly point out that $t$ represents the index of episode.
Line 149: it can be confusing to use superscript $f$ to denote optimality, as it is just a generic function.
Line 149-150: such definition of greedy policy lacks for scaling due to existence of multiple optimal policies.
Page 3: the definition of covering number needs a metric that is not mentioned.
Definition 1: the $\\epsilon, \\epsilon\^\\prime$ mentioned in the conditions never appear in the following inequalities.
Lemma 1 & Theorem 1: according to the reviewer's understanding, the term $\\mathcal{N}\_{\\mathcal{F}\_h\^{(t\_k)}}$ is never defined (although the $\\mathcal{N}\_{\\mathcal{F}\_h\^{(t, \\pi\^{(t)})}}$ is defined as part of the confidence set). In addition, although such lemma refers back to the Theorem 1, there $\\mathcal{N}\_{\\mathcal{F}, (\\mathcal{T}\^{\\Pi})\^T\\mathcal{F}}$ is not defined either, making the insight of 'log covering number blowing up' (one essential observation and motivation throughout the paper) not fully accessible.
Assumption 3: $d\_\\text{BE}(\\cdot, \\cdot, \\cdot)$ is not explicitly defined, if solely based on the description in Definition 1.

其他意见或建议

It is recommended to include Algorithm 3 in the main paper.

作者回复

2025-04-01

We thank the reviewer for the helpful and insightful comments! We are glad you recognize our work as achieving \sqrt{T} regret for general function approximation under specific assumptions—extending beyond the linear setting considered in prior work. We’ve addressed your questions and incorporated suggestions into our revision.

Computational Efficiency and Numerical Experiments

We have also constructed computationally efficient versions of Algorithm 1 and 2 that require only an offline dataset with good coverage, and have also conducted a set of numerical experiments on the request of reviewer TW2G. The guarantees for Algorithm 1H and 2H, as well as the summary of the numerical experiments, can be found in our response to reviewer TW2G, and the results of the numerical experiments can be found in this link (https://github.com/anonymoosemoose/actorcriticicml2025/blob/main/actor_critic_figs.pdf).

Response to Weaknesses and Questions

Weaknesses

W1: Assumption 3 – Dirac deltas as distributional candidates

Thanks for this thoughtful observation. The Bellman Eluder (BE) dimension was introduced by Jin et al. (2021), who primarily considered two types of distributions:

$\mathcal{D}_{\mathcal{F}}$ : distributions induced by greedy policies $\pi^f$ , and
$\mathcal{D}_{\Delta}$ : Dirac delta measures over state-action pairs.

They suggest an RL problem has low BE dimension if either variant is small. Our work aligns with this perspective:

Alg. 1 does not require rare-switching and scales with the SEC, a more general and often weaker measure than BE dimension.
Alg. 2 currently assumes low BE dimension under $\mathcal{D}_{\Delta}$ , which is aligned with prior frameworks (e.g., Xiong et al., 2023). That said, this can be weakened to the $\ell_2$ eluder condition (Xiong et al., 2023) with nothing more than a change in notation. We present results in the BE framework for familiarity and generality (capturing linear MDPs, low-Eluder-dimension classes, etc.). We’ll clarify these distinctions in the revision, and are happy to make the generalization to the $\ell_2$ eluder condition.

W2: Generalizability of regret bounds

We appreciate this concern. While SEC can depend on T , it typically scales better than the BE dimension and is more general. Importantly, BE dimension is known to scale as \sqrt{T} at worst (not linearly), so the regret result for Alg. 1 remains meaningful. Though Alg. 2 assumes low BE dimension under \mathcal{D}_{\Delta}, this can be relaxed to the \ell_2 eluder condition with minimal modification. This class includes:

tabular and linear MDPs
low Eluder dimension classes
linear mixture MDPs
kernelized nonlinear regulators
the SAIL condition
undercomplete POMDPs In all these settings, Alg. 2 achieves $\sqrt{T}$ regret.

Methods

M1: Policy not derived from the current function This is mostly a notational choice for cleaner proofs. As in GOLF (Xie et al., 2022), the critic is chosen from the prior confidence set; in other works (e.g., Zhong & Zhang, 2023), the actor update is based on the current critic. We follow these conventions for simplicity, though users need not in practice.

Questions

Q1: Assumptions on state and action spaces There are no specific assumptions here—the state and action spaces can be continuous or large. General function approximation accommodates this flexibility.
Q2: Superscript (t) in Equation 1 Thanks for catching this! The summation should read $(s, a, r, s{\prime}) \in \mathcal{D}_h^{(t)}$ , not $\mathcal{D}$ . We’ll fix this in the revision.
Q3: Definition 1 – Bellman error and auxiliary function class Yes—the Bellman error $f_h - \mathcal{T} f_{h+1}$ , with $f \in \mathcal{F}$ , belongs to the auxiliary class $\mathcal{G}$ . We’ll clarify this in the text.
Q4: Covering number notation in Theorem 1 The term $\mathcal{N}_{\mathcal{F}, (T^{\Pi})^{T} \mathcal{F}}$ represents the covering number of the union of two function classes: $\mathcal{F}$ and $(T^{\Pi})^T \mathcal{F}$ . We agree this should be stated more clearly and will revise accordingly.

We also thank the reviewer for the very thorough list of errata you have provided within your remarks on clarity, and will make the revisions requested. Let us know if further clarification is needed—we sincerely appreciate your thoughtful review! If our responses have successfully addressed your concerns, we would be sincerely grateful if you considered adjusting your score.

Xiong et al. (2023), A General Framework for Sequential Decision-Making under Adaptivity Constraints.
Jin et al. (2021), Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms.
Zhong and Zhang (2023), A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes.

审稿人评论

2025-04-07

Thank you for the detailed response! Please see the following follow-up questions:

SEC dependence on $T$

May I confirm for all example scenarios in the answer to W2, whether the SECs have (near) independence from $T$ or not?
And in general, given the dependence on $T$ , how to understand the regret in Theorem 1, as the second term in the sum can further dominate?

Asynchronous policy and evaluation update

As suggested by the response that such difference in the order of updating policies is only a notational issue, does it mean in both cases the theoretical analysis are the same?

Assumptions on state/action spaces

Are the theoretical results verified in the continuous case?
For some steps in the Algorithms, i.e. Algorithm 1 Step 7, how to incorporate the mirror descent exactly? (or an approximation is still needed in practice)

Clarity

A further revision in terms of notations or auxiliary definitions would be helpful.

作者评论

2025-04-08

Thank you for the follow-up questions! We appreciate your sincere effort throughout this review process.

SEC dependence on $T$

That is correct for the first two categories. In tabular MDPs, this means that the SEC is no worse than $SA\log T$ , in linear MDPs for dimension $d$ it is no worse than $d \log T$ , and in MDPs with low eluder dimension it is no worse than the eluder dimension times a logarithmic factor in $T$ .
Linear mixture MDPs are an easier case, where DOUHUA suffices. Cai et al. (2020) Provably Efficient Exploration in Policy Optimization show that there is no blowup in the covering number, and a variant of LSVI-UCB can achieve $\sqrt{T}$ regret (which of course means that DOUHUA does too).
Within the remaining three, the most direct way is not to use the SEC to bound the regret. Instead, it is easier to use the $\ell_2$ eluder condition to do so. With this analysis (found in Equations G.11 to G.14 in Xiong et al. (2023) A General Framework for Sequential Decision-Making under Adaptivity Constraints), one can achieve a $\sqrt{T}$ bound that has no hidden factors in $T$ that are worse than a logarithmic factor.

The reason for such disparate approaches is largely due to the zoo of different complexity classes available in the general function approximation literature. The SEC and the $\ell_2$ eluder condition are some of the most general conditions we know of. The former is more well-known in the literature, and also covers the case of bounded coverability MDPs that the latter may not, while the latter subsumes the last three (somewhat exotic) settings and is the condition that allows for us to control the switch cost in the proof of Theorem 2.

Regret in Theorem 1. The second term in the sum can dominate if the log-covering number of the policy class grows with $T$ . This bound is best understood as a bound that is $\sqrt{T}$ in certain cases, such as that of tabular and linear mixture MDPs, but can be vacuous in general. As such, this bound serves as a motivation for Algorithm 2. We will refine the writing within these two sections to make this more clear.

We hope that this resolves your questions regarding complexity classes and any residual dependence in $T$ . We emphasize that the bound in Theorem 2 indeed achieves a $\sqrt{T}$ regret for essentially all cases within the $\ell_2$ eluder condition class (after making the extensions mentioned in Equations G.11 to G.14 of Xiong et al. (2023)). Any cases where this does not occur are not due to the SEC, but are due to the (somewhat pathological) ones pointed out in Proposition 11 of Xie et al. (2023) The Role of Coverage in Online Reinforcement Learning that correspond to the failure of the Bellman-eluder dimension. As the rare-switching uses the $\ell_2$ eluder condition, this is not easy to resolve.

Asynchronous policy and evaluation update Yes, the analysis is basically the same for performing an actor update based on the current critic. However, this involves some index-chasing that readers familiar with the GOLF analysis would not be very happy with.

Assumptions on state/action spaces

The theoretical results are verified in the continuous case. We implement the case with continuous states in our linear MDP experiment. The only difference with continuous actions is a notational change, where we would replace the sum in the normalizing constant of the policy with an integral. In practice, said integral can be computed via Monte Carlo sampling.
It is possible to implement this (almost) exactly in practice! One would store the sequence of past Q-functions in the case of DOUHUA, and the last Q-function in the case of NORA. Upon receiving a query to evaluate or sample from $\pi_h^{(t)}(s,a)$ for a given $h,s,a$ tuple, we would first compute $\exp\left(\eta \sum_{t=1}^T f_h^{(t)}(s,a)\right)$ for DOUHUA, and $\exp\left(\eta \cdot (t - t_{\text{last}})\cdot f_h^{(t)}(s,a)\right)$ for NORA. To generate a sample from this density, the normalizing constant can be computed exactly in the case where there is a finite number of actions, and in the other case a sample can be generated via MCMC, importance sampling, or rejection sampling.
Most practitioners prefer to work with parameterized policy classes, for which one needs to either perform on-policy sampling, or make an off-policy adjustment via the importance weights given by $\pi_\theta / \pi_b$ . This incurs a sample-complexity penalty in the former case, and incurs instability in the latter case. The mirror descent update we adopt in Step 7 bypasses this issue.

Clarity

Absolutely. Thank you for your effort in pointing out improvements that can be made in terms of presentation and clarity! We will make the revisions requested.

If these responses have managed to address your remaining concerns, we would be very grateful if you considered increasing your score. Thank you for the effort you've put into the review process, we sincerely appreciate it!

审稿意见

评分: 32025-03-13

This paper introduces the first provably efficient actor-critic algorithms utilizing general function approximation. For cases where the policy class does not grow rapidly due to critic updates (as mentioned by "ease case"), the authors propose a novel algorithm that combines the GOLF algorithm with mirror ascent from policy gradient methods. Under specific assumptions, the authors establish that the proposed algorithm achieves a sample complexity of $O(1/\epsilon^2)$ and a $\sqrt{T}$ -regret guarantee. Furthermore, in scenarios where the policy class expands significantly with critic updates, the authors design an algorithm that limits critic updates to $O(\log T)$ while still attaining the same $O(1/\epsilon^2)$ sample complexity and $\sqrt{T}$ -regret bound. The paper also extends this approach to the hybrid reinforcement learning (RL) setting, demonstrating that initializing the critic with offline data can enhance sample efficiency compared to methods relying solely on either offline or online data.

给作者的问题

Unlike Algorithm 1, Algorithm 2 assumes a bounded Bellman-Eluder dimension for the Dirac delta distribution. Despite this, the regret bound still includes an SEC term. Could you clarify why this term appears, even under the bounded Bellman eluder dimension?
For the policy class $\Pi^T$ , is there an intuitive explanation for the lower bound on the number of critic updates? If the critic does not change significantly, the current bound in Lemma 1 seems rather crude. Could a tighter bound be derived under such conditions?
Algorithm 2 appears to be strictly superior to Algorithm 1 in terms of performance. What are the limitations of Algorithm 2 compared to Algorithm 1, if any? Is there a particular setting where Algorithm 1 is preferable?

论据与证据

The main claim of this paper is that an actor-critic algorithm can be designed using general function approximation for the value function class while achieving $O(1/\epsilon^2)$ sample complexity or $\sqrt{T}$ -regret. The authors propose two algorithms and provide theoretical results to support their claims. While I have not rigorously verified all proofs, the theoretical justification appears well-supported. Notably, unlike prior actor-critic approaches that rely on coverage or reachability assumptions, this work achieves sample efficiency without such assumptions, making it a significant contribution. However, the computational cost of the proposed methods and the lack of empirical comparisons raise concerns about their practical feasibility and efficiency in real-world implementations.

方法与评估标准

The evaluation criteria used in this paper, regret and sample complexity, are well-established metrics in the reinforcement learning literature and are commonly used to assess an algorithm's sample efficiency. However, it would be beneficial to include numerical experiments to further validate the practical applicability of the proposed algorithm. Empirical results would help demonstrate its real-world feasibility and provide insights into its computational efficiency and performance in practical scenarios.

理论论述

I have not rigorously verified the proofs, but the results appear to be reasonable and well-justified.

实验设计与分析

This paper does not include any experimental results.

补充材料

I have not reviewed the supplementary material.

与现有文献的关系

From a theoretical perspective, the proposal of an actor-critic algorithm achieving $O(1/\epsilon^2)$ sample complexity for general function classes is a highly promising result. Additionally, the extension to hybrid RL is particularly valuable, as it demonstrates the potential to leverage offline data to enhance sample efficiency in scenarios where interactions with the real environment are costly.

遗漏的重要参考文献

I believe the authors have appropriately and thoroughly discussed the related work.

其他优缺点

[Strengths]

This paper introduces the first actor-critic algorithm that achieves $O(1/\epsilon^2)$ sample complexity and $O(\sqrt{T})$ -regret under general function approximation settings. Notably, unlike prior works that rely on coverage and reachability assumptions, the proposed method does not require such assumptions while maintaining sample efficiency.
The paper effectively categorizes the problem into easy and hard cases, clearly outlining the potential challenges in each scenario and explaining how the proposed approach overcomes them.

[Weaknesses]

The results are entirely theoretical, with no empirical validation to demonstrate how the algorithm performs in real RL environments. Including experimental results would strengthen the claims. The proposed algorithm, while theoretically sound, appears to be computationally inefficient, which may limit its practicality for large-scale or real-world applications.

其他意见或建议

It appears that the definitions of $\text{OPT}^*$ and $\text{OPT}^f$ in Corollary 2 are missing. Could the authors clarify their precise definitions?

作者回复

2025-04-01

We thank the reviewer for providing helpful comments and suggestions to our paper! We are glad that the reviewer believes our theoretical justification to be well-supported. We have revised our paper and included numerical experiments based on the suggestions from the reviewer in this link (https://github.com/anonymoosemoose/actorcriticicml2025/blob/main/actor_critic_figs.pdf).

Comments Regarding Corollary 2: $\text{OPT}^*, \text{OPT}^f$ are not formal definitions, but placeholders for user-defined bounds on error terms for their chosen optimizer. We’ll clarify this in the revision.

Response to Weaknesses and Questions

W1: Lack of experiments and computational inefficiency. These are very valid concerns. We address them below.

Computational inefficiency. Optimism is a powerful yet computationally expensive approach in RL. Recent work (Song et al., 2023) shows that in hybrid RL settings with good offline coverage, optimism isn’t necessary, and FQI suffices for provable efficiency. We extend this to the actor-critic setting with Algorithms 1H and 2H, corresponding to Algorithms 1 and 2 in our paper, but without computing confidence sets. When ERM is efficient, these become computationally practical. We show that:

Algorithm 1H achieves a regret of

= O\left(\sqrt{H^4 T \log|\mathcal{A}|} + \sqrt{\beta^\pi H^4 c_{\text{off}}^*T^2/N_{\text{off}}} + \sqrt{\beta^\pi H^4 T\text{SEC}(\mathcal{F}, \Pi, T)}\right),

Algorithm 2H achieves a regret of

\text{Reg}^*(T) = O\left(\sqrt{d H^5 T \log T \log|\mathcal{A}|} + dH^3 \log T + \sqrt{\beta^* H^4 c_{\text{off}}^* T^2/N_{\text{off}}} + \sqrt{\beta^* H^4 T \text{SEC}(\mathcal{F}, \Pi, T)}\right) ,

Experimental results.

Optimism in linear MDPs. We compare Alg. 1 and Alg. 2 to a rare-switching version of LSVI-UCB on a linear MDP tetris task. The condition required for Alg. 1 to work holds here, as we do not clip the Q-function estimates. We see that Alg. 1 performs better than LSVI-UCB (!), and Alg. 2 achieves sqrt(T) regret even though it performs slightly worse than LSVI-UCB
Deep hybrid RL. We compare Algs. 1H & 2H (with offline pretraining) to Cal-QL (Nakamoto et al., 2023). Alg. 2H takes a max over 10 randomly sampled actions from the policy to enhance exploration as Cal-QL does (this can be viewed as a computationally efficient version of optimism, or alternatively as approximately targeting pi^*). It outperforms Cal-QL, which outperforms 1H. Neither uses pessimism. This supports the claim that hybrid RL can avoid both pessimism and full optimism while retaining strong performance.

Note: In the plots, the first 1000 evaluation steps correspond to offline pretraining. Alg. 1H and 2H naturally underperform initially before fine-tuning begins. All plots are smoothed following WandB defaults.

We hope that these experiments address your concerns regarding real-world feasibility.

Q2: Why does the regret bound for Algorithm 2 still include the SEC term, despite assuming a bounded Bellman-Eluder (BE) dimension?

We thank the reviewer for this remark, which allows us to realize a subtlety. We use SEC over BE because it is generally tighter. However, Xie et al. (2022, Prop. 14) show that SEC is upper bounded by BE under the greedy policy’s occupancy measure, but not the BE dimension for Dirac deltas (Jin et al., 2021, App. H). Still, we can use Xiong et al. (2023, Eqs. G.13–G.14) to bound the term containing the SEC with the Bellman eluder dimension for Dirac deltas. We’re happy to clarify or include this in the revision.

Q3: Can the number of critic updates be bounded more tightly when updates are small?

Yes – thank you for this insightful question. One can use an $\epsilon$ -net and collapse nearby critics when $\exp(\eta(f - f’)) < \epsilon/2$ , which can reduce the bound significantly. This is a great observation that we would be more than happy to include in the paper.

Q4: When is Algorithm 1 preferable to Algorithm 2?

Though Alg. 2 often performs better, it may underperform when the Q-function sum has a low-dimensional structure (e.g., linear MDP Tetris). This is due to policy resets during rare-switching critic updates, which disrupt learning. Also, as shown in Table 1, Alg. 2 incurs an extra $dH$ factor in settings favorable to Alg. 1.

We truly appreciate the reviewer’s helpful and encouraging comments. If our revisions address your concerns, we’d be grateful if you would consider raising your score.

最终决定Accept (poster)

2025-05-01

This paper addresses a central and long-standing open problem in reinforcement learning: the design of provably efficient actor-critic algorithms with general function approximation. It introduces two theoretically grounded methods—DOUHUA and NORA—that demonstrate sample efficiency and low regret under carefully stated assumptions, and significantly, without relying on traditional reachability or coverage conditions. The work is theoretically deep, builds on contemporary complexity measures like the Bellman-Eluder dimension and SEC, and offers rigorous proofs supported by supplementary discussions. While initially limited by a lack of empirical results and computational practicality, the authors have responded thoroughly in the rebuttal, offering both efficient variants and meaningful experimental comparisons. Reviewer concerns were addressed in depth, with clarification on theoretical assumptions, notation, and regret guarantees. Overall, this submission makes a substantial theoretical contribution to the RL literature, pushing the frontier of sample efficiency in actor-critic methods.