Improved Sample Complexity for Global Convergence of Actor-Critic Algorithms
Global convergence of actor-critic method with sample complexity of $O(\epsilon^{-3})$ compared to existing rates of $O(\epsilon^{-4})$.
摘要
评审与讨论
This paper studies the convergence rates of the actor-critic algorithm for solving reinforcement learning problems. The authors establish an sample complexity, claiming it improves the current state of the art.
优点
The writing and organization of this paper are clear.
缺点
I have two major comments: (1) the results, and (2) the assumptions.
(1) It seems that the algorithm presented in this work is not the vanilla policy gradient but rather the natural policy gradient. Specifically, Algorithm 1, Line 3, appears to have the same update as in Lemma 15 of [1]. For the natural policy gradient, several results in the literature [2,3,4,5] have shown that it achieves geometric convergence when using increasing step sizes. Consequently, the natural actor-critic algorithm has an sample complexity for global convergence.
Could the authors explain the main differences between the proposed algorithm and the natural policy gradient? If they are indeed the same algorithm, what are the main improvements of this work compared to those mentioned above?
[1] Agarwal, A., Kakade, S. M., Lee, J. D., & Mahajan, G. (2021). On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98), 1-76.
[2] Lan, G. (2023). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical programming, 198(1), 1059-1106.
[3] Xiao, L. (2022). On the convergence rates of policy gradient methods. Journal of Machine Learning Research, 23(282), 1-36.
[4] Yuan, R., Du, S. S., Gower, R. M., Lazaric, A., & Xiao, L. (2022). Linear convergence of natural policy gradient methods with log-linear policies. arXiv preprint arXiv:2210.01400.
[5] Chen, Z., & Maguluri, S. T. (2022, May). Sample complexity of policy-based methods under off-policy sampling and linear function approximation. In International Conference on Artificial Intelligence and Statistics (pp. 11195-11214). PMLR.
(2) The iid sampling from seems to be a strong assumption. One of the main challenges in analyzing the convergence rates of coupled stochastic iterative algorithms (such as the one in this work) is to deal with the noise. Realistically, one would implement the sampling process described in the top paragraph on page 7 while performing the update, making the noise sequence being a time-inhomogeneous Markov chain. The iid assumption seems to greatly simplify the analysis. Could the authors discuss relaxing the iid assumption to strengthen the practical relevance?
(3) It is not entirely clear to me why Assumption 1, Parts 2 and 3, are considered weaker than Part 1, as claimed by the authors. Could a proof be provided to show that Part 1 is automatically satisfied under either Part 2 or Part 3? An extended discussion on the relationships between these parts of the assumption and their implications for the analysis would be beneficial.
问题
See the weaknesses section.
This paper shows an improvement from to for one kind of actor-critic algorithm.
优点
This paper improve the finite-analysis bound from to for some kind of actor-critic algorithm.
缺点
First, I would like to point out that this paper is poorly written in several respects.
-
Inconsistent Notation: The notations are inconsistent throughout the paper. For instance, in line 660, there is a mix-up between capital C and lowercase c.
-
Mathematical Errors: There are several mathematical errors. For example, in line 297, it seems that a square is missing.
-
Undefined Notations: Certain notations are used without definition, like in the algorithm.
-
Lemmas Relabeled in the Appendix: If the proofs refer to the same lemma, they should maintain consistent labeling throughout the paper. For instance, Lemma 4 in the main body is relabeled as Lemma 9 in the appendix.
It is the authors' responsibility to ensure that the paper is easy to follow and free from critical errors.The authors need to thoroughly review and revise the manuscript to address these issues.
Regarding the content, I have some additional concerns:
-
The algorithm analyzed in the paper is for a tabular AC method, as the critic update is entirely tabular. Comparing the results in the tabular case with those using a linear function approximator is not meaningful. I think this is the key detriment of this paper.
-
The algorithm assumes that the state-action pairs are sampled i.i.d. from . This requires sampling the entire trajectory each time, which is impractical in real-world scenarios. The typical way of sampling is from a single trajectory. This is also a significant limitation.
问题
Please see my comments in Weakness section.
This paper studied actor-critic(AC) algorithm in discounted reinforcement learning setting. The paper proposes a variant of AC where the critic uses a constant step size and that of actor's is diminishing in a given form. The paper claimed to have achieved for a sub-optimality gap target. Such a result reduces the gap from policy gradient with exact gradient. In addition, constant step size in critic component enhances the practicality of the algorithm.
优点
- A claimed sample complexity for sub-optimality gap target.
- The use of constant critic step size enhances the practicality of the algorithm.
缺点
- Missing a body of bi-level actor-critic literature that is complementary to the line of approach in the current paper. For example,
[1] Xu, Tengyu, Zhe Wang, and Yingbin Liang. "Improving sample complexity bounds for (natural) actor-critic algorithms." Advances in Neural Information Processing Systems 33 (2020): 4358-4369.
[2] Chen, Z., Zhou, Y., Chen, R.R. and Zou, S., 2022, June. Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. In International Conference on Machine Learning (pp. 3794-3834). PMLR.
[3] Hairi, F.N.U., Liu, J. and Lu, S., 2022. Finite-Time Convergence and Sample Complexity of Multi-Agent Actor-Critic Reinforcement Learning with Average Reward," in Proc. ICLR, Virtual Event, April 2022. Proc. ICLR.
- The presentation is subpar, which makes it difficult to get the gist of the analysis. In particular, there are two ideas seem to be critical in the analysis of the algorithm:
-
Slow change of policy, essentially decoupling actor and critic steps: However, this idea has not been provided with clear intuition and sufficient explanation. How to see the "slowness" in the given (actor) step size choice? "Slow" relative to which critical time scale? Furthermore, what is the intuition behind the decoupling given "slowness" not complete decoupling in terms of technical derivations?
-
How does "adversarial" component come to the analysis? What does this have to with "slow" change of policy?
-
Rephrase two sentences between Line 71 - 73. It's not clear what the authors are trying to convey.
-
Lots of typos and incomplete writings, a not comprehensive list:
- Line 66: n -> In.
- Caption under table 1, missing an "as" after such.
- Last paragraph in section 1 is not complete.
- Define in Line 207.
- Line 268, the same expression appears twice.
- Missing a square after the first inequality in Line 297.
- Line 346, Lets -> Let's.
- Line 380, "=" -> "-", also missing a coefficient factor.
- Equation (4) and (5) appears the same.
- Line 419 is missing an "is".
- Line 463 is missing parentheses.
问题
See the weakness section. In addition,
-
Line 143 mentioned that (Xu 2020) requires additional computation and is highly challenging. However, it is not obvious why "highly challenging" or even "challenging" in the first place. Please elaborate.
-
Line 293 states that sample uniformly. Is it uniformly among state space or collected samples so far ?
-
What does the last paragraph in Page 7 have to with item 3 of Assumption 1?
-
In the same paragraph, why it will lead to deterministic policy, why not a stochastic policy?
-
. In the same paragraph, why converging to deterministic policy potentially lead to better error?
In my opinion the paper is well written, well-structured and easy to read. However, given the current state of the literature, the results it has obtained have been obtained in a more general setting previously. Therefore I cannot recommend this work for acceptance.
优点
The paper has very clear and readable presentation. Additionally the convergence methods seem novel.
缺点
The key shortcoming of this paper is that Gaur et al. (2024), has already proven the last iterate convergence with a sample complexity of . It does this for an infinite state and action space. Additionally, the work uses a decreasing actor step size and a constant critic step size. It also incorporates
References:
Mudit Gaur, Amrit Bedi, Di Wang, and Vaneet Aggarwal. Closing the gap: Achieving global convergence (Last iterate) of actor-critic under Markovian sampling with neural network parametrization. In Proceedings of the 41st International Conference on Machine Learning,
问题
Is there a way to extend this paper for an infinite state and action space? If that can be achieved, it would being parity between this work and Gaur et al. (2024) atleast in terms of the type of MDP considered. It might then be possible to argue that this work has some advantages over Gaur et al. (2024) and may be fit for acceptance.
We thank the reviewer for the time spent reviewing this work. We are glad that the reveiwer finds our methods to prove convergence novel.
The work Gaur et al. 2024 (see its Algorithm 1) is very different than our vanila actor crtic. Although they use new samples, but they are used multiple times using the buffer. As their Theorem 1 states, to achieve -close policy, they do iterates, in each iterate they generate many new samples (hence the sample complexity of ), however in every iterates, gradient estimation uses samples times (from the buffer of ). To summarize, they require many new samples, but samples are times.
While our algorithm use new samples, each once, doesn't require memory to store sample buffer. This makes our algorithm very different and more efficient than Gaur et al 2024.
Extending this work to infinite state-space can be challenging and an interesting direction, we leave this for our future work.
We thank the reviewers for their invaluable time spent reviewing this paper. We will gladly incorporate reviwers suggestion, especially on related work. We withdraw this paper for the same.