PaperHub
5.7
/10
Poster3 位审稿人
最低3最高8标准差2.1
3
6
8
3.3
置信度
ICLR 2024

On Stationary Point Convergence of PPO-Clip

OpenReviewPDF
提交: 2023-09-20更新: 2024-03-06
TL;DR

We provide a comprehensive analysis that shows the stationary point convergence of PPO-Clip and the convergence rate thereof.

摘要

关键词
PPOPPO-Clipstochastic optimization

评审与讨论

审稿意见
3

The paper provides a theoretical analysis of the convergence of a variant of the popular PPO algorithm where the objective function is clipped.

优点

The paper provides formal guarantees for the convergence of the clipping variant of PPO that has been used widely by reinforcement learning practitioners. This probably adds some reassurance that that this is a sound RL method to use.

缺点

The paper appears to be a good contribution to the filed of RL, but has little, if anything, to do with representation learning. ICLR'24 might not be the best venue for this paper.

Some minor typos: Last paragraph on page 1: "this analysis rely" -> "this analysis relies" Same place: "no longer involve" -> "no longer involves"

问题

Can you think of a possible impact your result can have on the field of representation learning?

评论

We thank the reviewer for providing feedback. We are glad to hear that the reviewer acknowledge that "The paper appears to be a good contribution to the filed of RL".

We re-examined the Call For Papers of ICLR 2024, where it states

We consider a broad range of subject areas including feature learning, metric learning, compositional modeling, structured prediction, reinforcement learning, uncertainty quantification and issues regarding large-scale learning and non-convex optimization, as well as applications in vision, audio, speech, language, music, robotics, games, healthcare, biology, sustainability, economics, ethical considerations in ML, and others.

Here, reinforcement learning is clearly one of the topics that ICLR 2024 is interested in. Additionally, in "A non-exhaustive list of relevant topics", the second topic is "representation learning for planning and reinforcement learning". We also searched ICLR 2022 accepted papers, where a simple query of "reinforcement learning" returns 68 papers, and querying "policy optimization" returns 7 papers. We therefore believe that manuscripts that have a good contribution to reinforcement learning should be considered by ICLR 2024.

For possible impacts, we understand that it is hard to imagine the immediate use cases of the work, because our analysis is on the original version of PPO-Clip and does not introduce new algorithms. Nevertheless, this does not mean a lack of implication. One example is our analysis provides an understanding toward the clip operator, which is not typically used in optimization. Our analysis on the events Bn,kB_{n,k} and Cn,kC_{n,k} characterizes the effectiveness of the clip and also how often the clip is expected to happen in different stages of the optimization. Another example is our setting of TT, which denotes the number of steps that the same old reference policy is used. If TT is too large, then the clip happens too often, and the gradient becomes zero too often which makes the update less efficient. If TT is small, then the samples are less effectively utilized, and the algorithm will take more time interacting with the environment. There will be a tradeoff between these two effects. The impact of this work will be its long-term implication, where a better understanding of our algorithm could involve new variants of the algorithm to be developed. Considering the popularity of the algorithm, bringing up new understanding is valuable.

We thank the reviewer for pointing out several grammatical errors and we have revised them in the updated manuscript.

[1] ICLR 2024 - Call For Papers https://iclr.cc/Conferences/2024/CallForPapers

[2] ICLR 2022 Accepted Papers https://iclr.cc/virtual/2022/papers.html

评论

Thank you for addressing my comments and answering my questions.

评论

We thank the reviewer again for providing feedback on our paper. As the discussion period is coming to a close, we would like to know if we have resolved your concerns expressed in the original reviews, especially the comments that we believe might contradict the ICLR 2024 CFP. We remain open to any further feedback and are committed to making additional improvements if needed. If you find that these concerns have been resolved, we would be grateful if you would consider reflecting this in your rating of our paper.

评论

We thank the reviewer for the feedback and the further response. In case the concerns in the original review have been addressed and the questions have been answered, could the reviewer reflect this change in the rating of our manuscript? In case any concern remains, we are committed to further discussing with the reviewer and making updates to improve our manuscript.

审稿意见
6

This work takes a closer look at the theories behind the clipped surrogate objective of PPO (Proximal Policy Optimization). The authors provide a comprehensive analysis that proves the stationary point convergence of PPO-Clip and demonstrates the convergence rate.

优点

The work is novel in that it investigates the theoretical convergence of PPO, whereas the field has overwhelmingly focused on the empirical performance and application of PPO (e.g., Dota 2 with PPO and ChatGPT, RLFH with PPO). This is partially because the clip operation is inherently non-smooth, thus posing challenges to empirical analysis. The authors' analysis seems sound and comprehensive; they also clearly listed out the necessary list of assumptions. The authors' Theorem 3.2. indicates "PPO-Clip has the same convergence property as the best current results available for policy gradient", which seems significant.

缺点

Please take this with a grain of salt, as I have primarily been using PPO under empirical settings. I struggle to understand how this work connects with the wider research community. What is the implication of this work? This work demonstrates PPO has stationary point convergence — can this property be used in some ways?

问题

the unbounded score function makes the ratio of two policies arbitrarily large, even in the late stages of the optimization process. Do the authors mean the ratio could become arbitrarily large?

评论

We thank the reviewer for the feedback. We are glad to learn that the reviewer acknowledges our novelty and the challenges involved in the analysis. Indeed, there are few analysis that directly tackles the vanilla version of the PPO with the clip surrogate objective. We are pleased to bring up the convergence characterizations of PPO-Clip, under a minimum set of assumptions.

We fully agree with the reviewer that it is hard to imagine the immediate use cases of the work. The primary cause is that our analysis is on the original version of PPO-Clip and does not introduce new algorithms. Nevertheless, this does not mean a lack of implication. One example is our analysis provides an understanding toward the clip operator, which is not typically used in optimization. Our analysis on the events Bn,kB_{n,k} and Cn,kC_{n,k} characterizes the effectiveness of the clip and also how often the clip is expected to happen in different stages of the optimization. Another example is our setting of TT, which denotes the number of steps that the same old reference policy is used. If TT is too large, then the clip happens too often, and the gradient becomes zero too often which makes the update less efficient. If TT is small, then the samples are less effectively utilized, and the algorithm will take more time interacting with the environment. There will be a tradeoff between these two effects.

With this level of popularity of PPO in particular (e.g. 5,500 search results on GitHub), we believe it is relevant to overcome the technical challenges, and bring the understanding of PPO up to the community. Such knowledge could result in a more long-term implication that invokes more new algorithms to be proposed, which could be more valuable than proposing a single variant alone.

The reviewer's comment on the ratio of two policies is well taken and the corresponding fix has been made in the updated pdf.

评论

Thank you for the response.

评论

We thank the reviewer again for your time and effort in reviewing our paper. We are glad that our novelty in the analysis is recognized and it is nice to have the opportunity to elaborate potential implications. We very much enjoyed the discussion with the reviewer.

审稿意见
8

This paper presents theoretical analysis of convergence of PPO algorithm using clipping and version of two-time scale method, i.e., updating policy parameter with particular period. Analysis of PPO-clip is difficult due to non-smoothness of clipping operator and involves ratio of policies. Theorem 3.3 provides best iterate and averaged iterate convergence in terms of V(θn,1)0||\nabla V(\theta_{n,1})||\to 0.

优点

  • The paper tackles important question regrading theoretical analysis of practical implementation of PPO algorithm. The aim of the paper is clear and simple.

  • The analysis seems to be novel. The authors use two-time scale method to overcome the difficulty to deal with ratio of the policy and constructing particular set of events enables to derive recursive inequalities to bound the norm of gradient of the clipped loss function.

缺点

  • Even though the aim of this paper is to tackle theoretical analysis of practical implementation of PPO, still there is some gap. Using a inner and outer loop style update seems to be far from practical implementation.
  • The estimate for advantage function, ϕn\phi_n can be estimated with parameterized value network (e.g. neural network). However, I believe extending the proof to Actor-Critic setting is non-trivial. Hence, I believe the proof is setting restricted to Monte-Carlo setting.
  • Assumption 3.4. restricts the generality of TT and the learning rate. Providing simple examples on the learning rate, and TT would be helpful to understand the conditions about Assumption 3.4. For example, can we use 1k\frac{1}{k} or 1k\frac{1}{\sqrt{k}} as the step-size?

问题

  • Above the paragraph of Step 1, what does it mean to have similar recursive inequality like policy gradient? Please provide more details.

  • In Step 1, how does the estimated clipped PPO loss have gradient? Should it be sub-gradient due to the non-smoothness of the clipping operator? Please provide more clarifications.

  • What is the intuitive meaning of Xn,kX_{n,k} and Yn,kY_{n,k}? The motivation of decomposition of error term into Xn,kX_{n,k} and Yn,kY_{n,k} is not really clear

  • The introduction of Cn,kC_{n,k} in Step 2 is quite abrupt. Please provide more details.

  • In Step 2, what is the meaning of bound of E[XkF]\mathbb{E}[||X_k|| \mathcal{F} ]? Depending on 1πθn1,1(s,a)\frac{1}{\sqrt{\pi_{\theta_{n-1,1}}(s,a)}} seems to be problematic. How is this term compensated? In deriving (14) where did 1πθn1,1(as)\frac{1}{\sqrt{\pi_{\theta_{n-1,1}(a\mid s)}}} go?

  • I think there is typo in (13). V(θn1)\nabla V(\theta_{n-1}) should be V(θn1,1)\nabla V(\theta_{n-1,1}).

评论

We thank the reviewer for the feedback. We are glad to have our aim and our technical novelty acknowledged. One thing we wanted to clarify is that our formulation in Section 3.1 includes the single-timescale setting (which is what the vanilla implementation of PPO did in OpenAI Baselines). Our results hold for both the single-timescale version (the vanilla version) and the two-timescale version (the version that the optimizer is run for TT times per sample, which is also commonly used).

We now provide detailed responses to the reviewer's questions and comments

  1. Even though the aim of this paper is to tackle theoretical analysis of practical implementation of PPO, still there is some gap. Using a inner and outer loop style update seems to be far from practical implementation.

Fortunately, our results hold for both the single-timescale version and the two-timescale version of PPO. We believe this is a strength rather than a weakness.

  1. The estimate for advantage function, ϕn\phi_n can be estimated with parameterized value network (e.g. neural network). However, I believe extending the proof to Actor-Critic setting is non-trivial. Hence, I believe the proof is setting restricted to Monte-Carlo setting.

We fully agree with the reviewer. We unify the estimation error of the advantage, including the sampling error, into ϕn\phi_{n}. Indeed, characterizing this error is very challenging. It involves estimating the fitting error of the neural network, which is an important problem in neural network theory. To the best of our knowledge, this problem is solved for some specific networks, such as over-parameterized networks and linear neural networks [2, 3]. Extending the proof to Actor-Critic will result in the ϕn\phi_n variable unbounded.

  1. Assumption 3.4 restricts the generality of TT and the learning rate. Providing simple examples on the learning rate, and TT would be helpful to understand the conditions about Assumption 3.4. For example, can we use 1k\frac{1}{k} or 1k\frac{1}{\sqrt{k}} as the step-size?

We first wanted to explain that TT is a small value, something between 11 to 1010 in practice. It is the number of off-policy PPO updates on the same reference old policy. Our notation of calling it "T" might results in some confusion that it sounds like something large.

Assumption 3.4 is only related to kk, which approaches to infinity, and is not related to TT, which is a small constant. The assumption is actually quite weak, which only requires the learning rate to not decay too quickly. Examples are ϵn,k=1/n\epsilon_{n,k}=1/\sqrt{n}, ϵn,k=1/n\epsilon_{n,k}=1/n, and ϵn,k=1/nlnn\epsilon_{n,k}=1/n\ln n. Examples that do not satisfy our assumption are ϵn,k=1/n2\epsilon_{n,k}=1/n^2, which decreases too fast and causes the value function update to stop before reaching a stationary point.

  1. Above the paragraph of Step 1, what does it mean to have similar recursive inequality like policy gradient? Please provide more details.

For the PG algorithm, i.e., θn+1=θn+ϵn^VH(θn),\theta_{n+1}=\theta_{n}+\epsilon_{n}\hat{\nabla} V_{H}(\theta_{n}), we can obtain the following recursive inequality E(VV(θn+1)Fn)VV(θn)ϵn^(θn)V(θn)+Lϵn22^V(θn)2.E(V^*-V(\theta_{n+1})|F_n)\le V^*-V(\theta_n)-\epsilon_n\hat{\nabla}(\theta_{n})^{\top}\nabla V(\theta_{n})+\frac{{L}\epsilon_n^2}{2}\\|\hat{\nabla}V(\theta_{n})\\|^{2}. This inequality is crucial for the convergence of the PG algorithm. Therefore, a very natural approach to prove the convergence of the PPO algorithm is to construct a similar inequality to the one mentioned above, and thereby establish the convergence of PPO.

  1. In Step 1, how does the estimated clipped PPO loss have gradient? Should it be sub-gradient due to the non-smoothness of the clipping operator? Please provide more clarifications. What is the intuitive meaning of Xn,kX_{n,k} and Yn,kY_{n,k}? The motivation of decomposition of error term into Xn,kX_{n,k} and Yn,kY_{n,k} is not really clear.

The gradient of the clipped variable is defined as normal gradient when it is unclipped, and zero when it is clipped. This is the same as what the auto-gradient will implement in practice. The only caveat thing in analysis is the "boundary" points, where the ratio is exactly 1+δ01+\delta_0 or 1δ01-\delta_0. We set the gradient of these points as zero. In fact, we can set the gradient to any value in the subgradient and the analysis still holds, because those boundary points set has a zero measure.

For the events, Xn,kX_{n,k} and Yn,kY_{n,k} represent the errors of the object with the clip operation and the object without the clip operation, respectively. Since the remaining part after removing the clip operation (the first term on the right side of the inequality in Step 1) is a standard policy gradient, by bounding these two error terms we could obtain the final result.

评论
  1. The introduction of Cn,kC_{n,k} in Step 2 is quite abrupt. Please provide more details.

The purpose of introducing this event is to estimate the range in which the clip operation holds. We have scaled down the range of the clip operation to an event that is only related to πθn,1\pi_{\theta_{n,1}} and πθn1,1\pi_{\theta_{n-1,1}}. In this way, we have transformed a double-loop problem into a single-loop problem. In other words, Cn,kC_{n,k} is a relaxation of Bn,kB_{n,k}.

  1. In Step 2, what is the meaning of bound of E[XkF]\mathbb{E}[||X_k|| \mathcal{F} ]? Depending on 1πθn1,1(s,a)\frac{1}{\sqrt{\pi_{\theta_{n-1,1}}(s,a)}} seems to be problematic. How is this term compensated? In deriving (14) where did 1πθn1,1(as)\frac{1}{\sqrt{\pi_{\theta_{n-1,1}(a\mid s)}}} go?

We thank the reviewer for pointing this out. We did made a typo in this term in the original version of the appendix. In the revised version (updated pdf), we have corrected this typo. We additionally added the detailed derivation to clarify how the term is obtained. Please refer to lines 17-27 on page 14 of the revised appendix for the updated information.

  1. I think there is typo in (13). V(θn1)\nabla V(\theta_{n-1}) should be V(θn1,1)\nabla V(\theta_{n-1,1}).

We thank the reviewer for pointing this out. We corrected this in the revised version.

[1] OpenAI Baselines https://github.com/openai/baselines

[2] Du, Simon S., Xiyu Zhai, Barnabas Poczos, and Aarti Singh. "Gradient descent provably optimizes over-parameterized neural networks." arXiv preprint arXiv:1810.02054 (2018).

[3] Baldi, Pierre, and Kurt Hornik. "Neural networks and principal component analysis: Learning from examples without local minima." Neural networks 2, no. 1 (1989): 53-58.

评论

Thank you for the detailed response, and most of the concerns have been addressed. I have increased my score from 6 to 8.

评论

We thank the reviewer again for your time and effort in reviewing our paper. Your review helps us a lot in improving the manuscript. We are glad to know that the reviewer recognized our contributions and we very much enjoyed the discussion with the reviewer. As the discussion period is coming to a close, we would like to note that we remain open to any further feedback and are committed to making additional improvements.

AC 元评审

This paper presents theoretical analysis of convergence of PPO algorithm that uses clipping and the two-time scale method. Analysis in this work appear to be novel with the use of (1) two-time scale method that technical overcomes the technical difficulty related to policy ratio, and (2) recursive inequalities to bound the norm of gradient of the clipped loss function. The paper studies an important theoretical analysis of the popular PPO algorithm, which provides a pivotal theoretical bound to many RL approaches that relies on PPO.

Reviews submitted by v9Uz is omitted due to its brevity and quality.

为何不给更高分

Reviewers still have questions regarding some technical details such as the theoretical guarantees of more general learning rates, and presentation of the paper requires more clarification/improvement,

为何不给更低分

Paper has significant novel contribution on proving convergence of the PPO algorithm. Theoretical contributions appear to be relevant to a wide RL research community.

最终决定

Accept (poster)