PaperHub
6.5
/10
Rejected4 位审稿人
最低5最高7标准差0.9
7
5
7
7
3.8
置信度
正确性3.3
贡献度3.3
表达2.5
NeurIPS 2024

Policy Gradient with Tree Expansion

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We introduce SoftTreeMax, a novel parametric policy combining tree expansion into policy gradient. We analyze its variance and bias, and implement a deep RL version of it.

摘要

关键词
Policy GradientSoftmax policyTree expansionModel-Based RL

评审与讨论

审稿意见
7

Disclaimer: I do not have the mathematical background required to check all proofs. It is also my first time reviewing for NeurIPS. However, I have contributions in classical (deep) RL and finishing my PhD.

In this paper, authors use a novel policy parametrization for RL. Namely, the SoftTreeMax policy that replaces the traditional logit values by small horizon trajectories rewards values. The authors claim and prove that SoftTreeMax policy gradient has less variance than traditional policy gradient. They do experiments with PPO on Atari.

I summarize my review as follows. The idea and work in this paper are novel, strong and well-motivated. However, it is in my opinion poorly presented, and poorly situated compared to existing related work.

I very slightly lean towards accepting this paper as is but give feedbacks for the authors to increase their score in the following sections.

优点

Originality: it is the first time that I see someone use look-ahead information in the policy parametrization. I have already seen it done in the critic part of Policy Gradient (PG) methods, e.g. for advantages estimation in PPO using n-step returns with n>1 and generalized advantage estimations.

Contribution: if the proofs of variance reduction as a function of the depth of the tree expansion in the logit values and as a function of the approximation error of the forward models are correct, I believe this work opens a whole new avenue for future work (which, by the way, is missing from the paper).

缺点

In the following comments, I assume that the techincal results are correct.

First weakness: clarity.

The proposed algorithm (PPO with SoftTreeMaxPolicy) makes heavy use of three major branches of reinforcement learning namely policy gradient methods (core of the paper), model-based RL (use of a forward model to compute small trajectories), efficient implementations of tree search (control the exponential cost of look-ahead search). Those components are all central to your work but the paper lacks a summary of how those components work together. I recommend to summarize those in a schematic (see Questions section).

Second weakness: baselines and related work.

The related work mentionned in this work does not clearly help to understand the significance of your work. There are well-studied tools reducing the variance in the gradient estimates (n-steps returns and gae-lambda), and some work combining TS and PG. It would have been nice to see comparisons with those work. I recommend the authors to rewrite and extend their related work section (see Questions section).

REFERENCES.

TS + PG references: [An Actor-Critic Algorithm Using a Binary Tree Action Selector, Action Guidance with MCTS for Deep Reinforcement Learning, Policy Gradient Algorithms with Monte-Carlo Tree Search for Non-Markov Decision Processes, Policy Gradient Search: Online Planning and Expert Iteration without Search Trees]

Better sample efficiency and variance reduction: [Schulman et al., 2015b; Mnih et al., 2016; Munos et al., 2016; Schulman et al., 2017; Hessel et al., 2018; Schrittwieser et al., 2020; Wurman et al., 2022; Chebotar et al., 2023; Schwarzer et al., 2023, Brett Daley, Martha White, 2024]

问题

Questions:

  • How do you learn the forward model? If it is not learned, how do you plan to use your algorithm for environments for which you have to learn the model?

  • Is there an Atari game named "NameThisGame' or is this a typo in your plots ?

Suggestions:

I provide some suggestions to increase the score of the paper. I believe increasing my score by 1 for every three implemented suggestions seems fair.

  • Could you please briefly conclude on if it is worth spending all this compute on tree expansion with respect to how much variance is reduced ?

  • Could you motivate looking at the variance only at the end of training of PPO (line 301)?

  • Could you please add a schematic of the forward pass and gradient accumulation in the fashion of fig 1 from [Fully-Convolutional Siamese Networks for Object Tracking] ? Basically combining your figure 2 with the text in lines 282-285 and lines 649-658 (appendix B1) in a summary figure OR in an algorithm description with pseudo code ?

  • Could you please move to appendix results on one of C-SoftTreeMax or E-SoftTreeMax and move in the main paper the walltime plots and sample complexity plots (B2, B3)?

  • Could please add a dedicated future work section with dedicated paragraphs on efficient tree expansion implems (better pruning, parallelism, gpu, ...) and on forward model learning ?

  • Could you please have a dedicated related work section (you could again move some theory to the appendix) with dedicated paragraphs on other variance reduction method in PG RL (you could have a look at the introduction of Averaging n-step Returns Reduces Variance in Reinforcement Learning 2024 ), on combining TS and PG and an other paragraph on GPU-based environment that have differentiable models ?

Advice:

If I may, having also worked with expensive tree structures, I recommend you to limit your experiments to e.g. depths 2, 4, 6,8 rather than 2,3,4,5,6,7,8.

Thank you

局限性

No limitations.

作者回复

Thank you for your response. Please find our answers below.

Question 1 - Do you learn the forward model and if not how can you use learned models?

In the experiments, the forward model is not learned but given. In cases where access to a forward model is not given, it should be learned and replaced with the exact model in the tree expansion. Note that learned forward models are typically represented using a neural network which allows efficient tree expansion with Breadth-First Search (BFS), as we do in this work with the GPU simulator. In the last few years several prominent works have shown a forward model can be learned successfully such as MuZero (Schrittwieser et al., 2020), EfficientZero (Ye et al., 2021), and Dreamer (Hafner et al., 2020). We discuss this alternative in Section 4.3 and provide a corresponding result in Theorem 4.8 for an approximate forward model.

Question 2 - Is “NameThisGame” the name of an Atari game or a typo?

Indeed an unusual name, but it is an Atari game. see https://en.wikipedia.org/wiki/Name_This_Game.

Question 3 - Is tree expansion worth the computation?

Evidently yes. As we show in Figure 3, the SoftTreeMax policy achieves substantially higher reward than those obtained by vanilla softmax. To see a comparison of time-based runs we refer the reviewer to Figure 4 in the Appendix.

Question 4 - What is the motivation of observing the variance only at the end of training?

In the initial stages of training, the return (cumulative reward) fluctuates wildly, since the policy is still almost random and exploration is intense (Anschel et al, 2016). This directly injects noise to the critic and gradient variance estimator, as they build on the return. Thus, we consider the gradient variance after convergence for a fair comparison of the estimator operator itself. More formally, the theoretical analysis relates to the expected values obtained in the limit of the stationary distribution over states.

Question 5.1 - Add a schematic of the forward pass and gradient accumulation.

We tried capturing the forward-pass mechanism (lines 649-658 that you referred to) in the following diagram: https://ibb.co/5RPRkzF .

Question 5.2 - Add an algorithm description with pseudo code.

Per your request, we wrote the following pseudo code https://ibb.co/s2CRMp8, which we will add to the final version.

Question 6 - Change figure location in the paper.

Thank you for this suggestion. We tried re-arranging the figure locations and (i) see that this hurts the readability and flow more than it helps, in addition to (ii) violating the strict 9-page limit.

Question 7 & 8 - Future work and related work sections:

We have prepared new sections addressing future work and related work. Due to space constraints, we cannot include the full content here, but only include their outline. We would be happy to provide these expanded sections as additional comments later if the reviewer wishes to see them.

The future work section focuses on:

  1. Efficient tree expansion using GPU-based simulators like Atari-CuLE, IsaacGym, and Brax.

  2. Learning forward models when simulators are unavailable, building on works like MuZero.

  3. Extending to continuous control by sampling from parametric action distributions, adapting MPPI with a value function.

  4. Theoretical analysis of non-expansive operators and eigenvalue cancellation for infinite action spaces.

The related work section covers variance reduction techniques, including:

  1. Baseline subtraction (state and action-dependent)

  2. Importance sampling

  3. Sub-sampling (e.g., SVRPG)

  4. Control variates

  5. Natural policy gradients

  6. Experience replay

评论

Thank you for the rebuttal. I am still not convinced that the computations of tree expansions are worth it because:

  • SoftTreeMax does not consistently imrpove on PPO.
  • SoftTreeMax perf sometimes decreases with planning horizon dd which is weird
  • I do not agree with what you answered to reviewer m94g : "If instead we used SAC with n-step TD as a baseline, then the comparison will be with SAC with n-step TD and SoftTreeMax as a policy." I think it would be fair to compare PPO/SAC with n-step TD against PPO/SAC with SoftTreeMax only. The point being that it would be a fair way to compare the performance gain over PPO/SAC with two different types of added information that are shown to reduce variance respectively in the Critic and in the Actor.

Can you provide more insight as to why the performances of SoftTreeMax can decrease with dd as opposed to not increase? Why not compare an Actor-Critic baseline with n-step return estimation of the Critic to an Actor-Crtic baseline with tree expansion in the Actor estimation? This would answer the question of where is it best to spend compute and use more information.

Respectfully

评论

Thank you for your reply.

Question 1: Can you provide more insight as to why the performances of SoftTreeMax can decrease with d?

This is a good observation. The deterioration is not inherent to the method, but due to experiments having a fixed training time of one week. The deeper the tree expansion, the slower each step and less steps are finished by the end of the run. Consequently, performance may decline beyond a certain depth d in these games, as the number of value updates becomes insufficient for proper training. Figure 5 illustrates this phenomenon: as depth increases, the number of online interactions (corresponding to value updates) decreases. However, when comparing performance at the same number of interactions (x-axis value), we observe a monotonic improvement with increasing d in almost all cases. This explanation clarifies that the method's effectiveness is not diminishing; rather, the fixed time constraint is limiting the learning process for deeper tree expansions in some games.

Question 2: Why not compare an Actor-Critic baseline with n-step return estimation of the Critic to an Actor-Critic baseline with tree expansion in the Actor estimation? This would answer the question of where it is best to spend computation and use more information.

We wish to first clarify that we in fact use a variant n-step TD in the critic: the PPO implementation from stable-baselines3 that we extend uses, by default, the GAE estimator in the critic with lambda=0.95, which is very similar n-step TD. We apologize for not making this clear earlier and will elaborate on this in the experiments section.

However, it's crucial to understand that SoftTreeMax is a new type of policy (actor) that can be combined with various types of critics (pure Monte-Carlo, TD(0), TD(N), GAE, etc). Our intention in the rebuttal was that one needs to compare actor to actor and critic to critic separately. Still, your question seems to approach this comparison from a computational angle.

In that aspect, the comparison between n-step return and SoftTreeMax is not straightforward, as they fundamentally differ in their approach and computational requirements. While n-step return samples one possible outcome, SoftTreeMax averages over all possible outcomes through tree expansion, providing more comprehensive information at a higher computational cost. A fair comparison might involve using tree expansion for value function estimation (like the concept of back-propagation in MCTS) or sampling multiple future trajectories for both methods. However, such a comparison, while interesting, is beyond the scope of our current work, which focuses on introducing and analyzing SoftTreeMax as a novel policy approach. We will clarify these points in the final version.

评论

I believe this is great work. Really. So please keep your word on clarifying the discussed point in the camera-ready version of the paper. Make the presentation as clear as possible to reach as many readers as possible! I raised my score to 7. Great work.

评论

Thank you so much for the kind words and insightful feedback! Your thorough reviews have been invaluable in improving our work, and we're committed to incorporating all these additions to enhance the paper's clarity and reach for the camera-ready version.

审稿意见
5

The paper introduces a model-based online planning method called SoftTreeMax. The method acts as an extension of the softmax, by replacing the logit in softmax with a n-step return. Based on SoftTreeMax, the paper proposes two learning algorithms, C-SoftTreeMax and E-SoftTreeMax. The work includes a mathematical analysis of both algorithms to prove the policy gradient variance is bounded and exponentially decays when the planning horizon becomes longer. Then, the paper provides an empirical test on C-SoftTreeMax to compare the learned policy and the variance with PPO.

优点

  • The method is theoretically sound. The paper provides a detailed theoretical analysis to show the advantage of the decayed policy gradient variance of the proposed algorithm.

  • The paper reports the implementation and computation in detail and ensures reproducibility.

  • The paper empirically checks C-SoftTreeMax. The empirical results compare between the performance and the change of variance as the agent learns. Curves clearly suggest the inverse relationship between the performance and the variance, thus supporting the importance of having a method with guaranteed decaying variance.

缺点

  • The main concern I have is the difference between the proposed method and the n-step return.

    In Section 3, the paper defines the SoftTreeMax logit as

    ls,a(d;θ)=γd[t=0d1γtrt+γdθ(sd)].l_{s,a}(d; \theta) = \gamma^{-d} \left[ \sum^{d-1}_{t=0} \gamma^t r_t + \gamma^d \theta(s_d) \right] .

    In the Actor-Critic method, it is normal to use Q(s,a)Q(s,a) for the scoring function θ\theta, and learn a policy π(as)exp(Q(s,a))\pi(a|s) \propto exp(Q(s,a)), for example, SAC, a commonly known online learning algorithm, with state-of-the-art performance. In this case, ls,a(d;θ)l_{s,a}(d; \theta) can be rewritten to

    γd[t=0d1γt1rt+γdEaQ(sd,ad)]. \gamma^{-d} \left[ \sum^{d-1}_{t=0} \gamma^{t-1} r_t + \gamma^d E_a Q(s_d, a_d) \right] .

    This equation is exactly the same as the equation in n-step TD, except the normalizer $\gamma^{-d}. However, the normalizer can be written as part of the temperature in softmax.

    In C-SoftTreeMax, the learned policy is written as πd,θC(as)exp[βEπbls,a(d;θ)]\pi^C_{d,\theta} (a|s) \propto \exp [\beta E^{\pi_b} l_{s,a}(d; \theta)] in formula (3).

    Replacing ls,al_{s,a} with the equation in (2), we have πd,θC(as)exp[βEπb[γdt=0d1γtrt+γdθ(sd)]]\pi^C_{d,\theta} (a|s) \propto \exp [\beta E^{\pi_b} [\gamma^{-d} \sum^{d-1}_{t=0} \gamma^t r_t + \gamma^d \theta(s_d) ]] .

    As the expectation is taken on πb\pi_b, the normalizer γd\gamma^{-d} does not depend on πb\pi_b, thus could be moved out of the expectation. Then (3) is changed to πd,θC(as)exp[(βγd)Eπb[t=0d1γtrt+γdθ(sd))]]. \pi^C_{d,\theta} (a|s) \propto \exp [(\beta\gamma^{-d}) \mathbb{E}^{\pi_b} [ \sum^{d-1}_{t=0} \gamma^t r_t + \gamma^d \theta(s_d)) ]] .

    Now the temperature becomes βγd\beta\gamma^{-d}, which is a tunable parameter, and the expectation term becomes n-step TD when using Q(s,a)Q(s,a) for θ\theta. Similarly, when using the state value V(s)=θV(s) = \theta, the normalizer can be written as part of the temperature, and the expectation term becomes n-step TD. In this case, using the proposed method seems to be no different from simply changing the value update from TD(0) to n-step TD.

  • The empirical test seems to be incomplete. The paper proposes two algorithms, C-SoftTreeMax and E-SoftTreeMax. But only C-SoftTreeMax was empirically tested. The paper explains the empirical test for E-SoftTreeMax was left for future work on risk-averse RL. But it does not make sense to me that the empirical test is completely omitted while the method is listed as a new algorithm in the paper.

  • There could be a stronger baseline. Given the similarity between SAC policy (π(as)exp(Q(s,a))\pi(a|s) \propto exp(Q(s,a))) and the policy learned by C-SoftTreeMax (π(as)exp(Wθ(s,a))\pi(a|s) \propto exp(W_\theta(s,a))), it may worth testing SAC, both 1-step TD and n-step TD in critic update.

  • The empirical test could include more seeds. Currently there are only five seeds. The variance calculated based on 5 seeds could be highly inaccurate.

  • The empirical test is limited to discrete action space. It is acceptable and understandable to leave continuous control experiments to future work, but still, adding continuous control results could make the paper more convincing.

  • The new method introduces a new parameter, the planning length dd. According to the empirical results (Figures 3, 4, and 5), different dd gave very different performances. The best setting is different across tasks. In 3 out of 8 tasks, (NameThisGame, Phoenix, and VideoPinball), longer planning hurts the performance. It is acceptable to say the performance is sensitive to the value of dd and leave this parameter as a tunable one, but it could be better to give an indication of how to pick this value when the reader would like to apply this method to a new task.

Typo:

  • Line 307, “one week The” -> “one week. The”

问题

  • In Section 5, the paper says θ\theta is replaced by a neural network WθW_\theta. It would be good if the authors could further explain what the target is when updating WθW_\theta, i.e. is WθW_\theta a value function estimation or not?

  • It would be better to explain the empirical setting in more detail. In Section 6, the paper explains the tree is pruned according to an estimated weight. It would be more clear if the paper could include a description of how the weights are updated.

局限性

Yes

作者回复

We appreciate your feedback on the method's soundness and reproducibility. We hope the following clarifications and new content address your concerns.

Weakness 1 - Difference between SoftTreeMax and n-step return

  1. Thank you for this insightful point. SoftTreeMax may seem similar to n-step TD, but it differs in several important aspects: In n-step TD, one uses observed n-long trajectories from the interaction to estimate the value. In our (model-based) approach, we expand the tree of all future trajectories using the forward model instead. In fact, averaging over multiple future scenarios is the engine behind variance reduction.

  2. In policy gradient (PG) methods, n-step returns merely serve as a Monte Carlo estimation technique for the advantage function AA in the gradient estimator E[(logπ)A]E[(\nabla \log \pi)A], but do not directly influence the policy itself. In contrast, in SoftTreeMax the multi-step reward is part of the policy itself, and it affects π\pi in E[(logπ)A]E[(\nabla \log \pi)A], rather than affecting AA.

  3. In our SoftTreeMax, θ\theta (defined in Section 2) appears in the softmax similarly to a Q-function. However, it is not necessarily the Q-function itself and is not explicitly minimizing any Bellman-type loss function. While n-step TD with value bootstrapping is common in value-based methods, its application to policy gradient algorithms remains unclear. Our work aims to address an important question: “how should planning be incorporated into PG methods?” As far as we know, the community hasn't fully explored this path.

Since SoftTreeMax and n-step TD operate on different terms of the gradient, both can be used at the same time by choosing SoftTreeMax as the policy.

Interestingly, Reviewer mmxv independently highlighted a key distinction of our approach. They noted the originality in using look-ahead information directly in policy parametrization, contrasting it with more common advantage estimation using n-step returns. This observation aligns with our view on the novel aspects of SoftTreeMax compared to traditional n-step TD approaches.

We will add a discussion on this topic to avoid any misunderstanding.

Weakness 2 - E-SoftTreeMax experiments

We actually repeated all experiments with E-SoftTreeMax, following the same protocols as C-SoftTreeMax (see footnote 1 in page 8). Results for E-SoftTreeMax are almost identical to C-SoftTreeMax, so we left the figures out. The reason why the results of the two variants are almost identical, is that for quasi-deterministic Atari, the trajectory logits (Eq. (2) in the paper) have almost no variability. This makes the two SoftTreeMax variants (Eqs. (3) and (4)) nearly identical. We believe there is room for future work and a deeper investigation of E-SoftTreeMax using probabilistic environments that have inherent risk associated with them.

Weakness 3 - SAC baseline with n-step TD

The reason we used PPO as our baseline is that we built SoftTreeMax on top of it, as a model-based variant of PPO. This gives a fair comparison with similar compute power, using 256 GPU workers in the original PPO, considered as “depth 0” in the paper. SAC can be extended in similar fashion, and in that case we believe the “depth 0 SAC” would be a relevant baseline. Again, to highlight the point of Weakness 1 - If instead we used SAC with n-step TD as a baseline, then the comparison will be with SAC with n-step TD and SoftTreeMax as a policy.

Weakness 4 - Five seeds is not enough.

We followed the community standard for this parameter. Using 5 seeds is very common; for example, this is what is used in the seminal papers introducing DQN, PPO, SAC, TRPO, IMPALA, TD3, MuZero, EfficientZero, Dreamer, and many more. Figure 4 shows that while there is some variance between the experiments, the differences in the plots is considerably higher, demonstrating the efficacy of our proposed method.

Weakness 5 - Continuous control

Extending to continuous control is doable but requires significant research effort. To expand the tree, one can maintain a parametric distribution over actions which can depend on θ\theta. This method can be seen as a tree adaptation of MPPI (Williams et al., 2017) with a value function. Theoretically, we can exploit the non-expansive operator property of PπbP^{\pi_b} and the policy gradient's eigenvalue cancellation. These properties can be shown to hold for decision models with infinite actions. Nonetheless, we wish to highlight that a finite action space is an important setup in RL that was studied abundantly.

Weakness 6 - How to pick dd?

The choice of d depends on the game's planning requirements. Games needing long-term planning require better value estimates, which take longer to obtain at larger depths. For example, Breakout benefits quickly from large depths due to its short-term goal, while Gopher's complex task requires more training time for larger depths to be beneficial. We suggest three approaches for choosing d: 1) Increase until results stop improving, 2) Use a "natural" time horizon if applicable, or 3) Learn d online (though our preliminary tests found this unstable). We'll discuss these in the final version.

Question 1 - Is WθW_\theta the value function?

Not necessarily. When making that assumption, we get very interesting connections to SAC and n-step TD as pointed out in your first comment and by the other reviewers. But we do not constrain or provide any loss that links WθW_\theta directly to the value or Q-function. Our work is a natural extension from vanilla PPO which does not hold any requirements from the weights per action either.

Question 2 - Could you explain the empirical setting in more detail?

Yes. We added a new diagram of the tree expansion implementation in the GPU https://ibb.co/5RPRkzF, and the pseudo code for SoftTreeMax-PPO https://ibb.co/s2CRMp8. We will add them to their appropriate location in the final version.

评论

I appreciate the detailed reply from the authors. The reply addressed my main concern about the novelty. Now, the difference between the proposed method and the n-step TD is clear to me.

For the experiments, I am aware that many papers use 5 seeds, but usually, only a sample size of no less than 30 is considered sufficient for the central limit theorem to hold. Compared with 30, 5 is not a large number. Therefore, I still believe that 5 seeds are not good enough to make a conclusion on variance. Increasing the number of seeds is definitely making the conclusion more convincing. However, considering the limited time, I would not insist on asking authors to add more seeds to this paper.

After reading other reviews and the rebuttal, I agree with what reviewers d2zq, dZVs, and mmxv said about the novelty. The paper does provide a new aspect of policy gradient learning. Given the good novelty and concern about experiments (number of random seeds and baseline), I have increased my score accordingly.

评论

Thank you for your reply and for raising you score in light of the discussion!

The statistical significance of 5 seeds

Thank you for raising this concern. This claim coincides with the two references shared by Reviewer d2zq and indeed highlights an important point regarding deep RL research. While using 5 seeds is a common practice in the field, we acknowledge the growing concerns about the statistical reliability of this approach. We will explicitly discuss this limitation in the paper, noting both the standard practice and the recent critiques. We will make an effort to include more seeds for the final version.

审稿意见
7

This paper proposes a new policy parameterization called SoftTreeMax, which can reduce the variance of the stochastic policy gradient. The authors consider finite state and action spaces throughout the paper. They start by taking a softmax tabular policy and replacing the logit θ(s,a)\theta(s,a) with a score from a trajectory of horizon dd sampled from a behavior policy. They then address the normalization accordingly. Because of the randomness of the trajectory of size dd, the authors take an expectation with respect to the randomness of the trajectory of size dd to well-define the policy parameterization. By taking the expectation before or after the exponent operator, the authors propose two policy parameterizations: C-SoftTreeMax and E-SoftTreeMax. It turns out that under either C-SoftTreeMax or E-SoftTreeMax parameterization, the variance of the stochastic gradient of the value function decays exponentially with the planning horizon dd. The theoretical findings are well supported by simulations. Finally, the authors propose a parallel GPU-based simulator for the practical implementation of SoftTreeMax. By applying the C-SoftTreeMax parameterization to PPO and adapting θ(s)\theta(s) with a neural network, as well as using the parallel GPU-based implementation, leads to better performance compared to distributed PPO.

优点

  • The concept of integrating tree search (planning) directly into policy parameterization is both intriguing and innovative. It demonstrates two key advantages: 1) With the SoftTreeMax parameterization, the variance of the stochastic gradient of the value function decreases exponentially with the planning horizon. 2) It can be easily combined with various policy gradient algorithms, such as REINFORCE and PPO, to enhance their performance.
  • Although computing the parameterization is computationally intensive, the authors offer a practical solution through parallel GPU-based simulation.
  • SoftTreeMax shows significant potential for extension to infinite or large spaces, making it a promising method for addressing complex problems.

缺点

  • Not really a weakness. Since the SoftTreeMax can be beneficial to different policy gradient methods, I expect to see its improvement applied on some (original) policy gradient methods for the experiments, such as REINFORCE and actor-critic, to demonstrate its generality, which is another strength of the paper.
  • Since SoftTreeMax is inspired from tree search, the authors can highlight the difference between the "tree search" literature and the "tree expansion" used in this paper in order to avoid confusion. The tree search method uses a max to find the best policy, while here the authors use a fixed behavior policy.
  • For the tree search literature, the authors can cite Efroni et al. [2018] and Protopapas et Barakat [2024]. In particular, Protopapas et Barakat [2024] combine a variant of PG, the policy mirror descent, with tree search.
  • For the theoretical PG literature in Line 312, the authors can cite Yuan et al. [2022], where they provide a general analysis of PG, including softmax tabular parameterization as a special case.

Efroni, Y., Dalal, G., Scherrer, B., and Mannor, S. Beyond the one-step greedy approach in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1387–1396. PMLR, 10–15 Jul 2018.

Kimon Protopapas, Anas Barakat (2024). Policy Mirror Descent with Lookahead.

Rui Yuan, Robert M. Gower, and Alessandro Lazaric. A general sample complexity analysis of vanilla policy gradient. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 3332–3380. PMLR, 28–30 Mar 2022.

问题

No specific questions.

局限性

Yes.

作者回复

Thank you for supporting the paper and for your detailed and insightful response. We are delighted that you found the concept intriguing and innovative. Please find our answers below.

Weakness 1 - “Not a weakness”: I expect to see improvement to any policy gradient method

Indeed, our SoftTreeMax method can be used with any other PG algorithm since it provides a parametric policy that is an alternative to softmax. We agree that it is expected to improve other algorithms besides PPO studied here. We focused the experimental efforts on PPO because it is commonly used, has good performance overall and can be reduced and extended to multiple other variants in the literature (such as REINFORCE).

Weakness 2 - Highlight the difference between "Tree search" and "tree expansion"

Good point; we will emphasize this when introducing the expansion mechanism in the third paragraph of the introduction in the final version.

Weakness 3,4 - You can cite Efroni [2018] and Barakalt [2024], Yuan [2022]

Thank you for the references; we will add them to the appropriate parts of the paper. Specifically intriguing is Efroni et al., 2018. There, they show that multi-step policies improve convergence rate at the expense of more complex iterations for policy iteration. We believe that a similar property can be shown for such policies used in policy gradient, and specifically for SoftTreeMax, with higher depth leading to better convergence rate. This behavior is also expected because the policy then gains access to more information (reward).

审稿意见
7

This paper proposes a model based policy gradient (PG) algorithm that tries to decrease the variance of the gradient updates in PG methods and thus (hopefully) improving their sample complexity. The proposed approach works by modifying the softmax policy to work not just not with simple score functions (mapping states to real numbers), but rather with the truncated expected discounted return from a state. The resulting policy is called "SoftTreeMax". As such, this paper seems a little bit similar to decision time planning (or MCTS type) algorithms.

The paper proposes two variants of SoftTreeMax and analyzes their variance properties; in particular, the results show that as the size of truncation increases, the variance decreases exponentially. The paper also has a theorem which shows that SoftTreeMax can estimate the gradients well even with an approximate model. There are also some experiments on certain Atari games which show the efficacy of SoftTreeMax against the softmax policy using the popular PPO algorithm.

优点

Originality: The proposed algorithm seems novel to me in that I have not seen tree search used with policy gradient. As far as I remember, most tree search methods employ value function based RL algorithms (most popular being AlphaZero). In that way, this paper is quite novel. Further, the modification of logits to instead include the expected return seems novel too. (Although, it might be related to the "soft-Q learning" papers a little bit; something which could be mentioned in the paper. For instance, see https://arxiv.org/pdf/1704.06440, Eq. 2.)

Quality and Clarity: The paper is well written and has clear theorem statements that are easy to interpret. The setting considered is the popular discounted finite MDP, making the results very easy to follow. The experiments are also conducted using popular benchmarks and thus help the reader understand and compare the proposed method. Another nice thing is that the experimental results clearly align with the theoretical statements (such as the variance reduction with depth) and the benefits in terms of achieving better sample complexity are quite clear. Overall, the paper seems like a logical next step: using policy gradient methods with model based and tree search based methods. (In particular, for continuous action spaces, this would be highly useful and relevant.)

Significance: The paper combines, a very popular class of RL methods - policy gradients, with tree search like planning algorithms. For many difficult problems, particularly those with very large search spaces, Tree search has been shown to be very effective. On the other hand, policy gradient methods are broadly the most popular way of dealing with continuous action RL problems. Therefore, an algorithm combining these two solution methods can have immense applications, especially as AI methods continue to be adopted for solving more and more real-world decision making tasks (robotics being a good representative application area).

缺点

  • The major weaknesses of this paper are not enough discussion on the exact mechanisms for implementing SoftTreeMax. So SoftTreeMax is akin to a policy parameterization (such as softmax). As such, when implementing an RL agent, we need a large number of other details (some details being how is the exploration policy chosen, what optimizer is used to update the gradients, how are the function approximators initialized, is there any processing of the observation features, etc.). These implementation details are important since they can affect the performance of the RL algorithms more than the core idea itself (for instance, recall RainbowDQN https://arxiv.org/abs/1710.02298 and PPO https://arxiv.org/abs/2005.12729).

  • Can you guess how SoftTreeMax run on a continuous action space MDP, such as those from Mujoco?

  • Does SoftTreeMax solve the problems of Softmax policies? I know it helps with variance, but what about softmax's dependence on the initialization? Or the possibly slow convergence of sotmax PG in MDPs due to an exponential dependence on the state space size (https://arxiv.org/abs/2102.11270)? Do you have any intuition? From the nice discussion in Appendix C.1, it seems that the answer is no. In particular, choosing the behavior policy while generating the tree could be quite important. Maybe some comments on how this policy is chosen in the implementation could be helpful (I saw the high-level comments about controlling for the eigenvalues of transition matrix; but those seem high-level ideas).

  • What does SoftTreeMax really do? Is it just an MCTS like decision time planning algorithm that is helped a tiny bit by the theta values in Eq. 2? For instance, what if I do not update the theta values at all (i.e. set learning rate to zero). In that case, would the performance decrease drastically? Related to that, do you have any comments on how it relates with MCTS? I remember Bertsekas (http://web.mit.edu/dimitrib/www/LessonsfromAlphazero.pdf) said that as the depth increases, the utility / importance of the learned values (in this case theta values) in decision making decreases. Would that apply here as well?

  • Once the policy is learned, how do you sample from it? I guess, you still need a model to sample from this policy. This seems like a clear disadvantage (although, this is true for all tree search based / decision time planning methods). Maybe mention this as a limitation.

line 53: "In all tested Atari games, our results outperform the baseline and obtain up to 5x more reward." --> this is not true? For instance, results on VideoPinball or Breakout show that PPO is pretty much as good. Also, the paper doesn't specify how the hyperparameters were chosen. So maybe PPO with a better choice of hyper-params might outperform SoftTreeMax?

问题

The method seems to be highly computationally expensive. Thus, a major question is whether the convergence is sped up by the variance reduction? The paper claim that SoftTreeMax algorithm would result in a faster convergence than traditional policy gradient methods.

Now, it looks like Figure 4 (page 24) is talking about exactly that. (Consider moving it into the main paper? This seems like a very important experiment.) Also, in Figure 4 caption "The x-axis is the wall-clock time. The runs ended after one week with varying number of time-steps.", what does 'varying number of steps' mean? Does this mean that the algorithms run for the same amount of time but different number of (non-simulated) environment interactions?

Minor comments (feel free to ignore)

line 22: "PG algorithms are also notoriously unstable due to the high variance of the gradients computed over entire trajectories" --> I don't think actor-critic type methods do that.

line 34: "for cumulative reward and one for exponentiated reward." --> reward is not exponentiated, right? The SoftTreeMax logits (Eq. 2) are exponentiated.

line 55: "demonstrating the relation between low gradient variance and high reward" --> seems like a strong statement? We don't know if low gradient is necessarily good, right? As Appendix C.1 says, this could also be sometimes harmful.

line 69-81: Nice preliminaries section! Also consider specifying the agent-MDP interaction: s0ν,at(st),st+1p(st,at),rt+1=r(st,rt)s_0 \sim \nu, a_t \sim( \cdot | s_t), s_{t+1} \sim p( \cdot | s_t, a_t), r_{t+1} = r(s_t, r_t), etc..

line 76: define Pπ(ss)P^{\pi'}(s' | s) as Pπ(s,s)P^{\pi'}(s, s') instead?

line 82: consider writing bold ΘRS\Theta \in \mathcal{R}^S as bold θRS\theta \in \mathcal{R}^S --- capital letters are often reserved for matrices.

line 84: consider replacing D(u)D(u) with diag(u)\textrm{diag}(u) throughout the paper -- midway, I got confused by DD.

line 127: "by showing that the gradient is strictly positive everywhere [Mei et al., 2020b, Lemmas 8-9]." --> Gradient being positive is unclear, since it is a vector. Also, I'm not sure if Lemmas 8 and 9 actually show this.

line 165: "where the gradient is scaled to reduce its variance," --> do you mean that gradient is clipped to reduce variance?

line 181: "the C-SoftTreeMax gradient is given" --> try using 'Jacobian' instead of 'gradient'

line 214-217: would the variance zero in this case?

Figure 2: Consider writing "I.e., instead of the" as --> "That is, instead of the"

Figure 254: what norm is being used?

line 254, Theorem 4.8: It is quite interesting that bound is unaffected by the the behavior policy and the tranistion dynamics Pπ\mathcal{P}^\pi. Any comments?

line 273: "SoftTreeMax with an exhaustive TS of depth d" --> Is the search really going through all possible state-actions? If not, consider dropping the work exhaustive.

line 282: maybe figure 2 should be moved somewhare closer.

lines 311-326: maybe put a colon, instead of a period, after the bolded letters. So do "true dynamics. Aversion Risk. Previous wor"

局限性

The major limitation seems to be that the paper doesn't provide

  1. a written algorithm - while I understand that this is almost straightforward, just putting a clear algorithm (specifying order of planning/simulation steps, how to take real world actions with SoftTreeMax policy, computing gradients, and updating the policy) would be highly useful.

  2. The paper doesn't run experiments on continuous action space environments, which is arguably the more relevant domain for PG methods. (For discrete environments like Atari, we already have a plethora of value function + tree search based methods - like AlphaZero or MuZero.) Further, there are no hyper-parameter settings specified; it would be difficult to reproduce these results without looking at the codebase.

Further, a limitations section is missing

  • if the PG is biased, would that affect convergence? In particular, note that Theorem 4.8 shows that the bias would scale with the state space size S (which is generally huge). Maybe setting d = \log(S) / \log(1 / gamma) helps with this and reduces this error to \log(S). But the computational complexity is probably exponential in d, so each gradient update becomes O(S), which is also bad..

Maybe a discussion between these factors would be helpful ---- how does the increase in bias affect the convergence speed (and the quality of the final policy obtained). What bias does the pruning approach introduce? ---- how does the decrease in the variance of gradient update affect the convergence speed (and the quality of the final policy obtained). Does the additional variance (in computing Lemma 4.6) introduced by pruning remove counteract the variance reduction (as given in Theorem 4.4)? ---- how does changing the bias / variance affect d and thus the computational complexity (edited)

作者回复

Weakness 1 - Implementation details

We've added pseudo-code for SoftTreeMax-PPO (https://ibb.co/s2CRMp8), a new GPU tree expansion diagram (https://ibb.co/5RPRkzF), and open-source code (supplementary). We used hyper-parameters from the original PPO paper (Appendix B.1) and defaults from stable-baselines3 for the other details requested.

Weakness 2 - Continuous action spaces

For continuous control, we propose sampling from a parametric distribution over actions depending on θ\theta, similar to MPPI (Williams et al., 2017) with a value function. Theoretically, we can exploit the non-expansive operator property of PπbP^{\pi_b} and the policy gradient's eigenvalue cancellation.

Weakness 3.1 - Exponentially slow convergence

Unlike Softmax, SoftTreeMax's convergence is less likely to depend exponentially on state space size. The lookahead and added reward prevent gradient collapse except in degenerate cases where PπbP^{\pi_b} becomes rank-one, enabling more stable learning across various state space sizes.

Weakness 3.2 – Choice of behavior policy

We use a uniform tree expansion policy, assigning equal weight to all actions. This approximates the optimal variance decay, achieved when the induced transition matrix is near rank-one. Such a policy smoothens transition probabilities, minimizing the associated Markov chain's mixing time without requiring specific MDP assumptions.

Weakness 4 - SoftTreeMax's nature and learned theta values

SoftTreeMax is an alternative parametric policy to Softmax, incorporating reward and forward model. It differs from MCTS in framework, extensions, and loss functions. Learning significantly improves performance, especially at larger depths in complex games. Figure 4 (Appendix B.2) shows performance variations across games with and without learning. The importance of learned theta values increases with depth, particularly in games requiring long-term planning. We can provide more concrete examples and explanations if needed.

Weakness 5 - Sampling from learned policy

Correct. Inference requires tree expansion using the same behavior policy as training. Potential mitigations like policy distillation need further research. We'll add this as a limitation in the paper.

Weakness 6 - Results

Our results show significant improvement over PPO in all 8 games, including Breakout and VideoPinball. In Breakout, we reach a reward over 800 for depths 4 and 5, while PPO doesn't. In VideoPinball, we obtain 800K for depth 3 while vanilla PPO reaches less than 200K.

Question 1 - Convergence and variance reduction

We believe convergence is sped up by variance reduction. In most games, variance monotonically reduces as depth increases, while reward acts oppositely. Figure 1 shows SoftTreeMax's gradient variance is consistently three orders of magnitude lower than PPO's, correlating with tree depth.

Question 2 - Figure 4 interpretation

Correct. 'Varying number of steps' means algorithms run for the same wallclock time but different environment interactions. This balances online vs. planning iterations, accounting for both outer and inner iteration complexity. Appendix B.2, Figure 4, shows a sweet spot balancing these iterations at different depths depending on the Atari game.

Question 3 - PG algorithms stability

We'll revise the statement about PG algorithms' instability. Actor-critic methods mitigate variance from entire trajectories, but instability persists for other reasons. We'll cite TRPO and PPO papers for relevance.

Question 4 - Reward exponentiation

E-SoftTreeMax (Equation 4) can be viewed as exponentiating rewards within logits before expectation, leading to a product of exponents of the reward sequence. We'll clarify this phrasing.

Question 5 - General comments

We appreciate your thorough suggestions on wording, figure locations, notations, and preliminary section. We'll incorporate these in the final version and fix any typos and inaccurate phrasing.

Question 6 - Gradient positivity

We meant the norm of the gradient is strictly positive everywhere. We'll clarify this and elaborate on the relevant lemmas from Mei et al., 2020b. Lemma 8 provides a lower bound on the gradient norm, while Lemma 9 shows this bound is strictly positive.

Question 7 - Variance in permutation matrix case

In this case, variance isn't zero. We have a single leaf per action selection, with rewards and logits exponentiated in the softmax operation. The zero-variance case occurs with zero second eigenvalue, detailed in Section C.1.

Question 8 - Norm in Theorem 4.8

We use the induced infinity norm, mentioned in line 597. The bound's independence from behavior policy and transition dynamics stems from approximating dynamics and reward up to ϵ\epsilon error. We can bound relevant matrices by a factor of ϵ\epsilon due to their convex combinations of transition and reward. For further details please refer to the proof in Lemma A.4 and line 624 in the appendix.

Question 9 - Search process

For depths up to 3, we search all possible state-actions due to the quasi-deterministic nature of Atari environments. For larger depths, we prune low-weight trajectories. We'll clarify this in the paper.

Limitations

We've added pseudo-code and implementation details as requested. For continuous action spaces, we've proposed approaches but acknowledge further research is needed. We'll expand the limitations section, incorporating feedback from reviewers. The observation regarding linear dependence on SS is a great one. in our bias bound may be an artifact of our proof, potentially improvable in future work, with a tighter bound in Eq. (81). Pruning introduces bias through sub-sampling, affecting performance and convergence. The behavior policy controls the bias-variance tradeoff, balancing between proximity to πt,d\pi_{t,d} (lower bias) and similarity of πb\pi_b rows (lower variance).

评论

I have read the rebuttal and I thank the author for their clarifications; they helped a lot. I also read the other reviews and the corresponding rebuttals. I think all of them raise relevant points. I think if the authors were to incorporate these suggestions, the quality of the paper would increase a lot. I retain my score of 7; it seems to be a fair evaluation of the work.

Some additional / minor comments:

Related to my review:

Lines 12 and 13 in https://ibb.co/s2CRMp8 can't really be implemented (since the objective is highly non-convex in terms of the policy / value function weights). So maybe, put "take a gradient step" or something similar there instead of "maximizing".

The diagram https://ibb.co/5RPRkzF looks very nice, but because it is somewhat complicated, it needs a proper caption. I hope that will be added too.

"Our results show significant improvement over PPO in all 8 games, including Breakout and VideoPinball. In Breakout, we reach a reward over 800 for depths 4 and 5, while PPO doesn't. In VideoPinball, we obtain 800K for depth 3 while vanilla PPO reaches less than 200K."

I still think that the results are very similar for these problems. I will suggest not writing "obtain up to 5x more reward"; it is not well supported by the experiments and seems like an oversell. Instead, just saying "competitive" or "outperforms" is enough.

Related to other reviewer's comments:

It seems like making the distinction between this work and n-step TD based actor-critic methods would be very useful. In particular, the nice and short analysis by m94G could be included and clarified.

The comment on five seeds not being enough is correct, and I don't agree with the author's responses. Now I'm not necessarily asking you to run more seeds, but at the very least, acknowledging that five seeds is probably not enough in the paper would be great! (Also see Figure 5 of https://arxiv.org/pdf/1709.06560 and Figure 15 of https://arxiv.org/pdf/2304.01315).

May I request the authors to include the experiments they already ran on the E-Softtreemax?

More importantly, is it possible to run the experiments for a learned model? I think having even just one experiment demonstrating that SoftTreeMax works with a learned model could be very encouraging for the readers.

评论

Thank you for your reply.

Comment 1: Maximize -> take a gradient step

Thank you for the suggestion, see the revised version here: https://ibb.co/L9Jfw9G

Comment 2: Complicated diagram requires captioning

We agree, here is the captioning we will add:

“A diagram of the tree expansion used by SoftTreeMax. In every step, the states in the current level of the tree are duplicated and concatenated with each possible action. The resulting state-action pairs are then fed as a batch to the GPU simulator to generate the next level of states. Finally, the states of the last level dd are inserted into the neural network WθW_\theta and the logits are computed using the corresponding rewards along each trajectory.”

Comment 3: Presentation of the results

We understand your concern and will rephrase accordingly.

Comment 4: Connection to n-step TD is useful

Thank you, we will add a summary of the discussion with the reviewers in the rebuttal; please also see the follow up conversation with Reviewer mmxv.

Comment 5: The statistical significance of 5 seeds

Thank you for the references; they indeed highlight an important point regarding deep RL research. While using 5 seeds is a common practice in the field, we acknowledge the growing concerns about the statistical reliability of this approach. We will explicitly discuss this limitation in the paper, noting both the standard practice and the recent critiques. We will make an effort to include more seeds for the final version.

Comment 6: Please add the results on E-SoftTreeMax to the paper as well

Certainly, we will include these plots in the final revision.

Comment 7: Is it possible to run the experiments for a learned model?

We appreciate the suggestion to include experiments with a learned model. Given the time constraints before the final submission, implementing this would be challenging. However, we recognize its importance and will make our best effort to include preliminary results if time allows. If not feasible for this submission, we will explicitly address this as an important direction for future work in our paper.

评论

Thank you for your response. I have read it and I appreciate it.

作者回复

We thank the reviewers for their dedicated and comprehensive efforts. We are encouraged that three reviewers found our paper to be technically solid with possibly high impact and recommended to accept. We are also pleased the reviewers found the paper novel [dZVs, mmxv, d2zq], innovative and well motivated [dZVs, mmxv, d2zq] with potentially “immense applications” [d2zq] that opens a path for future works [mmxv]. Reviewers found the proposed SoftTreeMax algorithm can be easily paired with any policy gradient algorithm [dZVs], its theoretical analysis sound [m94G], and the experimental results to demonstrate the claims in the paper [dZVs,m94G].

During the rebuttal process, we addressed a concern raised about the relationship between SoftTreeMax and n-step TD methods. We clarified how our approach fundamentally differs, particularly in its novel integration of look-ahead information into policy parametrization. Notably, other reviewers independently recognized this key distinction [dZVs, mmxv, d2zq]. This dialogue has helped highlight the unique contributions of SoftTreeMax to the field.

Beyond these encouraging feedbacks, the reviewers also made valuable suggestions which inspired us to improve with new content. Among else, those include new related work and future work sections (see response to mmxv), a new diagram of the tree expansion implementation in the GPU https://ibb.co/5RPRkzF, and the pseudo code for SoftTreeMax-PPO https://ibb.co/s2CRMp8 (see responses to m94G, mmxv, d2zq). We will incorporate these in the revised version.

Lastly, due to rebuttal size restrictions, we significantly shortened all our responses after the initial writing. Should the reviewers find our answers too short, we would gladly expand in additional comments.

最终决定

Summary

This paper proposes a combination of tree expansion and policy gradient. The core idea is to replace the softmax parameter with Eq. (2), using cumulative reward plus γd\gamma^d discounted parameter. Two methods are derived based on this idea, C-SoftTreeMax and E-SoftTreeMax, with the difference of taking expectations (over trajectories) inside or outside exponential in the softmax.

The major claim is that by doing parameterization this way. The variance of policy gradient can be largely reduced, decaying exponentially on the expansion depth dd. The authors conducted experiments on Atari to show that the proposed methods achieve better performances as well as lower variance than PPO.

Strengths

The idea of combining policy gradient with tree expansion is novel to me (as also noted by reviewers). I believe this idea is reasonable and merits more investigations. The paper is also well written and the results are clearly presented.

Weaknesses

After reading, I considered that the major insight the authors wanted to tell the audience in this paper, i.e., "reducing variance exponentially", is not convincing, and is misleading.

Eq. (2) uses γd\gamma^d as a scaling factor to the parameter θ\theta. The issue with that is the gradient w.r.t. θ\theta itself is reduced exponentially by γd\gamma^d as well. This makes the "variance reduction" claim not valid, since we wanted to preserve the scale of policy gradient and then reduce variances, rather than to make the whole gradient small.

I believe the reason for why the proposed methods work is something else, rather than reducing variance this way. From the paper, the authors seemed to realize that (the gradient norm is also reduced exponentially), while still attributed the improvements to variance reduction. This in total delivered a misleading message of "reducing the variance and improving policy gradient is the major contribution of this work".

Comments

That being said, I think the proposed methods work for some other deeper reasons (but not the variance reduction as explained above), and they are worth investigating. It could be the use of tree expansion guided faster exploration (to learn more important states), or it is the situation that the gradient here makes the parameter easier to be learned than the usual policy parameterizations (like in PPO). However, those are all my guesses and they require more efforts from the authors to confirm or disprove.