PaperHub
6.0
/10
Rejected4 位审稿人
最低5最高8标准差1.2
5
8
5
6
3.0
置信度
ICLR 2024

SoftTreeMax: Exponential Variance Reduction in Policy Gradient via Tree Expansion

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

We introduce SoftTreeMax, a novel parametric policy combining tree expansion into policy gradient. We analyze its variance and bias, and implement a deep RL version of it.

摘要

关键词
Reinforcement LearningPolicy GradientSoftmaxTree expansion

评审与讨论

审稿意见
5

The authors propose SoftTreeMax in this paper, which is a new method that aim to mitigate the high sample complexity and large variance of policy gradient methods by employing planning. It extends traditional logits with the multi-step discounted cumulative reward and the logits of future states. It is shown that tree expansion helps reduce gradient variance. The variance decays exponentially with the planning horizon, and the closer the induced transitions are to being state-independent, the faster the decay. With approximate forward models, the resulting gradient bias diminishes with the approximation error while retaining the same variance decay. SoftTreeMax reduces the gradient variance by three orders of magnitude in Atari, leading to better sample complexity and improved performance compared to distributed PPO.

优点

The paper proposes a novel approach, SoftTreeMax, to mitigate the large variance and high sample complexity for policy gradient methods by leveraging tree expansion and softmax. While there have been related works that study the softmax operation in policy gradient or value-based approaches, SoftTreeMax is unique in its focus on tree expansion to reduce variance. The paper is well-written and easy to follow, with most claims being well-discussed within the paper. The problem of mitigating large variance and high sample complexity for policy gradient methods is a significant challenge in RL, and SoftTreeMax provides a promising solution.

缺点

One weakness of the paper is in its experimental evaluation section. While the paper presents promising results for SoftTreeMax in Atari, some of the claims made are not well-supported. For example, the paper does not include enough baselines to make a fair comparison with SoftTreeMax. This makes it difficult to determine the extent of SoftTreeMax's improvement over existing methods.

Additionally, the paper lacks in-depth comparison with other related methods. While the paper compares SoftTreeMax with distributed PPO, it does not provide a comprehensive comparison with other state-of-the-art methods in the field. This makes it difficult to determine the generalizability of SoftTreeMax and its performance in comparison to other methods.

问题

Policy gradient methods suffer from large variance and high sample complexity. To mitigate this, we introduce —a generalization of softmax that employs planning.

However, in the experimental evaluation part, only PPO is used as the baseline algorithm. There have been many efforts to reduce the variance and improve sample complexity for policy gradient methods (e.g., including a baseline). It is therefore better to also compare state-of-the-art approaches that also solve the same problem.

We do so by sub-sampling only the most promising branches at each level. Limiting the width drastically improves runtime, and enables respecting GPU memory limits, with only a small sacrifice in performance.

Doesn’t this also introduce additional variance by sub-sampling and pruning?

For depths d3d \geq 3, we limited the tree to a maximum width of 1024 nodes and pruned trajectories with low estimated weights.

Does it suffer from a limitation when used with a larger value of dd, which may lead to a more significant limitation of the allowed maximum width considering costs?

Figure 3: Reward and Gradient variance: GPU SoftTreeMax (single worker) vs PPO (256 GPU workers).

Does it perform more sample-efficient than baseline methods (not compared in terms of final performance or actual wall-clock time)?

评论

Thank you for your response. Please find our answers in the following.

W1: Add baselines … to determine the extent of SoftTreeMax's improvement over existing methods? Following this important comment, we now added a SOTA baseline: EfficientZero [Ye et al., 2021]. We chose EfficientZero because it is a highly sample-efficient version of MuZero -- one of the best known RL algorithms today -- and is also open source. Due to the time limitation of the rebuttal, we present the experiments on four games, for a runtime of 96 hours. The results are given here: https://ibb.co/6Zd24yk. As seen in the plots, SoftTreeMax surpasses EfficientZero in all games but one. We will add full experiments in the final version.

Experiment details: We used the same hardware to run all experiments, as described in Appendix B.1. For “SoftTreeMax”, we present the best depth out of those in Appendix B.2, Figure 4.

We thank the reviewer for their useful suggestion, which shows that there are indeed better baselines than PPO to compare to. With that said, we also recall that the PPO baseline we run here has “enhanced” sample efficiency due to its multiple GPU actors.

Q1: There have been efforts to reduce the variance and improve sample complexity (e.g., including a baseline). Compare to SoTA that solves that problem? Thank you for bringing this critical point. Our implementations of SoftTreeMax and SoftMax PPO already employ a baseline subtraction technique. This means that on top of our technique to reduce “direct” variance (i.e. SoftTreeMax), we also run a SOTA technique to reduce “estimation” variance (i.e. baseline subtraction). Importantly, SoftTreeMax puts forward a completely new category of variance reduction techniques because it shrinks the inherent variance of the gradient itself as defined in Section 2.1 below Eq. (1), instead of the variance of the gradient estimator. As such, it can be applied orthogonally in addition to previous variance reduction techniques.

We will clarify this point in the revised version.

Q2: What is the effect of sub-sampling and pruning on the variance? We perform sub-sampling as part of the pruning process, beyond depth 3. Its effect on variance is complex but small, and may vary with the numbers of pruned nodes. It is small because scores in the leaves are exponentiated to form the policy; thus, most pruned nodes have very low probability and the effect is insignificant compared to the computational gain. Nonetheless, in ablation studies, we saw that extreme pruning leads to low performance and slow convergence.

Q3: Can you use higher depths than d=3 without limitation? Not with the current hardware. Beyond a depth of 3, the GPU memory becomes a bottleneck and without pruning searches in such depths become impossible. With Pruning, the complexity of applying the policy grows linearly with the depth. Supplemental Figure 5 In Appendix B.3 compares the number of online steps taken during one week for several depths up to 8. Indeed, for higher depths, the complexity of the policy inference becomes a limiting factor for training. We do note that this cost depends heavily on the simulator itself, and for RL problems with faster simulators higher depths are more tractable.

Q4: Is there a better way to compare sample efficiency instead of run-time? To answer the question, we first define “sample efficiency” more carefully. We distinguish “online environment interactions” from “planning interactions” (i.e. tree search predictions). If we compare only online interactions, we will overestimate the sample efficiency of SoftTreeMax because it will hide the extra planning iterations during policy inference. However, if we also naively add in the planning interactions, we will underestimate the sample efficiency of SoftTreeMax. The reason being that the computational complexity of planning is lower than online interaction, thanks to the GPU batching mechanism that our tree search takes advantage of.

For a fair comparison, we chose the metric of the overall wallclock time that accounts for both outer and inner iteration complexity. Appendix B.2, Figure 4, shows a sweet spot that balances online vs. planning iterations. This occurs at different depths, depending on the Atari game, with optimal data efficiency usually obtained at d>0d>0. Appendix B.3, Figure 5, complements Figure 4 and highlights the need for fewer outer (online) interactions as the tree deepens. For more details, see the discussions in Sections B.2 and B.3.

审稿意见
8

The article introduces a new family of policies called SoftTreeMax, which are a model-based generalization of the popular softmax used in reinforcement learning (RL). SoftTreeMax policies replace the standard policy logits with the expected value of trajectories that originate from specific states and actions. These policies aim to reduce the high variance of policy gradients and improve RL performance.

The article contains theoretical analysis, including variance bounds for SoftTreeMax, that demonstrates how the gradient variance decays exponentially with the planning horizon. Additionally, the article discusses how the gradient bias introduced by an approximate forward model diminishes with the approximation error.

Experimental results comparing SoftTreeMax to distributed Proximal Policy Optimization (PPO) demonstrate that SoftTreeMax leads to better sample complexity and improved performance in various Atari games, with significantly lower gradient variance.

优点

The methods introduced in this paper are shown to reliably reduce the variance of PG, which is derived theoretically and then verified in experiments. The paper is clearly written, provides mathematical proofs and practical implementations of the results, and seems like a meaningful incremental contribution.

缺点

Lack of experiments with probabilistic environments.

问题

  • What do you mean by "reward and variance are negatively correlated" on page 8?
  • The definition of Var_x(X) seems to have a typo.
  • How would you expect the sampling variance to impact the policy gradient if the expectations cannot be computed exactly?
评论

Thank you for your response. Please find our answers in the following.

Q1: What do you mean by "reward and variance are negatively correlated" on page 8? We refer to the empirical finding in Figure 3. Its data shows that low variance is correlated with high reward.

Q2: Can you fix the typo in Varx(X)_x(X) definition? Typo fixed, thank you.

Q3: How would sampling impact the policy gradients if expectations cannot be computed exactly? We perform sub-sampling as part of the pruning process, beyond depth 3. In ablation studies, we saw that extreme pruning leads to low performance and slow convergence. This is because sub-sampling increases the bias of the gradient, as it is no longer an exact expectation over trajectories. Sub-sampling also affects the variance, but the effect is complex and may vary for different choices of pruned nodes.

审稿意见
5

This work proposes SoftTreeMax, which uses planning to reduce the policy gradient variance. In particular, the authors proposed two variants, i.e., C-SoftTreeMax and E-SoftTreeMax, where logits are re-defined as Eq. (2). They show that the variance of the proposed gradient decays exponentially w.r.t. dd (trajectory depth). They also characterize gradient bias by approximation errors. Experiments on Atari shows that the proposed methods achieve better performance and lower variance than PPO.

优点

  1. The paper is well-written, with clear introduction of the settings, methods, and results.
  2. Combining policy gradient and tree search seems very interesting.
  3. Experiments verify the proposed methods, an they look promising.

缺点

  1. It is confusing to me where the exponential decay of variance is from, i.e., from the design or the fact that the policy is nearly deterministic, and therefore not clear to me if reducing both gradient and variance would benefit (please see the question below).

问题

Looking at Lemma 4.1 and Lemma 4.3, it seems the exponential decay of variance is from θlogπθ(s)\nabla_\theta \log{ \pi_{\theta}(\cdot | s) } . If πθ(s)\pi_{\theta}(\cdot | s) has softmax parameterization then this basically means the policy is nearly deterministic? If this is true, then this also means the policy gradient has to be close to zero (softmax policy has almost zero gradient near deterministic policies), which is expected to slow down the convergence. Could you explain why reducing both gradient and variance to exponentially small would help learning?

评论

Thank you for your response. Please find our answers in the following.

Q1: When the policy is nearly deterministic, how do softmax and SoftTreeMax behave? In short: Indeed, softmax may become ``stuck” in deterministic policies, but this does not happen in SoftTreeMax. This question highlights the advantage of SoftTreeMax over softmax.

In more detail, the softmax gradient is indeed almost zero near deterministic policies, as also observed in [Mei et al., 2020a]. In contrast, SoftTreeMax avoids these local optima by integrating the reward into the policy itself. More formally, for softmax, θlogπθ(as)=eaπθ(s),\nabla_\theta\log \pi_\theta(a|s) = e_{a} - \pi_\theta(\cdot|s), where ea[0,1]Ae_{a} \in [0,1]^A is the vector with 0 everywhere except for the aa-th coordinate. Thus, for a deterministic policy, we get θ(s,a)\theta(s, a) \rightarrow \infty for one aa per ss; this expression will be zero for a deterministic policy. In contrast, from Lemmas 4.3 and 4.6 it follows that the SoftTreeMax gradient will not be zero in the respective cumulative or exponential variants.

Weakness & Q2: Why would reducing both gradient and variance to exponentially small help learning? Thank you for a very important question. The key observation is that SoftTreemax does not reduce the gradient uniformly (which would have been equivalent to a change in learning rate). While the gradient norm shrinks, the gradient itself scales differently along the different eigenvectors, as a function of problem parameters (PP, θ\theta) and our choice of behavior policy (pibpi_b). This allows SoftTreeMax to learn a good “shrinkage” that, while reducing the overall gradient, still updates the policy quickly enough. We will clarify this in the revised paper.

Figure 3 in the paper presents empirical evidence that the theory holds in our case. It depicts the gradient variance in each (environment, depth), showing a strong correlation between lower variance and higher reward. In some games, performance is non-monotonic, because all experiments used the same computational budget (same HW for the same amount of limited time). So, in some cases the variance was very small, each training iteration took very long to compute, and fewer iterations were completed by the end of the run.

审稿意见
6

This paper proposes a new policy gradient, named softtreemax, which combines the tree search within the policy gradient method. The authors . We analyze the gradient variance of SoftTreeMax and reveal how tree expansion helps reduce this variance.

优点

  1. The idea of incorporating the tree search within the policy gradient is novel.

  2. The variance analysis is solid.

  3. The proposed softmax tree method is also extended to infinite action space.

  4. Multiple experiments are conducted to demonstrates the necessity of reducing the variance of PG for improving performance and the empirical performance advantage of the proposed method.

缺点

See questions.

问题

  1. When there is approximation error in the model P and r, what are the variance of SoftTreeMax? Do the claimed exponential variance reduction still hold with the approximate model? Is it worth to make the PG, a model-free method, to a model-based method by combining it with the tree search.

  2. Although authors mention that formally proving the conjectured global convergence with fast rate as in (Mei et al 2020b) is subject to future work. It is hard to demonstrate its advantage over the traditional SoftMax policy gradient, or more generally traditional policy gradient methods without comparing the sample complexity between SoftTreeMax policy gradient and traditional SoftMax policy gradient. The main missing piece is that the reduction in variance does not necessarily imply the faster convergence of smaller sample complexity if bringing such variance reduction needs to use a form of policy gradient sacrifice the performance in the deterministic setting (For example, I am not sure whether the proposed SoftTreeMax policy gradient will even converge in the derterministic setting).

  3. There are some relevant papers that address related problems that authors may need to add to the related work.

(1) Optimization Methods for Interpretable Differentiable Decision Trees in Reinforcement Learning, Andrew Silva, et al,. 2020 AISTATS.

(2) On the Global Optimum Convergence of Momentum-based Policy Gradient, Yuhao Ding, et al, 2022 AISTATS.

It is important to compare with (1) to evaluate whether this paper is still the first work on proposing a differentiable parametric policy that combines tree expansion with PG. (2) also studies the convergence and the variance reduction for softmax PG.

评论

Thank you for your response. Please find our answers below.

Q1a : Do the claimed exponential variance reduction still hold with the approximate model? Thank you for raising this important question. Yes, the variance decays exponentially at a rate dictated by P^πb\hat{P}^{\pi_b} instead of Pπb.P^{\pi_b}. As the two terms get closer when the approximation error decays, the rate naturally also becomes closer due to eigenvalue continuity. To study the consequence of using an approximate model, we analyzed the bias in Theorem 4.8.

To formally see why the variance decays exponentially in the approximate case, recall the three key steps in our proof of Theorem 4.4:

  1. Eigendecomposition of PπbP^{\pi_b} with the eigenvalue 1 not repeating and noting that the leading eigenvector is 1A1_A (the all ones vector of size AA).

  2. Noting that Ps1S=1A.P_s 1_S = 1_A.

  3. Noting that [IA1Aπd,θC]1A=0.[I_A - 1_A \pi^C_{d, \theta}] 1_A = 0.

The first two steps also hold for the approximate model since these steps only rely on row-stochasticity (leaving aside the pathological case where the eigenvalue 11 repeats in P^πb\hat{P}^{\pi_b}, which will not occur when the latter is close to Pπb.P^{\pi_b}.). Finally, the third step is valid as long as π^d,θC\hat{\pi}_{d, \theta}^C is a distribution, which is trivially true.

We will add this new discussion to the revised version.

Q1b: Is it useful to make PG, a model-free method, to a model-based method by combining it with tree search? We believe that PG as a model-based method has great potential to improve upon its model-free version. To demonstrate this beyond what was shown in the submitted version regarding PPO, we now added a SOTA baseline: EfficientZero [Ye et al., 2021]. We chose EfficientZero because it is a highly sample-efficient version of MuZero -- one of the best known model-based algorithms today -- and is also open source. The results are given here: https://ibb.co/6Zd24yk. As seen in the plots, SoftTreeMax surpasses EfficientZero in all games but one. We will add full experiments in the final version.

Experiment details: We used the same hardware to run all experiments, as described in Appendix B.1. For “SoftTreeMax”, we present the best depth out of those in Appendix B.2, Figure 4.

Q2a: Direct comparison of sample complexity is missing. For a fair comparison, we estimate sample complexity through wall-clock time; We explain in detail the considerations for this decision in our reply to Q4, R eGu2.

For wall clock time, in Appendix B.2, Figure 4, we provide the convergence plots as a function of wallclock time. It shows a clear benefit for SoftTreeMax over distributed PPO with the standard softmax policy. In most games, PPO with the SoftTreeMax policy shows very high sample efficiency: it achieves higher episodic reward even though it observes far fewer episodes, for the same running time. In addition, in Appendix B.3, Figure 5, we provide convergence plots of SoftTreeMax where the x-axis is the number of online interactions with the environment, thus excluding the tree expansion complexity. The plot demonstrates a monotone improvement as a function of the tree depth.

Q2b: Does a reduction in variance necessarily imply faster convergence? There are several theoretical results that support this relation. First, the general fact that variance reduction translates to faster convergence was established in stochastic approximation theory. In the context of PG, several recent works proved improved convergence rates thanks to variance reduction techniques such as SVRG [Papini et al., 2018] (see also [Liu et al., 2020]).

For SoftTreemax discussed here, note that SoftTreemax does not reduce the gradient uniformly (which would have been equivalent to a change in learning rate). While the gradient norm shrinks, the gradient itself scales differently along the different eigenvectors, as a function of problem parameters (PP, θ\theta) and our choice of behavior policy (pibpi_b). This allows SoftTreeMax to learn a good “shrinkage” that, while reducing the overall gradient, still updates the policy quickly enough.

Figure 3 in the paper presents empirical evidence that supports the theory in our case. It depicts the gradient variance in each (environment, depth), showing a strong correlation between lower variance and higher reward. In some games, performance is non-monotonic, because all experiments used the same computational budget (same HW for the same amount of limited time). So, in some cases, the variance was very small, each training iteration took very long to compute, and fewer iterations were completed by the end of the run.

评论

Q2c: I am not sure whether the proposed SoftTreeMax policy gradient will even converge in the deterministic setting. SoftTreeMax does converge in a deterministic setting. To test this, we conducted experiments in both the deterministic and the quasi-deterministic variants of the Atari environment. In both cases, we obtained convergence, and the results are almost identical. The quasi-deterministic setting is a popular variation that adds to the deterministic emulator a small stochastic component via “sticky actions” [Machado et al., 2017]. We will discuss this in the revised version.

Q3a: How does your work relate to “Interpretable Differentiable Decision Trees” from [Silva et al., 2020]? Although the terminology appears closely related, the work by Silva et al is only loosely connected to our work. There is a delicate distinction to be made here, and one should be aware of the overload of the term “tree”. The differentiable tree-policy that we propose relates to a tree of futures, in the context of planning. The DDT paper examines a decision tree as a function approximator (instead of, e.g., a neural network). The two are orthogonal and can be used in conjunction. We added this reference to the related work section in the revised version.

Q3b: Consider [Ding et al., 2022] as a reference on convergence and variance reduction for softmax PG. Thank you, we added it in the revised version.

评论

We thank the reviewers for their dedicated and comprehensive efforts. We are encouraged that all reviewers found the paper well written [84Hk], clearly presented, novel, and impactful. Specifically, they found that the paper provides a meaningful contribution [Xrnz], the proposed SoftTreeMax algorithm is a promising and unique solution for variance reduction [eGu2], the variance analysis to be solid [vDhi], and the experiments are promising [84HK]. Beyond these encouraging feedbacks, the reviewers also made valuable suggestions that we will incorporate in the revised version and comments that we now address.

Following a comment by Reviewer eGu2, we added comparisons with a strong baseline to our experiments: EfficientZero [Ye et al., 2021]. We chose it because it is a highly sample-efficient version of MuZero that is also open source. MuZero is one of the best known RL algorithms to date. The results that were completed during the rebuttal are given here: https://ibb.co/6Zd24yk, showing that SoftTreeMax surpasses EfficientZero in all the games we tested except one.

AC 元评审

This paper proposes a new policy parameterization called SoftTreeMax, which, the authors claim, can reduce the policy gradient variance. Experiments on Atari shows that the proposed methods achieve better performance and lower variance than PPO.

However, there is a big issue that the so-called "variance-reduction" seems to be an artifact of parameter scaling, which reduces gradient and standard deviation (sqrt(variance)) at the same rate, which does not make sense for a "variance reduced method".

In details, let us take the C-SoftTreeMax parameterization as an example. Once the behavioral policy πb\pi_b is chosen and is fixed, then it just has the form of softmax-linear policy parameterization with some shift: π(s)Exp(as+γdBsθ)\pi(\cdot|s) \propto Exp(a_s + \gamma^dB_s\theta) for some vector asa_s and matrix BsB_s depending on πb\pi_b and transition kernel, which is not anything new. Moreover, the authors claim that the variance decays exponentially in dd with a rate γ2d\gamma^{2d}. However, this argument is not convincing because by Lemma 4.3, the magnitude of the policy gradient is also O(γd)O(\gamma^d), which is the same order of the standard deviation (sqrt(variance)). Therefore, the claimed variance reduction looks more like an artifact of parameter scaling, and the ``reduced noise'' does not seem to be smaller than the gradient information itself.

Moreover, as practical γ\gamma are often of 0.90.9 or 0.990.99, the "exponential decay" only looks linearly when dd is small, e.g. 0.99100=0.370.99^{100} = 0.37, that is, for d100d\leq 100, almost no "exponential decay" in the variance can be experienced. But instead, the computational cost does grow exponential in dd. Indeed, the authors report that d=3d=3 is already the best tradeoff that their device can handle.
Moreover, when dd increase, the gradient also diminishes exponentially in dd, this can make the training extremely slow. Due to this reason, even if the authors' device allows them to compute extremely deep trees, the exponentially small effective stepsize will again hurdle the training.

The proposed method is only advantageous when the behavioral policy πb\pi_b is already performing very well. Then the method is able to do some slow improvement. If πb\pi_b is bad, then the proposed parameterization does not seem to be much different from a simple rescaling of tabular softmax: π(s)Exp(ϵθs)\pi(\cdot|s) \propto Exp(\epsilon\theta_s), which also has an O(ϵ2)O(\epsilon^2) variance in the O(ϵ)O(\epsilon)-small gradient.

Therefore, the AC is not convinced by the authors' results and would like to suggest a rejection.

为何不给更高分

Paper has theoretical flaws.

为何不给更低分

N/A

最终决定

Reject