PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
3
4
ICML 2025

Policy Gradient with Tree Expansion

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

SoftTreeMax: A novel parametric policy based on tree expansion that can be optimized using policy gradient methods

摘要

关键词
Reinforcement LearningPolicy GradientTree ExpansionSoftmaxVariance Reduction

评审与讨论

审稿意见
3

The paper introduces SoftTreeMax, a new approach that combines policy gradient (PG) with tree search. The goal is to address the inherent high gradient variance in traditional PG algorithms.

The authors present theoretical analysis showing that the gradient variance of SoftTreeMax decays with the depth of the tree expansion, and that this decay rate is influenced by the second eigenvalue of the transition matrix induced by the tree expansion policy.

On the empirical side, they utilize a GPU-based simulator to efficiently manage the exponential computational cost of expanding the tree. Empirical validation is done through experiments on the Atari benchmark suite, where the proposed approach demonstrated lower gradient variance and better performance compared to Proximal Policy Optimization (PPO).

after rebuttal

The rebuttal did not significantly change my view of the paper. I will maintain my score.

给作者的问题

  • How can one handle a stochastic environment with an infinite number of states?

  • Does it make sense to change the behavioral policy over time? Specifically, does it make sense to use the current policy (or a regularized version of it) as the tree-expansion policy?

论据与证据

Overall, the claims are clear.

The variance bound of C-SoftTreeMax in Theorem 4.4 depends on the number of states S, which is typically large (or even infinite). It is unclear whether this dependency is unavoidable in the worst case, as no lower bound is given, and a comparison between the bound and the empirical variance is not provided.

For E-SoftTreeMax, the bound in Theorem 4.7 does not depend on S, and an empirical comparison is included. Thus, this claim is more convincing.

Theorem 4.8 provides a bound on the bias of the gradient estimator in cases where the dynamics model is inaccurate. This bound scales linearly with SS and dd. Again, it is unclear whether this dependency is unavoidable, and no empirical comparison is provided. Assuming this dependency is accurate, the variance decreases with dd while the bias increases. This tradeoff on overall performance is not examined, as the experiments are conducted with a precise forward model (which does not exist in most practical applications).

Finally, in the conducted experiment on the Atari environment, when a precise forward model does exists, the paper demonstrates that indeed the variance reduces with dd and the overall performance is better than PPO's.

方法与评估标准

The theoretical results provide bounds on the variance as well as on the bias of the estimator as a function of the tree depth d which definitely makes sense.

The comparison to PPO also makes sense, as this is one of the most popular PG algorithms.

Since the method aims to combine PG with tree search, it would make sense to compare it to a benchmark that is based on MCTS (such as AlphaZero or something similar).

理论论述

I read the proof outline in the main text and did not find anything that raised my suspicion. I have also read the proof of Lemma 4.1 in the appendix.

实验设计与分析

The conducted experiments seem valid. As mentioned earlier, some additional experiments could be beneficial, such as a comparison to a tree search algorithm. Additionally, regarding the comment in footnote 1, I think it would be beneficial to compare the two suggested algorithms in a non-deterministic benchmark.

补充材料

Lemma 4.1, and the experiments in figures 4 and 5

与现有文献的关系

The paper addresses the well-known issue of high gradient variance in policy gradient algorithms.

遗漏的重要参考文献

N/A

其他优缺点

The paper presents a novel approach for reducing gradient variance.

A major limitation is that the algorithm requires computing expectations over future states, which can be very challenging in stochastic environments with a large number of states.

其他意见或建议

N/A

作者回复

Thank you for your detailed and thoughtful review.

Dependency of bound on SS and lower bound

RL analysis on tabular MDPs often include SS terms, usually stemming from the triangle inequality applied on summation over states. These dependencies can sometimes be replaced by structural assumptions on the transition matrix (like branching). This is a common and important theoretical-practical gap. While the theory suggests meaningless bounds for large SS in practice we see exponential variance reduction with dd for the huge state space in Atari.

On a related note, a lower bound to the variance is given in the appendix A.5 (it does not include SS).

Bound on E-SoftTreeMax does not contain SS

There is a dependence on S,AS,A also in the case of E-SoftTreeMax. However, notice that the bound in Thm 4.7 is presented in O-notation while the C-SoftTreeMax bound is expressed explicitly. The critical theoretical contribution in both cases is the identification of the exponential decay rate with respect to tree depth.

Regarding the missing numerical variance measurements for C-SoftTreeMax: we conducted similar variance analyses for both SoftTreeMax variants, but included only the E-SoftTreeMax results in the paper to avoid redundancy. The numerical results for C-SoftTreeMax show similar behavior and correspondence to the theory. We will clearly state this in the revision.

The primary reason we included the detailed variance measurements for E-SoftTreeMax was to verify our theoretical hypothesis from Thm 4.7 that the parameter α\alpha is related to the λ2\lambda_2. Our numerical experiments confirmed this in the general case by showing that variance decays at a rate determined by the mixing properties of the baseline policy.

We will add a brief discussion of C-SoftTreeMax's empirical variance characteristics in the revised manuscript to ensure completeness.

Comparison to MCTS

Following your comment, we added comparisons with a strong baseline to our experiments: EfficientZero [Ye et al., 2021]. We chose it because it is a highly sample-efficient version of MuZero that is also open source. MuZero is one of the best known RL algorithms to date. The results are given here: https://ibb.co/zHG3KP8m, showing that SoftTreeMax surpasses EfficientZero in all the games we tested except one. We will add full experiments in the final version.

Question 1: How can one handle a stochastic environment with an infinite number of states?

For stochastic environments with infinite state spaces, we propose three complementary approaches:

  1. We can use Monte Carlo sampling to expand the tree. This approach maintains a parametric distribution over actions (potentially dependent on θ\theta) and samples from it to build the tree. This method can be viewed as a tree adaptation of Model Predictive Path Integral control (MPPI) with a value function.

  2. Function approximation for leaf states: For infinite state spaces, we already use neural networks to approximate the value function at leaf states. This approach naturally extends to continuous state spaces.

  3. Theoretical extension: The key concepts in our analysis can be adapted to infinite state spaces by:

    3a. Replacing transition matrices with kernels in the continuous case.

    3b. Exploiting that πb\pi_b's transition kernel is a non-expansive operator.

    3c. Leveraging that the eigenvalue 1 gets canceled for the policy gradient.

These properties can be shown to hold for decision models with infinite state and action spaces, though we leave the complete theoretical development for future work.

Question 2: Does it make sense to change the behavioral policy over time? Specifically, does it make sense to use the current policy as the tree-expansion policy?

Yes, adapting the behavioral policy over time is a promising direction. As training progresses and the policy improves, using it as the tree expansion policy could lead to exploring more relevant parts of the state space. This creates a bootstrapping effect similar to what's done in MCTS algorithms like AlphaZero, where the current policy guides the search.

Still, care must be taken as trained policies often become deterministic leading to unfavorable variance properties for semi-deterministic environments (when PπbP^{\pi_b} approaches a permutation matrix, λ2(Pπb)\lambda_2(P^{\pi_b}) could potentially approach 1, eliminating the variance reduction benefit). Also we can’t use the SoftTreeMax policy directly to open the tree as it requires expanding a tree while expanding a tree.

A reasonable compromise might be to learn a proxy policy as a mixture of the current policy and a uniform policy, with the mixing weight potentially decreasing over time to balance exploration and exploitation while maintaining favorable variance properties.

Thank you for raising this excellent point, which provides fertile ground for future work. We will address this in the revised paper.

审稿意见
3

The authors propose a generalization of softmax parametrization for policy gradient methods that utilizes the breadth-first search tree of the future states. This type of parametrization combines planning with policy gradient methods to reduce the latter's variance. The authors proved that given some assumption on the planning policy, the variance of the new gradient decreases exponentially with the depth of planning, allowing trade-off between computation efficiency and gradient variance. Finally, the authors propose the GPU implementation of their algorithm, which shows improvement against the PPO baseline regarding sample complexity.

After rebuttal

I was happy to see the MCTS baseline. However, I still believe that the assumption on the presence of GPU simulators is very strong, so I decided to maintain my score.

给作者的问题

  • Does the number of online interactions include the samples generated during the tree search?
  • For d=0d=0, the method corresponds to a usual softmax parametrization (lines 147-157). Why, in your experiments, increasing dd to 1 or 2 can harm performance? In this case, it looks like the method gets more information on the environment and should help in training. I also found the case of d=1,2d = 1,2 the most interesting from a practical perspective since it should not introduce a significant overhead on tree-search counterpart but decrease the variance (given all the experiments, the relative variance reduction between d=0d=0 and d=1d=1 is most significant).

论据与证据

Claim 1. The gradient variance of SoftTreeMax decays exponentially with a rate depending on the spectral properties of the transitions matrix induced by a tree search policy;

This claim has been proven theoretically (Theorems 4.4 and 4.7) and practically given an experiment on a randomly generated MDP.

Claim 2. In the case of an approximate forward model, the gradient bias is proportional to the approximation error.

This claim has been proven theoretically; however, there is no empirical evidence since all the experiments use the exact forward model.

Claim 3. The algorithm is implementable on GPU and outperforms PPO in terms of both gradient variance reduction and achieving higher reward.

This claim was confirmed on 8 Atari games, implemented in a parallelizable GPU version of the environment.

方法与评估标准

The proposed method and the evaluation methodology make sense to me. However, given the focus on improving the algorithm's planning capabilities, the inclusion of the planning benchmark (e.g., the game of Go or any other strategic game) could be important in assessing the final improvement.

理论论述

Theoretical statements make sense to me, although I have not carefully checked them in the appendix.

实验设计与分析

  • The experimental design lacks comparisons with other model-based approaches.

The proposed algorithm assumes a relatively strong assumption on the interaction model with MDP: the presence of a simulator or an approximate simulator to perform a BFS over the next states. Given this assumption, I believe the comparison with MCTS-based methods (with a comparable tree search budget) should be present.

  • The lack of planning benchmarks.

Since the method is assumed to improve planning capabilities, I expect to see a comparison on some planning benchmark, such as Go or any other strategic game (at least Connect-4).

补充材料

I skimmed the proofs for C-SoftTreeMax and read the experimental details in detail.

与现有文献的关系

The improvement of the planning capabilities of policy-gradient methods is relevant to the current scientific literature.

遗漏的重要参考文献

The paper lacks a discussion of another type of planning without deep search, such as k-step lookahead, during advantage estimation (e.g., Tree Backup (Precup et al, 2000), Retrace (Munos et al, 2016), application in Muesli (Hessel et al, 2021)). In particular, I found that the method resembles the idea of n-step Tree Backup (e.g., Sutton & Barto book) but from the perspective of policy parameterization, not Q-value estimation.

Precup, Doina, Richard S. Sutton, and Satinder Singh. "Eligibility traces for off-policy policy evaluation." ICML. Vol. 2000. 2000.

Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. (2016). Safe and efficient off-policy reinforcement learning. Advances in neural information processing systems, 29.

Hessel, M., Danihelka, I., Viola, F., Guez, A., Schmitt, S., Sifre, L., ... & Van Hasselt, H. (2021, July). Muesli: Combining improvements in policy optimization. In International conference on machine learning (pp. 4214-4226). PMLR.

其他优缺点

Overall, I enjoy the idea of integrating planning directly into the policy's parametrization and the theoretical guarantees for variance reduction given the depth of the search.

However, as a limitation acknowledged by the authors, I also found that the assumption of the presence of GPU-sumilator is very strong.

其他意见或建议

作者回复

Thank you for your detailed and thoughtful review. We appreciate the careful reading of our work and the constructive feedback. Below, we address your concerns:

Additional MCTS baselines

Following your comment, we added comparisons with a strong baseline to our experiments: EfficientZero [Ye et al., 2021]. We chose it because it is a highly sample-efficient version of MuZero that is also open source. MuZero is one of the best known RL algorithms to date. The results that were completed during the rebuttal are given here: https://ibb.co/zHG3KP8m, showing that SoftTreeMax surpasses EfficientZero in all the games we tested except one. We will add full experiments in the final version.

Question 1: Does the number of online interactions include the samples generated during the tree search?

No, the number of online interactions does not include the samples generated during tree search. We distinguish between two types of interactions:

  1. Online environment interactions: These are actual interactions with the environment during training (e.g., when collecting trajectories for PPO updates).

  2. Planning interactions (tree search): These are simulated forward passes using our GPU-based simulator to expand the tree.

As we explained in our experiments, the computational complexity of planning interactions is substantially lower than online interactions, especially with our GPU batching mechanism for tree expansion. This is why we track them separately, and our sample complexity plots show only the online interactions.

For a complete view of computational efficiency, we also provide wallclock time comparisons in Figure 4, which accounts for the total computation including tree expansion. As seen there, despite the additional computation for tree search, SoftTreeMax still demonstrates favorable performance per unit time for appropriate choices of depth.

Question 2: Why can increasing d to 1 or 2 harm performance in some experiments?

This is an excellent observation. While theoretically more information should help, this illustrates the basic bias-variance tradeoff in the SoftTreeMax policy:

At small depths (d=1,2), we introduce a structural bias by incorporating tree search, but the variance reduction benefit hasn't fully kicked in yet. The theoretical variance reduction is exponential in d, so the effect becomes more pronounced at larger depths.

We will address this valuable insight in the revised paper.

审稿人评论

I would like to thank the authors for their response, and I would like to keep my score.

审稿意见
3

This paper extends softmax policy gradient methods by integrating planning through tree expansion. The authors introduce two implementation variants—C-SoftTreeMax and E-SoftTreeMax—which differ in whether the expectation is computed inside or outside the exponent. They analyze the policy gradient variance for both approaches in the tabular case and demonstrate that the variance upper bound decreases with increasing tree depth at a rate determined by the second-largest eigenvalue of the behavior policy. Additionally, the paper examines the gradient bias introduced when employing an approximate model. The theoretical findings are further validated through experiments on several Atari games, where GPU based simulation was leveraged to mitigate the computational overhead introduced by tree expansion.

给作者的问题

  1. The bounds on the variance seems loose, in that, when the temperature \beta is infinite t, the bound is infinite while the policy is deterministic, which should yield 0 variance in a deterministic MDP. Could you clarify if it is an artifact or if there’s another reason?
  2. Could you clarify regarding the potential issue with the proof for Lemma A.5 and Theorem A.6? A detailed discussion on the bound on M^M||\hat{M}-M|| would be particularly helpful.
  3. Could you provide intuitions regarding the differences between the variance upper bounds of C-SoftTreeMax and E-SoftTreeMax?

论据与证据

The main issue is in the bound for the policy gradient bias when using an approximate model.

The proof doesn’t entirely check out. In the proof for Lemma A.5, it is stated that M^M=O(βdϵ)||\hat{M}-M||=O(\beta d \epsilon). However, it is unclear why γd\gamma^{-d} from Cs,dC_{s,d} does not appear in the bound. Moreoever, if the gap between M^\hat{M} and MM is big, i.e., not approaching 0, the subsequent proof may not hold.

Another reservations that I have is regarding the difference between the variance upper bounds of C-SoftTreeMax and E-SoftTreeMax. Specifically, the former one scales with S2S^2 and A2A^2 while the later one does not contain either SS and AA. I’m not sure why there is such discrepancy.

Related to the bounds, while the variance of E-SoftTreeMax is verified in numerical experiment illustrated in Fig. 1, there’s no verification provided for the result of C-SoftTreeMax.

方法与评估标准

The methods and evaluation mostly make sense to me. However, I have concerns about the assumption in E-SoftTreeMax that r(s,a)r(s, a) equals r(s)r(s), as this does not seem to hold in either Atari or MuJoCo—despite the paper's claim that this is typical for these environments. In Atari, for example, the reward is determined by transitions between consecutive states (i.e., r(s,s)r(s, s^′)). MuJoCo’s reward functions generally include a dedicated component determined by the action.

理论论述

I carefully checked the proof for all results.

There seems to be an issue with the proof for Lemma A.5 and Theorem A.6, as I discussed in the “Claims And Evidence” section.

In addition, the mathematical analysis would benefit from improved clarity and proofreading.

For instance, the definition of θ\theta and Θ\Theta are ambiguous. Sec 2.1 describes θ\theta as a mapping from S x A to R. However, the beginning of Sec 2 describes Θ\Theta as a vector of RSR^S, a vector representation of θ(s)\theta(s), which seems to imply that the later is a scalar. It is not clear how θ(s)\theta(s) is defined, or what its dimensionality is.

Also please note the typos listed under “Other Comments Or Suggestions”.

实验设计与分析

Yes, I checked the validity of both numerical and Atari experiments and found no major issues overall. However, there are two minor issues:

  1. variance bounds: the missing numerical result for C-SoftTreeMax which was brought up earlier.

  2. behavior policy: In Sec 8, the paper claims to have “explained how to choose the expansion policy to minimize the gradient variance”. However, more precisely, I think what explained was instead on what property we want the induced transition matrix to have. It is not clearly explained the procedure of how to “choose” a behavior policy. In fact, in the deep RL experiment in Atari, only one type of policy (uniform) was used as the behavior policy. It is not clear what level of impact the behavior policy has on the learning performance.

补充材料

I did not run the code but took a look into it.

与现有文献的关系

It extends the existing RL literature on two fronts:

1, integrates planning with model-free policy gradient methods

2, in the context of planning with a local simulator [Yin et. al 2022], most work were concerned with action value methods. This paper provides some empirical evidence supporting the benefits of combining policy gradient methods with planning.

D. Yin, B. Hao, Y. Abbasi-Yadkori, N. Lazić, and C. Szepesvári. Efficient local planning with linear function approximation. The 33rd International Conference on Algorithmic Learning Theory, 2022.

遗漏的重要参考文献

No

其他优缺点

N.A.

其他意见或建议

I recommend changing the title to “Softmax” policy gradient with tree expansion, since the main results are specific to softmax parameterization.

List of typos:

page 1, the last sentence of the 2nd last paragraph, “we prove that the with …”

page 2, Sec 2, the RHS of the definition of μπ\mu_\pi should be μπ\mu_\pi^\top instead of PπP^\pi

page 2, Sec 2.1, in the variance expression of a random vector XX, ^\top should be on the second parenthesis.

page 24, Sec B.3, the symbol inside the parenthesis in “In Gopher, however, for large depths ()”

作者回复

Thank you for your thoughtful review. We appreciate the careful reading of our proofs and the constructive feedback. We address your concerns below.

Proof issues in Lemma A.5 and Theorem A.6

You're correct that γd\gamma^{-d} from Cs,dC_{s,d} should appear in this bound. This oversight makes the proper bound O(βdγdϵ)O(\beta d \gamma^{-d} \epsilon). When the approximation error ϵ\epsilon is small enough and β\beta properly controlled, this term still approaches 0 as required for the subsequent proof. We will correct this in the revised version.

Variance bounds discrepancy

E-SoftTreeMax also depends on S,AS,A, but Theorem 4.7 uses O-notation while C-SoftTreeMax shows explicit constants. Both establish the crucial exponential decay rate with tree depth.This exponential decay explains why deeper trees provide more stable policy gradient estimates and better overall performance, regardless of specific constants or dependencies. The key contribution is establishing this relationship between tree depth and variance reduction through the eigenstructure of the MDP's transition dynamics.

We'll highlight this insight more clearly.

Assumption that r(s,a) equals r(s)

This simplification was made for theoretical analysis of E-SoftTreeMax. Our practical implementation handles general reward structures correctly in all experiments.

Definition of θ\theta and Θ\Theta

Thank you for highlighting this inconsistency. In Section 2.1, we use θ\theta conventionally for policy parameters, but later use ΘRS\Theta \in \mathbb{R}^S and θ(s)\theta(s) as a scalar state function. We'll replace the notation in Section 2.1 with a different symbol, reserving θ\theta and Θ\Theta exclusively for our SoftTreeMax formulation to ensure consistency throughout the paper.

Verification for C-SoftTreeMax

We analyzed both variants but included only E-SoftTreeMax results (Figure 1) to avoid redundancy. C-SoftTreeMax shows similar behavior and correspondence to theory. We focused on E-SoftTreeMax to verify that parameter α\alpha relates to the transition matrix's second eigenvalue in the general case, which our experiments confirmed. We'll add a brief discussion of C-SoftTreeMax's empirical variance characteristics for completeness.

Behavior policy choice

Our theory identifies properties of ideal behavior policies but doesn't provide concrete procedures for complex domains. In Atari experiments, we used uniform policies for simplicity. Preliminary experiments with other policies showed similar trends but weren't comprehensive enough to include.

We'll clarify our claims and add discussion about the practical impact of behavior policy choice.

Minor comments and title suggestion

We appreciate your writing suggestions and will consider revising the title to better reflect our focus.

Question 1 regarding loose variance bounds

This is indeed an artifact of our analysis approach. Our bound becomes loose in this regime because we use worst-case inequalities at multiple steps in the proof that don't capture the special structure that emerges in deterministic settings. The β2\beta^2 term appears because we bound the gradient norm without accounting for how the policy structure changes as β\beta increases. For practical parameter ranges used in training, our bound provides useful insights about variance reduction with tree depth. In the revised paper, we'll clarify this limitation and discuss how the bound could be refined to better handle the deterministic limit case.

We'll clarify this limitation and discuss potential refinements.

Question 2 on the potential issue with the proof

Please see our response above regarding the missing γd\gamma^{-d} term.

Question 3 on variance bounds intuition

C-SoftTreeMax has a more decomposable gradient structure (Lemma 4.3): θlogπd,θC=β[IA1A(πd,θC)]Ps(Pπb)d1.\nabla_\theta\log \pi_{d,\theta}^C = \beta\left[I_{A} - 1_A (\pi_{d,\theta}^C)^\top \right]P_s \left(P^{\pi_b}\right)^{d-1}. This allows direct spectral decomposition. When the projection matrix interacts with the decomposed (Pπb)d1(P^{\pi_b})^{d-1}, the stationary distribution term gets canceled, leaving terms scaled by λid1\lambda_i^{d-1}, with λ2|\lambda_2| dominating.

E-SoftTreeMax's analysis is more complex as its gradient involves matrix products and ratios with exponentiated rewards, requiring sophisticated techniques to establish the convergence rate α\alpha.

This explains why we provide an exact variance bound for C-SoftTreeMax, while for E-SoftTreeMax we give an asymptotic bound where α=λ2(Pπb)\alpha = |\lambda_2(P^{\pi_b})| only under certain conditions.

审稿人评论

Thank you for the detailed response, which has addressed most of my concerns. I've increased my score accordingly.

审稿意见
4

The paper introduces the SoftMaxTree algorithm, a softmax policy gradient algorithm extended with planning. The main idea of the extension is that estimation the gradient from longer paths reduces the variance of the gradient. Two variant are considered depending on the expectation being inside or outside of the exponent. The main theoretical results are bounds on the variance for the two variants considered. A bound is also provided for the bias introduced in the case of approximate forward model. Empirical evidence on several Atari benchmarks show that the proposed algorithms perform favorably compared to the standard PPO in terms of cumulative reward and variance.

给作者的问题

The variance scales favorably with depth, but a small eigenvalue is crucial in this dependence. The assumption on the transition matrix is common for the theoretical analyses, but often does not hold in practical applications. Do you expect an increase of variance with depth in such scenarios? A discussion on the topic could be useful.

The variance is reduced with search depth, however the bias increases with approximate forward model. Any solutions for the tradeoff?

论据与证据

The claims are supported both theoretically and empirically.

方法与评估标准

The evaluation is sensible.

理论论述

I checked the proof succinctly.

实验设计与分析

The experimental setting is fine. Additional baselines that use planning in the training phase (e.g. using MCTS to generate traces) would have been useful.

补充材料

I parsed thru the proofs, and looked at the additional experimental data.

与现有文献的关系

The bounds on the variance are valuable.

遗漏的重要参考文献

The most relevant literature is mentioned, but a deeper comparision with policy gradient algorithms that are combined with MCTS would have been useful.

其他优缺点

The search/planning part of the algorithm is awkward. It is attempting to do exhaustive search, but does some intensive pruning to keep the tree narrow (to achieve linear complexity scaling with depth). It is really unclear why not MCTS or at least Monte Carlo sampling. That would also have linear dependence, and both are extensively studied in RL. I assume the variance would have been more difficult to analyze theoretically in case of MCTS.

其他意见或建议

One of the advantages of PPO is that it is model free. Usually, one still needs a simulator for training, but there is less constrain on the simulator compared to the setting here that needs a forward model (e.g., we need to be able to reset the simulator to a certain state). If it is possible to use planning, using MCTS to generate traces improves the training (even with an approximate forward model as in MuZero). Therefore I would suggest to add baselines that use MCTS in the training phase (perhaps off-policy variants).

作者回复

Thank you for your thoughtful review and for recognizing the value of our theoretical and empirical contributions. We appreciate your feedback and address your questions below.

MCTS vs. our search approach

Thank you for this important point of clarification. SoftTreeMax and MCTS represent fundamentally different approaches with distinct goals. While both involve planning, our work specifically focuses on how to directly integrate forward modeling into the policy gradient framework itself. The key innovation of SoftTreeMax is that it incorporates planning directly into policy parameterization, creating a differentiable policy that naturally reduces gradient variance. Unlike MCTS, which operates as a separate planning algorithm that produces improved rollouts, our approach transforms the underlying architecture of policy gradient methods. This enables us to maintain the theoretical elegance of policy gradient approaches while addressing their fundamental variance issues. Our analytical results show how this integration affects gradient properties in ways that would not be directly applicable to MCTS-based methods. We believe this approach opens new avenues for research at the intersection of planning and policy-based methods.

That said, we acknowledge the value of comparing against and potentially combining with MCTS-based approaches in future work.

Additional MCTS baselines

Following your comment, we added comparisons with a strong baseline to our experiments: EfficientZero [Ye et al., 2021]. We chose it because it is a highly sample-efficient version of MuZero that is also open source. MuZero is one of the best known RL algorithms to date. The results that were completed during the rebuttal are given here: https://ibb.co/zHG3KP8m, showing that SoftTreeMax surpasses EfficientZero in all the games we tested except one. We will add full experiments in the final version.

Variance scaling with eigenvalue assumptions

You've identified an important theoretical-practical gap. While our theory depends on properties of transition matrices that may not always hold in practice, our empirical results consistently show variance reduction across various environments. In cases where the eigenvalue assumptions are violated, we would expect diminished (but still positive) variance reduction benefits rather than variance increases. Figure 3 in our paper demonstrates this empirically - the variance consistently decreases with depth across different games, even though the theoretical assumptions likely vary between environments.

Bias-variance tradeoff with approximate models

The bias-variance tradeoff we observed follows this pattern:

  • The closer πb\pi_b is to πd,t\pi_{d,t} at time t, the lower the bias

  • The closer πb\pi_b is to having similar rows in PπbP^{\pi_b}, the lower the variance

For practical solutions to this tradeoff, we recommend:

  1. Adaptive depth selection based on model certainty
  2. Progressive increase in depth as model accuracy improves during training
  3. Ensemble methods to better estimate model uncertainty

Our Theorem 4.8 shows that the gradient bias diminishes with approximation error while retaining similar variance reduction properties. Even with approximation errors, the variance continues to decay exponentially, just at a rate dictated by P^πb\hat{P}^{\pi_b} instead of PπbP^{\pi_b}.

We will expand this discussion in the revised paper to provide clearer guidance on managing this tradeoff in practical implementations.

审稿人评论

Regarding MCTS, I think the authors missed my point. SofTreeMax has a planning component that helps to reduce the variance. My argument was that the planning component uses an awkward pruning technique, and it using MCTS as the planning component could have been more powerful. Theoretically would probably more difficult to tackle, but I would expect that it would be a stronger algorithm.

I am on the fence about the variance reduction in practical application. There are domains where search pathology was shown, those might apply here as well. I am not sure though what kind of domains are more suitable for the proposed approach.

最终决定

This paper proposed a generalization of softmax, a tree based implementation for the softened max, to improve planning for RL.

The method reduces the variance of the policy gradient, and is well supported by experiments.

The tree depth is a hyper parameter that can influence the performance quite bit.

One reviewer pointed out that the main issue is the bounding of the variance given an approximate model. The authors need to improve the presentation of the result to be easier to follow.

Another reviewer suggested the exhaustive search is too time consuming, and can be improved by sampling based search.

These can be addressed to improve the quality of the paper.