Gradient Boosting Reinforcement Learning
摘要
评审与讨论
This paper presents Gradient Boosting Reinforcement Learning (GBRL), an attempt to integrate Gradient Boosting Trees (GBT) into reinforcement learning (RL). GBTs are known for their performance in structured data tasks, but RL has long relied on deep neural networks. The authors propose modifications that allow GBTs to function effectively in RL.
The main idea is to adjust standard GBT updates so they can handle the dynamic and non-stationary nature of RL. Instead of training separate networks for policy and value functions, GBRL uses a single decision tree ensemble for both, which reduces memory consumption.
Experiments show that GBRL performs well in structured environments like MiniGrid, is competitive in continuous control tasks, but struggles with high-dimensional unstructured data such as Atari-RAM. The authors also implement GPU acceleration, making large-scale training feasible.
给作者的问题
Please refer to the weaknesses section for key concerns regarding theoretical justification, baseline comparisons, and failure case analysis.
论据与证据
The claims regarding GBRL’s superior performance in structured RL tasks are well-supported by experiments. However, the claim that GBRL can be a general alternative to deep RL models is not fully justified, as results on high-dimensional tasks like Atari-RAM show significant performance gaps.
update after rebuttal
Thanks for the response. I will keep my score.
方法与评估标准
The methods and evaluation criteria are generally appropriate. The selection of structured and continuous control tasks aligns well with the goal of testing GBTs in RL. However, including more comparisons against structured-data-oriented models could improve the fairness of the evaluation.
理论论述
The paper lacks formal theoretical guarantees. There is no proof of convergence, and the theoretical discussion on stability in non-stationary RL environments is minimal.
实验设计与分析
The experimental setup is well-structured and diverse, covering different RL environments. However, failure cases, particularly in high-dimensional settings, are not deeply analyzed.
补充材料
Yes
与现有文献的关系
N/A.
遗漏的重要参考文献
N/A.
其他优缺点
Strengths One clear strength is novelty—very few works have tried scaling gradient boosting for RL. The shared tree-based actor-critic model is an interesting way to cut down memory requirements. The paper’s experimental coverage is also broad, evaluating performance across structured, continuous control, and high-dimensional tasks.
GBRL also enhances interpretability in RL. Unlike deep networks, decision trees provide a transparent way to understand decision-making, which could be crucial for real-world deployment.
Additionally, practicality is a strong point. GBRL integrates with existing RL frameworks like Stable Baselines3, lowering the barrier for adoption.
Weaknesses However, there are some issues. Theoretical justification is lacking—there’s no proof of convergence, and the paper does not fully explore how GBTs handle non-stationary RL data. While GBRL does well in structured environments, its performance in unstructured tasks is poor, especially on Atari-RAM, and the authors don’t offer a deep explanation of these failures.
Another limitation is baseline comparisons. The experiments mostly compare GBRL to standard MLP-based RL models but ignore specialized architectures like TabNet or SAINT that are designed for structured data.
Lastly, computational cost is high. GBRL requires significantly more GPU memory than MLPs (~24GB vs ~3GB in some cases), making it less practical for resource-constrained environments.
其他意见或建议
The paper would benefit from a more detailed discussion on why GBRL struggles in high-dimensional settings and an analysis of computational trade-offs in real-world deployment.
Thank you for your detailed review and valuable feedback. Below we address each of your concerns.
-
Lack of theoretical proof of convergence & handling non-stationarity: While comprehensive theoretical analysis is beyond the current scope, GBRL builds upon established foundations from both gradient boosting and policy gradient theory. Each tree in our approach minimizes a functional gradient step, making GBRL's convergence analysis potentially related to known policy gradient convergence results. The recent work on softmax policy gradient convergence rates [1] shows O(1/t) convergence for tabular settings with true gradients. We believe similar analysis could potentially apply to GBRL, as our tree-based approximation maintains the key properties that enable convergence -- specifically, GBRL preserves the policy gradient's directional information while providing a piecewise approximation of the advantage landscape. Furthermore, GBT methods have convergence guarantees in various settings. Recent theoretical works have analyzed convergence for GBT on convex losses [2], and online gradient boosting methods have known convergence guarantees under certain conditions [3,4]. The work on non-parametric policy gradients [5] also provides a foundation for our approach. A full theoretical analysis would require examining how the tree-based approximation interacts with policy improvement dynamics and whether it preserves the Łojasiewicz condition identified in [1]. We believe our empirical evidence of convergence across diverse environments provides strong practical support for GBRL while setting the stage for future theoretical analysis.
-
Poor performance on high-dimensional unstructured tasks (Atari-RAM): We agree with your observation. GBRL's weaker performance in high-dimensional raw-feature spaces (such as Atari-RAM) arises from a fundamental difference in inductive biases between decision trees and NNs [6, 7]. Decision trees partition the input space according to single feature thresholds, whereas NNs learn complex feature compositions via continuous transformations. This makes NN more suited for environments where useful information is embedded in combinations of features. This highlights the complementary nature of GBRL to existing NN-based methods rather than positioning it as a replacement. We will expand our discussion in the revised paper to explicitly highlight this limitation and its implications.
-
Lack of comparisons to specialized architectures (TabNet, SAINT): The goal of this work is to combine the popular GBT tool with RL, analyze its performance, and overcome the challenges that this combination introduces. One major advantage that we found is GBT's robustness to OOD scenarios compared to NNs, and not only handling structured data. While specialized architectures like TabNet/SAINT excel at purely tabular data, our environments span a broader range of settings including those with mixed features, temporal dependencies, and various action spaces. Our primary focus was therefore on comparing with the standard MLP-based approaches most commonly used in RL rather than models optimized solely for tabular data. This allows us to assess GBRL's performance in the broader context of RL applications.
-
Computational cost: You raise a valid concern about the scalability of our approach. Please see our detailed response to Reviewer KuLP regarding the unbounded growth of the ensemble as the policy improves.
[1] Mei, Jincheng, et al. "On the global convergence rates of softmax policy gradient methods." International conference on machine learning. PMLR, 2020.
[2] Cortes, Corinna, Mehryar Mohri, and Dmitry Storcheus. "Regularized gradient boosting." Advances in neural information processing systems 32 (2019).
[3] Beygelzimer, Alina, et al. "Online gradient boosting." Advances in neural information processing systems 28 (2015).
[4] Hu, Hanzhang, et al. "Gradient boosting on stochastic data streams." Artificial Intelligence and Statistics. PMLR, 2017.
[5] Kersting, Kristian, and Kurt Driessens. "Non-parametric policy gradients: A unified treatment of propositional and relational domains." Proceedings of the 25th international conference on Machine learning. 2008.
[6] Grinsztajn, Léo, Edouard Oyallon, and Gaël Varoquaux. "Why do tree-based models still outperform deep learning on typical tabular data?." Advances in neural information processing systems 35 (2022): 507-520.
[7] Rahaman, Nasim, et al. "On the spectral bias of neural networks." International conference on machine learning. PMLR, 2019.
This paper presents a GPU accelerated method to use gradient-boosting trees for use in RL. The contribution is in the fact that existing methods for boosting-trees are tailored to offline learning, whereas, the authors' method is fully incremental. They then show empirically many benefits of using tree based function approximation in RL settings such as, robustness to spurious correlations and input perturbations. The algorithm is RL-loss agnostic, can use 'weight sharing' (or more accurately '(sub-)tree sharing'), and compares well against a neural network baseline.
给作者的问题
Figure 5; which RL algorithm was used for GBRL and NN, PPO? What neural network architecture was used?
Figure 11; how come AWR-GBRL performs so poorly on pendulum-v1 ?
论据与证据
The authors describe a variety of benefits for using tree-based function approximation, all of which they address in their experiments. Limitations for their method are also shortly discussed.
So the claims match well with what is shown. However, I do think that e.g., line 71-72 column 1 deserves a more nuanced discussion. The results are simply not decisive enough to conclude superior performance on structured tasks.
方法与评估标准
No complaints.
理论论述
NA
实验设计与分析
The experimental setup is exhaustive, explained well, and aligns with the claims that the authors want to make. However, much of the discussion of the results can be presented more scientifically rather than as the current salespitch. Especially in section 5.1.
For example: Line 240 column 2, the authors praise results from the appendix on PutNear, Four Rooms, and Fetch. Figure 14, however shows that this is not consistent over RL algorithms and note that only 5 seeds are shown. The GBRL method fails on RedBlueDoors, KeyCorridor with A2C, and is slower to learn with AWR compared to the NN. But it does seem to often outperform the NN when paired with PPO.
Why not summarize much of the current discussion and shed some light on why some RL losses yield strong differences in performance? This would improve impact so practitioners understand how GBRL interacts with such choices. Although, I doubt that this question can be answered with the currently low number of seeds...
补充材料
Please tidy up the figures in the supplementary material to fit within the margins, pages 17-20. Also, sort or split/ group the methods for better clarity and keep it consistent across figures.
Typo in L667 section title.
Unclear what the intervals mean in the result tables from the captions, best to repeat this from the main text.
The text doesn't say how many seeds were run for the learning curves in C., neither does it say what the confidence bands show in figures 11-14. Repeat this from or refer to the main text.
与现有文献的关系
I am not very familiar with prior work that uses gradient boosted trees in RL.
The authors already cite the work by Abel D., et al. (2016) and Brukhim N. et al. (2023).
However, there have also been approaches for decision tree learning within RL settings, see Silva et al. (2020) or Vos D. (2023). These are not mentioned in the paper right now, but could strengthen the 'tabular data' paragraph in the related work for concurrent tree-based approaches within RL.
Silva, A., Gombolay, M., Killian, T., Jimenez, I. & Son, S.. (2020). Optimization Methods for Interpretable Differentiable Decision Trees Applied to Reinforcement Learning. <i>Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics</i>. PMLR. 108:1855-1865
Vos, D., & Verwer, S. (2023). Optimal decision tree policies for Markov decision processes. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (pp. 5457–5465). IJCAI. https://doi.org/10.24963/ijcai.2023/606
遗漏的重要参考文献
NA
其他优缺点
Strength
Language, wording, structure, style, and coherence is excellent. This is a well written paper that should be understandable for a broad audience.
Limitations of the method and implementation are clearly formulated.
In general, there is not too much to critique about this paper. There are indeed still limitations, however the aim, motivation, and approach are all very clear. The authors also provide a codebase that is implemented in CUDA and compatible with commonplace baseline repositories. This paper should be accepted as I think it establishes a valuable starting point for future work.
Weaknesses
The large GPU memory requirements and inefficient compression of old trees does not make this algorithm super competitive. Although, it does have redeeming features that future work should directly explore.
Last sentence of the conclusion is too salesmen-like, and can be more nuanced. GBRL is a nice alternative to existing solutions. However, I don't see it as a 'step toward' real-world tasks. It rather feels like a sidestep.
The critique from 350-353 on NNs is a bit arbitrary, you can always come up with examples that break particular models. If I use LayerNorm within my NN, then the output will always be in a reasonable range even if OOD. This can be formulated better.
其他意见或建议
Line 85 column 2, use \citet instead of \citep for 'Kersting & Driessens (2008) ...'.
Section 3, why are the variables bold-face? To indicate a vector? This confused me for a moment since the reward is also bold-face, which made me think we were dealing with a multiobjective problem. Also, in figure 1 the state is not bold while in the main text it is. So, I suggest a slight revision for consistency and clarity.
Remove 'superior' from line 312 column 2.
Fontsize of plot ticks and labels in figures 5 are too small. The same goes for the ticks in figures 3, 8 and 9.
The title/ y-labels in figure 6 is a bit confusing, I suggest to right-align the current y-labels and modify the ysuplabel as 'episodic reward'.
Line 272-273 column 1 is difficult to read.
Line 380-384 column 2 broken sentence.
Summary in line 425-428 column 1 belongs in the conclusion (obsolete).
At times there are singular words being split up from their text due to the format, this obstructs reading. Please revise the following,
- Line 290 column 2, 'dilution'
- Line 365 column 2, ''
- Line 416 column 1, 'missing.'
Many references are formatted inconsistently, wrong, or do not have proper journals or venues listed. Examples:
- Line 482 column 1, URL uses a DOI link.
- Line 458 column 2, Duan T. (2020) is a ICML 2020 paper.
- Line 471 column 2, Gorishniy Y. (2023) is a NeurIPS 2021 paper.
Thank you for your detailed review and constructive feedback. We address your main points below and will incorporate the suggested revisions in the final version.
Regarding line 71-72 and performance on structured tasks: We agree with this point. We will revise the contribution statement (lines 71-72) to: "In popular environments, our framework demonstrates competitive performance against NNs while showing particular strengths on structured tasks."
Regarding the critique on NNs (lines 350-353): We agree that any model class—including NNs and GBTs—can be made to fail or succeed depending on the setup. Our intent is to highlight a key difference in inductive biases. As discussed in Jeffares et al. (2024) [1], the kernel representations induced by GBTs are bounded and often behave more predictably on irregular or out-of-distribution inputs, while NN-tangent kernels can be unbounded and vary significantly on test points far from the training distribution. While techniques like LayerNorm can help stabilize neural network activations, they don't fundamentally change the difference in inductive biases and they often don't fully solve the extrapolation issues in OOD scenarios. We will revise the relevant paragraph in the paper to reflect this more clearly and avoid overgeneralization.
Why do certain algorithms perform differently than others? Apart from inherent algorithmic differences, the main difference for GBRL is the number of fitted trees per update.
- A2C - performs a single gradient step per rollout using the entire batch. In GBRL, this results in a single constructed tree per rollout. Since GBTs update the value and policy functions directly, such large, coarse updates can destabilize learning.
- PPO - in contrast, uses minibatches and multiple epochs per rollout. This allows GBRL to fit several small trees per rollout, yielding more stable and incremental updates.
- AWR - is an off-policy algorithm. Therefore, GBRL could potentially build many trees without adding sample complexity. However, we noticed that our best performing models had a huge computational cost to finish training. As a result, we resorted to reducing the number of gradient steps per update at the expense of performance. Future work could investigate methods for optimizing GBRL with a fixed tree budget (such as pruning and distillation), which may enable the applicability of more demanding RL algorithms, such as AWR, using GBRL.
In general, as each tree approximates a gradient step, GBRL benefits from frequent updates to refine its ensemble gradually. This makes it particularly well-suited to algorithms that support minibatching and multiple epochs per rollout. We will include a targeted PPO-based ablation study in the appendix to demonstrate the impact of minibatching performance. Results can be found in the following link: https://pasteboard.co/ENlUg1ffghZU.png.
Other improvements:
We greatly appreciate your detailed comments on presentation and formatting issues. We will implement all of your suggested improvements to enhance the clarity, readability, and technical accuracy of our paper.
Regarding specific questions:
-
Figure 5; which RL algorithm was used for GBRL and NN, PPO? What neural network architecture was used? We compared GBRL to NNs using PPO and a standard MLP with two hidden layers, each containing 64 units and tanh activations. This architecture is the default in Stable Baselines3, and a common choice in RL literature, for these environments. As such, we used the same NN architecture for the Equation environment. We used PPO for all experiments in section 5.2.
-
Figure 11; how come AWR-GBRL performs so poorly on pendulum-v1? AWR-GBRL performs poorly on Pendulum-v1, despite working well on many other environments. Since PPO-GBRL and AWR-NN succeed in this task, we believe the issue lies in the optimization process specific to AWR. Although we did try to tune hyperparameters for this setup, it's likely that the configuration remains suboptimal.
[1] Jeffares, A., Curth, A., and van der Schaar, M. "Deep Learning Through a Telescoping Lens: A Simple Model Provides Empirical Insights on Grokking, Gradient Boosting & Beyond." Advances in Neural Information Processing Systems, 37, 2024.
I thank the authors for their detailed response and commend them for adding additional results. I have also read the comments by the other reviewers, and the authors' rebuttals.
My score and impression of the paper remains the same 4-accept. This is a good paper, which I believe can form a strong basis for future work.
I have left a few last comments on your rebuttal and paper, and one question.
Question on extra PPO results batch-sizes
This result is interesting, but a bit counterintuitive. Typically performance tends to increase with large batch-sizes (lower variance), but in your method it decreases performance. I read your response to Reviewer KuLP, and see that perform best due to a trade-off in learning speed and variance.
Do you have additional thoughs on how to deal with this? Can you support reasonable values for over other environments than cartpole from this?
Regarding the critique on NNs (lines 350-353)
I agree with your reasoning on how NNs and GBTs inherently induce different inductive biases. I see that you cite Jeffares' work in the main paper, but the point you sketch here is not obvious in the main paper, it's practically buried. I think the paper will benefit from including this discussion. I believe this also runs tangent to the critiques listed by all other reviewers on theoretical grounding of GBTs for RL.
I'd suggest to put this in the related work, in the tabular data paragraph. At the moment, this paragraph only states that GBT can perform better than NNs, but not why. My example of LayerNorm was only an attempt at a minimal counterexample to invalidate your point in section 5.2.3. So, explain to the reader why GBTs can guarantee more predictable OOD behaviour, which can be supported by the architecture bias guaranteeing kernel boundedness.
Why do certain algorithms perform differently than others?
Thanks for the clear explanation. I could not find this in the paper (maybe I read over this), please make sure to add this difference in RL algorithms and their nuances on your tree construction somewhere with a separate section heading (appendix is fine).
Regarding specific questions
Please add your answer to 1) to the supplementary, even if it uses the stable-baselines3 defaults. Or otherwise cite a source that exactly details this.
For 2) figure 11 bottom-center, I was wondering if you had specific thoughts on this, since it was the only outlier in this figure. But I understand that this can be due to a tuning problem.
Last points of improvements:
- Run more than 5 seeds and report proper confidence intervals
- Appendix A title has a spelling error 'Implementaion'
We address your main points below:
Question on extra PPO results batch-sizes: We hypothesize that the drop in performance with large batch sizes stems from two factors:
- Tree-level variance trade-off: Larger batch sizes mean more diverse data per tree, which can challenge split quality at fixed depths (we used depth = 4 for all experiments).
- Fewer trees per rollout: Since each batch creates a single tree, larger batches result in fewer added trees per rollout.
To further investigate this, we ran an additional ablation study, where we fixed the batch size and increased the number of epochs per rollout. This increases the number of trees (updates) without changing the batch size. Results can be found in the following link: https://pasteboard.co/VBqZXlhOScsg.png (runtime axis has been zoomed to focus on initial convergence).
We observe that when keeping the total number of updates constant per rollout:
- More PPO epochs leads to faster convergence. The policy reaches a higher score with the same number of environment interactions (higher sample efficiency).
- Increasing the number of epochs, at larger batch sizes, mitigates the low sample efficiency seen before. For example, the following three experiments construct the same number of trees per rollout (32) and exhibit roughly the same convergence rates in terms of sample efficiency: (blue), (red), (purple).
However, larger batches increase runtime due to greater per-tree sample complexity.
Regarding the critique on NNs (lines 350-353): We agree and will implement this suggestion.
Why do certain algorithms perform differently than others? We’ll add a new section in the appendix summarizing the differences between RL algorithms and how these affect tree construction.
Regarding specific questions We will include our response to question (1) in the supplementary, along with either a citation or a note about the stable-baselines3 defaults.
The paper proposes GBRL, a framework that adapts gradient boosting trees (GBTs) to reinforcement learning (RL) tasks. Recognizing that neural networks (NNs) often struggle with structured and categorical features, the authors leverage the natural strengths of GBTs—namely, their ability to handle heterogeneous data and capture abrupt transitions in decision boundaries. Furthermore, to adapt the GBTs to the dynamic nature of reinforcement learning, the paper designed the GBRL framework to continuously interleaving tree construction with environment interaction. The paper presents extensive experiments on a range of tasks, demonstrating that GBRL can outperform or match NN-based approaches in settings where the input features are naturally structured, while also showing increased robustness to out-of-distribution states, signal dilution, and spurious correlations.
给作者的问题
.
论据与证据
yes.
方法与评估标准
yes.
理论论述
yes (no proofs).
实验设计与分析
yes, the experiments are sound.
补充材料
yes, extensive experimental results.
与现有文献的关系
The paper encourage to exploration and usage of gradient booting tree algorithms in the domain of deep reinforcement learning.
遗漏的重要参考文献
no.
其他优缺点
Strength:
- The proposed GBRL framework was shown to outperform traditional PPO methods implemented in "stable baseline" in many tasks.
- Comprehensive experiments are conducted to support the claims of the authors, providing many valuable insights.
- The paper are well-written and sound.
"Weakness":
- A significant limitation is the unbounded growth of the ensemble as the policy improves. While the paper discusses potential solutions (e.g., tree pruning or ensemble compression) for future work, this issue might affect long-term or real-time deployment and is not resolved in this paper.
- Although the experiments are comprehensive, additional ablations on hyperparameter sensitivity and design choices (such as the impact of learning rates for policy vs. value functions) would strengthen the empirical claims.
- Though comprehensive experiments are conducted, the papers lacks in-depth theoretical analysis to further support the strength of GBT concretely.
其他意见或建议
.
伦理审查问题
No.
Thank you for your positive assessment of our paper, recognizing the comprehensive experiments and the value of our approach for structured data in RL. We address each of your concerns below and will revise the paper accordingly.
Regarding the "unbounded growth of the ensemble as the policy improves":
You raise a valid concern about the scalability of our approach. In our current implementation, ensemble size does increase with training steps.
There are several possible approaches to address this limitation:
- Tree Pruning: One can reduce memory requirements through pruning redundant or low-importance splits after training.
- Ensemble Compression: One can distill knowledge from large ensembles into smaller, more efficient tree structures.
- Adaptive Growing Strategy: Rather than adding trees at fixed intervals, one can add them selectively based on performance plateaus or gradient magnitudes.
While we briefly mentioned these approaches in the paper, we will expand the limitations section to further explain these approaches as future research directions.
Regarding "additional ablations on hyperparameter sensitivity":
This is an excellent suggestion. We performed these experiments, provide them below, and will add them to the paper. Specifically, we propose to add an ablation study of GBRL-based PPO in the appendix examining:
-
Learning rate ablation for policy and value function components: Here, we keep one fixed while changing the other. Similar to training with neural network estimators, we observe that large learning rates harm stability, whereas low learning rates result in very slow convergence. The results can be seen in the following links: https://pasteboard.co/PxZtndOMFuzZ.png, https://pasteboard.co/2W5UT93RBNrn.png.
-
Tree depth limitations. This ablation shows that tree depth has a large impact on convergence, however deeper trees lead to increased runtime and compute cost. A larger tree depth corresponds to a better gradient approximation, which may explain the faster convergence shown in these results. In the paper, we chose a maximal tree depth of 4, a value that balances performance (the ability to solve environments) with maintaining a reasonable wallclock time-to-converge. The results can be found in the following link: https://imgur.com/iu0IUsT.
-
Batch size. Using a rollout length of 2048 samples, we analyze the impact of PPO batch size on performance. We observe that batch size significantly impacts convergence. Specifically, smaller batches result in GBRL building more trees per rollout, improving adaptability. However, smaller batches also lead to noisier gradient estimates due to limited samples per constructed tree (blue, 16 samples per batch, starts much faster, but fails to converge to a stable solution). Conversely, larger batches stabilize training by reducing variance through averaging within leaves (more samples per utilized per constructed tree), but build fewer trees per rollout (gray, 2048 samples per batch, is very stable but also very slow to converge). Hence, both excessively small and large batch sizes negatively impact performance. Results can be found in the following link: https://pasteboard.co/ENlUg1ffghZU.png.
Regarding the lack of "in-depth theoretical analysis":
Please see our detailed response to Reviewer 6bS7 regarding the lack of theoretical proof of convergence and handling non-stationarity.
[1] Mei, Jincheng, et al. "On the global convergence rates of softmax policy gradient methods." International conference on machine learning. PMLR, 2020.
[2] Cortes, Corinna, Mehryar Mohri, and Dmitry Storcheus. "Regularized gradient boosting." Advances in neural information processing systems 32 (2019).
[3] Beygelzimer, Alina, et al. "Online gradient boosting." Advances in neural information processing systems 28 (2015).
[4] Hu, Hanzhang, et al. "Gradient boosting on stochastic data streams." Artificial Intelligence and Statistics. PMLR, 2017.
[5] Kersting, Kristian, and Kurt Driessens. "Non-parametric policy gradients: A unified treatment of propositional and relational domains." Proceedings of the 25th international conference on Machine learning. 2008.
The authors introduce a reinforcement learning method based on gradient-boosted trees (Gradient Boosting Reinforcement Learning, GBRL).
The method consists of gradually training an additive ensemble of decision trees to output both a policy and Q-value estimates at the leaves. The loss function has two terms: a cumulative reward term for supervising the actor (i.e. policy outputs of the trees) and an L2 term for supervising the critic (which the authors don’t specify).
The gradient of the cumulative reward with respect to the policy outputs can be computed using the policy gradient theorem, with the Q-value predictions being used to estimate the advantage function. The critic uses an L2 loss, whose gradient is the difference between the critic’s prediction and the target. Hence, the functional gradient can be computed from the tree’s outputs, enabling the application of the usual gradient-boosted trees approach (Section 4.1).
The authors then experimentally validate GBRL, comparing it to neural network methods on on:
-
Standard RL environments: GBRL exhibits equivalent or superior performance on grid environments, a football environment, and continuous control environments. It is inferior in the Atari RAM domain, which exposes raw system memory states in Atari games. The authors attribute this underperformance to the fact that decision trees rely on single-feature splits, while NNs can compose raw features into more informative representations.
-
A custom environment for assessing robustness to signal dilution: the environment consists of simple algebraic manipulations on equations whose complexity can be tuned. The rewards in this environment are sparse. GBRL converges significantly faster than NN based methods, and achieves higher reward.
-
A custom environment for assessing spurious correlations: the environment consists of a grid world where the agent is meant to pick up a red ball. The authors experiment with placing a red box at various locations near the target ball to distract the agent. They find that GBRL exhibits greater out-of-distribtion performance when training on one box location and testing on other box locations.
-
Variations of the aforementioned environments to assess robustness to state perturbations: the authors consider (a) introducing an additional confounder object in the grid world mentioned above, to assess robustness to irrelevant information, (b) introducing Gaussian noise in classical control tasks to assess robustness to corruption/noise, and (c) varying the number of players in the football environment mentioned above, to assess robustness to missing features. In all cases, they find GBRL generalizes better than the NN baseline.
The authors provide a CUDA-based implementation of their method.
给作者的问题
- What NNs are used as a baseline in each experiment?
- Would it be possible to give more detail (e.g. via pseudocode) on (a) how exactly the advantage is computed in the policy gradient and (b) what loss is used to train the critic?
论据与证据
I found the major claims of the paper (especially w.r.t. GBRL’s robustness) to be well-justified by the experiments.
方法与评估标准
Yes, the proposed method makes sense for online RL.
理论论述
N/A (there are no theorems, propositions, lemmas or proofs in the paper).
实验设计与分析
Experiments are extensive, covering both standard RL environments and tasks designed specifically to assess various types of robustness. This paints a much clearer picture of GBRL’s strengths than if the authors had only evaluated it in standard benchmarks.
补充材料
I skimmed appendix A to better understand experimental details.
与现有文献的关系
Prior work of Brukhim et al. (2022) (A Boosting Approach to Reinforcement Learning) had considered the application of boosting to RL, but from a theoretical perspective. In addition, Kersting & Driessens (2008) introduced Non-Parametric Policy Gradients (NNPG), which serve as a starting point for GBRL. GBRL also builds on foundational RL methodologies by using an Actor-Critic training setup.
遗漏的重要参考文献
I am not aware of any glaring omissions of references to prior work.
其他优缺点
Strengths:
-
Clear presentation of key motivations and results in the experiments section.
-
Extensive experiments, covering both standard RL environments and tasks designed specifically to assess various types of robustness. This paints a much clearer picture of GBRL’s strengths than if the authors had only evaluated it in standard benchmarks.
-
Provision of a CUDA-based library for GBRL (at least as far as the authors claimed), which should enable practitioners and researchers to more easily test GBRL in their domains of interest.
-
More generally, the experiments and running time considerations (appendix A.3) indicate GBRL is competitive with baseline RL methods in terms of performance and training cost, while showcasing certain robustness methods. This expands the toolbox of practitioners when doing RL in e.g. noisy or nonstationary environments.
Weaknesses:
-
It was not immediately clear from reading the paper what architectures are used for the baseline neural networks in the experiments section. It would help contextualize the results if the authors could be clearer about this in the main paper.
-
Lack of clarity in the methods section regarding (a) how the advantage in Eq. 3 is computed and (b) how exactly are the Q-value prediction heads supervised.
-
Assuming e.g. a temporal difference loss for the critic, I would imagine the final loss used for obtaining the tree would be different from the one presented in Eq. (4). More generally, it would be good to contextualize how multi-objective training works when dealing with trees, and any additional constraints relative to working with neural networks.
其他意见或建议
While there is already an inherent amount of non-stationarity in the training data when doing online RL (as changes in the policy produce changes in the training data), it would have been good to see an experiment directly comparing GBRL’s behavior under non-stationarity relative to baseline NNs.
Thank you for your thoughtful review. Below we address your points, which we will also clarify in the revised paper.
Stable Baselines3 and RL Zoo3: Our GBRL method was implemented within the Stable Baselines3 framework, a widely-used reinforcement learning library providing standard implementations for RL algorithms [1]. We compared GBRL against baseline neural networks (NNs) provided by RL Zoo3 [2], which include tuned hyperparameters for each environment, ensuring a fair and standardized evaluation.
NN architecture : For all experiments, we used a standard MLP with two hidden layers, each containing 64 units and tanh activations. This architecture is the default in Stable Baselines3 for the environments we tested and is widely used as a baseline in RL literature. We maintained this consistent architecture across all environments to ensure fair comparison.
- Advantage computation: We use Generalized Advantage Estimation (GAE) as proposed by Schulman et al. (2018) [3] to compute the advantage: , where . The GAE parameters and for each environment are provided in Table 2 of the appendix.
- Critic: We use on-policy actor-critic algorithms, where the critic predicts the value function (not action-value Q). We use a standard L2 loss between the critic's prediction and the target return estimate: , where is either a Monte Carlo return estimate or a bootstrapped n-step return, depending on the specific algorithm (PPO, A2C, or AWR).
Both the GAE and target return calculations were done using built-in functions in Stable_baselines3.
Multi-objective training with GBT: In GBRL, we compute gradients for both the actor and critic objectives per timestep. We then concatenate these gradients: and use them to build a decision tree. At each candidate node split, we evaluate a score function to decide the best split. However, since the actor and critic gradients can differ in magnitude, this score can become biased toward one objective. To address this, we explored two strategies:
- Gradient normalization per output dimension with L2-based split scoring.
- Cosine similarity, which emphasizes gradient direction over magnitude.
Both approaches are supported in our codebase and produce good results, while the cosine similarity seems to perform slightly better. The results are visualized for the Cartpole environment in the following link: https://pasteboard.co/qXY0zekTq7uM.png. We will add this result as part of an ablation study to the appendix of the final version.
[1] Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., and Dormann, N. "Stable-Baselines3: Reliable Reinforcement Learning Implementations." Journal of Machine Learning Research 22 (2021): 1–8.
[2] Raffin, A. "RL Baselines3 Zoo." GitHub repository, 2020. https://github.com/DLR-RM/rl-baselines3-zoo
[3] Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. "High-Dimensional Continuous Control Using Generalized Advantage Estimation." arXiv preprint arXiv:1506.02438 (2015).
The paper is well motivated with powering RL with another alternative, boosting trees, instead of NN. The authors provided good feedbacks to the reviews. Although it’s uniformly agreed between reviewers is a solid contribution. The empirical results of the paper is a little weak. First, the experiment domains are too simple. Second, the performance on other benchmarks like Atari need still be improved. Please also try to address the points made by the reviewers.