PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
5
4
3.0
置信度
创新性3.0
质量2.8
清晰度3.0
重要性2.8
NeurIPS 2025

Statistical Inference for Gradient Boosting Regression

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

We provide confidence intervals, prediction intervals, and hypothesis tests for variable importance for boosting with dropout and parallel training.

摘要

Gradient boosting is widely popular due to its flexibility and predictive accuracy. However, statistical inference and uncertainty quantification for gradient boosting remain challenging and under-explored. We propose a unified framework for statistical inference in gradient boosting regression. Our framework integrates dropout or parallel training with a recently proposed regularization procedure called Boulevard that allows for a central limit theorem (CLT) for boosting. With these enhancements, we surprisingly find that increasing the dropout rate and the number of trees grown in parallel at each iteration substantially enhances signal recovery and overall performance. Our resulting algorithms enjoy similar CLTs, which we use to construct built-in confidence intervals, prediction intervals, and rigorous hypothesis tests for assessing variable importance in only $O(nd^2)$ time with the Nyström method. Numerical experiments verify the asymptotic normality and demonstrate that our algorithms perform well, do not require early stopping, interpolate between regularized boosting and random forests, and confirm the validity of their built-in statistical inference procedures.
关键词
Boostingregressionnonparametrictreesinferencestatistics

评审与讨论

审稿意见
5

The paper builds upon Boulevard regularization and proposes two modified algorithm (dropout, parallel). They show the asymptotic normality of the proposed estimators including statistical inference.

优缺点分析

Strengths

  • the paper is well motivated and tackles a relevant problem
  • the paper proposes a unified framework: algorithm, proof of asymptotic normality and statistical inference
  • the results and the corresponding assumptions are clearly discussed

Weaknesses

  • In my opinion, once the asymptotic normality is achieved, the contributions regarding statistical inference are straightforward. It is not clear to me why the proposed two algorithms should be preferred over vanilla Boulevard regularization which is also asymptotically normal and allows for statistical inference. A clear description about the advantages and disadvantages (e.g. MSE, asymptotic efficiency, resources...) of vanilla Boulevard regularization, BRAT-D and BRAT-P including a discussion when to use what would be beneficial. Furthermore, to empirically compare the confidence intervals etc. with those based on vanilla Boulevard regularization, would strengthen the contribution, particularly Corollary 4.

问题

  • The asymptotic variance is estimated. Can you please comment on the convergence to the true asymptotic variance?
  • The proposed confidence intervals are quite large limiting the practical use. Could you please comment on potential optimality?
  • It is hard to compare the MSE-performance of the different methods in Figure 2. Please make the figures larger (in particular Fig. 2 and 4).

局限性

Yes

最终评判理由

The authors addressed my concerns. I have raised my score accordingly.

格式问题

No

作者回复

We thank the reviewer for recognizing the strengths of the paper, and for offering very helpful suggestions. We respond below to your concerns. In particular, we offer a description on the advantages of each algorithm, and provide additional results comparing the confidence intervals with vanilla Boulevard. If our response sufficiently addresses your concerns, we would greatly appreciate it if you’d consider increasing your score.

Weaknesses

  1. Methodological contribution past the CLT. We agree with the reviewer that constructing the baseline procedures for statistical inference are straightforward once the CLT is established. The intervals do indeed fall out of the CLT. The hypothesis test requires a little more insight, but also does so.
  • Problem with the straightforward approach. Unfortunately, the straightforward procedures are not computationally tractable on large datasets commonly encountered when using gradient boosting! Although the CLT reduces the problem of inference to kernel ridge regression (with the right kernel), computing the intervals is an O(n3)O(n^3) problem. This is the chief difficulty that Zhou and Hooker (2022) face when constructing replication intervals for vanilla Boulevard.
    • They mention in Section 6.2 that: “Although the matrix inverse involved in the expression is too time consuming to compute for large nn, it is possible to estimate it using Monte Carlo when nn and dd are small.”
  • We do not have this problem. An important methodological contribution that we should have highlighted more is in making the intervals and tests practical and tractable via matrix sketching. We employ the Nystrom method and subsample ss landmarks to form a low-rank approximation of the kernel. This yields a solution that requires only O(ns2)O(ns^2) time to precompute the kernel, and only O(s2)O(s^2) time for inference -- yielding practical statistical inference in near-linear time in the number of datapoints nn.
    • With the Nystrom method of Musco and Musco (2017), we can choose ss to near-linear in the effective dimension deffμd_{\text{eff}}^\mu of a ridge regression problem with regularization parameter μ\mu on the kernel matrix: s=O~(deffμ)s = \widetilde{O}(d_{\text{eff}}^\mu).
  1. Comparison to vanilla Boulevard. Certainly! We provide the requested comparison below, and will include it in the final version.
  • Should I use Boulevard or Algorithm 1? Boulevard regularization is a special case of Algorithm 1 when no dropout is used p=0p=0. Should we use dropout? There are good reasons to believe so.
    • Improvements in prediction error. In theory, as we discuss in Lines 59 and Corollary 4, Algorithm 1 achieves up to a fourfold improvement in terms of the asymptotic relative efficiency compared to vanilla Boulevard, when all other hyperparameters but the dropout rate are held equal.
      • In practice, this is similar to asking if one should use dropout with a neural network. There, it is almost always used, but when it is not, there are alternatives, such as weight decay, that also allow for regularization. With this analogy in mind, choosing pp can be a go-to knob for tuning the amount of regularization – though alternatives like tree depth and learning rate exist.
    • Improvements in statistical inference. From a coverage/statistical inference perspective, as we discuss in Line 339, there are drawbacks to using vanilla Boulevard instead of Algorithm 1. Setting p0p \to 0 empirically leads to wider intervals and less power in the hypothesis test for variable importance.
      • Why does this happen? We scale Algorithm 1 by 1+λ(1p)λ\frac{1+\lambda (1-p)}{\lambda} to correct the bias. As a result, the irreducible error will be scaled up as well -- but less so when pp is large. A higher dropout rate also introduces decorrelation between trees, suggesting a stabler variance estimate.
  • Should I use Boulevard or Algorithm 2? Asking why one should use Algorithm 2 over Boulevard amounts to asking whether one wants to fit trees in parallel or not, and there are computational reasons (e.g. parallelism) to do so.
    • Improvements in prediction error. In theory, Algorithm 2 can achieve an unbounded improvement in relative efficiency over vanilla Boulevard as we take λ0\lambda \to 0, as we discuss in Lines 295-296.
      • But this improvement is in terms of the number of boosting rounds, not ensemble size! This clarifies the intuition – boosting random forests (a strong learner) should lead to better prediction error than boosting a tree (a weak learner), with the same number of boosting rounds.
      • Our results in Figure 2 are somewhat unfair to Algorithm 2, as the x-axis is in the ensemble size and not the number of boosting rounds. The number of boosting rounds that Algorithm 2 encounters can be as few as 31 on Abalone and 50 on Wine Quality, in contrast to the 500 boosting rounds all other algorithms enjoy. Given the same number of boosting rounds, we would certainly expect an improvement.
    • Parallel computation. With modern multi-core processors and distributed computing, there is immense potential for exploiting the possibility of increased parallelism in boosting.
  1. Comparison of confidence intervals. We provide the requested comparison below.
    • For a quantitative comparison, we provide a new figure that plots the width and coverage of our intervals against the dropout probability. We have lodged this with the AC. Unlike in Figure 3, we do not perform the conformal adjustment outlined in Lines 320 to 327, and we plot the coverage and width across all datapoints and repetitions for simplicity. Although not performing this leads to overcoverage in the PIs, we have to do so to compare the widths – otherwise any over-or-under-coverage would simply be adjusted to the nominal 0.95 level.
      • The mean CI and PI widths decrease as pp increases. The PI coverage remains close to 1 throughout (unlike in Figure 3, overcoverage occurs as we do not perform the conformal adjustment here), while the CI coverage decreases as pp increases -- as we see in the table below.
    • For a qualitative comparison, we also provide a version of Figure 1 that we have lodged with the AC. This uses the same hyperparameters, except for setting p=0p=0. Empirically, it seems that the width of the intervals does indeed decrease in pp, though this can come at a cost with respect to CI coverage (though achieving good CI coverage is difficult in nonparametric regression in general).
Dropout rateCI widthCI coveragePI widthPI coverage
0.0 (vanilla Boulevard)57.980.6680.740.9997
0.145.620.6170.020.9998
0.223.220.4654.080.9995
0.316.660.4249.430.9997
0.423.360.4754.350.9998
0.515.730.4448.760.9950
0.613.770.4347.640.9950
0.710.210.4244.900.9957
0.85.570.4141.830.9950
0.96.880.4742.330.9950
  • We also examine test size and power as a function of dropout. Performance is strong for p[0.2,0.5]p \in [0.2, 0.5], peaking at p=0.5p=0.5. However, for p0.1p \leq 0.1, Type II error rises significantly (vanilla Boulevard has Type II error 0.430.43), and for p0.6p \geq 0.6, Type I error becomes uncontrollable. We provide a detailed figure lodged with the AC.
Dropout RateAverage Type I ErrorAverage Type I Error
0.00.00.43
0.10.00.33
0.20.0330.1
0.30.0670.1
0.40.0330.033
0.50.0330.0
0.60.70.0
0.70.670.0
0.80.770.0
0.90.630.0
1.00.70.0

Questions

  1. The asymptotic variance is estimated. It is unclear whether the reviewer is referring to rn\mathbf{r}_n or σ^2\widehat{\sigma}^2. We address both. The convergence in rn\mathbf{r}_n in Theorem 2 is in the theorem statement, in the sense that this is the exact quantity used in the CLT. This is a happy result – showing that the estimation is baked into the result, and the empirical quantity is the quantity of interest. Regarding σ^2\widehat{\sigma}^2, we estimate this as the sample average of the residuals on a hold-out dataset, for which the convergence and consistency is clear.
  2. The CIs are large. This is a known issue with nonparametric regression.
  • In theory. Sadly, large intervals are expected due to the curse of dimensionality, as minimax optimal rates are exponentially dependent on dimension. Although Corollaries 1 and 2 show that our algorithm is suboptimal by a quadratic factor for Lipschitz functions, this is small compared to the exponential dependence in dd that no algorithm can escape without making further assumptions, e.g. on additive structure.
  • In practice. Empirically, Figure 3 demonstrates our prediction intervals achieve comparable median widths to conformal baselines, while providing stronger conditional (rather than merely marginal) coverage guarantees. This should be viewed positively.
  1. Size of Figure 3. We agree, thank you for pointing this out. We will enlarge it for the camera-ready version.
  • Regarding MSE performance, there is no consistent winner. Our methods remain competitive: rounding to three decimals, GBT and LightGBM each win four times; SGBT wins three; Algorithm 1, random forests, and XGBoost each win twice; Algorithm 2 and Boulevard each win once.

We hope that this response has assuaged your concerns to your satisfaction. Again, if it has, we would greatly appreciate it if you’d consider increasing your score!

References

Musco and Musco (2016), Recursive Sampling for the Nystrom Method

评论

Thank you very much for addressing my concerns and questions in such detail. Please highlight especially the explanation you provide under Problem with the straightforward approach and We do not have this problem in your manuscript. I raised my score accordingly.

评论

Dear Reviewer AVWH,

Certainly. We appreciate you raising this concern in particular, and very much agree that highlighting this explanation within the paper would improve it. Thank you for your support!

审稿意见
4

The paper proposes a variation of gradient boosting decision trees where dropout and parallel boosting are incorporated. It then proposes theoretical analysis of the algorithm showing central limit theorems (CLTs). This can be used for statistical inference: prediction, confidence, and reproduction intervals. Some numerical experiment results are presented.

优缺点分析

The paper proves for the first time a central limit theorem (CLT) for gradient boosting regression trees. However, I find the paper quite difficult to follow because of its dense mathematical presentation. While the theoretical derivations and equations seem rigorous and technically sound, their practical implications are not clearly conveyed. The central limit theorem (CLT) results rely on multiple assumptions, many of which seem to difficult to verify/ensure in practice. Moreover, the empirical evaluation does not convincingly demonstrate a significant performance improvement over strong baselines such as XGBoost. As a result, the contribution of the proposed method remains unclear in practice. Also, these CLT results are shown for a modified version of gradient boosting. Why this has not been shown for a traditional version of Friedman's gradient boosting is not clear.

问题

XGBoost and other tree ensembles have been widely used for many practical problems without CLT and statistical inference. Why does one need these in practice? I think the paper needs to motivate more the importance of statistical inference.

局限性

yes

最终评判理由

The paper is more theoretically oriented, and the practical significance of the results still remains unclear to me. It seems to have merit in this more theoretically oriented statistics community. I did not check the theory, and so I decrease my confidence score and raise my rating.

格式问题

On page 1, the text is very close to the "Submitted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025). Do not distribute." footnote, which is not the case with other Neurips formatted papers.

作者回复

We appreciate the referee's attention to the paper and their confirmation that our mathematical development is sound.

However, we do strongly feel that the reviewer's score is unreasonable, as is their framing of comparisons solely in terms of predictive performance. We think that inference should be regarded as an additional component -- you don't need it to predict well on aggregate, but it can be important for some applications.

On the importance of statistical inference.

To the question of "Why does one need statistical inference?" which we will interpret as meaning for these models specifically. Here we could provide a very considerable range of situations where uncertainty quantification would be relevant. By this we mean "if you collected a new data set and ran the model again, how different would your prediction be"?

  • You might feel quite different about having knee reconstruction surgery if you were told that it had an 80% success rate empirically, than if you were told that the confidence interval for what this success rate might actually be ran from 40% to 95%.
  • To be more concrete, if you had run XGBoost on geologic samples to give you a predicted yield for proposed rare earth mines, you might prioritise a mine with a smaller but more certain predicted yield than one with a larger prediction but an enormous confidence intervals.
  • Conversely, a marketing campaign may want to avoid sending materials to someone who is very unlikely to respond in order not to antagonise them, but might want to send materials out to someone of whom we have very high uncertainty.

These are all immediately practical situations in which uncertainty about a predicted value is already relevant. This is without considering use in downstream tasks:

  • Prioritizing data to collect to reduce uncertainty over all.
  • Propagating uncertainty through to forecasts about the consequences of taking actions in a sequential decision-making poblem.
  • Propagating uncertainty into explanation or interpretation methods which might also be relevant for.
  • Model reduction, or testing whether expensive features are worth collecting given how they affect performance.

We note that in all these cases, inference is in addition to prediction. Our methods perform comparably with SoTA boosting, but also provide uncertainty quantification – we are not sacrificing one for the other.

The NeurIPS format does not allow space for an in-depth discussion of the uses of inferential quantities. We felt both that these would be broadly understood and that it was important to focus on the technical components of our results; but we will include some additional motivation in the introduction.

On the algorithmic changes to boosting and assumptions.

For the remainder of your comments; our results do require specific algorithmic changes to boosting. It is not clear how this can be done otherwise – for instance, boosting to a set number of rounds can lead to rampant overfitting and absolutely no hope for convergence or a CLT, while early stopping has its own issues, as we highlight in our response to Reviewer cjXq.

However, the assumptions about the data are not stronger than theoretical guarantees for most machine learning algorithms: we need the distributions of our features and response to behave reasonably, and we need the data we collect to be both independent and representative of the task. These assumptions are relevant for any ML modeling task. Regarding the assumptions about the trees and the learning algorithm itself, we address this in detail within Lines 131-137, 255-258, and 261-263. None of these assumptions are new – they are all inherited from prior work (Zhou and Hooker (2022), Athey et al. (2018)) on trees. As we allude to in Line 137, numerical experiments indicate that our boosting algorithms and inferential procedures work even when these assumptions are not explicitly verified.

We note that purely on an algorithmic level, our extension of Boulevard also provides an interpolation between boosting and bagging methods such as random forests, and while this was not our focus, it does allow us extra degrees of freedom to optimize between them.


At the end of the day, we do not think this review is fair. We strongly feel that both the reviewer's score, and the lion’s share of their objections, are unreasonable, and that this review does not provide a meaningful assessment of our work.

Given that you do acknowledge that our results appear to be correct, we hope that you will consider significantly revising your score; inadequacies in expressing motivation does not render a paper irredeemable.

评论

Thank you for providing the response. I increase my score. I have no further comments/questions.

评论

Dear Reviewer oGZf,

We appreciate your support, and thank you for doing so!

审稿意见
5

This paper focuses on gradient boosting and builds upon the Boulevard regularization framework introduced by Zhou and Hooker (2022). It extends Boulevard in two directions: first, by introducing random dropout of trees; and second, by integrating the leave-one-out procedure and Boulevard regularization into a parallel boosting framework. The paper studies the statistical properties of both extensions. In particular, it shows that despite the randomness introduced by dropout, the algorithm still converges to a kernel ridge regression estimator that satisfies a central limit theorem (CLT). Moreover, the dropout improves signal recovery over the original Boulevard regularization. The authors also conduct theoretical analysis for the second algorithm, proving convergence to kernel ridge regression and a central limit theorem. The paper discusses confidence interval construction, and provides numerical experiments to support the theoretical claims.

优缺点分析

Strengths:

  • The paper is very well written, with clear explanations of both the algorithms and the theoretical developments.
  • The paper makes both algorithmic and theoretical advances. On the algorithmic side, it introduces two meaningful extensions to Boulevard regularization, dropout boosting and parallel tree growth with leave-one-out structure, with improvement in signal recovery. On the theoretical side, the authors rigorously prove that these new algorithms retain key statistical properties, including convergence to kernel ridge regression and central limit theorems.
  • One of the key contributions is the introduction of dropout in the boosting process. While dropout is often used heuristically in ensemble methods, this paper provides a formal justification for its use and shows that it leads to improved signal recovery compared to the original Boulevard method. This insight is valuable both from a theoretical and practical standpoint.

Weaknesses:

  • While the role of dropout is mentioned, not much discussion is on how to choose the dropout probability pp in practice.

问题

  • Would it be possible to add some visual diagnostics (e.g., QQ plots) to verify the empirical normality implied by the CLT?
  • Could the authors provide a more in-depth discussion on how to tune hyperparameters such as the regularization strength λ\lambda and the dropout probability pp in practice? Practical guidance or heuristics for selecting these values would be valuable for applying the method effectively.

局限性

The paper is theoretical and methodological in nature and does not raise direct societal concerns.

最终评判理由

I would like to retain my score, as I find the contributions of this work to be solid in both algorithmic design and theoretical analysis.

格式问题

NA

作者回复

Thank you for your kind review! We are heartened to see your kind comments on our paper being “very well written” (which is high praise!), that we offer both practical and theoretical advancements, and of you highlighting the utility of a formal justification for the use of dropout within boosting. We provide answers to your questions below (the second is addressed before the first, given that the second was also raised as a weakness).

Weaknesses and Questions

  1. Practical guidance for choosing hyperparameters.
  • Learning rate λ\lambda. We found that a learning rate close to λ=1\lambda=1 seemed to work well empirically. At no point did we choose a learning rate smaller than 0.10.1. As a strategy, we would try λ=1\lambda=1 first, and decrease this in increments of 0.20.2 based on cross-validated performance or its performance in a hold-out validation set. λ\lambda should not be set to more than 11, as specified in the setup of the paper in Line 99. Doing so invalidates the theory, and leads to instability in practice.
  • Dropout probability pp. Regarding the choice of dropout, we found that intermediate values of p=0.3,0.6,0.9p=0.3, 0.6, 0.9 seemed to work well. The extremes lead to interesting behavior – p=0p=0 (yielding vanilla Boulevard) did not yield stable prediction intervals, and p=1p=1 yielding a random forest. In Lines 339-341, we highlight that the variance of the estimator, and therefore the CIs, grows as λ\lambda and pp decrease – but we certainly agree that more discussion is warranted, and will make this amendment in the final paper.
  • Number of boosting rounds KK. We would set this to the highest possible KK that does not lead to a significant slowdown in the time taken for one (parallel) boosting round. This is likely to be a multiple of the number of processors one has available. Why? Having a larger KK essentially allows us to boost a mini-random forest. Boosting random forests (a strong learner) should lead to better prediction error than boosting a tree (a weak learner), if one keeps the number of boosting rounds the same. Algorithm 2 can in theory achieve an unbounded improvement in asymptotic relative efficiency compared to vanilla Boulevard. However, this improvement is in terms of the number of boosting rounds, not ensemble size!
    • Our results in Figure 2 regarding the MSE of Algorithm 2 are somewhat unfair to Algorithm 2 – as the x-axis is in the ensemble size, not the number of boosting rounds. The number of boosting rounds that Algorithm 2 encounters can be as few as 31 on Abalone and 50 on Wine Quality, in contrast to the 500 boosting rounds all other algorithms enjoy! Given the same number of boosting rounds, we would certainly expect an improvement.
  1. Visual diagnostics on empirical normality. Recent response guidelines do not allow us to provide new figures. However, we have created the requested QQ plot from a simulation using Algorithm 1 and have lodged it with the AC.
  • Qualitatively it gives us a good fit to the proposed CLT – this is almost a linear fit, with a slope that is just smaller than one. There is some indication of heavy tails in the extreme quantiles that involve only six out of 100 simulations, and the worst-behaved datapoint had a theoretical quantile of around 2.75-2.75 and a sample quantile of 6-6. This is not entirely unexpected – trees are notoriously bad at the very edge of the support of the data.
  • In case this is inaccessible to the reviewer, we also conducted normality tests (Shapiro-Wilk and Jarque-Bera tests) based on the simulation. Results suggest that aside from outliers (that can be blamed on the poor behavior of trees near the edge of the data support), overall the predictions are pretty normal:
TestTrimmingOutliers RemovedStatisticp‑value
Shapiro–WilkNone0.94720.0005
Shapiro–Wilk2σ2 \sigma80.98760.5431
Jarque–BeraNone20.98552.7736e‑05
Jarque–Bera2σ2 \sigma81.09950.57709

Overall, the results are encouraging for verifying empirical normality, and we thank the reviewer for this suggestion. We will include this QQ plot in the revised manuscript.

We hope that if our response and additional figures that we have generated in response to your review and that of Reviewers cjXq and AVWH have improved the paper in your eyes, that you might consider increasing your confidence in your assessment of the paper. Thank you for your kind review again!

评论

I thank the authors for their response and their effort in addressing the normality testing in accordance with the updated NeurIPS rebuttal guidelines. I have no further questions.

评论

Dear Reviewer kHmv,

Of course, checking for normality to see if the theory indeed holds in practice is probably something we should have done anyway in the original paper. Thank you for your kind review again!

审稿意见
4

Gradient boosting is one of the most widely used machine learning (ML) methods, but the theoretical properties are less well understood than for other ML methods. The paper wants to fill this gap and provides theoretical results for boosting, namely a central limit theorem for boosting predictions. Two algoritms (Boulevard Regularized Additive regression Trees - Droput (BRAT-D) and BRAT parallel (BRAT-P) are proposed and their theoretical results are dervied. Finally, numerical experiments are presented which underscore the theoretical results.

优缺点分析

Strengths
Overall, the paper is an important contribution for understanding boosting algorithms better. The developed theory is very involved and impressive. The paper is structured clearly.

Weaknesses

  • The literature review seems incomplete. The claim that "Still, this is only method available for frequentist inference for gradient boosting,..." is not correct. E.g. Luo et al. (2016) analyse gradient boosting with univariate regressions as base learners and derive theoretical properties. A comparison with this method would be interesting and including other related papers in the literature review.

Ye Luo, Martin Spindler, Jannis Kueck (2016): High-Dimensional L_2Boosting: Rate of Convergence. arxiv 1602.08927, Journal of Machine Learning Research forthcoming.

  • A crucial point of boosting is early stopping and when to determine when to stop. This is also key for empirical applications, in particular having a feasible stopping criterion. It seems that the stopping time is given in the algorithms, but how would the optimal (feasible) stopping be determined and how does the stopping criterion affect the theoretical results? How does stopping / stopping criteria affect the empirical results?
  • The theoretical settings seems to be very much inspired by Random Forests, in particular "parallel structure", but are not directly related to the concrete implementation of gradient, in particular sequential proceeding.

问题

  • Is it possible to give (feasible) stopping criteria for the boosting algorithms and analyze the behaviour under stopping criteria.
  • How do different stopping criteria affect the results from the numerical experiments?

局限性

yes

最终评判理由

I keep my score. It is a theoretical interesting contribution building on previous work. My onl caveat, as outlined above is, that the algorithms are modified so that they don't have much in common with the original boosting algorithm (which of course is common practice to get theoretical results at all.)

格式问题

none

作者回复

We thank the reviewer for their kind comments on the strengths of the paper! We are very pleased to hear that you think that this is an important contribution towards understanding boosting algorithms. We are also glad that you think that the paper is structured clearly – we put a lot of thought on how this can be presented in 8 pages. We also thank you for your kind comments on how you think that our theory is involved and impressive.

Below are responses and answers to the weaknesses and questions you raised. These include more numerical results that pertain to whether or not early stopping affects our algorithms. We hope that these assuage your concerns.

Weaknesses and Questions

  1. Are there other methods available for frequentist inference for gradient boosting? This is a good question. We hope that there are! Alternatives to Boulevard regularization are very much needed. We thank the reviewer for their comment – we had read Luo et al. (2016), but it did not come to mind that statistical inference could be possible with their approach. There are two things that you might mean, and we address them below.
  • Do you mean that one can extract valid intervals out of their post-L2 boosting estimator, given that it is a valid post-selection inference algorithm? This is likely! Though we aren’t sure that this would provide inference for boosting in the sense that we had in mind. Their method performs boosting on component-wise linear models, and not trees.
    • Furthermore, their post-L2 boosting method refits a linear model on the predictors selected by their boosting procedure. This does not provide statistical inference for the model that was fitted by boosting.
    • This observation does raise a question. Viewing trees as OLS on adaptively selected rectangular indicator variables, can we fit a kernel regression on the kernel given by gradient boosting and perform inference on that model? Yes. This would constitute a similar algorithm to Friedberg et al. (2018), except applied to boosting and not random forests. We are working on it right now for another related paper. Please don’t scoop us!
  • Do you mean that one may be able to obtain valid intervals in general given rates of convergence? Luo et al. (2016) show rates of convergence, but neither provide a CLT, nor do they use the convergence rates established to provide methods for statistical inference.
    • A CLT alone is not sufficient for statistical inference – one needs also an estimate of the asymptotic variance. It is unclear how this can be done in the framework of Luo et al. (2016). An analysis of the convergence rates is also not sufficient for statistical inference – constants are untracked, and the limiting distribution, or even whether asymptotic normality holds, is unknown.
    • Their paper makes no mention of confidence etc. intervals, and it is unclear to us how their results can be used for that. We think that this may be possible if one can either estimate the complexity of the early-stopped function class on the fly, or if one can track down the exact constants within their analysis -– but this extends far beyond the scope of their paper.
    • Having said that, we do agree that we should have included this within the literature review – we did read the paper, and found it to yield spiritually similar conclusions to Zhang and Yu (2005), which should also have been included. We thank the reviewer for pointing this out, apologize for the omission, and will fix this in the final version of the paper.
  1. Stopping times. On questions about stopping criteria: one of the features of the Boulevard framework is that we do not need an explicit stopping criterion to avoid overfitting. Because it works on a running average rather than a sum of trees, it is akin to random forests where building a larger ensemble never hurts you, but leads to diminishing returns.
  • In that sense, we do not need explicit stopping rules in our theory – in fact the case we analyse is based on the limit of infinite boosting rounds – and in practice would use similar tools as in random forests; monitoring test set performance and stopping at the point that we no longer see improvements.
  • Having said that, the reviewer alludes to an important point: gradient boosting with early stopping is a promising direction for statistical inference. Certainly, if rates of convergence are available (such as with the work of Luo et al. (2016) and Zhang and Yu (2005)), perhaps something can be pulled out from them. Asymptotic normality may not hold, but non-normal limits and nonasymptotic results are still possible even then. We are interested in addressing this in future work.
  • How does early stopping affect the empirical results? Last minute guidance restricts us from providing figures in this response, but we created a new figure that plots the test set performance against the number of boosting rounds which verifies this. We have lodged it with the AC. Under a moderate noise-to-signal ratio DGP with the ground truth being a sinusoid: y=sin(4πx)+ϵ,ϵN(0,4)y = \sin(4\pi x) + \epsilon, \epsilon \sim \mathcal{N}(0,4), we fit Algorithm 1 with p=0.5p=0.5 and Algorithm 2 with K=2K=2, boost to 500 rounds, and keep all other hyperparameters the same.
    • The test MSE decreases rapidly for all models. For gradient boosting, and stochastic gradient boosting, it starts to increase dramatically after 10 to 20 rounds. XGBoost and LightGBM achieve the minimum at around the same mark, but the test MSE increases to a lesser degree (the former more than the latter) after this point. We attach a table below, depicting the results from highest final MSE to lowest final MSE at the end of 500 boosting rounds.
    • Algorithm 1, Algorithm 2, and vanilla Boulevard achieve approximately the same minimum, but the test set MSE remains flat thereafter – unlike the previously described boosting algorithms, it does not increase in the number of boosting rounds past this point. Early stopping is not necessary for our algorithms, both in theory and in practice!
AlgorithmStep the minimum was achievedApproximate minimum MSEFinal MSEEarly stopping needed?
GBT~10~40.5~60Yes
SGBT (subsample=0.6)~12~40.5~60Yes
XGBoost~15~41~50Yes
LightGBM~18~41~44Yes
Algorithm 2Flat at 50-500~42~42No
BoulevardFlat at 50-500~41~41No
Algorithm 1Flat at 50-500~41~41No
Random ForestFlat at 50-500~40.7~40.7No
  1. Random forests. A final weakness was suggested to be the similarity structure between random forest and our parallelization structure.
  • The referee could be referring to the contrast between the parallel structure of parallelized boosting (which starts to look more like a random forest) and the sequential processing in gradient boosting. This applies only to Algorithm 2 – as Algorithm 1 is a bona-fide sequentially processed gradient boosting algorithm for the L2 loss.
  • In either case, this does not mean that Algorithms 1 and 2 are a reduction of boosting to random forests. While trees are averaged instead of summed, these trees are still grown in sequential order, one after the other. This leads to a convergence to a kernel ridge regression y^n(b)ba.s.(λ1I+(1p)E[Sn])1E[Sn]yn\widehat{\mathbf{y}}_n^{(b)} \xrightarrow[b \rightarrow \infty]{a . s .}\left(\lambda^{-1}\mathbf{I}+(1-p) \mathbb{E}\left[\mathbf{S}_n\right]\right)^{-1} \mathbb{E}\left[\mathbf{S}_n\right] \mathbf{y}_n, rather than convergence to a random forest E[Sn]yn\mathbb{E}\left[\mathbf{S}_n\right] \mathbf{y}_n. The analysis in terms of kernel ridge regression is in fact quite different, and these lead to different estimators in practice as well.

Again, we thank the referee for your kind comments and insightful questions. We hope that the response above is helpful and has addressed your concerns. If you think our response has adequately addressed your concerns, we will be glad if you choose to increase your score!

References

  1. Tong Zhang and Bin Yu (2005), Boosting with early stopping: Convergence and consistency.
  2. Friedberg et al. (2018), Local linear forests
  3. Luo et al. (2016), High-dimensional L2 boosting: Rate of convergence
评论

Thanks a lot for detailed replies which clarifies many of the points. I think almost all technical points are clear now.

I have a conceptual point which is beyond of the rebuttal: For me boosting is a sequential procedure where the steps are dependent on each other. In your framework it is more considered as parallel procedure as random forest. (E.g. that stopping times does not matter all, but which is core to and special about boost.) Therefore I would not consider it as theory for boosting. But nevertheless the theory is impressive.

评论

Dear Reviewer cjXq,

Thank you for the compliment! We appreciate the remark. Is your objection solely due to the fact that we are happy to boost for infinite rounds and do not need early stopping? If so, then the following points may be helpful as a point of clarification:

  1. This is something inherited from Zhou and Hooker (2022), who construct the original Boulevard procedure that shares the same willingness to boost for infinite rounds.
  2. Within Algorithms 1 and 2, trees are grown dependent on the output of previous trees, where we fit new trees on the residuals of ensembles of old trees. This is faithful to the boosting procedure, where the steps are dependent on each other. The only caveat is that the procedure does admittedly require eventual non-adaptivity, but this only needs to happen in the limit.
  3. Stopping times are only one of the ways that we can ensure that boosting algorithms do not overfit. Alternatives include restricting tree depth, reducing the learning rate, employing L2 regularization, or even including dropout (e.g. DART). Boulevard regularization just so happens to be one way this can be done that is especially amenable to statistical inference. With this, everything remains the same about the original boosting procedure -- we simply update fb+1=b1bfb+λbtb+1f_{b+1} = \frac{b-1}{b}f_b + \frac{\lambda}{b} t_{b+1} instead of fb+1=fb+λtb+1f_{b+1} = f_b + \lambda t_{b+1}. This can be thought of as a decreasing learning rate and ensemble shrinkage.

If your objection lies in the fact that this does require a non-standard modification to the boosting algorithm from what is commonly used in practice, that is a fair objection. We aren't too happy with it either, and are currently exploring alternatives in ongoing projects that amount to future work.

Again, if you think our response has adequately addressed your concerns, we will be glad if you choose to increase your score! Otherwise, please let us know if you do still have objections -- your feedback will be very helpful in improving the paper and how we present our results.

评论

Thanks a lot for the clarification and efforts!

评论

Anytime! Thank you for your effort as well!

最终决定

Building on work by Zhou & Hooker (2002), this paper proposes a gradient boosting algorithm with a CLT guarantee: in the limit of boosting rounds, the ensemble output converges to a kernel ridge regression. The CLT guarantee means that one can use the ensemble for statistical inference, like computing confidence intervals around the predictions or ground truth. The paper accomplishes improvements over Zhou & Hooker by incorporating dropout, in addition to Zhou & Hooker's Boulevard regularization. Two boosting algorithms are proposed: one sequential, and one parallel -- the latter of which also incorporates a leave-one-out procedure. Experiments on synthetic and real data round out the study.

The reviews all say this is a technically solid, well written paper. The weaknesses are mostly about the writing, with some reviewers asking for more elaboration on tuning the dropout parameter, the importance of the CLT, and comparison/advantages over prior work (notably, Zhou & Hooker). The authors' rebuttal seems to have addressed most of these concerns, so their points could be incorporated into the camera-ready.

As all scores are on the positive side, I will agree with the reviews and recommend acceptance. Since there is no overwhelmingly positive review (i.e., strong accept), and the impact of this work is a bit unclear, I recommend poster presentation.