Subsampled Ensemble Can Improve Generalization Tail Exponentially
Our paper shows that ensemble learning via majority voting achieves exponentially faster risk decay, improving base learners with slow rates.
摘要
评审与讨论
This paper studies the technique of ensemble learning, a popular paradigm consisting in aggregating several weak learners into a strong model (e.g., combining decision trees into tree ensembles). The authors propose the novel methodology of simply selecting the best base learner out of the pool of models trained on different subsamples of the data. In the easiest case, this happens via majority voting. They show that this improves the generalisation bound of the model, guaranteeing an exponentially decaying generalisation tail even where the general problem has a slow polynomial decay.
More specifically, in a discrete problem the model appearing more frequently is selected as the output model (MoVE). In continuous problems, this degenerates as all learned models likely appear exactly once. Therefore, ROVE chooses the model that has the highest likelihood of being eps-optimal.
The authors then formally derive theoretical bounds for the exponentially decaying generalisation tail in both scenarios.
They then experiment with MoVE and ROVE, which actually comes in two different versions, one where all data is used in both steps, one that splits the data between learning candidate models and performing eps-optimality vote. Experiments are performed on synthetic regression datasets tackled with MLPs, and on four discrete heavy-tailed stochastic programs. For the regression cases the proposed algorithms seem to work better than a single base learner. A further comparison is drawn with bagging - here we see that bagging wins on light tailed, and viceversa, which is reasonable. Performances on the stochastic programs look competitive, and ROVE seems the best alternative.
优缺点分析
STRENGTHS
-
significance: ensembling is a very popular technique and studies advancing its applicability are definitely valid to the ML and optimisation communities
-
quality: the paper is well written, has a solid theoretical justification and an extended experimental evaluation (I have only one concern that I express in my questions)
-
originality: the work is original and seemingly valid for any kind of ensemble - therefore it could lead to practical improvements in diverse areas where current ensembling solutions are not satisfying enough
WEAKNESSES
-
the work is quite theoretical and I do not necessarily see an immediate reflection in ML models, e.g., I am still a bit unsure that a single tree can do better than an ensemble in a lot of scenarios. See my related question.
-
while I get that the main point is a higher level contribution, it feels a bit weird to me that a paper on ensembles does not have any experiment related to tree ensembles, arguably the most popular ensembling technique in ML - see my related question later
问题
-
I'm not entirely sure I get the discrete scenario. The majority vote in MoVE happens over the different trained models. Does that mean multiple models that are exactly the same are the output of the training on different subsamples? e.g., we would need to have two identical decision trees? Or is this for example related to the numerical solution of a stochastic program? Could you please provide a practical example of a discrete scenario? It's probably just a lack of expertise on my side.
-
In addition to the formula in the algorithm, could you better explain in a couple of sentences and/or with an example what it means that a model is eps-optimal?
-
The proposed method envisions always using one single learner (extracted through majority voting or finding the eps-optimal one), and justifies how this improves generalisation. Intuitively, often using boosting/bagging is a much more powerful solution (e.g. boosted ensembles such as XGB perform way better than single decision trees). Indeed in Figure 2 you show that for light-tailed distributions subagging wins over ROVE.
- Is this always the case? Does your method always wins for heavy-tailed distributions?
- Were you able to find a "cutoff" which tells when your method is preferable, depending on how data is distributed? Can you extract this somehow from the training data?
- Have you compared your algorithms to other ensembling techniques other than bagging (from Figure 2)?
-
Arguably the most popular type of ensembles in machine learning is tree ensembles. I wonder why no experiments were presented using decision trees? Do you know if your algorithm performs well on those?
局限性
The authors address one potential limitation in the last paragraph. The work has no potential negative societal impact.
最终评判理由
I have appreciated the authors' responses, as they were on point and even included an additional suggested experiment. On the other hand, I do understand some of the weaknesses highlighted by other reviewers and I still somehow question the practical usefulness of this method. Therefore I am retaining my original score of borderline accept - but I'm in favour of the paper being accepted if replies to other reviewers will be considered satisfactory enough!
格式问题
None
Thank you very much for your detailed review and positive feedback. We hope our replies below can provide further clarity and address your concerns. Please feel free to let us know if you have any further question.
Q3: The proposed method envisions always using one single learner (extracted through majority voting or finding the eps-optimal one), and justifies how this improves generalisation. Intuitively, often using boosting/bagging is a much more powerful solution (e.g. boosted ensembles such as XGB perform way better than single decision trees). Indeed in Figure 2 you show that for light-tailed distributions subagging wins over ROVE. (1) Is this always the case? Does your method always wins for heavy-tailed distributions? (2) Were you able to find a "cutoff" which tells when your method is preferable, depending on how data is distributed? Can you extract this somehow from the training data? (3) Have you compared your algorithms to other ensembling techniques other than bagging (from Figure 2)?
A: Thank you for the questions. The goal of our work is to provide a fresh perspective, especially for problems that suffer from heavy-tailed noises, that by performing voting (subsampled ensemble) on the model level instead of the output level, one might be able to achieve better generalization performance. Although the output model from our algorithms is one of the base learners, the main message we want to convey is not that "a single tree can do better than an ensemble of trees". Instead, we want to demonstrate that by carefully selecting a model from the candidate sets, one can possibly achieve the generalization performance that a simple aggregation cannot offer (for example, for heavy-tailed problems, a simple average might not effectively reduce the noises).
In theory, as long as the mild assumptions stated in our theorems are satisfied, our methods are guaranteed to have the exponentially decreasing excess risk tails. This is what other ensemble approaches do not have. Indeed, in our new experiments shown below, our methods outperform all popular ensemble methods in tail performance when the problem gets heavy-tailed.
W2: Comparison with tree ensembles
A: We have added a new set of experiments comparing our approach with popular existing tree ensembles, including Random Forest (RF), Gradient Boosted Decision Trees (GBDT), and XGBoost (XGB). Decision trees are used as the base learner for all methods. Random forests are constructed with the same number of decision trees as our methods to ensure a fair comparison. GBDT and XGB are constructed with early stopping to avoid overfitting. Results on percentiles of the test errors are as follows:
Comparison with tree ensembles in terms of tail performance
For the same synthetic data as described in lines 209-212
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 19.664 | 38.005 | 81.457 | 193.69 | 492.9 | 2035.014 |
| ROVE | 3.455 | 4.431 | 4.943 | 5.569 | 6.153 | 6.323 |
| ROVEs | 2.309 | 3.056 | 3.4 | 3.978 | 4.261 | 4.547 |
| RF | 1.016 | 2.683 | 4.617 | 6.5 | 32.222 | 210.42 |
| GBDT | 0.787 | 1.825 | 3.415 | 6.81 | 30.019 | 87.736 |
| XGB | 1.635 | 5.092 | 9.101 | 21.094 | 52.973 | 105.72 |
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 2.55 | 3.436 | 4.734 | 6.184 | 9.825 | 24.89 |
| ROVE | 0.927 | 1.116 | 1.238 | 1.329 | 1.489 | 1.604 |
| ROVEs | 0.665 | 0.832 | 0.907 | 0.987 | 1.035 | 1.094 |
| RF | 0.121 | 0.181 | 0.249 | 0.326 | 0.757 | 2.789 |
| GBDT | 0.114 | 0.184 | 0.246 | 0.324 | 0.819 | 1.057 |
| XGB | 0.289 | 0.573 | 1.082 | 2.279 | 3.711 | 6.764 |
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 0.845 | 0.994 | 1.138 | 1.315 | 1.599 | 2.415 |
| ROVE | 0.389 | 0.448 | 0.502 | 0.535 | 0.62 | 0.65 |
| ROVEs | 0.289 | 0.353 | 0.382 | 0.41 | 0.431 | 0.44 |
| RF | 0.044 | 0.053 | 0.062 | 0.073 | 0.102 | 0.189 |
| GBDT | 0.043 | 0.06 | 0.074 | 0.087 | 0.105 | 0.116 |
| XGB | 0.103 | 0.173 | 0.248 | 0.421 | 0.588 | 0.913 |
For a new synthetic data where each is uniform on and are Pareto
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 46.095 | 103.303 | 193.074 | 364.772 | 640.562 | 1287.311 |
| ROVE | 17.683 | 18.981 | 19.483 | 20.908 | 21.484 | 22.072 |
| ROVEs | 17.907 | 19.262 | 20.444 | 21.171 | 21.715 | 23.893 |
| RF | 6.518 | 10.507 | 18.519 | 34.58 | 46.284 | 79.079 |
| GBDT | 21.61 | 40.734 | 59.422 | 87.771 | 136.325 | 232.915 |
| XGB | 31.989 | 46.385 | 72.751 | 207.12 | 496.714 | 596.137 |
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 6.341 | 7.251 | 7.924 | 9.374 | 12.617 | 18.885 |
| ROVE | 8.975 | 9.248 | 9.413 | 9.501 | 9.563 | 9.683 |
| ROVEs | 10.95 | 11.299 | 11.506 | 11.602 | 11.706 | 11.821 |
| RF | 4.18 | 4.274 | 4.322 | 4.477 | 4.583 | 4.742 |
| GBDT | 2.699 | 4.052 | 6.68 | 9.019 | 12.409 | 18.297 |
| XGB | 5.77 | 8.136 | 10.154 | 13.496 | 17.501 | 28.473 |
For a new synthetic data where each is uniform on and are Pareto
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 35.753 | 84.013 | 149.819 | 267.985 | 423.76 | 665.424 |
| ROVE | 5.969 | 7.133 | 7.808 | 8.481 | 9.272 | 9.704 |
| ROVEs | 4.633 | 5.347 | 5.902 | 7.043 | 7.916 | 10.408 |
| RF | 2.967 | 7.792 | 14.225 | 22.411 | 36.374 | 62.494 |
| GBDT | 6.247 | 15.913 | 31.117 | 50.551 | 72.74 | 178.967 |
| XGB | 3.636 | 6.206 | 8.579 | 13.917 | 25.043 | 27.09 |
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 1.944 | 2.579 | 3.479 | 4.35 | 5.367 | 7.477 |
| ROVE | 0.889 | 0.997 | 1.065 | 1.11 | 1.149 | 1.264 |
| ROVEs | 0.726 | 0.824 | 0.874 | 0.936 | 1.023 | 1.049 |
| RF | 0.165 | 0.245 | 0.307 | 0.377 | 0.529 | 0.637 |
| GBDT | 0.732 | 0.924 | 1.129 | 1.431 | 1.659 | 2.511 |
| XGB | 0.331 | 0.443 | 0.568 | 0.687 | 0.875 | 1.007 |
These results show that our methods outperform the base learner in almost all cases, and also outperforms RF, GBDT, and XGB especially in high percentiles when the tail is heavy (e.g., with Pareto shape 1.5). For not so heavy-tailed cases RF, GBDT, XGB may outperform our methods.
Q1: I'm not entirely sure I get the discrete scenario. The majority vote in MoVE happens over the different trained models. Does that mean multiple models that are exactly the same are the output of the training on different subsamples? e.g., we would need to have two identical decision trees? Or is this for example related to the numerical solution of a stochastic program? Could you please provide a practical example of a discrete scenario? It's probably just a lack of expertise on my side.
A: We are certainly happy to provide more clarifications. A discrete scenario refers to problems where the decision space consists of discrete solutions. In such cases, when solving the same problem on multiple subsamples, some solutions may appear repeatedly across different base learners. In the machine learning context, a good example is model selection where one selects a final model among finitely many candidates. It indeed more naturally arises in stochastic programming with discrete solution space.
Q2: In addition to the formula in the algorithm, could you better explain in a couple of sentences and/or with an example what it means that a model is eps-optimal?
A: No problem! In general, a solution is called -optimal for the optimization problem if its loss satisfies , where is the true optimal solution.
In the specific context of Algorithm 2 (ROVE/ROVEs), a solution is considered -optimal on a given subsample if, when evaluted on that subsample, its loss is within of the best loss achieved on that subsample. For each subsample, Phase II of Algorithm 2 basically finds all such -optimal solutions among the retrieved solutions. Finally, the algorithm selects the solution that is -optimal for the largest number of subsamples. This procedure can be viewed as a generalized majority vote in the continuous space.
Thank you for the clarifications! I appreciate the authors' response as they were on point and even included the suggested experiment. On the other hand, I do understand some of the weaknesses highlighted by other reviewers and I still somehow question the practical usefulness of this method. Therefore I am currently retaining my original score of 4 - but I'm in favour of the paper being accepted if replies to other reviewers will be considered satisfactory enough!
Thank you for your support! We are glad that our response and new experiments can provide further clarity : )
The authors propose a new method for aggregating base learners, each trained on a subsample of the training set. They prove and show for toy examples that the tail probability for the corresponding excess risk can be exponentially small, when the assumptions are met. The empirical performance of the proposed method is also demonstrated.
优缺点分析
Strengths: The proposed method is reasonable and potentially useful when the tail probability needs to be controlled. The results presented in Figs. 1 and 2 are consistent with intuitive expectations.
Weaknesses:
- Theoretically, although the authors claim that the performance enhancement of their method is ``agnostic to the underlying base learner,'' the crucial condition that seems hard to verify for a given algorithm and a given . If such a condition only holds for extremely large while the population risk is bounded, then such a result may not be very informative. This scenario is possible. Consider the case of a finite for simplicity. Suppose that, for some dataset, the algorithm outputs with roughly the same probability, and for the population risk minimizer , we have that for all . Then , while for a large range of . Then the condition that cannot be satisfied in such a case. (Here denotes . There is some unknown reason that this cannot be displayed correctly.)
- Also, the condition is unreasonably strong for me. On the one hand, p_k^\{\max} is the probability that corresponds to the mode in . On the other hand, is the tail probability for the excess risk incurred by some random . There is no reason that these two can be compared. By requiring the former be (strictly) greater than the latter, it is qualitatively requiring that the excess risk for the ``mode'' base learner can almost be controlled, which leads to the exponential tail. In my opinion, this is too strong. It is also unsatisfactory that there is no explicit dependence on in the final expression of the tail bound.
- There are some technical details that might not be totally correct, or need further clarification. (See Questions below.)
- Practically, the authors only compare their method with the conventional bagging in terms of the mean test error, rather than the tail probability. I believe it makes much more sense to compare the tail probability with bagging.
问题
Q1. If , what can be said?
Q2. What are the reasonable conditions under which ?
Q3. In Figs. 1 and 2, can you add the tail probability for bagging?
Q4. In Example 1.1 and/or the linear regression example, can you provide the corresponding tail bound for bagging? Is there any substantial improvement by applying your proposed method?
Q5. In Algorithm 2, can you speficy ``Choose using the data...''? How is picked practically such that it satisfies the conditions in Theorem C. 11, which seem hard to check?
Q6. In Theorem C.11, it is unclear to me in which ways go to infinity as goes to infinity, respectively. It seems problematic in some cases. Note that the tail probability on the right-hand side of Eq. (34) is greater than . If and , then the bound is trivially going to infinity. How can the left-hand side converge to 0?
Q7. This is a minor comment. In Technical Novelty, the authors mention ``a sharper concentration result for U-statistics with binary kernels, improving upon standard Bernstein-type inequalities.'' I believe the KL divergence type concentration inequality is not new. In D.2, once it comes to the step that can be viewed as a binomial random variable, the rest is known. The novelty of the work might be the application of this result in U-statistics with binary kernels.
局限性
Yes. (I do not foresee any negative societal impact of the work.)
最终评判理由
The proposed algorithm is practically relevant, and the authors provide theoretical justification under reasonable assumptions. My initial rejection was targeting the unrealistically strong statement in the paper. As the authors have agreed to revise accordingly, I will raise my score. Since reviewers are advised to assign borderline scores sparingly and this year there are only 6 scores in total, I gave the paper a Reject first, but now I change it to an Accept.
格式问题
None.
W1: Hard to verify the condition for a given algorithm and a given .
A: Although it can be nontrivial to verify the condition since it's not as explicit as moment or structural conditions, we choose not to use more explicit conditions in order to make the theorem lightweight and at the same time convey enough insight regarding the behavior of the algorithm. In particular, the theorem clearly shows the difference our algorithm makes --- our explicit bound (10) clearly demonstrates how the tail of the base learner is improved by the ensemble. We figure that this forms a good balance between technical complexity and the insightfulness of the theorem.
W2: Restrictiveness of the condition
A: Thanks for pointing out this. This condition can be relaxed. We use this more restrictive condition in Theorem 2.1 to streamline the presentation. See Theorem C.8 for the original version of this condition in equation (28), where we compare the global mode of the solution space with the mode among all non--optimal solution. We can readily replace the condition in Theorem 2.1 with this relaxed condition without affecting its validity. This relaxed condition can be satisfied by the example provided by the reviewer.
W4: Comparing tail probability with bagging
Percentiles of test performance
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 6.35 | 15.09 | 31.57 | 68.878 | 154.42 | 381.5 |
| ROVE | 1.246 | 2.837 | 3.53 | 4.604 | 5.342 | 7.379 |
| ROVEs | 1.806 | 3.042 | 3.87 | 4.243 | 4.649 | 6.516 |
| Bagging | 3.914 | 7.715 | 11.952 | 30.704 | 41.481 | 114.48 |
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 1.002 | 2.173 | 2.911 | 4.214 | 5.041 | 5.21 |
| ROVE | 0.668 | 0.806 | 0.93 | 0.984 | 1.051 | 1.16 |
| ROVEs | 0.699 | 0.801 | 0.877 | 0.978 | 1.061 | 1.107 |
| Bagging | 0.444 | 0.561 | 0.637 | 0.81 | 1.621 | 3.002 |
Q1: If , what can be said?
A: We do not have a bound for tail in this case.
Q5: In Algorithm 2, can you speficy "Choose using the data..."? How is picked practically such that it satisfies the conditions in Theorem C. 11 , which seem hard to check?
A: Thank you for the question. It is nontrivial to pick to satisfy the technical condition in Theorem C.11. In practice we use an adaptive strategy for choosing . Yet, we deferred that to Appendix F.3 (page 42) because of the space limitation. On a high level, the proposed procedure aims to find an that makes it more likely for the true optimal solution to be included in the "near optimum set", instead of being ruled out by suboptimal solutions. We also kindly refer the reviewer to Figures 4, 5, and 8 in Appendix F.3 for more experiments that compare the algorithm performance under various choices of . Regardless of potential violation of the conditions, the practical performance of this adaptive strategy is satisfactory.
Q6: Growth rate of to achieve convergence
A: The reviewer is correct that need to grow in a certain way to make the bound vanish. The typical scale of these numbers when applying our method is that are of the same order, , , and .
Q7: In Technical Novelty, the authors mention "a sharper concentration result for U-statistics with binary kernels, improving upon standard Bernstein-type inequalities." I believe the KL divergence type concentration inequality is not new. In D.2, once it comes to the step that can be viewed as a binomial random variable, the rest is known. The novelty of the work might be the application of this result in U-statistics with binary kernels.
A: The reviewer is correct about the KL divergence type concentration. The novelty indeed lies in the application of this result in U-statistics with binary kernels.
Extra response to reviewer's questions
Q1: If , what can be said?
A: We are not able to obtain a bound for tail in this case in this case. In practice, this means the base learner can not identify -optimal solutions at all, our ensemble method is not guaranteed to have exponential decay for the tail at . However, exponential decay is still guaranteed by our method for large value such that . Therefore, our method does not necessarily improve the average performance of the base learner, but does improve its tail.
Q2: Conditions implying the condition
A: As responded above, we choose not to pursue further explicit conditions implying . In general, if the base learner admits some kind of concentration property, one should be able to derive a threshold for based on the concentration to make . But again, this typically involves unnecessary extra complexities but does not provide deeper understanding of the algorithm.
Q4 Bagging Tail bound for Example 1.1 and/or linear regression example
A: We focus on Example 1.1. First of all, bagging is more for machine learning and not a standard procedure for stochastic programming. A natural implementation of bagging in stochastic programming is that one repeatedly solves randomly subsampled optimization problem (e.g., SAA) and then takes the average of their optimal solutions as the final solution estiamte, i.e., where each is the optimal solution of the SAA on the -th randomly drawn subsample.
In this particular problem, we find it more meaningful to compare the expected excess risk of bagging with our method. For Bagging, the expected excess risk is
where the inequality is as in equation (3) and is the subsample size used in bagging. According to Corollary 2.2, the expected excess risk of MoVE can be bounded as
whenever .
Therefore our bound has an exponentially decaying excess risk whereas bagging has a polynomially decaying one. Note that although our theory only guarantees exponential decay for the risk tail, for discrete problems like this example exponential tails can possibly translate into exponential expected risk.
For tail comparison, since this problem leads to binary solution estimate from the SAA and the bagging U-statistic consists of a bounded kernel, one can invoke established concentration inequalities for U-statistic to obtain an exponential tail bound for . Such exponential bound, however, only holds for , where decays polynomially.
Q7: In Technical Novelty, the authors mention "a sharper concentration result for U-statistics with binary kernels, improving upon standard Bernstein-type inequalities." I believe the KL divergence type concentration inequality is not new. In D.2, once it comes to the step that can be viewed as a binomial random variable, the rest is known. The novelty of the work might be the application of this result in U-statistics with binary kernels.
A: The reviewer is correct about the KL divergence type concentration. The novelty indeed lies in the application of this result in U-statistics with binary kernels. This application, however, is nontrivial in that a U-statistic consists of non-i.i.d. statistics computed on overlapping samples and the KL divergence type concentration can not be directly applied. We address this problem by bounding the U-statistic's moment generating function with that of the mean of statistics computed on disjoint batches through Jensen's inequality (our Lemma C.3) and applying the concentration to the latter.
I was assigned several highly theoretical papers, and it took me some time to complete the reviews. I appreciate the authors' patience. Their response is helpful in clarifying their ideas and provides additional empirical support.
Upon further review of the paper and the authors’ response, I believe the most unacceptable point for me is the statement that "This tail enhancement power of ensembling is agnostic to the underlying base learner..." The authors seem to suggest that their aggregation method can be applied to any base learner. But this is not the case. The crucial condition or reflects a desirable property of the base learner: the mode model is not a bad one in terms of prediction performance. In this case, the proposed method will be effective in controlling the tail probability. But if the base learner is not able to generate a good prediction with high probability, then there is no guarantee for the method. It is not hard to construct pathological examples where the condition is not satisfied.
With this understanding, if we do not consider the worst case with an arbitrary base learner, then the above condition seems reasonable for many modern machine learning algorithms. On the one hand, this is good, as the results are relevant for applied machine learning. On the other hand, the results are not provably valid for an arbitrary algorithm (because the condition is still too strong in the most general case). But overall, the practical relevance of the work might outweigh its theoretical limitations. If the authors agree to revise the statement and properly address this issue, especially in the Main Intuition and Limitations sections, then I’d consider increasing my score.
We appreciate your positive feedback on the technical condition of our Theorem 2.1 and the practical relevance of our work. We understand that the (relaxed) condition might still be strong in extreme cases, and we are happy to make it clearer the scope of our algorithm. Since our relaxed condition effectively means all mode models (i.e., all whose ) of the sampling distribution are -optimal, we will make the following changes:
-
In the abstract, change "This tail enhancement power of ensembling is agnostic to the underlying base learner..." to "This tail enhancement power of ensembling applies to base learners that have reasonable predictive power to begin with..."
-
In lines 41-42, change "The theoretical framework and methodology proposed in this paper are agnostic to the choice of learning algorithms." to "The theoretical framework and methodology proposed in this paper work for learning algorithms that meet a formally chracterizable performance criterion."
-
In line 77 of the "Main Intuition" section, between "..hence admits exponential decay in the tail." and "To illustrate...", insert "Consequently, base learners with modes near the optimal model receive an exponential enhancement in their tail behavior." This is where we make a precise connection to the condition.
-
In lines 300-301, change "...a meta algorithm that acts on any learning algorithm to provide exponential tail bounds regardless of the underlying problem characteristics." to "...a meta algorithm that provides exponential tails as long as the base learning algorithm possesses reasonble predictive performance as characterized in our Theorem 2.1."
-
At the end of the Conclusion and Limitations section, add "Moreover, the tail guarantee of our method requires the mode of the sampling distribution of the base learner to be a reasonably good model."
We hope these revisions address the reviewer's remaining concerns regarding our technical condition. We remain open to incorporating any additional suggestions the reviewer might have.
I am satisfied with the revisions and will increase my score.
However, the theoretical results no longer apply to arbitrary base learners, and the crucial condition is difficult to verify quantitatively. As a result, the theoretical contributions of the paper are primarily of qualitative value. Therefore, I cannot strongly recommend it for publication.
Thank you for recognizing our contributions and for increasing our score! In addition to the proposed revisions above, we would like to offer a "dual" perspective on the technical condition (or its relaxed form ), which may help further explain our theoretical contribution.
Our proposed revision was framed around the question: Given a fixed accuracy threshold , what properties must the base learner satisfy for the condition to hold? As explained in our previous response, this requires that the mode models of the base learner be close to the optimal model, which leads to the impression that “the theoretical results no longer apply to arbitrary base learners, and the crucial condition is difficult to verify quantitatively” as concluded by the reviewer.
However, we can also consider a dual perspective: Given a fixed base learner , what values of satisfy the condition? As we noted in our initial rebuttal, a sufficiently large will always satisfy the condition, regardless of the base learner. In this sense, our theoretical guarantees apply broadly to arbitrary base learners, with the caveat that the tail improvement applies specifically to the upper (i.e., high-risk) tails of the excess risk distribution.
This dual view further highlights the practical relevance of our method: it provides guarantees precisely where users care most—at the high-risk end of the distribution, where extreme losses occur.
We hope this additional clarification further strengthens both the theoretical foundation and the practical utility of our approach.
Thanks. The further discussion is still qualitative. Quantitatively, there is no way to tell what nontrivially counts as "sufficiently large" for an arbitrary base learner in an arbitrary problem. This doesn't change my view of the paper: the algorithm itself is practically relevant, and the theory provides further justification under reasonable assumptions.
We sincerely thank the reviewer for engaging in this further discussion and for recognizing both the practical relevance of our algorithm and the value of the theory under reasonable assumptions.
While it's generally hard to find a theoretical threshold for qualifying values for the condition, in practice, if such a threshold is really desired, one can always attempt to estimate or by bootstrapping. That is, one pretends that the empirical data distribution is the exact true distribution (so that the excess risk of a model is just its empirical loss minus the minimum empirical loss), and repeatedly trains base learners on resampled data of size to generate bootstrap samples. Note that such a bootstrapping procedure is exactly part of what our algorithm does, hence comes with our algorithm almost as a byproduct. It will be interesting to formalize this idea in a future work.
Dear Reviewer pCzZ,
Thank you again for your valuable time and insightful review. As the author-reviewer discussion deadline approaches, we would like to respectfully ask whether our responses and additional experiments addressed your concerns. If you have any further comments, we would greatly appreciate your feedback.
Best regards, The authors
This paper introduces a novel ensemble learning method comprising MoVE (Majority Vote Ensembling) and ROVE (Retrieval and -Optimality Vote Ensembling), which leverages subsampling and model selection strategies to achieve exponential decay in excess risk, claiming to significantly outperforming traditional ensembling techniques which have polynomial decay, particularly in heavy-tailed data distributions. By training multiple base learners on data subsets and combining them through majority voting or likelihood-based retrieval, the approach ensures stronger generalization even when individual learners exhibit only polynomial decay in excess risk. Theoretical analysis confirms the presence of exponential tails in the generalization error. Experiments were conducted on: the MLP regression problem with synthetic data; four discrete stochastic programs: resource allocation, supply chain network design, maximum weight matching, and stochastic linear programming; and one continuous: mean-variance portfolio optimization, showing improved results.
优缺点分析
One-line summary: the paper is strong in theoretical contributions but, at least to me, not totally convinced in terms of practicality. My initial rating is based on my assessment that the theoretical contributions slightly outweigh the practical concerns.
Strengths
-
This is a paper that is very solid in terms of theoretical contributions but well-written in such a way that the audience can manage to follow. Example 1.1 is particular useful in terms of motivating what the paper mainly contributes: even if base learners have polynomial decay in excess risk, classical bagging approach would result in polynomial decay but with the paper's proposed approaches, one can obtain exponential decay.
-
Apart from the main theoretical contribution, which is exponential decay instead of polynomial decay, there are a number of small theoretical contributions in deriving the main contribution: a sharper concentration result for U-statistics with binary kernels improving upon Bernstein-type inequalities, analysing regret sensitivity, and developing a uniform law of large numbers (LLN) for -optimal events.
-
Experimental results support the theoretical claims.
Weaknesses
-
Although the main claim is about going from polynomial decay to exponential decay for the excess risk, I do not think this is too big of a claim because it is well known that in classification we have exponential decay. I maybe wrong but perhaps in regression the best bounds have not approached exponential decay yet and this paper fills in that missing link.
-
In practice, when it comes to ensemble learning these days it is very popular to use gradient boosting (e.g. XGBoost or LightGBM for implementations) for both regression and classification problems as it has shown successes in many benchmarks. Cited in the paper but gradient boosting is nowhere to be found in the experiments, which is a big surprise to me. I think gradient boosting should have served as a strong baseline to compare against, alongside any bagging approach. Why? Because problem-wise, these approaches solve the same problem. In contrast, bagging and even advanced bagging approaches like random forests, are no longer very popular, as they are not as performant as gradient boosting. Gradient boosting approaches have a difference, maybe strength (which I may be wrong), over the proposed MoVE/RoVE/RoVEs approaches here in that they build new base learners based on the performances of previous base learners, whereas the approaches here build all base learners at once initially. Although the theory side may be lacking, intuitively such strategy that gradient boosting adopts may lead to a faster exponential decay rate.
-
Although the idea of majority voting sounds interesting at first, it seems to come with more drawbacks than benefits. In order to do majority voting, you have to discretize the parameter space . In this deep learning era where model weights are in the order of billions, discretizing the parameter space does not sound practical. Initially MoVE only works with discrete . The authors have come up with the -optimal notion and derived RoVE/RoVEs to work with continuous , which appears to be an attempt to "cluster" the continuous into a countable set of models that the algorithms can manage. However, when accuracy becomes highly demanded, one has to bring down very close to 0. But that might explode exponentially the number of discrete models that the algorithms have to train. Wouldn't that mean , the ensemble size, also has to be increased exponentially? The proposed theory treats as a constant (relying on only to be more precise), which, I think, has a tendency to hide the complete view. deserves to have deeper analysis, especially when becomes very close to 0.
问题
-
As claimed by the authors, classification is not addressed here due to the cross-entropy classification loss. It is somewhat reasonable but am not entirely convinced. Heavy tailed distribution can show up when the number of classes becomes very large, e.g. thousands, and we change our view point to the distribution of the classes. In such case it would be very interesting to see how the proposed approaches here perform. Do you have any comment on this point?
-
I have a suggestion in mixing the approaches here with gradient boosting, which may be worth exploring. You can use MoVE/RoVE/RoVEs as a base learner at each iteration of gradient boosting, limiting to a small, managable size at each iteration. In such a way, you have a strong base learner than decision stumps that gradient boosting approaches are using. It would be interesting to see if such a mixed approach could lead to a result that is better than both vanilla gradient boosting and MoVE/RoVE/RoVEs.
局限性
N/A
最终评判理由
The rebuttal has made the paper more clear. I have read other reviews and rebuttal. I have revised my score accordingly.
格式问题
N/A
Thank you very much for your detailed review and positive feedback. We hope our replies below can provide further clarity and address your concerns. Please feel free to let us know if you have any further question.
W1: Significance in going from polynomial to exponential decay
A: It is correct that in classification the widely used cross-entropy loss is inherently light-tailed, but in regression the loss function can be heavy-tailed and our method improves the tail decay from polynomial to exponential. There are existing methods such as MOM and truncation-based methods as discussed in Section 4 that can provide such tail enhancement under certain structural conditions on the model class or for certain types of loss functions, but our method is the first that is free of these restrictions.
Q1: Classification with many classes
A: In classification with infinite number of classes, the distribution of the loss is exactly the distribution of the (minus) log probabilities of the classes. Therefore in theory the loss can be heavy-tailed if the log probabilities are heavy-tailed, but in practice, it may not be that likely for a log value to be heavy-tailed.
W2: Comparison to modern ensemble methods
A: We have added a new set of experiments comparing our approach with popular existing tree ensembles, including Random Forest (RF), Gradient Boosted Decision Trees (GBDT), and XGBoost (XGB). Decision trees are used as the base learner for all methods. Random forests are constructed with the same number of decision trees as our methods to ensure a fair comparison. GBDT and XGB are constructed with early stopping to avoid overfitting. Results on percentiles of the test errors are as follows:
Comparison with tree ensembles in terms of tail performance
For the same synthetic data as described in lines 209-212
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 19.664 | 38.005 | 81.457 | 193.69 | 492.9 | 2035.014 |
| ROVE | 3.455 | 4.431 | 4.943 | 5.569 | 6.153 | 6.323 |
| ROVEs | 2.309 | 3.056 | 3.4 | 3.978 | 4.261 | 4.547 |
| RF | 1.016 | 2.683 | 4.617 | 6.5 | 32.222 | 210.42 |
| GBDT | 0.787 | 1.825 | 3.415 | 6.81 | 30.019 | 87.736 |
| XGB | 1.635 | 5.092 | 9.101 | 21.094 | 52.973 | 105.72 |
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 2.55 | 3.436 | 4.734 | 6.184 | 9.825 | 24.89 |
| ROVE | 0.927 | 1.116 | 1.238 | 1.329 | 1.489 | 1.604 |
| ROVEs | 0.665 | 0.832 | 0.907 | 0.987 | 1.035 | 1.094 |
| RF | 0.121 | 0.181 | 0.249 | 0.326 | 0.757 | 2.789 |
| GBDT | 0.114 | 0.184 | 0.246 | 0.324 | 0.819 | 1.057 |
| XGB | 0.289 | 0.573 | 1.082 | 2.279 | 3.711 | 6.764 |
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 0.845 | 0.994 | 1.138 | 1.315 | 1.599 | 2.415 |
| ROVE | 0.389 | 0.448 | 0.502 | 0.535 | 0.62 | 0.65 |
| ROVEs | 0.289 | 0.353 | 0.382 | 0.41 | 0.431 | 0.44 |
| RF | 0.044 | 0.053 | 0.062 | 0.073 | 0.102 | 0.189 |
| GBDT | 0.043 | 0.06 | 0.074 | 0.087 | 0.105 | 0.116 |
| XGB | 0.103 | 0.173 | 0.248 | 0.421 | 0.588 | 0.913 |
For a new synthetic data where each is uniform on and are Pareto
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 46.095 | 103.303 | 193.074 | 364.772 | 640.562 | 1287.311 |
| ROVE | 17.683 | 18.981 | 19.483 | 20.908 | 21.484 | 22.072 |
| ROVEs | 17.907 | 19.262 | 20.444 | 21.171 | 21.715 | 23.893 |
| RF | 6.518 | 10.507 | 18.519 | 34.58 | 46.284 | 79.079 |
| GBDT | 21.61 | 40.734 | 59.422 | 87.771 | 136.325 | 232.915 |
| XGB | 31.989 | 46.385 | 72.751 | 207.12 | 496.714 | 596.137 |
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 6.341 | 7.251 | 7.924 | 9.374 | 12.617 | 18.885 |
| ROVE | 8.975 | 9.248 | 9.413 | 9.501 | 9.563 | 9.683 |
| ROVEs | 10.95 | 11.299 | 11.506 | 11.602 | 11.706 | 11.821 |
| RF | 4.18 | 4.274 | 4.322 | 4.477 | 4.583 | 4.742 |
| GBDT | 2.699 | 4.052 | 6.68 | 9.019 | 12.409 | 18.297 |
| XGB | 5.77 | 8.136 | 10.154 | 13.496 | 17.501 | 28.473 |
For a new synthetic data where each is uniform on and are Pareto
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 35.753 | 84.013 | 149.819 | 267.985 | 423.76 | 665.424 |
| ROVE | 5.969 | 7.133 | 7.808 | 8.481 | 9.272 | 9.704 |
| ROVEs | 4.633 | 5.347 | 5.902 | 7.043 | 7.916 | 10.408 |
| RF | 2.967 | 7.792 | 14.225 | 22.411 | 36.374 | 62.494 |
| GBDT | 6.247 | 15.913 | 31.117 | 50.551 | 72.74 | 178.967 |
| XGB | 3.636 | 6.206 | 8.579 | 13.917 | 25.043 | 27.09 |
- , Pareto shape =
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 1.944 | 2.579 | 3.479 | 4.35 | 5.367 | 7.477 |
| ROVE | 0.889 | 0.997 | 1.065 | 1.11 | 1.149 | 1.264 |
| ROVEs | 0.726 | 0.824 | 0.874 | 0.936 | 1.023 | 1.049 |
| RF | 0.165 | 0.245 | 0.307 | 0.377 | 0.529 | 0.637 |
| GBDT | 0.732 | 0.924 | 1.129 | 1.431 | 1.659 | 2.511 |
| XGB | 0.331 | 0.443 | 0.568 | 0.687 | 0.875 | 1.007 |
These results show that our methods outperform the base learner in almost all cases, and also outperforms RF, GBDT, and XGB especially in high percentiles when the tail is heavy (e.g., with Pareto shape 1.5). For not so heavy-tailed cases RF, GBDT, XGB may outperform our methods.
Q2: Mixing our method with gradient boosting
A: Thank you for your insightful comments. We greatly appreciate your suggestion regarding integrating our approach with gradient boosting. We agree that using the base learner selected by MoVE/RoVE/RoVEs to replace the plain base learner can possibly strengthen the overall model compared to the two individual approaches. We will carefully consider this mixed strategy in our future study.
W3: Exponentially growing ensemble size?
A: We have rigorouly investigated how the accuracy of the retrieved model set (measured by the excess risk of the best model in the set) scales with the ensemble size . Our result (Lemma C.12) in line 1021 shows that in order to retrieve an -optimal model with high probability, the ensemble size can be set to the order . This is straightforward to understand: Each model trained on a random subsample is optimal with probability by the definition of the excess risk, and if subsamples are approximately independent (e.g, when ), the number of subsamples needed to obtain an -optimal model is approximately geometrically distributed with success rate . The probability typically does not approach zero exponentially fast as , so the required ensemble size does not need to increase exponentially.
I thank the authors for their rebuttal. It has improved the clarity of the paper. I will increase my score now.
We are glad that our reply addresses your concern. Thank you for increasing the score and the positive view on our paper!
This paper investigates a new ensemble learning approach that leverages subsampling to dramatically improve the tail behavior of generalization error. In contrast to classical bagging (which reduces variance by averaging outputs of multiple models), the authors propose to aggregate at the parameter/model level rather than just the output level . Specifically, they train many models on random subsamples of the data and then select a final model via a majority-vote scheme across those subsamples. The main theoretical result is that this procedure yields an exponentially decaying tail for the excess risk (generalization error) – even when each individual “base” learner only achieves a polynomial (slow) tail decay . In other words, with high probability the ensemble’s error is as small as the best model’s, providing a form of oraclelevel performance guarantee. This improvement is agnostic to the base learning algorithm: it holds for standard empirical risk minimization as well as more robust schemes (e.g. distributionally robust optimization or regularized methods) . The paper introduces a family of methods (named MoVE, ROVE, and ROVEs) to implement this idea, and demonstrates in experiments that these ensemble methods significantly improve out-of-sample performance on tasks with heavy-tailed data or inherently slow convergence rates. The authors also provide open-source code for their ensemble algorithms, supporting reproducibility.
优缺点分析
Strength : The paper presents a new ensemble learning framework that significantly enhances theoretical approaches for learning with heavy-tailed data distributions, a complex scenario frequently encountered in safety-critical fields like finance, medicine, and systems influenced by noisy or adversarial inputs. The authors show that the suggested ensemble method can turn polynomially decaying generalization tails into exponentially decaying ones. This is a big step forward in learning guarantees, not just a small one. This is especially important for situations when high-confidence reliability is very important. The method is also base-learner agnostic, which means it might be a powerful and versatile "wrapper" that makes a number of existing learning algorithms more resilient. The work is well-organized and written in a formal academic language, with well defined terms (such "excess risk" and "tail decay rate") and descriptive names for the approaches (MoVE, ROVE, ROVEs) that make it apparent what they do. Each approach is also put in context by explaining what it's for, and the theoretical ideas are backed up with clear explanations and strong visuals, such figures and examples, that help make sense of how the methods work when there are heavy-tailed situations.
Weaknesses : The paper has some good points, but it also has some problems when it comes to being easy to read, useful in real life, and useful in a wider sense. The main idea behind this method is to choose a model by majority vote, however the way it is presented could be misleading. Many readers could think of "majority voting" in the traditional sense (i.e., voting on predictions) instead of voting on the models themselves. This unusual framing, which doesn't include an early, step-by-step pseudocode or algorithm description, makes it harder to understand. Also, important details like how to choose the number of subsampled models () or the best subsample sizes are not stressed in the main text and may be hidden in the appendix. This makes it tougher for readers to use or duplicate the strategy in real life. The method's practical applicability may be confined to specific contexts, with its benefits being most evident in situations characterized by heavy-tailed loss distributions or suboptimal convergence behavior. In typical ML applications characterized by sub-Gaussian noise or well-behaved distributions, the ensemble may provide minimal to no advantages compared to simpler bagging or robust loss techniques. Also, the predicted exponential improvements can be asymptotic, which means that we don't know if we can see real advantages at realistic sample sizes (like ). In general, the method looks good on paper, but it might not be used in real-world ML pipelines until it is tested further, made simpler, and given clearer instructions for how to use it.
问题
-
Clarification of the "Majority Vote" Process: Could you likely explain how the model selection process works in more specific terms, maybe by using a simple algorithmic pseudo-code? This statement was a little perplexing. For example, if the model parameter space is continuous, how does a "vote" work? It would help readers and practitioners use the procedure correctly if it were made clear whether this means picking the model that does best on most subsample runs and how ties or near-ties are handled. What criteria on the model class cause several subsamples to converge to the same "majority" model?
-
Assumptions for guarantees of exponential tails: The idea says that the tails of excess risk will decrease exponentially. What assumptions about the data or the base learner must be true for this assurance to hold? For example, do we need a finite second moment or only a moment condition on the loss distribution? If the data has a very heavy tail (for example, infinite variance), does the ensemble still get an exponential tail, or would the rate improvement be limited? It would be beneficial for the paper to clearly describe the criteria (e.g., “the loss belongs to class ” or analogous statements) under which Theorem 1 (or the principal conclusion) is valid. This tells readers when it's okay to use the method.
-
Cost and scalability of computing: The ensemble trains models on subsamples. What advice can you give on how to choose (the size of the ensemble) and the subsample percentage in real life? The trials probably employ some defaults, but it would be helpful if the paper talked about the trade-offs (more models for greater tail assurance vs. higher computation). Also, have you tried the method on bigger problems, such deep neural networks or big datasets? If not, do you think there will be any problems (like problems with optimization or too much variation in the outcomes of subsamples) when you scale up? It would be helpful to include some examples or explanations of how performance becomes better as gets bigger.
-
Comparison to Other Robust Methods: How does the subsampled ensemble stack up against other robust methods for heavy tails, including applying a robust loss function (like Tukey's loss or trimmed mean aggregation) or the median-of-means applied directly to outcomes? Kwon et al. (2021) specifically suggested a robust model selection utilizing MOM—how does your methodology vary or exhibit greater generality? A clear explanation of how this method is different from median-of-means or bootstrap-of-means methods would make the newness clearer. For example, can a basic median-of-means estimator on the losses fail to produce exponential tails while your model-selection method works? Such information would help readers know when to choose the suggested ensemble over other strong estimators.
-
Things to think about in the real world and examples of how to use it: The paper centers on heavy-tailed synthetic or regulated trials. Can you give me any instances of real-world datasets or areas where you think base learners will only have polynomial tail decay and will benefit a lot from your method? For example, would this be useful for reinforcement learning (where returns can be heavy-tailed), financial risk models, or any other big benchmarks? It would also be helpful to talk about the ensemble's failure modes, such as when subsampled voting might impair performance or not produce an exponential tail (for example, if the subsamples are too small to train usable models or if the data aren't heavy-tailed at all). This kind of advice would help practitioners who are thinking about using this ensemble technique make their work more useful.
局限性
The main problem with this task is that it takes a lot of computer power. To get the exponential tail bound, you need to train and test a lot of models on different data sets, which might be costly. This could make training take a lot longer in situations like deep learning than it would with just one model. Another problem is that the theory only looks at the worst-case tails of the error distribution. When the ensemble tries to get the best performance in the worst-case scenario (high-confidence bounds), it might give up some average-case accuracy or introduce bias. For example, it might choose a model that is safer but not quite as good on average. Users should know that the strategy works best when there are genuine concerns of heavy tails; otherwise, a smaller ensemble or single model can be enough. There is also a statistical limitation: the method implies that we can accurately find the "best" model by looking at how well it does on a subsample. If the number of subsamples is too small or if the performance of the subsamples is sufficiently noisy, most of vote could choose a model that isn't good enough. In rare circumstances, as when the subsamples are very small or the model classes are very complicated, the ensemble could not work better and could potentially make things worse because the voting results are unstable. These subtleties imply that the method necessitates meticulous adjustment of and subsample size, and perhaps requires validation using a hold-out set, if feasible.
最终评判理由
I have look at the rebuttal and somewhat satisfied with the discussion and hence improve the score.
格式问题
None
Thank you very much for your detailed review and positive feedback. We hope our replies below can provide further clarity and address your concerns. Please feel free to let us know if you have any further question.
Q1: Confusing usage of the term "majority vote"
A1: We will update the relevant part of the introduction as follows. To avoid the confusion between our procedure and the widely known majority vote in ensemble classification, we will no longer quote the phrase "majority vote" at the outset, but first describe the procedure in plain English, e.g., "selecting the model that is most frequently output by the base learner", and then identify it as a counterpart of the widely known majority vote for selecting models (as opposed to classes). This applies to discrete . If the review still thinks it's misleading, we are happy to replace the phrase "majority vote" with more distinguishable terms such as "mode finding" throughout. For general (including continuous) , the vote is preceded by a step that retrieves a finite set of high-quality models to vote on, thereby addressing the "each model appearing at most once" issue in directly copying the procedure from discrete to general .
A short description of the procedure is to select the model that does best on most subsample runs as the review suggests, but the notion of "best" here is determined by what the base learner attempts to optimize. Tie-breaking can be done arbitrarily as mentioned in line 99, and convergence theory is provided in Theorem 2.1.
Q2: Assumptions for guarantees of exponential tails
A2: For our key result Theorem 2.1, the only required technical condition is as stated in line 116. An informal interpretation of this condition is that the base learner generates -optimal solutions more often than other solutions (but not necessarily exponentially more often). We choose not to use more specific conditions such as moment conditions on the loss and structural conditions on the model class that imply our condition, in order to 1) make the theorem more general and 2) show a direct contrast between the tail of the base learner and that of our ensemble method. Indeed, our condition will hold for large enough and as long as the base learner is weakly consistent, i.e., converging to optimal solutions in probability as . Weak consistency is a rather weak requirement that does not need finite high-order moments. On the other hand, our explicit bound (10) clearly demonstrates how the tail of the base learner is improved by the ensemble.
Q3: Cost and scalability of computing
A3: Like other machine learning algorithms, in practice the hyperparameters and can be tuned via techniques such as cross validation. On a technical level, the subsample size controls a tradeoff between the bias and the tail of method: Using a large in our method does not inflate the bias against the groundtruth but also may not be able to shrink the tail due to the small ratio , whereas a small helps shrink the tail but may enlarge the bias. As for , increasing typically improves the performance, but our bound also shows a value larger than does not provide further benefit. We will update relevant discussion in lines 133-138. We also kindly refer the reviewer to Appendix F.3 for our configuration recommendation that works well for most problems we consider in the experiment. Regarding the computation cost in repeated training, we argue that this is common for all ensemble methods, and this is less significant for our method because subsample training are parallelizable. See Figure 7 for an experiments on the effect of on the performance.
Q4: Comparison to other robust methods
A4: We have provided a comprehensive comparison between our method and robust techniques such as MOM and truncation-based methods in lines 289-303. As discussed there, MOM approaches require squared loss and/or convex function classes whereas our method only relies on the weak convergence of the base learner and thus is free of these conditions. Thanks for pointing out the work Kwon et al. (2021), and in their paper focuses on the squared loss and requires 4-th moment conditions. We will add this reference in the paper.
Q5: Things to think about in the real world and examples of how to use it
A5: The reviewer is correct that our method is most useful for heavy-tailed problems, such as those from finance and physics. We try our method on several real-world datasets from the UCI repo, and found ROVE useful in predicting energy yield for gas turbines and critical temperature of superconductors:
- Gas turbine data
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 1.907 | 2.477 | 2.762 | 3.022 | 3.633 | 4.368 |
| ROVE | 1.746 | 1.853 | 1.905 | 1.968 | 2.024 | 2.05 |
- Superconductivty data
| 50% | 80% | 90% | 95% | 97.5% | 99% | |
|---|---|---|---|---|---|---|
| base | 202.04 | 221.318 | 249.03 | 276.546 | 286.248 | 296.915 |
| ROVE | 204.562 | 210.36 | 213.277 | 215.045 | 216.227 | 217.942 |
Here we list the percentiles of the test error of the base learner and our ROVE algorithm to compare their tail performance. We run each method multiple times on the data set, and each time we randomly split the data into training and testing and evaluate the trained models on the testing set.
In cases where the problem is not heavy-tailed our method usually perform comparably with the base learner or only brings marginal improvement as demonstrated by our experiments.
Thank you for the thoughtful and detailed rebuttal. The clarifications addressed most of my concerns, and I appreciate the effort to improve both the conceptual clarity and practical usability of the proposed method. The explanation of the model selection process and the conditions for achieving exponential tail decay significantly strengthen the technical contribution, and the discussion on scalability and comparisons to other robust approaches was also useful.
That said, I still feel the paper would benefit from more concrete practical guidance—particularly on hyperparameter tuning and applicability in standard ML scenarios—to help make it more accessible to practitioners. Overall, I'm satisfied with the response and have updated my score accordingly.
We truly appreciate your recognition of our efforts and sincerely thank you for raising your score. Your comprehensive feedback has been invaluable in helping improve the paper. We will include the additional experiments in the revised version and do our best to provide more practical guidance.
The authors present a new perspective on building ensemble models. The authors argue that, in the presence of heavy tail noise, it is beneficial to "ensemble" the models rather than the outputs. They propose training models on multiple subsamples of the training data (just like a normal ensemble), but, rather that averaging the predictions of these models, one should just pick the single base learner that was most often obtained (thus the final model output by the procedure is not actually an ensemble in the classical sense, but a single base model). The authors provide a comprehensive theoretical treatment of the problem, and provide experiments on synthetic data supporting the theory. After the discussion with the authors all reviewers recommend accepting the paper provided that the authors revise the paper to address their concerns.
In addition, I would like to highlight one shortcoming of the paper that received less attention from the reviewers: the lack of empirical evaluation on real data. In the rebuttal the authors show limited results on real problem and without comparing to ensemble based methods like bagging and random forests. While I appreciate that the main contribution of this paper is theoretical, it would be greatly strengthened the paper if the authors would showcase the practicality of their approach by tackling more realistic problems. Since, as the authors claim, optimization with heavy tailed noises has received significant attention in various fields, there should be ample opportunity to apply the proposed technique on real data from finance, scheduling or physics. I strongly encourage the authors to do so.