No Free Lunch from Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions
We show that ensembles of random feature models never outperform a single model of the same total size, and identify conditions where near-optimal performance by ensembles is possible in both the overparameterized and underparameterized regimes.
摘要
评审与讨论
The paper investigates the random-feature ridge regression between using a single large model versus multiple smaller models (ensembles). The authors demonstrate that ensembles can achieve near-optimal performance when the total feature count remains high in the overparameterized regime, while in the underparameterized regime, error scaling laws depend on the relationship between ensemble size and model size.
给作者的问题
Can the authors rely on the results presented in [1] and [2]? In my understanding, the solution for random feature models in [2] directly corresponds to the solution of ensembled random feature models. This follows from the fact that for different vectors ,
,
which implies that
.
论据与证据
The claims are supported by clear evidence.
方法与评估标准
I appreciate the real data experiments via CIFAR10 and MNIST. The authors performed fully numerical experiments first to support their theory first. Then they also conducted real data experiments.
理论论述
I did not check how equation (9) comes. To my understanding, all the comparison below comes from equation (9), the authors should give clearer demonstration on Section 3.1, at least give clear resources where these equations come from.
实验设计与分析
In Section D.1, D.2, the authors should clearly point out where are the experimental results.
补充材料
I have checked all supplementary materials.
与现有文献的关系
The authors missed two important references, which already rigorously gave the risks of single random feature model and ensembled random feature models. The authors should discuss these references adequately, and compare the results.
[1] Mei S, Montanari A. The generalization error of random features regression: Precise asymptotics and the double descent curve[J]. Communications on Pure and Applied Mathematics, 2022, 75(4): 667-766.
[2] Meng, X., Yao, J., & Cao, Y. (2024). Multiple descent in the multiple random feature model. Journal of Machine Learning Research, 25(44), 1-49.
遗漏的重要参考文献
I have some concerns on the essential contribution of this paper. The rigorous excess risk curve has already been investigated by previous papers. [1] gave the rigorous theoretical values of risks of random feature model, and [2] extended the single random feature model to ensembled random feature model. With , people can also easily get rigorous theoretical values of risks of ensembled random feature model based on the analysis of [1] and [2]. The authors should discuss more on these two references.
[1] Mei S, Montanari A. The generalization error of random features regression: Precise asymptotics and the double descent curve[J]. Communications on Pure and Applied Mathematics, 2022, 75(4): 667-766.
[2] Meng, X., Yao, J., & Cao, Y. (2024). Multiple descent in the multiple random feature model. Journal of Machine Learning Research, 25(44), 1-49.
其他优缺点
The paper is generally well written. However, there are two main concerns.
[1] Literature related issues given above. Please also see my questions below. [2] Clearer resources in Section 3.1.
其他意见或建议
See above.
Thank you for your review. We respond to your questions concerns as follows:
In Section D.1, D.2, the authors should clearly point out where are the experimental results.
Response: Thank you for your suggestion. We will update the text to point this out.
Section D.1 on synthetic tasks pertains to the numerical error points in Figures 3 and S2.A-C.
Section D.2 on the numerical experiments for binarized MNIST and CIFAR-10 with ReLU features pertains to figures 1, 2, 4, S1, S2.D-F, S3, S4, S5, S6, and S7.
The authors missed two important references ([1], [2])...compare the results.
Thank you for bringing these works to our attention. We were aware of ref [1] and many other papers deriving the error... These papers focus on the over-fitting and double descent / multiple descent pheonomena in their analysis. However, these effects vanish entirely when the ridge parameter is optimized at each sample size. By studyiing the behavior of the test risk at optimal ridge, we obtain more practical results that will apply even in settings where the ridge parameter is set to its optimal value using cross-validation.
I have some concerns on the essential contribution ... these two references. Can the authors rely on the results ...
Thank you for bringing these papers to our attention. In short, no, we cannot rely on either of them to derive the test risk of ensembled RFRR.
As for ref. [1], the data model considered here consists of samples drawn uniformly from the unit sphere, whereas the results we use apply to arbitrary (but nicely behaved) data distributions .
As for ref. [2], the loss function considered there encodes joint training of the parameters and . This can be written as
The square term here cannot be factored into a sum of contributions depending separately on and , as you have assumed in your comment.
In the notation of [2], the model we consider corresponds to a decoupled loss function which optimizes and separately:
with additional terms if .
Additional response to reviewer comments:
Many of your questions and concerns revolve around the derivation of the risk estimate for random feature ensembles, and its presence in prior literature. There are many papers which derive the bias and variance terms for random-features regression using a variety of methods, which we have referenced (Atanasov e tal.,2024; Canatar et al. ,2021; Simon et al., 2023; Adlam & Pennington, 2020; Rocks & Mehta, 2021; Hastie etal., 2022; Zavatone-Vethetal., 2022) and we are happy to add discussions of [1] and [2].
We do not claim to have contributed the error formula for ensembles -- our contribution is a novel and informative analysis of test risk formula.
The error formulas for RFRR ensembles derived in previous works are not easily interpretable, and studying the implications of these error formulas is an important task unto itself. We believe that our paper has addressed an important gap in our understanding of ensembled random feature models. We have used the (known) bias-variance decomposition of the test risk of RFRR to study the tradeoff between model size and ensemble size. Specifically, we have shown that:
- Ensembling is never optimal under a fixed parameter budget at optimal ridge.
- Ensembling can achieve near-optimal performance in both the overparameterized and underparameterized regimes, with precise spectral conditions for near-optimal scaling in the underparameterized regime. If accepted, we will clarify these contributions in the abstract and introduction. We may also change the title of the paper to "No Free Lunch from Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions" to highlight all of our main contributions.
[1] Mei S, Montanari A. The generalization error of random features regression: Precise asymptotics and the double descent curve J. Communications on Pure and Applied Mathematics, 2022, 75(4): 667-766.
[2] Meng, X., Yao, J., & Cao, Y. (2024). Multiple descent in the multiple random feature model. Journal of Machine Learning Research, 25(44), 1-49.
Updates after author discussion
Thanks a lot for all the clear discussion. A lot of my issues/confusion with the paper have been addressed in the comments, and I'm convinced that the theory just needs some cleaning up to be fully clear. The paper then tells an interesting -- and, to my knowledge, novel -- story about ensembling. I've increased my score to a 4/5 to vote for an accept. I would just recommend to the authors to do a few careful scrubbing passes over the theoretical sections to make sure notation is being used consistently and all the asymptotics are carefully defined and explained (e.g., the fact that is (1) done for analytical convenience, but (2) has some justification in the literature is helpful discussion that should definitely be in the paper!)
Best, Reviewer Ef22
Original review before author response
This is a theoretical paper studying kernel ridge regression with random features. Specifically, it asks whether ensembles of multiple models are effective when the optimal ridge parameter is used. The authors show that, in fact, when holding the total number of random features constant across all models (representing a fixed compute budget), the test risk is minimized by not ensembling. However, the authors note that ensembles can achieve nearly optimal performance in the overparameterized regime. The authors validate their theory in small-scale synthetic and real experiments, which point to some interesting future directions around when the ridge parameter is not optimally chosen.
给作者的问题
My main questions that influenced my review are:
- Are all the approximate equalitites really justified in the proofs / theoretical statements?
- Can the proof of theorem 5.1 be expanded on or my understanding of it corrected?
- What is the purpose of Section 6?
论据与证据
I think that the theoretical results, as stated, back up the overall premise of the paper. And the empirical results also effectively illustrate the theoretical results. I do have some issues with the theoretical development, which are listed below.
The one portion of the paper that I didn't see as offering much evidence was Section 6. This felt a little tacked on, and I wasn't sure how it related to the main story of the paper, which is (to my reading) about ensembling not being effective. I think some more discussion of what the motivation / takeaways of this section are would be helpful.
方法与评估标准
Yes, the paper is mostly theoretical, so there isn't too much choice of methodology here.
The only comments I had were from reading Appendix D (details on experimental setups):
- Section D.1 states that that the are normalized so that and . Why is it that these are both simultaneously satisfiable?
- In Section D.2, why is the variance of chosen to be ? Is this a result from Lee et al. (2018)? If so, some quick rewording might clarify this.
理论论述
There were a few times where I thought the theoretical development was hard to follow because it seemed to be lacking definition, was not fully rigorous, or seemed to have skipped a few steps in proofs. I've bulleted out these issues below:
Lacking definition
- "consider the "featurization" transformation " -- I think this could really use a precise example to clarify what is supposed to be.
- was used around the second column of line 62, but I didn't see a definition for it.
- ``As , this stochastic kernel converges to the deterministic kernel ''. Does this not require a careful choice of and .
- `` and the true target function '' -- I think this is just a typo
- "where is the "true'' risk" -- How is the definition of in equation 9 not the `"true" risk? I didn't really understand what is supposed to be.
- Eq 7 seems to define for a fixed . But then Eqs 12-14 decompose using expectations over (i.e., over the random features making up ). These seem inconsistent with one another.
- In section 5.1, "". It doesn't seem possible that could be both and 0.
Missing Rigor
- I think it's fine to not show the derivation of Eq 9. But it's written as an approximate equality without ever saying what the slack in this approximation is. I think this should be stated.
- Eq 21 seems like a major point of the paper, given that it's the main result showing near optimality of ensembling in the overparameterized regime. But, it's proof in Appendix C.2 doesn't seem fully rigorous. First, in the proof and are stated to be "[assumed] to be on the same order of magnitude" without stating what this means. Second, the proof uses a number of approximate equalities (Eqs C.15-C.17) without specifying what the approximation is or how it affects the proof. Overall, I think this result should be wrapped into a lemma / theorem environment with carefully stated assumptions and a complete proof.
- In Appendix C (derivation of the scaling laws), there were a lot of approximations used. It's not clear how these affect the proof.
Overall, my major issue with rigor is that I don't think it's appropriate to use approximate equalities in a formal proof without precisely defining what terms sign is hiding and verifying that dropping those terms doesn't affect the result.
Missing Steps in Proofs
Appendix B.1 contains the proof of Theorem 4.1, which says that when minimizing the test risk over the ridge parameter , the test risk decreases monotonically with the number of features, number of ensemble elements, and number of training datapoints. It's overall not clear to me how this proof accounts for the fact that we are minimizing over . There's a reference to Eq 11 and the fact that its fixed point can be held constant by scaling up . But we're optimizing over . So I'm not totally sure what this is pointing out.
Also, the proof of monotonicity in (the number of model features) uses Eq. 9 to derive its result. But Eq 9 is an approximate equality, so I don't think that analyzing it can lead us to rigorous conclusions.
实验设计与分析
Yes, the paper is mostly theoretical, so there isn't too much choice of experimental design here.
补充材料
Yes, the only portion I did not read is Appendix B.2 (the proof of theorem 5.1); I was a little too confused about some of the notation / previous results to really dig into that proof.
与现有文献的关系
Ensembling is an important idea in the machine learning literature; e.g. random forests are hugely successful in practice, and ensembles of deep models have been proposed recently for various purposes such as uncertainty estimation. As far as I know, most work on ensembling does not study the very important tradeoff in terms of number of parameters (representing compute) and statistical accuracy. I think this paper is an interesting step towards understanding that tradeoff.
遗漏的重要参考文献
Nothing as far as I know!
其他优缺点
I've listed everything in the sections above.
其他意见或建议
I would just point out that, while there's not a statistical gain in ensembling here (in fact it seems there may be a statistical loss!) there is a computational gain, as the compute required for features is versus features spread over ensembles requires . I think this would be good to bring up around the discussion of Theorem 5.1.
Thank you for your detailed review. Below, we address your questions and concerns:
"The one portion ... would be helpful."
Response: Thank you for this suggestion. We will add more justification. While ensembles are never optimal, they allow parallelization and can be near-optimal, hence useful in practice. Section 5.1 addresses near-optimality in the overparameterized regime. The scaling laws derived in section 6 address near-optimality in the under-parameterized regime -- necessary for a complete characterization.
- Section D.1 states ... satisfiable?
Response: We will clarify that both the and are normalized to satisfy these constraints.
- In Section D.2, why ... clarify this.
Response: Yes, we choose to converge to the NNGP kernel in Lee et. al. (2018). We will clarify this in the text.
- "consider the featurization transformation ". I ... supposed to be.
- was ... for it.
Response: We have revised the text:
"Define the random features by , where the are random parameter vectors sampled independently from measure on . Here, the function is a "featurization transformation," often taking the form for some nonlinear activation function ... "
- **"As , this... and ?
Response: A careful choice of and is required to ensure convergence to a particular kernel (see point 2). However, the stochastic kernel in general converges to some deterministic kernel given by . We will clarify.
- Equation (7) seems to ... with one another.
Response: This is a standard bias-variance decomposition of the error (see, [2]). An expectation over is not necessary on the left side of eq. 12 because the test risk concentrates over .
- In Section 5.1, "".... and 0.
Response: To clarify, we now say "we expand the risk estimate (eq. 9) as power series in and ".
Comments regarding the use of eq. 9, and the slack in this estimate ("where is ... to be."; "I think... should be stated."; "In Appendix C ... result."; "Also, the proof rigorous conclusions."):
Response: Thank you, we agree that more thorough explanation would be helpful. We call the "true" risk the error for a particular realization of the random parameters . The key result of Defilippis et al. (2024) is to show that the distribution over values concentrates around a deterministic‐equivalent expression which depends on the feature count . The difference between the true random‐features generalization error and its deterministic‐equivalent is controlled by a multiplicative concentration bound:
\quad\text{(with high probability)}, $$ for constant $C$. We will clarify the notion of "deterministic" equivalence between the random quantity $\mathcal{E}\_g$ and the deterministic quantity $E\_g$ at large $P$ and $N$. >Eq 21 seems ... complete proof. **Response:** The approximate equalities in Eq's C.15 - C.17 indicate that we are neglecting higher order terms in $\lambda$ and $1/N$. The result is a rigorous first-order approximation in these variables. To clarify, we will replace $\approx$ with $\asymp$ (equal to leading order) in the derivation. > Appendix B.1... pointing out. **Response:** Thanks, we will clarify these steps in the SI. For increasing $N$ we could argue as follows: Denote by $E_g(N, \lambda)$ the test with $N$ random features and ridge $\lambda$. Consider $N \to N'$ for $N'>N$. We show that there exists a ridge parameter $\lambda'$ such that $E_g(N', \lambda') \leq E_g(N, \lambda)$. Next, we have that $$ \min_\lambda (E\_g(N', \lambda)) \leq E\_g(N', \lambda') \leq E\_g(N, \lambda)$$ Because this is true for any $\lambda$, we may assign $\lambda = \operatorname{argmin}\_{\lambda} E\_g (N, \lambda)$, completing the proof. Analogous steps can be applied when increasing $P$ or for the joint transformation of $N$ and $K$ with $KN=M$. References: [1] Defilippis et. al https://arxiv. org/abs/2405.15699. [2] Adlam et. al. https://arxiv.org/abs/2011.03321.(reposting this as a Rebuttal Comment, as I didn't realize that authors can't see Official Comments. Sorry about that!)
Thanks for all the replies! I have a few follow-up points listed below:
-
On section 6: these updates sound great. In addition to the computational complexity being lowered from to , I totally agree with the point that ensembling allows parallelization. This is a really interesting statistical-computation tradeoff.
-
On Eq (7): To clarify, I'm confused because Eq 7 seems to be a function of Z. In particular, a random defines a given . And Eq (7) refers to a fixed , that is, a fixed . On the other hand, the right hand side of Eq. (12) does not depend on , as it takes expectations over . I would believe that Eq. (7) will concentrate in for large numbers of features. But then there needs to be some kind of limit taken here for Eq (7) and Eq (12) to both hold. But maybe I'm misunderstanding the point about concentration in .
-
On the high probability bound between and -- are the commas here typos? Should the bound be ? Just want to make sure I'm following here! Also is there an exact reference from Defilippis et al. that has this result?
-
On the approximate equalities: is it correct to say that every approximate equality in the paper actually means "this is an equality when dropping terms that are of size or or smaller?" If so, doesn't this assume we're working under an asymptotic model that has ? Why should we expect this?
-
One final thing from the review that wasn't addressed: what does it mean that and are "[assumed] to be on the same order of magnitude"? This sounds like the imagined asymptotic model not only has , but also has and they're going at about the same rate. Can the authors give some more clarification here?
Thank you!
Thank you for your constructive feedback, which has helped us improve the clarity of our results. We respond to your comments below:
On section 6: ... statistical-computation tradeoff.
Response: We are glad that you agree that this is an interesting question, and will emphasize these points about parallelization and improved computational complexity in ensembles in the final version if accepted.
On Eq (7): To clarify, I'm confused because Eq 7 seems to be a function of Z...
Response: Thank you for your comment, which we did not have space to fully address in our last reply. We will make some edits to section 2 to clarify the meaning of the errror formulas, and to make sure that the meaning of is consistent throughout.
First, we will replace the in equation 7, with the "true" error symbol to indicate that this depends on a particular realization of :
We will then introduce the deterministic equivalent error , which is the deterministic quantity about which the "true" error concentrates. This satisfies
\quad\text{(with high probability)}, $$ (commas in last reply were a formatting error). As a shorthand, we may write $\mathcal{E}_g \simeq E_g$, with $\simeq$ indicating deterministic equivalence with a multiplicative error bound of this type. We will update equation 9 so that it reads as: $$ \mathcal{E}\_g^1 \simeq E\_g^1 = \frac{1}{1-\gamma\_1}\left[-\rho \kappa_2^2 \mathrm{tf}\_1^{\prime}\left(\kappa\_2\right)+(1-\rho) \kappa\_2 \mathrm{tf}\_1\left(\kappa\_2\right)+\sigma\_\epsilon^2\right] \qquad (9)$$ So, the error formula on the right of eq. 9 is exactly equal to $E_g^1$. $E_g^1$ is, in turn, a deterministic equivalent approximation of the "true" error $\mathcal{E}_g^1$. We will similarly have $\mathcal{E}_g^K \simeq E_g^K$ for $K>1$. >On ... from Defilippis et al. that has this result? **Response**: Defilippis et. al. indeed is the reference with this exact result. See equation 3 in their paper. >On the approximate equalities: ... we expect this? **Response**: No, it would not be correct to say that every approximate equality in our paper corresponds to dropping higher order terms in $\lambda$ and $1/N$. There are two different types of approximation we are making. The first is to replace the random quantity $\mathcal{E}_g^K$ with it's deterministic equivalent $E_g^K$. All of our results rest on this simplification, which is justified due to the multiplicative concentration bounds proven by Defilippis et. al, and by our numerical simulations. Theorems 4.1, 5.1, and corollary 5.5 apply exactly to $E_g^K$ with no further approximations, and with no assumptions about the ridge. Additional approximations are used in In sections 5.1 and 6 to study the behavior of $E_g^K$ in limits corresponding to the overparameterized and underparameterized regimes, respectively. In section 5.1, we expand $E_g^K$ in the limit $N \gg 1$, keeping $P \sim \mathcal{O}(1)$, so that $N \gg P$ (overparameterized). Equation 21 is the **only** equation in the paper that assumes an asymptotically small ridge parameter $\lambda \approx 0$, and this is indicated in the correction term $\mathcal{O}(\lambda^2, \lambda/N, 1/N^2)$. Here, we have assumed small ridge for analytical convenience, but the result (eq. 21) provides a good explanation of our empirical results in fig, S2.C and S2.F at optimal ridge as well. Specifically, the overlap between the green curves ($P = 10$) and dashed black lines show that ensembles achieve *near-optimal* performance in the heavily overparameterized regime. Simon et. al. (2023) also provide an argument for why optimal ridge is always small in the overparameterized regime (see theorem 2 there). We include discussion of this in the final version of the paper. > One final thing ... "[assumed] to be on the same order of magnitude"? ... more clarification here? **Response**: To clarify this point, we will remove the statement that we are assuming that $\lambda$ and $1/N$ are "on the same order of magnitude," and instead use the terminology "we expand as power series in $1/N \gtrsim 0$ and $\lambda \approx 0$." Again, this only applies to the derivation of eq. 21, and higher-order contributions are accounted for in the additive correction term there ($\dots + \mathcal{O}(\lambda^2, \lambda/N, 1/N^2)$). > To conclude Since this is our last opportunity to reply, we want to again thank the reviewer for giving such a careful reading of our paper, and for helping us improve the clarity of our presentation.In the context of random feature high-dimensional ridge regression, this paper investigates the problem of training an ensemble of independent models and the trade-off between ensemble size and model size for a fixed total number of features. The authors prove a 'no free lunch' theorem, showing that increasing the ensemble size while keeping the total number of features fixed leads to higher test risk, making a single large model the optimal choice, provided the ridge parameter is fine-tuned. However, in the overparameterized regime, small ensembles can achieve near-optimal performance over a wider range of ridge parameter values, making them more robust than single models. The authors derive scaling laws showing that while optimal error scaling is always achieved by increasing model size with a fixed ensemble size, near-optimal scaling can be achieved under certain conditions on the kernel and task eigenstructure. These findings are validated through numerical simulations on synthetic data and real-world datasets.
给作者的问题
I have no important questions.
论据与证据
All claims are supported by clear and convincing evidence.
方法与评估标准
The proposed methods are well-suited for the problem considered.
理论论述
I checked the correctness of the proofs and I have no issues to discuss.
实验设计与分析
I have no issues to discuss.
补充材料
I have reviewed the supplementary material in its entirety.
与现有文献的关系
This paper builds on prior works showing that single large models can outperform ensembles when optimally tuned and extends the theoretical understanding of random feature generalization to the ensemble setting. The authors also contribute to the literature on scaling laws by deriving how test risk scales with ensemble and model size, providing insights into the trade-offs between the two quantities.
遗漏的重要参考文献
I am not aware of any relevant references that have been omitted.
其他优缺点
This work improves the current understanding of generalization and scaling laws for RFRR, extending previous findings by tackling the problem of ensemble learning, which is closely related to practical issues and applications. The presentation is clear and provides valuable insights that could apply to other settings, such as explaining the apparent violation of the "no free lunch from ensembles" principle. Moreover, all claims are supported by extensive numerical simulations. I have not identified any major weaknesses.
其他意见或建议
Typo on line 121, second column: "et al." is repeated twice
Thank you for your review!
The paper investigates the performance of random feature ensembles, and discusss whether the ensemble models outperform the single model when the number of total parameter is fixed. The theoretical analysis is given based on the random feature ridge regression, while the empirical studies are performed on binarized CIFAR10 task. The paper provides two main results, the first one shows the RF ensemble models always benefit from the larger ensemble member and the more ensembles, while the second shows when there is a fixed total parameter count, increasing the number of ensemble size degrades the performance with an optimal ridge parameter. Moreover, the paper also derives a acaling law for the underparameterized ensembles.
给作者的问题
Most of the experiments are based on the random feature RF model. Whether the same results shows in some larger ensemble models?
论据与证据
The paper provides the sufficient theoretical analysis to support the claims, and the experiments based on the random feature ridge regression on binarized CIFAR10 task also shows the convincing results.
方法与评估标准
The theorems and results provided in this paper are easy to understand, and there are several works referenced in this paper already give the similar analysis that decreasing the random features leads to a poor performance. The proposed results are clear and make sense for the emsemble RF models.
理论论述
The paper presents the theoretical analysis to prove the theorems, and the claims is proved based on the prior works and some obvious inequations.
实验设计与分析
The experimental studies are sufficient and verify the claims in this paper. The Figure 1 shows increasing the number of samples P and the ensemble member size N can reduce the predictor error, while the Figure 2 shows increasing the number of ensemble menbers K while fix the total number of random features M, the performance of ensemble model degrades.
补充材料
The supplementary material contains the sufficient process of proofs.
与现有文献的关系
The results presented in this paper is clear and evident, the theoretical analysis is mainly based on the prior works, and the results are mainly focus on the random feature ridge regression model, whether the results are useful for the other machine learning models or deep neural networks are unclear.
遗漏的重要参考文献
No additional references need to be discussed here. The authors have covered all the essential studies in the experiments or the related work.
其他优缺点
Weaknesses:
- The contribution of this paper is limited, the results in this paper are evident, and can be easily derived based on the prior works.
- The results mainly based on the random feature ridge regression model, whether the results suit for the deep learning models are not discussed.
- The layout of the figures in this paper is irregular, as well as the fonts in these figures.
其他意见或建议
This is a paper with limited contributions, the author should provide more interesting findings, such as extending the results on deep learning area or other ensembel models.
Rebuttal to Reviewer 2SLL (Complete)
Thank you for your review and questions. It appears that you are convinced of the correctness of our results, but have concerns about the significance of the contribution. We will respond to your concerns and questions individually below:
- The contribution of this paper is limited; the results are somewhat evident and can be derived from prior works.
Response: The error formulas for RFRR ensembles derived in previous works are not easily interpretable, and studying the implications of these deterministic equivalent errors is an important task unto itself. We believe that our paper has addressed an important gap in our understanding of ensembled random feature models. We have used the (known) bias-variance decomposition of the test risk of RFRR to study the tradeoff between model size and ensemble size. Specifically, we have shown that:
- Ensembling is never optimal under a fixed parameter budget at optimal ridge.
- Ensembling can achieve near-optimal performance in both the overparameterized and underparameterized regimes, with precise spectral conditions for near-optimal scaling in the underparameterized regime. If accepted, we will clarify these contributions in the abstract and introduction. We may also change the title of the paper to "No Free Lunch from Random Feature Ensembles: Scaling Laws and Near-Optimality Conditions" to highlight our contributions beyond Theorem 4.1.
- The results mainly pertain to the random feature ridge regression model; whether they apply to deep learning models remains unaddressed.
Response: While we are ultimately interested in understanding the utility of ensemble learning using state-of-the-art machine learning models, random-features regression provides an important "ground-floor" to investigate the utility of ensemble learning in a tractable setting. Furthermore, because of the approximate correspondence between RFRR and deep networks trained in the lazy learning regime [3], our results already bear some relevance to deep ensembles. Understanding the properties deep ensembles trained in the rich regime is a critical research direction for our future work, but will be better presented with reference to these baseline results in linear models, which have already saturated the page limit for this venue.
Our results also add to a long history of research on random-feature models in analogy to deep networks. Another example of this fruitful correspondence is work on fine-grained bias-variance decompositions [2]
We have also performed numerical experiments for deep ensembles. In these experiments, we train ensembles of deep convolutional neural networks on a computer vision task (CIFAR10 image classification) using parameterization, which keeps training dynamics consistent across widths [1]. In addition to the weight decay, there is a "richness" parameter which controls the amount of feature-learning in the network. Our simulations show that a large network outperforms any ensemble of smaller networks with the same total size when both the weight decay and "richness" parameter are tuned to their optimal values. If it will have a positive impact on your evaluation, we are willing to add these numerical results to the supplemental material. This result is available at this anonymized github repo: https://anonymous.4open.science/r/NoFreeLunchRandomFeatureEnsembles/README.md
- The layout of the figures is irregular, and the fonts in these figures could be improved.
Response: Thank you for your feedback, we are open to any and all suggestions on how to improve the clarity and appearance of our figures for the final version of this paper if accepted!
"Most of the experiments are based on the random feature (RF) model. Would the same results hold for larger ensemble models?"
Response: See our response to 2. above.
[1] Nikhil Vyas, Alexander Atanasov, Blake Bordelon, Depen Morwani, Sabarish Sainathan, and Cengiz Pehlevan. Feature-learning networks are consistent across widths at realistic scales, 2023. URL https://arxiv.org/abs/2305.18411.
[2] Ben Adlam and Jeffrey Pennington. Understanding double descent requires a fine-grained bias-variance decomposition, 2020. URL https://arxiv.org/abs/2011.03321.
[3] Chizat, L., Oyallon, E., & Bach, F. (2020). On lazy training in differentiable programming (arXiv:1812.07956v5). Retrieved from https://arxiv.org/abs/1812.07956
This work provides an asymptotic comparison of random feature ensembles versus a single large random feature regressor under a fixed feature budget. While the authors demonstrate that no ensemble will outperform a single regressor with an optimally tuned ridge parameter, they demonstrate conditions under which an ensemble is near optimal. This work is theoretically sounds and builds upon many recent derivations of the asymptotic risk of random feature regression. While some portions of the theoretical argument were not rigorously presented in the original submission, the authors have addressed these concerns during the discussion period. I urge the authors to incorporate these suggestions for clarity and rigour in the camera ready. Overall, this work tackles a relevant problem and adds to the knowledge of scaling laws, and therefore I recommend acceptance.