Weak-to-Strong Generalization Even in Random Feature Networks, Provably
摘要
评审与讨论
This paper theoretically proves that, even in the random features model (a special case of two-layer neural networks and we only train the second layer), we can still obtain week-to-strong generalization if we choose a suitable earlying stopping time for the student model. The contributions include the following aspects:
-
in Theorem 3.1, for random features model, training with steps for the student model, the test loss on the student model can be sufficiently smaller than that of the teacher model. Under Gaussian universarity assumption, the relationship can be described with an index 1.49.
-
Without Gaussian universarity assumption, such charecterization can be given under a linear model with a proper spectrum decay in Theorem 3.2.
-
In Theorem 4.1, the limits of random features model with early stopping is given in terms of lower bound.
给作者的问题
-
What is the relationship between in Eq. (14) and effective rank? For example, if there is a large eigenvalue gap, is close to 1 and the effective rank (definition can be found in Peter's benign overfitting paper) is well-controlled.
-
In Theorem 6.3, it's still unclear to me what is the optimal early stopping time?
论据与证据
The key idea (Theorem 6.3) is that early stopping can reduce the variance if there exists a large eigenvalue gap while the bias can be unchanged, so we can obtain the weak-to-strong generalization.
-
The key issue here is that, the authors assume that the target function is supported on the top-K eigenspace of the student's kernel with nonzero eigenvalues. In this case, the bias can be unchanged and we only need to reduce the variance. However, I'm not sure this matches the key spirit of weak-to-strong generalization from the perspective of model capacity from a smaller to larger case (mainly on the bias part). More importantly, we assume the target function is supported by the student's kernel, why do we need to introduce the teacher model in a two-stage learning? In this case, we can directly learn it by the student model and then obtain .
-
Another issue is that, the authors use gradient flow to derive the results. I understand that this can simplify the analysis and we can directly obtain the dynamics of the output in a closed-form solution format. But using certain gradient steps for early stopping is more natural and makes sense, which is the key idea of this work.
方法与评估标准
This is a theoretical paper and no algorithm is provided.
理论论述
I have checked the theoretical claims in Section 3 and Section 5, Section 6 in page 7.
实验设计与分析
The experiments on toy model, e.g., Figure 1,2,3 look nice to grasp the main idea of this work.
补充材料
Yes, I high level checked the proof in the supplementary materials.
与现有文献的关系
yes. this paper provides a theoretical analysis of weak-to-strong generalization.
遗漏的重要参考文献
N/A
其他优缺点
This paper provides one way to understand weak-to-strong generalization from the variance reduction part.
其他意见或建议
This paper is writing in a quite rush way with massive typos, e.g.,
line 271: Random Feature Netowkr -> Random Feature Netowork
Notation inconsistency: line 300, U_S is bold but sometimes not.
We thank the reviewer for their positive feedback. Below we address the main comments:
-The reviewer asks “Why at all introduce the teacher”? Indeed, in the weak-to-strong setup as introduced by Burns, and as we study here, training directly on the target would give better performance than weak-to-strong training. This is what Burns refers to as the “ceiling performance”, denoted by in the definition of the PGR (our equation 9). The weak-to-strong question, as introduced by Burns, asks how training with only the teacher labels, and not the true target, can get close to this ceiling performance. This captures a situation where the target is not available when training the strong student, only the weak teacher. See Burns et al for further discussion. What we are investigating are scenarios in which even if the student is trained only through the teacher, the student is better than the teacher (but not better than the ceiling performance!). Unsurprisingly, this happens when the target is aligned in some way with the inductive bias of the student (and teacher), and the way we capture this in order to obtain crisp possibility results is through the eigenspace assumption (though our general results do not need this specific assumption and are more general).
-Gradient flow: gradient flow can be thought of simply as gradient updates with very small step size (formally, when the step sizes goes to zero), see e.g. “A Continuous-Time View of Early Stopping for Least Squares” by Ali et al 2019 or “Continuous vs. Discrete Optimization of Deep Neural Networks” by Elkabetz et al 2021. Using standard results on the convergence of gradient descent to gradient flow, it is possible to obtain the same results also for early stopped gradient descent, with an additional error term that depends on the stepsize, and can be made as small as we would like. Studying how using large stepsizes effects weak-to-strong generalization, and how small the stepsize needs to be to realize weak-to-strong, is indeed an interesting and difficult problem (this follows the path of many results on the implicit bias of gradient descent, which are also much better understood in the gradient flow, ie small stepsize, limit).
-Regarding the : Effective rank is a property of the student’s kernel only, i.e. the eigenvalues of each of the eigendirections of the kernel function. measures, roughly, how much of the teacher error is contained in the eigendirections. is a quantity that depends on the projection of the teacher features onto the eigendirections, and is not related to the student at all. So there is no a-priori connection - the eigenvalues of the student kernel do not factor in . This can be seen from the fact that and the student eigenvalue factor separately into the final result.
-Regarding the optimal stopping time: we indeed do not have a simple expression for the optimal stopping time. What we show is that with some stopping time we can get W2S generalization. We do however have some insight into the optimal stopping time through the experimental results. We agree that understanding the optimal stopping time is an interesting question. Empirically, optimal stopping can often be achieved through validation.
The paper studies weak-to-strong generalization in random feature models trained via gradient flow with early stopping. Contrary to the prior belief that well-pretrained models (like GPT-4) are necessary for weak-to-strong, this paper shows that weak-to-strong can occur even in much simpler models without pretraining, like two-layer neural networks with random, fixed bottom layers and trained top layers. In the context of random feature models, the weak v.s. strong models are modeled as linear or two-layer ReLU networks with small v.s. large numbers of hidden units, respectively. For weak-to-strong, the last layer of the weak teacher is first trained over the population with ground truth labels. Then, an infinitely wide strong student is trained via gradient flow over the same population with pseudo-labels from the weak teacher. It is shown that the strong student can achieve polynomially smaller generalization error than the weak teacher. A lower bound of such improvement is derived. Insights are provided on the mechanisms behind weak-to-strong from the early stopping perspective.
给作者的问题
Major questions are raised in the "Other Strengths And Weaknesses" section.
论据与证据
Yes, while I couldn't verify all the proofs in the appendix, the main results in the paper seem reasonable. The claims are well-supported by theoretical analysis and some synthetic numerical experiments.
方法与评估标准
Yes, the theoretical tools used in the paper are reasonable. In particular, the connection between weak-to-strong and early stopping (or in general, taking training dynamics into consideration when analyzing weak-to-strong) is insightful.
理论论述
While I couldn't verify all the proofs in the appendix, the main results in the paper seem reasonable.
实验设计与分析
This is a theoretical paper with a few synthetic numerical experiments. The experiments are simple but sufficient to support the theoretical claims.
补充材料
Not applicable.
与现有文献的关系
In contrast to the common belief that weak-to-strong requires well-pretrained models, this paper provides rigorous analysis showing that weak-to-strong can occur even in much simpler models without pretraining.
遗漏的重要参考文献
Upon my knowledge, the paper discusses the essential references in the field.
其他优缺点
Strengths:
- In contrast to the common belief that weak-to-strong requires well-pretrained models, this paper provides rigorous analysis showing that weak-to-strong can occur even in much simpler models without pretraining.
- The connection between weak-to-strong and early stopping (or in general, taking training dynamics into consideration when analyzing weak-to-strong) is insightful.
Weaknesses
- The insights underneath some theoretical results are not thoroughly explained. For example, Theorem 3.1 (and 3.2) imply that as . Toward this limit, it seems that both the teacher and student will be infinitely wide random feature models. This limiting scenario seems to go beyond the weak-to-strong setting but brings the best PGR, which is in some sense counterintuitive regarding where the weak-to-strong effect comes from. It would be helpful to provide more insights on this.
- The notion of "Gaussian University Ansatz" seems to be critical but is not well explained in the paper. For example, a formal definition of the ansatz in the main text (instead of footnotes) would be helpful. It's somehow confusing what's the key difference between the linear model 2.2 v.s. the two-layer ReLU model 2.1 under Gaussian University Ansatz in terms of analysis. Further clarification would be helpful, e.g., what leads to the different exponents 1.49 v.s. 2 (except the special spectrum of the covariance in Theorem 3.2)?
- Section 5 provides valuable insights for the mechanism of weak-to-strong and the importance of early stopping. However, the writing is a bit confusing. It would be helpful to provide more structured explanations for better readability.
其他意见或建议
Here are some additional minor points:
- Line 197, right: "Indeed, we can see that even in the ReLU Model 2.1, the teacher error is polynomially smaller..." -> "student error"?
- Line 250, right: "In Theorem 4.1 we show that this is not possible, at least not with a random feature mdel as in Section 2." -> "model".
We thank the reviewer for their positive feedback. Below we address the main comments:
-Regarding : First, it is important to emphasize that our results are not asymptotic and we show a gap for finite teacher width , not only as . Indeed, we also discuss , and this is where the gap is largest. Even as , this is still very much a weak-to-strong setting: the teacher becomes larger, but the student is proportionally even larger. We actually consider this the more interesting and relevant setting, as it captures using an already very large teacher (like GPT2 or 3, which as statistical models go, already has an extremely large number of parameters, well in the “high dimensional” regime) to train an even larger student (eg GPT4). In short, we always consider students which are much larger than their teachers, i.e. , and even when we take both widths to be very large with we still have .
-Regarding the Gaussian Universality Ansatz: First, we want to emphasize that this is only used in the second part of Theorem 3.1, in order to easily get a lower bound on the teacher, and in a sense is not crucial for the main weak-to-strong effect, which is the improvement in the student as specified in the first part of Theorem 3.1. Regarding the difference between the ReLU setting with the Ansatz and the linear setting: the main difference is that the ReLU network imposes a particular structure on the spectrum of the kernel, which is different from that of the linear model. The difference in exponent is due to the difference in this spectral structure. We can expand on this in the final version of the paper.
-Regarding section 5: As the reviewer pointed out, the goal of section 5 is explain the how weak-to-stron generalization happens in this case. The main takeaway is understanding the implications of how the trajectory of the student predictor behaves in function space. This can be also seen from the closed-form equation of GF, i.e. Equation (13) on page 7. Namely, once we understand the directions along which GF (and approximately GD) shrinks and the fact that the shrinkage is faster for directions with larger covariance, we can explain the weak-to-strong phenomenon in this case. As it turns out, the relevant directions are eigendirections and they are the coordinate directions for linear networks and spherical harmonics for ReLU networks. How fast the shrinkage happens is determined by the eigenvalue corresponding to the given eigendirection. Therefore, if there is a big eigengap between and , then we can learn the teacher predictor along much quicker than any of the , effectively zeroing out the signal along any of the . We will update section 5 to clarify this in our final version.
Thank you for the responses. Overall, I appreciate the valuable insights presented in this work. But I believe the paper would benefit from a careful revision to improve the presentation and correct typos. I will maintain my current evaluation.
The paper investigates weak to strong generalization via studying random feature networks. A set of experiments and a series of proofs are shown which give evidence that weak to strong generalization occurs in random feature networks with student models having significantly smaller error than the teacher model, giving a framework from which further studies can be conducted from.
给作者的问题
What datasets are the experiments investigated on?
论据与证据
I find the experimental evidence to be of good quality, and it appears that multiple seeds are ran for each architecture and setting investigated.
方法与评估标准
Yes, but I would like to see some investigations on conventional datasets.
理论论述
I took a brief look at the proofs, which are laid out clearly, but I am uncertain as to correctness.
实验设计与分析
The experiment design is good. The experiments are done with 5 seeds. I am a bit a confused on what the inputs are, if it is images or text or random numbers, so it would be good if this was clarified whenever the experiments are detailed.
补充材料
Yes, the experimental detail.
I took a brief look at the proofs, which are laid out clearly, but I am uncertain as to correctness.
与现有文献的关系
A set of results on random feature networks for weak to strong generalization are presented.
遗漏的重要参考文献
n/a
其他优缺点
n/a
其他意见或建议
line 387 - > firsy line 272 -> Netowkr line 201 punctuation line 270 -> simmulation Grammar is odd in many parts There is a lack of a conclusions section, making this paper appear incomplete.
We thank the reviewer for their positive feedback.
Full details on the data distributions used in the simulations are provided in the figure captions. These are synthetic data distributions that match Theorems 3.1 and 3.2. More explicitly:
In Figure 2, we use a ReLU networks with the input dimension . The student has hidden units, and different curves correspond to teacher with hidden units. The inputs a uniformly distributed on a unit-radius sphere and the target is given by for a unit norm beta.
In Figure 3, we consider linear networks where the number of hidden units for the student is always , and teachers with varying number of hidden units between and . The input dimension is (depending on the teacher width), as in Theorem 3.1. The input follows a non-spherical Gaussian distribution where , as in Theorem 3.1, and the target is .
The paper analyzes how weak-to-strong generalization can emerge in random feature models when the student model is trained using gradient descent with early stopping. It demonstrates that the student model can significantly outperform the teacher model, with the student’s error being much smaller, even to the extent of being a power of the teacher’s error.
For two-layer ReLU networks, the study shows that the gap between the student and teacher models can be arbitrarily large, meaning the student can have a much smaller error than the teacher. Specifically, the student’s performance can be arbitrarily better by a large factor, and the Performance Gap Recovered (PGR) ( which is defined by Burns et al. 2023) can approach 1.
In the case of two-layer linear networks learning linear target functions with a specific input distribution, the student’s error is shown to be asymptotically smaller than the teacher’s error, with the relationship .
The paper explores the limit of weak-to-strong generalization and proves that for random feature models, the largest possible gap occurs in the linear network case, where the student’s error is at least quadratic in the teacher’s error. It also highlights that weak-to-strong generalization cannot arbitrarily improve a very weak teacher model, and for near-zero student error, the teacher's error must also be small.
The authors explain that early stopping during gradient descent helps the student model by eliminating spurious noise in higher-order eigen-components introduced by the weak teacher model, thereby enabling the student to better align with the leading components of the target.
给作者的问题
-
Is there any intuition behind the fact that linear network achieves the limit of weak-to-strong generalization in Theorem 4.1?
-
For theorems 3.1 and 3.2, as approaches , what is the regime of ? Specifically, is linear or supra-linear?
-
What are the potential technical difficulties when the authors analyze the models with finite data during the training of the teacher model?
-
In the case of linear networks, what are the differences and similarities between Ildiz et al. 2024 and this paper?
论据与证据
The authors clearly proves all of their claims in the paper. In addition to that, the authors provide intuitions about how the weak-to-strong generalization and superalignment occurs in Section 5.
方法与评估标准
This paper is a theoretical paper where the authors prove their contributions in statements. The authors provide 2 different experimental settings where they used 2-layer ReLU networks and Linear Networks. The experimental settings make sense in general and verify the theoretical results.
理论论述
I checked Appendix C, where the authors analyze Linear Networks; and I checked the arguments utilized in Appendix D, where the authors analyze the 2-layer ReLU networks.
实验设计与分析
N/A
补充材料
N/A
与现有文献的关系
The authors analyze two different models (2-layer ReLU networks and Linear Networks) in the context of weak-to-strong generalization, which is proposed in Burns et al. (2023). This paper provides a clear statement of when (based on the stopping criteria) and how (implicit regularization induced by early stopping) the weak-to-strong generalization appears in random features. Given the previous works, I think, this is paper is the best one which clearly explains the super-alignment behavior.
遗漏的重要参考文献
The authors discussed the previous works of weak-to-strong generalization under the sections of empirical and theoretical works. However, I think, there are missing previous works related technical part of the paper. The related works related to random features are weak and there is no related works related to gradient flow dynamics as far as I see. I think the authors should discuss the technical parts of their work with respect to previous works.
其他优缺点
Strengths
-
The paper introduces a detailed framework for understanding how weak-to-strong generalization can emerge when the student model is trained using gradient descent with early stopping. This adds valuable theoretical insights into how a student model can outperform a teacher model, even with weaker training data or less optimal settings, which has significant implications for machine learning.
-
The paper provides strong, rigorously proven results, such as showing that the gap between the student and teacher models can be arbitrarily large.
-
The explanation of how early stopping helps the student model by eliminating spurious noise in higher-order eigen-components is a valuable contribution.
-
The authors also discuss the limit of weak-to-strong generalization and they found that the limit is achieved with Linear Networks.
Weaknesses
-
The training of the student and teacher models are with population data. For the student model, it is acceptable since the dataset is obtained using the teacher model. However, the authors only analyzes the training of the teacher model with population data. I am curious of the potential theoretical drawbacks of analyzing finite data in the training of the teacher model. It might be possible to analyze the linear networks with finite data for training the teacher model.
-
I noticed lots of typos in the paper. I think it would be better if the authors carefully check the typos.
-
There is no conclusion section in this paper. I think the authors should add a conclusion section where they can discuss the limitations and the future works of this paper.
-
There are some missing references regarding dynamic flow analysis.
其他意见或建议
I noticed some of the typos in the paper:
-
generalization in line 105 column 1.
-
trained in line 149 column 1.
-
dependence line 182 column 2.
-
random in line 234 column 1.
-
network in line 271 column 2.
-
coordinates in line 310 column 1.
-
quantities in line 332 column 2.
We thank the reviewer for their positive feedback. Below we address the main comments:
-Regarding linear networks achieving optimal exponent: With the linear network model, we choose the covariance directly and so can impose an extreme spectral structure, namely a huge eigengap. This makes our learning algorithm (GF with early stopping) the ideal truncating algorithm — zeroing out all projections onto the eigendirections with low eigenvalues but exactly keeping the projections into the eigendirections with large eigenvalues. This turns out to be sufficient for matching the lower bound. But ReLU networks with spherical Gaussian random weights have a specific spectral structure, which does lead to significant weak-to-strong behavior (as we analyze), but is not as extreme.
-Regarding : We take first, so should be thought of as small, i.e . Our results do not use the simultaneous limit, i.e. taking and to infinity together.
-Regarding analysis with finite data: It is certainly possible to obtain similar results also when the teacher is trained with a large but finite amount of data. This is possible with a large enough amount of data such that standard generalization guarantees would ensure the teacher is close to its population optimum. There isn’t a significant technical difficulty here, just bookkeeping and additional terms to track and a corresponding complication in the Theorem statements and proofs. We choose to not get into this complication as we felt it does not provide additional insight (the separation and gaps would be the same), would unnecessarily increase complexity of presentation, and that a weak-to-strong effect even if the teacher does has infinite data (and so is in a sense less weak) is in a sense even stronger.
-Ildiz et al also study a two-step stage learning process, first training a linear teacher and then a linear student based on the teacher. But the effect they show is that of “distillation”, i.e. how training the student based on teacher labels is better than training the student based on real labels. This is different from a weak-to-strong effect, where we compare the student trained on teacher labels to the teacher itself. In their Section 3.1 and Proposition 3, Ildiz et al do study a small teacher that, like our weak teacher, has fewer features than the student. But what they show is very different: even in this subsection (which they title “weak to strong”) they show how student-trained-on-teacher-labels can be better than student-trained-on-real-labels (as opposed to weak-to-strong, which is student-trained-on-teacher-labels is better than teacher). Even in their setting of Section 3.1, it is NOT the case that the student-trained-on-teacher-labels is better than the teacher, i.e. there is no “weak to strong” effect in the sense of Burns et al. The type of distillation benefit they show (which we again emphasize is different from the weak-to-strong effect we study) is possible due to the following differences from our work:
--Idliz et al train with either ERM or min-norm-interpolating predictor, not with early stopping. We emphasize that early stopping is crucial for weak-to-strong generalization. What they can show is that the student-trained-on-teacher labels can be better than the student-trained-on-real-labels if the training data is finite, but this does not mean the student can be better than the teacher. In fact, if we increase the number of data to infinity, the student will eventually become the same as the teacher, while in our setting early stopping enables the student to perform better than the teacher.
--The teacher uses a carefully selected subset of features, as opposed to random features.
Similarities include the use of linear models, the two-stage learning process (though without early stopping) and training the teacher model on infinite data.
-Regarding the random features and gradient flow: Our main claim for random features factors into Theorems 6.5 and Theorems 6.9 and is a statement about the random feature features (in the teacher’s model) project onto the top-S eigenbasis. For this, we apply a general Hilbert Space concentration result to a carefully constructed Hilbert space to prove this statement. We are not aware of prior works having a similar approach. For the gradient flow reference, we agree that it is appropriate to include additional works. We will add references in the final version, including “A Continuous-Time View of Early Stopping for Least Squares” by Ali et al 2019 or “Continuous vs. Discrete Optimization of Deep Neural Networks” by Elkabetz et al 2021.
I want to thank the authors for their response. They addressed most of my concerns. I understand their explanation to the difference between Ildiz et al 2024. However, the authors did not talk about the limitations and the future directions of their work. It might be because of the character limitations during rebuttal. Since I am responding late, I suggest the authors to add these section in their final version if the paper is accepted. I am maintaining my positive score.
This paper studies the phenomenon of weak-to-strong generalization in a random feature model: a weak (=with less parameters) teacher gives labels to a strong (=with more parameters) student, and the student can outperform the teacher. Specifically, the paper provides upper bounds on the performance of the student: for ReLU under the Gaussian Universality Ansatz (which is pretty common in related work); in the linear case. Remarkably, the paper also gives a lower bound of quadratic order. Furthermore, the results are interpreted via the lens of early stopping.
All reviewers are positive about the paper, with varying levels of enthusiasm. The feedback of the authors has also been well received, addressing most concerns. There is consensus towards accepting the paper, and I concur. The results are rather strong, interesting and insightful; this is a solid contribution to ICML and I recommend accepting it.
It has been pointed out that the presentation could be improved and a number of typos should be corrected and I encourage the authors to do a careful round of polishing when preparing the camera-ready.