From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning
This work presents a multi-scale adaptive theory to understand feature learning in neural networks, bridging kernel rescaling and kernel adaptation across different scaling regimes and providing insights into directional feature learning effects.
摘要
评审与讨论
The paper addresses the problem of Bayesian learning with a wide two-layer linear network. It develops a formalism which bridges between prior works, and in particular covers in unified manner the rich (mean-field) and lazy (standard parameterization) regime. The analysis consists in rephrasing the problem as the computation of the legendre transform of a generating function, which can be approximated using expansions in the various considered regimes. The theory is corroborated by numerical simulations across the various scalings. Notably, the mean predictor is found to be the same as a rescaling of the NNGP kernel. However, the covariance of the predictor is anisotropic, and reveals feature learning.
给作者的问题
I do not have further questions for the authors.
论据与证据
The theoretical claims are supported by numerical experiments in all figures. One possible source of concern is that the results of the proposed approach at one-loop level and those of the rescaling approach (Li and Sompolinsky, 2021) are very close in Fig.1, 3, making it hard to tell whether one significantly matches better numerical experiments. It would be informative to, if possible, provide a figure in a setting where the two approaches are farther from each other.
方法与评估标准
The paper is primarily theoretical in nature. All setups considered are stylized. As a minor suggestion, since to the best of my awareness the theory makes no distributional assumptions, would it be feasible to include experiments on simple real datasets e.g. MNIST? If not, what are the barriers?
理论论述
I did not check the technical parts in detail.
I have a few questions regarding the theoretical parts, which I list below.
- regarding Fig. 4 : from which expression is the curve for the rescaling approach (pink) evaluated ? I have only found a computation of the predictor mean and variance in the Appendix, but it is possible I might have overlooked a part.
- Does the equivalence with the rescaling approach in the proportional regime hold for all or just for ?
实验设计与分析
I have not identified any particular issue with the experiments.
补充材料
I did not check carefully the supplementary material.
与现有文献的关系
The main contribution of the paper is to reach a formalism holding across various previously considered scaling limits, in particular the lazy (e.g. (Li and Sompolinsky, 2021, Pacelli et al., 2023) ) and rich limits of Bayesian learning. This allows authors to recover a number of previous expressions, e.g. (Seroussi et al., 2023). Appendix A.4 in particular connects their formalism with that of (Li and Sompolinsky, 2021). As such, I believe it could be of interest to researchers working on this topic.
As a comment, the paper focuses on shallow, linear networks. I believe this is an important precision which should be at least mentioned in the abstract and main contributions, as this limitation is not shared by all the related works.
遗漏的重要参考文献
The authors crucially do not mention (Van Meegen and Sompolinsky, Coding schemes in neural networks learning classification tasks, 2024), despite sharing a large closeness in topic, and in some claims. This reference is not a concurrent work. For example, the fact that the mean predictor of linear networks in the rich regime still coincides with a rescaled NNGP kernel is already observed in this reference, yet is not mentioned in the manuscript under review.
其他优缺点
I believe the results can be of interest, as they connect a number of prior works, but have limited confidence in my assessment. Due to the concerns I listed above, and more crucially the key missing reference, I am in favor of not accepting the current manuscript, but am happy to increase my score if the authors address and clarify my questions, and revise the manuscript accordingly.
其他意见或建议
I do not have further comments or suggestions.
We thank the reviewer for their overall positive feedback and are confident to address the raised points below.
Application to real-world data sets
The theory indeed applies to arbitrary data, since it only requires the input kernel without specific assumptions on the task. While we focused on linearly separable tasks in the initial version of the manuscript, we add results for MNIST in the revised manuscript (see Fig. 1 in Supplement). Further, we extend the formalism to include non-linear activation functions in an approximate manner (see reply to reviewer YLkj and Fig. 2 & 3 in Supplement).
Relation to van Meegen & Sompolinsky 2024
We thank the reviewer for pointing out that we accidentally missed to cite van Meegen & Sompolinsky (2024). We sincerely regret this oversight. The main distinction is their mean-field scaling of readout synapses , which differs from ours . In the proportional limit , they thus effectively consider a “super mean-field scaling” . As a result, not only the network output but also the posterior readout weights concentrate, justifying an additional saddle point approximation on . The latter enables them to find spontaneous symmetry breaking, the mechanism behind the seminal finding of coding schemes they describe. We will explain this relation in a revision. To expose it on a formal level, we derive their theory in our framework (see Section 5 in Supplement) and will add it as an appendix in a revision. We share the reviewers assessment on the high relevance of van Meegen & Sompolinsky (2024) as a crucial work in the field and a highly relevant reference for our manuscript. We will include it prominently in the following places:
• p. 1, l. 31ff: "However, the NNGP does not capture FL, which emerges at finite network width as well as in the proportional limit, where both network width and sample size jointly tend to infinity (Li & Sompolinsky, 2021), or in certain scaling regimes (Yang et al., 2024). Hence the NNGP also fails to capture the networks' nuanced internal representations that arise from feature learning (vMS, 2024)."
• p. 2, l. 106: “In contrast, adaptive theories of FL (Roberts et al., 2022; Seroussi et al., 2023; Bordelon & Pehlevan, 2023; Fischer et al., 2024b; vMS, 2024) ..."
• p. 3, l. 121 "(vMS, 2024) considers the proportional limit in a regime where weight variances scale as (see Appendix for a derivation of their theory in our framework and a comprehensive comparison of the approaches)."
• p. 4, l. 166 "Following along the lines of (Segadlo et al. 2022a, Fischer et al., 2024b, vMS, 2024)..."
• p. 6, l. 324 "...and adaptive theories (Naveh & Ringel, 2021a; Seroussi et al., 2023; Fischer et al., 2024b; Rubin et al., 2024, vMS, 2024)"
Similarity of mean-predictors in the rescaling and the adaptive theory
We agree that the mean predictors for adaptive and rescaling in Fig. 1 and Fig. 3 of the manuscript are close. This point is precisely the question we target in this manuscript: How can these apriori very different theories be reconciled? We therefore show explicitly in Section 5 that the rescaling theory can be derived from the adaptive theory for the mean predictor. We then show in Section 6 that beyond mean predictors, the adaptive theory captures directional aspects of feature learning that escape rescaling theories. In a revision we will strengthen this point by demonstrating that only the adaptive theory captures the emergence of structure in the kernels driven by coherent statistics between input and target (see Fig. 4 in Supplement). We will also show that for nonlinear networks the predictions of the different theories differ qualitatively (see reply to reviewer YLkj and Fig. 3 in Supplement).
Reviewer's specific questions
• In Fig. 4, the curves are computed from Eq. (27) with Eq. (29) for the rescaling theory. The variance for the rescaling theory is calculated in Eq. (116)-(118) in App. A.4.
• While the tree-level solution and its equivalence to rescaling applies only to , the one-loop theory applies to the full regime and thus also the equivalence to rescaling as discussed in Section 5. Note that this equivalence is restricted to the mean predictor, while the adaptive theory capture additional aspects, like directional feature learning in the variance terms (cf. Section 6).
In a revision we will clarify these points in the main text.
I thank the authors for their detailed and exhaustive answer to my concerns and questions, and the additional plots and discussion which markedly strengthen the paper. In light of this, I am in favor of acceptance and updated my score accordingly. I think the paper will be of interest to the community. I wish to reiterate I have not read the technical derivations in detail, so please use my evaluation in light of this caveat.
This paper aims to provide an account of feature learning in linear Bayesian neural networks that bridges the gap between results emphasizing isotropic rescaling and anisotropic reshaping of kernels.
Update after rebuttal
After the authors' rebuttal, I weakly recommend acceptance.
给作者的问题
I'm curious about the computational complexity and numerical stability of evaluating the one-loop corrections, as these have been issues for previous perturbative approaches to feature learning (see discussion in Bordelon & Pehlevan 2022). Could you provide some comments along these lines, at least in the Supplement? These issue could raise particular challenges in nonlinear networks, where higher moments of activations cannot be computed analytically. This is a key advantage of the Bayes-optimal setting as studied by Cui et al: the nonlinear case is not so much more challenging.
论据与证据
Most of the authors' claims are well-supported, but there is room for improvement of clarity. I elaborate on this issue in various places below.
One issue I will note up front is that the authors should emphasize in the Abstract and Introduction the fact that they focus on linear networks.
方法与评估标准
The analytical methods used are standard in statistical field theory, and are appropriate here.
理论论述
This is a physics paper using standard methods from statistical field theory, and the derivations appear correct (though I have not checked them line-by-line).
实验设计与分析
The experiments largely seem well-executed, but the choice of datasets is a bit disappointing. The paper would be enhanced by some experiments on real data, even subsets of MNIST or CIFAR-10 as in Baglioni et al. PRL 2024, Zavatone-Veth, Canatar et al. NeurIPS 2021, or Seroussi et al Nat Comm 2023. I don't expect experiments at the scale of Izmailov et al 2021 (https://arxiv.org/abs/2104.14421), but the authors should be aware that such things are possible.
补充材料
I have quickly skimmed the Supplementary Material, and have not checked the derivations step-by-step.
与现有文献的关系
I think the paper generally does a good job of situating itself relative to past works on the statistical physics of Bayesian neural networks, though the accuracy of its citations to past works could be improved (see below).
遗漏的重要参考文献
The discussion of related works in Section 2 is not complete, and incorrectly describes some of the works which are cited.
-
The claim of the "often superior performance of finite-width networks" made in the opening paragraph of Section 2 is out of date in the context of gradient descent trained networks thanks to work on mean-field parameterizations where the infinite-width limit can be taken without losing feature learning. See for instance Yang et al. (NeurIPS 2021) https://arxiv.org/abs/2203.03466, or Vayas et al. (NeurIPS 2023) https://arxiv.org/abs/2305.18411. Here, larger networks generally perform better as increasing width suppresses spurious predictor variance due to initialization.
-
The citation to Canatar & Pehlevan (2022) on Lines 134-135 is somewhat misplaced, as they use SGD in their experiements, and do not take an explicitly Bayesian perspective.
-
Lines 136-140 do not make clear the close relationship between Zavatone-Veth & Pehlevan (2021) and Hanin & Zlokapa (2023): the former paper characterizes the prior in terms of Meijer G-functions, and the latter leverages that result to characterize the variance in the predictor at zero temperature. The authors should also cite Hanin & Zlokapa's recent perturbative extension of that work to nonlinear networks in https://arxiv.org/abs/2405.16630.
-
Cui et al (2023) arises as an attempt to rigorously justify the Gaussian approximation of Li & Sompolinsky (2021) in a mathematically tractable setting. Also, Maillard et al. (2024) is in a sense a follow-up to this work. And, a key feature here is that these works allow one to study nonlinear networks and tasks.
-
I think the authors should make a greater attempt to connect their results in the linear case to those of Aitchison (2020), whose work emphasizes the adaptive aspect of feature learning and is in the limit he considers exact. Also relevant in the linear case is Zavatone-Veth & Pehlevan (https://arxiv.org/abs/2111.11954) who derive exact expressions for the predictor mean and covariance, generalizing Li & Sompolinsky and Aitchison.
-
When discussing the properties of linear networks in the proportional limit, the authors should reference Hanin & Zlokapa (2023) as well as Zavatone-Veth, Tong, & Pehlevan PRE (2022; https://arxiv.org/abs/2203.00573), both of which furnish sharp asymptotics for the generalization error. The latter paper is also relevant because it compares that performance against sharp asymptotics for a random feature model of matched architecture.
-
The authors should cite and discuss van Meegen & Sompolinsky (https://arxiv.org/abs/2406.16689), which considers mean-field parameterization. Here there is a matrix-valued order parameter, and the authors find drastically different properties of the posterior depending on the nonlinearity.
其他优缺点
-
This is partially a matter of taste, but I think the clarity of the paper would be improved by giving it a title that more clearly states what you actually have done. For the same reason, I am not in favor of the choice to describe the results as a "multi-scale adaptive theory". It would be more illuminating to state in plain words what you have done: a standard expansion in fluctuations of the predictor.
-
The clarity of the paper could be improved. In particular, there are many overly long sentences, like that starting with "Another line of work..." on Line 161.
-
In the Introduction, the authors should more clearly emphasize that they focus on linear networks. I don't view the specialization to linear networks as a flaw, but it does limit the generality of the authors' "Universal theory of train and test statistics" (as they claim in Appendix A).
其他意见或建议
-
It might be worth noting that the "directional feature learning measure" introduced in eq. (27) is closely related to measures of feature learning based on centered kernel alignment, but differs in that those measures usually consider hidden-layer statistics of the trained network rather than output statistics.
-
Following on my earlier comment about writing: the sentence "The appearing matrix product allows a non-trivial change of the NNGP kernel in certain meaningful directions, yielding additional insights" in Lines 362-364 is not meaningful, because it does not clarify what those insights are.
We thank the reviewer for their feedback.
Claims and Evidence
We will add experiments on MNIST in a revision (see Fig. 1 in Supplement). We also extend the formalism to non-linear activations and include experiments (see Fig. 2 & 3 in Supplement; for details, also see reply to point 2 to reviewer YLkj and Sections 2 & 3 in Supplement).
Choice of title
We regard the formalism that is valid across all scaling regimes as a main merit of our work. Adequate fluctuation corrections are one important technical aspect, but we fear this term is not an easy to comprehend title; instead we will clarify in abstract and introduction that we perform systematic expansions in the output fluctuations. We further add a derivation of the theory by van Meegen & Sompolinsky 2024 within our framework: It results by an additional saddle point approximation justified in their scaling (see Section 5 in Supplement). Thus our work covers the entire range of scalings and explains the relation between kernel rescaling and kernel adaptation.
Improvement of accuracy of citations
We are glad to hear that the reviewer in general agrees to our presentation of prior work and are grateful for the detailed and helpful suggestions to improve this section further:
• We agree that feature-learning either arises at infinite width or in infinite-width limits different from the NNGP; we will change l. 122 as 'However, the NNGP cannot explain the often superior performance of finite-width networks [...], requiring either the inclusion of finite-width effects or different infinite-width limits such as P scaling (Yang et. al, 2021, Vayas et. al, 2023)'.
• We agree that Canatar & Pehlevan (2022) is better placed below l. 129 as: 'These differ in the choice of order parameters considered and also in the explained phenomena. An experimental investigation of kernels in feature learning in gradient descent settings was performed by Canatar & Pehlevan (2022).'
• We will include Hanin & Zlokapa (2024) in the list of perturbative approaches in l. 149.
• We will change the citation of Cui et. al. (2023) in l. 139ff. to "Cui et al. (2023) study non-linear networks [...]. ". Concerning Maillard et. al. (2024) we feel that the current citation is appropriate since this work takes into account a different scaling in the amount of training data.
• We already cite Aitchison (2020) in l. 121; we will further include it in the enumeration of adaptive approaches, e.g. in l. 127. The additional reference of Zavatone-Veth & Pehlevan (2021b) will be included as “Zavatone-Veth & Pehlevan (2021b) investigate deep linear networks in different proportional limits; recovering the results from Li & Sompolinsky in an adaptive approach.
• We already cite Hanin & Zlokapa (2023) in l. 138 and will add there: 'Zavatone-Veth, Tong & Pehlevan (2022) study the same setting but consider explicit models on the input data in the limit of infinite pattern dimension.'
• In the revised manuscript, we discuss the relation to van Meegen & Sompolinksy (2024) in the related works and show how to obtain their theory within our framework. For details please also see our reply to reviewer QcC4.
Computational complexity and numerical stability
Solving the tree-level self-consistency equations requires operations and the one-loop self-consistency equations , with a difference by a pre-factor independent of . Note that neither the input dimension nor the network width but only the number of training samples affect the computational complexity.
A naive implementation of the self-consistency equations indeed faces problems. Our stable implementation of the one-loop equations first solves the tree-level equations with the NNGP as initial value and then uses the tree-level solution as the initial value for the one-loop equations. The tree-level and one-loop equations are more unstable in the mean-field regime; we resolve this by annealing solutions from the standard to the mean-field regime. We will include this in App. B of a revision.
Further points
We will carefully incorporate the comments on text improvements in the revised manuscript; in particular in l. 362-364, clarifying that meaningful directions are either the teacher direction or dominant eigendirections of the input kernel that align with the target. We will also explain additional insights due to this result: encoding of significant directions allows for more effective learning by reducing the sample complexity. Please see response to reviewer YLkj for details. We will rename App. A to 'General approach to train and test statistics' as the common starting point for train and test statistics in linear and non-linear networks across all scaling regimes .
Thank you for your detailed response. I think these revisions adequately address my concerns and those raised by the other referees, so I will raise my score accordingly.
This paper introduces a unifying theoretical framework which connects the kernel rescaling approach with the adaptive kernel approach in the Bayesian setup. The authors demonstrate that both of these approaches can be derived from the same starting point (the network’s posterior) but differ in the choice of order parameters in an effective action formulation. Specifically, the paper explores two different initialization scalings: Mean-Field Scaling: In this regime, network outputs concentrate, and feature learning is captured through a tree-level (saddle-point) approximation. While the kernel adapts directionally in principle, for a linear network’s mean predictions, this adaptation can be effectively approximated by a simple rescaling of the NNGP kernel. Standard Scaling: Here, fluctuations in network outputs become significant, necessitating a one-loop approximation. The authors demonstrate that in the proportional limit of N and P, even these corrections can be expressed in a rescaling-like form for the mean predictor, although the kernel undergoes non-isotropic modifications in key directions. Experiments using linear networks on synthetic tasks validate these theoretical insights. Notably, the study underscores how this new framework effectively captures directional feature learning effects in the covariance of network outputs, which conventional rescaling approaches fail to account for.
给作者的问题
No questions.
论据与证据
Their claims are convincing.
方法与评估标准
Methods: The use of a Bayesian field-theoretic framework is well-justified given the theoretical focus. Restricting experiments to linear networks makes the math tractable. Evaluation Criteria: Measuring both the mean and covariance of the posterior outputs on carefully chosen synthetic tasks is appropriate.
理论论述
I did not review the supplementary material, where the detailed theoretical analysis is presented. The theoretical results and equations in the main paper look decent.
实验设计与分析
he authors use synthetic tasks for which analytic predictions exist, train networks using Langevin stochastic gradient descent, and systematically compare theoretical vs. empirical mean and covariance statistics. While the scope is necessarily limited to linear networks and synthetic data (Ising tasks and simple teacher-student setup), it is appropriate for verifying the paper’s key claims about the theory of feature learning across scaling regimes.
补充材料
I did not review the supplementary material.
与现有文献的关系
This paper bridges separate lines of work of kernel rescaling and kernel adaptation by showing they can be viewed with a field-theoretical approach but differ in the choice of order parameters. This unified viewpoint is a valuable contribution to the literature. Specifically, regarding the kernel rescaling theories, the authors demonstrate why and when the kernel-rescaling perspective is valid (e.g., for the mean predictor in linear networks, especially in the proportional limit). They identify the dimensionality of the order parameter (just a scalar vs. high-dimensional) as the crux behind whether the kernel is rescaled or adapted (to specific directions).
遗漏的重要参考文献
I am not aware of papers which should be cited/discussed.
其他优缺点
Strength: This paper creatively combines statistical physics methods with existing neural network kernel/rescaling theories to derive a unified perspective for kernel adaptation in Bayesian setup. Their numerical experiments with toy tasks beautifully agree with the theoretically derived quantities.
Weaknesses: 1.Unclear Practical Benefit While the multi-scale adaptive theory elegantly unifies kernel rescaling and kernel adaptation, its practical utility remains unclear. One would hope for novel or stronger predictive insights. For example, phenomena previously unexplained that this theory can now elucidate. but the paper does not provide explicit examples of such breakthroughs. 2. Limited to Single-Layer Linear Networks The work focuses on single-hidden-layer linear networks, which is a highly constrained setting. It is not evident how to extend the framework to more realistic architectures (e.g., multi-layer or nonlinear) without incurring significant complexity. This limits applicability to modern deep-learning scenarios. 3. Bayesian Setup vs. SGD Practice The theory is developed under Bayesian posterior sampling assumptions, but most modern neural networks are trained via (variants of) stochastic gradient descent. It is not straightforward how the results might carry over to typical (non-Bayesian) training regimes, potentially reducing the theory’s direct impact on practical deep learning methods.
其他意见或建议
No suggestions.
We thank the referee for their positive evaluation and for the precise summary. In a revision, we will address the three mentioned weaknesses:
-
Our work aims at a feature learning theory from first principles. We hope that in the long term this theory can be leveraged to build a theoretical foundation for mechanistic interpretability. We will provide three additional results in that direction:
(a) As the theory makes no assumption on the distribution of the data, we will show in a revision that the theory makes accurate predictions for train and test predictor also on more practical datasets (MNIST, see Fig. 1 in Supplement).
(b) We will include evidence of sample complexity reduction in the presence of feature learning in non-linear networks (see Fig. 3 in Supplement). To this extent, we consider a teacher-student setting where the teacher is of the form with being the first and third-order Hermite polynomials. In this setting the adaptive theory correctly predicts that the network will learn the non-linear components at , whereas both the NNGP and the rescaling theory capture only the linear component. This result further demonstrates that the adaptive theory captures feature learning phenomena beyond rescaling.
(c) The theory explains the collaborative effect of input () and output () statistics in shaping the resulting kernel and thus the anisotropy of the predictor's variance. In particular, we show that low-rank structures in the input kernel get amplified by the number of training patterns , if the labels contain coherent information (see Section 4 and Fig. 4 in Supplement).
-
The presented framework is indeed general and allows the inclusion of non-linearities. The difficulty lies in approximating the cumulant-generating function in Eq. (4) of the main text: In a revision, we present two approaches. (a) A cumulant expansion that for point-symmetric activation function leads to a simple replacement of by a different matrix , while general non-symmetric non-linearities yield additional terms (see Section 2 for theory and Fig. 2 for empirics in Supplement). This allows us to obtain the rescaling theories by Li & Sompolinsky (2022) and by Pacellli et al. (2022) within our framework. (b) A variational Gaussian approximation on yields results that differ qualitatively from the rescaling approach, leading to a reduction of sample complexity as discussed in the previous point (see Section 3 and Fig. 3 in Supplement).
We will also add an appendix showing how to derive the seminal results by van Meegen and Sompolinsky (2024) on coding schemes within our framework; in brief, their results are valid in “super-mean-field” scaling for the readout weights , which allows them to make an additional saddle point approximation on the readout weights (see Section 5 in Supplement). We believe that this unification is useful to the community to better compare the different variants of theories and understand their limitations.
-
We agree with the referee that the Bayesian approach is in general different from practical ways of training. However, while Bayesian sampling does not exactly correspond to SGD, under certain conditions and with appropriate parameter tuning, it can serve as a useful approximation of SGD. This approximate correspondence builds on the noise inherent in SGD to explore the parameter space in a manner that aligns with Bayesian principles.
All reviewers recommend acceptance. The paper develops a theory of feature learning in Bayesian linear networks that bridges kernel rescaling and kernel adaptation in different scaling regimes. Their approach uses methods grounded in statistical field theory, and provides consistent predictions in both standard and mean-field limits. Numerical experiments on synthetic tasks and MNIST validate the theory. While the framework currently targets linear Bayesian networks, the authors explain how approximate Bayesian sampling can inform SGD-based neural networks. They have also clarified relationships with prior work and extended the analysis to nonlinear activations.