Sharp Generalization for Nonparametric Regression by Over-Parameterized Neural Networks: A Distribution-Free Analysis in Spherical Covariate
We show that an over-parameterized two-layer neural network trained by gradient descent (GD) exhibits minimax optimal convergence rates for nonparametric regression, and our results are distribution-free in spherical covariate.
摘要
评审与讨论
This paper studies the generalization bounds of two-layer neural networks under the neural tangent kernel (NTK) regime. Using the critical radius as an error measure, this paper establishes distribution-free generalization bounds of the network, which recovers the optimal bounds derived in the previous literature in certain distributional cases. Moreover, this paper also reduces the requirement of the neural network width in the literature using a new error decomposition.
给作者的问题
-
What does "preconditioned" in Section 4 heading "Preconditioned Gradient Descent" refer to?
-
It seems that in the literature the EDR remains to be for varous data distributions. Can you provide concrete examples of other EDRs of the NTK under some distribution?
-
Regarding the simulation, what is the performance of the neural network under the theoretical stopping time ?
论据与证据
Yes.
方法与评估标准
NA
理论论述
The proof seems to be solid with clear proof roadmap and justified technical improvements.
实验设计与分析
NA
补充材料
I have reviewed the proofs in the appendix roughly.
与现有文献的关系
This paper improves the results in the literature, establishing distributional-free generalization bounds for two-layer neural networks under the NTK regime, while previous literature mainly focuses on specific distributions. Also, this paper reduces the requirement of the neural network width by using a new error decomposition. The contributions are mainly technical improvements.
遗漏的重要参考文献
I recommend adding this paper, which, as far as I know, is the first to study early stopping under the RKHS framework.
Yao, Yuan, Lorenzo Rosasco, and Andrea Caponnetto. “On Early Stopping in Gradient Descent Learning.” Constructive Approximation 26 (August 1, 2007): 289–315. https://doi.org/10.1007/s00365-006-0663-2.
其他优缺点
Weaknesses
The setting of problem seems to be quite simplified in this paper. For example, the neural network is a two-layer network with only trainable inner weights. Also, while the distribution of the data can be arbitrary, the distribution is required to be supported on the sphere. These could impact the generality of the results.
其他意见或建议
Minor: The choice of word "early stopping" or "early-stopping" should be consistent throughout the paper.
We appreciate the review and the suggestions in this review. The raised issues are addressed below. In the following text, the line numbers are for the revised paper.
Regarding the weakness, we emphasize that the two-layer neural network studied in this paper still achieves the sharp risk bound of , supporting the claim in[ Bietti & Bach, 2021] that shallow over-parameterized neural networks with ReLU activations exhibit the same approximation properties as its deeper counterpart.
Regarding the suggested reference and the wording suggestion, we will discuss [Yuan et al. 2007], and use “early stopping” consistently in the final version of this paper.
Below is our response to the questions.
(1) "What does "preconditioned" in Section 4 heading "Preconditioned Gradient Descent" refer to?"
The "preconditioned" in Section 4 heading is a typo, and we have fixed it which will be reflected in the final version of this paper.
(2) "It seems that in the literature the EDR remains to be for varous data distributions. Can you provide concrete examples of other EDRs of the NTK under some distribution?"
Theorem 10 of [Li et al., 2024] suggests that the polynomial eigenvalue decay rate (EDR) of is achieved by learning the bias in a neural network, which is different from the EDR of discussed in this paper. In this case, the main results (Theorem 5.1 and Corollary 5.2) can be applied to obtain the minimax optimal nonparametric regression risk of the order .
(3) "Regarding the simulation, what is the performance of the neural network under the theoretical stopping time ?"
In the following table we report the minimum test loss and the test loss at the theoretical stopping time of the neural network trained in our simulation study (in Section D of the appendix) with the training data size ranges within with an increment of . It can be observed that the test loss at the theoretical stopping time is only marginally higher than the minimum test loss for each training data size , justifying the good generalization of the model at the stopping time .
| Training Data Size () | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 |
|---|---|---|---|---|---|---|---|---|---|---|
| Test Loss at Step | 0.3269 | 0.1901 | 0.1735 | 0.1425 | 0.1127 | 0.1058 | 0.0853 | 0.0771 | 0.0769 | 0.0665 |
| Minimum Test Loss | 0.3265 | 0.1889 | 0.1732 | 0.1419 | 0.1098 | 0.1039 | 0.0850 | 0.0771 | 0.0767 | 0.0663 |
References
[Li et al., 2024] Li, Y., Yu, Z., Chen, G., and Lin, Q. On the eigenvalue decay rates of a class of neural-network related kernel functions defined on general domains. JMLR 2024.
This paper addresses the generalization capabilities of over-parameterized two-layer neural networks (NNs) trained by gradient descent (GD) with early stopping for nonparametric regression. The authors establish a sharp generalization bound for the nonparametric regression risk. This result is distribution-free, meaning it does not rely on specific distributional assumptions about the covariate, as long as the covariate lies on the unit sphere. The authors also provide minimax optimal rates for specific cases, such as when the eigenvalues of the NTK decay polynomially. The paper contributes to the theoretical understanding of over-parameterized NNs by bridging the gap between classical kernel regression and finite-width NNs trained by GD.
给作者的问题
-
Can the results be extended to deeper neural networks? If so, what additional challenges arise in the analysis?
-
The assumption that may not hold in practice. How robust are the results if this assumption is relaxed? Are there any results for target functions outside the RKHS?
-
The paper suggests that a constant learning rate can be used. Are there any conditions under which a varying learning rate might be beneficial, or is a constant rate always sufficient?
论据与证据
Yes
方法与评估标准
Yes
理论论述
Yes
实验设计与分析
No
补充材料
No
与现有文献的关系
The contributions of this paper are related the community of deep learning theory.
遗漏的重要参考文献
Some related papers.
-
Dinghao Cao, Zheng-Chu Guo, and Lei Shi. Stochastic gradient descent for two-layer neural networks. arXiv preprint arXiv:2407.07670, 2024.
-
Mike Nguyen and Nicole M¨ucke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent. Journal of Statistical Planning and Inference, page 106169, 2024.
其他优缺点
Strengths:
-
The authors establish sharp generalization bounds for over-parameterized neural networks based on weaker assumptions, including distributional assumptions and the eigenvalues of the empirical NTK matrix.
-
The proof techniques, particularly the use of local Rademacher complexity and uniform convergence to the NTK, are novel and provide a fresh perspective on analyzing the generalization of over-parameterized NNs.
Weaknesses:
-
The analysis is restricted to covariates lying on the unit sphere.
-
The results assume that the target function lies in the RKHS associated with the NTK. This assumption may not hold in practice.
-
The paper focuses exclusively on two-layer NNs.
其他意见或建议
-
Page 1, line 34, right, "learning learning" should be "learning"
-
Page 7, line 348, "limit the number of steps T in for Algorithm" delete for
-
Page 8, line 409, r or r'?
We appreciate the review and the suggestions in this review. The raised issues are addressed below. In the following text, the line numbers are for the revised paper.
(1) “Can the results be extended to deeper neural networks? If so, what additional challenges arise in the analysis?”
Yes, this work can be extended to deeper neural network. The key roadmap is presented as follows. (1) We can establish the uniform convergence to the NTK of the deeper neural network using the existing uniform convergence results in [Li et al., 2024] and still apply the proof strategy in Theorem C.8 of this paper to decompose the neural network function by , where is the “kernel part” of and is the error function. (2) Lemma C.9 of this paper then can still be used to bound the local Rademacher complexity of all such neural network function, and we can use Theorem C.10 of this paper together with the convergence results about the training loss of the deeper neural network widely studied in the literature such as [Li et al., 2024, Allen-Zhu et al. 2019] to prove the sharp nonparametric risk bound of , where is the critical population rate of the NTK of the deeper neural network.
We would like to emphasize that the two-layer neural network studied in this paper still achieves the sharp risk bound of , supporting the claim in [ Bietti & Bach, 2021] that shallow over-parameterized neural networks with ReLU activations exhibit the same approximation properties as its deeper counterpart.
(2) "The assumption that may not hold in practice. How robust are the results if this assumption is relaxed? Are there any results for target functions outside the RKHS?"
We would like to point out that our generalization results can be extended to the relaxed case that the target function with diverges, that is, as . In this relaxed case, is not in the RHKS of constant norm, that is, with being a positive constant considered in this paper. We remark that in this relaxed case, the RKHS norm of goes to as , which is close to the setup considered in [Bordelon et al., 2025] where the RKHS of goes to (with in their work). We note that Theorem C.10 still holds with and , and Theorem C.10 with such new can be used to prove the main result, a new version of Theorem 5,1, for this relaxed case. The new version of Theorem 5.1 would have a nonparametric regression risk bound, , which has a multiplicative factor depending on before , and this risk bound still converges to if as . For example, when is the spherical uniform distribution, we have , and such risk bound converges to if .
(3) "The paper suggests that a constant learning rate can be used. Are there any conditions under which a varying learning rate might be beneficial, or is a constant rate always sufficient?"
Varying learning rate could be beneficial compared to constant learning rate in the nonconvex optimization literature such as the optimization of DNNs, which will be discussed in the final version of this paper. In this paper, we state that the constant learning rate leads to an empirically faster convergence speed for training the neural network than the literature where the infinitesimal learning rate is used.
Furthermore, we will fix the typos, and should be in line 409. We will also discuss the suggested works in “Essential References Not Discussed” in the final version of this paper.
References
[Li et al., 2024] Li, Y., Yu, Z., Chen, G., and Lin, Q. On the eigenvalue decay rates of a class of neural-network related kernel functions defined on general domains. JMLR 2024.
[Allen-Zhu et al., 2019] Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. ICML 2019.
[Bietti & Bach, 2021] Bietti, A. and Bach, F. R. Deep equals shallow for ReLU networks in kernel regimes. ICLR 2021.
[Bordelon et al., 2025] B. Bordelon, A. Atanasov, C.Pehlevan. How Feature Learning Can Improve Neural Scaling Laws. ICLR 2025.
This manuscript contributes to the generalization analysis of overparameterized ReLU neural networks (with one hidden layer) in the context of nonparametric regression tasks. The training data is assumed to be such that are drawn from a unit sphere in , while the response is generated by a target function belonging to an RKHS ball , perturbed with random sub-Gaussian noise. Specifically, consider to be such a neural network trained by gradient descent with constant learning rate and early stopping, it is proved that the expected risk converges to at a rate , where is the critical population rate of the NTK associated with the network, and is the number of the training data. Notably, this result holds without imposing any additional distributional assumptions on the input . As a corollary of the above-mentioned main result, consider the special case when the eigenvalues of the integral operator associated with the reproducing kernel decays polynomially, the minimax optimal rate in nonparametric regression can be obtained.
Simulation studies are presented at the end of the Appendix to illustrate that the ratio between the "empirical early stopping time" and the "theoretically predicted early stopping time" remains stable across different values of . This stability suggests a proportional relationship between empirical observations and the theoretical predictions of early stopping time.
给作者的问题
-
In the paper, the authors presented a special example when the eigenvalues decay polynomially as , a convergence rate of the expected risk of can be obtained, which is considered minimax optimal as claimed by previous papers like (Yang and Barron, 1999; Yuan and Zhou, 2016). Can the authors provide any other examples as extensions of Theorem 5.1? For example, if the target function belongs to an RKHS ball induced by the Gaussian kernel, what should be the convergence rate of the expected risk?
-
I have some concerns about the practicality of implementing the early stopping rule for the algorithm. The early stopping time defined at (8) relies on the empirical kernel complexity . If one does not know which RKHS the target function lies in (or more specifically, if one has no information on the behaviour of eigenvalues ), how should one estimate ?
论据与证据
Yes, the claims made in the submission are supported by clear evidence. All theoretical results are backed by proofs that appear correct based on my assessment. I did not identify any problematic claims.
方法与评估标准
The paper primarily focuses on shallow yet overparameterized ReLU neural networks with a single hidden layer, a well-studied setting in theoretical analysis. The hidden layer is trained using gradient descent with a constant learning rate, while the second-layer weights remain fixed. This network architecture and training approach are reasonable choices. While considering deeper neural networks (i.e., those with multiple hidden layers) would better align with practical applications, the use of a shallow network is understandable given the theoretical focus of the study.
理论论述
Yes, I have reviewed the proofs presented in the paper. Based on my assessment, they appear to be correct.
实验设计与分析
This paper is a theoretical work and include only a simple simulation study. It does not include any real-data experiments.
补充材料
I confirm that I have read the majority of the supplementary material.
与现有文献的关系
This paper contributes to a deeper understanding of the generalization analysis of overparameterized neural networks with algorithmic guarantees.
遗漏的重要参考文献
I am not aware of any essential reference that is missing so far.
其他优缺点
Strengths:
-
I appreciate the presentation of this work. The theoretical proofs are easy to follow. The authors did a good job emphasizing the novelty of their results, comparing to previous works such as (Hu et al., 2021; Suh et al., 2022), which is achieving the same rate of convergence in nonparametric regression without imposing distribution assumption on the input covariate as long as it lies in a -dimensional sphere.
-
As a follow-up of the previous point, I believe the novelties (from a technical viewpoint) of this work include:
(i) Adopt a new error decomposition such that is an error function bounded by the width , lies in a bounded RKHS ball. With this decomposition, the author no longer requires approximating the kernel regressor and no longer need the uniform distributional assumption on the input covariate.
(ii) Establish a lower bound of the network width depending on and , and does not depend on
(iii) Use local Rademacher complexity to, in turn, bound the Rademacher complexity of the hypothesis space consisting of all the neural network functions trained by GD.
- Admittedly, studying the theoretical guarantees of training neural networks via gradient descent in the NTK regime is not entirely new. However, overall, I believe this work is technically solid.
For other weaknesses, please refer to the questions below.
其他意见或建议
See questions below.
We appreciate the review and the suggestions in this review. The raised issues are addressed below.
(1) "Can the authors provide any other examples as extensions of Theorem 5.1? For example, if the target function belongs to an RKHS ball induced by the Gaussian kernel, what should be the convergence rate of the expected risk?"
It is known that the eigenvalue decay rate (EDR) of the Gaussian kernel on the unit sphere in is ( the super-geometric decay in [Scetbon et al., 2021]) where is a positive constant depending on , and the kernel regression with the Gaussian kernel has a regression risk of the order by Theorem 3.1 of [Scetbon et al., 2021]. Furthermore, Theorem 10 of [Li et al., 2024] suggests that the polynomial eigenvalue decay rate (EDR) of is achieved by learning the bias in a neural network, which is different from the EDR of discussed in this paper. In this case, the main results (Theorem 5.1 and Corollary 5.2) can be applied to obtain the minimax optimal nonparametric regression risk of .
(2) "If one does not know which RKHS the target function lies in (or more specifically, if one has no information on the behaviour of eigenvalues ), how should one estimate "?
We respectfully point out that in Eq. (7) is defined in terms of the (empirical) eigenvalues () of the kernel gram matrix defined on the training data. In the case that one does not know the behavior of the population eigenvalues for , one can still estimate the stopping time by the fixed point of with the eigenvalues of the gram matrix . To do so, we can estimate the fixed point by approximating the numerical solution to the equation .
References
[Scetbon et al., 2021] M. Scetbon, Z. Harchaoui. A Spectral Analysis of Dot-product Kernels. AISTATS 2021.
The authors study risk rate convergence of the infinite-width NTK for two-layer neural networks with ReLU activations trained with gradient descent and early stopping. The authors present theoretical results that relax some assumptions made in prior work: specifically it is common for results to be derived on data sampled uniformly from the hypersphere and the authors recover these results under data on the sphere that does not require uniform sampling.
给作者的问题
The paper Ko & Huo (2024) seem to have different relaxations of the data distribution that don’t even require data being on the hypersphere either, are they comparable to the results in Table 1? If so, would the authors consider explaining more the relationship between these two? As far as I can see they also give some rates?
论据与证据
The claims are well supported by proofs which do not rely on uniform spherical data. The scope of the paper itself seems fairly incremental and the results seem to be largely a verification of existing theoretical results in this slightly relaxed setting. It looks like the proofs offer some novelty in proof techniques.
方法与评估标准
N/A
理论论述
Did not check proofs in detail but skimmed them & the proof overview and it looks reasonable
实验设计与分析
N/A
补充材料
Skimmed some proofs, did not look in detail
与现有文献的关系
This work contextualizes their results in many related work
遗漏的重要参考文献
N/A
其他优缺点
It seems the main results of this paper are moving from uniform spherical data to spherical data (not necessarily uniform), and showing that this setting recovers optimal rates from existing work. Further this paper works in the setting where only the first layer weights are trained and the second layer weights are initialized randomly and fixed whereas it seems the prior work trains all layers of the network. It seems like the authors here are trading off one somewhat restrictive assumption (uniform spherical data) for another (fix the second weight layer to random +1/-1 weights). I'm not sure if I see one assumption as more restrictive than the other.
The authors also reduce the lower bound on the width of the network required to achieve optimal rates, which seems to be a new contribution (I am not familiar enough with the literature to know).
Overall the contributions seem incremental but altogether can provide a useful basis for future work studying generalization bounds, hence my score.
其他意见或建议
N/A
We appreciate the review and the suggestions in this review. The raised issues are addressed below. In the following text, the line numbers are for the revised paper.
(1)"... It seems like the authors here are trading off one somewhat restrictive assumption (uniform spherical data) for another (fix the second weight layer to random +1/-1 weights)."
We respectfully point out that the training setup (fixing the weights of the second layer to random +1/-1 weights) is not an assumption, and such training setup reduces the training cost by not training the second-layer while obtaining a neural network with sharp risk bound. Such training setup has also been used in the prior analysis of the generalization of two-layer neural network such as [Du et al., 2019].
(2)"The paper Ko & Huo (2024)…"
In this paper, we focus on the generalization analysis of neural networks with algorithmic guarantees, that is, the generalization bounds for neural networks trained by gradient descent or its variants. Such analysis is of particular practical importance since the neural networks used in practice are mostly trained by GD or its variants. Although the work mentioned in this review, Ko & Huo (2024), provides certain risk rates, as mentioned in lines 181-182, it does not provide algorithmic guarantees and the neural networks achieving such rates are not trained by GD or its variants.
Novelty of This Paper
We would like to emphasize the novelty of this paper, which gives the first sharp nonparametric regression risk bound without distributional assumptions on the unit sphere. Our analysis introduces new proof techniques including the new uniform convergence results for NTK and the novel local Rademacher complexity based analysis (acknowledged by Reviewer EyzF Reviewer JB4U) which lead to the distribution-free generalization bounds with spherical covariate. Our results also give better lower bound for the network width compared to the existing literature. Furthermore, in our response to Reviewer JB4U (points (1) and (2)), we provide detailed discussions about (1) how to extend our work from a two-layer neural network to a deeper neural network, and (2) how to relax the assumption that the target function is in a ball RKHS of constant radius. The current novelty and contributions together with such extension and relaxation would benefit the theoretical deep learning literature with new insights and proof strategies.
References
[Du et al., 2019] Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. ICLR 2019.
This paper studies over-parameterized ReLU networks with one hidden layer, trained with gradient descent and early stopping. The paper contributes theoretical convergence analysis with minimax optimal rate, using the tools of neural tangent kernel and local Rademacher complexity. Reviewer Fjrf roughly reviewed the technical details, and Reviewer EyzF carefully checked the proofs. No technical mistake was identified. Although Reviewer pA14 commented the work as "incremental", the novelty was appreciated by Reviewers EyzF and JB4U. Other minor concerns include the "when-to-stop" issue of early stopping (Reviewer EyzF), unit-sphere assumption being restrictive (Reviewers JB4U and Fjrf).