On the Saturation Effects of Spectral Algorithms in Large Dimensions
We fully depict the saturation effects of spectral algorithms and reveal a new phenomenon in large dimensional settings.
摘要
评审与讨论
This paper concerns the convergence rate of spectral methods, particularly kernel ridge regression (KRR) and kernel gradient flow (KGF), in large-dimensional settings where the sample size is of the same magnitude as a power of the input dimension , i.e., . It reveals a new phenomenon of the saturation effect in large dimensions, which is different from its fixed-dimensional counterpart. Specifically, it shows that in large dimensions, KRR still suffers from the saturation effect while KGF does not. The key techniques involve the use of analytic filter functions to characterize the regressors from different spectral algorithms, followed by standard concentration results on the bias-variance decomposition of the excess risk.
优点
This paper offers solid theoretical results which improve over previous literature or reveal novel phenomenon. It offers important insights on spectral algorithms in large input dimensional setting, for example the interpolation (with different qualification ) of the learning rate between KRR and KGF, the exact description of the phenomena periodic plateau behaviour and polynomial approximation barrier. This paper also offers numerical validations on their claim.
缺点
There is no major weakness spotted in this paper.
问题
I appreciate the result of this paper and hence look for any possible extension from the current result.
This paper focuses on dot-product kernels with inputs distributed uniformly on a hypersphere and with polynomial spectral decay.
-
By [Belkin2018] and [Haas2024], it simply seems the function cannot be smooth but at least 1-differentiable, say is induced by the ReLU-NTK. If is smooth, for example is the Gaussian kernel, then the spectral decay is exponential. Can the result in this paper extend to this case? If yes, where is the adaptation? If no or not obvious, what would be the main technical difficulties?
-
In realistic setting, uniform input distribution on a hypersphere is too restrictive. Could one relax the condition to the distributions which have support on the whole sphere instead? I recall Lemma Lemma F.9 in [Haas2024] stating the spectral decay is still polynomial in this case. Could one extend the analysis in this case?
Also, I have some technical questions concerning the appendix.
-
In Eq (62) in Lemma D.7, the left hand side (LHS) should be independent to the noise, but why is there a term with on the right hand side (RHS)?
-
Less like a question but more like a comment: I think that there is a typo in Eq (42): it should be instead of . Also, is Eq (44) actually redundant as a special case of Eq (43) given the notation mentioned in line 602 - 603?
Reference:
- Haas, Moritz, et al. "Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension." Advances in Neural Information Processing Systems 36 (2024).
- Belkin, Mikhail. "Approximation beats concentration? An approximation view on inference with smooth radial kernels." Conference On Learning Theory. PMLR, 2018.
局限性
All assumptions and conditions are stated clearly in the paper.
We sincerely thank you for taking the time to read our paper and for providing valuable feedback. We are pleased to see that you not only accurately described the contributions of our work but also gave it high praise. We would like to address the questions and comments you raised regarding possible extensions of our research.
Author's response to Question 1:
Thank you for suggesting checking our assumptions on two specific kernels. After a careful check on the paper [Belkin2018] and [Haas2024], we find that for both ReLU-NTK and the Gaussian kernel are analytic, with all coefficients , thus satisfying Assumption 1. Therefore, the results of this paper apply to both ReLU-NTK and the Gaussian kernel. Please let us clarify in detail:
-
From the definition of ReLU-NTK in [Bietti2021], for ReLU-NTK is analytic (see, e.g., page 6 in [Bietti2021]). Moreover, Corollary 3 of [Bietti2021] as well as Lemma B.2 of [Haas2024] both imply that we have all coefficients . Notice that the proof for the positive-definiteness of the kernel in the above studies relies on [Gneiting2013], hence they also require that is analytic.
-
From the definition of the Gaussian kernel defined on the sphere, we can show that for the Gaussian kernel is analytic, with all coefficients .
Since both kernels satisfy Assumption 1, we have the following claims:
-
You are right. When is fixed dimensions, [Bietti2021] showed that the eigenvalues of ReLU-NTK satisfy and for large , leading to with ; whereas it is known that the eigenvalues of the Gaussian kernel defined on sphere decay exponentially (see, e.g, [Amnon2020]). Therefore, we may need different tools to deal with the rate of the excess risk of fixed-dimensional spectral algorithms with ReLU-NTK and the one with the Gaussian kernel.
-
In large dimensions, the spectra of both ReLU-NTK and the Gaussian kernel exhibit a strong block structure as follows: and for (see, e.g., Lemmas D.11 and D.13). Therefore, our results hold for both ReLU-NTK and the Gaussian kernel in large dimensions without any adaptation.
Author's response to Question 2:
We agree that a uniform input distribution on spheres is too restrictive. As noted in Remark 2.1, most studies analyzing spectral algorithms in large-dimensional settings concentrate on inner product kernels on spheres for two main reasons:
-
Firstly, harmonic analysis on the sphere is clearer and more concise. For example, the properties of spherical harmonic polynomials are simpler than those of orthogonal series on general domains. This clarity makes Mercer’s decomposition of the inner product more explicit, avoiding several abstract assumptions.
-
Secondly, there are few results available for Mercer’s decomposition of kernels on general domains, especially when considering the domain’s dimension. For example, no results determine the eigenvalues of Sobolev space in large dimensions.
Thus, if we believe that large-dimensional KRR and KGF exhibit some new phenomena, it would be prudent to start with a more tractable setting, and then extend the results to general domains.
Regarding your second question, thank you for bringing up the interesting work [Haas2024] and suggesting an extension of our analysis. We believe that the results in [Haas2024] can indeed be extended to large-dimensional settings, although we must carefully consider the constants that depend on . For example, the density in Lemma D.7 of [Haas2024] is lower and upper bounded by two constants depending on .
We appreciate your insightful suggestion. We will add the paper [Haas2024] to the discussion section of our manuscript, and we will consider extending our results to general domains in future work.
Author's response to Question 3: Thank you for pointing out the potential issue in Eq (62). We have revised our manuscript and updated Eq (62) to the following version :
Moreover, we followed your suggestion and added a rigorous definition of in our manuscript: We say two (deterministic) quantities satisfy if and only if for any , there exists a constant that only depends on and the absolute positive constants , such that for any , we have .
We hope that the updated definitions clarify that Eq (62) is correct when is an absolute constant defined in (1).
Author's response to Question 4:
Thank you for your careful reading of our proof and for pointing out the typo in Eq (42). We will correct it in the updated manuscript. Regarding your comment about Eq (44), we believe it is not redundant. In Eq (44), refers to the filter function of a specific spectral algorithm (not necessarily KRR), while in (43) specifically denotes the filter function for KRR.
Reference:
- Alberto Bietti and Francis Bach. "Deep equals shallow for ReLU networks in kernel regimes." In International Conference on Learning Representations (ICLR), 2021.
- Tilmann Gneiting. "Strictly and non-strictly positive definite functions on spheres." Bernoulli, 19(4): 1327–1349, 2013.
- Geifman, Amnon, et al. "On the similarity between the Laplace and neural tangent kernels." Advances in Neural Information Processing Systems 33 (2020): 1451-1461.
Thank you very much for your detailed response. I am content to see that the results are valid for various kernels in high-dimensional setting. I would lean to accept this paper.
Thank you very much for your positive feedback. We are pleased that you find the results valid in high dimensions. Your support and recommendation to accept the paper are greatly appreciated.
In a large-dimension setting, i.e., the dimension of the input grows polynomially with respect to the sample size , this manuscript rigorously proves upper and lower bounds for spectral algorithms and shows the dependence on the qualification and the interpolation index. Consequently, the manuscript proves the saturation effect in spectral algorithms for large-dimensional data.
优点
-
Identify several phenomena in large-dimensional spectral algorithms based on the derived rates. These phenomena are also illustrated by figures, making the explanations easy to follow.
-
Discovered the thresholding for igniting saturation effect is different for large-dimensional and fixed-dimensional settings. Specifically, in a large-dimensional setting, the saturation effect occurs when the interpolation index exceeds the qualification, whereas in a fixed-dimensional setting, it must be more than twice the qualification.
缺点
-
By checking previous work [1,2] and the proofs in these works, it looks like Theorem 3.1 has been established in Section 4 of [1], while Theorem 4.1 and Theorem 4.2 are the direct extensions of partial results in Theorem 2 and Theorem 3 from [2]. For instance, the proofs of Theorems 4.1 and 4.2 are obtained by replacing the Tikhonov regularized filter function in the variance and bias decomposition of [2] with a general filter function satisfying specific conditions such as C1 and C2. Such a proof trick has been used in previous extend from KRR (Tikhonov regularization) to general spectral algorithms, i.e., [3] to [4].
However, unlike the extension from [3] to [4], the current manuscript seems to be a partial extension of [2] with a similar proof trick as I mentioned before. Therefore, I have concerns about the technical contribution and novelty of this manuscript as a submission to a conference. This work seems more like an extension to a journal like JMLR, etc.
I am just not sure whether such a partial extension of a previous article with almost the same proof technique is suitable for conference publication or whether it would be better evaluated in a journal. I defer this justification to the AC. Please disregard this comment if the AC deems the current context appropriate for conference publication.
-
I noticed there are some simulation experiments to confirm the saturation effect in fixed dimension KRR; see [5]. Is it possible to confirm the results in this manuscript? I understand given the rates are asymptotic, it might be hard to have thorough investigations due to the extremely large . But I'm still curious if any preliminary experiments can be done.
[1] Lu, Weihao, et al. "Optimal rate of kernel regression in large dimensions." arXiv preprint arXiv:2309.04268 (2023).
[2] Zhang, Haobo, et al. "Optimal Rates of Kernel Ridge Regression under Source Condition in Large Dimensions." arXiv preprint arXiv:2401.01270 (2024).
[3] Zhang, Haobo, et al. "On the optimality of misspecified kernel ridge regression." International Conference on Machine Learning. PMLR, 2023.
[4] Zhang, Haobo, Yicheng Li, and Qian Lin. "On the optimality of misspecified spectral algorithms." Journal of Machine Learning Research 25.188 (2024): 1-50.
[5] Li, Yicheng, Haobo Zhang, and Qian Lin. "On the Saturation Effect of Kernel Ridge Regression." International Conference on Learning Representations. (2024)
问题
- I'm curious to know if it is possible to conduct a similar analysis under ultra-high-dimension settings like the dimension grows exponentially fast as the sample size . Do we need additional techniques to conduct these analyses?
- Based on the figure, it looks like even when , as long as grows with , the saturation effect will not happen, which is different from the fixed dimension setting. While this may be the consequence of the derived rate, can authors provide some intuition behind this?
- Is there a particular reason that the authors concern with as integer to derive the rates? Why is this ratio an integer?
局限性
Yes.
Thank you for your thorough review and valuable comments. Below, we address your concerns and questions in detail.
Concern 1: Please let us clarify our novelties and contributions below.
- Although the saturation effect has been observed for kernel ridge regression in a large-dimensional setting [2], its persistence for other spectral algorithms was unclear. We determine the exact rate of excess risk for analytic spectral algorithms with qualification . Key outcomes include:
- Kernel gradient flow is minimax optimal in large dimensions, with an excess risk rate of .
- We proved the saturation effect for large-dimensional KRR when and for spectral algorithms with when .
Regarding your concerns on the technical contributions:
- [3][4] addressed the convergence rate for fixed-dimensional spectral algorithms when , focusing on non-saturated cases where KRR and kernel gradient flow are rate-optimal. In contrast, they did not consider the saturation effect of spectral algorithms when .
- [1] determined the minimax rate of kernel regression for where . The proof in [1] relies heavily on empirical process theory, requiring bounds on the empirical loss and its difference from the expected loss (excess risk). This approach does not generalize to cases where .
- [2] provided a minimax lower bound for using integral operator techniques for KRR, hinting at a saturation effect but not generalizing to other spectral algorithms.
We extended results to general spectral algorithms in large dimensions using complex variable analysis (Appendix E.1 and [6]) to match upper and lower bounds. Large dimensions introduce unique challenges:
- The polynomial eigendecay assumption that does not hold in large dimensions, since the hidden constant factors in the assumption vary with .
- The embedding index assumption in [6] does not hold either. We develop a new condition in (63) and (74) to replace the embedding index assumption, and we verify this new condition in Appendix D.4.
We believe our contributions are novel and valuable for the machine learning community, meriting publication at NeurIPS.
Reference: [6] Li, Yicheng, et al. "Generalization Error Curves for Analytic Spectral Algorithms under Power-law Decay."
Concern 2: Thank you for your suggestion. We agree that conducting thorough numerical investigations can be challenging due to the extremely large dimensionality . Following your recommendation, we conducted two preliminary experiments. We will follow your advice to conduct more comprehensive experiments later and report the results in the updated manuscript.
Question 1: In this paper, we consider the asymptotic framework with . As discussed in lines 97-115, many studies focus on the performance of spectral algorithms within this asymptotic framework. By comparing our work with these studies, we highlight the novelty and contributions of our results to the field. In contrast, we were not aware of similar studies under ultra-high-dimensional settings, hence we did not explore this scenario in our manuscript.
Your insightful suggestion implies that this could be an interesting avenue for future research. When and is sufficiently small, the excess risk in our results is of rate . If we consider as a limiting case of , we might conjecture that the correct rate of the excess risk when is .
We believe your suggestion is promising, and we will consider conducting a similar analysis under ultra-high-dimension settings in future work.
Question 2: Thank you for your question. We guess that there might be some typos in your expression, and we guess that you wanted to state the distinction between our results and existing results in fixed-dimension settings as follows:
- Our results, Theorems 4.1 and 4.2, demonstrate that the saturation effect occurs in large-dimensional settings if and only if . In Figure 2 (on page 13), we illustrate this with the spectral algorithm rate (blue line) and the minimax rate (orange line). The blue and orange lines exhibit non-overlapping regions if and only if . Therefore, Figure 2 aligns with our claims.
- In contrast, in fixed dimension settings, the saturation effect occurs if and only if .
Then let us provide you with some intuition behind the consequence of the derived rate. We highlight that the periodic behavior of the rates with respect to in Theorem 4.1 and 4.2 is closely related to the spectral properties of inner product kernels for uniformly distributed data on a large-dimensional sphere. In Lemmas D.11 and D.13, we show that and for . Moreover, recall that the leading terms for the bias and variance are given by (see Appendices D.1-D.3). Therefore, by comparing between and (with calculations detailed in Lemma D.14) under the strong block structure of the spectrum, we found that the rate of excess risk behaves periodically for each period , , and that the saturation effect occurs when .
Question 3: The intervals , naturally from our derivation process. Thus, we may focus on . Regarding your second question about why this ratio is an integer, we are not entirely certain whether you are referring to the constant or the integer . For further clarification, please let us know.
I appreciate the detailed response and explanation that confirms the technical contribution of this paper. Also, thanks for the additional experiments that enhanced the content of the manuscript.
I have adjusted my rating, good luck!
We are pleased that you recognize the technical contributions of our paper and the additional experiments we conducted. We will follow your suggestion to include these experiments in the updated manuscript. Thank you for raising your score.
Summary:
The authors study the saturation of spectral algorithms (KRR & GF) in high-dimensions where are both large, meaning that when KRR can't achieve information theoretic lower bounds with over smooth the regression functions while kernel Gradient Flow (GF) can. Theorem 3.1 states the optimal convergence rate of kernel GF which matches the provided minimax lower bound in Theorem 3.3. Moreover, they find that KRR is unableto achieve this lower bound (being suboptimal) for interpolation spaces with .
优点
Pros:
- very well-written
- tightening previous results on the minimax rate of kernel GF in high dimensions
- proving the saturation of KRR in high dimensions
缺点
Cons:
- the main results of this paper are stated on page 7, the presentation of the results is slow
问题
This is an interesting paper about the saturation of KRR in high dimensions. The authors provide several new results and the paper is well written and organized.
- line 200 -- what does mean? It is only defined in the next page.
局限性
N/A
We sincerely thank you for your detailed review and thoughtful comments. We are grateful that you found our paper well-written and technically solid. Below, we address your concerns and questions in detail.
Author's response to Concern 1: We appreciate your suggestion to present our results earlier in the paper. We agree that, in a 9-page conference paper, it is crucial to present the main results as early as possible. Therefore, we followed your recommendation and added a non-rigorous version of our main results (Theorem 4.1 and Theorem 4.2) in the contribution part (page 2 in our manuscript). For your convenience, we restate the non-rigorous version as follows:
Theorem (Restate Theorem 4.1 and 4.2, non-rigorous)
Let , , and be fixed real numbers. Denote as the integer satisfying . Then under certain conditions, the excess risk of large-dimensional spectral algorithm with qualification is of order:
where {}.
We would greatly appreciate your further advice on how to better organize our paper.
Author's response to Question 1: Thank you for highlighting this point. We will clarify the definition of in line 200. Here, represents the regression function, where denotes the -th component of . We will ensure this definition is included in the revised manuscript.
We thank you once again for your expertise and valuable feedback. Please let us know if you have any further questions.
Thank you! I appreciate the authors' response. It is really helpful to have such non-rigorous versions of the results earlier in the paper, and I'm happy that the authors included this to improve the presentation of their draft. I continue supporting this paper so I keep my score positive.
Thank you as well! We are pleased that the changes you suggested have enhanced the clarity and presentation of our work. We sincerely appreciate your positive assessment of our paper.
Following the recommendation of the Reviewer JAP1, we conducted two preliminary experiments using two specific kernels: the RBF kernel and the NTK kernel. Experiment 1 was designed to confirm the optimal rate of kernel gradient flow and KRR when . Experiment 2 was designed to illustrate the saturation effect of KRR when .
Experiment 1: We consider the following two inner product kernels:
-
RBF kernel with a fixed bandwidth:
-
Neural Tangent Kernel (NTK) of a two-layer ReLU neural network:
where .
The RBF kernel satisfies Assumption 1. For the NTK, the coefficients of , , satisfy and (see, e.g., [Lu2023]). As noted after Assumption 1, our results can be extended to inner product kernels with certain zero coefficients . Specifically, for any , as long as for , the proof and convergence rate remain the same. Therefore, for in our experiments, the convergence rates for NTK will be the same as for the RBF kernel.
We used the following data generation procedure:
where each is i.i.d. sampled from the uniform distribution on , and .
We selected the training sample sizes with corresponding dimensions such that . For each kernel and dimension , we consider the following regression function :
This function is in the RKHS , and it is easy to prove that, for any , Assumption 2 (b) in our revision holds for with . Therefore, Assumption 2 holds for . We used logarithmic least squares to fit the excess risk with respect to the sample size, resulting in the convergence rate . The experimental results align well with our theoretical findings.
Experiment 2: We use most of the settings from Experiment 1, except that the regression function is changed to with , the Gegenbauer polynomial, and . Notice that the addition formula implies that
hence and satisfies Assumption 2.
Our experiment settings are similar to those on page 30 of [5]. We choose the regularization parameter for KRR and kernel gradient flow as . For KRR, since Corollary D.16 suggests that the optimal regularization parameter is , we set . Similarly, based on Corollary D.16, we set for kernel gradient flow. Additionally, we set . The results indicate that the best convergence rate of KRR is slower than that of kernel gradient flow, implying that KRR is inferior to kernel gradient flow when the regression function is sufficiently smooth.
References
-
Lu, Weihao, et al. "Optimal rate of kernel regression in large dimensions." arXiv preprint arXiv:2309.04268 (2023).
-
Li, Yicheng, Haobo Zhang, and Qian Lin. "On the Saturation Effect of Kernel Ridge Regression." International Conference on Learning Representations. (2024)
A very solid contribution on the saturation effect of spectral learning algorithms in high dimension (dimension grows as a power of the sample size), a setting that has generated a lot of attention and investigation in recent years, while the saturation effect had only been studied in fixed dimension. The concern of possibly only technically incremental results with respect to previous work raised by one reviewer has been answered convincingly by the authors in the discussion, pointing out novelties in results and techniques used. All reviews lauded the paper for its quality of writing and presentation.