Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors
摘要
评审与讨论
The paper examines prediction and estimation risk of ridgeless least squares estimator in the setting of a general error structure. The iid assumption on the error structure is often not valid in settings such as time series data , panel data, grouped data etc. The current paper introduces a theoretical framework which investigates the variance component estimation of both prediction and estimation risks in the above mentioned data settings. The benefits of overparametrization which has been seen in iid context has been shown to exist in the dependent error structure context as well.
优点
Following are the strengths of the paper-
-
Investigation of prediction and estimation risk under non i.i.d. regressor errors with specific focus on time series and cluster data
-
Explicit quantification of the variance component of both the risks (as mentioned above) which depends on the trace of the error covariance matrix and the trace of a function of design matrix as a separable product.
-
Explicit analysis of the variance and bias term of both the risks (as mentioned above) in the high-dimensional asymptotics
-
Well constructed numerical experiments to support the theory
缺点
Following are the weaknesses of the paper
-
The theoretical results particularly the bias component analysis section could have been more rigorous and better written. There are some notational discrepancies and theoretical inconsistencies.
-
Some remarks following theorem 3.4 and 3.5 where the design matrix has a known distribution say Gaussian would have been useful examples to get insight on the results proved in the theorems
-
Some notations such as and used in theorem 3.4 have been clarified later in the appendix. It would be better to introduce them in the sketch of the proof if you are using them anyway there.
问题
I have the following queries for the authors
-
It would be interesting to see how do the prediction risk and the estimation risk behave in Figure 1 and 2 respectively as and increase? In other words can we infer any pattern from the results shown in Theorem 3.4 and 3.5 ?
-
In section 4 in the bias component analysis, before assumption 4.1 the authors mention each has a positive definite covariance matrix and is independent of each other. Why the dependent 's are not considered?
-
In assumption 4.2, the expectation has subscript and there is an assumption also that is independent of X. Are the authors assuming to be random. What does this mean?
-
What is in equation (4)? Did the authors define it earlier?
-
In equation (5) if , does this mean the bias go away?
-
In corollary 4.8, I thought the double asymptotics on and have already been used. Then what is the limit wrt mean in the second term on the right hand side?
局限性
The authors have adequately addressed the limitations of the paper.
We thank the reviewer for the feedback. We respond to the concerns and questions below:
[Q1] (...) how do the prediction risk and the estimation risk behave in Figure 1 and 2 respectively as and increase? In other words can we infer any pattern from the results shown in Theorem 3.4 and 3.5?
- Please see [General Response] and Fig S1 in the attached PDF. We test a wide range of pairs where .
- Fig S1 (first three rows) shows that, even if the values of and change, the results are almost identical if remains the same.
- This is an interesting result and easily predictable since affects the values of and . For example, if , then in the limit of . So for a pair of sufficient large and , only determines the level sets. For a discussion of anisotropic , please refer to Remark 4.9 below Cor 4.8 (Line 273-283).
[Q2] (...) Why the dependent 's are not considered?
- We can allow for dependent 's. The remark we made such that "each has a positive definite covariance matrix and is independent of each other" is just a sufficient condition for "Assumption 4.1. rank() almost everywhere". It is possible to satisfy Assumption 4.1 even if 's are dependent and more importantly, without the rank assumption, the numerator in the equation below Line 243 becomes , which makes the RHS .
[Q3] (...) Are the authors assuming to be random. What does this mean?
- Yes, we are assuming that to be random. We make the random assumption (Assumption 4.2) in order to obtain an exact closed-form finite-sample expression for the prediction risk in Corollary 4.3. This type of assumption has been used before in the literature [e.g., 23, 20, 7] after the influential work by Dobriban and Wager (2018, Annals of Statistics). Although it may be less natural than the fixed assumption, it is helpful to obtain a clean insight into the problem.
[W1, W3, Q4] There are some notational discrepancies and theoretical inconsistencies. Some notations such as and (...) have been clarified later in the appendix. (...) What is in equation (4)? (...)
- Thank you for pointing that out. We moved the (full) proof of Thm 3.4 in which all are defined to Appendix because of the page limit. This may cause some notational discrepancies. We will move the definitions to the main part and revise our manuscript accordingly.
[Q5] In equation (5) if , does this mean the bias go away?
- No, the bias does not go away. If , i.e. , then the bias is which is 1 in Fig 4 (Left).
[Q6] In corollary 4.8, I thought the double asymptotics on and have already been used. Then what is the limit wrt mean in the second term on the right hand side?
- The second term you refer to is from the expected variance. As shown in our main Thm 3.4, we decompose the expected variance into two parts (i) and (ii) , i.e., Here, we apply the double asymptotics on and to each part (the limit of a product is the product of the limits), i.e., and since (ii) does not depend on (and ).
[W2] Some remarks following theorem 3.4 and 3.5 where the design matrix has a known distribution say Gaussian would have been useful examples to get insight on the results proved in the theorems.
- This is a great suggestion. The design matrix plays a role in (Thm 3.4) and in (Thm 3.5).
- First, in Thm 3.5, in the limit is which decreases to as and increases to as . This is because in the limit (Thm 4.7). Thus, for a sufficiently large and , we have . Fig 4 (Right) empirically validates this approximation for not very large and ( and ). And (cf. ). This approximation depends on the degree of anisotropy of as discussed in eq (7).
- Second, it is not straightforward for Thm 3.4. Thus, to get some insights, we set and then The numerator has an expectation of , and the denominator () has an expectation of which increases faster than that of the numerator as . Furthermore, if and , then as . Fig 4 (Left) empirically validates that the variance is small for a large .
The paper explores the prediction risk and estimation risk of the ridgeless least squares estimator under more general assumptions on regression errors. It highlights the benefits of overparameterization in a realistic setting that allows for clustered or serial dependence. The paper establishes that the estimation difficulties associated with the variance components of both risks can be summarized through the trace of the variance-covariance matrix of the regression errors. The findings suggest that the benefits of overparameterization can extend to time series, panel, and grouped data. The paper is a theoretical work that discusses various aspects of linear regression models, providing details on the assumptions and proofs for the theoretical results presented. It also includes information on the experimental setting and provides code and instructions for reproducing the main results.
优点
This study addresses an important research gap by considering more realistic assumptions on regression errors. It provides exact finite-sample characterizations of the variance components of prediction and estimation risks, includes numerical experiments that validate the theoretical results, and demonstrates the relationship between the expected variance and the covariance of the regression errors. Additionally, it analyzes the bias components of prediction and estimation risks, offers a comprehensive overview of linear regression models covering various theoretical aspects, and provides detailed proofs for the theoretical results, ensuring the validity of their claims.
缺点
Is it possible to provide validation on large-scale data?
问题
Refer to weaknesses.
局限性
N/A
We thank the reviewer for the feedback. We respond to the question below:
On large-scale validation
- Please see our top-level comment [General Response] and Fig S1 in the PDF attached to it.
- We additionally tested a wide range of pairs including where .
- Fig S1 (last row) shows that our theory (-axis) matches the expected variance (-axis) for a high-dimensional for . Note that CIFAR-10 and ImageNet have and dimensional data, respectively.
- Fig S1 (first three rows) shows that, even if the values of and change, the results are almost identical if the ratio remains the same.
- This is an interesting result and easily predictable since affects the values of and . For example, if , then in the limit of . So for a pair of sufficient large and , only the ratio determines the level sets. For a discussion of anisotropic , please refer to Remark 4.9 below Corollary 4.8 (Line 273-283).
Thanks for the response. I will keep my rating.
The paper investigates the properties of minimum norm (ridgeless) interpolation least squares estimators, analyzing prediction risk and estimation risk under broader regression error assumptions, including clustered or serial dependence. This diverges from the typical assumption of i.i.d. errors with zero mean and common variance. The paper shows that the challenges in estimating the variance components of prediction and estimation risks can be captured by the trace of the variance-covariance matrix of the regression errors.
优点
-
The paper provides a more general theoretical analysis of minimum norm interpolation least squares estimators, going beyond the restrictive i.i.d. error assumption.
-
The paper suggests that the benefits of overparameterization can extend to a wider range of regression settings, including time series, panel, and grouped data.
缺点
While the paper examines broader error structures, it might not fully grasp the complexity of real-world regression challenges, which could involve even more intricate patterns of error dependence.
问题
- Is it possible to remove the assumption that is independent of ?
2, Is it overly restrictive to demand that the design matrix is left-spherical and has a rank of almost everywhere?
局限性
The authors have addressed the limitations
We thank the reviewer for the feedback and comments. We also believe that it is important to understand the real-world regression challenges with even more intricate patterns of error dependence. Even though it is extremely difficult to fully grasp the complexity of real-world problems, the paper aims to provide a relatively general theoretical analysis by relaxing some restrictive assumptions. We again emphasize our contributions: we relax the previous assumptions as follows:
We respond to the concerns and questions below:
On the assumption " is independent of "
- This is a common assumption in most existing studies.
- We can further relax the independence assumption. Specifically, may depend on . Then, the variance is where , , and is a vector with its -th element as the -th largest eigenvalue of .
- Therefore, with a weaker assumption " for any orthogonal matrix ", we can still obtain a similar conclusion: Here, is the all-ones matrix (see the proof of Theorem 3.4 for the details).
- Even without this assumption, using the matrix inequality , we can obtain an inequality
On the left-spherical symmetry assumption
- We believe that the left-spherical symmetry assumption is restrictive but not overly restrictive because it can be strictly weaker than the usual assumption . For example, 's can be i.i.d. features from a mixture of centered Gaussian distributions.
On the assumption almost everywhere
- If is independent of each other and has a positive definite covariance matrix (e.g., and ), then is almost everywhere. Thus, we believe that the rank assumption is not overly restrictive.
- Moreover, this assumption is only made for the convenience of our asymptotic analysis. Even without the assumption we can obtain a similar result with instead of .
- Without the rank assumption, the numerator in the equation below Line 243 becomes , which makes the RHS .
Thank you for your detailed response. I will maintain my score unchanged.
The paper considers the ridgeless least-squares estimator, and derives its prediction and estimation risk. One of the assumptions used is that the expectation of the noise variance matrix is finite and positive-definite. This is more general than the assumption that this expectation is some positive multiple of the identity matrix.
优点
- The paper has an easy-to-follow introduction that motivates the need to derive theoretical results under general assumptions on regression errors.
- Related works are sufficiently discussed. The most relevant papers are those of Chinot et al. [9] and Chinot and Lerasle [8], which are based on different noise assumptions that this paper makes.
- The technical presentation is clear with examples and figures to help the reader understand the notations and results.
缺点
The major concern I have is whether the paper makes sufficient technical contributions. Even with the more general assumption on noise (Assumption 2.1), the technical change in the proofs seems very small compared to prior work. For example, the proof of Theorem 3.4 is short and relatively straightforward (and this might further simplify if we make Gaussian assumptions on data rather than left-spherical assumptions. Gaussian assumptions are what I like to make personally). It is always nice to have short and concise proofs whenever possible, but this might also indicate that the paper is not very technically solid.
问题
N.A.
局限性
N.A.
We thank the reviewer for the feedback. We respond to the concerns below:
On the technical contributions
-
We have a concise proof. A compact and special technique made it possible.
-
The main technical difficulty is that we generally cannot directly factor out from .
-
In the isotropic error case , we can easily obtain ( out of )
-
Under general error assumption (e.g., anisotropic error), however, this is not feasible. We would like to emphasize that, to address this technical difficulty, we compute the "expected" variance ( over ) to apply Lemma 3.3 which is technically novel:
-
We again emphasize our contributions: we relax the previous assumptions as follows:
Dear authors, thank you for the reply. I would like to keep my score.
I think this paper is well-written and clearly presented. On the other hand, the technical contributions are clear but seem to be limited. Personally, and respectfully, I feel this is a borderline case, so I keep my low confidence and rely on ACs and other more experienced reviewers for a final decision. Thank you for your understanding.
[General Response]
We would like to thank the reviewers for the thorough examination of the paper and their insightful and valuable comments.
We appreciate that all the reviewers recognized the strengths of our paper with positive ratings, saying "the presentation is clear", the introduction is "easy-to-follow" (JpQB) and numerical experiments are "well constructed (...) to support the theory" and "to help the reader understand (...) the results" (JpQB, pxrE, AHim). They also said our paper "addresses an important research gap", "provides a more general theoretical analysis", and "offers a comprehensive overview (...) covering various theoretical aspects" that "can extend to a wide range of regression settings, including time series, panel, and grouped data" (JpQB, fTDr, AHim, pxrE). This setting is "more realistic", "going beyond the restrictive i.i.d. error assumption" (JpQB, fTDr, AHim, pxrE).
During the author response period, we have given careful thought to the reviewers’ suggestions to answer the questions and concerns (we will make the corresponding revisions to our manuscript):
- We clarify some notations (pxrE).
- . Here, is a vector with its -th element as the -th largest eigenvalue of .
- We conduct the extra experiments with larger and (e.g., ) (AHim, pxrE).
- See the attached pdf file.
- We discuss some generalizations to further relax the assumptions and their limitations (fTDr, pxrE).
- We restate our technical contributions (JpQB).
I am happy with the comments author made in their rebuttal. I think they have largely addressed most of the queries of the reviewers. I have increased my score to 6.
One issue which I am still not comfortable with is the random assumption in assumption 4.2. I agree to the literature Dorriban and Wager (AoS, 2018) authors suggested but what I am uncomfortable with is the fact that we are doing OLS to get but to obtain bias of the prediction risk we are making the assumption as if has a prior distribution.
Is it not possible to make some other assumption on rather than the distributional assumption and work in the fixed setting?
Many thanks to Reviewer pxrE for raising the score to 6 and providing further comments. In principle, it would be possible to work in the fixed setting. For example, if we focus on the prediction risk, equation (4) in the paper (the displayed equation between lines 236 and 237) is still valid without the random assumption. Then, we can interpret the bias result similarly with , where , instead of defined in Assumption 4.2. However, when we consider asymptotic results by letting and converge to infinity, we need to come up with a suitable assumption such that could be treated fixed or we need to find a suitable limit. To simplify our discussion and provide a clean result, we followed Dorriban and Wager (AoS, 2018) that provides a heuristic approach for an average-case analysis of dense parameters. We hope that you will find our short-cut approach more acceptable.
Dear Reviewers,
Thank you for the valuable comments during the discussion period. We would like to summarize our contributions again:
The majority of the previous analyses have been limited to an unrealistic regression error structure, assuming iid errors with common variance. We explore more general regression error assumptions, highlighting the benefits of overparameterization in a more realistic setting that allows for clustered or serial dependence. Our findings suggest that the benefits of overparameterization can extend to time series, panel and grouped data.
Notably, we establish that the estimation difficulties associated with the variance components can be summarized through and thus they are not affected by the degrees of dependence across observations.
Thanks again.
Dear authors -- thank you for contributing a well written paper on the analysis of ridgeless overparameterized interpolation methods in the non-iid setting. The reviewers agreed that this is a useful theoretical gap to fill, as indeed pure iid data rarely arises in applications outside of academia, and with ridgeless analysis being fairly new it has not yet been filled. The authors responded well to most questions from reviewers, except the decision to combine random beta assumptions with plain OLS estimates, this needs to be better motivated, or better yet, provide both an analysis for fixed and random beta.
The reviewers also thought that the technical contribution, while useful, is somewhat limited. The scores are borderline, so I will have to make a judgement. I like the paper, but I am not sure whether the extension from iid to non-iid setting contains ideas beyond what's well established in the econometrics literature. You cited the textbook by Hansen, but did not explore the connection in any depth at all. The sandwhich estimator and the difficulty of analyzing the trace of the covariance is very thoroughly studied in econometrics, e.g. https://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf (Practitioner Guide to cluster-robust inference), White's, Newey-West estimator, and so forth including panel and time-series data with spatial and temporal correlation structure. I would really like a more detailed discussion of the techniques in the econometrics literature, and the novelty in applying them to the ridgeless setting, otherwise the contribution indeed seems of the more direct type. What's different and special about analyzing these issues in the ridgeless setting?
While I can not accept the paper in the current form, I encourage you to explore and explain the connections to econometrics literature on handling non-iid setting.
We thank the reviewers again. With their insightful and valuable comments, the contents and the clarity of our paper are much improved in the revised version. Please check our published version at the following link: https://openreview.net/forum?id=AsAy7CROLs¬eId=AsAy7CROLs