PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高7标准差0.8
6
5
7
5
3.0
置信度
正确性3.0
贡献度2.5
表达2.5
NeurIPS 2024

Prediction Risk and Estimation Risk of the Ridgeless Least Squares Estimator under General Assumptions on Regression Errors

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

In recent years, there has been a significant growth in research focusing on minimum $\ell_2$ norm (ridgeless) interpolation least squares estimators. However, the majority of these analyses have been limited to an unrealistic regression error structure, assuming independent and identically distributed errors with zero mean and common variance. In this paper, we explore prediction risk as well as estimation risk under more general regression error assumptions, highlighting the benefits of overparameterization in a more realistic setting that allows for clustered or serial dependence. Notably, we establish that the estimation difficulties associated with the variance components of both risks can be summarized through the trace of the variance-covariance matrix of the regression errors. Our findings suggest that the benefits of overparameterization can extend to time series, panel and grouped data.
关键词
minimum norm solutionridgeless estimatorbenign overfittingdouble descentoverparameterization

评审与讨论

审稿意见
6

The paper examines prediction and estimation risk of ridgeless least squares estimator in the setting of a general error structure. The iid assumption on the error structure is often not valid in settings such as time series data , panel data, grouped data etc. The current paper introduces a theoretical framework which investigates the variance component estimation of both prediction and estimation risks in the above mentioned data settings. The benefits of overparametrization which has been seen in iid context has been shown to exist in the dependent error structure context as well.

优点

Following are the strengths of the paper-

  • Investigation of prediction and estimation risk under non i.i.d. regressor errors with specific focus on time series and cluster data

  • Explicit quantification of the variance component of both the risks (as mentioned above) which depends on the trace of the error covariance matrix and the trace of a function of design matrix as a separable product.

  • Explicit analysis of the variance and bias term of both the risks (as mentioned above) in the high-dimensional asymptotics

  • Well constructed numerical experiments to support the theory

缺点

Following are the weaknesses of the paper

  • The theoretical results particularly the bias component analysis section could have been more rigorous and better written. There are some notational discrepancies and theoretical inconsistencies.

  • Some remarks following theorem 3.4 and 3.5 where the design matrix XX has a known distribution say Gaussian would have been useful examples to get insight on the results proved in the theorems

  • Some notations such as a(X)a(X) and bb used in theorem 3.4 have been clarified later in the appendix. It would be better to introduce them in the sketch of the proof if you are using them anyway there.

问题

I have the following queries for the authors

  • It would be interesting to see how do the prediction risk and the estimation risk behave in Figure 1 and 2 respectively as nn and pp increase? In other words can we infer any pattern from the results shown in Theorem 3.4 and 3.5 ?

  • In section 4 in the bias component analysis, before assumption 4.1 the authors mention each xix_i has a positive definite covariance matrix and is independent of each other. Why the dependent xix_i's are not considered?

  • In assumption 4.2, the expectation has subscript β\beta and there is an assumption also that β\beta is independent of X. Are the authors assuming β\beta to be random. What does this mean?

  • What is SS in equation (4)? Did the authors define it earlier?

  • In equation (5) if p>>np >> n, does this mean the bias go away?

  • In corollary 4.8, I thought the double asymptotics on nn and pp have already been used. Then what is the limit wrt nn mean in the second term on the right hand side?

局限性

The authors have adequately addressed the limitations of the paper.

作者回复

We thank the reviewer for the feedback. We respond to the concerns and questions below:

[Q1] (...) how do the prediction risk and the estimation risk behave in Figure 1 and 2 respectively as nn and pp increase? In other words can we infer any pattern from the results shown in Theorem 3.4 and 3.5?

  • Please see [General Response] and Fig S1 in the attached PDF. We test a wide range of (n,p)(n,p) pairs ((100,200),(100,400),(100,1k),(500,1k),(500,2k),(500,5k),(500,50k),(500,100k),(10k,150k))((100,200),(100,400),(100,1k),(500,1k),(500,2k),(500,5k),(500,50k),(500,100k),(10k,150k)) where 1k=10001k=1000.
  • Fig S1 (first three rows) shows that, even if the values of nn and pp change, the results are almost identical if γ=p/n\gamma=p/n remains the same.
  • This is an interesting result and easily predictable since γ\gamma affects the values of EX[Tr((XX)+Σ)]\mathbb{E}_X[\text{Tr}((X^\top X)^+\Sigma)] and EX[Tr((XX)+)]\mathbb{E}_X[\text{Tr}((X^\top X)^+)]. For example, if Σ=I\Sigma=I, then EX[Tr((XX)+Σ)]=EX[Tr((XX)+)]1γ1\mathbb{E}_X[\text{Tr}((X^\top X)^+\Sigma)]=\mathbb{E}_X[\text{Tr}((X^\top X)^+)]\rightarrow\frac{1}{\gamma-1} in the limit of n,p,p/nγn,p\rightarrow\infty,p/n\rightarrow\gamma. So for a pair of sufficient large nn and pp, only γ\gamma determines the level sets. For a discussion of anisotropic Σ\Sigma, please refer to Remark 4.9 below Cor 4.8 (Line 273-283).

[Q2] (...) Why the dependent xix_i's are not considered?

  • We can allow for dependent xix_i's. The remark we made such that "each xix_i has a positive definite covariance matrix and is independent of each other" is just a sufficient condition for "Assumption 4.1. rank(XX) =n= n almost everywhere". It is possible to satisfy Assumption 4.1 even if xix_i's are dependent and more importantly, without the rank assumption, the numerator in the equation below Line 243 becomes prank(X)p-\text{rank}(X), which makes the RHS =rΣ2prank(X)p=r_\Sigma^2\frac{p-\text{rank}(X)}{p}.

[Q3] (...) Are the authors assuming β\beta to be random. What does this mean?

  • Yes, we are assuming that β\beta to be random. We make the random β\beta assumption (Assumption 4.2) in order to obtain an exact closed-form finite-sample expression for the prediction risk in Corollary 4.3. This type of assumption has been used before in the literature [e.g., 23, 20, 7] after the influential work by Dobriban and Wager (2018, Annals of Statistics). Although it may be less natural than the fixed β\beta assumption, it is helpful to obtain a clean insight into the problem.

[W1, W3, Q4] There are some notational discrepancies and theoretical inconsistencies. Some notations such as a(X)a(X) and bb (...) have been clarified later in the appendix. (...) What is SS in equation (4)? (...)

  • Thank you for pointing that out. We moved the (full) proof of Thm 3.4 in which all S=Σ1/2,a(X)=λ((XX)+Σ),b=λ(Ω)S=\Sigma^{1/2},a(X)=\lambda((X^\top X)^+ \Sigma),b=\lambda(\Omega) are defined to Appendix because of the page limit. This may cause some notational discrepancies. We will move the definitions to the main part and revise our manuscript accordingly.

[Q5] In equation (5) if pnp\gg n, does this mean the bias go away?

  • No, the bias does not go away. If pnp\gg n, i.e. γ1\gamma\gg 1, then the bias is [Bias(β^X)]2=r2_ΣpnprΣ2>0[\text{Bias}(\hat\beta\mid X)]^2=r^2\_\Sigma\frac{p-n}{p}\approx r^2_\Sigma>0 which is 1 in Fig 4 (Left).

[Q6] In corollary 4.8, I thought the double asymptotics on nn and pp have already been used. Then what is the limit wrt nn mean in the second term on the right hand side?

  • The second term you refer to is from the expected variance. As shown in our main Thm 3.4, we decompose the expected variance into two parts (i) E_X[Tr((XX)+Σ)]\mathbb{E}\_X[\text{Tr}((X^\top X)^+\Sigma)] and (ii) Tr(Ω)n\frac{\text{Tr}(\Omega)}{n}, i.e., EX[Var(β^X)]=(i)×(ii)\mathbb{E}_X[\text{Var}(\hat\beta\mid X)]=\text{(i)}\times\text{(ii)} Here, we apply the double asymptotics on nn and pp to each part (the limit of a product is the product of the limits), i.e., lim_n,p,p/nγ(i)=s\lim\_{n,p\rightarrow \infty, p/n\rightarrow \gamma} \text{(i)}=s^* and lim_n,p,p/nγ(ii)=lim_nTr(Ω)n\lim\_{n,p\rightarrow \infty, p/n\rightarrow \gamma} \text{(ii)}=\lim\_{n\rightarrow\infty}\frac{\text{Tr}(\Omega)}{n} since (ii) does not depend on pp (and γ\gamma).

[W2] Some remarks following theorem 3.4 and 3.5 where the design matrix has a known distribution say Gaussian would have been useful examples to get insight on the results proved in the theorems.

  • This is a great suggestion. The design matrix plays a role in EX[Tr((XX)+Σ)]\mathbb{E}_X[\text{Tr}((X^\top X)^+\Sigma)] (Thm 3.4) and in EX[Tr(Λ+)]/p\mathbb{E}_X[\text{Tr}(\Lambda^+)]/p (Thm 3.5).
  • First, in Thm 3.5, EX[Tr(Λ+)]/p\mathbb{E}_X[\text{Tr}(\Lambda^+)]/p in the limit is Θ(1γ1)\Theta(\frac{1}{\gamma-1}) which decreases to 00 as γ\gamma\rightarrow\infty and increases to \infty as γ1\gamma\searrow 1. This is because EX[Tr(Λ+)]/ps\mathbb{E}_X[\text{Tr}(\Lambda^+)]/p\rightarrow s^\ast in the limit (Thm 4.7). Thus, for a sufficiently large nn and pp, we have E_X[Tr(Λ+)]/ps\mathbb{E}\_X[\text{Tr}(\Lambda^+)]/p\approx s^\ast. Fig 4 (Right) empirically validates this approximation for not very large nn and pp (n=50n=50 and p[50,5000]p\in [50,5000]). And s=Θ(1γ1)s^*=\Theta(\frac{1}{\gamma-1}) (cf. s_iso=1γ1s^*\_{\text{iso}}=\frac{1}{\gamma-1}). This approximation depends on the degree of anisotropy of Σ\Sigma as discussed in eq (7).
  • Second, it is not straightforward for Thm 3.4. Thus, to get some insights, we set n=1n=1 and then EX[Tr((XX)+Σ)]=Ex[Tr((xx)+Σ)]=Ex[xΣxx4].\mathbb{E}_X[\text{Tr}((X^\top X)^+\Sigma)]=\mathbb{E}_x[\text{Tr}((xx^\top)^+\Sigma)]=\mathbb{E}_x\Big[\frac{x^\top\Sigma x}{\\|x\\|^4}\Big]. The numerator xΣxx^\top\Sigma x has an expectation of Tr(Σ2)=Θ(p)\text{Tr}(\Sigma^2)=\Theta(p), and the denominator x4\\|x\\|^4 (xN(0,Σ)x\sim\mathcal{N}(0,\Sigma)) has an expectation of 2Tr(Σ2)+Tr(Σ)2=Θ(p2)2\text{Tr}(\Sigma^2)+\text{Tr}(\Sigma)^2=\Theta(p^2) which increases faster than that of the numerator as pp\rightarrow\infty. Furthermore, if Σ=I\Sigma=I and p>2p>2, then E_x[1/x2]=1/(p2)0\mathbb{E}\_x[1/\\|x\\|^2]=1/(p-2)\rightarrow 0 as pp\rightarrow\infty. Fig 4 (Left) empirically validates that the variance is small for a large γ\gamma.
审稿意见
5

The paper explores the prediction risk and estimation risk of the ridgeless least squares estimator under more general assumptions on regression errors. It highlights the benefits of overparameterization in a realistic setting that allows for clustered or serial dependence. The paper establishes that the estimation difficulties associated with the variance components of both risks can be summarized through the trace of the variance-covariance matrix of the regression errors. The findings suggest that the benefits of overparameterization can extend to time series, panel, and grouped data. The paper is a theoretical work that discusses various aspects of linear regression models, providing details on the assumptions and proofs for the theoretical results presented. It also includes information on the experimental setting and provides code and instructions for reproducing the main results.

优点

This study addresses an important research gap by considering more realistic assumptions on regression errors. It provides exact finite-sample characterizations of the variance components of prediction and estimation risks, includes numerical experiments that validate the theoretical results, and demonstrates the relationship between the expected variance and the covariance of the regression errors. Additionally, it analyzes the bias components of prediction and estimation risks, offers a comprehensive overview of linear regression models covering various theoretical aspects, and provides detailed proofs for the theoretical results, ensuring the validity of their claims.

缺点

Is it possible to provide validation on large-scale data?

问题

Refer to weaknesses.

局限性

N/A

作者回复

We thank the reviewer for the feedback. We respond to the question below:

On large-scale validation

  • Please see our top-level comment [General Response] and Fig S1 in the PDF attached to it.
  • We additionally tested a wide range of (n,p)(n,p) pairs including (500,5k)(500,50k),(500,100k),(10k,150k)(500,5\text{k})(500,50\text{k}),(500,100\text{k}),(10\text{k},150\text{k}) where 1k=10001\text{k}=1000.
  • Fig S1 (last row) shows that our theory (yy-axis) matches the expected variance (xx-axis) for a high-dimensional xiRpx_i\in\mathbb{R}^p for p=50k,100k,150kp=50\text{k},100\text{k},150\text{k}. Note that CIFAR-10 and ImageNet have p3kp\approx 3\text{k} and p150kp\approx 150\text{k} dimensional data, respectively.
  • Fig S1 (first three rows) shows that, even if the values of nn and pp change, the results are almost identical if the ratio γ=p/n\gamma=p/n remains the same.
  • This is an interesting result and easily predictable since γ\gamma affects the values of EX[Tr((XX)+Σ)]\mathbb{E}_X[\text{Tr}((X^\top X)^+\Sigma)] and EX[Tr((XX)+)]\mathbb{E}_X[\text{Tr}((X^\top X)^+)]. For example, if Σ=I\Sigma=I, then EX[Tr((XX)+Σ)]=EX[Tr((XX)+)]1γ1\mathbb{E}_X[\text{Tr}((X^\top X)^+\Sigma)]=\mathbb{E}_X[\text{Tr}((X^\top X)^+)]\rightarrow \frac{1}{\gamma-1} in the limit of n,p,p/nγn,p\rightarrow\infty,p/n\rightarrow\gamma. So for a pair of sufficient large nn and pp, only the ratio γ\gamma determines the level sets. For a discussion of anisotropic Σ\Sigma, please refer to Remark 4.9 below Corollary 4.8 (Line 273-283).
评论

Thanks for the response. I will keep my rating.

审稿意见
7

The paper investigates the properties of minimum norm (ridgeless) interpolation least squares estimators, analyzing prediction risk and estimation risk under broader regression error assumptions, including clustered or serial dependence. This diverges from the typical assumption of i.i.d. errors with zero mean and common variance. The paper shows that the challenges in estimating the variance components of prediction and estimation risks can be captured by the trace of the variance-covariance matrix of the regression errors.

优点

  1. The paper provides a more general theoretical analysis of minimum norm interpolation least squares estimators, going beyond the restrictive i.i.d. error assumption.

  2. The paper suggests that the benefits of overparameterization can extend to a wider range of regression settings, including time series, panel, and grouped data.

缺点

While the paper examines broader error structures, it might not fully grasp the complexity of real-world regression challenges, which could involve even more intricate patterns of error dependence.

问题

  1. Is it possible to remove the assumption that ε\varepsilon is independent of XX?

2, Is it overly restrictive to demand that the design matrix XX is left-spherical and has a rank of nn almost everywhere?

局限性

The authors have addressed the limitations

作者回复

We thank the reviewer for the feedback and comments. We also believe that it is important to understand the real-world regression challenges with even more intricate patterns of error dependence. Even though it is extremely difficult to fully grasp the complexity of real-world problems, the paper aims to provide a relatively general theoretical analysis by relaxing some restrictive assumptions. We again emphasize our contributions: we relax the previous assumptions as follows:

Previous workTheir assumptionsOur relaxed assumptionHastie et al. (2022) and Bartlett et al. (2020)isotropic covariance Ω=E[εε]=ω2Igeneral covariance ΩChinot and Lerasle (2023)xiiidN(0,Σ)X is left-spherical\begin{array}{|c|c|c|}\hline \text{Previous work}& \text{Their assumptions}& \text{Our relaxed assumption}\\\\ \hline \text{Hastie et al. (2022) and Bartlett et al. (2020)}&\text{isotropic covariance }\Omega=\mathbb{E}[\varepsilon\varepsilon^\top]=\omega^2 I&\text{general covariance }\Omega \\\\ \text{Chinot and Lerasle (2023)}& x_i\sim_{iid}\mathcal{N}(0,\Sigma)&X\text{ is left-spherical}\\\\ \hline \end{array}

We respond to the concerns and questions below:

On the assumption "ε\varepsilon is independent of XX"

  • This is a common assumption in most existing studies.
  • We can further relax the independence assumption. Specifically, Ω(X):=E[εεX]\Omega(X) :=\mathbb{E}[\varepsilon\varepsilon^\top \mid X] may depend on XX. Then, the variance is VarΣ(β^X)=Tr(XΩ(X)XΣ)=a(X)Γ(X)b(X)\text{Var}_\Sigma(\hat\beta\mid X)=\text{Tr}(X^\dagger\Omega(X)X^{\dagger\top}\Sigma)=a(X)^\top\Gamma(X)b(X) where a(X):=λ((XX)Σ)a(X):=\lambda((X^\top X)^\dagger\Sigma), b(X):=λ(Ω(X))b(X):=\lambda(\Omega(X)), and λ(A)\lambda(A) is a vector with its ii-th element λi(A)\lambda_i(A) as the ii-th largest eigenvalue of AA.
  • Therefore, with a weaker assumption "λ(Ω(X))=λ(Ω(OX))\lambda(\Omega(X))=\lambda(\Omega(OX)) for any orthogonal matrix OO", we can still obtain a similar conclusion: E_X[VarΣ(β^X)]=E_X[a(X)Γ(X)b(X)]=Lemma3.3E_X[E_O[a(OX)Γ(OX)b(OX)]]=E_X[a(X)E_O[Γ(OX)]b(X)]\mathbb{E}\_X[\text{Var}_\Sigma(\hat\beta\mid X)]=\mathbb{E}\_X[a(X)^\top\Gamma(X)b(X)]\overset{\text{Lemma} 3.3}{=}\mathbb{E}\_X[\mathbb{E}\_{O}[a(OX)^\top\Gamma(OX)b(OX)]]=\mathbb{E}\_X[a(X)^\top\mathbb{E}\_{O}[\Gamma(OX)]b(X)] =E_X[a(X)1nJb(X)]=E_X[1n_i,ja_i(X)b_j(X)]=E_X[1nTr((XX)Σ)Tr(Ω(X))].=\mathbb{E}\_X\Big[a(X)^\top\frac1n Jb(X)\Big]=\mathbb{E}\_X\Big[\frac1n\sum\_{i,j}a\_i(X) b\_j(X)\Big]=\mathbb{E}\_X\Big[\frac1n\text{Tr}((X^\top X)^\dagger\Sigma)\text{Tr}(\Omega(X))\Big]. Here, JJ is the all-ones matrix (see the proof of Theorem 3.4 for the details).
  • Even without this assumption, using the matrix inequality Ω(X)Ω:=supXΩ(X)In\Omega(X) \preceq \Omega^\ast := \sup_X ||\Omega(X)|| I_n, we can obtain an inequality EX[Var(β^X)]Tr(Ω)nEX[Tr((XX)Σ)].\mathbb{E}_X[\text{Var}(\hat\beta\mid X)]\leq \frac{\text{Tr}(\Omega^\ast)}{n}\mathbb{E}_X[\text{Tr}((X^\top X)^{\dagger}\Sigma)].

On the left-spherical symmetry assumption

  • We believe that the left-spherical symmetry assumption is restrictive but not overly restrictive because it can be strictly weaker than the usual assumption xiiidN(0,Σ)x_i\sim_{iid}\mathcal{N}(0,\Sigma). For example, xix_i's can be i.i.d. features from a mixture of centered Gaussian distributions.

On the assumption rank(X)=n\text{rank}(X)=n almost everywhere

  • If xix_i is independent of each other and has a positive definite covariance matrix (e.g., xiiidN(0,Σ)x_i\sim_{iid}\mathcal{N}(0,\Sigma) and Σ0\Sigma\succ 0), then rank(X)\text{rank}(X) is nn almost everywhere. Thus, we believe that the rank nn assumption is not overly restrictive.
  • Moreover, this assumption is only made for the convenience of our asymptotic analysis. Even without the assumption we can obtain a similar result with rank(X)\text{rank}(X) instead of nn.
  • Without the rank nn assumption, the numerator in the equation below Line 243 becomes prank(X)p-\text{rank}(X), which makes the RHS =rΣ2prank(X)p=r_\Sigma^2\frac{p-\text{rank}(X)}{p}.
评论

Thank you for your detailed response. I will maintain my score unchanged.

审稿意见
5

The paper considers the ridgeless least-squares estimator, and derives its prediction and estimation risk. One of the assumptions used is that the expectation of the noise variance matrix is finite and positive-definite. This is more general than the assumption that this expectation is some positive multiple of the identity matrix.

优点

  • The paper has an easy-to-follow introduction that motivates the need to derive theoretical results under general assumptions on regression errors.
  • Related works are sufficiently discussed. The most relevant papers are those of Chinot et al. [9] and Chinot and Lerasle [8], which are based on different noise assumptions that this paper makes.
  • The technical presentation is clear with examples and figures to help the reader understand the notations and results.

缺点

The major concern I have is whether the paper makes sufficient technical contributions. Even with the more general assumption on noise (Assumption 2.1), the technical change in the proofs seems very small compared to prior work. For example, the proof of Theorem 3.4 is short and relatively straightforward (and this might further simplify if we make Gaussian assumptions on data rather than left-spherical assumptions. Gaussian assumptions are what I like to make personally). It is always nice to have short and concise proofs whenever possible, but this might also indicate that the paper is not very technically solid.

问题

N.A.

局限性

N.A.

作者回复

We thank the reviewer for the feedback. We respond to the concerns below:

On the technical contributions

  • We have a concise proof. A compact and special technique made it possible.

  • The main technical difficulty is that we generally cannot directly factor out Ω\Omega from Tr(XΩXΣ)\text{Tr}(X^\dagger\Omega X^{\dagger\top}\Sigma).

  • In the isotropic error case Ω=ω2In\Omega=\omega^2 I_n, we can easily obtain (ω2\omega^2 out of Tr\text{Tr}) VarΣ(β^X)=Tr(XΩXΣ)=ω2Tr(XXΣ)=ω2Tr((XX)Σ).\text{Var}_\Sigma(\hat\beta\mid X)=\text{Tr}(X^\dagger\Omega X^{\dagger\top}\Sigma)=\omega^2\text{Tr}(X^\dagger X^{\dagger\top}\Sigma)=\omega^2\text{Tr}((X^\top X)^{\dagger}\Sigma).

  • Under general error assumption (e.g., anisotropic error), however, this is not feasible. We would like to emphasize that, to address this technical difficulty, we compute the "expected" variance (E_X[]\mathbb{E}\_X[\cdot ] over XX) to apply Lemma 3.3 which is technically novel: E_X[VarΣ(β^X)]=E_X[Tr(XΩXΣ)]=Thm3.4Tr(Ω)nE_X[Tr((XX)Σ)].\mathbb{E}\_X[\text{Var}_\Sigma(\hat\beta\mid X)]=\mathbb{E}\_X[\text{Tr}(X^\dagger\Omega X^{\dagger\top}\Sigma)]\overset{\text{Thm} 3.4}{=}\frac{\text{Tr}(\Omega)}{n}\mathbb{E}\_X[\text{Tr}((X^\top X)^{\dagger}\Sigma)].

  • We again emphasize our contributions: we relax the previous assumptions as follows:

Previous workTheir assumptionsOur relaxed assumptionHastie et al. (2022) and Bartlett et al. (2020)isotropic covariance Ω=E[εε]=ω2Igeneral covariance ΩChinot and Lerasle (2023)xiiidN(0,Σ)X is left-spherical\begin{array}{|c|c|c|}\hline \text{Previous work}& \text{Their assumptions}& \text{Our relaxed assumption}\\\\ \hline \text{Hastie et al. (2022) and Bartlett et al. (2020)}&\text{isotropic covariance }\Omega=\mathbb{E}[\varepsilon\varepsilon^\top]=\omega^2 I&\text{general covariance }\Omega \\\\ \text{Chinot and Lerasle (2023)}& x_i\sim_{iid}\mathcal{N}(0,\Sigma)&X\text{ is left-spherical}\\\\ \hline \end{array}
评论

Dear authors, thank you for the reply. I would like to keep my score.

I think this paper is well-written and clearly presented. On the other hand, the technical contributions are clear but seem to be limited. Personally, and respectfully, I feel this is a borderline case, so I keep my low confidence and rely on ACs and other more experienced reviewers for a final decision. Thank you for your understanding.

作者回复

[General Response]

We would like to thank the reviewers for the thorough examination of the paper and their insightful and valuable comments.

We appreciate that all the reviewers recognized the strengths of our paper with positive ratings, saying "the presentation is clear", the introduction is "easy-to-follow" (JpQB) and numerical experiments are "well constructed (...) to support the theory" and "to help the reader understand (...) the results" (JpQB, pxrE, AHim). They also said our paper "addresses an important research gap", "provides a more general theoretical analysis", and "offers a comprehensive overview (...) covering various theoretical aspects" that "can extend to a wide range of regression settings, including time series, panel, and grouped data" (JpQB, fTDr, AHim, pxrE). This setting is "more realistic", "going beyond the restrictive i.i.d. error assumption" (JpQB, fTDr, AHim, pxrE).

During the author response period, we have given careful thought to the reviewers’ suggestions to answer the questions and concerns (we will make the corresponding revisions to our manuscript):

  • We clarify some notations (pxrE).
    • a(X):=λ((XX)Σ),b:=λ(Ω),S:=Σ1/2a(X):=\lambda((X^\top X)^\dagger\Sigma),b:=\lambda(\Omega),S:=\Sigma^{1/2}. Here, λ(A)\lambda(A) is a vector with its ii-th element λi(A)\lambda_i(A) as the ii-th largest eigenvalue of AA.
  • We conduct the extra experiments with larger nn and pp (e.g., n=10k,p=150kn=10\text{k},p=150\text{k}) (AHim, pxrE).
    • See the attached pdf file.
  • We discuss some generalizations to further relax the assumptions and their limitations (fTDr, pxrE).
  • We restate our technical contributions (JpQB).
评论

I am happy with the comments author made in their rebuttal. I think they have largely addressed most of the queries of the reviewers. I have increased my score to 6.

One issue which I am still not comfortable with is the random β\beta assumption in assumption 4.2. I agree to the literature Dorriban and Wager (AoS, 2018) authors suggested but what I am uncomfortable with is the fact that we are doing OLS to get β^\hat{\beta} but to obtain bias of the prediction risk we are making the assumption as if β\beta has a prior distribution.

Is it not possible to make some other assumption on β\beta rather than the distributional assumption and work in the fixed β\beta setting?

评论

Many thanks to Reviewer pxrE for raising the score to 6 and providing further comments. In principle, it would be possible to work in the fixed β\beta setting. For example, if we focus on the prediction risk, equation (4) in the paper (the displayed equation between lines 236 and 237) is still valid without the random β\beta assumption. Then, we can interpret the bias result similarly with Sβ(Sβ)S\beta (S\beta)^\top, where S=Σ1/2S = \Sigma^{1/2}, instead of rΣ2r_\Sigma^2 defined in Assumption 4.2. However, when we consider asymptotic results by letting nn and pp converge to infinity, we need to come up with a suitable assumption such that Sβ(Sβ)S\beta (S\beta)^\top could be treated fixed or we need to find a suitable limit. To simplify our discussion and provide a clean result, we followed Dorriban and Wager (AoS, 2018) that provides a heuristic approach for an average-case analysis of dense parameters. We hope that you will find our short-cut approach more acceptable.

评论

Dear Reviewers,

Thank you for the valuable comments during the discussion period. We would like to summarize our contributions again:

The majority of the previous analyses have been limited to an unrealistic regression error structure, assuming iid errors with common variance. We explore more general regression error assumptions, highlighting the benefits of overparameterization in a more realistic setting that allows for clustered or serial dependence. Our findings suggest that the benefits of overparameterization can extend to time series, panel and grouped data.

Previous workTheir assumptionsOur relaxed assumptionHastie et al. (2022) and Bartlett et al. (2020)isotropic covariance Ω=E[εε]=ω2Igeneral covariance ΩChinot and Lerasle (2023)xiiidN(0,Σ)X is left-spherical.\begin{array}{|c|c|c|} \hline \text{Previous work}& \text{Their assumptions}& \text{Our relaxed assumption} \\\\ \hline \text{Hastie et al. (2022) and Bartlett et al. (2020)}& \text{isotropic covariance }\Omega=\mathbb{E}[\varepsilon\varepsilon^\top]=\omega^2 I& \text{general covariance }\Omega \\\\ \text{Chinot and Lerasle (2023)}& x_i\sim_{iid}\mathcal{N}(0,\Sigma)& X\text{ is left-spherical.} \\\\ \hline \end{array}

Notably, we establish that the estimation difficulties associated with the variance components can be summarized through Tr(Ω)\text{Tr}(\Omega) and thus they are not affected by the degrees of dependence across observations. E_X[VarΣ(β^X)]=E_X[Tr(XΩXΣ)]=Thm3.4Tr(Ω)nE_X[Tr((XX)Σ)].\mathbb{E}\_X[\text{Var}_\Sigma(\hat\beta\mid X)]=\mathbb{E}\_X[\text{Tr}(X^\dagger\Omega X^{\dagger\top}\Sigma)]\overset{\text{Thm} 3.4}{=}\frac{\text{Tr}(\Omega)}{n}\mathbb{E}\_X[\text{Tr}((X^\top X)^{\dagger}\Sigma)].

Thanks again.

最终决定

Dear authors -- thank you for contributing a well written paper on the analysis of ridgeless overparameterized interpolation methods in the non-iid setting. The reviewers agreed that this is a useful theoretical gap to fill, as indeed pure iid data rarely arises in applications outside of academia, and with ridgeless analysis being fairly new it has not yet been filled. The authors responded well to most questions from reviewers, except the decision to combine random beta assumptions with plain OLS estimates, this needs to be better motivated, or better yet, provide both an analysis for fixed and random beta.

The reviewers also thought that the technical contribution, while useful, is somewhat limited. The scores are borderline, so I will have to make a judgement. I like the paper, but I am not sure whether the extension from iid to non-iid setting contains ideas beyond what's well established in the econometrics literature. You cited the textbook by Hansen, but did not explore the connection in any depth at all. The sandwhich estimator and the difficulty of analyzing the trace of the covariance is very thoroughly studied in econometrics, e.g. https://cameron.econ.ucdavis.edu/research/Cameron_Miller_JHR_2015_February.pdf (Practitioner Guide to cluster-robust inference), White's, Newey-West estimator, and so forth including panel and time-series data with spatial and temporal correlation structure. I would really like a more detailed discussion of the techniques in the econometrics literature, and the novelty in applying them to the ridgeless setting, otherwise the contribution indeed seems of the more direct type. What's different and special about analyzing these issues in the ridgeless setting?

While I can not accept the paper in the current form, I encourage you to explore and explain the connections to econometrics literature on handling non-iid setting.

公开评论

We thank the reviewers again. With their insightful and valuable comments, the contents and the clarity of our paper are much improved in the revised version. Please check our published version at the following link: https://openreview.net/forum?id=AsAy7CROLs&noteId=AsAy7CROLs