PaperHub
4.3
/10
withdrawn3 位审稿人
最低3最高5标准差0.9
5
5
3
3.0
置信度
ICLR 2024

Denoising Low-Rank Data Under Distribution Shift: Double Descent and Data Augmentation

OpenReviewPDF
提交: 2023-09-23更新: 2024-03-26
TL;DR

We compute the test error for linear denoisers for low-rank data in the presence of covariate shifts, experimentally verify our results, and derive insights about double descent, data augmentation and benign overfitting from them.

摘要

Despite the importance of denoising in modern machine learning and ample empirical work on supervised denoising, its theoretical understanding is still relatively scarce. One concern about studying supervised denoising is that one might not always have noiseless training data from the test distribution. It is more reasonable to have access to noiseless training data from a different dataset than the test dataset. Motivated by this, we study supervised denoising and noisy-input regression under distribution shift. We add three considerations to increase the applicability of our theoretical insights to real-life data and modern machine learning. First, while most past theoretical work assumes that the data covariance matrix is full-rank and well-conditioned, empirical studies have shown that real-life data is approximately low-rank. Thus, we assume that our data matrices are low-rank. Second, we drop independence assumptions on our data. Third, the rise in computational power and dimensionality of data have made it important to study non-classical regimes of learning. Thus, we work in the non-classical proportional regime, where data dimension $d$ and number of samples $N$ grow as $d/N = c + o(1)$. For this setting, we derive general test error expressions for both denoising and noisy-input regression, and study when overfitting the noise is benign, tempered or catastrophic. We show that the test error exhibits double descent under distribution shift, providing insights for data augmentation and the role of noise as an implicit regularizer. We also perform experiments using real-life data, where we match the theoretical predictions with under 1% MSE error for low-rank data.
关键词
DenoisingLow Rankdistribution shiftdouble descentoverfittingsupervisedbenign overfittingautoencodercovariate shiftout of distribution generalization

评审与讨论

审稿意见
5

This work focuses on analyzing multi-output linear regression in the presence of noisy inputs. An important part of their work is the discussion of the scenario in which the target multivariate linear regressor equals the identity, allowing them to explore the problem of linear denoising. Similar to a recent line of works, and relevant to contemporary machine learning practices, this work explores the asymptotic regime where data dimensions and the number of samples grow proportionally.

The primary distinctions from prior studies include the absence of an assumption that the training and test data are independently and identically distributed (i.i.d.) or even sourced from the same distribution. Instead, it is assumed that both the test and training data exist within the same low-dimensional subspace (exhibiting low-rankness). Similarly, this work employs relatively mild assumptions concerning the noise in both the test and training data

Due to the specific assumptions made about the data, the test error is defined as the expected value taken over both the training and test noise of the mean squared error (MSE) on the test data. The authors subsequently derive precise formulas for the test error of the linear regressor that minimizes the training error and the one that minimizes the expected training error. Additionally, they discuss distribution shift bounds and the asymptotic behavior of the (relative) excess risk.

The authors conclude by comparing their theoretical predictions on real datasets, demonstrating good agreement.

优点

This paper departs from the conventional assumptions that have been prevalent in previous high-dimensional regression studies. Notably, a particularly intriguing contribution lies in the elimination of the iid or "Gaussian-like" assumptions previously relied upon. This requires the application of novel and technical analyses, paving the way for promising avenues for future research.

Lastly, the strong agreement between the theoretical formulas and the experimental results on standard ML datasets suggests that the empirical observations hold a certain level of "universality" for approximately low-rank datasets.

缺点

The paper operates in a regime where the noise is essentially negligible in comparison to the signal's magnitude. This is characterized by a signal-to-noise ratio that approaches infinity. However, the more intriguing scenario is when the noise magnitude is comparable to that of the signal.

On the same note, the authors assert that their noise assumption is "strictly more general" than that of previous works, like Cui and Zdeborova. It's important to note that Cui and Zdeborova consider Gaussian noise with a variance of O(1) for each entry, which contrasts with the o(1) variance examined here. For example, in Cui and Zdeborova, the squared Frobenius norm of Atr is of the order O(N).

The phenomenon of vanishing noise, combined with a fixed subspace dimension, leads to the peculiar situation where both training and test errors tend to zero, regardless of the under/over-parameterization ratio (c).

The theoretical results presented in the "Out-of-Distribution and Out-of-Subspace" section are not adequately discussed, leaving the reader to ponder their implications.

The predictions derived from empirical results apply specifically to linear denoisers on real datasets. However, it remains unclear whether these observations hold true or offer insights into the behavior of more general denoisers in contemporary practices, such as deep neural networks.

问题

Could you please provide clarification regarding the regularization of W due to the noise discussed on page 8?

Firstly, there is already significant regularization in place due to the selection of the minimum norm solution. Therefore, it's not entirely clear to me how this interacts with noise-induced regularization.

Secondly, it's worth noting that the empirical results are based on noiseless data. If the objective is to showcase the effects of noise regularization, wouldn't it be beneficial to introduce a small amount of noise and compare it with the noise-free scenario?

评论

We thank the reviewer for the in-depth comments about our work and for agreeing with us that removing the Gaussian-like assumptions is one of our major contributions. If the reviewer feels that our series of responses adequately addresses their concerns, we hope that they will consider revising their score.

We would first like to clarify a misconception about the relative size of the signal and noise.

From assumption 1.2, we have that for XtrnRd×NX_{trn} \in \mathbb{R}^{d \times N}, we have that XtrnF2=O(N)\|\|X_{trn}\|\|_F^2= O(N). This implies that each data point has norm norm O(1)O(1).

From assumption 2, we have that for AtrnRd×NA_{trn} \in \mathbb{R}^{d \times N}, E[Aij2]=η2/d\mathbb{E}[A_{ij}^2] = \eta^2/d. However, this means that for the whole matrix AtrnA_{trn} we have that

E[A_trn_F2]=_i=1d_j=1NE[A_ij2]=dNη2/d=η2N.\mathbb{E}\left[\| \|A\_{trn}\|\|\_F^2\right] = \sum\_{i=1}^d \sum\_{j=1}^N \mathbb{E}\left[A\_{ij}^2\right] = d N \eta^2/d = \eta^2 N.

Further, from assumption 2, we assume that η2=Θ(1)\eta^2 = \Theta(1).

Hence we have that the norm of the data matrix and the norm of the noise matrix are on the same scale. Hence we do not have vanishing noise.

In fact, based on our assumptions we have that E[AtrnF2]=Θ(N)\mathbb{E}\left[ \|\|A_{trn}\|\|_F^2\right] = \Theta(N). However, we only have that X_trn_F2=O(N)\|\|X\_{trn}\|\|\_F^2 = O(N). In particular, we allow for sublinear growth. Hence are applicable to situations where we have diminishing signal.


We picked the scaling so that data points and noise vectors have constant norm as d,Nd, N \to \infty rather than having both norms go to \infty as is the case in many prior works. Hence, due to this, the errors go to 0. However, if we undo the scaling, so multiply everything by dd, we would have the asymptotic error be non-zero and would depend on cc. This is not due to vanishing noise. It is due to the scaling we use for the problem, which is the same for both the signal and the noise.

评论

The theoretical results presented in the "Out-of-Distribution and Out-of-Subspace" section are not adequately discussed, leaving the reader to ponder their implications.

We have added more discussion in the updated manuscript. The main implications are as follows:

  1. For Out-of-distribution generalization, we note that the risk curve for out-of-distribution data is close to the risk for the in-distribution risk. This suggests:
a. Improving in-distribution performance improves out-of-distribution performance
b. We see more evidence for the linear relationship between in-distribution and out-of-distribution risk. (See  Miller et al. (2021); Recht et al. (2019); Miller et al. (2020); McCoy et al. (2020))
c. We see that the out-of-distribution risk curve also exhibits double descent. 

2. The implications for out-of-subspace are similar.

The predictions derived from empirical results apply specifically to linear denoisers on real datasets. However, it remains unclear whether these observations hold true or offer insights into the behavior of more general denoisers in contemporary practices, such as deep neural networks.

The reviewer is correct that our predictions are for linear data on real datasets. We believe this to be a significant contribution. We agree with the reviewer that more work is needed to understand denoising via other models. This analysis needs to be done carefully and deserves to be the focus of its own paper. We believe that our paper serves as a foundation upon which to build such work .

Questions:

Could you please provide clarification regarding the regularization of W due to the noise discussed on page 8?

Prior work, such as Bishop 1995, has shown that input noise is a regularizer. Further, prior work, such as Nakirran et al. 2020 have shown that properly regularizing the model alleviates double descent. Hence, we interpret the existence of the peak (despite the regularization from the noise) as evidence the strength of the regularization near the peak is small.

Firstly, there is already significant regularization in place due to the selection of the minimum norm solution. Therefore, it's not entirely clear to me how this interacts with noise-induced regularization.

The reviewer is correct. The noise regularization further regulates the norm of the denoiser. Biasing it towards solutions with even smaller norms.

Secondly, it's worth noting that the empirical results are based on noiseless data. If the objective is to showcase the effects of noise regularization, wouldn't it be beneficial to introduce a small amount of noise and compare it with the noise-free scenario?

Apologies, but we are not sure we understand this comment. Could the reviewer provide more details?

We thank the reviewer for their comments. We hope that with this response we have addressed all concerns.

审稿意见
5

This paper derives generalization bounds for linear least square problems with noisy data. In contrast to prior literature, this work considers a more realistic and general setting where 1) the data matrix is low rank, 2) a non-classical proportional regime is used by taking both the data dimension dd and size of data NN to the limit \to\infty, and 3) the test data can be drawn from an arbitrary, non-iid distribution. The results of this paper yields some novel insights in the double descent phenomenon and out-of-distribution generalization.

优点

The problem setting in this paper is well-explained and well-motivated. Also, this paper offers a wide range of range theoretical and empirical results. These results lead to some interesting insights in a variety of topics of current interests. In particular, I like that Theorem 1 demonstrates the different generalization behaviors in the under-parameterized vs over-parameterized regimes. And the experiments reinforces this contribution by illustrating the double descent phenomenon in accordance to the predictions of Theorem1. Lastly, the out-of-distribution generalization bounds look promising, and I suggest that the authors mention Corollary 5 in the main body, as it is a considerably stronger statement than Corollary 1.

缺点

However, my impression is that the paper tries to cover too much mileage at once and thus does not explain the results very clearly. Here are my criticisms:

  1. The authors did not clearly explain the terms in Theorem 1. In the second-to-last paragraph of page 6, what are "bias term" and "variance term" referring to? The authors also did not explain the difference between the second terms in the respective bounds for under/over-parameterized case. And finally, for the over-parameterized case, where did the extra third term come from?

  2. There is no proof sketch for any of the results. Given that the exact terms of the generalization bounds are difficult to digest, it is imperative for the authors to elucidate the key ideas and techniques behind the results. Also, since the proof of Theorem 1 is almost 20 pages long, without an outline of the proof's approach, there is no way for me to even superficially check its validity.

  3. The additional results in the Appendix are very poor organized, many experiments results are mixed into the "Additional Theoretical Results" section. And many theorem statements (e.g. Theorem 5) in Appendix C are not written out completely and rigorously despite that page limit is not a concern.

  4. In Section 4, I think the conclusion in "Overfitting Paradigms" is incorrect. In the limit as NN \to \infty, the value of cc should be approaching zero, NOT to ++\infty. Then, we should not be achieving benign overfitting as suggested by the authors.

Lastly, as as suggestion, the author should consider briefly explain that Marchenko-Pastur measure is the "natural" limit distribution of the singular values of a random matrix. Otherwise, specifying the shape parameter cc may look weird to the reader.

问题

  1. In the first term of Theorem 2's bound, I feel that there should be a NtstN_{tst} in the denominator?

  2. For the figures, I personally do not like that xx-axis is in 1/c1/c instead of cc. In the literature, the xx-axis is usually for the amount of over-parameterization (e.g. Figure 1 in [1]). What are your motivations for doing it this way?

  3. If I understood correctly, the solid lines in Figure 1a are simply connected the empirical data points, right? This is quite confusing as the solid line denotes the theoretical prediction in Figure 1c.

  4. Why is that the theoretical prediction deviates slightly from the empirical observation in Figure 1c (real world setting) but is a near perfect match in Figures 9 and 10? Is that because the experiments for Figure 1c rely on a low-rank approximation? Or are there other reasons?

[1] Schaeffer, Rylan, et al. "Double Descent Demystified: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle." arXiv preprint arXiv:2303.14151, 2023.

评论

We thank the reviewer for their positive comments. We agree with the reviewer that we offer a wide variety of theory results and that the difference in generalization between the under-parameterized and the over-parameterized regimes is fascinating. If the reviewer feels that our series of responses adequately addresses their concerns, we hope that they will consider revising their score.

We thank the reviewer for their interest in Corollary 5. We have updated the paper to mention Corollary 5 in the main paper.

We now address the reviewer's concerns.

Explanation of the terms in Theorem 1.

We thank the reviewer for pointing out this issue. We have added the following clarification. Our generalization error is given by

E_A_trn,A_tst[Y_tstW_opt(X_tst+A_tst)2N_tst]\mathbb{E}\_{A\_{trn}, A\_{tst}}\left[\frac{\|Y\_{tst} - W\_{opt}(X\_{tst} + A\_{tst})\|^2}{N\_{tst}}\right]

We can expand this into

1Ntst(YtstWoptXtst2+WoptAtst22Tr((YtstWoptXtst)TWoptAtst)).\frac{1}{N_{tst}}\left( \|Y_{tst} - W_{opt} X_{tst}\|^2 + \|W_{opt}A_{tst}\|^2 - 2Tr((Y_{tst}-W_{opt}X_{tst})^TW_{opt}A_{tst})\right).

Due to the assumptions that entries of AtstA_{tst} have mean 0, when we compute the expectation, the cross term vanishes. Then, we are left with two terms. The first is 1Ntst(YtstWoptXtst2)\frac{1}{N_{tst}} \left( \|Y_{tst} - W_{opt} X_{tst}\|^2\right), which we think of as the bias and the second is 1NtstWoptAtst2 \frac{1}{N_{tst}}\|W_{opt}A_{tst}\|^2, which we think of as the variance.

Relating this to Theorem 1 (let us ignore the big-O and little-o terms), we have two terms for both the over and under-parameterized regimes. The first term (that depends on LL is the bias term. The second term (that is independent of the LL) is the variance term.

The extra term in the overparameterized case appears due to a technical reason. In the under-parameterized case, certain products of matrices cancel nicely and either become 0 or the identity. Hence, we do not need to estimate them using random matrix theory. However, this is not the case for the overparameterized case. Unfortunately, we get a term for which the error between the true value and our estimate is O(Σ/N2)O(\|\Sigma\|^/N^2).

Proof Sketch

The proof is broken down into the following steps.

  1. Obtain a formula for WoptW_{opt}. In particular, we know that this Wopt=X(X+A)+W_{opt} = X(X+A)^+. However, this step expands (X+A)+(X+A)^+ and simplifies the expressions. (Lemma 1 for the overparameterized and Lemma 9 for the under-parameterized case)
  2. Substitute our expression into our expressions for the bias and the variance. Then, we can compute these norms as the trace of various matrices. (Lemma 2, 10 for the bias, the variance is done in the proof of the main theorem)
  3. This is the most challenging step - show that the variety of the products concentrate to diagonal matrices. This is quite difficult, and Lemma 3,4,5,6,7,11,12,13,14,15 are proving these concentration results.

Appendix organization.

We thank the reviewer for pointing this out. We have reorganized the appendix. We have also added more details to Appendix C.

Overfitting incorrect

We thank the reviewer for pointing this out. However, we believe the error is a typo (not a fundamental one). The result has been updated. In the theorem for the underparameterized case, we were missing a cc. So the relative error is c/(1c)c/(1-c), which approaches 0 as c0c \to 0. Also, we see in the highly overparameterized case where we have a relative error of 1/(c1)1/(c-1), which also approaches 0 as cc \to \infty. This is the highly overparameterized case and the highly underparameterized case both benignly overfit in this case, and we see the more complex tempered overfitting in the proportional regime.

Questions

Theorem 2 bound

Yes, that is correct. We have fixed it.

c vs 1/c

We did it because we test on real data. Recall that c=d/Nc = d/N. However, for real data, dd is fixed (for CIFAR d is 3332=3072), and is something that we cannot change. So, we vary cc by varying NN and plotted our graphs against 1/c=N/d1/c = N/d. This reads naturally once one notices that NN was the varied quantity.

Figure 1a

This is not correct. Figure 1a, the solid line connects the theoretical values. If the reviewer looks closely (zooms in on the old pdf), the scatter points and the line do not exactly match. There are close, but it is not exact. However, to avoid this confusion. We have sampled more theory points and plotted a smoother version of the theory curves, making it more evident that the solid lines are always theory.

Figure 1c vs Figures 9 and 10

For Figures 9 and 10, we are doing this for distributions in our low-rank subspace. Hence, the curves exactly line up. Figure 1c is for an experiment without the exact low-rank assumptions needed by our results. Hence, there is a mismatch.

—-------------

We thank the reviewer again. With the above clarifications and the changes to the paper, the reviewer is willing to accept the paper.

评论

Thank you for the response and I appreciate the changes made to address my concerns.

However, I find that not all changes to the manuscript were highlighted. Since it seems that I cannot access the original version anymore, it is rather difficult to locate the changes. So, please correct me if I missed anything.

All of my concerns except for points 1 and 2 are addressed to varying degrees. But there are two major issues that remain.

  1. The restructuring of the Appendices were very low-effort. Appendix B now contains no explanations of the figures at all. And the re-shuffling created some dangling pointers at the bottom of page 6.

  2. I am not entirely happy with the added discussions about Theorem 1 and the proof sketch. I don't think it is difficult to guess that the bounds would contain a variance and bias term. I am interested in seeing a careful dissection of the terms, e.g. what is the significance the matrix LL while VtrnV_{trn} does not make it to the bound? In the second term, why the exponent of cc is different in the two settings? There are many possible questions the author could answer that could offer some technical insights beyond the equations. And speaking of the proof sketch, step 3 basically says it involves a bunch of complicated math without elaborating further. I feel I did not gain any useful intuition regarding the proof (the response to reviewer RbRx was a bit more insightful however).

My general impression is that authors did not make enough changes to address my concerns, especially regarding my points 1 and 2. Also, I feel the summary of my review in the top-level thread is potentially misleading (see my response above). Overall, I am comfortable with maintaining my original assessment of this paper.

评论

We apologize again, and we sincerely thank the reviewer for their engagement.

Here, we present more details about the proof. We focus on the β=I\beta = I case for simplicity.

Notional Setup

The first step we expand on is the expansion of W=X(X+A)+W = X(X+A)^+ (Note β=I\beta = I). We do this using the results from Wei 2001.

To do so, we need some notation. Define Z=I+VtrnTAtrn+UΣtrnZ = I + V_{trn}^TA_{trn}^+ U\Sigma_{trn}, that is common for both c>1c > 1 and c<1c < 1.

Then for c>1c > 1, we define the following. Note when c>1c > 1, AtrnA_{trn} has full column rank (and not row rank) so IAtrnAtrn+0I - A_{trn}A_{trn}^{+} \neq 0.

  1. P=(IAtrnAtrn+)UΣtrnP = -(I - A_{trn}A_{trn}^+)U\Sigma_{trn}, (This is a projection of the left singular vectors onto the orthogonal complement of the range of the noise).
  2. H=VtrnTAtrn+H = V_{trn}^TA_{trn}^+, (This measures the alignment between the right singular vectors of the two matrices)
  3. K1=HHT+Z(PTP)1ZTK_1 = HH^T + Z(P^TP)^{-1}Z^T,

For c<1c < 1

  1. Q=VtrnT(IAtrn+Atrn)Q = V_{trn}^T(I-A_{trn}^+ A_{trn}), (This is the analog of PP)
  2. K=Atrn+ΣtrnUK = -A_{trn}^+ \Sigma_{trn} U, (This is the analog of HH)
  3. H1=KTK+ZT(QQT)1ZH_1 = K^TK + Z^T(QQ^T)^{-1}Z, (This is the analog of K1K_1)

Notice that so far, everything is nicely symmetric.

Then for c>1c > 1 we get

Wopt=UΣtrn(PTP)1ZTK11HUΣtrnZ1HHTK11ZP+W_{opt} = U\Sigma_{trn} (P^TP)^{-1}Z^TK_1^{-1}H - U \Sigma_{trn} Z^{-1}HH^TK_1^{-1}ZP^+

and for c<1c < 1 we get

Wopt=UΣtrnH11KTAtrn++UΣtrnH11ZT(QQT)1H. W_{opt} = -U\Sigma_{trn} H_1^{-1} K^TA_{trn}^+ + U \Sigma_{trn} H_1^{-1}Z^T(QQ^T)^{-1}H.

The first thing to highlight is the symmetry is broken. The expression for c<1c < 1 is not just the expression for the c>1c > 1 with the analogous terms. This is the first reason the expressions are different. There is a second reason that will get to shortly.

Rank 1 Case

The above expressions are very complicated. So, to build intuition, let us focus on the r=1r=1 case and c>1c > 1 case. Here we shall present an example of the concentration argument. This has the following steps

  1. Step 1: Write all expressions as the trace or a sum depending only on the singular values of AtrnA_{trn}.
  2. Step 2: Estimate using Random Matrix Theory and show concentration.

Step 1

For the r=1r=1 case, we can consider the thin SVD X=σtrnutrnvtrnTX = \sigma_{trn}u_{trn} v_{trn}^T, where utrn,vtrnu_{trn}, v_{trn} are vectors. In this case, ZZ is a scalar, PP is a column vector, HH is a row vector, and K1K_1 is a scalar.

As mentioned, the main idea is to show that the expectations of all terms concentrate. Let us do this by showing that K1K_1 concentrates. Recall

  1. P=(IAtrnAtrn+)utrnσtrnP = -(I - A_{trn}A_{trn}^+)u_{trn}\sigma_{trn},
  2. H=vtrnTAtrn+H = v_{trn}^TA_{trn}^+,
  3. K1=HHT+Z(PTP)1ZTK_1 = HH^T + Z(P^TP)^{-1}Z^T,

Since the expectation is linear, let us focus on the HHTHH^T. Here, we see that

HHT=vtrnT(AtrnTAtrn)+vtrnHH^T = v_{trn}^T(A_{trn}^TA_{trn})^+v_{trn}

In this case, let AtrnTAtrn=V~Σ2~V~TA_{trn}^T A_{trn} = \tilde{V} \tilde{\Sigma^2} \tilde{V}^T. Using bi-rotational invariance, we see that V~\tilde{V} is a uniformly random orthogonal matrix that is independent of Σ~\tilde{\Sigma}. Hence a:=vtrnTV~a := v_{trn}^T\tilde{V} is a uniformly random vector on the sphere (vtrn=1\|v_{trn}\| = 1).

Thus,

HHT=iai21σ~i2HH^T = \sum_{i} a_i^2 \frac{1}{\tilde{\sigma}_{i}^2}

Then when we take the expectation, since aa is independent of σ\sigma, we get

E[HHT]=_i=1NE[a_i2]E[1σ~_i2]=_i=1N1NE[1σ~_i2]\mathbb{E}[HH^T] = \sum\_{i=1}^N \mathbb{E}[a\_i^2] \mathbb{E}\left[\frac{1}{\tilde{\sigma}\_{i}^2}\right] = \sum\_{i=1}^N \frac{1}{N} \mathbb{E}\left[\frac{1}{\tilde{\sigma}\_{i}^2}\right]

For the second term Z(PTP)1ZTZ(P^TP)^{-1}Z^T, the idea is similar, but we write ZZ in this manner and (PTP)(P^TP) in this manner. Then, after step 2, where we show the concentration, we can show that the inverse of PTPP^TP concentrates and then the product concentrates.

Hence, we see that we have done step 1.

Step 2

This is where we use assumption 2.4. Using that assumption, we have that (note without the expectation)

_i=1N1N1σ~_i21λdμc(λ) \sum\_{i=1}^N \frac{1}{N} \frac{1}{\tilde{\sigma}\_{i}^2} \to \int \frac{1}{\lambda} d\mu_c(\lambda)

Where μc\mu_c is the appropriate Marchenko Pastur distribution.

We compute the variance to show concentration and see that it is decaying.

Intuition

Just for some intuition, similar to above, we can break up all terms into a form that looks like xT(Atrn)kyx^T(A_{trn})^k y.

When xyx \neq y, due to the bi-rotational invariance, after multiplying the Singular vector matrices for AtrnA_{trn}, (similar to the HHTHH^T case), we will get independent mean 0 vectors. Hence, this term will concentrate to 0.

When x=yx = y, we must have that kk is even, this concentrates to E[λk/2]\mathbb{E}[\lambda^{k/2}]. Which we estimate by assuming that λ\lambda is from the appropriate Marchenko pastur distribution.

Difficult when r1r\neq 1

When r>1r > 1, everything above is a matrix, not a vector. However, the spirit of the argument is the same.

评论

Let us continue to build some more intuition.

Why do things concentrate

Let AA be a p×qp \times q matrix with IID entries that have mean 0 and variance 1 and bounded fourth moment. Then if we look at 1qAAT\frac{1}{q} AA^T, let λ1,,λk\lambda_1, \ldots, \lambda_k be the eigenvalues, and let μp,q\mu_{p,q} be the uniform distribution on these eigenvalues.

Note here the randomness in μp,q\mu_{p,q} is over which eigenvalue of AATAA^T we are picking AND NOT over AA

Then the Marchenko Pastur theorem tells us that μp,q\mu_{p,q} converges to marchenko pastur weakly. Hence this tells us that for EVERY A, this converges, so we have pathwise convergence instead of in probability or expectation.

Then we saw above the expressions of the form i=1N1N1σi2\sum_{i=1}^N \frac{1}{N} \frac{1}{\sigma_i^2} can be written as E_μ_d,N[1σ2]\mathbb{E}\_{\mu\_{d,N}}\left[\frac{1}{\sigma^2}\right] where the expectation is taken not over AA but over this μd,N\mu_{d,N} measure. Hence, everything concentrates.

Difference between c>1c>1 and c<1c < 1

Continuing with the notation from above. Suppose p<qp < q, then we notice that AATAA^T is p×pp \times p. It can be shown that AA is full rank.

However, in our work, for one case we see AAT/qAA^T/q and in the other ATA/qA^TA/q (Recall PP, QQ terms at the very beginning of the previous comment).

However, the Marchenko Pastur theorem doesn't apply to ATA/qA^TA/q, it doesn't have the correct normalization. So to correct we need to multiply by q/pq/p. For our results q=d,p=Nq = d, p = N. Hence, we are multiplying by cc.

This is the second reason that the powers of cc are different

Understanding LL

LL is the cooridnates of XtstX_{tst} in the basis given by UU. Then suppose Xtst=UtstΣtstVtstTX_{tst} = U_{tst} \Sigma_{tst} V_{tst}^T. Then we see that L=UTUtstΣtstVtstTL = U^TU_{tst} \Sigma_{tst} V_{tst}^T. Then, using the unitary invariance of the norm, we have that

β_UT(Σ_trn2c+η_trn2I)1L_F2=β_UT(Σtrn2c+η_trn2I)1UTUtstΣtst_F2 \|\|\beta\_U^T( \Sigma\_{trn}^2c + \eta\_{trn}^2 I)^{-1} L \|\|\_F^2 = \|\|\beta\_U^T( \Sigma_{trn}^2c + \eta\_{trn}^2 I)^{-1} U^TU_{tst} \Sigma_{tst} \|\|\_F^2

Hence we see that the error depends on an alignment term between the training and test data UTUtstU^TU_{tst} and the singular values Σtst\Sigma_{tst}.

Why no VV terms

Since we are doing linear least squares problems, we see that everything depends on the Gram matrix of the data (this is also known as the kernel trick). Hence, we only get dependence on UU and not on VV. In the classical regime, this would mean we depend on the eigenvectors and eigenvalus of the covariance matrix and not on anything else.

Different powers of cc

This is due to two reasons

  1. The lack of symmetry the expressions for WoptW_{opt} (See previous reply)
  2. The need to multiply by d/Nd/N to get convergence (see the section on concentration in this reply).

We hope that the above helps better understand the proof. If there are steps that are unclear or parts where the reviewer would like more detail, we are happy to clarify and provide more details.

评论

We also updated the manuscript to

  1. Say the Marchenko Pastur is the natural limit.
  2. Fix the dangling references on page 6
  3. Add more description to the Figures in Appendix B.

We sincerely thank the reviewer for their feedback. We hope with this, we have addressed the reviewer's concerns.

评论

Thank you for the quick updates. I know this may be difficult, but can you integrate at least some of the ideas you described above into the paper? I think it is absolutely crucial to have a discussion of proof ideas in the paper. I would be happy to raise my score if this is done.

评论

Thank you for your constructive feedback and flexibility! We are happy to do that, and will do so shortly. It would be great if this convinces you to increase your score.

评论

We thank the reviewer for their feedback and engagement.

We have just updated the manuscript to include a proof sketch and a paragraph that provided more insights into the terms in the expression.

We hope that this revision satisfies the reviewer.

评论

Thank you for the responses and the revision.

While the paper looks to be in better shape now, I still think it is a borderline case. Given that the authors made some unprofessional remarks during the discussion (see links below), I am hesitant to give them the benefit of doubts. So, I decided to maintain my original score.

  1. in a top-level coment: https://openreview.net/forum?id=WmB803HJkD&noteId=uZ266DDUeY

  2. in a reply to reviewer RbRx: https://openreview.net/forum?id=WmB803HJkD&noteId=KlEIinip8U

评论

We thank the reviewer for their feedback.

审稿意见
3

This paper considers a linear denoising problem in which the responses Y_{trn} are formed as Y_{trn} = \beta^T X_{trn}. These responses are observed noiselessly, but the data is observed with noise as Z_{trn} = X_{trn} + A_{trn}. The authors study the performance of a least-squares denoiser W_{opt} = Y_{trn} \cdot (Z_{trn})^{\dagger} so that new noisy observations Z_{tst} are denoised as W_{opt} Z_{tst}. Under a well conditioned, low rank assumption on X_{tst} and regularity assumptions (e.g. isotropic, existence of second moments, convergence of the empirical spectral distribution to the Marchenko-Pastur law) on the noise A_{trn}, the authors provide a characterization of the test error of this denoising procedure.

优点

The non-asymptotic character of the results is appreciated and the validation on real datasets is very nice.

缺点

  • Assumption A2 seems quite similar to some classical random matrix theory assumptions, explicitly in part 4. In which convergence of the empirical spectral distribution to the Marchenko-Pastur law is assumed. If I have understood correctly, this assumption (which would follow if independence of the coordinates of the noise matrix were assumed to be independent) allows to bypass the independence assumption.

  • The main technical difficulty seems to come from analyzing the pseudoinverse (A + X)^{\dagger} as opposed to the inverse (if it existed) (A + X)^{-1}, as the authors point out in the commentary following Theorem 1. The analysis of this, while cumbersome, does not seem to me to involve much technical novelty. In particular, the setting with additive structure A + X, where A is random and enjoys favorable structure and X is low rank is quite a bit easier to analyze than the setting in which samples are i.i.d. \Sigma^{-½} x_i, and \Sigma may be ill-conditioned (e.g., Cheng and Montanari, Assumption 3).

  • The experimental verification on real data-sets is nice, but it is not very surprising in the simple context considered here. For instance (and to give an example not cited in the paper), in a similar situation (quadratic loss) which is amenable to random matrix theory, Paquette et. al (https://arxiv.org/abs/2205.07069) study the dynamics of SGD on quadratic models and give exact trajectories when the data is from MNIST (for example).

  • Regarding the insights obtained by the characterizations: The insight over previous work regarding double descent and data augmentation seems limited. Also, the discussion around benign overfitting is fairly unclear to me. The setting here is fairly different from that of Bartlett, et. al (https://arxiv.org/abs/1906.11300) in which benign overfitting is characterized in terms of effective ranks. The assumption on the spectrum and dimension (d/N = c + o(1), finite dimensional data) considered here seems to preclude this kind of decay.

  • (Minor) If my understanding of the previous points is correct, it would help to soften the language in the introduction. As written, the claims made in the introduction are quite a bit stronger than what seems to be concretely shown.

  • (Minor) There are also several typos and misspellings in the paper that should be fixed before publication.

问题

  • Could the authors please provide an example of a distribution on the noise which satisfies the assumptions (Assumption A2) and does not have independent coordinates (e.g. is not a standard ensemble such as a Wigner matrix)? Why is this more realistic than the examples considered previously?

  • Could the authors please clarify what they feel the technical novelty is in analyzing the pseudoinverse (A + X)^{\dagger}? In particular, it is not clear to me from looking at the proof what the new ideas involved in analyzing this are: It seems that the setting is technically more complicated than if A + X were invertible, but the complications seem to be purely technical and overcome using standard techniques. To be clear, I do not think this is a bad thing, but clarifying this would help clarify the contribution overall.

  • In the conclusion, the authors write "Our work has opened the doors for a similar analysis of more sophisticated denoising problems, involving linear networks or non-linearity...". Could you please elaborate why you feel this is the case? The analysis seems to me quite tailored to the linear setting with quadratic loss considered here. In the appendix, the non-linear extension discussed also seems to require further independence assumptions, which seems counter to the viewpoint adopted here.

评论

We thank the reviewer for their detailed and in-depth review. We also agree with the reviewer that the non-asymptotic validation with real data is a strength of our paper. If the reviewer feels that our series of responses adequately addresses their concerns, we hope that they will consider revising their score.

Here are some short responses to the weaknesses before we delve deeper:

Weakness 1: As a punchline, using the spectrum of the noise allows us to work with possibly dependent data. However, as discussed in the paper, most results already make assumptions that encompass our assumptions on the noise. The fact that we only use assumptions about the noise is a strength of the work, not a weakness. Being able to handle such a broad setting comes with many complications, discussed in detail below.

Weakness 2: It is already hard to analyze the inverse of a sum of random matrices (X+A)^{-1}. This is, in fact, a fundamental issue with analyzing input noise settings. Our idea is to use matrix inverse perturbation results, which comes with several complications already. The problem is further complicated by having to deal with pseudo-inverses, not only complicated by that. We provide more details below.

We will now explain why deriving our test error results is not easy. The difficulty in the proof comes from the placement of the noise and when the expectation is taken. Both of these are crucial.


Placement of the noise

In our setup, we have noise on the input data instead of the output. However, for comparison, let us briefly consider the situation where we have noise on the output.

The case of output noise

When the noise is on the output, we have some training data XX and y=βTX+ϵy = \beta^TX + \epsilon where ϵ\epsilon is noise. In this situation, if we solve the least squares problem, we have our estimate:

β^T=yX+=βTXX++ϵX+\hat{\beta}^T = yX^+ = \beta^TXX^+ + \epsilon X^+

Now we have two terms. The first only depends on XX and not the noise. Hence, this is straightforward to analyze. The second term is linear in the noise.

Continuing from here, suppose we care about ββ^2=β2+β^22βTβ^\|\|\beta - \hat{\beta}\|\|^2 = \|\beta\|^2 + \|\hat{\beta}\|^2 - 2\beta^T\hat{\beta} (as is the case in prior work). We see that β\|\beta\| is a constant.

Next β^=βTXX+2+ϵX+2+0\|\hat{\beta}\| = \|\beta^TXX^+\|^2 + \|\epsilon X^+\|^2 + 0 where the cross term is zero due to the independence of the noise. Here XX+XX^+ is roughly constant, so it is easy to analyze, and if ϵ\epsilon is Gaussian, then ϵX+\epsilon X^+ is Gaussian, and the norm is easy to calculate.

For the cross term βTβ=βTXX+β+βT(X+)TϵT\beta^T\beta = \beta^TXX^+\beta + \beta^T(X^+)^T\epsilon^T. The second term is zero due to the independence of ϵ\epsilon, and the first is a quadratic form dependent on XX+XX^+.

Comparison to our setup (noisy inputs)

Let us now compare this to our setup. Since we have noise on the input, our estimate is

β^=βTX(X+A)+ \hat{\beta} = \beta^T X (X+A)^+

To start off, we already cannot decouple the noise and the data. Hence, this is one challenge that we had to overcome. In the output noise setup, since the two decouple, we only need to understand the spectrum of XX. In our case, we need to understand the spectrum of X+AX+A.

In the classical regime (dd fixed, NN to infinity), this would be easier since (X+A)(X+A)T(X+A)(X+A)^T would converge (with appropriate normalization) to the covariance matrix XXTXX^T. However, there are two issues in our problem that complicate this:

  1. First, we are not in the classical regime. In the proportional regime, as noted by Cheng and Montanari and many other prior works, (X+A)(X+A)T(X+A)(X+A)^T is not a consistent estimate for the covariance matrix.

  2. Second, we actually do not have terms of the form (X+A)(X+A)T(X+A)(X+A)^T (as was the case in the output noise situation). We have terms of the form X(X+A)TX(X+A)^T. While this is subtle, it makes a significant difference.

Points 1 and 2 combined imply that we must understand the eigenstructure of X+AX+A. Specifically, point two asks us to study X(X+A)TX(X+A)^T, which implies that we need to understand the alignment between the eigenvectors of XX and the eigenvectors of X+AX+A. Studying the eigenvectors of a perturbed matrix is very difficult. Further, we don't need to understand just the leading eigenspaces, but all eigenvectors.

We believe that the fact we got around this hurdle with accessible techniques is, in fact, a strength of our work.

When the Expectation is Taken

Another fact to highlight is that we compute the expectation over training data noise after computing the test error for WoptW_{opt}. Recall that WoptW_{opt} is trained to minimize the MSE error over a single noise instance. If we computed the test error for a linear function minimizing the expected MSE error (corresponding to WW^*), the result is much simpler. This latter case is exactly the case of learning a regularized auto-encoder.

评论

Concentration results

We get around the first issue by using a result on the pseudoinverse of the perturbation of matrices. We note that our perspective on this is new. Most results consider the full-rank matrix as the base matrix and the low-rank matrix as the perturbation. However, in our setting, we have the low-rank matrix as the base matrix and the full-rank matrix as the perturbation. In hindsight, this idea might seem natural, but that is not the case a priori.

Additionally, we get very complex expressions that are not linear (which would be straightforward) or quadratic (where we could have used the concentration of quadratic forms, such as Hanson-Wright). We have products of dependent quadratic (or even high powers like quartic) forms.

Hence, we have the technical novelty where we need to prove the concentration of the various different quantities. To further complicate things, we need to prove the concentration of the inverses of many matrices. This is very non-trivial. We do this by computing the variance of each of the terms and then carefully tracking the error between the product of expectations and the expectations of products.

Hence, the technical challenges we had to overcome are

  1. Understand the spectrum of X+AX+A and XX individually. Without being able to use the covariance matrix or a Wigner matrix for the data XXTXX^T.
  2. Proving concentration of many different terms in the non-classical proportional regime to be able to estimate the expectation of the product of dependent random variables.
评论

Experiments not surprinsing

This was about our experimental results. We thank the reviewer for pointing us to the great work done by Paquette et al. We have added this to our references. However, we would like to point out one key difference from their work that we believe makes our results surprising.

This is the fact that Paquette et al. train and test on the same dataset. However, we can train and test on different datasets as evidenced in Figure 1a. We apologize if this was not adequately highlighted in the paper. For all three figures in Figure 1a, the model is trained on CIFAR. Then, the left figure tests on CIFAR, the middle on STL10 (a completely different dataset), and the right on SVHN (another completely different dataset).

Insights from results

We would like to point out that we are the first to analyze double descent under distribution shift. The insight on data augmentation is highly practical and tells us that in real life, it is important to avoid situations where c1c \approx 1, and we provide a way to do so.

We agree that the insight on overfitting was somewhat confusing. We have now split it into a succinct insight in the main paper and point to an elaborate insight in the appendix.

Please also see the response to the other reviewers.

Softening Claims

We thank the reviewer for pointing these out, and we are happy to soften our claims in the introduction. We have also gone through the paper to reduce typos.


Questions

Question 1

The reviewer is correct that this is not evident apriori. The primary example of such noise is obtained by letting BB be a matrix IID entry (same assumptions) as for Wigner matrices. Then, let Q1,Q2Q_1, Q_2 be two uniformly random rotation matrices. Then, matrices of the form Q1BQ2Q_1BQ_2 satisfy our noise assumptions.

Question 2

Please see our response titled Dicculty part 1 and part 2 on the technical difficulty of proving our results. We do not think the difficulty comes from having a pseudoinverse instead of an inverse. Instead, it comes from the fact that we have noise on the input, and yy depends only on the signal.

If the reviewer has a reference for the situation where βTXW(X+A)2\|\beta^TX - W(X+A)\|^2 is analyzed, even for the situation where (X+A)1(X+A)^{-1} is invertible, we would appreciate a pointer to this. As mentioned in the paper, we know of a very limited number of prior works (3 to be exact) that analyze the input-noise setting.

We agree with the reviewer that the core linear algebraic and probabilistic tools we have used are well-established among mathematicians. However, in the spirit of machine learning research, we hope that the reviewer agrees that doing difficult computations using elementary techniques to prove novel results is a worthwhile contribution

Question 3

We have edited the sentence in the revised manuscript to soften the claim.

评论

First, thank you to the authors for the response. I am happy to leave my score as is: In my opinion, this will be—upon further editing—an interesting contribution, but I think it falls short (again in my opinion) of the bar for ICLR. My main suggestion for improvement would be for the authors to more critically (and fairly) situate their work in the context of previous literature. As an aside, I would like to echo Reviewer jx5r regarding the cherry picking of comments in the summary.

I would like to clarify some points of my review and provide some concrete suggestions for improvement. Also, to clarify, I agree with the authors that simple, accessible proofs are desirable and that technical difficulty for the sake of technical difficulty is not a useful exercise. My apologies if my initial review signaled otherwise.

  • You’ve provided a phenomenological study in which you propose a simple model which reflects some aspects of more realistic situations and analyze the simpler set-up to provide insight. I have no issue with this broad set-up. However, there are two axes on which to evaluate the study. First regards the obtained qualitative insights and how well they reflect reality, and second regards the technical insights derived from the simplified model. I am not sure that there is much novelty in either direction.
    For instance, regarding the qualitative insights, the authors mention avoiding c \approx 1. Why, aside from the difference in simple models considered, is that not an insight already evident from, e.g., [1]?

  • Thank you to the authors for their response regarding the difficulty in the analysis. I read through the proof of Theorem 1 again in the appendix (as well as the response to reviewer jx5r), but am unable to articulate precisely what the author’s key technical insights here were. Could you please point me to the lemmas which attack this issue you have mentioned, and in particular where the difficulty arises as compared to Sonthalia and Nadakuditi (2023)? Many of the estimates used in the proof of Theorem 1 seem to come from that paper.

  • Again, regarding technical insights and novelty, I want to emphasize that part of the confusion here since the authors have not clearly situated themselves with respect to prior work. In general, the authors frequently cite previous work and situate themselves positively with respect to the previous work, without mention of any differences (and perhaps strengths that the previous work may have and weaknesses of their own).

  • There seem to be two directly related works cited by the authors: Pretorius et. al (2018), and Sonthalia and Nadakuditi (2023). I do agree that the results obtained here are improvements upon these.

  • However, there are misleading and superficial comparisons to other previous work. For instance, the result of Cheng and Montanari (2022) considers kernel ridge regression (and not in the input noise set-up). There is no additive “Marchenko-Pastur–like” noise in that work; rather the random matrices involved are non-standard and require quite a bit of care to deal with. In particular, my impression of the analysis done in this paper is that it is precisely the nice assumptions on the additive noise which enable the guarantees. This is not emphasized at all in the manuscript.

  • Related to the previous point: It is likely that I have missed something crucial here, and it would help for the authors to clarify this. In your paragraph on “Comparison on assumptions in prior work”, you point out that most authors assume something along the lines of xN(0,Σ)x \sim \mathcal{N}(0, \Sigma), and stress that you make less stringent assumptions on the independence conditions. Perhaps I have mis-interpreted, but it seems to me that it might be a more fair comparison to be with respect to the centered data assumption? In which case, if we take the signal-less setting for both, the assumption from previous work is more general? Again, perhaps I’ve mis-interpreted this point.

  • To give one more example of potentially unfair comparisons, the authors provide a comparison to Cui and Zdeborova (2023), and simply say that the assumptions here are strictly more general than in Cui and Zdeborova (2023). I am not sure this is true. While your assumptions on the data are more general, they consider a seemingly more realistic, and more complicated, denoising model. It seems to me that it is a bit misleading to compare your results to previous work in such a way that only favors your results and disregard other aspects of the work. I bring up these points to give just a few examples, but the paper has many such, which I would suggest the authors to improve.

  • One other big missing piece is comparisons and citations to results already in the random matrix theory literature.

[1] Hastie, et. al. “Surprises in High-Dimensional Ridgeless Least Squares Interpolation”

评论

We kindly ask if the reviewer, due to the subjective nature of their concern and the discussion with us, would be willing to increase their score to 5. (Note we have limited the audience of this comment).

Apology if I am interrupting your conversion with reviewer RbRx. I think it is extremely inappropriate to directly ask the reviewer for a specific score. And trying to hide this request with a limited view permission goes against the spirit of peer review.

评论

We apologize did not realize this was inappropriate. We assumed it was similar to asking the reviewer to increase their score. The comment is deleted.

评论

We apologize did not realize this was inappropriate. We assumed it was similar to asking the reviewer to increase their score. The comment is deleted.

Mistakes happen, and I may be a little judgemental. But can you please not make this conversation even less transparent? Regardless of how people feel about directly asking the reviewer for a specific score, it is definitely wrong to make such request private, and now you are only making things worse by deleting it.

评论

We thank the reviewer for their comment and engagement and apologize for the summary.

My main suggestion for improvement would be for the authors to more critically (and fairly) situate their work in the context of previous literature.

We thank the reviewer for their comment. Our main difficulty is that in the denoising setting, there seems to be very few works. (Sonthalia and Nadakuditi 2023, Pretorius et al. 2018, and Cui and Zdeborova 2023). The reviewer agrees that we generalize the first two. We have modified the paper to highlight the strengths of all three papers. Primarily, Cui and Zdeborova 2023.

As an aside, I would like to echo Reviewer jx5r regarding the cherry picking of comments in the summary.

We apologize. We had hoped to identify the main concerns for each reviewer. We have updated the summary.

I would like to clarify some points of my review and provide some concrete suggestions for improvement. Also, to clarify, I agree with the authors that simple, accessible proofs are desirable and that technical difficulty for the sake of technical difficulty is not a useful exercise. My apologies if my initial review signaled otherwise.

We are thankful for the concrete suggestions. We hope that we have addressed most of these in the updated manuscript.

We are also thankful that the reviewer agrees that simple proofs are desirable. Hence, we hope that the simplicity is not held against us.

You’ve provided a phenomenological study in which you propose a simple model which reflects some aspects of more realistic situations and analyze the simpler set-up to provide insight ... is that not an insight already evident from, e.g., [1]?

[1] does tell us that avoiding c=1c = 1 is necessary. However, that is not the only insight from our paper. Specifically, our insights are novel when considering distribution shifts for denoising. This is where we believe many of our insights appear. As we mentioned, there are very few works that study denoising. As far as we can tell, only Sonthalia and Nadukiti 2023 consider distribution shifts, but in a limited manner.

With the importance of denoising, we believe that having a simple model for which we can theoretically justify (known) phenomena is essential. Specifically, we show

  1. Double descent occurs even when the test data is from a different distribution.
  2. We show that data augmentation hurts in the c>1c > 1 regime. While this can be thought of as understood from prior work such as [1]. Due to the inherent dependence in the data when doing data augmentation. We believe having a simple model where such results are theoretically backed is important.
  3. As far as we can tell, The results on tempered and benign overfitting for denoising are also new. These do provide the interesting qualitative insight that when we have a small amount of training data, we can improve the denoiser by increasing the denoiser (even with distribution shift) by increasing over parameterization.

Thank you to the authors for their response regarding the difficulty in the analysis ... Many of the estimates used in the proof of Theorem 1 seem to come from that paper.

We agree with the reviewer that we do not have new, deep, technical mathematical results. However, the analysis here is not straightforward. We highlight some of the difficulties.

  1. Lemma 4, which considers the concentration of the inverse of matrices. This is not needed in Sonthalia and Nadakuditi 2023 as they don't have to invert matrices.
  2. Lemma 5, for bounding the variance of each entry, we consider the square of the matrix. We show that we can estimate this. Hence, by symmetry, we can bound the variance of the sums of the terms and, then, the individual terms. This argument is new and is not present in Sonthalia and Nadakuditi 2023.
  3. Lemma 7, we had to bound the variance of the product of random variables. Sonthalia and Nadakuditi 2023 did not need to do this.

Again, regarding technical insights and novelty, ... without mention of any differences (and perhaps strengths that the previous work may have and weaknesses of their own).

We tried to highlight the differences between prior work and ours — specifically, the difference in data setting. We have added a paragraph in the introduction concerning some of our method's limitations and highlighted the strength of some prior work.

There seem to be two directly related works cited by the authors: Pretorius et al. (2018) and Sonthalia and Nadakuditi (2023). I do agree that the results obtained here are improvements upon these.

We agree with the reviewer.

评论

Thanks to the authors for the quick response and for clarifying the points of difference between Sonthalia and Nadakuditi (2023). I still feel that the contribution is rather incremental and stand by my suggestion in the initial review that it may be a more appropriate submission to, for instance, TMLR. Of course, given the subjective nature of this, if the other reviewers and the area chairs have a conflicting opinion, I am happy with whatever they choose.

评论

We thank the reviewer for their swift engagement and feedback in helping us improve the paper.

We thank the reviewer for the suggestion of TMLR.

We kindly ask if the reviewer, due to the subjective nature of their concern and the discussion with us, would be willing to increase their score to 5. (Note we have limited the audience of this comment).

We sincerely thank the reviewer for the time and energy put into helping improve our paper.

评论

Cheng and Montanari (2022)

We want first to clarify that we are looking at the same paper. Is the reviewer referring to https://arxiv.org/pdf/2210.08571.pdf ?

Further, we are uncertain where we say Cheng and Montanari have input noise. Could the reviewer please point that out to us? We cite this paper in 4 places.

  1. To motivate the problem of having poorly conditioned covariances. This was done to help motivate the study for low rank data.
  2. As a reference to prior work that mentions the proportional regime.
  3. When we talk about their data assumption. How they have assumptions similar to x_i=Σ1/2z_ix\_i = \Sigma^{1/2}z\_i. Directly quoting from their paper

``We allow the feature vector to be high-dimensional, or even infinite-dimensional, in which case it belongs to a separable Hilbert space, and assume either z_i=Σ1/2x_iz\_i = \Sigma^{-1/2}x\_i to have i.i.d. entries, or to satisfy a certain convex concentration property.''

Further, their paper does have additive Marchenko Pastur noise and, unless we are mistaken, is not about kernel regression. Again, quoting directly from their paper.

``we revisit ridge regression (l2-penalized least squares) on i.i.d. data (x_i,y_i),in(x\_i, y\_i), i \le n, where x_ix\_i is a feature vector and y_i=β,x_i+ξ_iRy\_i = \langle \beta, x\_i\rangle + \xi\_i \in \mathbb{R} is a response.''

Page 4 details the assumptions on ξ_i\xi\_i, which are mean 0, variance τ2\tau^2, and I.I.D. Theorems 1,2,3 on pages 9, 10, and 11 seem to be the main results. However, none of them are for kernel regression. Theorem 1 is for the ridge regression situation, and Theorems 2 and 3 are for the ridgeless case for the over and under-parameterized cases.

Due to these differences, we believe we and the reviewer are potentially referring to different papers.

Related to the previous point: It is likely that I have missed something crucial here ... Again, perhaps I’ve mis-interpreted this point.

Apologies, but we are not certain we follow.

The point here was to compare our data assumptions with prior work. Many prior works assume xiN(0,Σ)x_i \sim \mathcal{N}(0,\Sigma) and IID, whereas our paper studies xiVx_i \in \mathcal{V} in some low dimensional subspace. We do not need the data to be centered in our paper. Hence, we are unsure about the reviewer's point.

To give one more example of potentially unfair comparisons, the authors provide a comparison to Cui and Zdeborova (2023),

We have updated the manuscript to mention that Cui and Zdeborova 2023 have a more general model.

One other big missing piece is comparisons and citations to results already in the random matrix theory literature.

We do cite the original works showing convergence of the spectrum. However, since, as the reviewer points out and we agree with, we are not proving new random matrix theory results, we do not have an in-depth literature review. However, if the reviewer feels there are papers that we should mention, please let us know.


We thank the reviewer again for their comments. We hope that with these discussions, we clear doubts in relation to our contribution and relation to prior work. We hope that the updated manuscript reflects this.

评论

Thanks very much to the authors for pointing this out: you are entirely correct that the Cheng and Montanari (2022) paper does not study kernel ridge regression, my apologies for this (this was an unfortunate typo that I did not catch after erasing a sentence). I have updated the comment to reflect this. Thanks once more, and my apologies again for this typo. I also did not mean to imply that you mentioned their work as studying input noise.

I will try to make clearer what I was trying to say while clarifying the centering issue I brought up:

The point here was to compare our data assumptions with prior work. Many prior works assume and IID, whereas our paper studies in some low dimensional subspace. We do not need the data to be centered in our paper. Hence, we are unsure about the reviewer's point.

To clarify, what I am wondering is whether it is more accurate to compare your results to a setting in which data is not centered, but the covariance of the noise is well-behaved? I.e., a more closely related model to yours seems to me to be the situation in which the observed data is X_{trn} + G, where the entries of G are standard Gaussian. If I am correct about this, it is quite different from the N(0,Σ)\mathcal{N}(0, \Sigma) setting you compare to, and my reference to Cheng and Montanari (2022) was to contrast these. I hope this is clearer, and apologies for the confusion.

评论

Our setup is exactly X_trn+GX\_{trn} + G. Our point is precisely the difference between this and N(0,Σ)\mathcal{N}(0,\Sigma). Our paper is in a different setting.

If the reviewer has references to papers that study generalization for such data, we would be extremely grateful to hear about them, as we agree these would be the most relevant papers to compare against. Unfortunately, besides Cui and Zdeborova 2023, we do not know any that study non-centered data + Gaussian noise.

评论

Thanks for clarifying. I do stand by my point above then. It would be far clearer to avoid comparisons to the N(0,Σ)\mathcal{N}(0, \Sigma) setting entirely, or to explicitly mention that you weaken assumptions on the noise and consider distributions with different means. As I mentioned before, almost all of the authors' references to previous work painted your results in a favorable light. I think it would be far more beneficial to compare your results critically to past work. I do appreciate what you have pointed out on the lack of references in your specific setting. I suppose it would be more beneficial to spend more time motivating your specific setting (which as I mentioned in a previous comment, I do not find the motivation for the linear setting you consider to be quite satisfactory), rather than drawing potentially misleading comparisons to prior work in quite different settings.

I very much appreciate the effort the authors have put into their responses, but I will elect to keep my score as is.

评论

We thank the reviewer for the discussion. We believe that we have updated the paper to be more critical of our contributions, we do however believe the comparison to N(0,Σ)\mathcal{N}(0,\Sigma) is fair as we both study least squares regression problems. However, we appreciate the position of the reviewer.

评论

We restored the comment. The limited visibility allowed the PCs, SACs, ACs, and all reviewers could see the post. We appreciate the feedback of the reviewer.

评论

We thank the reviewers for their insightful comments and for helping improve the paper. We have posted an updated manuscript we the additions to the main text highlighted in blue.

We would like to summarize the positives.

In this new setting, the reviewers found that.

  1. Reviewer RbRx - The non-asymptotic character of the results is appreciated, and the validation on real datasets is very nice.
  2. Reviewer jx5r - The problem setting in this paper is well-explained and well-motivated. Also, this paper offers a wide range of theoretical and empirical results. These results lead to some interesting insights in a variety of topics of current interests.
  3. Reviewer sZRb - This paper departs from the conventional assumptions that have been prevalent in previous high-dimensional regression studies. Notably, a particularly intriguing contribution lies in the elimination of the iid or "Gaussian-like" assumptions previously relied upon. This requires the application of novel and technical analyses, paving the way for promising avenues for future research. Lastly, the strong agreement between the theoretical formulas and the experimental results on standard ML datasets suggests that the empirical observations hold a certain level of "universality" for approximately low-rank datasets.

In terms of the negatives

  1. Reviewer RbRx mentions that the proof techniques are standard. The reviewer is also uncertain about the sights we obtain. The reviewer is also concerned about our placing in the context of prior work and descriptions of them as such.

Response: The proof is quite involved. Please see the response to the reviewer for a more detailed response on the difficulty. We have updated the manuscript to soften many of the claims regarding prior work. We have also added limitations for our work and highlighted strengths of prior work.

  1. Reviewer jx5r had complaints about the writing, specifically about the discussion of the results and the appendix. The reviewer also would like a proof sketch to help build intuition. The reviewer also pointed out an issue with the overfitting result.

Response: As the reviewer points out, this is partly due to the vast number of results in the paper. We have taken the reviewer's feedback into consideration and posted a revised version of the paper that makes the requested changes. After hearing from the reviewer again, we have made further changes. We also corrected the overfitting result. Finally, in the response to the reviewer, we provide more details pertaining to the proof sketch.

  1. Reviewer sZRb’s primary complaint was the scaling of the noise compared to the signal. The reviewer also wanted additional discussion for some of the results, and wondered if our results generalized to non-linear networks.

Response: The scaling of the noise is a misunderstanding. We have added more discussion about our results. The study of non-linear networks is important and interesting and would require intensive study.

Hence, we hope that with this, we have removed any doubts about our work. We kindly ask the reviewers to let us know if there are still concerns.

评论

I would first thank the reviewers for their detailed responses. However, I feel cherry-picking the content of my review is not conducive to a productive discussion. Your summary of my review focused my opinion over the first two introductory sections of this paper, and barely mentioned my complaints about the presentation of your main results in Section 3. Therefore, I kindly request the authors to update their summary to better reflect on the reviewers' criticisms and expand on the specific steps taken to improve the paper.

评论

We apologize to the reviewer. This was not our intention. We thank the reviewer for pointing it.

We have updated the above summary. Please let us know if you still have concerns.

评论

Thank you, this looks much better.