PaperHub
4.3
/10
Rejected3 位审稿人
最低3最高5标准差0.9
5
3
5
3.3
置信度
ICLR 2024

The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

We analyze natural and synthetic datasets using RMT tools, finding that some universal properties are related to the strength of correlations in the feature distribution and connect these with the number of samples required to reach ergodicity.

摘要

关键词
Random Matrix TheoryData StructureUniversalityRandom Feature ModelsEmpirical Data EstimationNeural Scaling Laws

评审与讨论

审稿意见
5

This work studies the spectral statistics of the bulk of sample covariance matrices from different image datasets (MNIST, fMNIST, CIFAR10, tiny-IMAGENET). Its main result is to show that different spectral statistics of the bulk, such as the spectral density tails, level spacing, r-statistics, spectral form factor and Shannon entropy closely follows the one of correlated Wishart matrices with matching population covariance.

优点

The paper is well written and relatively smooth to read. Concepts from RMT are introduced from scratch and intuition is given, making it accessible for a wider audience. The result suggesting a universality of the Gaussian ensemble for the bulk of the sample covariance matrix is interesting, and motivates the classical assumption of Gaussian data used in many theoretical works in high-dimensional statistics.

缺点

The paper has a few limitations which should be better discussed by the authors:

  1. First, it only considers natural image data. The statistical properties of images have been intensively in the context of signal processing, and the observation that natural images tend to display a power-law behaviour is much older than neural scaling laws, see e.g. [Ruderman 1997]. The assumption of power-law features is also classical in the study of kernel regression, where it is known as "capacity conditions" [Caponnetto 2007].

  2. Second, the paper focus on the bulk of the sample covariance spectrum. While one does not expect universality beyond the bulk, they play a central role in learning. For instance, consider a kk-Gaussian mixture data X=yμ+ZX = y\mu^{\top}+Z with means μRd×k\mu\in\mathbb{R}^{d\times k} and labels yRN×dy\in\mathbb{R}^{N\times d}. In this case, the information about the kk modes is on the outliers, and the performance of a classifier trained in this data set will crucially depend on the correlation between the labels and the means. Moreover, the outliers is crucial in feature learning: it has been recently show that for two-layer neural networks a single gradient step away from random initialisation takes precisely the form of a rank one spike that correlate with the labels [Ba et al. 2023]. The neural collapse phenomenon provides a similar observation for deep neural networks [Papyan et al. 2020].

问题

  • [Q1]: In the abstract, the authors say that:

"These findings show that with sufficient sample size, the Gram matrix of natural image datasets can be well approximated by a Wishart random matrix with a simple covariance structure, opening the door to rigorous studies of neural network dynamics and generalization which rely on the data Gram matrix."

However, how these results open the door to the study of "neural networks dynamics and generalization" is actually never discussed. In light of my second commend in Weaknesses on the relationship between feature learning and generalisation with the outliers, why the authors believe the observation of spectral universality of the bulk is relevant to the study of generalisation?

  • [Q2]: I miss a discussion on the relationship between these results and error universality, which has been intensively investigated starting from [Mei & Montanari 2022; Gerace et al. 2020; Goldt et al. 2022; Hu, Lu 2023]. Here, one looks directly at the universality of the training and generalisation error instead of the features, taking into account the labels and the task. Although more restrictive, it has been observed to hold for data close to the one studied here [Loureiro et al. 2021]. For a simple regression task, a parallel with the type of universality discussed here can be drawn, since the computation of the error boils down to a RMT problem [Wei et al. 2022]. In particular, it has been noted that in some cases the structure of the bulk fully characterises the error, even for multi-modal distributions, see [Gerace et al. 2023; Pesce et al. 2023]

Minor comments:

  • Although that's ultimately up to the authors, the notation XiaRd×NX_{ia}\in\mathbb{R}^{d\times N} is unconventional in machine learning.

References

  • [Ruderman 1997] Daniel L. Ruderman. Origins of scaling in natural images. Vision Research, Volume 37, Issue 23, 1997, Pages 3385-3398, ISSN 0042-6989, https://doi.org/10.1016/S0042-6989(97)00008-4.

  • [Caponnetto 2007] Caponnetto, A., De Vito, E. Optimal Rates for the Regularized Least-Squares Algorithm. Found Comput Math 7, 331–368 (2007). https://doi.org/10.1007/s10208-006-0196-8

  • [Ba et al. 2023] Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, Greg Yang. High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation. Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022).

  • [Papyan et al. 2020] Vardan Papyan, X. Y. Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. PNAS, September 21, 2020, 117 (40) 24652-24663, https://doi.org/10.1073/pnas.2015509117.

  • [Mei & Montanari 2022] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.

  • [Gerace et al. 2020] Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mezard, Lenka Zdeborova. Generalisation error in learning with random features and the hidden manifold model. Proceedings of the 37th International Conference on Machine Learning, PMLR 119:3452-3462, 2020.

  • [Goldt et al. 2022] Sebastian Goldt, Bruno Loureiro, Galen Reeves, Florent Krzakala, Marc Mezard, Lenka Zdeborova. The Gaussian equivalence of generative models for learning with shallow neural networks. Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, PMLR 145:426-471, 2022.

  • [Hu, Lu 2023] H. Hu and Y. M. Lu, Universality Laws for High-Dimensional Learning With Random Features, in IEEE Transactions on Information Theory, vol. 69, no. 3, pp. 1932-1964, March 2023, doi: 10.1109/TIT.2022.3217698.

  • [Loureiro et al. 2021] Bruno Loureiro, Cedric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mezard, Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model. Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021).

  • [Wei et al. 2022] Alexander Wei, Wei Hu, Jacob Steinhardt. More Than a Toy: Random Matrix Models Predict How Real-World Neural Representations Generalize. Proceedings of the 39th International Conference on Machine Learning, PMLR 162:23549-23588, 2022.

  • [Gerace et al. 2023] Federica Gerace, Florent Krzakala, Bruno Loureiro, Ludovic Stephan, Lenka Zdeborová. Gaussian Universality of Perceptrons with Random Labels. arXiv:2205.13303 [stat.ML]

  • [Pesce et al. 2023] Luca Pesce, Florent Krzakala, Bruno Loureiro, Ludovic Stephan. Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation. Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023

评论

Thank you for your thoughtful review and insightful questions regarding our work. We are pleased that you found the paper to be clearly presented and accessible to a wider audience, as this was one of our goals. We also agree that the onset of universality properties in real-world datasets should motivate some ubiquitous modeling assumptions.

Moving forward, we aim to address the limitations and questions you raised to strengthen the contributions of this research. We have incorporated several changes to our manuscript, in the hope that clarifying the issues raised here will lead you to consider admission.

  1. Earlier works and context: Thank you for bringing these important works to our attention. We clearly should have situated our observations within this broader literature, rather than presenting them without the proper foundations. To address this weakness, we plan to refer directly to these works in the main text. Our work differs in focusing specifically on relating these heavy-tailed behaviors to random matrix theory. Moreover, we'd like to comment that the observation of power law data is valid for natural images, but does not seem to hold for dynamical system simulations, which often display multi-scaled behavior.

  2. Effects beyond the bulk on the learning process: Thank you for raising an excellent point that we glossed over a key aspect regarding the role of outliers in feature learning and classification performance. Focusing only on the bulk spectrum is an oversimplification, as outliers often encode crucial information as you rightly pointed out. To address this weakness more comprehensively: We agree that for classification tasks, such as Gaussian mixtures, the class information resides in the outliers rather than the bulk. More generally, the relationship between spectral properties (bulk vs outliers), feature learning dynamics, and generalization is complex and not fully captured by only considering the bulk. While the bulk may characterize certain neural scaling properties under distribution shift, it does not explain how networks learn discriminative representations or classify datasets that differ primarily in their outliers/eigenvectors rather than bulk spectral shape. We should have acknowledged these limitations. Analyzing the interaction between bulk, outliers and learning dynamics/generalization remains an important open question, as recent works have started to demonstrate how outliers impact early gradient steps and network collapse. Truthfully, we have preliminary work investigating how datasets with identical bulk spectra but different outliers/eigenvectors and means may lead to different performance, among other configurations we find to be interesting. This future work should provide avenues for addressing some of these open questions. We have included discussion on these limitations and future work more carefully in the revised manuscript.

Questions:

  1. Bulk spectrum and network analysis: Thank you for bringing up this point. We agree that we could have better justified how our results can be explicitly used for studying neural network dynamics and generalization. While the bulk universality provides a solvable model for aspects like neural scaling laws, as you note, the relationship between feature learning and generalization is complex, involving both the bulk and outliers. As a simple example of how the observations in this paper affect network analysis we have added Appendix D., where we study a simple teacher-student model trained on power law scaling data, and show that even in this simple example, our results are necessary for a full analysis. We appreciate you pushing us to better justify the relevance and limitations of our results.

  2. Error universality: You make an excellent point that is directly relevant to our work but that we failed to sufficiently discuss - the relationship between bulk universality and recent developments in analyzing error universality from a RMT perspective. It is an oversight on our part not to connect our observation that the bulk spectrum follows RMT to the implications this has for analyzing generalization error using tools from RMT. As you note, for simple tasks like regression, knowing the bulk spectral density Characterizes the error through allowing computation via RMT theory, even for multi-modal distributions. To address this major gap, we will expand our discussion on universality to include a connection to error universality, incorporating the suggested works. We will explain how our finding of bulk universality motivated by RMT enables bridging to these works analyzing error from an RMT perspective, even allowing characterization of the error beyond just the population regime.

Notation: We understand that this notation may be non-standard, as it is borrowed from the physics literature, and we will be sure to clarify that XiaX_{ia} is simply the transpose of the standard Data matrix XTX^T.

评论

I thank the authors for their detailed rebuttal and for the updated version of the manuscript.

The question of the role played by the bulk vs. outliers (in particular to generalization), as well as the relationship between bulk and Gaussian universality were raised by different reviewers. While your answer does touch upon these points, I think they require an extended discussion which would be more appropriate for a resubmission.

I would also like to mention that the decision of the authors to not engage in a discussion with the reviewers by posting their rebuttal on the last day of the discussion period was not really helpful.

For these reasons, I am keeping my score.

审稿意见
3

This paper observes that many real-world datasets have a power law scaling for the bulk eigenvalues of their sample covariance matrices. Hence, the authors construct a correlated Gaussian dataset by designing its population covariance from a Toeplitz matrix. In this case, the power law tail is related to the strength of the correlations between different features of each sample. By comparing some global and local statistics of sample covariance matrices, such as global distribution, level spacing distribution, rr-statistics, and spectral form factor, the authors empirically show that the sample covariance model from real-world datasets falls into the same universal class as this Gaussian model.

优点

The motivation of this paper is clear and convincing for me. This work may bring attention to random matrix theory (RMT) and statistical physics communities. The local and global eigenvalue statistics used in this paper are standard and can effectively represent the spectral behaviors of the dataset. Potentially this indicates we can use classical random matrix ensembles like GOE ensemble and Poisson ensemble to analyze the spectral properties of deterministic real-world datasets, understand the representation in datasets, and construct certain metrics to measure the quality of datasets. There is another potential motivation for analyzing the eigenstructure of the datasets. In certain cases, we can use synthetic data generation like GANs to create new datasets for training. So understanding the spectral distributions of real-world datasets via correlated Gaussian datasets helps us to further investigate if the synthetic data we generate really captures the useful representation of the data. Meanwhile,

缺点

  1. This paper analyzes the bulk spectra of real-world datasets by a Gaussian sample covariance matrix whose population covariance is generated by a Toeplitz matrix and power parameter α\alpha. The construction of this Toeplitz matrix and α\alpha indicates the correlation among the features of each sample. However, there is no algorithm to find the suitable α\alpha for different datasets. Besides, it would be more informative to have an additional discussion on power law scaling, encoding information in the Gaussian dataset, and the representation learning of the dataset.

  2. This paper focuses on bulk eigenvalues but, in machine learning, the outlier eigenvalues and eigenvector statistics are more important for learning. For instance, [1] shows that representations of GAN-data behave as Gaussian mixtures, where spikes in spectra are beneficial for linear classification. As also studied in [2], power-law tails in the spectra may not represent useful features in neural networks but the principal components contain some useful features for learning. There should be some discussion on this here to indicate this power-law scaling is useful for real-world datasets.

  3. The authors may need to present additional references in RMT or more detailed proofs for the formula they presented to help readers in the machine learning community better understand the math here. For instance, the claim in (4), formula (11) and (28). Especially, (28) seems to provide the limiting Stieljes transform of correlated Gaussian sample covariance matrix. I am not sure if this can be used to predict the bulk spectra of CGDs in Figure 2 (Bottom).

问题

  1. There is another random matrix model that possesses power-law scaling, heavy-tail Wigner matrices (Levy matrices), see [3-5], which may have Poisson laws in the spectra. In this ensemble, we still have i.i.d. structure but the distribution of each entry may have heavier tails. How do you exclude the case that the real-world datasets are not in this universal class?

  2. This paper empirically shows that the bulk eigenvalue distribution of the Gram matrix of the real-world dataset can be mimicked by a correlated Gaussian sample covariance matrix. Does that really mean the synthetic dataset captures useful features in the real dataset? Is it possible to train a neural network separately on both real and synthetic datasets and see if they have similar generalization behaviors? This may provide a better understanding of whether this power law scaling in datasets is useful or not.

  3. In the introduction, you claimed that O(10)O(10) large eigenvalues are separated from bulk and the rest of bulk eigenvalues have power law. Is this order O(10)O(10), in Eq. (3), a constant order or is it related to the number of classes in the dataset? Further quantified analysis may be needed to provide here by increasing the number of samples.

  4. If we consider the spectra of different classes in one dataset, like CIFAR10, do they have the power law scaling with the same α\alpha or not in Figure 1? So far, I have not seen a relation between α\alpha and classes in the dataset. Concretely, does the bulk spectrum of the full dataset have the same power-law scaling as the spectrum of the data points in a certain class of this dataset?

  5. Eq. (1) looks incorrect. Are XiaX_{ia} entries of the data matrix or a full matrix? You need to make the notions consistent.

  6. In Eq. (2), is Iij\mathbf{I}_{ij} the (i,j)(i,j) entry of the identity matrix?

  7. In (4), you claimed the power law of the eigenvalues of the correlated Gaussian matrix, but you showed the power law for the Toeplitz-type population covariance matrix in (24). Is this enough to conclude (4)?

  8. What is Σ(ρΣ)\Sigma(\rho_\Sigma) in exponent below Eq. (8)?

  9. Can you explain the last sentence in Section 4.1?

  10. How do you ensure that the population covariance ΣToe\Sigma^{\text{Toe}} is p.s.d from Eq. (18)?

  11. The caption in Figure 8 is not in the right order.

=================================================================================================

[1] Seddik, et al. "Random matrix theory proves that deep learning representations of gan-data behave as gaussian mixtures."

[2] Wang, et al. "Spectral evolution and invariance in linear-width neural networks."

[3] Burda and Jurkiewicz. "Heavy-tailed random matrices."

[4] Arous and Guionnet. "The spectrum of heavy tailed random matrices."

[5] Guionnet. "Heavy tailed random matrices: How they differ from the GOE, and open problems."

伦理问题详情

N/A

评论

Thank you for your thoughtful and constructive feedback. We are encouraged by the strengths you identified in our work, namely that the motivation and use of random matrix theory tools were found to be clear and compelling. The observation that our findings could bring useful attention to understanding representation in datasets from the perspective of classical random matrix ensembles is promising. To strengthen the technical & empirical rigor of the results, we address each of the points raised in order, hopefully leading to the acceptance of the paper.

  1. This point could be construed in two different ways: First, the algorithm for finding α\alpha for a given dataset - this is done by computing the empirical Gram matrix, diagonalizing it, arranging the eigenvalues in decreasing order and fitting a power law to the bulk, we regret not describing this process explicitly in the main text, and will correct it in the revised paper. From a theoretical point of view, there is no a priori way to know the correct α\alpha for real data, but we observe empirically that natural image datasets display α=0.20.6\alpha=0.2-0.6. Furthermore, it need not have been the case that a single correlation scale exists, as suggested by a single α\alpha, and for instance chaotic fluid images exhibit multiple scalings. We add a discussion on the implication of power law scaling for NNs in a new appendix (D.) where we apply our results in the context of a simple learning problem.

  2. Thank you for bringing up this important subject, as we are currently working on a paper which explores the relations between outliers, bulk and tail, both for eigenvalues and eigenvectors when used for training NNs. As shown by [1] and verified by our results, the outliers can also be described by a Gaussian model, but simply not the CGD that we suggest, as the outliers describe the most shared features in the data, and do not hold the local correlation structure of the bulk. They are crucial in classification tasks, and in particular their effect, as well as that of the different class mean values are the most important for linear classifiers. Our approach focused on the case where one studies improved performance by using more data, long after the outliers have been well described, and the only improvement gains must be squeezed out of the bulk, as described in our related work section. We agree that this discussion is important, and will be included in the revised manuscript.

  3. We appreciate the feedback from the ML perspective regarding our RMT claims, and agree that the results can be better stated. To strengthen our claims, we have revised Appendix B. to include a detailed derivation for the spectral density, and compare it with numerical simulations and the bulk of real data.

Questions:

  1. This is a very interesting question, and in practice we believe that the local spectral properties converging to GOE rather than a Poisson-GOE intermediate ensemble, coupled with the correct prediction for the bulk spectral density is sufficient proof. Nevertheless, one may wonder if there exist datasets which are sufficiently constrained to produce Poisson statistics, but as those datasets would require a strictly diagonal matrix with i.i.d entries we have not found such an example.

  2. Thank you for this question, it is indeed possible and we are in progress of completing a paper which takes the synthetic model as a basis for network analysis and deals precisely with this question. We also consider the effects of outliers and higher/lower moments in this analysis. Regardless, we point to Ref.[1,2] in our manuscript for examples that require only power law scaling for generalization.

  3. The number of outliers itself should be data dependent, and is not necessarily related directly to the number of classes, it may be related to the nearly constant values that are shared between all images.

  4. This question is apt, and we found that different classes can have either very similar, or quite dissimilar spectra, with somewhat different exponents. We study the effects of these differences in an upcoming work.

  5. XiaX_{ia} is the design (data) matrix composed of MM columns, each representing a sample, of dimension dd, implying that Eq.1 is simply 1/MXTX1/M X^T X, the standard definition of the empirical Gram matrix.

  6. Yes.

7+10) We apologize for the confusion. The covariance used to generate samples is composed of the singular values of the Toeplitz matrix, and is therefore PSD, and so the formula (24) holds for it. This is now more clearly explained in the text.

  1. The exponent is that of the modular hamiltonian, but in this instance is simply of ΣM\Sigma_M. This has been clarified in the revised text.

  2. The linear part of equation 9, 2τ2\tau is the universal part. The onset of this ramp is a known indicator of universality, as shown for real data in Fig.3.

  3. We apologize for the mistake, it has been corrected in the revised manuscript.

审稿意见
5

This submission studies the bulk eigenvalues of the Gram matrix for real-world data. The main contribution is an empirical verification that certain macroscopic (global law) and microscopic (eigenvalue spacing) spectral properties of the Gram matrix can be described by a Wishart matrix with certain correlation structures. This opens up new possibilities to use random matrix theory to understand the learning curve for realistic datasets.

优点

The authors consider an important research problem: most existing random matrix theory analyses on machine learning models require strong assumptions on the data distribution, such as Gaussian with identity covariance, and one may wonder if these theoretical results have any implications in practical settings. The universality perspective in the current submission provides some justification of the Gaussianity assumption.

缺点

I have the following concerns.

  1. The implications of the studied universality phenomenon on the learning behavior of machine learning models need to be elaborated. The authors motivated the study of Gram matrix using the neural scaling law, but how do the macroscopic and microscopic properties of the eigenvalues (especially the microscopic properties) relate to the power law scaling of learning performance? I cannot find such discussion in the main text. For example. although we know that the universality of eigenvalue spacing distribution can be very robust, it is unclear what insight a machine learning researcher may acquire from such statistics. It would appear that the more important quantity is the Gram matrix of the neural network representation, which has been explored in many recent works including (Seddik et al. 2020) (Wei et al. 2022).
  • (Wei et al. 2022) More than a toy: random matrix models predict how real-world neural representations generalize.
  • (Seddik et al. 2020) Random matrix theory proves that deep learning representations of GAN-data behave as Gaussian mixtures.
  1. Related to the previous point, it should also be noted that the eigenvalue statistics alone do not decide the learning curve. For trained neural network, the aligned eigenvectors due to representation learning play a major role in the improved generalization performance, as shown in (Ba et al. 2022) (Wang et al. 2023). The authors should comment on whether such eigenvector statistics also fit into the universality perspective in this submission.
  • (Ba et al. 2022) High-dimensional asymptotics of feature learning: how one gradient step improves the representation.
  • (Wang et al. 2023) Spectral evolution and invariance in linear-width neural networks.
  1. Related works are not adequately discussed. Various forms of universality law for neural network have appeared in (Seddik et al. 2020) (Wei et al. 2022), and in the context of empirical risk minimization, the "Gaussian equivalence property" has been studied in many prior works, see (Goldt et al. 2020). How do these results relate to the findings in this submission? Also, the Marchenko-Pastur law for general covariance is a classical result that existed way before (Couillet and Liao 2022).
  • (Goldt et al. 2020) The Gaussian equivalence of generative models for learning with shallow neural networks.

问题

See weaknesses above.

评论

Thank you for your important comments and questions, which have led us to consider carefully how to better put our results in the correct context for the ML community. We are glad that you agree that the providing better minimal theoretical models for data can lead to a better understanding of neural network behavior. We have made a number of changes which we hope address or clarify some of the issues you’ve raised, and which we hope would lead you to consider admission. We address these in more detail now, by the order in which they were raised.

  1. Explaining how the universality in data affects the learning behavior of ML models: We thank you for highlighting to us the importance of relating our results to the learning behavior of ML models. We attempted to address this point by appealing to works done on neural scaling laws, in which the Gram matrix spectrum is a key part of the analysis, but is limited by either taking the population limit (d/N=γ0d/N=\gamma\to0), or taking a phenomenological distribution for the eigenvalues. Our work verifies that the Gram matrix of real world datasets can faithfully be described by the RMT regime for a finite number of samples NdN\sim d, demonstrated by the local properties we studied. The implication is that given a power law exponent α\alpha, we know not only the spectrum, but also its degeneracy, given by the spectral density ρΣ\rho_\Sigma which we derive in Appendix B. In order to further strengthen the connection between our results and the ML setup, we have added a new appendix - Appendix D., in which we give a concrete example of how one may use our results to obtain theoretical predictions. Namely, we solve a simple Teacher-Student model with power law correlated data, and show that the training dynamics as well as the convergence depend on the spectral density of the Gram matrices studied in the main text. We obtain analytical expressions for the training and generalization losses.

  2. Eigenvector distribution and relation to learning in the GOE regime: Thank you for stressing the importance of the eigenvector behavior, which we agree is very important in the context of network learning. Indeed we did address the fact that the basis in which the covariance matrix is given is very important to the learning dynamics and convergence ("Additionally, the interplay between eigenvectors and eigenvalues in neural networks merits further exploration, as both components likely play crucial roles in the way neural networks process information"), we chose to focus only on the eigenvalue properties in this work. In ongoing work, which will be presented soon, we will discuss in detail the relationship between eigenvalues, eigenvectors, and network performance in multiple settings. To address the specific question "...whether such eigenvector statistics also fit into the universality perspective..." we would like to make two points. First, eigenvectors are important e.g. for classification problems since they encode information about the relative embedding of the objects with respect to a given fixed basis. Second, while the eigenvectors are fixed for a single realization of the covariance matrix, they are expected to be random and rather delocalized, when considering an ensemble of such realizations. This is not the case for the eigenvalue distribution, which admits its RMT prediction even using a single realization of the Gram matrix. It is therefore more subtle to assign to them universal characteristics. This implies, in particular, that the eigenvectors information used for training may not be valuable for generalization, depending on how delocalized the eigenvectors are, which also depends on the spectral density (see Bun et al. https://arxiv.org/abs/1502.06736).

  3. Discussion regarding related works: We appreciate that our discussion of related works may be lacking. We have made revisions to the manuscript to better put the paper in the correct context within the literature. Firstly, we have added the suggested references, and in particular our citation of (Couillet and Liao 2022) is meant as a proxy for the many theorems therein, but we have included the appropriate source in the revised manuscript. Secondly, we are fully aware of many of the suggested works, and agree that they should be placed in the discussion on universality and its applications in machine learning. For example, the Gaussian equivalence property will certainly hold for the datasets we studied, as they have the correct symmetry property from the RMT perspective.

AC 元评审

The article focuses on the spectral properties of sample covariance matrices in real-world image data and links these properties to a Wishart matrix with a specific correlation structure. This connection suggests that random matrix theory tools might be useful in analyzing neural network training. The reviewers acknowledged that the paper tackles an important problem and brings interesting contributions; however, they expressed a number of concerns, in particular about the lack of discussion on the connection to the learning curve and the absence of related references on the universality law for neural networks. The author's response and revised version were considered by the reviewers. However, due to the extent of the concerns raised, a resubmission to another venue is more suitable.

为何不给更高分

All three reviewers agreed that the current version has a number of limitations, in particular

  • Lack of discussion regarding the connection to the learning curves
  • Missing references

为何不给更低分

NA

最终决定

Reject