Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias
We reduce the bias of eigenspectrum diagnosis using a method based on matrix subsampling
摘要
评审与讨论
This paper proposes FARMS, a method for reducing the bias in the estimation of heavytailness (HT) metrics due to the aspect ratio of the weight matrix. FARMS samples submatrices with a fixed aspect ratio and averages the sampled empirical spectral density (ESD). Empirically, using HT metrics estimated by FARMS leads to improved downstream performance in pruning and choosing layerwise learning rates.
给作者的问题
-
Is there any theoretical motivation for the particular subsampling strategy in FARMS? Can you quantify its effect and why it should preserve important spectral properties about the original matrix?
-
If the subsampling strategy is not theoretically well-motivated, have you tried alternative approaches to mitigating the aspect ratio bias?
论据与证据
Yes
方法与评估标准
Yes
理论论述
N/A
实验设计与分析
The experiments in Section 4 are sound.
补充材料
I reviewed the full supplementary material.
与现有文献的关系
The paper highlights the importance in recognizing that the numerical value of the same property measured in different layers can naturally have different scales and should not be directly compared without proper normalization. This is a basic but important idea that deserves more attention from the community. Similar considerations around the different shapes of the weight matrices in each layer have led to impactful theoretical and practical advancement in areas such as hyperparameter transfer and infinite-width limits of neural networks with feature learning [1, 2].
[1] Yang, Greg, and Edward J. Hu. "Tensor programs iv: Feature learning in infinite-width neural networks." International Conference on Machine Learning. PMLR, 2021.
[2] Yang, Greg, et al. "Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer." arXiv preprint arXiv:2203.03466 (2022).
遗漏的重要参考文献
None that I'm aware of.
其他优缺点
Strength: The paper is well-written, and experimental results are strong and well-presented. The core idea of mitigating the aspect ratio bias makes intuitive sense.
Weakness: While the proposed method empirically improves the downstream performance of the estimated HT metrics, it does not provide a theoretical justification for the approach's soundness. For example, while subsampling the matrices reduces the bias due to its aspect ratio, does it introduce additional biases since we no longer work with the full weight matrix which is what we ultimately care about?
其他意见或建议
The abbreviation FARMS is used in the title without being defined.
Thank you for your insightful and constructive comments. We have addressed your comments as follows.
Broader Literature
Thank you for suggesting references [1, 2]. Tensor Programs IV [1] shows that standard parametrizations can collapse to the kernel regimes in the infinite-width limit. It proposes the Maximal Update Parametrization (µP) to ensure feature learning by carefully choosing parameter scaling rules based on layer dimensions. Tensor Programs V [2] further shows that µP enables zero-shot hyperparameter transfer: hyperparameters tuned on smaller models remain optimal for much larger ones. In this context, "shape" awareness relates to how fan-in/fan-out dimensions affect optimal scaling of initializations and learning rates to maintain stable training dynamics.
Our paper, on the other hand, focuses on how the aspect ratio (number of rows vs. columns) of an individual weight matrix biases the measurement of empirical spectral density (ESD). We show that different aspect ratios can artificially stretch or compress the ESD, which confounds the interpretation of HT metrics like the Power Law (PL) exponent Alpha as indicators of training quality. FARMS targets this bias. By analyzing the average ESD of submatrices sampled at a fixed aspect ratio, FARMS provides a normalized HT metric for more reliable comparison across layers of diverse shapes within a given network.
Thus, while [1][2] use shape-aware parametrization (like µP), our work uses a different shape-aware analysis technique (FARMS) to correct measurement bias in spectral diagnostics. We will discuss this in detail in the revised paper.
References
[1] Yang, Greg, and Edward J. Hu. "Tensor programs iv: Feature learning in infinite-width neural networks." International Conference on Machine Learning. PMLR, 2021.
[2] Yang, Greg, et al. "Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer." arXiv preprint, 2022.
Weakness 1 and Question 1
First, see our previous response to clarify what aspect ratio bias is and explain why FARMS addresses it. Then, we will answer your question in more detail.
- Why does the FARMS subsampling approach preserve important spectral property about the original matrix?
- Does it introduce additional bias?
The goal of measuring heavy-tailness in HTSR is to evaluate the strength of correlations introduced by training, as established in previous work [1]. However, when we subsample a single submatrix and measure correlations only within that submatrix, some correlations between elements in the subsampled matrix and those outside it are inevitably lost. This motivates our approach of using multiple submatrices to capture a broader range of correlations.
Does this approach introduce additional bias? We believe it does, but we view this "bias" more as a form of partial coverage of the entire matrix. At a high level, this is conceptually similar to bootstrap sampling in random forests—using multiple samples to mitigate the effects of limited coverage.
Recent work, such as [2][3], aims to theoretically quantify heavy-tailedness by interpreting it as the accumulation and evolution of feature spikes in the ESD that align with the teacher model's features. These feature spikes are approximately rank-one updates to the original matrix, and with high probability, sampling a submatrix will capture that rank-one component because that component covers the whole matrix (so the subsampled matrix contains that rank-one component). Therefore, sampling a submatrix will not miss the feature spikes, which are believed by previous work [2][3] to cause the heavy-tail structure. We believe this provides further justification that FARMS can preserve important spectral information, measured by the feature spikes in the ESD.
Finally, we direct the reviewer’s attention to the experiment in Appendix A.1. In this experiment, we demonstrate that for i.i.d. initialized weights of varying sizes, our method produces a constant measure of heavy-tailedness, whereas existing methods vary with matrix size, exhibiting an aspect ratio bias. While FARMS may introduce partial coverage of the entire matrix, it does so uniformly across matrices of all sizes, preventing any layer-shape-dependent bias.
[1] Martin and Mahoney. "Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning." JMLR 2021.
[2] Wang et al. "Spectral evolution and invariance in linear-width neural networks." NeurIPS 2023.
[3] Kothapalli et al. "Crafting heavy-tails in weight matrix spectrum without gradient noise." arXiv preprint.
Other Comments Or Suggestions
Thank you for the suggestion. The abbreviation was used in the abstract without a definition. We will include the definition of FARMS in the abstract.
Additional experiments
Experiments to answer all reviewers' questions can be found here.
The paper studies the problem of eigenspectrum analysis of DNN weight matrices and its relationship with model quality in terms of the training process. The paper proposes Fixed-Aspect-Ratio Matrix Subsampling (FARMS) to address the aspect ratio bias in existing Heavy-Tailed Self-Regularization (HT-SR) eigenspectrum analysis, which enables better computation of HT metrics. The idea is to compute the HT metrics based on the average of a series of subsampled sub weight matrices with fixed aspect ratios. The FARMS further induces improved algorithm for layer-wise hyperparameter tuning, with extensive experiments demonstrating its effectiveness and improvement.
给作者的问题
- How to design experiments that can directly evaluate the correlation between the HT metrics (calculated by different methods, i.e., existing ones and the FARMS method) and the so called "training quality of each layer"? How to define the training quality of different layers? To me, this seems like the foundation that the newly proposed HT metric calculation algorithm can result in better performance in terms of layer-wise hyper-parameter tuning. Since there appears no theory for the design of the algorithm, such an experiment would make the overall methodology more convincing. The authors show in Section 4.3 that by using the FARMS assisted layer-wise hyperparameter tuning, the final testing accuracy improves over using the existing HT metrics calculation method. But this does not directly showcase that the FARMS method gives better indication of how each single layer is trained.
论据与证据
Yes, the claims made in the submission are supported by clear and convincing evidence.
方法与评估标准
Yes, the proposed methods make sense for the problem studied.
理论论述
N/A
实验设计与分析
The reviewer has checked the experimental designs and analyses.
补充材料
The reviewer has briefly checked the additional experiment results in the supplementary material.
与现有文献的关系
The paper directly contributes to the line of works based on Heavy-Tailed Self-Regularization (HT-SR) [1, 2], which further induces improved algorithm for the task of layer-wise hyperparameter tuning.
References:
[1]Mahoney, M. and Martin, C. Traditional and heavy tailed self regularization in neural network models. In Proceedings of the 36th International Conference on Machine Learning
[2] Martin, C. H. and Mahoney, M. W. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research, 22(165):1–73, 2021.
遗漏的重要参考文献
To the best knowledge of the reviewer, there is no lacking related works.
其他优缺点
Strengths:
- The idea of matrix subsampling to debias the estimation of the heavy-tailedness metrics is new and interesting.
- The layer-wise hyperparameter tuning algorithms based on the new algorithm for HT metric calculation shows promising results on a range of tasks compared with previous SOTAs.
- The experiment design is quite extensive and solid.
Weaknesses:
- To me the main drawback is that it is not clear why in principle the HT statistics of subsampled matrix is a better estimator of the quality of the trained model, despite the fact that the aspect ration of the subsampled matrix has beed fixed. Please also see my question in the Questions For Authors part. Is there any theoretical justifications of such a method?
- Also, as a minor drawback, the computational cost could be large compared with previous SOTA methods (Appendix C.5).
其他意见或建议
Please see the Questions For Authors part.
Thank you for your insightful and constructive comments. We have addressed your comments as follows.
Weakness 1 and Question 1
- Previous work on training quality
Previous work on HTSR has established that the heavy-tailness of ESDs is strongly correlated with the test accuracy of models [1][2]. While this does not imply that "training quality" is identical to "test accuracy," the correlation between heavy-tailedness and test accuracy has been used to justify HTSR metrics. Therefore, improving test accuracy or similar performance metrics (e.g., perplexity) remains our primary goal.
Although previous work on HTSR does not explicitly define "training quality," several related quantities have been mentioned: (1) strong correlation between weight elements [1][2] and (2) feature spikes and their alignment with the target teacher model [3][4]. The feature spike, analyzed in the context of a single-index teacher and a two-layer student model [3][4], is approximately a rank-one update to the model (in the limit of infinite matrix size with fixed aspect ratio) and also persists after matrix subsampling. This is because the specific form of the rank-one update makes it cover the whole matrix with probability one.
- A toy experiment to measure training quality
We designed a toy experiment to test the correlation between "training quality" and the new HTSR metric measured using FARMS. Following [3][4], we use a single-index teacher to generate signals to train a two-layer student. The first layer of the student model is a weight matrix, while the second layer is a weight vector. Following [3][4], we only update the weight of first layer. To measure “training quality”, during training, we measure the alignment between the weight matrix and the ground truth target vector of the teacher model similar to [3][4], and we define this alignment to be the "training quality" of the student model.
Throughout the training process, we select the student network checkpoint with the highest alignment and report both the alignment value and the PL_Alpha value (the HTSR metric). We then vary the sizes of the student model with different weight matrix aspect ratios on a fixed input dimension 500 to conduct multiple experiments. Each experiment provides one PL_Alpha value and one alignment value. The multiple experiments (conducted using varying sizes) produce one curve for PL_Alpha and one curve for the alignment value.
We then plot the two curves using both existing methods for estimating PL_Alpha and our method FARMS. As shown in Figure 2, FARMS reveals a clear negative correlation between the two curves: the better the training quality, the larger the alignment, and the smaller the PL_Alpha. However, for the existing method, due to the aspect ratio bias, the correlation is incorrect.
Furthermore, in Appendix A.1, we compared FARMS with existing ways of measuring heavy-tailedness. For i.i.d. initialized weights of varying sizes, FARMS produces a constant value of heavy-tailedness degree, whereas existing methods change with matrix size, showing an aspect ratio bias. In this setting, all matrices are equally undertrained since they are just initialized.
- Summary of this question
We understand that to formally define "training quality" requires substantial novel work. Our goal is not to reinvent the wheel or claim that our method measures training quality better than prior work. Instead, we aim to correct a specific oversight in prior work: the aspect ratio bias. In other words, our goal is to mitigate the aspect ratio bias without sacrificing the ability to measuring heavy-tailedness.
[1] Martin and Mahoney. "Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning." JMLR 2021.
[2] Martin, Peng and Mahoney. "Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data." Nature Communications 2021.
[3] Ba et al. "High-dimensional asymptotics of feature learning: How one gradient step improves the representation." NeurIPS 2022.
[4] Wang et al. "Spectral evolution and invariance in linear-width neural networks." NeurIPS 2023.
Weakness 2
The computational cost can indeed be higher than previous methods, and we have acknowledged that. There are certain ways to mitigate the issue, e.g., by using a larger window size (like 4096 for LLaMA 7B/5120 for LLaMA 13B) and a limited number of sampling steps. In Table 2, when we use larger window size(such as 4096) and smaller sampling steps(such as 5), our method can still improve model performance and reduce the computational cost.
Additional experiments
Experiments to answer all reviewers' questions can be found here.
This paper introduces an approach called FARMS (Fixed Aspect Ratio Matrix Subsampling) that addresses a current deficiency of approaches that rely on quantifying the heavy-tail degree of an empirical spectral density (ESD) without accounting for matrix shape. In particular, the authors observe that random Gaussian matrices converge to distributions with differing levels of heavytail-ness depending on the ratio for an matrix but this same effect is not captured in algorithms that rely on heavy-tail self regularization theory (HTSR) to capture how well trained certain layers are. FARMS accounts for this by computing a weight matrix's spectral density as the averaged spectral density of subsampled matrices of fixed dimension. By using a fixed dimension for the subsampling matrix used to compute the ESD for all matrices, the impact of the ratio of on ESD is controlled. The authors adopt FARMS for computing the ESD for three HTSR-based algorithms and show FARMS improves performance over the standard approach for computing ESD.
给作者的问题
- Why do you think the LLM pruning results were more sensitive to sliding window and submatrices number?
- As seen in Table 4, it makes sense intuitively that a square matrix would give most ratio-agnostic ESD. Do you think there's a parametric adjustment that can be made instead of taking a subsampling approach?
论据与证据
The authors demonstrate sufficient evidence that accounting for the matrix dimension when computing the ESD for subsequent heavy-tail analysis improves the performance and robustness of HTSR algorithms developed for adaptive learning rate optimization and pruning.
方法与评估标准
The evaluation is relevant and sufficient for demonstrating the impact of accounting for the shape of a matrix. The authors evaluate FARMS applied to three different HTSR algorithms in three domains (computer vision, LLM pruning, and SciML) and show improvements for all three.
理论论述
There were no substantial theoretical claims.
实验设计与分析
I found the conclusions drawn from the experiment design and analyses valid in support of FARMS.
补充材料
I viewed the additional results presented in the supplementary material. The main thing that stood out was the zero-shot accuracy results for pruning Llama models which showed FARMS to improve accuracy but usually by < 1% relative to the baseline with FARMS. This echos a general theme in the results where FARMS does improve performance but generally only shifts the needle slightly.
与现有文献的关系
The contribution of this paper is highly relevant to applications of HTSR and techniques that rely on accurately capturing the ESD.
遗漏的重要参考文献
I found the discussion of prior work sufficient.
其他优缺点
Strengths:
- While the idea of accounting for the weight matrix ratio is simple and straightforward, the experiments demonstrate FARMS to offer consistent improvement over prior HTSR approaches across the board.
- I found the paper clear and well written.
- FARMS improves the robustness of TempBalance as shown in Figure 3 where FARMS does not need additional Layer Selection to improve performance.
Weaknesses:
- FARMS seems like it can be sensitive to the window size and sampling steps hyperparameter (Table 5) which could make it difficult to apply in practice.
- Error bars not provided for the LLM pruning experiments which make it difficult to gauge the statistical significance of the results in Table 1 and Table 5.
- The improvement from FARMS is somewhat incremental especially compared to the improvement from using HTSR adaptive learning rate relative to baseline.
其他意见或建议
It seems like there could be more of a theoretically grounded adjustment for this ratio that would preclude the need for window size, sliding window, and subsampling number hyperparameters of FARMS. Have you given this direction some thought?
Thank you for your insightful and constructive comments. We have addressed your comments as follows.
Weakness 2
While we agree that error bars would help show statistical significance, we respectfully point out that several representative works [1-3] on LLM pruning did not provide error bars in main results. Therefore, we did not provide error bars in the submitted paper. However, we also noticed that both SparseGPT [1] and Wanda [2] use calibration data to estimate input statistics, so we sample different calibration sets with 128 samples using six different seeds [0, 1, 2, 3, 4, 5]. We evaluate FARMS(our method) and AlphaPruning and provide the mean value and standard division(STD) of perplexity in Table 1. We find that FARMS achieves lower perplexity consistently. For example, the perplexity of LLaMA-7B reduces from 96.02 ± 1.59 to 79.42 ± 3.86 using the SparseGPT pruning method at a 0.8 sparsity ratio. We also found that FARMS achieve lower STD in most settings.
[1] Frantar, Elias, and Dan Alistarh. "Sparsegpt: Massive language models can be accurately pruned in one-shot." In International Conference on Machine Learning, pp. 10323-10337. PMLR, 2023.
[2] Sun, Mingjie, Zhuang Liu, Anna Bair, and J. Zico Kolter. "A simple and effective pruning approach for large language models." arXiv preprint arXiv:2306.11695 (2023).
[3] Cheng, Hongrong, Miao Zhang, and Javen Qinfeng Shi. "A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations." IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
Weakness 3
Firstly, compared to existing HTSR adaptive methods, our approach demonstrates stronger robustness. Figure 3 in the paper shows that FARMS can achieve better performance without Layer Selection, a heuristic tuning method in TempBalance (TB) to avoid the aspect ratio bias. Secondly, HTSR methods, such as TB and AlphaPruning, are strong baselines. For example, in layer-wise learning rate assignment, TB beats the state-of-the-art optimizers and schedulers like SGDR, SGDP, LARS, and Lookahead.
Other Comments or suggestions
Please refer to our response to Weakness 1 and Question 1 from Reviewer C2rY for more details. We cannot use a very small window size or sampling steps because doing so may not cover the entire matrix. Conversely, selecting a very large size would result in too much overlap between sampled matrices. A precise theoretical characterization may be beyond the scope of this paper, but we will include a detailed discussion in the revised version.
Question 1 & Weakness 1
Firstly, Layer-wise pruning LLMs at high sparsity is a challenging optimization problem. If sparsity ratios are not properly allocated, the performance of the model can become very unstable[4]. And the sensitivity shown in Table 1 is mainly from that.
Secondly, we provide the additional results for validating the stability of our method. In Table 2, we repeat our experiments with six random seeds [0, 1, 2, 3, 4, 5] and provide the mean and STD of perplexity. We can find that except for cases where a small window size like 500 is used, the results of other settings are better than baseline, and the differences in perplexity corresponding to different parameters are not particularly significant.
Thirdly, we show that the FARMS evaluation of the ESD is stable, which demonstrates that most sensitivity is from pruning itself. In Figure 1, we plot the layer-wise sparsity ratios assigned by FARMS across 4 × 4 different window sizes and sampling steps. We observe that the differences in sparsity ratio assignments across these settings are not significant. However, the assignment from the best FARMS setting is still distinct from the worst setting, which explains the performance difference.
[4] Yin, Lu, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li et al. "Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity." arXiv preprint arXiv:2310.05175 (2023).
Question 2
We are unsure what "parametric adjustment" means and kindly request the reviewer to clarify. If it refers to the "change of variable" method in the Marchenko–Pastur pdf function, this is challenging because the ESD transitions smoothly from Marchenko–Pastur to heavy-tailed in a way that is difficult to quantify precisely. If "parametric adjustment" instead refers to other hyperparameter assignment approaches, those are orthogonal to FARMS and can be combined with it.
Additional experiments
Experiments to answer all reviewers' questions can be found here.
I appreciate the author's response find most of my points addressed. I will increase my score to a 4.
This paper introduces a method based on weight matrix subsampling that removes the spectral bias induced by matrix rectangularity, thereby facilitating more robust and comparable measurements of matrices' spectral heavy-tailedness, a property previously shown to correlate with model performance in some circumstances.
The reviewers found the paper to be well-written and the experiments thorough, with the proposed method delivering consistent downstream performance gains relative to the baselines considered; on the other hand, there were some concerns about the theoretical justification. Given the empirical nature of the paper, and the slightly loose motivations regarding the heavy-tailed metrics in the first place, it seems entirely reasonable to focus on the experimental results rather than a precise formulation of the theory behind the subsampling approach. As such, I think the ICML community will find this paper valuable and I recommend acceptance.