6.0

/10

Poster4 位审稿人

最低5最高7标准差1.0

3.0

置信度

正确性2.8

贡献度3.0

表达2.5

NeurIPS 2024

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Haiquan Lu,Yefan Zhou,Shiwei Liu,Zhangyang Wang,Michael W. Mahoney,Yaoqing Yang

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

We use methods derived from HT-SR Theory to develop improved methods for pruning LLMs.

摘要

关键词

Pruninglarge language modelsheavy tails

评审与讨论

审稿意见

评分: 7置信度: 32024-06-17

This paper introduces AlphaPruning, a novel framework for unstructured LLM pruning. The framework leverages HT-SR theory that utilizes the heavy-tailed shape of ESDs in layer-weight matrices to allocate layer-wise sparsity more effectively. By focusing on shape metrics rather than scale metrics, AlphaPruning demonstrates superior performance in maintaining model accuracy and reducing complexity. The method has been empirically validated across various architectures, showing significant improvements in performance and efficiency compared to existing methods, with strong generalizability to other compression techniques and architectures.

优点

This paper introduces a novel sparsity allocation method that leverages the heavy-tailed shape of ESDs in layer weight matrices, a concept previously unexplored in the literature.
The method is extensively validated with a range of LLM architectures, demonstrating significant improvements over state-of-the-art methods in terms of reducing perplexity, increasing accuracy, and achieving computational efficiency.
AlphaPruning exhibits remarkable adaptability, integrating well with various other model compression techniques and extending beyond LLMs to include large vision model architectures, proving its versatility and broad applicability.

缺点

The novelty of the method for allocating sparsity based on layer quality (Section 3.2) is incremental, as a similar idea has previously been proposed in [1].

[1] Zhou, Yefan, et al. Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training.

问题

Section 3.2 proposes a linear mapping to get sparsity from layer quality. Would other mappings, such as first compute the logarithmic of the m and then perform linear mapping, yield better outcomes?
I would like to see results on using AlphaPruning to determine layerwise sparsity for other structured pruning methods, such as OSSCAR [1].

[1] Meng, X., Ibrahim, S., Behdin, K., Hazimeh, et al. OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization.

局限性

The authors addressed the limitations.

作者回复

2024-08-06

Weakness

We present the differences between AlphaPruning and [1], as detailed below:

Different research focus. Our study investigates post-training LLM pruning, whereas [1] studies model training.
The underlying principles of the two works are different. [1] aims to balance layer quality (or make layers equally well-trained) by tuning learning rates, aiming to improve generalization performance. This work focuses on minimizing pruning damage to the high-quality layers by making them less sparse, thereby reducing performance loss. While both studies use heavy-tailed metrics from HT-SR theory to estimate layer quality, this shared aspect does not make our work a trivial extension or incremental idea of [1]. This work is the first to explore how HT-SR theory can be applied to allocate sparsity, a concept previously unexplored in the literature.
Important technical difference: transformer-block-wise measurement instead of matrix-wise measurement. [1] measures the PL_Alpha_Hill metric for each weight matrix of the model individually. However, as we demonstrate in Appendix F, this approach provides suboptimal results for sparsity allocation. We improved upon this by averaging the metric scores across matrices within a transformer block and using the block-wise average score for sparsity allocation, which yielded significantly better results.

Question 1

We thank the reviewer for suggesting a new mapping function method. We implemented the proposed method and compared it with the linear mapping function used in our current approach, as shown in Figure 12 of the rebuttal PDF. The results show that both methods perform similarly when combined with Wanda, but linear mapping slightly outperforms the proposed logarithmic method when combined with SparseGPT.

Here are the experimental setups for Figure 12. We pruned the LLaMA-V1-7B model to different sparsity levels using two allocation methods and reported the perplexity on the WikiText validation set. The hyperparameter search and setup are consistent with the original settings in Appendix G.

Question 2

We thank the reviewer for the new reference. We integrated AlphaPruning with OSSCAR and provided the updated results in Figure 13 of the rebuttal PDF. OSSCAR prunes only the linear sublayer of multi-head attention and the second sublayer of the feed-forward network, applying uniform pruning across each transformer block. By incorporating AlphaPruning's layer-wise sparsity allocation, we achieved non-uniform block-wise pruning ratios while keeping the global pruning ratio the same. The results show that integrating AlphaPruning with OSSCAR can reduce perplexity at different sparsities. We will include this experiment in the updated draft and cite the corresponding work.

2024-08-11

I sincerely appreciate the authors' response, which addressed all my previous concerns. Especially in the response to Question 2, the authors demonstrate that combining AlphaPruning with other structured pruning methods can also yield results far superior to the original uniform sparsity. I believe this result is of great value.

评论- Thank you for your response

2024-08-11

We thank the reviewer for the positive feedback. We will make sure to include the new results in the updated draft.

审稿意见

评分: 5置信度: 32024-07-10

This work presents AlphaPruning, which prunes weight matrices of LLM models with different layer-wise sparsity levels based on the Heavy-tailed self-regularization theory. Compared to pruning with uniform sparsity among layers, AlphaPruning alleviates performance degeneration when the sparsity level is high and fine-tuning is not applied. AlphaPruning is composable with existing pruning methods. Experiments with various LLM models, such as Llama and Llama2, demonstrate the effectiveness.

优点

The presented method can be composed with other pruning methods.
The authors conducted experiments on various LLM architectures, such as Llama and Llama 2, with different baselines. In addition, they performed pruning of image classifiers, such as ConvNext and ViT.
Inference on CPU accelerates by the proposed pruning up to 3 times, depending on the sparsity.

缺点

Although they say that "AlphaPruning is theoretically driven" in L83, I could not find any theoretical justification of the method in the text.
While "Here, we provide a brief overview of HT-SR theory" (L100), it is not described, making it challenging to understand the proposed method.
Although not an expert of this domain, I think using stronger baselines would make this work better. For example, [Gale+19] demonstrated even simple magnitude pruning retains performance of ResNet-50 and Transformer even the sparsity level is set to 80%.

[Gale+19] Trevor Gale, Erich Elsen, Sara Hooker "The State of Sparsity in Deep Neural Networks" arXiv 2019.

问题

Why is $\mu_{\mathbf{x}_i}$ is introduced in Equation (1)? It is merely used in the text except for L125.
What is the actual advantage of unstructured pruning as AlphaPruning over other acceleration techniques, such as quantization?

局限性

As written in Weaknesses, the baselines are quite limited.

作者回复

2024-08-06

Weakness 1 and 2

AlphaPruning is grounded in heavy-tail self-regularization (HT-SR) theory, which we use to quantify the training quality of each layer and determine layer-wise sparsity. Here we provide a detailed overview of this theory and explain how AlphaPruning is built based on it. We will include these explanations in the updated draft.

HR-SR theory originated as a semi-empirical theory, with early empirical work [1-2] examining the empirical spectral density (ESD) of weight matrices, specifically the eigenspectrum of the correlation matrix $W^\top W$ . This research found that the heavy-tailed structure of the ESD strongly correlates with training quality. These findings are rooted in heavy-tailed random matrix theory and statistical physics, as detailed in Table 1 of [2].
Recent theoretical work studies how heavy-tails in ESD emerge and why it correlates with training quality. It is well-known [3-4] that spikes in ESD represent "signals", while the bulk represents noise, which follows the Marchenko-Pastur law. In the theoretical setting of [3], the signal or the spike aligns with ground-truth features from the teacher model, and that corresponds to increased correlations in weight elements [1-2]. Furthermore, [5] shows that heavy tails in ESD originate from the interaction between spikes and bulk, which can be quantified precisely using recent advances in the free-probability theory [6], and that is the "bulk-decay" phase in the five-plus-one phase model in [2]. These studies indicate that layers with more heavy-tailed ESDs have extracted more useful signals during training, indicating better training quality.
This insight motivates our sparsity assignment method: layers with more heavy-tailed ESD contain more learned signals and are assigned lower sparsity by our method, while layers with less heavy-tailed ESD retain fewer signals and are assigned higher sparsity. In practice, the heavy-tailed structure is measured by fitting a power-law distribution to the ESD, and extracting the power-law exponent $\alpha$ as the indicator. This is why our method is named "AlphaPruning".

References

[1] Martin et al. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data.

[2] Martin et al. Traditional and Heavy-Tailed Self Regularization in Neural Network Models

[3] Wang et al. Spectral Evolution and Invariance in Linear-width Neural Networks

[4] Couillet and Liao et al. Random Matrix Methods for Machine Learning

[5] Kothapalli et al. Crafting Heavy-Tails in Weight Matrix Spectrum without Gradient Noise

[6] Landau et al. Singular vectors of sums of rectangular random matrices and optimal estimation of high-rank signals: The extensive spike model

Weakness 3

We thank the reviewers for providing the new reference. The method used in [Gale+19] that performs best for Transformer models is uniform magnitude pruning with a gradual pruning schedule. The gradual pruning schedule refers to gradually increasing the sparsity of the network while training the semi-pruned model. This schedule is computationally intensive for LLMs due to the iterative model training and has not been adopted in LLM pruning literature. Our study focuses on pruning the LLM in a one-shot way without training, which aligns with previous studies [7-9] to ensure a fair comparison. We have included one-shot uniform magnitude pruning in the submitted paper, as shown in Tables 2 and 3 of the submitted paper. Our results demonstrate that our approach outperforms this baseline.

We clarify that in the submitted paper, we have compared with the most relevant and competitive baseline, OWL [7], which is the current SOTA non-uniform sparsity allocation method. We have also compared with other recent LLM pruning baselines, such as Wanda [8], and SparseGPT [9], as well as other baselines from CV pruning literature such as global, ER [10], rank-selection [11], and layer-wise error thresholding [12].

References

[7] Yin et al. 2024

[8] Sun et al. 2024

[9] Frantar et al. 2023

[10] Mocanu et al 2018

[11] Kuzmin et al. 2019

[12] Ye et al. 2020

Question 1

We thank the reviewer for pointing out this writing issue. We will remove the $\mu_{\mathbf{x}_i}$ in the updated draft.

Question 2

Both unstructured pruning and quantization are effective methods for improving inference speed and reducing memory footprints. The comparison between these two approaches in terms of efficiency can be nuanced depending on hardware deployment and the algorithms used. For example, unstructured sparsity has limited acceleration support in GPUs, compared to quantization, but it holds the great value of speedups on other hardware such as CPU, cerebras' chip, IPU, etc.

Recent studies indicate that unstructured pruning can slightly outperform quantization in inference speedup. A SOTA approach SqueezeLLM ([13]) demonstrates that quantizing LLMs to 3 bits can achieve a 2.1× speedup. Meanwhile, recent advances in sparse CPU kernels (DeepSparse) have better support in accelerating unstructured pruning, leading to 3.35 $\times$ speedup in CPU runtimes, as shown in Table 3 of [14], as well as Table 4 of our submitted paper. This is partly because pruning can better maintain and recover performance through fine-tuning compared to quantization, as noted in [14].

Pruning and quantization are compatible and complementary, and combining both approaches further enhances efficiency. Table 3 of [14] shows that combining unstructured pruning with quantization (INT8) can achieve up to 9.08× speedup, significantly higher than using either method alone.

References

[13] Kim et al. 2024

[14] Kurtic et al. Sparse Fine-tuning for Inference Acceleration of Large Language Models, 2023

2024-08-10

I thank the authors for the rebuttal.

Weaknesses 1 and 2

I appreciate the authors' explanation of the HT-SR theory. I understand that it is essentially equivalent to applying low-rank approximation to $W$ based on its singular value distribution, which I think is quite natural.

Weakness 3 and Question 2

I thank the authors for the clarification.

评论- Authors' further response to Reviewer's comments

2024-08-11

We thank the reviewer for their response. We provide our understanding of the low-rank approximation (LRA) and our AlphaPruning method.

The principle behind AlphaPruning, which determines layer-wise sparsity, is distinct from the method used to determine layer-wise rank in LRA. While they both involve measuring the eigenspectrum of the weights, our AlphaPruning method was not motivated by LRA. AlphaPruning aims to make the model more deterministic and less random, similar to how decision trees choose branches to reduce entropy maximally. It does this by preserving heavy-tailed layers that contain more signals and removing light-tailed layers. Higher sparsity is assigned to light-tailed layers, which, according to HT-SR theory, are closer to random distribution, and have higher rank. In contrast, LRA [1-3] focuses on applying more compression (similar to higher sparsity) on low-rank matrices where the largest eigenvalues dominate. This allows for minimal impact on reconstruction loss when removing small eigenvalues. Therefore, at first glance, these two methods will do opposite things.
We also note that minimizing reconstruction loss is not equivalent to minimizing the performance loss caused by compression methods. AlphaPruning outperforms baseline methods that assign sparsity based on stable rank, as shown in Table 1 of the submitted paper. This suggests that the heavy-tailed metric may be more relevant to model performance and better at determining layer-wise sparsity, while stable rank may be more relevant to matrix approximation.
We believe that the distinct difference between AlphaPruning and LRA requires further studies. We will make sure to add a section in the revised manuscript to deeply discuss this distinction.

Reference

[1] Zhang et al. Accelerating very deep convolutional networks for classification and detection

[2] Wen et al. Coordinating filters for faster deep neural networks

[3] Xu et al. Trained rank pruning for efficient deep neural networks

审稿意见

评分: 7置信度: 32024-07-13

This paper introduces Alpha Pruning, a novel approach for pruning large language models based on Heavy-Tailed Self-Regularization theory. Instead of applying a uniform pruning ratio across layers, Alpha Pruning utilizes PL_Alpha_Hill, derived from empirical spectral densities (ESDs), to assess how well-trained each layer is. It then assigns a lower pruning ratio to well-trained layers to preserve model performance. Alpha Pruning is evaluated across various LLM architectures and datasets, demonstrating superior performance compared to baseline uniform pruning and SOTA methods like OWL.

优点

Alpha Pruning leverages HT-SR theory to provide a principled method for guiding layer-wise pruning decisions, contrasting with heuristic-based approaches
This pruning method has been evaluated on various large language models and exhibits robust performance, showcasing its effectiveness and generalizability.
By evaluating the importance of each layer and assigning non-uniform pruning ratios, Alpha Pruning can complement existing pruning techniques such as magnitude-based pruning and Wanda. It is also compatible with other model acceleration techniques, such as structured pruning and quantization.
Despite the challenges of unstructured pruning in achieving significant speedups compared to structured methods, Alpha Pruning still achieves noticeable efficiency gains with high pruning ratios

缺点

The paper’s explanation, particularly concerning HT-SR theory and terms used in the method section, may be challenging to grasp for readers unfamiliar with these concepts. Figure 1. a could benefit from clearer explanations.
While Alpha Pruning is compared against uniform pruning and OWL across multiple architectures, other layer-wise pruning methods prevalent in the computer vision community [1] could provide additional comparative insights.

[1] Lee, Jaeho, et al. "Layer-adaptive sparsity for the magnitude-based pruning." arXiv preprint arXiv:2010.07611 (2020).

问题

How does the PL_Alpha_Hill metric evolve during fine-tuning? Does its value typically decrease as a block becomes better trained?
What might be the reason why a block is better trained than others? Is it possible to include PL_Alpha_Hill in the training process to train each block equally and accelerate the training?
What factors contribute to the non-linear relationship between sparsity and speedups is shown in the Table. 4? Is the maintenance of early layers a significant factor in this observation?

局限性

The limitation of this work is well discussed in the Appendix.

作者回复

2024-08-06

Weakness 1

We provide a detailed explanation of the parts the reviewer suggested, including HT-SR theory, terms in the method, and Figure 1a. We will include these in the updated draft.

More details of HT-SR theory: HT-SR theory [1-2] examines the empirical spectral density (ESD) of weight matrices, specifically the eigenspectrum of the correlation matrix $W^\top W$ . This research found that the heavy-tailed structure in ESD is strongly correlated with training quality. Recent theoretical work [3-4] finds that such a structure is a result of feature learning, a process of extracting various useful correlations (or features) from data during optimization. These studies indicate that layers with more heavy-tailed ESDs have extracted more useful signals during training, indicating better training quality. In practice, the heavy-tailed structure is measured by fitting a power-law distribution to the ESD, and extracting the power-law exponent $\alpha$ as the indicator. These studies motivate our work, and we allocate the layer-wise sparsity based on the $\alpha$ metric. This is why our method is named "AlphaPruning".
Terms used in method: In our method section, $\lambda$ represents the eigenvalues of the weight matrices' correlation matrix. The interval ( $\lambda_{\min }$ , $\lambda_{\max }$ ) defines the range of eigenvalues considered as the tail part of ESDs.
Figure 1a: The blue histograms depict the empirical spectral density. The $x$ -axis represents eigenvalue magnitudes, and the $y$ -axis represents the density, both on a logarithmic scale. The solid red curves depict the empirical distribution of the ESD tail, while the dashed red curves represent the fitted PL distribution. The PL_Alpha_Hill metric in the title is the fitted PL exponent.

Weakness 2

We thank the reviewer for providing the new baseline method LAMP, we provide new experiments on comparing our method with LAMP. The results are shown in Table 25 of the rebuttal PDF. We implement the original LAMP method, which allocates different sparsity for each matrix, and a variant LAMP called LAMP (per block), which allocates the same sparsity for all matrices within a transformer block. This adaption is based on our ablation study comparing per-matrix and per-block strategies in Appendix F. It shows that LAMP (per-block) outperforms the original LAMP, and shows better performance than Uniform pruning baseline when both are combined with SparseGPT. However, our method AlphaPruning outperforms this baseline method.

Here are the experimental settings. We pruned the LLaMA-V1-7B model to three sparsity levels (60%, 70%, 80%) using different methods and reported the perplexity on the WikiText validation set. The hyperparameter search and setup are consistent with the original settings in Appendix G.

Question 1

In Table 26 of the rebuttal PDF, we present the PL_Alpha_Hill and performance metrics (perplexity, zero-shot task accuracy) before and after fine-tuning. We can see that the PL_Alpha_Hill metric is reduced as well as the two performance metrics improved. Here is the experimental setup. We fine-tune the pruned LLaMA-V1-7B on 30K tokens from the C4 dataset. The model is pruned to 70% sparsity. The perplexity is evaluated on WiKiText, and the accuracy is evaluated on 7 zero-shot datasets as listed in Section 4.1.

We select the initial stage of fine-tuning to demonstrate how the metric value evolves. Figure 14 of the rebuttal PDF shows that the PL_Alpha_Hill metric indeed continues decreasing during the fine-tuning.

Question 2

The phenomenon that layers/blocks are not equally well-trained is well-documented in [5-7]. For example, [5] shows that layers or blocks of a model can have imbalanced training quality during training, and balancing the training speed of layers/blocks improves performance. As another example, [6-7] shows that some LLM layers are less useful than others, and removing these LLM layers has negligible impact on performance. However, the underlying reason why some layers are less well-trained than others remains an open question, and we will include that in future studies.

Incorporating PL_Alpha_Hill into the training process is possible. Recent work [5] has used this metric to dynamically adjust the learning rate of each layer during the training and demonstrated their method makes the layers more equally well-trained and improves the generalization performance.

Question 3

The non-linear relationship is due to the low-level hardware operations of sparsity structures rather than the maintenance of early layers. As shown in Figure 4 of [8], for low sparsity, the computational performance (red lines) increases slowly due to overheads in storing sparse structures and controlling sparse computations. As sparsity increases to moderate and high levels, we see sustained growth of performance until it usually levels off at extremely high sparsities where storage and control overheads dominate. This explains the slow speedup in our low-sparsity regime (e.g., less than 70% sparsity).

A new ablation study shows that early layer maintenance is irrelevant to the non-linear relationship. In Figure 11 of the rebuttal PDF, we compare the inference speedup of non-uniform pruning (AlphaPruning) and uniform pruning. The negligible speedup differences between the two methods indicate that non-uniform distribution is unrelated to the observed non-linearity. This is because our method allocates sparsity at the transformer block level, not at the individual layer level, and in transformer architectures, all blocks use identical computational resources.

References

[1] Ref [10] in submission

[2] Ref [13] in submission

[3] Ref [58] in submission

[4] Kothapalli et al. 2024

[5] Zhou et al. 2023 Temperature Balancing

[6] Gromov et al. 2024 The Unreasonable Ineffectiveness of the Deeper Layers

[7] Men et al. 2024 ShortGPT

[8] Hoefler et al. 2021 Sparsity in Deep Learning

审稿意见

评分: 5置信度: 32024-07-13

The paper introduces AlphaPruning, a novel method for pruning large language models (LLMs) using Heavy-Tailed Self-Regularization (HT-SR) Theory. AlphaPruning uses ESDs of weight matrices to determine layerwise pruning ratios. The method demonstrates the ability to prune LLaMA-7B to 80% sparsity while maintaining reasonable perplexity.

优点

The proposed method, AlphaPruning, achieved SOTA performance across various tasks compared to OWL.
The experiments conducted were comprehensive and diverse.

缺点

The motivation is not explicitly explained. It appears to apply a theoretical concept to the pruning area without providing quantitative or qualitative proof.

问题

Can you explain why the property of $W^T*W$ can decide the sparsity of the layer and can be combined with different pruning metrics?
To my knowledge, large language models (LLMs) are generally well-trained. The article concludes that PL_Alpha_Hill can indicate whether a layer is well-trained. Are there other methods or indicators that demonstrate if these layers are not well-trained?
Have you compared the sparsity allocation of different layers with OWL? If so, is the sparsity allocation similar to that of OWL?

局限性

This article includes a limitation.

作者回复

2024-08-06

Weakness 1 and Question 1

Motivation of our method and why property of $W^\top W$ can decide layer sparsity. Our method is grounded in heavy-tail self-regularization (HT-SR) theory, which we use to quantify the training quality of each layer and determine layer-wise sparsity. The rationale is as follows:

HR-SR theory originated as a semi-empirical theory, with early empirical work [1-2] examining the empirical spectral density (ESD) of weight matrices, specifically the eigenspectrum of the correlation matrix $W^\top W$ . This research found that the heavy-tailed structure of the ESD strongly correlates with training quality. These findings are rooted in heavy-tailed random matrix theory and statistical physics, as detailed in Table 1 of [2].
Recent theoretical work studies how heavy-tails in ESD emerge and why it correlates with training quality. It is well-known [3,5] that spikes in ESD represent "signals", while the bulk represents noise, which follows the Marchenko-Pastur law. In the theoretical setting of [3], the signal or the spike aligns with ground-truth features from the teacher model, and that corresponds to increased correlations in weight elements [1-2]. Furthermore, [4] shows that heavy tails in ESD originate from the interaction between spikes and bulk, which can be quantified precisely using recent advances in the free-probability theory [6], and that is the "bulk-decay" phase in the five-plus-one phase model in [2]. These studies indicate that layers with more heavy-tailed ESDs have extracted more useful signals during training, indicating better training quality.
This insight motivates our sparsity assignment method: layers with more heavy-tailed ESD contain more learned signals and are assigned lower sparsity by our method, while layers with less heavy-tailed ESD retain fewer signals and are assigned higher sparsity. In practice, the heavy-tailed structure is measured by fitting a power-law distribution to the ESD, and extracting the power-law exponent $\alpha$ as the indicator. This is why our method is named "AlphaPruning".

Why our method can be combined with other pruning metrics? Our layer-wise sparsity assignment is complementary to other pruning metrics, such as Wanda and SparseGPT. Our method determines the training quality of layers, while other pruning metrics identify the importance of components within each layer (or weight matrix). Thus, our approach decides how much to prune for each layer, while other metrics determine which components to prune within each layer.

References

[1] Martin et al. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data.

[2] Martin et al. Traditional and Heavy-Tailed Self Regularization in Neural Network Models

[3] Wang et al. Spectral Evolution and Invariance in Linear-width Neural Networks

[4] Kothapalli et al. Crafting Heavy-Tails in Weight Matrix Spectrum without Gradient Noise

[5] Couillet and Liao et al. Random Matrix Methods for Machine Learning

[6] Landau et al. Singular vectors of sums of rectangular random matrices and optimal estimation of high-rank signals: The extensive spike model

Question 2

In addition to PL_Alpha_Hill proposed in our work, [7-8] are other studies that investigated methods that measure whether a layer is well-trained or not, demonstrating LLMs layers are not equally well-trained. [7] developed a method that assesses the similarity between the representations at different layers, defined as the angular distance between feature vectors. They found that deeper layers are more similar to their neighboring layers than shallow layers, suggesting that LLMs may not fully utilize the parameters in these deeper layers, indicating these layers are not well-trained. Similarly, [8] introduced a metric called Block Influence, which measures the impact of each transformer block on hidden states to gauge layer significance. Their findings showed varying degrees of ineffectiveness/redundancy across layers, suggesting that these layers are not well-trained.

References

[7] Gromov et al. The Unreasonable Ineffectiveness of the Deeper Layers

[8] Men et al. ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

Question 3

In Figure 10 of the rebuttal PDF, we compare the sparsity allocation of the two methods. We show that the general trends of sparsity distribution generated by the two methods are similar, with lower sparsities allocated to earlier layers and higher sparsities allocated to deeper layers. However, our method produces a more granular distribution with clearer distinctions between consecutive deep layers, resulting in improved pruning performance.

作者回复

2024-08-06

We want to thank all the reviewers for the constructive feedback, which helps us improve our paper. Please refer to the attached PDF for our new experiments and see below for our responses to each comment.

最终决定Accept (poster)

2024-09-25

The paper proposes a new approach to pruning large language models (LLMs) based on the Heavy-Tailed Self-Regularization (HT-SR) theory. The authors introduce a metric called PL_Alpha_Hill, which measures the heavy-tailed shape of the empirical spectral density (ESD) of layer-weight matrices. They use this metric to allocate layer-wise sparsity more effectively, resulting in improved performance and efficiency compared to existing methods.

The reviewers praise the paper's novelty, technical soundness, and extensive validation across various architectures. However, they also raise concerns about the paper's clarity, the lack of theoretical justification, and the comparison with other pruning methods. The authors' rebuttal addresses these concerns by providing additional explanations, clarifying the relationship between HT-SR theory and the proposed method, and presenting new results that demonstrate the effectiveness of AlphaPruning in combination with other pruning methods.

Overall, the paper appears to be technically sound, and the authors have made a significant effort to address the reviewers' concerns. The paper's contribution to the field of LLM pruning is notable, and it has the potential to inspire further research in this area.

After the rebuttal and discussion, all reviewers lean towards acceptance. Overall, I would thus recommend to accept the submission. However, I would further recommend that the authors take the reviewer feedback into account for the camera-ready version.