PaperHub
6.3
/10
Poster4 位审稿人
最低3最高8标准差2.0
6
8
3
8
4.0
置信度
正确性3.0
贡献度2.5
表达3.0
ICLR 2025

OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition

OpenReviewPDF
提交: 2024-09-23更新: 2025-05-17

摘要

The recent paradigm shift to large-scale foundation models has brought about a new era for deep learning that, while has found great success in practice, has also been plagued by prohibitively expensive costs in terms of high memory consumption and compute. To mitigate these issues, there has been a concerted effort in post-hoc neural network pruning techniques that do not require costly retraining. Despite the considerable progress being made, existing methods often exhibit a steady drop in model performance as the compression increases. In this paper, we present a novel approach to compressing large transformers, coined OATS, that compresses the model weights by approximating each weight matrix as the sum of a sparse matrix and a low-rank matrix. Prior to the decomposition, the weights are first scaled by the second moment of their input embeddings, so as to ensure the preservation of outlier features recently observed in large transformer models. Without retraining, OATS achieves state-of-the-art performance when compressing large language models, such as Llama-3 and Phi-3, and vision transformers, such as Google's ViT and DINOv2, by up to $60\%$, all while speeding up the model's inference on a CPU by up to $1.37\times$ compared to prior pruning methods.
关键词
network pruninglow-rankcompressionsparsificationlarge language modelsoutlier features

评审与讨论

审稿意见
6

In this paper, the author presents a novel approach to compressing large transformers, coined OATS, that utilizes the second moment information in the input embeddings to decompose the model weights into a sum of sparse and low-rank matrices. The author also conducts a lot of experiments showing that OATS is able to consistently outperform prior state-of-the-art across multiple benchmarks and compression rates while also improving on speed-up.

优点

1.This paper propose a novel method for large transformers compression that utilizes the second moment of the input embeddings to approximate the model’s weight matrices as a sum of a sparse matrix and a low-rank matrix.

2.Extensive experiments on recent large language models demonstrate the effectiveness of the proposed method, which also generalizes well to vision Transformers.

缺点

Please see the questions.

问题

  1. The author has clarified the differences between Wanda and OATS. However, an ablation study would further illustrate the impact of the low-rank term on performance. How much does the low-rank term contribute to the performance boost compared to Wanda alone?
  2. The hardware speedup on GPU with N:M sparsity patterns can be shown and discussed in Section3.4.
  3. The paper lacks details on how the sparsity pattern is defined for a matrix composed of a sparse matrix S plus a dense matrix LLL, especially within the context of N:M sparsity and structured pruning. Providing more detailed explanations would enhance clarity.
  4. The baseline for models are all originally designed for LLMs. It would be valuable to compare against pruning metrics specifically designed for ViTs. This would present a stronger baseline to effectively showcase the proposed method’s advantages.
  5. The motivation behind the choice of outlier information for the sparse term S is not entirely clear. Given that there are various methods for selecting the sparse components, such as using magnitude-based or gradient-based criteria, it would be helpful to know if the authors experimented with these or other selection methods.
  6. Given that most pruning studies report results on 70B-parameter models, has the author tested their method on larger language models, such as Llama2-70B or the newer Llama3-70B?
评论

The baseline for models are all originally designed for LLMs. It would be valuable to compare against pruning metrics specifically designed for ViTs. This would present a stronger baseline to effectively showcase the proposed method’s advantages.

We appreciate the reviewer’s suggestion and we have added the following paragraph to Appendix A.1 directly addressing this issue:

There are a number of pruning approaches that have been specifically catered towards pruning vision transformers [1,2,3,4,5,6]. However, as much of the pruning literature developed on vision transformers involved models of much smaller scale than the large language models employed in this study, almost all of the prominent pruning algorithms require some form of training on the model parameters. As OATS was designed to require no training, OATS and the aforementioned pruning algorithms would not be comparable.

The motivation behind the choice of outlier information for the sparse term S is not entirely clear. Given that there are various methods for selecting the sparse components, such as using magnitude-based or gradient-based criteria, it would be helpful to know if the authors experimented with these or other selection methods.

We share the same concerns as the reviewer that the choice of using the matrix DD may not be the optimal metric when determining the sparse term. We avoided utilizing gradient-based approaches as methods like OATS, DSNOT, SparseGPT, and Wanda can prune the layers through effectively a single forward pass. By requiring gradient information, it would increase the computational requirements of the pruning algorithm. In response to the reviewer, we have, however, extended our ablation studies in Section 3.3 to include experiments where the sparse term SS is determined via magnitude pruning instead. These results are presented in a new section Appendix A.5 titled “Magnitude-Based Pruning for the Sparse Component” which are included in the table below for the reviewer’s convenience:

Outlier ScalingMMLU (↑)Zero-shot (↑)Perplexity (↓)
Low-Rank Term Only65.2271.0112.49
Both Terms (OATS)65.8470.7111.50

We have also included experiments in a new subsection Appendix A.3 titled “Using a Robust Scaling Matrix” where the scaling matrix DD utilizes a robust measurement of the second moments instead. We have included the new subsection below for the reviewer’s convenience:

To explore whether the scaling matrix DD is truly related to the outlier information, we run the following two experiments:

  1. scaling by the square root of the features' second moments, as is currently done in OATS.
  2. scaling by the median of the features' absolute values (computed along batch and sequence dimensions):

Drobust=\qquad \qquad \qquad \qquad \qquad D_{robust} =median(X)(|X|)

The second experiment estimates the square root of the second moment of features in a manner that is robust (insensitive) to outliers akin to the Median Absolute Deviation estimator from the robust statistics literature [7]. The results of the two experiments are presented in the table below:

Scaling MatrixMMLU (↑)Zero-shot (↑)Perplexity (↓)
DrobustD_{robust}55.5465.7718.59
DD59.9968.4115.18

The findings show that using the robust scaling method results in significantly worse performance. Hence, the scaling matrix DD that is sensitive to the outlier features and captures their scale leads to better compression.

[1] Chen, T., Cheng, Y., Gan, Z., Yuan, L., Zhang, L., & Wang, Z. (2021). Chasing Sparsity in Vision Transformers: An End-to-End Exploration.

[2] Yu, S., Chen, T., Shen, J., Yuan, H., Tan, J., Yang, S., Liu, J., & Wang, Z. (2022). Unified Visual Transformer Compression.

[3] Yu, F., Huang, K., Wang, M., Cheng, Y., Chu, W., & Cui, L. (2022). Width and Depth Pruning for Vision Transformers.

[4] Yu, L., & Xiang, W. (2023). X-Pruner: eXplainable Pruning for Vision Transformers.

[5] Zhu, M., Han, K., Tang, Y., & Wang, Y. (2021). Visual Transformer Pruning.

[6] Chavan, A., Shen, Z., Liu, Z., Liu, Z., Cheng, K.-T., & Xing, E. (2022). Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space.

[7] Huber, P. J. (1981). Robust statistics.

评论

Given that most pruning studies report results on 70B-parameter models, has the author tested their method on larger language models, such as Llama2-70B or the newer Llama3-70B?

Thank you for this great suggestion. In response, we have run Llama-3 70B experiments for OATS and the pruning benchmarks and have updated the manuscript to include those experiments. These results are included in the table below for your convenience:

CompressionMethodMMLU (↑)Zero-Shot (↑)Perplexity (↓)
0%Dense79.6375.272.68
30%SparseGPT78.2875.073.24
Wanda79.1575.193.28
DSNoT79.0075.543.27
OATS78.4775.243.07
40%SparseGPT76.2974.633.99
Wanda77.1674.104.08
DSNoT77.7074.294.10
OATS77.8974.883.68
50%SparseGPT72.4773.175.27
Wanda72.0472.855.38
DSNoT72.7672.915.58
OATS74.7973.304.78

We would like to thank the reviewer again for asking many insightful questions that have led to the improvement of our work. Should the reviewer find it fitting, we would be appreciative of any potential reconsideration of the score.

评论

Dear authors,

Thank you for considering my comments and for the well-prepared rebuttal. I will keep my original positive score.

Best wishes,

评论

Dear Reviewer rXsS,

As the discussion period draws to a close, we would like to sincerely thank the reviewer for their participation during the discussion, for complimenting on the preparedness of our rebuttal, and for their overall positive assessment of our work.

Best,

The Authors.

评论

We would like to first thank the reviewer for spending the time to provide feedback, and engaging with the authors through several questions.

The author has clarified the differences between Wanda and OATS. However, an ablation study would further illustrate the impact of the low-rank term on performance. How much does the low-rank term contribute to the performance boost compared to Wanda alone?

We thank the reviewer for the suggestion and we have added Appendix A.7 titled “Gap Between OATS and Wanda” highlighting the exact gaps for each performance metric between Wanda and OATS. This addition aims to quantify the impact of the low-rank term on performance. For the reviewer’s convenience, we have also included the relevant table below.

ModelCompressionMMLU (↑)Zero-Shot (↑)Perplexity (↓)
Phi-3 Mini30%+1.21%+0.82%-0.44
40%+1.60%+1.24%-1.06
50%+5.42%+3.38%-2.05
Phi-3 Medium30%+0.97%-0.01%-0.43
40%+1.65%+1.45%-0.79
50%+2.52%+2.43%-1.07
Llama-3 8B30%+1.55%+0.71%+0.20
40%+2.13%+1.64%-0.50
50%+6.63%+2.44%-1.49
Llama-3 70B30%-0.68%+0.05%-0.17
40%+0.73%+0.78%-0.40
50%+2.74%+0.45%-0.60

The hardware speedup on GPU with N:M sparsity patterns can be shown and discussed in Section3.4.

The experiments conducted in Section 3.4 with N:M sparsity were designed as exploratory investigations to compare pruning algorithms with 2:4 sparsity against OATS under the same compression rate (but sparser N:M). Unfortunately, current NVIDIA GPUs only support 2:4 sparsity patterns, preventing us from benchmarking speed-ups on a GPU. However, consistent with prior works like [1,2], which also explore sparser N:M patterns, our goal is to demonstrate that models can achieve higher sparsity while maintaining performance. Through our results, we hope to incentivize companies to support sparser N:M patterns paving the way for greater speed-ups in the future.

The paper lacks details on how the sparsity pattern is defined for a matrix composed of a sparse matrix S plus a dense matrix LLL, especially within the context of N:M sparsity and structured pruning. Providing more detailed explanations would enhance clarity.

We thank the author for highlighting a point of unclarity in our paper. Once we have the sparse plus low-rank decomposition, we are storing the compressed weight matrix as three separate matrices, a sparse matrix coinciding with the sparse term, and two matrices coinciding with the low-rank factorization of the low-rank term LD1LD^{-1}. We have included the following sentence at the end of Section 2.3 of the manuscript to improve clarity:

The original weight matrix is replaced with three matrices: the sparse matrix SD1S D^{-1}, and two matrices coinciding with the low-rank factorization of LD1LD^{-1}.

[1] Yin, L., Wu, Y., Zhang, Z., Hsieh, C.-Y., Wang, Y., Jia, Y., Pechenizkiy, M., Liang, Y., Wang, Z., & Liu, S. (2024). Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity.

[2] Sun, W., Zhou, A., Stuijk, S., Wijnhoven, R., Nelson, A. O., Li, H., & Corporaal, H. (2021). DominoSearch: Find layer-wise fine-grained N sparse schemes from dense neural networks.

审稿意见
8

This paper proposes a relatively novel model compression technique aimed at Transformer architectures, in which weight matrices are approximated as a sum of a sparse and a low-rank matrix. To control for the (previously documented) outlier feature problem, weights are also scaled by the second moment of their input.

优点

The method proposed by this paper is well-explained and well-justified. The actual practical algorithm is easy to follow.

Real-time speedup is shown in the CPU setting.

The experiments cover a range of model sizes (3.8-14B parameters). I especially liked seeing results on fairly small models, since those may be harder to compress.

Section 5 is an interesting way to look at the problem, which I think can lead to interesting further work in the direction of interpretability.

缺点

The choice of the rank ratio parameter could have been better explored (in particular, looking at multiple architectures/tasks).

Typos: Line 18: “approximating each weight” -> “approximating each weight matrix” Line 142: “the activations are calculated through a calibration set that is propagated through the compressed layers” - should be uncompressed?

问题

It would be nice to see a bigger hyperparameter selection section with more architerctures considered.

评论

We would first like to thank the reviewer for providing us with valuable feedback and for highlighting concerns with regard to typos:

Line 18: “approximating each weight” -> “approximating each weight matrix”

Thank you for pointing this out, we have since updated the abstract to correct this typo.

Line 142: “the activations are calculated through a calibration set that is propagated through the compressed layers” - should be uncompressed?

This is actually not a typo and the choice of calculating the input activations through the compressed layers was a deliberate decision based on prior pruning algorithms that did the same (SparseGPT, Wanda, and DSNOT).

It would be nice to see a bigger hyperparameter selection section…

To provide a better picture of how the hyperparameters impact OATS, we have included the results for additional configurations for the Llama-3 8B model and the Phi-3 Mini model in Appendix A.6 titled “Additional Hyperparameter Tests for OATS”. We have included the table below for the reviewer’s convenience:

ModelCompressionRank RatioMMLU (↑)Zero-Shot (↑)Perplexity (↓)
Phi-3 Mini30%0.168.7071.6510.24
0.268.0271.8110.21
0.369.2872.0710.28
40%0.165.7569.9411.57
0.265.8470.7111.50
0.366.8170.5411.60
50%0.157.9667.3715.48
0.259.1268.0215.13
0.358.6868.6315.47
Llama-3 8B30%0.163.6268.999.35
0.263.0969.549.09
40%0.161.4468.239.23
0.261.9768.439.09
50%0.156.4665.3310.85
0.256.0765.5110.70

Unfortunately, due to computational constraints, we were unable to perform a more extensive grid search over the hyperparameters for each model. However, given that all Phi-3 experiments utilized a rank ratio of 0.25 and all Llama-3 experiments utilized a rank ratio of 0.3, we believe that this is a testament to the robustness of OATS' performance to its hyperparameters.

with more architerctures considered.

We agree with the reviewer and we also believe that our experiments would benefit from including larger models. Thus, we have run experiments on Llama-3 70B for OATS and its pruning benchmarks, which we have included in the manuscript and below for the reviewer’s convenience:

CompressionMethodMMLU (↑)Zero-Shot (↑)Perplexity (↓)
0%Dense79.6375.272.68
30%SparseGPT78.2875.073.24
Wanda79.1575.193.28
DSNoT79.0075.543.27
OATS78.4775.243.07
40%SparseGPT76.2974.633.99
Wanda77.1674.104.08
DSNoT77.7074.294.10
OATS77.8974.883.68
50%SparseGPT72.4773.175.27
Wanda72.0472.855.38
DSNoT72.7672.915.58
OATS74.7973.304.78

We again would like to sincerely thank the reviewer for their comments and we remain available for discussion should the reviewer have any more inquiries prior to the discussion deadline. If the reviewer finds appropriate, the authors would be appreciative of any further reconsiderations of the score.

评论

I thank the reviewers for the clarification and additional experiments, and agree with the analysis that the hyperparameter search results provide evidence for the method's robustness.

Regarding Line 142 (compressed vs uncompressed weights in computing inputs). Then it would be helpful to clarify this in Algorithm 2.

评论

Thank you for the additional suggestion of editing the Algorithm 2 bubble. We have now added Layer Inputs Propagated through Prior Compressed Layers to the algorithm bubble when describing the layer inputs that are used to calculate the scaling matrix DD.

I especially liked seeing results on fairly small models, since those may be harder to compress…It would be nice to see… more architerctures considered.

We wanted to add that we have since run additional experiments on the Qwen 2.5 3B Instruct model to provide an even better understanding of how OATS performs on a wide range of different LLM architectures. We have included the results in a new section, Appendix A.8, titled Qwen 2.5 Experiments and included the table below for the reviewer’s convenience:

CompressionMethodMMLU (↑)Zero-Shot (↑)Perplexity (↓)
0%Dense65.9968.4911.02
30%SparseGPT65.6567.9111.55
Wanda65.4668.0811.66
DSNoT65.6568.2111.67
OATS65.3668.7411.45
40%SparseGPT63.0467.6412.56
Wanda61.8867.1412.89
DSNoT62.2667.4212.91
OATS64.3068.7612.31
50%SparseGPT57.4364.3614.92
Wanda55.3964.1016.27
DSNoT55.7864.7716.43
OATS58.7865.7414.91

We thank the reviewer again for their continued engagement with us. We remain available for discussion should the reviewer have any other suggestions that would lead to improvements for our work and its score.

评论

Dear Reviewer PLVG,

As the discussion period wraps up, we would like to offer our final thanks to the reviewer for responding to our comments, for providing additional suggestions beyond their initial review, and for their positive evaluation of our work.

Best,

The Authors.

审稿意见
3

This paper introduces OATS, a novel compression technique for large-scale transformers that combines sparse and low-rank approximations to reduce memory and compute costs without requiring costly retraining. OATS scales model weights by the second moment of input embeddings to preserve essential outlier features in transformers, ensuring model performance is maintained during compression. This approach addresses the typical performance degradation seen with increasing compression in existing pruning methods.

优点

  • Low-rankness plus sparsity is a good fit for compressing LLMs without retraining.
  • By separating the model into a sparse and a low-rank part, the approximation error can be theoretically reduced as the two parts can compensate for each other.
  • This paper provides measurements for practical speedups on CPUs with existing sparse computation frameworks.

缺点

  • The novelty is limited. The combination of low-rankness and sparsity is an old topic that has been explored for many years [R1, R2]. Applying the well-established approximation techniques to decompose/compress the large matrices in LLMs has little technical contribution. Besides, compressing DNN models using low-rank and sparse decomposition has already been well explored in [R3]. This paper just scales it to larger models and matrices. Authors are encouraged to specify the unique difference from existing approaches and why this difference is also unique for LLMs.
  • The proposed Truncated SVD and Threshold strategies to achieve low-rankness and sparsity are too trivial. It is unknown how to decide the rank and number of zeroes. Besides, the order of applying SVD and thresholding has a significant impact on the approximation errors. Authors are encouraged to clearly explain why using such decomposition strategies.
  • This paper claims "outlier information" in this paper. However, I have not seen any analysis or explanation for the "outlier information," and the proposed solution is not related to the "outlier information." Instead, this paper seems to directly apply the pruning approaches proposed in Wanda. The authors are encouraged to provide explanations of why the diagonal matrix is related to "outlier information" and why it is good for compression.
  • Many works have been proposed to compress LLMs with low-rankness and sparsity [R4-R6]. The authors have not presented the main differences among them and the unique contributions that stand out from those works.
  • Even though theoretical analysis may not have a practical guarantee of accuracy, the authors are encouraged to provide.
  • The paper presentation could be improved, especially the math equations.

[R1] Sparse and Low-Rank Matrix Decompositions, Forty-Seventh Annual Allerton Conference, 2009.

[R2] Godec: Randomized low-rank & sparse matrix decomposition in noisy case. ICML 2011.

[R3] On compressing deep models by low rank and sparse decomposition, CVPR 2017.

[R4] Slope: Double-pruned sparse plus lazy low-rank adapter pretraining of llms

[R5] LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

[R6] SLiM - One-shot Quantized Sparse Plus Low-rank Approximation of LLMs

问题

See the Weakness. Additionally, why use inverse transformation to reach the compressed weight? What is the actual speedup when using N:M sparsity on GPUs.

伦理问题详情

N/A

评论

N:M Speed-Up (Part 5/5)

[Reviewer's Comment]: What is the actual speedup when using N:M sparsity on GPUs.

The experiments conducted in Section 3.4 with N:M sparsity were designed as exploratory investigations to compare pruning algorithms with 2:4 sparsity against OATS under the same compression rate (but sparser N:M). Unfortunately, current NVIDIA GPUs only support 2:4 sparsity patterns, preventing us from benchmarking speed-ups on a GPU. However, consistent with prior works like [1, 2], which also explore sparser N:M patterns, our goal is to demonstrate that models can achieve higher sparsity while maintaining performance. Through our results, we hope to incentivize companies to support sparser N:M patterns paving the way for greater speed-ups in the future.

[1] Yin, L., Wu, Y., Zhang, Z., Hsieh, C.-Y., Wang, Y., Jia, Y., Pechenizkiy, M., Liang, Y., Wang, Z., & Liu, S. (2024). Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity.

[2] Sun, W., Zhou, A., Stuijk, S., Wijnhoven, R., Nelson, A. O., Li, H., & Corporaal, H. (2021). DominoSearch: Find layer-wise fine-grained N sparse schemes from dense neural networks.

We would like to again thank the reviewer for deeply engaging with our paper and for providing numerous valuable comments that have allowed us to improve the paper. Given the comments and additions to the paper, should the reviewer find it appropriate, we would be grateful for any potential reconsideration of our score.

评论

Additional Related Works and Overall Presentation (Part 4/5)

[Reviewer's Comment]: Many works have been proposed to compress LLMs with low-rankness and sparsity [R4-R6]. The authors have not presented the main differences among them and the unique contributions that stand out from those works.

We sincerely apologize for the oversight in not citing important and relevant works [R4, R6], and we greatly appreciate the reviewer for bringing these to our attention. We have corrected this error, and the following paragraphs outline the additions and adjustments made to address it.

To highlight and compare [R4] with OATS, we have included the following paragraph in Appendix A.1:

[New Section | Page 19]: In [1], the authors propose SLOPE, a novel method for accelerating the pre-training phase of LLMs by incorporating N:M sparsity and adding low-rank components to the model weights to enhance model capacity. Similar to OATS, SLOPE leads to a sparse plus low-rank structure in the model’s weight matrices, however, the low-rank terms are introduced during the final phase of pre-training and are actively trained on the model loss function. In contrast, OATS is designed as a lightweight method to accelerate inference. OATS does not require any training or fine-tuning, but instead approximates pre-trained weight matrices by solving the Robust PCA problem.

For [R6], we thank the reviewers for bringing this concurrent work to our attention. We were, unfortunately, unable to cite the paper since it was posted publicly on ArXiV only after the submission deadline for ICLR. We have now included the following paragraph in Appendix A.1 to highlight its differences with OATS:

[New Section | Page 19]: An independent and concurrent work with OATS proposes SLIM [2], a novel pipeline that combines pruning and quantization. To restore lost performance from compression, SLIM derives a low-rank term using singular-value thresholding and adopts a scaling technique akin to OATS. However, instead of the L2L^2 norm, SLIM utilizes the average absolute value across the batch and sequence dimensions. As a further deviation from OATS, SLIM is also not performing an alternating thresholding algorithm. Instead, they perform a single quantization and pruning step to initialize the quantized and sparse term, followed by a single singular value thresholding step to establish the low-rank term.

Regarding [R5], we previously cited it under the “Structured Pruning and Low-Rank Adaptation” paragraph in the Related Works section. We have included the paragraph below for the reviewer’s convenience:

[Page 10]: Recent works, such as LoSparse, LoRAPrune, and APT, propose variations of applying structured pruning on the weights while incorporating a low-rank adapter that is trained via gradient descent. These are markedly different than OATS, which does not employ any fine-tuning with low-rank adapters, nor does it perform structured pruning (but rather a sparse plus low-rank decomposition which can be thought of as a combination of structured and unstructured pruning).

[Reviewer's Comment]: The paper presentation could be improved, especially the math equations.

We thank the reviewer for raising their concerns and we have provided an additional annotation to the following equation in Section 2.4 when defining the rank-ratio, κ=r(dout+din)(1ρ)doutdin\kappa = \frac{r(d_{out} + d_{in})}{(1-\rho)d_{out} \cdot d_{in}}, clarifying that the numerator represents the number of parameters in the low-rank term and that the denominator represents the total number of nonzero parameters in the compressed layer.

[Reviewer's Comment]: Additionally, why use inverse transformation to reach the compressed weight?

We thank the reviewer for raising a potential point of unclarity in our work. In OATS, the alternating thresholding algorithm returns a sparse plus low-rank decomposition that approximates L+SWDL+S \approx W D. From this equation, if one wanted a sparse plus low-rank approximation of the original weight matrix, they would need to multiply by D1D^{-1} on the right. To improve clarity, we have edited lines 136-137 to instead read:

…which gives a sparse plus low-rank approximation of WDS+LWD \approx S+ L. OATS then applies the inverse transformation to reach the final compressed weight:

Wcompressed:=(L+S)D1. W_{compressed} := (L + S) D^{-1}.

[1] Mozaffari, M., Yazdanbakhsh, A., Zhang, Z., & Mehri Dehnavi, M. (2024). SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs.

[2] Mozaffari, M., & Mehri Dehnavi, M. (2024). SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs.

评论

Similarity with Wanda (Part 3/5)

[Reviewer Comment]: Instead, this paper seems to directly apply the pruning approaches proposed in Wanda.

Thank you for this comment. We wish to clarify that the Wanda algorithm is strictly for pruning dense weight matrices into sparse matrices. Unlike OATS, there is no sparse and low-rank decomposition or alternating thresholding being done. Throughout the paper, we have emphasized that the scaling done by OATS is inspired by Wanda, with an additional paragraph in the Related Works section specifically dedicated to highlighting that OATS would reduce to Wanda, in the specific case where the rank ratio is 0.

To further illustrate the differences between OATS and Wanda, we have included a new section Appendix A.7 titled “Performance Gap Between OATS and Wanda”, that highlights the exact gap between the two compression approaches for each performance metric. We have included the table below for the reviewer’s convenience:

ModelCompressionMMLU (↑)Zero-Shot (↑)Perplexity (↓)
Phi-3 Mini30%+1.21%+0.82%-0.44
40%+1.60%+1.24%-1.06
50%+5.42%+3.38%-2.05
Phi-3 Medium30%+0.97%-0.01%-0.43
40%+1.65%+1.45%-0.79
50%+2.52%+2.43%-1.07
Llama-3 8B30%+1.55%+0.71%+0.20
40%+2.13%+1.64%-0.50
50%+6.63%+2.44%-1.49
Llama-3 70B30%-0.68%+0.05%-0.17
40%+0.73%+0.78%-0.40
50%+2.74%+0.45%-0.60
评论

Additional Ablations for the OATS Algorithm (Part 2/5)

[Reviewer Comment]: It is unknown how to decide the rank and number of zeroes.

All Llama-3 experiments use a rank ratio of 0.3, and all Phi-3 experiments use a rank ratio of 0.25, demonstrating the robustness of OATS to the rank ratio parameter.

A related and valid question is why the same rank ratio can be applied across different layers of the model. We hypothesize that the weight matrices possess meaningful low-rank subspaces of roughly similar dimensions across layers. While we are actively pursuing a theoretical investigation to better understand this phenomenon, we believe that such an analysis falls outside the scope of the current paper. Evidence for the existence of a meaningful low-rank subspace is presented in a concurrent submission to the ICLR conference [1], which examines the spectral properties of weight matrices through the lens of Random Matrix Theory. Specifically, the study demonstrates that the spectrum includes a small number of outlier eigenvalues, whose removal significantly degrades performance.

[Reviewer Comment]: Besides, the order of applying SVD and thresholding has a significant impact on the approximation errors.

We thank the reviewer for providing a thought-provoking remark about whether the order utilized by OATS is the optimal order to be used. To address this question, we have run additional experiments and added the following subsection in Appendix A.4 titled “Switching the Order of Thresholding”:

[New Section | Page 20]: OATS opts to perform the singular-value thresholding first followed by the hard thresholding similar to [2]. However, one might consider whether the alternative order could lead to faster convergence or a better approximation. Presented in the table below is an extension of the ablation studies presented in Section 3.3, reporting the performance of OATS where the hard-thresholding is performed first:

First Thresholding OperationMMLU (↑)Zero-shot (↑)Perplexity (↓)
Hard-Thresholding65.5170.5411.72
Singular Value Thresholding (OATS)65.8470.7111.50

While the performance still remains competitive, across all performance metrics, the switched order falls short of matching the original order presented in the Algorithm.

[Reviewer Comment]: This paper claims "outlier information" in this paper. However, I have not seen any analysis or explanation for the "outlier information," and the proposed solution is not related to the "outlier information" … The authors are encouraged to provide explanations of why the diagonal matrix is related to "outlier information" and why it is good for compression."

We would like to thank the reviewer for this wonderful inquiry. In response, we have included a new subsection titled “Using a Robust Scaling Matrix” to Appendix A.3, which includes additional experiments exploring whether the matrix DD is truly capturing the outlier information, and if that is the reason why it is good for compression:

[New Section | Page 20]: To explore whether the scaling matrix DD is truly related to the outlier information, we run the following two experiments:

  1. scaling by the square root of the features' second moments, as is currently done in OATS.
  2. scaling by the median of the features' absolute values (computed along batch and sequence dimensions):

D_robust=\qquad \qquad \qquad \qquad \qquad D\_{robust} =median(X)(|X|)

The second experiment estimates the square root of the second moment of features in a manner that is robust (insensitive) to outliers akin to the Median Absolute Deviation estimator from the robust statistics literature [3]. The results of the two experiments are presented in the table below:

Scaling MatrixMMLU (↑)Zero-shot (↑)Perplexity (↓)
DrobustD_{robust}55.5465.7718.59
DD59.9968.4115.18

The findings show that using the robust scaling method results in significantly worse performance. Hence, the scaling matrix DD that is sensitive to the outlier features and captures their scale leads to better compression.

The reason why the scaling is good for compression has also been touched on in prior works like DSNoT and Wanda. Both papers utilize the same scaling matrix as OATS and both reasoned that the scaling is needed to steer the algorithm away from pruning weights that are in an outlier channel.

[1] https://openreview.net/forum?id=MmWkNmeDNE

[2] Zhou, T., & Tao, D. (2011). GoDec: randomized low-rank & sparse matrix decomposition in noisy case.

[3] Huber, P. J. (1981). Robust statistics.

评论

OATS Novelty and Differences (Part 1/5)

We appreciate the reviewer’s critical assessment of our paper and we will attempt to address each of the reviewer’s concerns.

[Reviewer Comment]: The combination of low-rankness and sparsity is an old topic that has been explored for many years [R1, R2].

We apologize for having missed seminal work [1,2] during our literature search and have fixed the manuscript appropriately to include citing the former when Robust PCA is presented in the paper, and the latter when presenting the alternating thresholding algorithm. We want to clarify that the novelty associated with our work is the effectiveness of Robust PCA for compressing large transformer models once an outlier scaling is applied -- not the Robust PCA problem itself. When introducing the Robust PCA problem, we did cite two seminal works on the problem, a more widely cited work by the same authors of [1] titled Rank-Sparsity Incoherence for Matrix Decomposition and a work by Candes et al. titled Robust principal component analysis? (2009).

[Reviewer Comment]: Besides, compressing DNN models using low-rank and sparse decomposition has already been well explored in [R3]. This paper just scales it to larger models and matrices. Authors are encouraged to specify the unique difference from existing approaches and why this difference is also unique for LLMs.

We again apologize for not having cited this important and relevant work and we thank the reviewer for pointing it out. To rectify our mistake, we have now added the following paragraph to Appendix A.1, providing an in-depth explanation of the differences between the two algorithms and why such differences are unique to LLMs:

[New Section | Page 18 ]: [3] introduced a method for sparse and low-rank decomposition of CNNs, including AlexNet and GoogLeNet, by solving the following optimization problem:

minS,LRdout×dinY(S+L)X22;s.t.;W(S+L)F2γ,Rank(L)r,S0k\min_{S, L \in \mathbb{R}^{d_{out}\times d_{in}}} ||Y - (S+L)X||_2^{2} \\; \text{s.t.} \\; ||W - (S + L) ||_F^2\leq \gamma, \text{Rank}(L)\leq r, ||S||_0 \leq k

where Y=WXY = W X. In contrast, OATS employs a different approach, solving:

minS,LRdout×dinWSLF2; s.t. ;Rank(L)r,;S0k.\min_{S, L \in \mathbb{R}^{d_{out} \times d_{in}}} || W - S - L ||_F^2 \\; \text{ s.t. } \\; \text{Rank}(L) \leq r, \\; ||S||_0 \leq k.

A key distinction between these methods lies in their objectives: the former directly minimizes reconstruction error, while OATS adopts a simpler formulation. One might question why not follow the approach of minimizing reconstruction error. As noted in DSNoT [4], pruning methods that prioritize minimizing reconstruction error can degrade model performance in large transformers, particularly in the presence of outlier features. Their findings highlight the importance of avoiding pruning weights within outlier channels. Since feature outliers are a phenomenon unique to large transformer models [5], this issue would not have been relevant to the work of [3], which predates the transformer era.

[Reviewer Comment]: The proposed Truncated SVD and Threshold strategies to achieve low-rankness and sparsity are too trivial.

We acknowledge the reviewer's point that OATS is a simple algorithm. However, this simplicity is precisely what makes OATS appealing. It demonstrates that a conceptually simple approach, which has not been widely applied to large-scale transformer models, can outperform more complex pruning methods.

[1] Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., & Willsky, A. S. (2009). Sparse and Low-Rank Matrix Decompositions.

[2] Zhou, T., & Tao, D. (2011). GoDec: randomized low-rank & sparse matrix decomposition in noisy case.

[3] Yu, X., Liu, T., Wang, X., & Tao, D. (2017). On Compressing Deep Models by Low Rank and Sparse Decomposition.

[4] Zhang, Y., Zhao, L., Lin, M., Yunyun, S., Yao, Y., Han, X., Tanner, J., Liu, S., & Ji, R. (2024). Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs.

[5] Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2024). LLM.int8(): 8-bit matrix multiplication for transformers at scale.

评论

Dear reviewer, we would greatly appreciate it if you could take a moment to read our rebuttal and provide any further feedback to the improvements made in response to your comments. This would give us enough time to respond should there be any further changes that the reviewer thinks are needed to improve the paper and its score. We understand the demanding nature of the review process and appreciate the time and effort that the reviewer is dedicating to this task.

评论

Dear reviewer, we apologize for the repeated emails, however, as the deadline for editing the manuscript is tomorrow, we wanted to ensure that all the reviewer’s concerns have been addressed, especially given that their initial review was quite critical compared to the others. We would greatly appreciate it if the reviewer could provide us with some feedback so that we could take further action should the reviewer have any more concerns prior to the deadline. We want to thank the reviewer again for their in-depth review and reiterate that we do very much appreciate the comments and suggestions that were made in the original review.

评论

Dear reviewer BUra, as we await your reply, we would like to bring to your attention that we have re-formatted, added a few clarifying sentences, and bolded key sentences in our replies addressing the reviewer’s comments. We hope that these changes make our responses easier to read for the reviewer. Please do not hesitate to reach out on the discussion page if there is anything that we, the authors, can do to ameliorate the reviewer’s process of reading/replying to our comments. We thank the reviewer and look forward to hopefully hearing back from them before the discussion deadline.

审稿意见
8

The paper introduces OATS (Outlier-Aware Pruning Through Sparse and Low-Rank Decomposition), a method designed to compress large transformer models without the need for retraining. The central concept involves representing weight matrices as the sum of a sparse and low-rank matrix.

The authors propose an iterative alternating thresholding technique to compute the joint sparse and low-rank decomposition of a matrix. They also focus on preserving outliers by scaling weights according to input embeddings prior to decomposition. The results indicate that OATS performs effectively on both language and vision tasks, outperforming other pruning methods proposed for transformers across various compression levels and tasks. Additionally, OATS offers speed improvements on the CPU.

优点

  • The concept of compressing a model as a sum of a sparse and low-rank matrix is very promising. Unlike most prior methods, which focus on one approach, OATS leverages both to potentially enhance performance.

  • OATS is retraining-free, which is crucial for practical applications where even a single backpropagation pass can be computationally prohibitive.

  • The framework has been tested on state-of-the-art models like Llama and ViT, demonstrating competitive performance.

  • The Alternating Thresholding technique in OATS heuristically finds an effective combination of low-rank and sparse components and accommodates different sparsity patterns.

  • By scaling the weight matrix WW with a diagonal rescaling matrix DD, OATS emphasizes outliers, enabling outlier-aware compression that avoids performance degradation.

缺点

One concern is that the method relies on multiple calls to truncated SVD, which can be computationally intensive. Specifically, finding the top-rr singular values of an m×nm \times n matrix has a time complexity of O(mnr)O(mnr). Given that compression speed is a significant factor for practical applications, it would be helpful if the authors could clarify the time complexity and wall-clock time spent on the compression process of the overall algorithm. This would offer a more concrete understanding of its practicality.

问题

One potential extension of this work could involve incorporating quantization into the proposed framework. Although this addition may be a long shot, integrating quantization could make the model a unified approach to transformer compression. Could the authors provide insights on whether quantization can be integrated with their current framework, or if not, what are the main challenges?

评论

By scaling the weight matrix with a diagonal rescaling matrix , OATS emphasizes outliers, enabling outlier-aware compression that avoids performance degradation.

We thank the reviewer for highlighting a strength with our work and we wanted to mention that we have further elaborated on this strength with a new section Appendix A.3 titled “Using a Robust Scaling Matrix”, included below, that empirically shows that the sensitivity to outliers is needed for good compression:

To explore whether the scaling matrix DD is truly related to the outlier information, we run the following two experiments:

  1. scaling by the square root of the features' second moments, as is currently done in OATS.
  2. scaling by the median of the features' absolute values (computed along batch and sequence dimensions):

Drobust=\qquad \qquad \qquad \qquad \qquad D_{robust} =median(X)(|X|)

The second experiment estimates the square root of the second moment of features in a manner that is robust (insensitive) to outliers akin to the Median Absolute Deviation estimator from the robust statistics literature [1]. The results of the two experiments are presented in the table below:

Scaling MatrixMMLU (↑)Zero-shot (↑)Perplexity (↓)
D_robustD\_{robust}55.5465.7718.59
DD 59.9968.4115.18

The findings show that using the robust scaling method results in significantly worse performance. Hence, the scaling matrix DD that is sensitive to the outlier features and captures their scale leads to better compression.

[1] Huber, P. J. (1981). Robust statistics.

We would like to give thanks to the reviewer for their detailed comments, careful inspection of our work, and foresight in extending OATS to quantization. Should the reviewer find it appropriate, the authors would always be appreciative of any additional reconsideration of the score.

评论

The authors would like to extend their gratitude to the reviewer for highlighting the strengths of our paper, raising valid concerns about the runtime of OATS, and inquiring about follow-ups in regards to extending the algorithm to the quantization setting.

Given that compression speed is a significant factor for practical applications, it would be helpful if the authors could clarify the time complexity and wall-clock time spent on the compression process of the overall algorithm. This would offer a more concrete understanding of its practicality.

We thank the reviewer for the suggestion and in response have included a new section Appendix A.2 titled ”Time Complexity and Wall-Clock Time for OATS” which we have included below for the reviewer’s convenience:

The time complexity for OATS is O(LNα)\mathcal{O}(LN\alpha) where LL is the number of transformer blocks, NN is number of iterations, and

α=\alpha = max_WdW_outdW_inrW\_W d^{W}\_{out} \cdot d^{W}\_{in} \cdot r^{W}

where the max is taken over the weight matrices, WRdW_out×dW_inW \in \mathbb{R}^{d^{W}\_{out} \times d^{W}\_{in}}, in a transformer block and rWr^{W} is the rank of the low-rank term for that weight matrix. The value α\alpha represents the time complexity needed to perform the singular value thresholding in OATS.

The table below reports the wall-clock time (in seconds) needed to perform a single iteration of the alternating threshold algorithm for a single transformer block for the different models that were compressed. All experiments utilized a single NVIDIA A40 with 48GB of GPU memory.

Phi-3 Mini (3.8B)Phi-3 Medium (14B)Llama-3 8BLlama-3 70B
8.8526.0217.10152.80

While OATS does require more wall-clock time than prior pruning algorithms, in practice, model compression would only need to be performed once before deployment. This trade-off is therefore worthwhile given the substantial performance improvements, particularly on more challenging tasks like MMLU. Furthermore, like prior pruning algorithms, compressing the layers within a single transformer block can be done in parallel. For example, the time needed per transformer block of Llama-3 70B can be reduced to 71.10 seconds by compressing in parallel across four NVIDIA A40 GPUs.

The total wall-clock time can also be reduced by lowering the number of OATS iterations. Presented in the Table below is an exploratory experiment compressing Llama-3 70B by 50% with a rank ratio of 0.3 with only 20 iterations. Even with only a quarter of the iterations, OATS is still able to outperform all prior pruning algorithms across all performance metrics.

MMLU (↑)Zero-shot (↑)Perplexity (↓)
74.0273.414.95

One potential extension of this work could involve incorporating quantization into the proposed framework. Although this addition may be a long shot, integrating quantization could make the model a unified approach to transformer compression. Could the authors provide insights on whether quantization can be integrated with their current framework, or if not, what are the main challenges?

We share the same interest as the reviewer in integrating quantization with OATS. The formulation proposed in our work is flexible and can indeed accommodate that. However, one of the associated challenges is deciding which terms in the sparse and low-rank decomposition should be quantized. Depending on that decision, one would obtain one of the following three optimization problems:

  1. min_S_Q,L_QS_Q+L_QW s.t. S_Q_0k,S_Q\_{S\_Q, L\_Q} ||S\_Q + L\_Q - W|| \text{ s.t. } ||S\_Q||\_0 \leq k, S\_Q is quantized, rank(L_Q)r,L_Q(L\_Q)\leq r, L\_Q is quantized

  2. min_S_Q,L;;S_Q+LW;; s.t. S_Q_0k,S_Q\_{S\_Q, L} \\; \\; ||S\_Q + L - W|| \\; \\; \text{ s.t. } ||S\_Q||\_0 \leq k, S\_Q is quantized, rank(L)r(L)\leq r

  3. minS,L_Q;;S+L_QW;; s.t. S_0k_{S, L\_Q} \\; \\; ||S + L\_Q - W|| \\; \\; \text{ s.t. } ||S||\_0 \leq k, rank(L_Q)r,L_Q(L\_Q)\leq r, L\_Q is quantized

We are currently investigating the three different approaches to determine which would ultimately lead to the best-unified compression approach.

评论

Dear reviewer, we would greatly appreciate it if you could take a moment to read our rebuttal before the deadline as it would give us enough time to construct a detailed response should there be any lingering concerns or issues that could be resolved to improve our work and its score. We are grateful of all the time and energy that the reviewer has dedicated to the demanding nature of the review process and we would like to reiterate our thanks to the reviewer.

评论

I appreciate the authors' thorough explanations and efforts to address the questions and concerns raised. I believe OATS presents an interesting contribution to the community, with many promising directions for future research stemming from it. I will maintain my score and advocate for the acceptance of this paper.

评论

Dear Reviewer Gg5U,

As the discussion period comes to a close, we would like to express our sincere gratitude to the reviewer for engaging with us during the discussion period, for the reviewer’s recognition of OATS as an interesting contribution to the community with many promising directions for future research, and for advocating for the acceptance of our work into the conference. Thank you.

Best,

The Authors.

公开评论

I hope this message finds you well.

I have a question regarding the CPU throughput tests presented in Section 3.4 of your paper. Specifically, I was wondering if any hardware acceleration techniques, such as compression or other methods, were used during these tests. If not, in scenarios where the total data volume (SUV matrix > Weight matrix) is larger, what strategies would you suggest for improving throughput?

Thank you for your time, and I look forward to your insights.

Best regards

评论

Thank you for showing interest in our work. Beyond compressing the weights according to OATS, we did not perform any other compression techniques. However, we did utilize Neural Magic's Deepsparse Engine to benchmark the throughput through the sparse model which utilizes additional techniques to leverage sparsity for speed-up [1]. Two simple solutions to further reduce the memory footprint associated with the unstructured sparse matrix is to either induce more (unstructured) sparsity or to replace unstructured with structured sparsity – both of which OATS is able to do over prior unstructured pruning techniques. However, beyond OATS, another option would be to incorporate quantization which is a direction that we are actively exploring.

[1] https://github.com/neuralmagic/deepsparse

AC 元评审

Summary

The paper presents OATS (Outlier-Aware Pruning Through Sparse and Low-Rank Decomposition), a technique developed to compress large transformer models and reduce memory and compute costs without requiring costly retraining. This method is based on decomposing the weight matrices into a combination of a sparse matrix and a low-rank matrix. OATS enhances model compression by scaling the weights of transformer models according to the second moment of input embeddings, a strategy designed to retain crucial outlier features. This method effectively maintains model performance, addressing the common issue of performance degradation observed with higher compression levels in conventional pruning techniques. The authors demonstrate that OATS effectively compresses transformer models for both language and vision tasks, surpassing the performance of other pruning methods across various tasks and compression levels. Additionally, OATS provides notable speed enhancements when deployed on CPUs.

Strengths

The reviewers unanimously highlighted several strengths of the proposed framework:

  • OATS eliminates the need for retraining, which is essential for practical scenarios where even a single backpropagation pass can be computationally expensive.
  • The method has been evaluated on cutting-edge architectures such as Llama and ViT on a large range of model sizes, solidly demonstrating that it can compete with recent benchmarks in terms of performance.
  • The paper includes data on practical speed improvements on CPUs when using current frameworks for sparse computation.

Weaknesses

The reviewers brought up several core weaknesses:

  • Some very relevant prior works are missing, some of which are conceptually identical to the proposed work, e.g., Yu et al., "On compressing deep models by low rank and sparse decomposition," CVPR 2017.
  • The choice of rank ratio parameter and why it should be fixed across different layers, as opposed to considering dynamic ranks across different layers, e.g., as done in the concurrent work [1].
  • The lack of computational analysis for OATs
  • Poor placement of the paper in the landscape of current literature.

[1] "Dynamic Low-Rank Sparse Adaptation for Large Language Models," in Proc. Thirteenth Int. Conf. Learning Representations, 2024, under review. [Online]. Available: https://openreview.net/forum?id=oXh0939Zzq

Conclusion

The majority of the reviewers evaluated this paper positively. However, Reviewer BUra (Rating: 3, Confidence: 5) raised several critical issues about the submitted work, with which I largely agree, particularly the omission of critical prior works that are conceptually similar to the current submission. The authors have addressed most of the concerns raised by Reviewer BUra, which unfortunately, Reviewer BUra did not acknowledge. While I share some of Reviewer BUra's reservations, I believe that in light of the authors' rebuttal, the paper's strengths outweigh its weaknesses. Considering this, I view the paper as marginally above the acceptance threshold and recommend acceptance.

审稿人讨论附加意见

Despite my efforts to engage the reviewers in a discussion during the review period to reach a consensus on the paper’s merits and shortcomings, there was no participation in any discussion.

Reviewer BUra raised significant concerns about the paper, to which the authors responded adequately. In my view, their responses partially address the issues highlighted by Reviewer BUra. Unfortunately, the reviewer did not acknowledge the authors' rebuttal. While I agree with some of Reviewer BUra's criticisms, considering the authors' rebuttal and aligning with the majority of the reviewers, I believe the strengths of the paper outweigh its shortcomings. Furthermore, given the significance and relevance of the research topic, I assess the paper to be above the acceptance threshold and vote for its acceptance.

最终决定

Accept (Poster)