PaperHub
5.5
/10
Poster3 位审稿人
最低2最高4标准差0.8
4
3
2
ICML 2025

BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation

OpenReviewPDF
提交: 2025-01-09更新: 2025-07-24
TL;DR

We identify the existing bias in existing pruning metric and propose a novel automatic optimization framework to get the appropriate pruning metric for better pruning results.

摘要

关键词
LLM Pruning; Automaic Framework

评审与讨论

审稿意见
4

This paper focuses on unstructured pruning of LLMs and introduces a new pruning metric. Unlike previous methods that estimate parameter importance based solely on magnitude, activations, or gradients, the proposed approach also considers the impact of outliers in model parameters.

The authors first demonstrate that a small number of outlier parameters with large magnitudes can significantly affect existing pruning metrics. To address this issue, the proposed method normalizes each model parameter using the 2\ell_2 norms of the corresponding input and output channels. Additionally, to handle outliers in the input, the authors introduce a power factor after computing the input norm.

To optimize the newly introduced hyperparameters, the authors propose using a zeroth-order gradient approach, which allows optimization without backpropagation. Experimental results show that the proposed method outperforms baseline approaches.

给作者的问题

N/A

论据与证据

The claims made in the submission are well-supported. The authors provide clear evidence for their core arguments regarding the influence of outliers in model parameters and input activations. Additionally, the proposed method is well-reasoned and justified.

方法与评估标准

The proposed method and empirical evaluation follow common practices in this domain. The reviewer generally agrees with the evaluation settings and does not find any significant issues with them.

理论论述

This paper does not contain theoretical claims.

实验设计与分析

The experimental design is sound and valid. It follows previous methods (e.g., Wanda, SparseGPT) and uses widely accepted benchmarks.

补充材料

All.

与现有文献的关系

This paper is closely related to previous literature and methods in this domain. It builds upon existing pruning metrics and further enhances the effectiveness of unstructured model pruning.

遗漏的重要参考文献

I have only one main comment here. The paper does not fully discuss past research on model pruning for LLMs, focusing mainly on the baselines used in the experiments. While this may be due to space constraints caused by the extensive evaluation and analysis, I still recommend that the authors provide a broader discussion of related work. In particular, discussing structured pruning methods [1-4] for LLMs would be valuable, as structured pruning is generally more hardware-friendly compared to unstructured pruning, which is the focus of this paper.

[1] Xia, Mengzhou, et al. "Sheared llama: Accelerating language model pre-training via structured pruning." arXiv preprint arXiv:2310.06694 (2023).

[2] Sreenivas, Sharath Turuvekere, et al. "Llm pruning and distillation in practice: The minitron approach." arXiv preprint arXiv:2408.11796 (2024).

[3] Ling, Gui, Ziyang Wang, and Qingwen Liu. "SlimGPT: Layer-wise Structured Pruning for Large Language Models." Advances in Neural Information Processing Systems 37 (2024): 107112-107137.

[4] Hou, Bairu, et al. "Instruction-Following Pruning for Large Language Models." arXiv preprint arXiv:2501.02086 (2025).

其他优缺点

This paper is well-motivated, with clear writing and a coherent logical flow. The reviewer enjoyed reading it. Additionally, the proposed method is reasonable and well-structured, and the evaluation is rigorous and comprehensive.

The only concern is the performance under semi-structured sparsity. I assume that the unstructured sparsity results in Table 2 and Table 4 do not lead to efficiency improvements. As shown in Table 5, Table 8, and Table 10, the performance degradation remains significant compared to the original dense model. Furthermore, SparseGPT sometimes outperforms the proposed method. However, the results improve in Table 11 and Table 12.

Overall, the reviewer considers this a strong paper.

其他意见或建议

N/A

作者回复

Dear zn1Z:

We sincerely appreciate the valuable suggestions provided by the reviewer. We note the two main concerns you raised, which we address below.

Firstly, we thank the reviewer for emphasizing the importance of comparing with structured pruning methods. We would like to compare structured and unstructured pruning in terms of pruning granularity, sparsity, accuracy, efficiency, and training after pruning. Structured pruning [1, 2, 3] removes complete substructures or weight groups (layers [4], FFN neurons [1], MHA heads [1], embedding dim [5], they are coarse-grained) from LLMs, enabling hardware-independent efficiency gains. However, under the constraints of this coarse-grained pruning, the post-pruned accuracy of LLM is prone to be drastically reduced, and therefore the generally applicable sparsity ratio is between 15% and 30%. Post-pruning fine-tuning/training/knowledge distillation may be used to restore the performance of the pruned model when faced with high sparsity ratios [2, 5]. Unstructured pruning removes elements from the weights (fine-grained) and stores or loads the pruned weights in a compressed assortment. Combined with decompression (memory bottleneck optimization) or hardware advantage (2:4 sparse tensor core), unstructured sparse LLM can also obtain considerable efficiency improvement. Unstructured pruning is less likely to harm the accuracy of the model due to the fine-grained constraints, so the sparsity can generally exceed 50%, and the adoption of a stricter 2:4 sparsity is also acceptable. It is possible to do unstructured pruning without loss of accuracy for large LLMs such as Llama-2-70B, and BaWA has done it. Unstructured pruned LLM can also be further trained, and some techniques such as PEFT [6] and STE [7] have emerged to provide support. In the revised manuscript, we will include a dedicated discussion (Section 2) comparing structured and unstructured pruning for LLMs to demonstrate the necessity of high-performance unstructured/semi-structured LLM pruning. The following table compares a portion of the experimental results of structured pruning and unstructured pruning as a reference.

MethodTypeSparsityreference speedupBoolQRTEHellaSwagWinoGrandeARC-eARC-cOBQAAVG
Llama-2-13BDense0%1.0x83.4367.5161.2573.880.8350.4332.864.3
LLM-PrunerStructured Pruning25%1.25x68.3550.5446.8161.5670.237.6328.851.98
ShortGPTLayer Pruning25%1.25x62.5459.5747.770.9661.2437.882752.41
Wanda2:4 Semi50%1.4x75.2656.6846.4366.7768.3534.4724.453.19
BaWA2:4 Semi50%1.4x78.2656.3248.566.9370.7935.492654.61

Secondly, although unstructured pruning has traditionally lacked hardware support, recent advancements such as Flash-LLM [8] demonstrate its practical speedup through specialized kernels. Moreover, our method’s compatibility with N:M sparsity (natively supported by Ampere GPUs) further ensures deployability while maintaining higher accuracy than structured alternatives (Table 9). Additionally, we would like to clarify that the slight performance gap between BaWA and SparseGPT in semi-structured settings (Table 8) stems from SparseGPT’s weight reconstruction—a complementary technique orthogonal to pruning metrics, which is explained in our rebuttal to the reviewer xLm6. As shown in "BaWA+ADMM" (Table 3), combining our metric with weight reconstruction outperforms all baselines universally. Furthermore, larger models (e.g., LLaMA2-70B) exhibit greater robustness to N:M constraints due to inherent redundancy, reducing accuracy drops to <1% in 4:8 sparsity (Table 12).

Reference

[1] Llm-pruner: On the structural pruning of large language models, NeurIPS'23.

[2] Sheared llama: Accelerating language model pre-training via structured pruning, ICLR'24.

[3] Instruction-Following Pruning for Large Language Models, ArXiv'25.

[4] SlimGPT: Layer-wise Structured Pruning for Large Language Models, NeurIPS'24.

[5] Llm pruning and distillation in practice: The minitron approach, ArXiv'24.

[6] SPP: Sparsity-preserved parameter-efficient fine-tuning for large language models, ICML'24.

[7] Sparsity-accelerated training for large language models, ACL'24.

[8] Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, VLDB'23.

审稿人评论

I thank the authors for the detailed response. After careful assessment, I still think this is a good paper with clear motivations and solid techniques. During rebuttal, the authors further address my concerns on the application of semi-structured pruning. Based on this, I will maintain my rating (4, accept) and recommend the acceptance of this paper.

作者评论

Dear zn1Z:

We sincerely appreciate your constructive feedback and continued support. Thank you for recognizing our efforts in addressing the semi-structured pruning concerns. We will carefully incorporate your suggestions in the final manuscript to further strengthen the technical presentation.

审稿意见
3

This work proposes a weight pruning method based on Wanda by performing normalization through input and output channels and scaling normalization factors. Wanda is a simple weight pruning method which uses scores measured by L1 of weight and L2 of input, but it suffers issues of imbalance in weight magnitude and influence from outliers. This work alleviate the issues by normalizing weights according to the L2 norm of input and output channels and by scaling the normalization terms. Scaling terms are optimized by forward-only methods for faster computation. Experiments are carried out on several zero-shot benchmarks and shows that the proposed method achieves better results when compared with other SOTAs, including ADMM-Iter.

给作者的问题

It is not clear whether scales are optimized in Table 5 when showing the proposed method with, e.g., input channel normalization.

论据与证据

The major claim regarding the weight magnitude normalization for input and output channels is supported by the discussion and analysis in section 2 and 3, and by ablation studies in section 4.4. The use of scaling is clearly described in section 2 and 3, and the proposed optimization algorithm is a nice contribution. However, further analysis on the more controlled scaling might be helpful to clearly understand their impact. For instance, we could intentionally introduce scales in the range of 0.1, 0.25, 0.5, 1.0 etc. and measured the impact in perplexity. The combination of optimization algorithm together with normalization is not throughly measured in the ablation studies if my understanding is correct. For instance, it is possible to optimize a single scale parameter θ\theta in Equation 4 when showing ablations in Table 5 to measure the impact of optimization. Similarly, outlier regularization in Table 5 uses a fixed scaling of 0.5, but it could be optimized to show the effectiveness of the proposed algorithm.

方法与评估标准

Experiments cover diverse benchmarks for zero-shot settings as well as WikiText-2 for perplexity to compare prior baselines.

理论论述

No theoretical proofs for the proposed method, since this work focuses on empirical studies by carefully analyzing the behavior of weight magnitudes for pruning.

实验设计与分析

Experimental design sounds good to me, but I'd suggest additional ablation studies as noted in the "Claims and Evidence" section, i.e., running experiments by optimizing scales when ablating normalization factors.

  • Ablations in the rebuttal, that look promising.

补充材料

I checked the code in appendix F and the combination of pruning mask in appendix G.

与现有文献的关系

This work is an extension of prior work in unstructured weight pruning such as Wanda as noted in this manuscript.

遗漏的重要参考文献

This work is missing discussion on ADMM-Iter and DSOT, given that this work shows experiments by combining them with the proposed method. It is not clear whether the proposed method is orthogonal to those method, and it is even not clear what kind of findings or conclusion could be drawn by showing the results.

  • Additional details in the rebuttal.

其他优缺点

Strengths

  • It is an extension of prior work on unstructured pruning by adding normalizations and scaling. The design is well motivated and optimization for scaling factor is a yet another contribution to this field.

Weaknesses

  • Further ablations are necessary to justify the claim regarding the scaling, since it is not well supported by the experiments.

其他意见或建议

None.

作者回复

Dear xLm6:

We greatly appreciate your insightful comments. Below we provide a point-by-point response to your concerns.

Scaling Factor Analysis

We agree that analyzing scaling factors is critical. In the revised manuscript, we will add:

  • A new comparative table (below) demonstrating the superiority of optimized scales over fixed values on LLaMA2-7B

  • A novel discussion in the revised Section 4.4 will highlight how adaptive scaling addresses LLM-specific distribution challenges. Specifically, different models and task settings will be explored to illustrate the effectiveness of BaWA's scaling strategy.

The evaluation results can be shown as follows.

Scaling Strategyθ1\theta_1 (Input)θ2\theta_2 (Output)θ3\theta_3 (Activation)PPLΔ vs. Best Fixed (Fixed at 0.5)
Fixed Scales
- θ\theta=0.10.10.10.18.92+24.1%
- θ\theta=0.50.50.50.57.18+0% (baseline)
- θ\theta=1.01.01.01.07.53+4.9%
BaWA Optimized0.420.510.386.30-12.3%

The key findings include:

  • Optimized scales reduce perplexity by 12.3% compared to the best-fixed scale (θ\theta=0.5)

  • Fixed scales exhibit significant sensitivity (±24.1% PPL variance)

Relationship with ADMM-Iter/DSnoT

We apologize for the lack of explanation for the relationship between BaWA and ADMM-Iter/DSnoT. In fact, the LLM pruning procedure can be divided into two stages, pruning mask selection and weight reconstruction. Different from BaWA which optimizes pruning metric (Stage 1), these methods (both ADMM and DSnoT) focus on reconstructing post-pruning weights (Stage 2). To illustrate the orthogonality, we validate when adding weight reconstruction methods (both DSnoT and ADMM) with BaWA, as well as comparing them with SparseGPT, ADMM-Iter and DSnoT without BaWA pruning metric on various LLaMA models with 2:4 sparsity. The results (Table below) clearly depict that using BaWA pruning metric with weight reconstruction methods achieves the best pruning performance. Furthermore, our evaluation in Table 17 (Appendix G) also demonstrates the effectiveness of adding weight reconstruction to BaWA pruning metric. We will add this pipeline diagram to Appendix G.

PPL1-7B1-13B1-30B1-65B2-7B2-13B2-70B
SparseGPT11.009.117.166.2810.178.325.40
Admm-Iter9.908.606.896.029.747.785.19
DSnoT10.899.056.766.1510.468.095.11
BaWA10.327.946.375.619.937.134.84
BaWA+DSnoT10.217.916.425.699.847.084.86
BaWA+Admm9.717.866.395.609.757.044.71
审稿意见
2

Existing pruning metrics are limited by their reliance on simple symbolic combinations of weights and activations, failing to account for imbalanced weight magnitudes and the disproportionate impact of activation outliers. To address these shortcomings, this paper introduces BaWA, a pruning metric that balances Weight and Activation distributions for more effective pruning. BaWA incorporates two key innovations:

  1. Magnitude Normalization, which mitigates weight imbalances across channels, enabling fairer pruning decisions.
  2. Outlier Regularization, which reduces the influence of activation outliers, ensuring more appropriate channel prioritization.

To further improve its effectiveness, BaWA includes an efficient, automated framework for optimizing normalization and regularization hyperparameters. Extensive experiments demonstrate that BaWA outperforms existing pruning metrics. For instance, applying BaWA to induce 2:4 sparsity in Mistral-7B reduces perplexity by 2.49 and increases average downstream task accuracy by 3.08%, surpassing the previous method Wanda.

update after rebuttal

I revisited the paper and would like to keep my original rating for two reasons:

  1. The experimental comparisons are somewhat outdated. Except for Table 3, most of the comparisons are Wanda (proposed in mid-2023) and SparseGPT, which is even older. In Table 4, the method is only slightly better than Wanda. Even in Table 3, the proposed method only shows good improvement when combined with the weight reconstruction method ADMM. Without it, the improvement seems marginal. So, I feel the proposed method may not offer a significant advancement.
  2. Techniques like Magnitude Normalization and Outlier Regularization have already been extensively studied in previous work. This paper doesn't introduce anything particularly new or exciting for me.

I think this work is a slight extension of Wanda, with an additional normalization step applied to the weights. The method is reasonable but the contribution feels moderate to me, I'd be fine with the paper being either accepted or rejected.

给作者的问题

Please see above.

论据与证据

The claims in the paper are well-supported by clear empirical evidence. However, the concepts of magnitude normalization and outlier regularization are not entirely novel, and the overall contribution may not appear significantly innovative.

方法与评估标准

The proposed methods and evaluation criteria are appropriate for the problem.

理论论述

This is an empirical paper, no theoretical claims are presented in the paper.

实验设计与分析

Yes, all parts.

补充材料

Yes, all parts.

与现有文献的关系

The paper builds on prior work in LLM pruning that uses weight and activation magnitudes to guide pruning decisions (e.g., Wanda). While magnitude normalization and outlier regularization have been widely explored in the context of model adaptive sparsity and robust pruning, BaWA refines these ideas by introducing a balancing mechanism for weight and activation distributions. The contribution is a little bit incremental and limited for the community.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

The motivation of the paper is clear, the proposed method is simple and easy to follow.

Weaknesses:

  1. The proposed two methods magnitude normalization and outlier regularization in LLM pruning are a little bit trivial and not significant enough.

  2. The novelty of the paper is somehow limited. The contributions appear somewhat incremental, making the overall presentation less engaging.

  3. The proposed method will involve additional complexity and computation than baseline Wanda, especially w/ the search strategy.

其他意见或建议

Please see the comments above.

作者回复

Dear mDNk:

We sincerely appreciate your thoughtful feedback regarding BaWA’s novelty and computational overhead. Below, we address your concerns in detail.

Novelty of BaWA:

We respectfully disagree with the provided novelty concern for three key reasons:

(a) Problem Characterization in LLMs

  • Structural Outliers: LLMs exhibit extremely sparse activation outliers (>50% outliers in <5% channels, Fig 1c), unlike CNNs with uniform noise patterns [1].
  • Cross-Layer Imbalance: Weight magnitudes vary by 100× within layers (Fig 1a), violating CNN pruning assumptions.

(b) Methodological Advancements

  • Dual-Channel Normalization: Joint input/output scaling (Eq. 5) handles asymmetric LLM distributions, unlike standard normalization.

  • Dynamic Outlier Suppression: Learnable threshold θ3\theta_3 adapts to layer-wise outlier density, improving upon fixed-threshold methods [2].

(c) Empirical Superiority

  • BaWA reduces perplexity by 2.49 over Wanda (Table 3), which is a great improvement.

  • Direct application of robust CNN pruning degrades accuracy by 4.1% (Table 8 in Appendix).

Computational Overhead:

Furthermore, the evaluation results in our paper have shown that the additional overhead of BaWA is negligible in two aspects.

(a) One-Time Search Cost: Searching a 70B model takes only 16 minutes (Table 6), <0.01% of typical training costs (thousands of GPU hours).

(b) Search-Free Mode: Even without search (BaWA w/o search), BaWA outperforms Wanda while maintaining the same pruning efficiency (Table 5).

References

[1] Li et al., Pruning Filters for Efficient ConvNets, ICLR 2017.

[2] Wei et al., Outlier Suppression+, EMNLP 2023.

最终决定

This paper introduces a new metric for pruning language models, focused on addressing imbalanced weight magnitudes and activation outliers. While there is some disagreement about the novelty of the approach as it extends prior work, the proposed method improves over prior work and the reviewers agree it is clearly explained and experiments are sound. Other concerns raised by reviewers (including speed of pruning, scaling factor sensitivity, comparison to structured pruning methods) are largely addressed by the authors in the rebuttal. Based on this, we recommend this paper for acceptance.