PaperHub
5.3
/10
Poster3 位审稿人
最低4最高6标准差0.9
6
4
6
4.3
置信度
正确性2.7
贡献度3.0
表达3.3
NeurIPS 2024

SlimGPT: Layer-wise Structured Pruning for Large Language Models

OpenReviewPDF
提交: 2024-05-14更新: 2024-12-19

摘要

关键词
Model CompressionStructured PruningLayer-wise PruningLarge Language Model

评审与讨论

审稿意见
6

This paper presents a novel SlimGPT framework to conduct structured pruning for LLMs in a fast and low-cost way. Specifically, SlimGPT modifies the Optimal Brain Surgeon (OBS) framework, and proposes a Batched Greedy Pruning to enhance the performance of head-wise pruning through Cholesky decomposition. SlimGPT also improves the FFN pruning efficiency via Dynamic Group Size. Besides, SlimGPT employs an Incremental Pruning Ratio in order to mitigate the error accumulation problem in layer-wise pruning. Experiments on the LLaMA, LLaMA-2, and other popular LLMs demonstrate that SlimGPT achieves a new SOTA on LLM pruning.

优点

  1. This paper presents a proper way to extend the OBS framework to structured pruning with strong theoretical foundation. The technical details are thorough and convincing.
  2. Extensive experiment results demonstrate that SlimGPT successfully achieves a new SOTA, surpassing all related works in this field.

缺点

  1. SlimGPT employs an Incremental Pruning Ratio strategy. In L220, the article specifies that a logarithmic increasing strategy performs well and is employed in all experiments. Actually, this should be considered more carefully. Pruning can be viewed as a method to get rid of unnecessary information in the activations and only preserve necessary components for later layers. From this perspective, it resembles that of Token Merging [1,2]. It is already demonstrated in [1,2] that the performance loss caused by an aggressive pruning schedule in the first layers can be mitigated by re-training. Therefore, I suggest the authors test more increasing strategies, instead of the logarithmic strategy, to fully utilize the power of SlimGPT.
  2. Pruning at a pp\\% sparsity does not usually lead to an \frac{1}{1-p\\%} throughput speedup [2]. To demonstrate that SlimGPT actually helps to deploy LLMs, the authors should carefully compare the throughput (eg., tokens per sec.) of pruned models generated by SlimGPT and the competing baselines.

[1] Token Merging: Your ViT But Faster (ICLR 2023)

[2] PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation (ECCV 2024)

问题

Under the same total sparsity value, does different increasing schedules affect the model inference speedup?

局限性

See Weaknesses and Questions. I do appreciate the theoretical contributions. So my rating may be adjusted after carefully checking the replies and other reviews.

作者回复

We truly appreciate the reviewer for the constructive comments.

W1: Further explanation about Incremental Pruning Ratio strategy.

In Section 5.3.2, we have discussed the impact of different pruning ratio strategies on performance, with detailed results presented in Table 5 (for convenience, Table 5 is reproduced below). Specifically, for the Incremental Pruning Ratio strategy, both logarithmic and linear approaches are employed. Each of these strategies offers distinct benefits: the linear approach achieves better Zero-shot Avg results but incurs a slight loss in PPL. Nevertheless, overall, whether the strategy is linear or logarithmic, the incremental scheme significantly outperforms uniform pruning or decrementing strategies.

Model SizePPLZero-shot Avg.
log increase (SlimGPT)3.40b38.8352.23
linear increase3.34b46.5753.45
uniform3.50b123.0544.34
log decrease3.40b380.6936.73
linear decrease3.34b932.6435.62

Please note that our experiments are conducted under low-resource conditions. After extensive fine-tuning on large-scale data, the differences in performance resulting from various pruning ratio strategies will diminish further (Sheared-LLama[1], even with uniform pruning, mitigated performance impacts through subsequent training on a large data scale). However, under resource-limited conditions, selecting an appropriate pruning ratio strategy can reduce the reliance on subsequent training, thus minimizing performance loss. This is particularly crucial for LLMs, as a complete training cycle demands substantial resources and time.

[1]. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning.

W2: Compare the throughput (eg., tokens per sec.) of pruned models generated by SlimGPT and the competing baselines.

Thank you for your valuable suggestion. About the additional experiments on inference speed, we provide the experimental results and analysis in "global response" at the top. Please refer to that reponse for detailed information.

Q: Under the same total sparsity value, does different increasing schedules affect the model inference speedup?

In general, under the same number of parameters, deeper models (with more layers) tend to have slower inference speeds. On the other hand, models with the same number of layers but different widths primarily experience variations in instantaneous computational load, which impacts speed differently depending on the performance and optimization of the GPU used.

Since our current method typically does not affect the number of layers, as long as there isn't an extreme variation in width distribution, inference speed should not be significantly impacted. This is supported by the experimental results from the previous question.

评论

Dear Reviewer 8hG2:

We sincerely appreciate your valuable and insightful comments. With less than 24 hours remaining in the discussion period, we anticipate further opinion from you.

We would like to further discuss the LLM inference speed. Since most operators are computed in parallel, the acceleration from pruning does not arise from a reduction in computational load but rather from a decrease in parameter access time, which is a significant bottleneck in large model inference. Therefore, the inference time is not linearly related to the parameter number. In our experiments, we find that pruning 50% of the parameters result in an inference speed that is 63% of the original.

Best Regards,
Authors of submission 8375

审稿意见
4

This paper presents SlimGPT, a method for structured pruning of LLMs to balance performance with efficiency. The method is based on the OBS framework and introduces Batched Greedy Pruning to enhance pruning accuracy and speed. The authors also propose the Incremental Pruning Ratio strategy to mitigate performance loss due to error accumulation. Experimental results on LLaMA and other models demonstrate that SlimGPT achieves state-of-the-art results with significant improvements in performance retention compared to existing methods.

优点

  1. The paper introduces a novel method for structured pruning based on the OBS framework to accelerate large language models.
  2. The method is validated through comprehensive experiments on various models, showing improved performance and efficiency.

缺点

  1. Since Table 9 illustrates the significant impact of the calibration dataset on the pruning performance, I doubt whether the selection of calibration samples is more important than the design of the pruning criterion. The experimental results compared in the paper are with different sampling strategies of the calibration dataset, so it is hard to evaluate the superiority of the proposed method.
  2. The design of the layer-wise sparsity is empirical with no theoretical analysis. Since the pruning within a layer is a greedy pipeline, it is unclear why the layer-wise sparsity design is not in a greedy pipeline.
  3. The inference speed should be provided for comparisons.

问题

  1. In Figure 1., from my view, it seems that the output elements of the third head are all the smallest, so they are all reordered to the first head for pruning. My question is what if not all elements of a head are in the same magnitude order? In this case, how to conduct batched greedy pruning?
  2. Can you provide some insights about the reason why the pruning results with 2048 samples and 2048 sequence length start to decrease in zero-shot average metric?

局限性

Not applicable.

作者回复

We truly appreciate the reviewer for the constructive comments.

W1: The impact of the calibration dataset on the pruning performance.

Thank you for the valuable comment. Table 9 displays the pruning results of SlimGPT without fine-tuning (for convenience, Table 9 is reproduced below). When the calibration data is switched from C4 to Alpaca/GPT4-Alpaca, the PPL results indeed degrade (48.26/47.06 vs. 42.06). However, the Zero-shot Avg scores improve significantly (54.44/54.66 vs. 52.70), surpassing the fine-tuned results of all baselines.

DatasetPPL↓Zero-shot Avg.↑
C4 (SlimGPT)42.0652.70
Alpaca48.2654.44
GPT4-Alpaca47.0654.66

This observation highlights a trade-off between PPL and Zero-shot Avg. Due to SlimGPT's inherent parameter compensation mechanism, it is sensitive to the quality of the input data, so there is a distinct difference in impact between pre-trained data and instruction-following data. Additionally, our experiments show that random open-source pre-trained data can achieve SOTA results. Therefore, we believe that using higher-quality data tailored to specific domains provides SlimGPT with greater potential for improvement compared to other methods.

W2: The design of the layer-wise sparsity is empirical.

The core concept of SlimGPT is derived from the OBS framework, which addresses global model pruning by breaking it down into layer-wise subproblems. Each layer is optimized sequentially from shallow to deep. Previous studies, such as OBC[1] and GPTQ[2], have demonstrated that this method yields excellent results across various domains. And our primary objective is to apply this framework to the structured pruning of LLMs.

We would like to discuss the viability of the layer-wise greedy pipeline. Due to the unidirectional influence between layers, the current layer is impacted solely by the preceding layer and remains unaffected by subsequent layers. If the pruning process is not executed sequentially, the local optimality at each step would be compromised. We believe this would complicate the task and significantly increase the computational time required.

[1]. Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. [2]. GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers.

W3: The inference speed should be provided.

Thank you for your valuable suggestion. About the additional experiments on inference speed, we provide the experimental results and analysis in "global response" at the top. Please refer to that reponse for detailed information.

Q1: Further explanation on Figure 1.

As a structured pruning scheme, SlimGPT prunes attention blocks with the smallest pruning granularity at the head level. Consequently, heads are treated as indivisible units. We evaluate each head by summing the errors of all columns within it. For those interested in the specifics, our detailed algorithmic process is outlined in Algorithm 1.

Q2: Can you provide some insights about the reason why the pruning results with 2048 samples and 2048 sequence length start to decrease in zero-shot average metric?

Thank you for the comment. The Zero-shot Avg is essentially a mean value, which can be easily influenced by sub-tasks with significant fluctuations. To facilitate analysis, we provide detailed results for the commonsense reasoning tasks below.

ExperimentSample SizeSequence LengthBoolQPIQAHellasWWinoGARC-eARC-cOBQAAvg.
(a)5126463.1268.8251.3457.6246.1729.3535.0050.20
512102467.2271.0655.2359.2753.4531.1436.6053.42
512204866.7371.6055.3557.3053.1132.4235.6053.16
(b)12851265.6669.3754.0758.3350.0430.5535.251.89
25651263.7670.2954.4758.8853.0730.8934.652.28
102451265.6071.6554.3557.4653.0730.6335.4052.59
204851263.3071.7154.7957.3853.231.6634.6052.38

As the sample length increases from 64 to 1024, there is a noticeable improvement in the metrics across various tasks. However, when the length is further increased to 2048, the rate of improvement slows down, and some metrics even decline, particularly for WinoGrande (59.27 vs. 57.3). We believe that since most commonsense reasoning tasks involve short text data, the impact of SlimGPT on these tasks may diminish when the sequence length of the calibration set exceeds a certain threshold. In fact, it could even weaken the model's understanding of short texts.

When the sample size increases, the changes across the various subtasks are less consistent and exhibit smaller magnitudes. As shown in Figure 3 of the paper, the fluctuations in Zero-shot Avg in subplot (a) are significantly smaller than those in subplot (b), and there are two instances of decline. Both of these declines result from fluctuations in the BoolQ dataset (refer to the table above). Thus, we hypothesize that the BoolQ dataset is particularly sensitive to random sampling in the calibration set, which may result in the drop in Zero-shot Avg.

评论

Dear Reviewer QvFZ:

As the discussion period comes to a close, we sincerely look forward to your feedback on our rebuttal. Your further opinions would be essential for us to improve our work.

Regarding your concern that the calibration set is more important than the pruning method, we include additional experiments with LLM-Pruner (our baseline). We perform pruning without fine-tuning at the same pruning ratio, yielding an evaluation result of PPL=136.19. In contrast, the worst result for SlimGPT is PPL=48.26. Therefore, we believe that while the calibration set may have a slight influence on the bias of model pruning, it does not fundamentally affect the pruning results.

Best Regards,
Authors of submission 8375

审稿意见
6

The authors proposed a layer-wise pruning approach called SlimGPT that follows the Optimal Brain Surgeon framework but with a batched pruning procedure utilized to make it feasible on large models while remaining structured. The authors claim near-optimal pruning performance on commonsense reasoning datasets and wikitext ppl.

优点

Models pruned via structured pruning approaches can naturally gain efficiency benefits, and the proposed method supposedly inherits this important property (though missing some efficiency reports). The task- and relatively architecture-agnostic nature of SlimGPT also makes it score well in terms of adaptability. The experiment reports indicate SlimGPT beats three other SOTA methods by a healthy margin (though its alignment requires some additional polish).

缺点

The main weakness of the paper is its eval, both in terms of alignment and coverage.

  1. Unaligned evaluation: Many of the compared baselines utilized a different calibration dataset and procedure, but it looks like only the LLM-Pruner is replicated in an aligned setup, while all results for other methods are copied from their original literature. This needs to be controlled, especially with only three methods to compare.

  2. Overly emphasis on the Llama family: The presented evaluation is solely conducted on various Llama 1/2 and Vicuna models, which are all Llama family-based. More coverage on other popular LLMs should be included. I'd recommend a healthy selection from Mistral, Phi, Gemma, Qiwen, and something along the line.

  3. Only on zero-shot commonsense reasoning tasks: As mentioned around line 243, the real task evaluation of this paper is conducted "under a zero-shot setting on the Commonsense Reasoning datasets, which encompass seven diverse subtasks..." This is not comprehensive enough. Common intelligence datasets like MMLU and GSM8k in typical few-shot setups should also be reported. Additionally, given the weak generation/long context performance in some recent layer pruning (but not layer-wise structured pruning in finer granularity) work like ShortGPT, I'd like to see SlimGPT evaluated on tasks like LongBench, InfinityBench, Needle-In-A-Haystack retrieval, and HumanEval on models with longer context window (e.g., mistral 32k).

  4. Incomplete efficiency report: There are no throughput or latency results, which are key efficiency metrics for conducting structured pruning in the first place and must be reported in efficiency literature, especially because different structured pruning methods can reflect different inference efficiencies, leading to work like ShearedLlaMA proposing targeted structured pruning. Also, the authors claim that "low-cost, low-resource, and time-efficient compression scheme" as their contribution in line 65, but there is no runtime or memory report on its pruning procedure.

  5. Not really a weakness, but the authors should consider giving a more detailed introduction of the compared baselines either in related work or around line 246.

Despite my score of 4, I believe many of the raised concerns are addressable as there are mostly just more evals, and I am open to improving my rating upon proper rebuttal.

问题

  1. Where is the plotting for the unpruned model in Figure 2? What pruning method is applied here? How much performance can fine-tuning recover in this setup?

  2. Is a pruned model superior to a smaller pretrained model? Like can a SlimGPT-pruned 13B model with a pruning ratio set around 45% be more performant than a 7/8B model? From the look of Tables 1 & 2, it seems a 50% pruned 13B model is significantly inferior to a 7B, and a 50% pruned 30B model is required to provide similar zero-shot task performance to an unpruned 7B, so this is unlikely. If confirmed, I am afraid this massively discounts the contribution of this work, as one can often just adopt a smaller pretrained model with no pruning or calibration necessary. Though I understand different application scenarios may call for models with different sizes, where the pretrained models can't cover them all.

局限性

The authors have provided a limitation and checklist section.

作者回复

We truly appreciate the reviewer for the constructive comments. Due to text limitations, we try to answer your questions concisely and convincingly.

W1: Unaligned evaluation.

We acknowledge your concerns. Achieving a fully aligned experimental setup is challenging since pruning tasks differ from tasks like model optimization, and various pruning methods have unique principles and data requirements. For instance, LLM-Pruner has two stages: pruning and fine-tuning, while Compresso and LoRAPrune involve only one stage of sparse training. Our method, SlimGPT, also follows a two-stage process, aligning it with LLM-Pruner. Given the differing principles of Compresso and LoRAPrune, we make a compromise in terms of aligning the experimental environment. However, we believe that this does not affect the conclusions of our paper. In fact, we achieve better pruning results using fewer data resources or fewer iterations.

W2: Overly emphasis on the Llama family.

Thank you for your valuable comments. In our model selection, we reference previous works and the influence of LLama and Vicuna models but overlooked Vicuna's derivation from LLama. Due to time constraints, we choose Baichuan-7b, a prominent model in the Chinese community, for our experiments. This model includes a comprehensive MMLU and C-Eval evaluation script and is supported by LLM-Pruner, aiding our quick verification process.

Using the same setup as the paper, we prune Baichuan-7b by 20% with LLM-Pruner and SlimGPT. SlimGPT achieves slightly better PPL results (19.73 vs. 19.85) and significantly outperforms LLM-Pruner in all commonsense reasoning tasks (58.8 vs. 56.38).

Prune%Method#ParamsPPL↓BoolQPIQAHellasWWinoGARC-eARC-cOBQAAvg
--7B13.2570.0976.9370.0664.0967.0540.5338.2060.99
20%LLM-Pruner5.7B19.8562.8774.4863.0360.9360.3136.8636.2056.38
20%SlimGPT w/o tune5.7B20.0169.1775.0365.2561.2564.9435.1536.6058.20
20%SlimGPT5.7B19.7366.2175.0366.7463.8562.8438.9138.0058.80

W3: Only on zero-shot commonsense reasoning tasks.

Thank you for your suggestions. We evaluated our model based on the official LLama report and previous research, focusing on language modeling performance (PPL) and zero-shot commonsense reasoning. While we believe these tasks are convincing, we recognize they may be insufficient for a complete assessment. To improve our evaluation, we perform 5-shot tests on Baichuan-7b using MMLU and cross-lingual task C-Eval, as seen in the table below.

The 5-shot evaluation results for MMLU show SlimGPT outperforming the baseline (35.4 vs. 24.3), and it also excels in the cross-lingual C-Eval dataset (28.7 vs. 22.4), despite reduced performance after finetuning on Alpaca.

DatasetPrune%Method#ParamsHumanitiesSocial SciencesSTEMOtherAvg
MMLU--7B38.149.235.247.742.1
20%LLM-Pruner5.7B24.823.421.926.824.3
20%SlimGPT w/o tune5.7B29.738.031.338.934.0
20%SlimGPT5.7B32.239.231.140.235.4
C-Eval--7B47.250.536.445.543.3
20%LLM-Pruner5.7B23.524.921.121.222.4
20%SlimGPT w/o tune5.7B32.934.328.330.931.0
20%SlimGPT5.7B31.833.124.229.928.7

Regarding your suggestion for experiments on long-context performance, we recognize their importance but, due to time constraints, we won’t be able to conduct them at this moment. We plan to include them in future work for a more comprehensive study.

W4: Incomplete efficiency report.

Thank you for your suggestion. We include the experimental results and analysis on inference speed and pruning efficiency in the "global response" section, as other reviewers have similar questions. Please refer to that for detailed information.

W5: Detailed introduction of the compared baselines.

Due to space constraints, we have omitted the baseline introduction in the final paper version. We apologize for any inconvenience and will include it in the updated version.

Q1: Further explanation on Figure 2.

Figure 2 shows the output errors of various pruned models relative to the unpruned model, which has an error of zero and is not displayed (its PPL is 12.63). In our pruning approach, we utilize SlimGPT with the Alpaca dataset, pruning only the first layer to minimize output errors.

To assess performance recovery after finetuning, we present the PPL results below, where all models are finetuned with LoRA for one epoch using Alpaca. We observe that the PPL does not improve after finetuning, even for the unpruned model. We believe this may be due to the distinct distribution bwtween the instruction-following dataset Alpaca and test dataset Wikitext2, potentially causing overfitting of the unpruned weights and poorer performance on the test data, which requires further validation.

ModelPPL (w/o tune)PPL (w/ tune)
LLama-7B12.6315.63
Layer0-prune-25%12.8616.86
Layer0-prune-50%13.9817.31
Layer0-prune-75%21.4931.27

Q2: Is a pruned model superior to a smaller pretrained model?

This topic deserves discussion. Under low-resource conditions, pruned models after finetuning generally perform worse than smaller pre-trained ones, as shown in our works and previous research(LLM-Pruner). However, with full training, pruned models can outperform their smaller versions, like in Sheared-LLama[1]. Thus, pruning may serve as a high-benchmark initialization method to lessen the need for extensive training.

Our aim is to focus specifically on the task of LLM pruning itself. Under constrained resource conditions, such as when a more compact version of the model (e.g., less than 1B parameters) is required for edge-side deployment, LLM pruning and compression provide a cost-effective solution.

[1]. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning.

评论

I thank the authors for the detailed rebuttal. The new results — especially on the efficiency front — look decent and thus warrant a slight bump to 5. However, to fully convince me, I'd like to see SlimGPT applied on truly challenging tasks like GSM8k, as well as long context tasks like LongBench and Needle-in-a-Haystack (on llama3 and mistral v0.2). I am particularly interested in the long context front due to the known drawbacks of ShortGPT.

With the unaligned nature discussed in W1, it looks like the proposed method is mostly compared to LLMPruner. While I recognize that LLMPruner is an established method, I wonder how SlimGPT would perform against some of the more advanced methods, like APT [2].

p.s. While I appreciate the addition of Baichuan, I would still like to see the MMLU report on a more mainstream model, like llama2-7b, for better cross-referencing needs. May the authors supply that?


[2] APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

评论

Dear Reviewer jqTK:

We truly appreciate the reviewer's recognition of our work. In response to your suggestions regarding extra experiments, we add the following evaluations:

Evaluation of LLama2-7b on MMLU and mathematical tasks GSM8k.

Prune%Method#ParamsHumanitiesSocial SciencesSTEMOtherAvgGSM-8k-Acc
--6.7B43.351.636.352.145.613.8
20%LLM-Pruner5.4B25.723.624.226.825.22.3
20%SlimGPT w/o tune5.4B36.045.233.544.139.44.2
20%SlimGPT5.4B35.342.231.543.037.86.0

The first few columns display the results for MMLU, while the last column shows the evaluation results for GSM-8k. SlimGPT demonstrates significant improvements over the baseline in both tasks. Notably, for the challenging task of GSM-8k, LLM-Pruner retains 16.7% of the performance (2.3 vs. 13.8) after pruning, whereas SlimGPT retains 43% of the performance (6.0 vs. 13.8), achieving more than a twofold increase. Since we employ a very basic and low-cost fine-tuning method, we believe there is room for further performance enhancement.

Evaluation of Mistral-7B-Instruct-V2.0 on LongBench

Due to the more representative 32k context of Mistral, we will present the evaluation results for Mistral-7B-Instruct-V2.0 on LongBench, hoping this will serve as a valuable reference.

The model evaluation takes longer than we anticipate, so we primarily provide results for the single-document QA task, as shown in the table below. The performance of the pruned model varies on English tasks, with the capability retaining as much as 97% (30.54 vs. 31.47) in some instances. However, its performance on the cross-language task MultiFieldQA-zh is comparatively modest. We find that this can be largely attributed to the Alpaca fine-tuning, which results in a bias towards answering questions in English, consequently lowering the overall scores. For example, consider the following case.

{'pred': 'The appeals court determined that the defendant should pay a compensation amount of RMB 57,081.86.', 'answers': ['人民币57081.86元。'], 'all_classes': None, 'length': 2975}

Prune%Method#ParamsNarrativeQAQasperMultiFieldQA-enMultiFieldQA-zh
--6.7B27.3331.4748.5958.17
20%SlimGPT5.4B18.8930.5440.4724.75

SlimGPT vs. APT.

The APT paper actually describes a sparse training pruning method. This approach manually inserts masks before and after specific modules and measures the importance of channels/heads using outliers. Its advantage lies in an end-to-end design; however, its task-specific nature and strong coupling with LoRA complicate its application to other tasks.

Based on the results provided in the paper, we conduct a rough comparative analysis. While the different pruning ratios make it difficult to directly judge the evaluation results, it is evident that SlimGPT, with 1 epoch naive LORA tuning, offers a more lightweight and straightforward setup without requiring alterations to the model structure.

MethodPrune%Tuning DataTuning EpochLORA rankHellaSwag EvalMMLU Eval
APT30%Alpaca158-25671.136.9
SlimGPT20%Alpaca1874.937.8

Best Regards,
Authors of submission 8375

评论

I'll take a closer look at the rest soon, but may the author please address the question as titled? Just want to post earlier so that you can have a chance to reply. I'd also like to see APT on GSM8k under a comparable setting.

评论

Dear Reviewer jqTK:

We would like to thank the reviewer for the comments.

Why can't you prune SlimGPT 30% to be comparable with your APT report?

Our analysis is based on the results provided in the APT paper and the existing results from the SlimGPT paper. Unfortunately, we do not have sufficient time to conduct additional experiments to prune the model by 30% for comparison now (The experiment is ongoing). We hope you can understand this limitation. Furthermore, the APT paper do not provide evaluation results for GSM8K, so we are currently unable to present GSM8K comparison results.

More evaluation of long-context tasks on LongBench.

We update our evaluation of long-context tasks on LongBench, which includes 4 task types, as shown in the table below. The current version of the model evaluation has fixed bugs related to input format compared to the previous version, leading to a slight change in performance.

According to the table, we find that the performance of the pruned model varies on English tasks; in some cases, it even exceeds the results of the original model (2WikiMQA: 28.34 vs. 26.32). The majority of tasks maintain over 80% of the original performance.

  • Evaluations on Single-Doc QA tasks
Prune%Method#ParamsNarrativeQAQasperMultiFieldQA-enMultiFieldQA-zh
--6.7B27.3231.4748.5749.06
20%SlimGPT5.4B19.3330.3742.6426.65
  • Evaluations on Multi-Doc QA tasks
Prune%Method#ParamsHotpotQA2WikiMQAMusiqueDuReader (zh)
--6.7B43.1126.3218.8130.57
20%SlimGPT5.4B38.1328.3415.1613.98
  • Evaluations on Summarization tasks
Prune%Method#ParamsGovReportQMSumMultiNewsVCSUM (zh)
--6.7B-22.9225.4614.91
20%SlimGPT5.4B-19.9522.7812.49
  • Evaluations on Few-shot Learning tasks
Prune%Method#ParamsTRECTriviaQASAMSumLSHT (zh)
--6.7B68.5086.9842.0039.00
20%SlimGPT5.4B56.0082.5941.4622.75

Best Regards,
Authors of submission 8375

评论

Per your global rebuttal, pruning a Llama2-7B to 50% only takes a little over 1,000 seconds, which is almost a trivial cost, and a few epochs of Alpaca are often pretty fast too. So what is actually slowing things down to the point that you "do not have sufficient time to conduct additional experiments to prune the model by 30% for comparison now (The experiment is ongoing)"?

Just trying to make sure I am not missing any major things.

评论

Dear Reviewer jqTK:

In fact, all of our hardware resources were dedicated to running the LongBench experiments, which exceeded our expectations in terms of resource demands. Just before our previous response, we completed the LongBench experiments. Now we finish the pruning experiment for LLaMA2-7B at 30% and the subsequent evaluation. Below, we present our experimental results (with APT data referenced from the paper). We primarily focus on comparing the HellaSwag and MMLU tasks, as these two datasets are used for evaluation in both papers. We train for 3 epochs during fine-tuning. SlimGPT performs slightly lower than APT on MMLU but demonstrates better performance on HellaSwag. And it is important to note that SlimGPT requires fewer iterations and less trainable parameters for LoRA during the fine-tuning phase.

MethodPrune%Tuning DataTuning EpochLORA rankHellaSwag EvalMMLU Eval
APT30%Alpaca158-25671.1036.9
SlimGPT30%Alpaca3872.4235.2

Best Regards,
Authors of submission 8375

评论

Thank you for being resourceful and adding many requested experiments during the rebuttal time. The added result confirms my intuition: that cheap, non-ShreadLlama-like LLM pruning techniques do not perform well under rigorous evaluation. Your added results on GSM8k and LongBench confirm that, as there are visible drops with just 20% pruned. Note that we usually don't observe such a performance drop with techniques like weight-only quantization at a much more aggressive rate; even with vanilla group-wise quantization with no finetuning.

That being said, I recognize that pruning LLMs is much harder than quantifying LLMs. There surely are some benefits unique to the pruning way, and overall, pruning is without a doubt a school of efficiency worth developing; especially knowing its gap with quantization. The proposed method is better or at least on par with the established/recent baselines, so I recommend an acceptance with score 6 & confidence 5. But I urge the authors to:

  • Tone down the claim a bit, e.g., the "98% performance" claim in your abstract is slightly misleading. Most common-sense reasoning tasks are easy and do not represent LLM's true capability, so it is almost an overstatement based on cherry-picked results.
  • Highlight the results that are not perfect (MMLU, GSM8k, LongBench, etc.) so that future works will have a clear direction for improvement, instead of always muddling those easy tasks.
  • Add a proper section to discuss the pros/cons of pruning compared to other efficiency techniques (e.g., quantization) and their unique challenges.
评论

Dear Reviewer jqTK:

Thanks for your valuable feedback. We sincerely appreciate your time to review our submission and response. We will revise the paper accordingly and incorporate the above results into the updated version.

Best Regards,
Authors of submission 8375

作者回复

Dear Reviewers,

We sincerely appreciate your valuable and insightful comments. Here I would like to address the concerns regarding inference speed or pruning efficiency raised by all reviewers.

Inference speed and memory usage report.

As the inference speed is primarily influenced by the final model structure and is not specifically tied to the pruning algorithm used (typically, the number of layers does not decrease), we initially omitted the inference runtime report. To demonstrate that SlimGPT actually helps to deploy LLMs, we provide the inference speeds for LLama-7b with 20% and 50% pruning, as shown in the table below. The batch size is set to 1, the maximum output limit is 512, and the average value is taken from 50 inference results. Additionally, we examine the impact of two different pruning ratio strategies on inference speed: uniform pruning and the Incremental Pruning Ratio strategy. All supplementary experiments were conducted in an environment with NVIDIA H20.

In the case of pruning 50% of the parameters using the log increase strategy, the model's memory usage during inference is reduced to 51% (14297MB vs. 27737MB), and the inference latency decreases to 63% (9.21ms vs. 13.51ms). When employing uniform pruning, both memory usage and latency experience slight reductions, although it is important to note that the parameter counts are not entirely equivalent between the two methods.

Prune%Strategy#ParamsMemoryAvg Latency (per token)
--6.7B27737MB13.51ms
20%log-increase5.4B22497MB11.89ms
50%log-increase3.4B14297MB9.21ms
50%uniform3.4B13793MB9.05ms

Pruning runtime and memory usage report.

Regarding the runtime and memory usage during the pruning procedure, we have mentioned briefly that all pruning processes can be completed within 1 GPU hour (using A100 hardware). Specifically, the memory usage varies depending on the model size and the calibration size, and the pruning speed is additionally influenced by the pruning ratio. We present the pruning efficiency results from our paper experimental setup.

Since SlimGPT operates in a layer-wise manner, we don't need to load the entire model but only load the parameters of the current layer and the corresponding input features at one time, which significantly reduces memory usage. For the task of pruning the 13B model by 50%, we only require 12 GB of GPU memory and 41 minutes to complete the process.

Modelmemoryprune-20%-runtimeprune-50%-runtime
LLama-7b7375M678.4s1073.9s
LLama-13b11601M1417.1s2475.3s
最终决定

The authors introduced a low-cost and fast structured pruning method named SlimGPT. Experiments results show that SlimGPT outperforms other methods and achieves state-of-the-art performances.

We appreciate the responses from authors and the discussions with reviewers. Please add all the mentioned experiments and analysis into the final version.