PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
3.0
置信度
创新性3.0
质量3.0
清晰度3.0
重要性2.8
NeurIPS 2025

Computation and Memory-Efficient Model Compression with Gradient Reweighting

OpenReviewPDF
提交: 2025-05-02更新: 2025-10-29

摘要

Pruning is a commonly employed technique for deep neural networks (DNNs) aiming at compressing the model size to reduce computational and memory costs during inference. In contrast to conventional neural networks, large language models (LLMs) pose a unique challenge regarding pruning efficiency due to their substantial computational and memory demands. Existing methods, particularly optimization-based ones, often require considerable computational resources in gradient estimation because they cannot effectively leverage weight sparsity of the intermediate pruned network to lower compuation and memory costs in each iteration. The fundamental challenge lies in the need to frequently instantiate intermediate pruned sub-models to achieve these savings, a task that becomes infeasible even for moderately sized neural networks. To this end, this paper proposes a novel pruning method for DNNs that is both computationally and memory-efficient. Our key idea is to develop an effective reweighting mechanism that enables us to estimate the gradient of the pruned network in current iteration via reweigting the gradient estimated on an outdated intermediate sub-model instantiated at an earlier stage, thereby significantly reducing model instantiation frequency. We further develop a series of techniques, e.g., clipping and preconditioning matrix, to reduce the variance of gradient estimation and stabilize the optimization process. We conducted extensive experimental validation across various domains. Our approach achieves 50% sparsity and a 1.58$\times$ speedup in forward pass on Llama2-7B model with only 6 GB of memory usage, outperforming state-of-the-art methods with respect to both perplexity and zero-shot performance. As a by-product, our method is highly suited for distributed sparse training and can achieve a 2 $\times$ speedup over the dense distributed baselines.
关键词
model compressiondeep neural networkssparse trainingdeep learning

评审与讨论

审稿意见
4

This paper proposes an optimization-based pruning method for large language models (LLMs). The key contribution is a reweighting technique that allows the reuse of outdated sub-models to estimate gradients for structural parameters, reducing the need to frequently instantiate new sub-networks. The method also incorporates variance reduction techniques to stabilize training. Experimental results demonstrate memory savings, inference speedup, and competitive performance compared to state-of-the-art pruning methods. The paper also extends the method to a distributed sparse training setting.

优缺点分析

Strengths

  1. The method allows using an outdated sub-model for multiple pruning steps, which significantly reduces computational cost.

  2. The method achieves competitive or superior performance compared to several baselines in terms of perplexity, memory usage, and inference speed.

  3. The paper is well organized and easy to follow.

Weaknesses

  1. The introduction states that "Consequently, LLMs pose a unique challenge for pruning", but this claim lacks sufficient justification. A clearer explanation should be provided.

  2. The theoretical parts in Section 4 are sometimes dense and hard to follow without a strong background in knowledge.

  3. Comparisons with more recent optimization-based methods are limited. Most baselines are metric-based.

  4. typos: caption of Figure 1 ("vanila" -> "vanilla")

问题

Please refer to Strengths and Weaknesses

局限性

The authors have not discussed the limitations and potential negative societal impact of their work. It is recommanded to be included in the final revision.

格式问题

n/a

作者回复

W1: Explanation for "LLMs pose a unique challenge for pruning".

As we mentioned in the main text, unlike conventional DNNs, LLMs face unique challenges in pruning, primarily due to the asymmetry in resources (data volume and memory) between training and pruning. Conventional DNNs, e.g., ResNet for image classification, can typically be pruned under resource conditions similar to those used during training, such as comparable datasets and memory (e.g., ImageNet dataset and 4 GeForce RTX 4090), which is always affordable for the downstream practioners. In contrast, training LLMs often relies on extremely large-scale computational resources and massive data. For example, LLaMA3-405B was trained with up to 16K H100 GPUs and approximately 15T of training tokens [r1], which far exceeds what is accessible during the pruning phase. This characteristic of LLMs therefore presents unique challenges for pruning methods, distinct from those encountered in conventional DNNs.

W2: Theoretical parts in Section 4 are dense and hard to follow.

We will move some of the theoretical parts to the appendix in the revision.

W3: Comparison with optimization-based baselines is limited.

Thanks for your suggestion. We additionally include experiments based on the optimization-based pruning method Compresso, NutePrune, LoRAShear [r2], MaskPrune [r3] with different models and prune rates. More comparison will be included in the revised version if accepted. For a fair comparison under equal training time, we extend the training epochs of our method to fully exploit its effienciy. We additionally report the time and memory overhead incurred by each method during the pruning process. The results are summarized below.

Table R1. Comparison with various optimization-based methods on Llama2-7B. To ensure a consistent training time budget, we extend the training epoch accordingly. Meanwhile, we also record the memory overhead for reference. The fine-tuned version is trained on the Alpaca dataset.

Prune rateMethodWikiText2 PPL↓Ptb PPL↓WikiText2 PPL(&Fine-tune)↓Ptb PPL(&Fine-tune)↓Time costMemory(MiB)
0%-5.478.39----
20%Ours(1 Epoch)9.6713.948.3513.184.8h10224.34
20%Ours(2 Epochs)9.4413.708.0212.909.6h10224.34
20%Ours(3 Epochs)9.3513.487.9412.5714.5h10224.34
20%Compresso11.4721.259.9218.3916.8h33298.74
20%NutePrune9.8816.648.7414.9013.4h37460.78
50%Ours(1 Epoch)26.8344.2314.3930.073.6h6390.21
50%Ours(2 Epochs)23.4641.7912.7925.427.2h6390.21
50%Ours(3 Epochs)21.5238.6912.5525.2010.7h6390.21
50%Compresso49.2878.9040.8367.0816.8h33298.74
50%NutePrune24.8940.8312.9426.0413.4h37460.78

Table R2. Comparison with various optimization-based methods on Llama-7B. Zero Shot task includes BoolQ, PIQA, HellaSwag, Winogrande, ARC-e, ARC-c and OBQA.

Prune rateMethodZero Shot Avg. Score↑Zero Shot Avg. Score(&Fine-tune)↑Time costMemory(MiB)
0%-63.25---
20%Ours(1 Epoch)60.3761.714.8h10173.46
20%Ours(2 Epochs)61.3461.979.6h10173.46
20%Compresso-60.7516.8h33298.74
20%NutePrune59.4661.4613.4h37460.78
20%LoRAShear-60.63--
20%MaskPrune61.17---
50%Ours(1 Epoch)51.7353.263.6h6239.06
50%Ours(2 Epochs)51.9453.687.2h6239.06
50%Compresso-46.8716.8h33298.74
50%NutePrune50.3552.1313.4h37460.78
50%LoRAShear-50.39--
50%MaskPrune49.92---

W4: Typos.

We will correct these typos in the revision.

L5: Missing discussion of limitations and broader impacts.

Thanks for the reminder. In fact, we discussed the limitations and broader impacts in Appendices F and G. We will move them into main text in the revised version if accepted.

Reference

[r1] Dubey A, Jauhri A, Pandey A, et al. The llama 3 herd of models. arXiv e-prints, 2024: arXiv: 2407.21783.

[r2] Chen T, Ding T, Yadav B, et al. Lorashear: Efficient large language model structured pruning and knowledge recovery. ArXiv 2023.

[r3] Qin J, Tan J, Zhang K, et al. MaskPrune: Mask-based LLM Pruning for Layer-wise Uniform Structures. ArXiv 2025.

评论

Dear Reviewer Hb8r,

We sincerely appreciate for your support and have carefully responded to your constructive comments thoroughly.

Since the deadline for the author-reviewer discussion phase is approaching, would you mind kindly letting us know if any further discussion or clarification is needed? Thank you again for your valuable time.

审稿意见
4

This paper introduces a computationally and memory-efficient pruning algorithm for LLMs, which addresses a key practical challenge in optimization-based structured pruning — the high overhead of sub-model instantiation. The core idea is a reweighting scheme, which reuses the old sub-model gradients to estimate the current gradient. The paper integrates multiple variance reduction techniques and demonstrates the method’s advantages in both standalone and distributed pruning scenarios.

优缺点分析

Strengths

  1. Novel reweighting-based pruning technique with strong theoretical and practical grounding.
  2. Extensive experimental coverage across LLaMA, OPT, and vision models.

Weaknesses

  1. I'm wondering about the results beyond 50% sparsity and robustness under extreme pruning, which is lacking in the paper.
  2. SliceGPT reported that the WikiText2 of Llama2-7B at 30% sparsity is 8.12, which significantly differs from the one reported in this paper (23.31). The WikiText2 of Llama2-13B at 30% sparsity is also larger than the one in the original paper. The difference also occurs in the results of LLM-pruner, I'm wondering why there is such a gap. Is there some difference in the experimental setup? If so why not use the same setup as the original paper?
  3. Comparison with optimization-based baselines is limited. Only 50% sparsity is compared, more sparsity levels should be compared. Nuteprune reported that the WikiText2 of Llama2-7B at 50% sparsity is 12.29, which significantly differs from the one reported in this paper. Does this difference come from the lack of fine-tuning?
  4. Since the other methods do not appear to match the results presented in their own original papers. The quality of these results in a scientific context is low, because it is difficult to draw any conclusions from them. I hope the author can provide some explanation for the difference.
  5. I want to know why the authors only compare the results without fine-tuning? If the models pruned by gradient reweighting can't beat those SOTA methods after fine-tuning, the practical significance of gradient reweighting would seem to be less. I recommend that the authors compare the performance after fine-tuning.
  6. I wish the authors could provide statistical information rather than single experiments, which would help to increase the credibility of the results

问题

See weaknesses.

If the authors can solve some of my confusion in experiments, the practical significance of the paper will be greatly improved, and I will be happy to increase my score.

局限性

Yes

最终评判理由

The authors have addressed my concerns on the higher sparse level, different results in the original paper, comparison with the results with fine-tuning, and statistical information. I think those improvements increase the value of the paper, and I have changed the rating from 2 to 4.

格式问题

N/A

作者回复

W1: Results beyond 50% sparsity.

We provide experimental results under 55% and 60% prune rates, as shown in the Table R1.

Table R1. Performances with higher prune rates on Llama2-7B.

Prune rateMethodWikiText2 PPL↓Ptb PPL↓
55%LLM-Pruner434.51531.02
55%SliceGPT97.69311.01
55%Wanda-sp218.80164.81
55%FLAP363.82413.05
55%Ours35.3459.83
60%LLM-Pruner761.97816.07
60%SliceGPT139.64475.73
60%Wanda-sp298.84247.60
60%FLAP464.39532.67
60%Ours53.8472.95

W2: The experimental performance gap in SliceGPT and LLM-Pruner.

The reason of the performance difference is we use different experimental setup for fair comparison.

Before presenting our detailed setting, we'd like to clarify the following three factors:

  1. SliceGPT uses WikiText2 as both the calibration and evaluation dataset, which makes it difficult to fully evaluate the generalization ability of the pruned model. Such experimental setup is also not wildely adopted in the literature.
  2. Regarding the performance difference on LLM-Pruner, there may be a misunderstanding: the original paper used the LLaMA-1 with a evaluation context length of 128, whereas our experiments are based on the LLaMA-2 and LLaMA-3 with a context length of 2048.
  3. Pruning methods such as SliceGPT and LLM-Pruner can have notable performance variation over different calibration datastes and input sequence lengths. As shown in Table R2 below, it achieves an average score of 51.50 when using WikiText2 as the validation set, compared to 57.93 when using Alpaca. Figure 7 of SliceGPT further illustrates that when varying the length of the calibration dataset, the WikiText2 PPL of OPT-6.7B fluctuates within a range of 12.0 to nearly 20.0.

Table R2. The results extracted from Tables 7–10 in the original SliceGPT paper demonstrate that the performance of LLaMA2-7B at a 30% prune rate varies significantly depending on the calibration data used. "-" means the result is not available in the original paper.

Calibration DataWikiText2 PPL↓PIQA↑WinoGrande↑HellaSwag↑ARC-e↑ARC-c↑Avg. Score↑
WikiText28.1263.5561.3349.6251.7731.2351.50
WikiText2&fine-tuning-67.4163.2255.6550.7634.1354.23
Alpaca-72.2559.8355.8663.9337.8057.93
Alpaca&fine-tuning-74.5961.6463.0666.5440.8761.34

Our experimental settings. To ensure fair comparison across all baselines in our experiments, we standardize the calibration dataset to C4 and use a sequence length of 128. For evaluation, we use WikiText2 and Ptb as test sets with a length of 2048, and we do not apply fine-tuning.

Results under the same setting with SliceGPT. Furthermore, we evaluate our method under the same experimental setting as SliceGPT, where WikiText2 is used as both the training and evaluation set. Under this experimental setup, our method also surpasses SliceGPT on multiple models. The results are shown in the following table.

Table R3. Performance of our method, with training and evaluation conducted on Wikitext2(↓).

Prune rateMethodLlama2-7BLlama2-13BOPT-2.7BOPT-6.7B
20%SliceGPT6.645.8113.7311.48
20%Ours6.325.6312.6910.94
30%SliceGPT8.126.9915.8312.51
30%Ours7.756.6115.0212.07

W3-1: Comparison with optimization-based baselines is limited.

Thanks for your suggestion. We additionally include experiments based on the optimization-based pruning method Compresso, NutePrune, LoRAShear [r1], MaskPrune [r2] with different models and prune rates. More comparison will be included in the revised version if accepted. For a fair comparison under equal training time, we extend the training epochs of our method to fully exploit its effienciy. We additionally report the time and memory overhead incurred by each method during the pruning process. The results are summarized below.

Table R4. Comparison with various optimization-based methods on Llama2-7B. To ensure a consistent training time budget, we extend the training epoch accordingly. Meanwhile, we also record the memory overhead for reference. The fine-tuned version is trained on the Alpaca dataset.

Prune rateMethodWikiText2 PPL↓Ptb PPL↓WikiText2 PPL(&Fine-tune)↓Ptb PPL(&Fine-tune)↓Time costMemory(MiB)
0%-5.478.39----
20%Ours(1 Epoch)9.6713.948.3513.184.8h10224.34
20%Ours(2 Epochs)9.4413.708.0212.909.6h10224.34
20%Ours(3 Epochs)9.3513.487.9412.5714.5h10224.34
20%Compresso11.4721.259.9218.3916.8h33298.74
20%NutePrune9.8816.648.7414.9013.4h37460.78
50%Ours(1 Epoch)26.8344.2314.3930.073.6h6390.21
50%Ours(2 Epochs)23.4641.7912.7925.427.2h6390.21
50%Ours(3 Epochs)21.5238.6912.5525.2010.7h6390.21
50%Compresso49.2878.9040.8367.0816.8h33298.74
50%NutePrune24.8940.8312.9426.0413.4h37460.78

Table R5. Comparison with various optimization-based methods on Llama-7B. Zero Shot task includes BoolQ, PIQA, HellaSwag, Winogrande, ARC-e, ARC-c and OBQA.

Prune rateMethodZero Shot Avg. Score↑Zero Shot Avg. Score(&Fine-tune)↑Time costMemory(MiB)
0%-63.25---
20%Ours(1 Epoch)60.3761.714.8h10173.46
20%Ours(2 Epochs)61.3461.979.6h10173.46
20%Compresso-60.7516.8h33298.74
20%NutePrune59.4661.4613.4h37460.78
20%LoRAShear-60.63--
20%MaskPrune61.17---
50%Ours(1 Epoch)51.7353.263.6h6239.06
50%Ours(2 Epochs)51.9453.687.2h6239.06
50%Compresso-46.8716.8h33298.74
50%NutePrune50.3552.1313.4h37460.78
50%LoRAShear-50.39--
50%MaskPrune49.92---

W3-2: Performance differences on NutePrune. Does this difference come from the lack of fine-tuning?

Yes. To ensure a fair comparison, we did not perform additional fine-tuning when reproducing the experiments of NutePrune, which may lead to discrepancies from the results reported in the original paper.

W4: Explanation for the difference.

For the differences in certain experimental results, please refer to the responses in W2 and W3-2.

W5: Why the authors only compare the results without fine-tuning?

Thanks for your suggestions. We provide the fine-tuning results in Table R4 and Table R5. For fair comparison, we adopt the same setup as LLM-Pruner: LoRA fine-tuning is performed on the pruned model using the Alpaca dataset for 2 epochs, with a rank of 8 and a learning rate of 1e-4.

W6: Statistical information.

Thank you for your valuable suggestion. Actually, we observed that the experimental results exhibited low variance in the initial training stage, so we fixed a random seed for all subsequent experiments. To illustrate this, we conducted statistical experiments using five different random seeds on both Llama2-7B and Llama3-8B. The results, shown in the table below, indicate that the standard deviation is relatively small compared to the performance gains. Complete results with statistical information will be included in the revised version if accepted.

Table R6. Statistical information over 5 random seeds.

ModelWikiText2 PPL↓Ptb PPL↓PIQA↑HellaSwag↑ARC-e↑ARC-c↑OBQA↑
Llama2-7B10.58(±0.64)15.86(±0.73)73.59(±1.24)63.09(±0.93)58.51(±1.53)32.85(±0.87)38.62(±0.91)
Llama3-8B17.88(±0.79)29.26(±0.96)72.18(±1.66)58.68(±0.85)56.50(±0.56)30.87(±0.75)33.68(±0.64)

Reference

[r1] Chen T, Ding T, Yadav B, et al. Lorashear: Efficient large language model structured pruning and knowledge recovery. ArXiv 2023.

[r2] Qin J, Tan J, Zhang K, et al. MaskPrune: Mask-based LLM Pruning for Layer-wise Uniform Structures. ArXiv 2025.

评论

I thank the authors for their detailed response. The explanations and additional experiments are very helpful and have addressed my concerns. I'm very happy to increase the rating.

评论

We sincerely appreciate your follow-up response and are grateful for your recognition of our efforts. If you have any further questions or suggestions, we would be happy to continue the discussion. Thank you again for your thoughtful review.

审稿意见
4

This work introduces a reweighting-based mask updating method for pruning DNNs, particularly Large Language Models (LLMs), with a focus on computational and memory efficiency. The primary challenge addressed is the computational complexity and memory redundancy associated with instantiating intermediate pruned sub-models. Instead of relying on back-propagation for the entire neural network, this method estimates the gradient of the pruned network. The authors highlight that the DNN feed-forward-only mechanism can also be generalized to sparse distributed training. The results demonstrate state-of-the-art performance in LLM and DNN compression compared to existing pruning algorithms.

优缺点分析

Strengths

  • The reweighted policy gradient estimator approximates the gradient of the mask distribution parameter using outdated sub-model instantiation results, significantly reducing model instantiation frequency.
  • Variance reduction in the pruning framework is mathematically proven to be bounded, demonstrating the framework's stability for practical application.
  • The framework runs in parallel with model training, making it compatible with existing methods for simultaneous model finetuning and compression.
  • As a byproduct, the framework is also compatible with sparse distributed training.
  • The reweighting-based pruning framework achieves state-of-the-art performance compared to existing pruning frameworks for DNNs and LLMs.
  • The presentation is logical, smooth, and clear, making it easy for future researchers to extend the work. Detailed discussions on theorems, implementation, and code are provided in the appendix.

Weakness

  • The study highlights that pruning performance on Large Language Models (LLMs) remains largely unexplored. While the proposed method outperforms other LLM pruning techniques, the pruned network significantly underperforms the original LLM. This disparity might stem from LLMs' inherent sensitivity to noise. However, computational resource limitations prevented the application of larger, more redundant networks to validate this hypothesis. Consequently, further research is needed to determine if the trade-off between LLM complexity and efficiency justifies this approach to LLM pruning.
  • minor: line 159 compactess -> compactness

问题

Super LLMs and Pruning Limitations

  • Resource Requirements: What are the potential resource requirements for super LLMs to demonstrate the limitations of pruning?
  • Network Growth and Redundancy: Theoretically, at what point does network growth indicate sufficient redundancy to warrant pruning?

Distributed Sparse Training and Asynchronous Pruning

  • Full-Sync Optimization: Does the current distributed sparse training framework necessitate full-synchronization for optimization?
  • Asynchronous Training and Framework Adjustments: If asynchronous training is implemented, what adjustments are required? Would a separate framework be necessary to accommodate asynchronous pruning?

局限性

yes

最终评判理由

All of my concerns has been addressed by the rebuttal. After combining all the comments, I'd like to keep the rating 4 for this work.

格式问题

N/A

作者回复

W1: Further research is needed to determine if the trade-off between LLM complexity and efficiency justifies this approach to LLM pruning.

We'd like to clarify as follows.

Reasons on Unexplored Pruning for LLMs. We observe that LLMs often exhibit performance degradation after pruning, especially under high sparsity structural pruning. We argue this phenomenon may be attributed to two main factors:

  1. A mismatch between pruning and training resources. The data scale and computing resources used for pruning are much lower than those used for training, which hinders its ability to achieve the desired effectiveness.
  2. Potentially lower redundancy in LLMs compared to vision models. One possible explanation is that language data contains less redundancy than image data, in which local pixel regions typically exhibit high similarity. However, this remains unclear under the current limited pruning resources.

Based on the above background, we emphasis the key contributions of our work as follows:

  1. The memory cost during pruning is significantly reduced by our method, making it more feasible to prune larger models with the same memory resource. This is important for the downstream practioners of LLMs, where the resource is always limited. As shown in Table R1, our method incurs very low memory overhead.

  2. Our method has higher effiency. We emphasize that improving pruning performance is closely tied to the efficiency of the algorithms, especially for the large scaled networks like LLMs. A more efficient pruning algorithm allows for more iterative updates, enabling more sufficient exploration and refinement on sparse structures. As shown in Table R1, extending the training time to match that of the baselines leads to more significant performance superority.

  3. Our pruning method achieves SOTA performance compared to the baselines.

Table R1. Comparison of pruning performance, training time, and memory overhead under 50% prune rate of Llama2-7B. To ensure fair experimentation, we extended the training schedule of our method to match comparable time budgets.

MethodWikiText2 PPL↓Ptb PPL↓Time costMemory(MiB)
Ours(1 Epoch)26.8344.233.6h6390.21
Ours(2 Epochs)23.4641.797.2h6390.21
Ours(3 Epochs)21.5238.6910.7h6390.21
Compresso49.2878.9016.8h33298.74
NutePrune24.8940.8313.4h37460.78

Q1: What are the potential resource requirements for super LLMs to demonstrate the limitations of pruning?

For medium-sized networks, i.e., ResNet for image classification, existing pruning algorithms often require resource investments comparable to full training in order to achieve satisfactory pruning results, like 4 GeForce RTX 4090. By analogy, for LLMs with more complex parameter structures and larger parameter scales, it is reasonable to hypothesize that approaching the limit of pruning performance similarly demands computational resources and data volumes on par with pre-training.

Q2: Theoretically, at what scale does network growth imply sufficient redundancy to make pruning worthwhile?

Most existing neural network inherently contain a certain degree of redundancy, as they are manually designed and cannot guarantee that each parameter is optimally utilized to represent knowledge. Consequently, pruning is essential for improving efficiency in applications where extreme time efficiency is required.

Q3: Does the current distributed sparse training framework necessitate full-synchronization for optimization?

Our framework can be integrated with both synchronized and asynchronized optimization flexibly. It is currently designed as a synchronized optimization process, since synchronized optimization is widely used in distribued training systems.

To support asynchronous optimization, only Step 8 in Algorithm 2 needs to be modified. Instead of waiting for all workers to return their parameter updates ΔWm(i)\Delta W_{m(i)} before updating the global model, the server can update the global weights using each received gradient individually as follows:

For each incoming ΔWm(i)\Delta W_{m(i)} from worker i, update the global weights as:

WW+η3ΔWm(i),\qquad\qquad\qquad\qquad\qquad\qquad W \leftarrow W + \eta_{3} \Delta W_{m(i)},

where η3\eta_{3} is a small learning rate that controls the pace of asynchronous updates.

This minor modification would allow the framework to run fully asynchronously without requiring a complete redesign, while still benefiting from the lightweight communication scheme enabled by the sparse structure and reweighting strategy.

Q4: If asynchronous training is implemented, what adjustments are required? Would a separate framework be necessary to accommodate asynchronous pruning?

The asynchronous version can be implemented following the answer in Q3, without the need for a separate training framework. We provide experimental results of asynchronous sparse training on ResNet-50, as shown in Table R2.

Table R2. Comparison of asynchronous sparse training results

MethodAcc(%)↑Params(%)↓FLOPs(%)↓Time↓
Dense Dist.76.910010023.00
Local SGD75.010010020.52
PowerSGD76.010010019.25
TSNNS74.563.277.419.49
EffTrain76.148.246.8115.04
Ours(Syn)75.248.146.59.79
Ours(Asyn)74.748.246.79.13
评论

Thanks for addressing the concerns in the review and detailed explanation. I have no other concerns.

评论

We sincerely appreciate your feedback and recognition, and are pleased that our work has addressed your concerns.

审稿意见
4

This paper addresses the challenges of pruning large language models (LLMs), where conventional methods struggle due to the high computational and memory overhead of repeated sub-model instantiations. The authors propose a novel, resource-efficient pruning approach that introduces a reweighting mechanism to estimate gradients of the current pruned model using gradients from previously instantiated sub-models. This significantly reduces the frequency of model instantiation. Additional techniques such as gradient clipping and preconditioning are introduced to stabilize optimization and reduce variance. The method is empirically validated across domains, achieving 50% sparsity and 1.58×\times forward-pass speedup on LLaMA2-7B using only 6 GB of memory, with improved perplexity and zero-shot performance compared to prior work. Furthermore, the technique demonstrates strong potential for distributed sparse training, achieving 2×\times speedup over dense baselines.

优缺点分析

Strengths:

  1. The paper presents a clear and concise mathematical formulation, including only the essential equations—particularly for the variance reduction mechanism—without unnecessary complexity.
  2. The ablation studies are thorough and well-structured, effectively isolating the impact of each proposed component on addressing key bottlenecks in pruning.
  3. The experimental evaluation is comprehensive, covering both algorithmic metrics (e.g., perplexity, zero-shot performance) and practical system-level outcomes (e.g., memory usage and actual speedup).

Weaknesses:

  1. The manuscript repeatedly emphasizes the three main contributions throughout multiple sections, leading to redundancy and detracting from the overall clarity and flow of the writing.
  2. The empirical comparison with prior optimization-based pruning methods is limited, relying on only two baseline works. A broader and more diverse set of baselines would strengthen the validity of the claims.
  3. The trade-off introduced by the variance reduction techniques appears to slightly degrade model performance in Table 5. It would be helpful to evaluate whether variance reduction is strictly necessary—can the method maintain acceptable accuracy while further improving throughput and memory consumption if the variance reduction is omitted?

问题

Is the m in Equation 1 either 0 or 1? Can we prune part of the each sub-module?

局限性

See above.

最终评判理由

The rebuttal addresses my confusion and makes the paper solid. I will raise my score to 4.

格式问题

No formatting issues.

作者回复

W1: Repeated emphasis on three main contributions.

We will revise the manuscript to improve clarity and make it easier for readers to follow.

W2: Lack of comparison with optimization-based methods.

Thanks for your suggestion. We additionally include experiments based on the optimization-based pruning method Compresso, NutePrune, LoRAShear [r1], MaskPrune [r2] with different models and prune rates. More comparison will be included in the revised version if accepted. For a fair comparison under equal training time, we extend the training epochs of our method to fully exploit its effienciy. We additionally report the time and memory overhead incurred by each method during the pruning process. The results are summarized in Table R1 and Table R2.

Table R1. Comparison with various optimization-based methods on Llama2-7B. To ensure a consistent training time budget, we extend the training epoch accordingly. Meanwhile, we also record the memory overhead for reference. The fine-tuned version is trained on the Alpaca dataset.

Prune rateMethodWikiText2 PPL↓Ptb PPL↓WikiText2 PPL(&Fine-tune)↓Ptb PPL(&Fine-tune)↓Time costMemory(MiB)
0%-5.478.39----
20%Ours(1 Epoch)9.6713.948.3513.184.8h10224.34
20%Ours(2 Epochs)9.4413.708.0212.909.6h10224.34
20%Ours(3 Epochs)9.3513.487.9412.5714.5h10224.34
20%Compresso11.4721.259.9218.3916.8h33298.74
20%NutePrune9.8816.648.7414.9013.4h37460.78
50%Ours(1 Epoch)26.8344.2314.3930.073.6h6390.21
50%Ours(2 Epochs)23.4641.7912.7925.427.2h6390.21
50%Ours(3 Epochs)21.5238.6912.5525.2010.7h6390.21
50%Compresso49.2878.9040.8367.0816.8h33298.74
50%NutePrune24.8940.8312.9426.0413.4h37460.78

Table R2. Comparison with various optimization-based methods on Llama-7B. Zero Shot task includes BoolQ, PIQA, HellaSwag, Winogrande, ARC-e, ARC-c and OBQA.

Prune rateMethodZero Shot Avg. Score↑Zero Shot Avg. Score(&Fine-tune)↑Time costMemory(MiB)
0%-63.25---
20%Ours(1 Epoch)60.3761.714.8h10173.46
20%Ours(2 Epochs)61.3461.979.6h10173.46
20%Compresso-60.7516.8h33298.74
20%NutePrune59.4661.4613.4h37460.78
20%LoRAShear-60.63--
20%MaskPrune61.17---
50%Ours(1 Epoch)51.7353.263.6h6239.06
50%Ours(2 Epochs)51.9453.687.2h6239.06
50%Compresso-46.8716.8h33298.74
50%NutePrune50.3552.1313.4h37460.78
50%LoRAShear-50.39--
50%MaskPrune49.92---

W3: The variance reduction in Table 5 seems to degrade the performance.

There may be some misunderstanding. As shown in Table 5, removing the components of our variance reduction technique (either the clipping operation or the preconditioning matrix H) leads to a performance degrade. "Ours" in Table 5 refers to our pruning method equipped with the full variance reduction technique. It is important to note that for WikiText2 and Ptb, the evaluation metric is perplexity, where lower values indicate better performance, whereas for PIQA, HellaSwag, ARC-e, and ARC-c, the metric is accuracy, where higher values are better. Taken together, these results demonstrate that the variance reduction strategies contribute positively to performance and remain an integral part of our method.

Q1: Is the mm in Equation 1 either 0 or 1? Can we prune part of the each sub-module?

First, yes, in Equation 1, mm is either 0 or 1, representing the retention state of a sub-module. Then, we can prune part of each sub-module by setting finer-grained mask. In existing works (e.g., Sheared LLaMA [r3] and FLAP [r4]), structural pruning is typically applied at the level of attention heads or the intermediate dimensions of FFNs, as such sparsity patterns can be efficiently implemented on various hardware platforms to achieve speedup. Therefore, our method adopts these default settings. However, thanks to the flexibility of our masks, our approach can also be extended to support finer-grained pruning. For instance, within each attention head, we can assign individual masks to the intermediate dimensions corresponding to Q, K, and V, enabling more fine-grained pruning strategies.

Reference

[r1] Chen T, Ding T, Yadav B, et al. Lorashear: Efficient large language model structured pruning and knowledge recovery. ArXiv 2023.

[r2] Qin J, Tan J, Zhang K, et al. MaskPrune: Mask-based LLM Pruning for Layer-wise Uniform Structures. ArXiv 2025.

[r3] Xia M, Gao T, Zeng Z, et al. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning. ICLR 2024.

[r4] An Y, Zhao X, Yu T, et al. Fluctuation-based adaptive structured pruning for large language models. AAAI 2024.

评论

Thanks for the detailed response. I will raise my score.

评论

We sincerely appreciate your response and your willingness to raise the score. If there are any additional questions or suggestions, we would be more than happy to discuss them.

最终决定

This manuscript proposes a method to explore pruned LLMs based on having access to a dense fully trained LLM. Their approach is novel in that they are able to better make use of the potential pruned intermediate LLMs with less variance in the reweight function, the loss, and the gradient. These are believed to allow for a more stable training process when searching for the final pruned LLM. The reviewers initial manuscript was viewed by the reviewers as needing additional experiments, especially needing studies with fine-tuning training. The authors engaged well with these criticism and the result is that all four reviewers gave scores of 4 with the only variation being on the certainty of the reviewers in their scores; that said the feedback from the less certain reviewers was consistent with that of the more certain reviewers.

The manuscript is a reasonable contribution. Using variance reduction for these less stable quantities is a natural and good idea. It has been implemented and shown to give superior performance. Reviewers have the requisite expertise to review the results and the only remaining question is how well the methods scale to larger problems which is beyond the scope of what the authors would be able to do. The ideas of the manuscript could be conveyed well through a poster; there isn't need for an oral or spotlight presentation.