Targeted Low-rank Refinement: Enhancing Sparse Language Models with Precision
We propose an iterative method for refining pruned neural network weights, aiming to improve model performance while maintaining sparsity
摘要
评审与讨论
The paper introduces a novel method to improve the performance of pruned large language models (LLMs) by combining sparsity with a low-rank approximation. The authors propose an iterative refinement algorithm that updates the sparse weight matrix while incorporating a low-rank component to approximate the difference between the original dense matrix and the pruned sparse matrix. This approach aims to recover information lost during pruning without requiring extensive retraining or large datasets, maintaining the sparsity pattern for hardware efficiency. Key contributions include:
- An iterative weight update method (Algorithm 1) that refines the sparse matrix and adds a low-rank patch, progressively increasing the rank from 2 to a target over iterations, preserving the sparsity pattern using a binary mask.
- The method reduces perplexity compared to baseline pruning techniques (e.g., magnitude pruning and Wanda) across various sparsity levels.
- It also achieves competitive performance on benchmark datasets like TruthfulQA, GSM8K, ARC-c, and MMLU w.r.t. Magnitude and Zero-shot SVD
- The paper provides a theoretical analysis proving sparsity preservation (Theorem 4.1), convergence to a solution (Theorem 4.2), and monotonic error reduction after a certain iteration (Theorem 4.3).
给作者的问题
- Calibration Data Details: What calibration data (size, source) was used for perplexity experiments? Wanda specifies 128 C4 sequences; this omission affects reproducibility. A response detailing this would strengthen confidence in the results.
- Performance Target for Parameter Reduction: The abstract claims an 8.6% parameter reduction for a specific target at 50% sparsity. What is this target, and can you provide supporting data? Without this, the claim feels unsubstantiated.
- Zero-shot Task Evaluation: Why were zero-shot accuracies (e.g., as in Wanda’s Table 2, also Tables in the appendix e.g., Table 23) not included? Adding these could align your evaluation with Wanda’s, enhancing practical relevance. A justification or additional results would influence my view on the method’s applicability.
- SparseGPT Comparison: Why was SparseGPT excluded from experimental comparisons despite its relevance (cf. Wanda’s Table 3)? Including it could better position your method; its absence weakens the competitive analysis.
论据与证据
-
Claim 1: The method bridges the gap between dense and sparse models using a low-rank component.
- Assessment: Well-supported
-
Claim 2: The iterative algorithm enhances precision by prioritizing larger singular values.
- Assessment: Well-supported, though the empirical comparison with PCP could be expanded to quantify precision gains more explicitly.
-
Claim 3: Significant perplexity improvements, especially at high sparsity levels (e.g., 99.6% reduction at 70% sparsity).
- Assessment: There is evidence though the lack of zero-shot task performance (e.g., accuracy on downstream tasks) limits the scope of evaluation compared to Wanda’s broader benchmarks (e.g., Table 2 in Wanda).
-
Claim 4: The method enables a reduction in model parameters while maintaining 50% sparsity and meeting a specific performance target.
- Issue: Lacks details and remains vague. The paper does not specify the performance target or provide a direct comparison showing parameter reduction versus performance trade-offs.
方法与评估标准
- Methodology yes.
- Evaluation criteria is appropriate for language modeling (perplexity) and generalizability (benchmarks). However, unlike Wanda, which includes zero-shot task accuracies (e.g., Table 23) and few-shot results (e.g., MMLU in Table 21), this paper lacks zero-shot accuracy metrics (and few-shots on MMLU), limiting its comparability on downstream tasks. Adding such evaluations would strengthen the assessment of practical utility.
理论论述
The proofs generally seem to be correct but rely on assumptions about singular value decay that could be sensitive to matrix properties.
实验设计与分析
-
Perplexity Comparison (Tables 1, 2): Tests LLaMa-7B and LLaMa-13B across sparsity levels with .
- Soundness: The design is valid, using WikiText-2 perplexity as a standard metric, consistent with Wanda and SparseGPT. Results are reproducible with a fixed and .
- Issues: The paper lacks details on calibration data (e.g., size, source), unlike Wanda (128 sequences from C4). This affects reproducibility. Additionally, only magnitude pruning and Wanda are baselines, omitting SparseGPT (a key competitor in Wanda’s Table 3).
-
Benchmark Evaluation (Table 3): Assesses performance on four datasets.
- Soundness: The choice of datasets is reasonable for LLM evaluation, and comparisons with dense and sparse models are fair.
- Issues: Sample sizes and statistical significance (e.g., variance across runs) are not reported, unlike Wanda’s robustness analysis (Table 18). This reduces confidence in the results’ stability.
-
Iterative Analysis (Figures 4, 5): Visualizes singular value spectra, energy retention, and convergence.
- Soundness: The synthetic and real-world (LLaMa-7B) analyses are well-designed to support theoretical claims.
- Issues: Limited to one sparsity level (50%) in Figure 4, missing higher sparsity cases (e.g., 70%) where Wanda shows larger gaps
The experiments are sound but incomplete compared to Wanda, which includes zero-shot accuracies, few-shot tasks, and robustness analysis, highlighting gaps in broader evaluation.
补充材料
Section A
与现有文献的关系
- LLM Pruning: Extends magnitude pruning (Han et al., 2015), SparseGPT (Frantar & Alistarh, 2023), and Wanda (Sun et al., 2023) by adding low-rank refinement, addressing the performance gap noted in the Junk DNA Hypothesis (Yin et al., 2024).
- Low-rank Approximation: Leverages SVD and iterative refinement, drawing from matrix completion (Chandrasekaran et al., 2011) and robust PCA (Candès et al., 2011), adapting them for sparsity preservation.
遗漏的重要参考文献
- SparseGPT (Frantar & Alistarh, 2023): Cited but not experimentally compared in Tables 1–3, despite being a key baseline in Wanda (Tables 2, 3). Its inclusion would contextualize the method’s superiority over second-order pruning approaches.
- LoRA (Hu et al., 2021): Wanda uses LoRA for fine-tuning (Section 5), showing performance recovery. The absence of fine-tuning comparisons here misses a practical recovery baseline.
- Yin et al. (2024) - Junk DNA Hypothesis: Cited, but its implications (irreversible loss at high sparsity) could be explored more deeply with zero-shot task evaluations, as Wanda does.
These omissions limit the paper’s positioning against state-of-the-art recovery and pruning methods.
其他优缺点
-
Strengths:
- Originality: Creative integration of sparsity and low-rank refinement, distinct from Wanda’s metric-based pruning.
- Significance: Addresses high-sparsity performance degradation, a critical issue for LLM deployment.
- Clarity: Well-written with clear figures (e.g., Figure 1) and algorithmic exposition.
-
Weaknesses:
- Evaluation Scope: Lacks zero-shot and few-shot task evaluations (cf. Wanda’s Tables 2, 21), limiting practical relevance.
- Reproducibility: Missing details on calibration data and iteration specifics (e.g., selection).
- Comparison Depth: Omits SparseGPT and fine-tuning baselines, reducing comparative strength against Wanda.
其他意见或建议
- Typos:
- Page 7, Table 2: "Dense" perplexity should be 5.09 (per Wanda), not 4.57.
- Suggestion: Include a runtime comparison (e.g., vs. Wanda’s Table 4) to quantify “computationally efficient.”
We authors greatly thank the reviewer for constructive comments on this work. We would like to clarify the following points:
W1: Evaluation Scope: Lacks zero-shot and few-shot task evaluations (cf. Wanda’s Tables 2, 21), limiting practical relevance.
Q3: Zero-shot Task Evaluation: Why were zero-shot accuracies (e.g., as in Wanda’s Table 2, also Tables in the appendix e.g., Table 23) not included? Adding these could align your evaluation with Wanda’s, enhancing practical relevance. A justification or additional results would influence my view on the method’s applicability.
Below, we provide additional results on more benchmarks.
Table: Performance of Llama-7B on three additional benchmarks, along with the results from Table 3.
| Model | HellaSwag | WinoGrande | ARC-e | TruthfulQA | GSM8K | ARC-c | MMLU |
|---|---|---|---|---|---|---|---|
| Dense baseline | 76.2 | 70.0 | 72.8 | 34.1 | 10.3 | 44.7 | 32.1 |
| Magnitude 50% | 60.9 | 59.3 | 54.3 | 35.3 | 1.0 | 33.5 | 24.6 |
| Magnitude 50% + Zero-shot SVD | 69.2 | 65.5 | 63.6 | 34.3 | 1.5 | 36.9 | 26.0 |
| Magnitude 50% + Ours | 69.8 | 65.3 | 64.3 | 34.2 | 3.4 | 41.5 | 26.0 |
W2: Missing details on calibration data and iteration specifics (e.g., ( T ) selection).
- The proposed iterative refinement method is entirely data-free and does not require calibration data. As shown in Algorithm 1, the only inputs are:
- Dense weight matrix
- Binary mask (from pruning)
- Target rank
- Number of iterations
We use WikiText-2 for perplexity evaluation and 'allenai/c4' as the calibration data for Wanda pruning as well as Wanda + Ours.
- We consistently uses T=50 across experiments, which is sufficient for achieving most of the potential error reduction while maintaining computational efficiency. Since the overall time complexity is , where , are the number of rows and columns of .
W3: Omits SparseGPT and fine-tuning baselines, reducing comparative strength against Wanda..
Q4: SparseGPT Comparison: Why was SparseGPT excluded from experimental comparisons despite its relevance (cf. Wanda’s Table 3)? Including it could better position your method; its absence weakens the competitive analysis.
Thank you for the suggestion. We provide the results of SparseGPT 2:4 and SparseGPT 2:4 + Ours in the following table.
Table: Performance of SparseGPT 2:4 and SparseGPT 2:4 + Ours using Llama-7B on three additional benchmarks, along with the results from Table 3.
| Method | HellaSwag | WinoGrande | ARC-e | TruthfulQA | GSM8K | ARC-c | MMLU |
|---|---|---|---|---|---|---|---|
| SparseGPT 2:4 | 58.6 | 63.9 | 56.6 | 36.5 | 2.0 | 33.1 | 25.4 |
| SparseGPT 2:4 + Ours | 65.1 | 66.9 | 59.9 | 33.8 | 2.7 | 36.1 | 29.1 |
Q1: Calibration Data Details: What calibration data (size, source) was used for perplexity experiments? Wanda specifies 128 C4 sequences; this omission affects reproducibility. A response detailing this would strengthen confidence in the results.
Thank you for your suggestion, we will add these details in revised version. We also use 128 C4 sequences as the calibration data for Wanda pruning as well as Wanda + Ours. For perplexity evaluation, we use 128 sequences from WikiText-2 dataset.
Q2: Performance Target for Parameter Reduction: The abstract claims an 8.6% parameter reduction for a specific target at 50% sparsity. What is this target, and can you provide supporting data? Without this, the claim feels unsubstantiated.
The performance target is the perplexity metric on WikiText-2 dataset, we will revise the abstract to clarify this. We provide the detailed results in Figure 3(a).
Thank you for adding the new results. What is the latency for this method? Compared to the other methods, what is the computational time?
Inference Latency Analysis
Thank you for raising this important question about comparing inference atency with other methods.
Without Hardware Acceleration
First, to ensure a fair comparison with the dense baseline model, we adopted a conservative evaluation strategy. This involved storing all matrices in their full dense format (as torch.Tensor objects) and retaining all zero elements without utilizing sparse matrix representations or hardware-specific optimizations for speed or memory footprint improvements. The overhead is slightly higher than dense models, which suggests that the low-rank refinement is computationally efficient despite some additional parameters.
Table: Parameter Count and Evaluation Time for Complete ARC-Challenge Benchmark.
| Model Configuration | Non-Zero Parameter Count | Parameter Count | ARC-C | Relative Time |
|---|---|---|---|---|
| Llama-2-7B | ||||
| Dense baseline | 50.5s | 1.0x | ||
| Magnitude 50% sparse | 3.5B | 6.7B | 50.7s | ~1.0x |
| Proposed Method (k=128, 50%) | 3.8B | 7.1B | 50.5s | ~1.0x |
| Llama-2-13B | ||||
| Dense baseline | 13.0B | 13.0B | 73.2s | 1.0× |
| Magnitude 2:4 sparse | 6.7B | 13.0B | 73.9s | ~1.01× |
| Proposed Method (k=128, 2:4) | 7.2B | 13.5B | 77.9s | ~1.06× |
With Sparse Matrix Formats and Hardware Acceleration
Here’s an expanded and rephrased version of your text with improved clarity and flow:
To leverage the benefits of sparsity, we convert the weight matrices into a compressed sparse format. Specifically, we apply torch.sparse.to_sparse_semi_structured to transform the weights into torch.sparse.SparseSemiStructuredTensor objects, which are optimized for efficient storage and computation. The table below compares the memory footprint of our method with and without sparse matrix representations.
Inference performance also varies depending on workload characteristics such as batch size and input sequence length. When using hardware-accelerated N:M structured sparsity (supported on NVIDIA Ampere GPUs and later architectures), we observe an average inference speedup of ~1.1x in wall-clock time over dense models and a 38% reduction in GPU memory consumption. However, this acceleration is highly dependent on hardware support—unstructured sparsity patterns or GPUs lacking dedicated sparse tensor cores may lead to slower inference compared to dense matrix operations. Thus, the efficiency gains depend on both the pruning structure and the underlying hardware capabilities.
Table: A comparison of memory usage with and without end-to-end inference acceleration.
| Model Configuration | Memory Usage | Relative Memory Usage |
|---|---|---|
| Llama-2-13B | ||
Proposed Method (k=128, 2:4), weight matrices are torch.Tensor | 2x14.6GB | 1.0x |
Proposed Method (k=128, 2:4), weight matrices are torch.sparse.SparseSemiStructuredTensor | 2x9.1GB | ~0.62x |
In this work, the authors proposes a low-Rank refinement method to factorize a dense full matrix into a sparse matrix and a low-rank matrix, bridging the performance gap between dense and sparse models. This approach iteratively improves the sparse weight matrix through a low-rank adjustment, thereby increasing model accuracy, particularly at higher levels of sparsity.
update after rebuttal
I will keep my score.
给作者的问题
-
The rank k is manually set in the paper, will a poor choice of k lead to unnecessary computational overhead or insufficient recovery of pruned weights?
-
If given fixed inputs, does the iterative refinement method always converge to a stable solution?
论据与证据
Yes, the claims are supported by clear and convincing evidence.
方法与评估标准
Yes, in this work, the proposed methods and evaluation criteria make sense.
理论论述
Yes. I checked the correctness of them and found no issues among them. The iterative refinement process looks right. The theoretical analysis about the convergence property and the error bound is well-established.
实验设计与分析
Yes, I have reviewed the soundness and validity of the experimental designs and analyses related to large language model pruning. The experiments are carried out on the well-known Llama 7B and 13B models at various levels of sparsity. In addition to evaluating PPL, several standard benchmarks are also assessed. The experiments are comprehensive.
补充材料
Yes, I reviewed all the supplementary materials, including additional theoretical analysis and experimental results.
与现有文献的关系
The key contributions of the paper on Targeted Low-rank Refinement are closely related to prior findings in pruning, low-rank approximation, and post-pruning performance recovery.
The study builds on the extensive body of research on pruning techniques, particularly magnitude pruning, which removes low-magnitude weights to reduce model size [1]. It addresses a known limitation of pruning: performance degradation due to the loss of important information, especially at high sparsity levels, as discussed in works such as the Junk DNA Hypothesis[2].
The idea of using low-rank approximations to restore lost model capacity has been explored in previous research [3], but prior methods often struggle with maintaining the structured sparsity patterns needed for hardware efficiency, which this paper explicitly addresses.
[1] Han, S., et al. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding.
[2] Yin, L., et al. Junk DNA hypothesis: Pruning small pre-trained weights irreversibly and monotonically impairs ”difficult” downstream tasks in llms. ICML 2024.
[3] Zhou, T. and Tao, D. Godec: Randomized low-rank & sparse matrix decomposition in noisy case. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011.
遗漏的重要参考文献
No. The authors has discussed almost all essential references in this work.
其他优缺点
Strengths:
-
The proposed approach is a data-free and plug-in-and-play method, orthogonal to existing pruning methods.
-
The proposed iterative refinement method addressing a key limitation of existing low-rank refinement techniques, i.e. prior methods often struggle with maintaining the structured sparsity patterns needed for hardware efficiency, which this paper explicitly addresses.
-
Experiments on LLaMa models demonstrate improvements over conventional magnitude pruning and Wanda pruning.
Weaknesses:
-
While the paper discusses parameter efficiency of low-rank refinement, it does not thoroughly analyze the computational cost of the iterative update algorithm compared to alternative approaches such as the optimization-based PCP method.
-
The proposed method is theoretically hardware-efficient as it enables structured N:M pruning. However, the inference latency and memory consumption of low-rank refinement is not measured.
其他意见或建议
In abstract “Nonetheless, these methods often create a gap between the original dense and the pruned sparse model, …”, “create a gap” is slightly awkward, “introduce” is a smoother expression.
Thank you for your time and effort in reviewing our paper. We appreciate your constructive feedback and suggestions.
W1 (Computational Complexity Analysis): While the paper discusses parameter efficiency of low-rank refinement, it does not thoroughly analyze the computational cost of the iterative update algorithm compared to alternative approaches such as the optimization-based PCP method.
The primary computational bottleneck in the proposed algorithm is the SVD computation performed in each iteration (line 171, page 4). For a weight matrix , the time complexity of SVD is (assuming ). With iterations, the overall time complexity is .
In contrast, the PCP baseline would typically use an iterative optimization method (like Adam) that also requires SVD computations in each iteration to compute the nuclear norm. Furthermore, the PCP baseline requires substantially more iterations () to achieve comparable results to the proposed method ().
Table: Computational Efficiency Comparison
| Method | Time Complexity | Typical Iterations | Relative Computational Cost |
|---|---|---|---|
| Proposed Algorithm | 50 | 1× | |
| PCP Baseline | 5000 | ~100× | |
| Zero-shot SVD | 1 | ~0.02× |
W2 (Inference Latency and Memory Consumption): The proposed method is theoretically hardware-efficient as it enables structured N:M pruning. However, the inference latency and memory consumption of low-rank refinement is not measured.
Below, we provide some missing measurements and analyses. Note that we deliberately employed a conservative evaluation approach to ensure fair comparison with the dense baseline model. We maintained all matrices in their dense representation format, preserving zero elements rather than utilizing sparse matrix formats or specialized hardware acceleration.
Table: Parameter Count and Evaluation Time for Complete ARC-Challenge Benchmark.
| Model Configuration | Non-Zero Parameter Count | Parameter Count | ARC-C | Relative Time |
|---|---|---|---|---|
| Llama-2-7B | ||||
| Dense baseline | 50.5s | 1.0x | ||
| Magnitude 50% sparse | 3.5B | 6.7B | 50.7s | ~1.0x |
| Proposed Method (k=128, 50%) | 3.8B | 7.1B | 50.5s | ~1.0x |
| Llama-2-13B | ||||
| Dense baseline | 13.0B | 13.0B | 73.2s | 1.0× |
| Magnitude 2:4 sparse | 6.7B | 13.0B | 73.9s | ~1.01× |
| Proposed Method (k=128, 2:4) | 7.2B | 13.5B | 77.9s | ~1.06× |
Q1: The rank k is manually set in the paper, will a poor choice of k lead to unnecessary computational overhead or insufficient recovery of pruned weights?
Yes. The choice of k leads to a trade-off between the performance recovery and the computational cost. The low-rank component adds parameters and FLOPs. Therefore, an unnecessarily high k value directly increases these costs with diminishing returns. Figure 2(b) in the paper shows the cumulative energy retention for different k values, and the curve begins to flatten as k increases, indicating diminishing returns.
Q2: If given fixed inputs, does the iterative refinement method always converge to a stable solution?
Yes. As shown in Theorem 4.2, Theorem 4.3 and Corollary 4.4 in the manuscript, the iterative refinement method always converges to a stable solution as and the approximation error is bounded by:
\left\\|W - \left(S^{(t)} + L_k^{(t)}\right)\right\\|_F \leq \sqrt{(r-k)} \sigma\_{k+1}\left(L^{(t)}\right),where is the rank of the weight matrix , is the -th largest singular value of .
This paper introduces a novel approach to improve the performance of sparse language models through low-rank refinement. The main contribution of the paper is a method that refines sparse models using a low-rank refinement, which leads to improved precision. This approach is theoretically grounded, with proofs and additional lemmas provided in the appendix to support the claims.
update after rebuttal
I will keep my ratings since most of my concerns are solved.
给作者的问题
- In line 6 of Algorithm 1 , what is the intuition behind this equation?
- How to select the hyperparameters? See weaknesses 3.
论据与证据
The claims made in the submission appear to be supported by theoretical analysis and experimental results.
方法与评估标准
Yes, the proposed methods and evaluation criteria appear to be well-suited for the problem of enhancing sparse language models.
理论论述
Yes. The paper includes theoretical analysis with proofs and additional lemmas to support its claims, particularly regarding the iterative weight update algorithm. The theoretical analysis demonstrates the favorable convergence properties of the proposed method and provides a rigorous foundation for its effectiveness.
However, this paper does not include the full details of lemma A4.2 in the Appendix A.
实验设计与分析
Yes, I checked the soundness and validity of the experimental designs. The paper presents experimental results to validate the proposed method, particularly focusing on the Llama models.
补充材料
Yes. Particularly the proofs for the theoretical claims.
与现有文献的关系
This paper addresses the challenge of improving the performance of sparse language models, a well-studied problem in machine learning. Prior work has explored techniques like unstructured pruning and N:M structured pruning to reduce parameter count while maintain performance. The paper’s experimental results on the Llama model, align with prior findings that higher sparsity levels can lead to more severe performance degradation. This observation is consistent with the literature on sparse model performance, where sparsity is often traded off against computational cost and precision.
遗漏的重要参考文献
While the paper mentions magnitude pruning and N:M structured pruning, it does not discuss structured sparsity techniques, such as block sparsity or channel pruning, which have been shown to improve hardware efficiency and model performance.
其他优缺点
Strengths:
- This paper is well motivated and this approach effectively bridges the gap between dense and sparse models.
- The paper is well-structured, with a clear presentation.
- The theoretical analysis is solid.
- Code is provided for reproduction.
Weaknesses:
- While the paper mentions magnitude pruning and N:M structured pruning, it does not discuss structured sparsity techniques, such as block sparsity or channel pruning, which have been shown to improve hardware efficiency and model performance.
- This paper does not include the full details of lemma A4.2 in the Appendix A.
- The iterative refinement process introduces new hyperparameters (e.g., rank k, number of iterations T), but the paper does not provide a clear guideline on how these should be selected across different models and sparsity levels.
- The computational cost of iterative updates may be high as the size of the weight matrix increases, which may limit the applicability of the method to very large models.
其他意见或建议
The proposed method is effective at high sparsity levels, but at low sparsity levels, the performance gain is not as significant.
W1: While the paper mentions magnitude pruning and N:M structured pruning, it does not discuss structured sparsity techniques, such as block sparsity or channel pruning, which have been shown to improve hardware efficiency and model performance.
We thank the reviewer for pointing out the importance of structured sparsity techniques. It is straightforward to apply the proposed method to structured sparsity techniques such as block sparsity or channel pruning by by simply changing the binary mask matrix P. The core algorithm is compatible with any arbitrary binary mask pattern, including structured ones. For structured patterns like block sparsity (e.g., 4×4 blocks) or channel pruning, P would have the corresponding pattern of 0s and 1s.
Theoretically, Theorem 4.1 (Sparsity Preservation) guarantees that the method preserves whatever sparsity pattern is defined by .
W2: This paper does not include the full details of lemma A4.2 in the Appendix A.
The proof of Lemma A4.2 currently appears before the lemma itself. In the revised version, we will relocate the proof to its proper position following Lemma A4.2.
proof of Lemma A4.2: Let and be the elements of and respectively. By definition of the Frobenius inner product and element-wise multiplication:
W3: The iterative refinement process introduces new hyperparameters (e.g., rank k, number of iterations T), but the paper does not provide a clear guideline on how these should be selected across different models and sparsity levels.
-
Rank parameter k: We consistently set across the experiments to demonstrate the effectiveness of low-rank refinement. In Figure 4(a), we visualize the singular value spectrum of the residual matrix for different values of (from 64 to 512) and . As increases from 64 to 512, we see higher magnitudes for the top singular values. This indicates that larger values allow the method to retain more information. But when is too large, the performance gain diminishes and the computational cost increases. Therefore, the choice of should strike a balance between the performance and the computational cost.
-
Number of iterations T: The primary computational bottleneck in the proposed algorithm is the SVD computation performed in each iteration (line 171, page 4). For a weight matrix , the time complexity of SVD is (assuming ). With iterations, the overall time complexity is . On the other hand, as shown in Figure 4(d) and the diminishing convergence speed as stated by Theorem 4.5 (Error Bound), appears to be a reasonable choice for the number of iterations and further iterations yield diminishing returns in terms of error reduction.
W4: The computational cost of iterative updates may be high as the size of the weight matrix increases, which may limit the applicability of the method to very large models.
As the size of the weight matrix increases, the computational cost of the proposed method increases accordingly. The primary computational bottleneck in the proposed algorithm is the SVD computation performed in each iteration (line 171, page 4). For a weight matrix , the time complexity of SVD is (assuming ). With iterations, the overall time complexity is .
In contrast, the PCP baseline would typically use an iterative optimization method (like Adam) that also requires SVD computations in each iteration to compute the nuclear norm. Furthermore, the PCP baseline requires substantially more iterations () to achieve comparable results to the proposed method ().
Table: Computational Efficiency Comparison
| Method | Time Complexity | Typical Iterations | Relative Computational Cost |
|---|---|---|---|
| Proposed Algorithm | 50 | 1× | |
| PCP Baseline | 5000 | ~100× | |
| Zero-shot SVD | 1 | ~0.02× |
Q1: In line 6 of Algorithm 1 , what is the intuition behind this equation?
This equation in Algorithm 1 defines a linear schedule for increasing the rank from 2 to k across T iterations. Starting with low rank forces the algorithm to capture the most important singular components first, and each iteration gradually incorporates more subtle details from higher singular components. The gradual increase in rank helps maintain numerical stability and helps the algorithm to converge faster.
Q2: How to select the hyperparameters? See weaknesses 3.
Please refer to the response to W3.
Solved most of my concerns. I will keep my ratings.
Magnitude pruning removes weights that have the smallest absolute values. However, traditional pruning methods require re-training the model to recover performance, which is computationally expensive and requires extensive data or teacher model. To address this, the authors propose to address this by approximating the dense matrix as the sum of a sparse matrix with maintained sparsity and a low-rank matrix. The main contributions are as follows:
- Weight is dismantled into sparse part and low rank part . The authors transform searching and into an optimization problem as Eq. (3).
- The authors propose to incorporate the binary mask into the optimization process to ensure that the sparsity pattern of is fixed as .
- Iterative refine sparse weight with adaptive rank increase.
In the experiments, the proposed method is validated on WikiText-2, TruthfulQA, GSM8K, ARC-C and MMLU with Llama models. Experimental results demonstrate that low-rank refinement significantly enhances model performance, particularly at high sparsity levels.
给作者的问题
In general, this paper performs well in its clarity, structure and completeness. The idea is innovative and the improvement is significant. However, end-to-end inference acceleration is missing. I will recommend a weak accept, and I suggest the authors pay attention to the inference part, as well as results on latest language models. Should those weaknesses and questions be addressed I will raise my scores accordingly.
论据与证据
Most of the claims are clear and convincing.
方法与评估标准
Most of the methods and evaluation criteria make sense.
理论论述
Most of the proofs make sense.
实验设计与分析
Most of the experimental designs are valid.
补充材料
There is no supplementary material.
与现有文献的关系
This study is closely related to previous post-training pruning works, such as SparseGPT and Wanda.
遗漏的重要参考文献
No missing related works.
其他优缺点
Strengths:
- The paper is well organized and well written. The technical content is explained in sufficient details. The equations are very clear and easy to understand.
- Comprehensive experiments performed in this paper. The reviewer appreciates the authors' effort in validating generation tasks such as GSM8K, rather than simply providing perplexity results. Since generation tasks is usually harder, the improvement is significant.
Weaknesses:
- End-to-end inference acceleration is missing. It's better to report speedup for completeness.
其他意见或建议
Comments:
- I suggest that the authors add add some frontier models, such as Llama 3.1 8B. Llama is quite old and may lack reasoning and mathematics abilities. This may affect the GSM8K results.
Thank you for the insightful feedback and constructive suggestions.
W1: End-to-end inference acceleration is missing. It's better to report speedup for completeness.
- Without sparse matrix formats or specialized hardware acceleration, the inference time is as follows:
Firstly, for a fair comparison with the dense baseline model, we took a conservative approach in our evaluation.
Specifically, we stored all matrices in their full dense format and kept all zero elements (the weight matrices are torch.Tensor objects), without leveraging any sparse matrix formats or hardware-specific optimizations for acceleration.
Table: Parameter Count and Evaluation Time for Complete ARC-Challenge Benchmark.
| Model Configuration | Non-Zero Parameter Count | Parameter Count | ARC-C | Relative Time |
|---|---|---|---|---|
| Llama-2-7B | ||||
| Dense baseline | 50.5s | 1.0x | ||
| Magnitude 50% sparse | 3.5B | 6.7B | 50.7s | ~1.0x |
| Proposed Method (k=128, 50%) | 3.8B | 7.1B | 50.5s | ~1.0x |
| Llama-2-13B | ||||
| Dense baseline | 13.0B | 13.0B | 73.2s | 1.0× |
| Magnitude 2:4 sparse | 6.7B | 13.0B | 73.9s | ~1.01× |
| Proposed Method (k=128, 2:4) | 7.2B | 13.5B | 77.9s | ~1.06× |
- With sparse matrix formats and hardware acceleration, we convert the weight matrices to their sparse format. Specifically, we use
torch.sparse.to_sparse_semi_structuredto convert the weight matrices totorch.sparse.SparseSemiStructuredTensorobjects. The following table shows the memory usage of the proposed method with and without sparse matrix formats. The actual inference speedup varies based on factors like batch size and input sequence length. With hardware-accelerated N:M structured pruning (on NVIDIA Ampere GPUs and newer), we can observe approximately ~1.1x faster inference in wall-clock time compared to dense models. However, it's important to note that for unstructured pruning patterns or on GPUs without dedicated sparse acceleration support, using sparse operations can actually result in slower inference times compared to dense computation.
Table: A comparison of memory usage with and without end-to-end inference acceleration.
| Model Configuration | Memory Usage | Relative Memory Usage |
|---|---|---|
| Llama-2-13B | ||
Proposed Method (k=128, 2:4), weight matrices are torch.Tensor | 2x14.6GB | 1.0x |
Proposed Method (k=128, 2:4), weight matrices are torch.sparse.SparseSemiStructuredTensor | 2x9.1GB | ~0.62x |
W2: I suggest that the authors add add some frontier models, such as Llama 3.1 8B. Llama is quite old and may lack reasoning and mathematics abilities. This may affect the GSM8K results.
Thank you for the suggestion. We have added Llama-3.1-8B to the experiments. The results are shown in the following table.
Table: Llama-3.1-8B results.
| Model | HellaSwag | WinoGrande | ARC-e | TruthfulQA | GSM8K | ARC-c | MMLU |
|---|---|---|---|---|---|---|---|
| Dense baseline | 78.9 | 73.6 | 81.1 | 45.2 | 49.8 | 53.4 | 63.5 |
| Pruning Method | |||||||
| Magnitude 50% | 56.4 | 57.6 | 56.7 | 42.9 | 1.3 | 35.8 | 35.3 |
| Magnitude 50% + Ours | 66.8 | 67.8 | 68.7 | 38.9 | 6.5 | 42.6 | 45.7 |
Thank you for adding inference acceleration results and Llama 3.1 results. The results are reasonable, and I will recommend a weak accept for your paper.
The paper gives a new way to improve/finetune a pruned model, using low rank matrices during the re-training. The paper is well written and the paper has fairly extensive experiments, along with some theoretical convergence analyses. Reviewers are all fairly positive about the paper, and the authors are encouraged to incorporate comments from the review and rebuttal into the final version.