PaperHub
8.2
/10
Poster4 位审稿人
最低4最高6标准差0.7
4
5
5
6
4.0
置信度
创新性3.0
质量3.0
清晰度3.3
重要性3.3
NeurIPS 2025

3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We introduce 3BASiL-TM, a highly efficient one-shot post-training method for Sparse plus Low-Rank decomposition of LLMs that reduces the WikiText2 perplexity gap to dense model by over $30\%$ compared to prior methods.

摘要

Sparse plus Low-Rank $(\mathbf{S} + \mathbf{L}\mathbf{R})$ decomposition of Large Language Models (LLMs) has emerged as a promising direction in $model compression$, aiming to decompose pre-trained model weights into a sum of sparse and low-rank matrices $\mathbf{W} \approx \mathbf{S} + \mathbf{LR}$. Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce $3BASiL-TM$, an efficient one-shot post-training method for $(\mathbf{S} + \mathbf{L}\mathbf{R})$ decomposition of LLMs that addresses this gap. Our approach first introduces a novel 3-Block Alternating Direction Method of Multipliers (ADMM) method, termed $3BASiL$, to minimize the layer-wise reconstruction error with convergence guarantees. We then design a transformer-matching ($TM$) refinement step that jointly optimizes the sparse and low-rank components across transformer layers. This step minimizes a novel memory-efficient loss that aligns outputs at the transformer level. Notably, the $TM$ procedure is universal as it can enhance any $(\mathbf{S} + \mathbf{L}\mathbf{R})$ decomposition, including pure sparsity. Our numerical experiments show that $3BASiL-TM$ reduces the WikiText2 perplexity gap to dense LLaMA-8B model by over 30% under a (2:4 Sparse + 64 LR) configuration, compared to prior methods. Moreover, our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA $(\mathbf{S} + \mathbf{L}\mathbf{R})$ method. Our code is available at https://github.com/mazumder-lab/3BASiL.
关键词
Sparse plus Low-RankModel CompressionLarge Language ModelsLoRAPEFTADMMOptimization

评审与讨论

审稿意见
4

In this paper, the authors propose 3BASiL, which introduces a unified algorithmic framework for compressing LLM by decomposing weight matrices into sparse + low rank form. Its key innovations include:

  1. Propose a 3-block ADMM algorithm that jointly updates sparse, low-rank, and the dual variables with a closed-form update for each.

  2. Propose a block-level weight fine-tune step that is able to further improve performance.

优缺点分析

Strengths:

  1. The method offers closed-form updates for each component (sparse, low-rank, dual), and is supported by theoretical convergence guarantees.

  2. Compared to prior work, it achieves up to 40% lower perplexity and 2–3× faster compression, demonstrating both efficiency and effectiveness.

Weaknesses:

  1. The necessity of adding the term W^(S+L)\|\hat{W} - (S + L)\| in Equation (1) is not clearly explained. When computing updates, doesn't it behave similarly to (X+λ)W^(X+λ)(S+L)\|(X + \lambda)\hat{W} - (X + \lambda)(S + L)\|?

  2. While the method focuses on N:M sparsity, it seems generalizable. Have you evaluated its performance under unstructured or other structured sparsity formats? Additionally, could it adapt to non-uniform sparsity distributions, such as OWL[1]?

  3. The paper mentions 80 alternating minimization steps during compression. Have you conducted sensitivity analysis on the number of steps?

[1] Lu et al. Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity.

问题

Please see the weakness.

局限性

The authors addressed the limitations in the Conclusion and limitation section.

最终评判理由

My concerns are addressed. I lean to accept.

格式问题

The equations between L117 and L122 should have indices.

作者回复

We would like to thank the reviewer for the high-quality review, highlighting the strengths of our paper, and the useful suggestions to improve the manuscript that led to an insightful finding regarding using OWL for (S+LR) methods.

Weaknesses:

W1: You are making a very good point. If one expands the expression of equation (1), it is equivalent to minimizing 1/2Tr((W^SL)T(XTX+λI)(W^SL))1/2Tr((\hat{W} - S - L)^T (X^TX+\lambda I) (\hat{W} - S - L)). So it behaves like adding a regularization term to the local layer-wise reconstruction error Hessian XTXX^TX. The aim here is to make the Hessian matrix XTX+λIX^TX + \lambda I invertible (which is satisfied for any λ>0\lambda > 0) and reduce the numerical errors when applying equation (4).

From a global optimization perspective, we aim to minimize the layer-wise reconstruction error withoutdeviatingtoofar*without deviating too far* from the original weights. It has also been used in previous (S + LR) compression methods like Hassle-free.

W2.1: Our method is indeed generalizable to unstructured sparsity patterns. We experiment with a "less aggressive compression" configuration (0.5 + 128) and study the differences after further LoRA fine-tuning. We believe that it is interesting to further explore these settings as they are "near-lossless" configurations.

Table 1: Comparison of the the perplexity of (S+LR) algorithms before and after LoRA fine-tuning on the configuration (0.5 + 128) on the model Llama3.2-1B. For perplexity (lower is better, ↓). If a dataset has the suffix "-LFT", we report the perplexity on that dataset after further LoRA fine-tuning on the C4 dataset [this can decrease the performance on WT2 and PTB datasets if the model has a performance comparable to dense model--3BASiL seems to be the only one to have this property].

MethodConfigC4 ↓C4-LFT ↓WT2 ↓WT2-LFT ↓PTB ↓PTB-LFT ↓
OATS0.5 + 12817.9916.7112.2911.7621.6820.97
Hassle-free-SparseGPT0.5 + 12817.2516.3811.9111.5321.0120.53
Hassle-free-ALPS0.5 + 12816.8116.1211.6211.4120.5920.54
3BASiL0.5 + 12816.1715.6911.1711.1319.8420.00
3BASiL-TM0.5 + 12815.7815.4410.8910.9119.1919.43
dense--14.0114.019.759.7517.5917.59

W2.2: Our method can also support non-uniform sparsity distributions and we have added support for OWL for unstructured sparsity. This is a great insight and we would like to thank the reviewer for raising this important remark. We have tried to activate OWL for the configuration (0.7 + 64) for Llama3-8B. It turns out that this does improve the results of 3BASiL. This is not trivially true because OWL deltas were optimized for pure pruning methods. We believe that this is an insightful finding for the community. It opens further research directions as to how to create similar deltas for the low-rank components, so thanks to the reviewer for this suggestion!

Table 2: Comparison of the performance of our proposed method 3BASiL with and without activating OWL on the configuration (0.7 + 64) on Llama3-8B. For perplexity (lower is better, ↓). For accuracy (higher is better, ↑).

MethodConfigC4 ↓WT2 ↓PTB ↓PIQA ↑ARC-E ↑ARC-C ↑BoolQ ↑HellaSwag ↑OpenBookQA ↑RTE ↑WinoGrande ↑Avg ZS ↑
3BASiL0.7 + 6425.3320.5128.0171.1653.4931.8374.2255.3232.6055.9665.9055.06
3BASiL-OWL0.7 + 6423.3219.8427.0971.9359.8132.6874.7158.5033.8053.7966.2256.43

We intend to add a section showcasing the value of activating OWL for highly sparse unstructured configuration [e.g. (0.7 + 64)] and adding a discussion motivating the question of non-uniform sparsity and non-constant rank allocation as a future research direction.

W3: 80 alternating minimization steps is the default value in the papers of OATS and Hassle-free. A sensitivity analysis has been conducted in OATS for instance [see figure 1 of OATS paper [1]], and shows that the performance plateaus after 80 minimization steps. For these alternating-minimization based methods, increasing the number of steps generally lowers the objective (1) and improves the performance. If we were to reduce the number of steps, the performance would degrade.If we were to increase it, the performance gain is negligible (plateau reached) in our experiments but the runtime still increases.

Our algorithm is different from alternating minimization [used by OATS/HASSLE-free] and the number of iterations is not directly comparable to OATS and HASSLE-free. That being said, 3BASiL is both faster and achieves a better performance than OATS and HASSLE-free [see figure 2 in our submission].

[1] Zhang et al., OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition (ICLR'25)

评论

Thank you for the authors’ response.

Most of my concerns have been adequately addressed. I will maintain my score, which leans toward acceptance.

评论

We sincerely appreciate your time and valuable feedback, which helped us improve the paper. Thank you for maintaining your positive assessment of our work!

审稿意见
5

This paper proposes a sparse + low-rank approach for model compression and introduces cross-layer joint optimization to improve the performance of the compressed model.

优缺点分析

Strengths

1.This work proposes a novel three-block variable decomposition framework that jointly optimizes sparsity and low-rank constraints, which address the theoretical limitations of existing alternating minimization methods.

2.Introduced a TM fine-tuning step after model compression to enhance the performance of the compressed model.

Weaknesses

1.This method is only applicable to semi-structured conditions, which limits its scope of application

2.Experiments were only conducted on models of 1-8B size. The performance on larger-scale models remains to be further verified

问题

1.The paper primarily presents experimental results for the semi-structured sparse component. To demonstrate the method's general applicability, could additional comparisons with unstructured sparsity be attempted?

2.Could the actual acceleration effects of the compressed model in practical usage be provided to show that the low-rank + sparse method is indeed well-adapted to hardware?

3.If conditions permit, a more extensive assessment can be conducted appropriately.

局限性

Yes

最终评判理由

These responses have addressed most of my concerns. Given that the current evaluation is already positive, we will not modify the score.

格式问题

NO

作者回复

We would like to thank the reviewer for the extensive feedback, highlighting the strengths of our paper, and the valuable suggestions to improve the manuscript.

Weaknesses and Questions

W1 & Q1: Our method is indeed generalizable to unstructured sparsity patterns. We experiment with a "less aggressive compression" configuration (0.5 + 128) and study the differences after further LoRA fine-tuning. We believe that it is interesting to further explore these settings as they are "near-lossless" configurations.

Table 1: Comparison of the the perplexity of (S+LR) algorithms before and after LoRA fine-tuning on the configuration (0.5 + 128) on the model Llama3.2-1B. For perplexity (lower is better, ↓). If a dataset has the suffix "-LFT", we report the perplexity on that dataset after further LoRA fine-tuning on the C4 dataset [this can decrease the performance on WT2 and PTB datasets if the model has a performance comparable to dense model--3BASiL seems to be the only one to have this property].

MethodConfigC4 ↓C4-LFT ↓WT2 ↓WT2-LFT ↓PTB ↓PTB-LFT ↓
OATS0.5 + 12817.9916.7112.2911.7621.6820.97
Hassle-free-SparseGPT0.5 + 12817.2516.3811.9111.5321.0120.53
Hassle-free-ALPS0.5 + 12816.8116.1211.6211.4120.5920.54
3BASiL0.5 + 12816.1715.6911.1711.1319.8420.00
3BASiL-TM0.5 + 12815.7815.4410.8910.9119.1919.43
dense--14.0114.019.759.7517.5917.59

W2 & Q2: Thank you for pointing out this very important point! We are currently running experiments on the model OPT-30B which we intend to include in the revised paper [see Table 2 below for a comparison between 3BASiL and Hf-SparseGPT--it seems that this type of compression has less impact on performance for larger models].

According to SLOPE [1], we should expect an inference acceleration of up to 1.25× and up to 0.63 memory reduction for this model if we use their custom designed CUDA kernels during inference.

Table 2: Comparison of the the perplexity of Hf-SparseGPT and 3BASiL on (2:4 + 64) configuration for a OPT-30B model.

MethodConfigC4 ↓WT2 ↓PTB ↓
Hf-SparseGPT2:4 + 6411.6410.3414.54
3BASiL2:4 + 6411.5510.0414.32
Dense-11.449.5614.04

Q3: In the revised paper, we are going to expand our experiments to report more unstructured sparsity (S+LR) configurations similar to Table 1 above. We intend to include more pure pruning results to showcase the universality of Transformer Matching (see Table 2 of our response to reviewer JUTM). In addition, we are going to include a subsection showcasing the applicability of our method to non-uniform transformer compression (see Table 2 of our response to reviewer J8EX). Finally, we will report some results on the model OPT-30B to see the performance of 3BASiL on larger models [similar to Table 2 above].

[1] Mozaffari et al., SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs (ICLR'25).

评论

These responses have addressed most of my concerns. Given that the current evaluation is already positive, we will not modify the score.

评论

We sincerely appreciate your time and valuable feedback, which helped us improve the paper. Thank you for maintaining your positive assessment of our work!

审稿意见
5

In this paper, an ADMM algorithm is studied for compressing dense weights into sparse-plus-low-rank (S+LR) representations. With particular sparse patterns, such as 2:4 sparsity, S+LR matrices can be computed efficiently on GPUs. The paper proposes a multi-block ADMM procedure to initialize the factors using calibration data, which provides a good initialization and thus improves the model quality after fine-tuning. Each factor is obtained by minimizing the augmented Lagrangian while fixing the others, with convergence guarantees as the penalty term rho increases. To address misalignment in layer-wise output matching, the paper also introduces module-wise output matching. Experimental results demonstrate that 3BASiL improves both one-shot and fine-tuned compression outcomes across various structured sparsity patterns on large language models.

优缺点分析

Strengths

  1. The paper presents both theoretical and practical contributions. The proposed algorithm appears to outperform other S+LR decomposition models, and the experimental results show that the compressed models maintain decent performance. Theorem 1 provides convergence guarantees, which is a strong point.
  2. The writing quality is good, making the paper easy to understand.

Weaknesses

Most of my concerns relate to the experimental results.

  1. Although Theorem 1 proves convergence of the proposed ADMM steps, it is unclear whether convergence occurs in practice. Please consider including plots showing the diminishing difference between L and S at each step or the convergence of the augmented Lagrangian.
  2. The actual inference speedup is not reported. While it seems likely that the compressed network will run efficiently on GPUs supporting 2:4 or 4:8 sparsity, reporting real inference speed measurements would help emphasize the practical impact of the proposed method.

问题

  1. Why is 3BASiL designed with a 3-block ADMM? In the Robust PCA literature, an ADMM algorithm to solve robust PCA minL,SL+S1\min_{L,S} ||L||_* + ||S||_1 s.t. L+S=WL+S=W can be formulated as a 2-block method. Is the use of 3 blocks in 3BASiL due to the hard sparsity constraint, unlike the l1 loss in RPCA?
  2. The last inequality of Equation 18 uses S(t+1)S(t)FCAρt||S^{(t+1)} - S^{(t)}||_F \le \frac{C_A}{\rho_t}, whereas the bound in Equation 17 is 3CAρt1\frac{3C_A}{\rho{t-1}}. How is the former inequality obtained? Can’t the bound in Equation 17 be used directly?
  3. Transformer-level Matching is proposed to update the factors with a better proxy than layer-level matching. If the quality of this proxy is a concern, couldn’t the S and L factors be updated directly based on the original training loss (e.g., cross-entropy) or a knowledge distillation loss on a subset of C4, instead of splitting the process into Transformer-level Matching and LoRA fine-tuning?

局限性

Yes

最终评判理由

All of my concerns are resolved.

格式问题

No

作者回复

We would like to thank the reviewer for the thorough review, for highlighting the strengths of our submission, and for raising important fundamental questions about our compression algorithm.

Weaknesses & Questions

W1: Thanks for pointing out this remark. We also think that including plots for the convergence of the objective and iterates will further improve the quality of the paper, we will add such plots in the revised version.

Since we are not allowed to include plots for this rebuttal per NeurIPS policy, we wanted to share simple logs we used to follow the convergence of 3BASiL [for the compression of self_attn.q_proj of the first transformer of Llama3-8B under a (2:4 + 128) configuration].

For iteration 009, the true loss ||X(W_old - (W_S + B @ A))||_F^2 = 1952.1258544921875

For iteration 049, the true loss ||X(W_old - (W_S + B @ A))||_F^2 = 664.7432861328125

For iteration 149, the true loss ||X(W_old - (W_S + B @ A))||_F^2 = 176.23971557617188

For iteration 199, the true loss ||X(W_old - (W_S + B @ A))||_F^2 = 119.56678771972656

For iteration 299, the true loss ||X(W_old - (W_S + B @ A))||_F^2 = 87.93989562988281

For iteration 399, the true loss ||X(W_old - (W_S + B @ A))||_F^2 = 87.01194763183594

For iteration 479, the true loss ||X(W_old - (W_S + B @ A))||_F^2 = 86.82907104492188

[479 is the last iteration, note that iterations in our algorithm are extremely efficient and are not comparable to the default value 80 of alternating minimization steps used in HASSLE-free and OATS].

W2: Thank you for pointing out this very important point! We are currently running experiments on the model OPT-30B which we intend to include in the revised paper [see Table 2 below for a comparison between 3BASiL and Hf-SparseGPT--it seems that this type of compression has less impact on performance for larger models].

According to SLOPE [1], we should expect an inference acceleration of up to 1.25× and up to 0.63 memory reduction for this model if we use their custom designed CUDA kernels during inference.

Table 2: Comparison of the the perplexity of Hf-SparseGPT and 3BASiL on (2:4 + 64) configuration for a OPT-30B model.

MethodConfigC4 ↓WT2 ↓PTB ↓
Hf-SparseGPT2:4 + 6411.6410.3414.54
3BASiL2:4 + 6411.5510.0414.32
Dense-11.449.5614.04

Q1: Thank you for making the connection between our optimization formulation and the RPCA literature. Our problem is quite different from RPCA as we aim to match the outputs**outputs** of each weight matrix X(WSL)F||X(W - S - L)||_F as opposed to the actual matrix weights matching WSLF||W - S - L||_F. A block in ADMM is introduced to this end. Please also note [line 154 of our paper] that our 3-block ADMM approach can be reformulated as a standard 2-block because the Lagrangian is separable with respect to 2 blocks introduced for the optimization.

Q2: This is a typo, thanks for catching it. The correct inequality should be

L(t+1)L(t)F3CF2CAρt1| L^{(t+1)} - L^{(t)} |_F \leq \frac{3C_F^2 C_A}{\rho_{t-1}},

which directly uses the bound in Eq (17). The constant CC should be correspondingly changed to C=3CF2CAC=3C_F^2C_A (since CF1C_F\ge 1). We deeply appreciate the reviewer's careful examination and have corrected this typo in our revised manuscript.

Q3: This is a fundamental question related to one-shot compression methods! It is true that SS and LL can be updated with the original loss function (LL is updated during LoRA fine-tuning for example) or a knowledge distillation loss from the original model. However, if one aims to update SS, one needs to do a full back-propagation on the entire LLM which is very memory intensive. Our entire pipeline: compression of LLama3-8B with 3BASiL, refinement with TM and LoRA fine-tuning can be done on a single A100 GPU. However, full back-propagation even with batch=1 results in a cuda memory error. That's why Transformer Matching introduces an intermediate [memory-efficient] loss function which provides is a good trade-off between the local-layerwise reconstruction loss and the original training loss.

[1] Mozaffari et al., SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs (ICLR'25).

评论

I appreciate your thorough response.

The loss log makes sense and highlights that the algorithm works in practice. Thank you for providing it!

Regarding the inference speedup, does the 2:4+r64 3BASiL configuration correspond to the row "OPT-30B" and the column "1.56% Adapter"? It seems like the expected speedup is 1.53x in Table 1 of [1]. Could the author provide a pointer to the reference value in [1]?

I think this is a strong paper. I'll keep my score the same as I have already championed the paper towards acceptance.

[1] Mozaffari et al., SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs (ICLR'25).

评论

The rank 64 corresponds to ~0.9% (hidden dimension is 7168 for OPT-30B) and hence we expect improved speedups and memory savings compared to the reference value of 1.56% Adapter. We will launch the same experiments with the rank 112 corresponding to exactly 1.56% and report the reference value in [1] in the revised manuscript.

We sincerely appreciate the time you took to review our work and the positive feedback you had about our submission, we believe that it has improved the quality of the submission!

[1] Mozaffari et al., SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs (ICLR'25).

审稿意见
6

The large scale of Large Language Models (LLMs) makes them a big target for compression, and one common technique is sparsity which, when coupled with low-rank decomposition, shows promise at matching the dense model's quality. Despite many developments in this direction, there still exists a large gap in model quality. The authors present 3BASiL-TM to help close this gap. First, 3BASiL is a 3-block ADMM algorithm which alternates between updating the sparse version of the weights, the low-rank component, and the sparse mask. Treating L and D as a single variable block reformulates the approach as standard 2-block ADMM and helps guarantee convergence. Next, the authors propose Transformer-level Matching (denoted by a -TM suffix), which serves to further refine the sparse and low-rank components, using the nonlinear behavior of each transformer block, so that the low-rank components are well-initialized for LoRA adaptation. Empirical studies show that TM generalizes to many one-shot S+LR methods, that its application to 3BASiL delivers top-quality results, and that the benefit of 3BASiL-TM is maintained, even after LoRA fine-tuning.

优缺点分析

Strengths

This submission is outstanding. It presents a novel collection of related ideas, clearly motivating and proving the benefit of each. Treating each metric in turn:

Quality

I couldn't find any missing links in the chain of reasoning from claim to claim. Each claim was supported by theory or empirical results, as appropriate, with one exception (noted below). This is clearly a complete package and not a work in progress. (Future work is indicated, but well beyond the scope of a single endeavor.)

Clarity

I believe that sufficient detail is presented such that a dedicated practitioner could implement the technique and reproduce results. Fully understanding each step does take careful reading, but the organization and exposition is not to blame; the mechanisms used are simply nontrivial.

Significance

3BASiL-TM represents a major step forward in recovering model quality. There are still gaps to the dense baselines, but they are greatly reduced without undue computational load. Care has been taken to not only find compelling new techniques, but to make them practical for use - runtimes in single-digit hours is entirely reasonable that users shouldn't need to shy away from one-time applications to their models due to limited resources. (Larger models will require more time, but will also require more resources to train and deploy.) As I discuss immediately below, there are findings within the results that I consider very useful for advancing the field.

Originality

I only noted one missing piece of related work (below); otherwise, the comparison to prior work is complete and shows that the collection of techniques in 3BASiL-TM is both new and important.

A couple observations show the importance of the submission goes beyond its ability to improve zero-shot model quality:

  • Table 3 is particularly interesting: for a given compression rate, it's important to find the right balance between sparsity and the rank of the LR.
  • Figure 4 is also important: I've observed "head-starts" by clever one-shot schemes become washed out after fine-tuning. It's good to see this doesn't happen to 3BASiL-TM (at least for the limited fine-tuning performed).

Weaknesses

I noted one missing comparison - EoRA (1), which seems like it might compete with the "smart initialization" of LoRA parameters offered by transformer matching.

I also could not find evidence to support the claim on line 181 that TM can apply to sparse, but not LR, networks. I can imagine how it will apply, but this is not supported in any of the presented results.

Minor issues:

  • Inconsistent citation styles are confusing - particularly (but not limited to) lines 213-221. Some citations are missing years (e.g. lines 290, 292).
  • The wrong entry in Table 2's 2:4+64LR RTE results is bolded (or there's a typo in one of the 3BASiL results).
  • Citing the N:M sparse format (and the hardware advancements that make it useful) (e.g., 2) would make the background information complete.
  1. Liu et al., EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation, https://arxiv.org/abs/2410.21271
  2. Mishra et al., Accelerating Sparse Deep Neural Networks, https://arxiv.org/abs/2104.08378

问题

These two questions directly relate to the quality of the submission:

  1. How does TM improve, say, the 2:4 and 4:8 results in Table 3?
  2. How does EoRA interact with 3BASiL as a whole, and specifically with using TM as a smart initialization for LoRA fine-tuning?

局限性

The limitations are satisfactorily discussed in Section 6.

最终评判理由

The authors' rebuttal entirely satisfactorily answered my questions, and reading other reviews and responses has not changed my initial review's conclusions.

格式问题

No concerns

作者回复

We would first like to thank the reviewer for the thorough examination of the paper, highlighting the strengths of our submission and proposing missing pieces that would improve the manuscript.

Weaknesses -- Questions

W1 & Q2: Thank you very much for pointing EoRA which should indeed be included in the related works of our paper. Upon inspection of this important method, we noted that HASSLE-free [a considered competing method] with a special configuration (alternating minimization steps=1 and using our improved implementation--see Table 4 discussion) reduces to EoRA. In that case, Hf-SparseGPT-ours would (i) compress the model weights to 2:4 sparsity using the SparseGPT algorithm and (ii) use the closed-form rank-update (equation 4) for "compensation" as discussed in the EoRA paper [1]. In EoRA, QQ^\prime corresponds to H1/2H^{1/2} in our paper (up to a minor notation difference, as they consider XXTXX^T to be the hessian whereas we use XTXX^TX).

We launched experiments with EoRA, Hf-SparseGPT-ours (default AM steps=80) and 3BASiL for Llama3-8B under a (2:4 + 128) configuration [similar to Table 2 in EoRA table] (we also compare the effect of using TM on top of EoRA to show universality).

We obtain the following results:

Table 1: Comparison of (S+LR) methods with (2:4 + 128) configuration on Llama3-8B. For perplexity (lower is better, ↓). For accuracy (higher is better, ↑).

MethodC4 PPL ↓WT2 PPL ↓PTB PPL ↓ARC-C ↑HellaSwag ↑WinoGrande ↑Avg. ↑
EoRA20.3414.4921.8635.7561.0267.1758.23
EoRA-TM16.0011.4117.3238.8265.1166.5459.91
Hf-SparseGPT-ours15.4610.5016.0540.6169.2770.7263.30
3BASiL14.459.5614.4643.9471.2970.6565.06
3BASiL-TM13.308.8213.9244.8872.2871.5965.40

Note that the results of our implementation of EoRA are slightly different than reported in the EoRA paper.

(i) Our EoRA implementation has a slightly worse WikiText ppl than reported in the EoRA paper (11.07) because our calibration data (X matrix) is C4 whereas they consider WikiText2 (training data). Despite this fact, 3BASiL-TM [with calibration C4] largely outperforms EoRA [with calibration WikiText2] on the WikiText2 test perplexity task (from 11.07 to 8.82) under the same constraints.

(ii) Our EoRA implementation has a slightly better ARC-C accuracy than reported in the original EoRA paper. For this metric, we both use the same calibration data C4. That being said, we use more calibration data (128 segments vs. 64 in EoRA paper). Our method 3BASiL-TM improves the accuracy post-compression from 35.75 to 44.88.

We would like to thank the reviewer again for pointing us to this important method. We intend to add it to the related works discussion (Exact Low-Rank updates for layer-wise compression). We also think it is helpful for the compression community if we add a paragraph discussing the underlying connections between our improved implementation of Hf-SparseGPT-ours and EoRA as well as more experiments in the Appendix.

W2 & Q1: This is a very good remark. We have launched preliminary results for the universality of Transformer Matching to pure pruning methods. The results in Table 3 of the paper for pure pruning methods have been extracted from HASSLE-free paper [2]. During the rebuttal, we have added support for SparseGPT and ALPS with TM. The results are below.

Table 2: Comparison of pure pruning methods with 2:4 and 4:8 configurations on Llama3-8B. For perplexity (lower is better, ↓). For accuracy (higher is better, ↑).

MethodConfigC4 ↓WT2 ↓PTB ↓PIQA ↑ARC-E ↑ARC-C ↑
SparseGPT2:422.6216.1425.3171.4956.1934.04
SparseGPT-TM2:415.3110.8217.0776.5066.0440.70
ALPS2:419.8014.5922.4273.5659.6035.15
ALPS-TM2:414.9910.6916.5377.3166.8840.87
------------------------
SparseGPT4:817.5612.2718.6374.8662.9238.82
SparseGPT-TM4:813.689.2514.5177.9769.9944.11
ALPS4:816.0811.2316.5875.9065.0740.27
ALPS-TM4:813.619.1814.2878.3569.7442.75

We agree with the reviewer that further experiments should be included to pure sparsity algorithms to showcase the universality of our proposed Transformer Matching procedure. In the revised paper, we will add support to all reported pruning methods and their enhanced TM versions.

W3: Thank you for pointing out these issues. They have been fixed in the revised manuscript.

We want to reiterate our appreciation for the thorough feedback of the reviewer. We believe that incorporating the suggestions and addressing the questions and weaknesses discussed above will greatly improve the quality of the revised paper.

[1] Liu et al., EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation.

[2] Makni et al., A unified framework for Sparse plus Low-Rank Matrix Decomposition for LLMs (CPAL'25).

评论

I thank the authors for their response, which I find adds to the already very strong submission. I'll keep my score at a strong accept.

评论

We sincerely appreciate the time you took to review our work and are glad that our revisions addressed your concerns!

最终决定

The paper proposes 3BASiL, an efficient one-shot post-training method for decomposing large language models (LLMs). Section 2 formulates the optimization problem, which is solved using the ADMM framework. A key component of the method is the Transformer Matching (TM) step, which jointly optimizes all sparse and low-rank components across layers within a transformer block to better approximate the original model’s output. The paper presents an extensive set of experiments demonstrating the effectiveness of the proposed approach.

All reviewers appreciated the novelty of the method and the rigor of the experimental evaluation. Initial concerns raised during the review process were addressed during the discussion phase, including clarifications and additional results. These responses were well-received, and there was broad consensus among reviewers that the paper meets the standards for acceptance.

I recommend acceptance. The paper offers a well-motivated and technically sound contribution, supported by thorough experimentation. For the camera-ready version, the authors should incorporate all requested changes, including the additional experiments presented during the rebuttal.