3.8

/10

Rejected4 位审稿人

最低3最高6标准差1.3

4.5

置信度

正确性2.3

贡献度2.0

表达2.8

ICLR 2025

TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition

Mingxue Xu,Yao Lei Xu,Danilo Mandic

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

TL;DR

Language model compression based on low-rank factorization for low-end devices.

摘要

The Small Language Models (SLMs, or on-device LMs) is a concept corresponding to the Large Language Model (LLM), which has significantly fewer parameters and is typically deployed on low-end devices, like mobile phones and single-board computers (e.g. Raspberry Pi). Unlike LLMs, which utilize the increasing model size for better generalization, SLMs are expected to adjust the exact deployment environment changes. Furthermore, most edge applications have battery life concerns, which have never been considered in the GPU servers for data centres. Targeting these two issues, this paper focuses on the token embedding compression for adaptivity and low energy requirements in edge applications. We propose a training-free model compression approach based on the Tensor-Train Decomposition (TTD), whereby each pre-trained token embedding vector is converted into a lower-dimensional Matrix Product State (MPS). We then comprehensively investigate the low-rank structures extracted by this approach, regarding the compression ratio, language task performance, latency and energy consumption on a typical low-end device (i.e. Raspberry Pi). Taking the sub-billion parameter versions of GPT-2/Cerebres-GPT and OPT as examples, the model compressed with our approach can achieve a comparable language task performance to the original model with around $2.0\times$ embedding layer compression, while the energy consumption of single query drops by half.

关键词

model compressionlow-rank factorizationtensor decomposition

评审与讨论

审稿意见

评分: 6置信度: 32024-11-02

The paper presents TensorGPT, a novel approach for compressing LLMs through tensor-train decomposition of embedding layers. The key innovation is applying tensor-train decomposition to individual token embeddings without requiring additional training data or computation. The authors evaluate their method on GPT family models (GPT-2 and CerebrasGPT), demonstrating meaningful parameter reduction while maintaining or sometimes improving model performance. The work provides comprehensive evaluations on GPT-family models and demonstrates practical applicability on edge devices.

优点

Novel training-free compression method specifically targeting embedding layers. Strong practical value for edge device deployment
Comprehensive experiments across multiple tasks and model sizes. Solid theoretical foundation with clear mathematical derivations. Thorough analysis of compression vs. performance trade-offs

缺点

Limited comparison with existing compression methods and baselines. The comparison would be more comprehensive by comparing with more baselines which are train-free and trained.
Evaluation focused mainly on GPT-family models and mainly focus on small models.
It would be a great to combine the proposed embedding compression method with other model compression methods to check compatibility.

问题

Could this approach be extended to show performance of other model architectures beyond the GPT family?
Could this approach compatible with other model compression methods?
Have you investigated the effect on model robustness like in the multilingual setting using more diverse tokens?

评论- Thank you so much for your detailed comments and suggestions.

2024-11-25

Thank you so much for your detailed comments and suggestions, our response is as follows.

Weakness 1. Limited comparison with existing compression methods and baselines.

The comparison would be more comprehensive by comparing with more baselines which are train-free and trained.

Thanks for the suggestion. However, we think training is too heavy for low-end devices, since memory usage can be three to four times as much as inference [21].

We have added SVD-based compression as our baseline in General Response G3.

Weakness 2. Evaluation is limited on small GPT models.

Evaluation focused mainly on GPT-family models and mainly focus on small models.

We have extended our experiments to OPT series models, please refer to General Response G3.

Regarding the small model size, please refer to General Response G1.

Weakness 3. Compability with other model compression approaches.

It would be a great to combine the proposed embedding compression method with other model compression methods to check compatibility.

Thanks for the suggestion; we are considering quantization and plan to take it as future work.

Question 1. Language models outside GPT family.

We have extended to OPT, please refer to General Response G3.

Question 2. Could this approach compatible with other model compression methods?

Yes, quantization is the easiest one, and it is orthogonal to our approach.

The combinations with weight tying mentioned by Reviewer JBLc, and pruning like that in SliceGPT[6] are a bit difficult, but we believe it can be solved by changing product/multiplication sequences.

Question 3. Have you investigated the effect on model robustness like in the multilingual setting using more diverse tokens?.

Thanks for the suggestion. At the moment, we suspect that the multilingual tasks are too complex for the small language models, as we discussed in Tab.xaFK.3, Weakness 3 with Reviewer xaFK. MGSM [19] in Tab.xaFK.3 is also a multilingual dataset, but all the investigated sub-billion models cannot perform well. We feel that it is better to evaluate robustness only when the language models perform well on the task.

However, we acknowledge that multilingual tasks are rather valuable for the robustness of the edge application, and plan to take this as our future work. Yes, quantization is the easiest one and is orthogonal to our approach.

Reference

[19] "Language models are multilingual chain-of-thought reasoners." arXiv preprint arXiv:2210.03057 (2022).

[21] Zhao, Jiawei, et al. "Galore: Memory-efficient llm training by gradient low-rank projection." arXiv preprint arXiv:2403.03507 (2024).

审稿意见

评分: 3置信度: 52024-11-04

The paper presents "TensorGPT," a model compression method using Tensor-Train Decomposition (TTD) to reduce the parameter size of Large Language Models (LLMs), particularly in embedding layers, without additional training. This training-free approach preserves model performance while significantly reducing memory requirements, demonstrated on low-end devices like Raspberry Pi.

优点

The approach does not require extra training, making it applicable for scenarios with limited resources or when extra data is unavailable.
TensorGPT achieves substantial compression with low memory overhead, retaining performance across language modeling and sentiment analysis tasks.
The paper includes experiments on low-end hardware (Raspberry Pi) and larger models, evaluating compression ratios, latency, and task-specific performance.

缺点

The novelty is limited. Tensor-Train Decomposition is explored in several language model compression works. What are the differences between the proposed method and existing works?
Lack important baselines. Compressing token embedding layers is studied in some papers and the paper does not compare with them. For example, [1] and baselines used in [1] should be set as baselines.
The empirical evaluation is primarily on language modeling and sentiment classification, with potential limitations in representing other complex NLP tasks, such as math reasoning and multi-hop question answering.

[1] Wang, Haoyu, et al. "LightToken: A task and model-agnostic lightweight token embedding framework for pre-trained language models." Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023.

问题

refer to weaknesses

评论- Thank you so much for your detailed comments and reference.

2024-11-25

Thank you so much for your comments, our response is as follows.

Weakness 1. The novelty is limited.

Tensor-Train Decomposition is explored in several language model compression works. What are the differences between the proposed method and existing works?

As far as we know, [11,16,17] are the only works that use Tensor-Train Decomposition for the language model compression. All of them require extra training, which is unrealistic for low-end devices. Our training-free approach is suitable for meeting the requirements of adaptivity and low energy in edge applications, as we discussed in General Response G0.

We would highly appreciate it if the Reviewer were aware of such work and possibly gave more references for it, so we could discuss them in our paper.

Weakness 2. Lack important baselines.

Compressing token embedding layers is studied in some papers and the paper does not compare with them. For example, [1] and baselines used in [1] should be set as baselines.

Thanks for the reference. We have discussed the works about token embedding layer compression ([7,10-12, 14, 17]) in the General Response G2. Given that we only focus on small language models deployed on low-end devices, only [10] solves the same problem as ours and should be compared. However, [10] still requires training a meta-model, which should be fine-tuned when the deployment environment changes. So, we only take the SVD-based approach, the same as that in LightToken [7] (reference [1] in the Reviewer's original reviews), as our baseline. The experimental comparison of this part is in the General Response G3.

Weakness 3. Limited evaluation.

The empirical evaluation is primarily on language modelling and sentiment classification, with potential limitations in representing other complex NLP tasks, such as math reasoning and multi-hop question answering.

Thanks for pointing out. We report the results of zero-shot reasoning in General Response G3.

We suspect that math reasoning and multi-hop question answering, which are still hard problems for LLMs, are too complex for sub-billion language models on low-end devices.

We have evaluated some sub-billion language models on MGSM [19] for math reasoning and DROP [20] for multi-hop question answering, and it came out that all the models could not perform well. Our view is that there is no need to compress models for these two tasks on low-end devices.

Tab. xaFK.3 Small language model scores on math reasoning (MGSM) and multi-hop question answering (DROP)

Task (metric or filter)	OPT-125M	OPT-350M	Qwen2.5-0.5B	CerebrasGPT-111M	CerebrasGPT-256M	CerebrasGPT-590M	distilgpt2	gpt2	gpt2-medium
MGSM (flexible extract)	0.63	0.93	5.77	0.23	0.7	0.57	0.3	0.6	0.73
MGSM (remove whitespace)	0	0	0.23	0	0	0	0	0	0
DROP (EM)	0.09	0.35	0.04	0	0.07	0.05	0	0.07	0.05
DROP (F1)	2.26	2.83	0.98	15.3	1.88	1.88	1.39	2.72	3.69

Reference

[7] "LightToken: A task and model-agnostic lightweight token embedding framework for pre-trained language models." Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023.

[10] "Direction is what you need: improving word embedding compression in large language models." arXiv preprint arXiv:2106.08181 (2021).

[11] "Tensorized embedding layers for efficient model compression." arXiv preprint arXiv:1901.10787 (2019).

[17] "Efficient gpt model pre-training using tensor train matrix representation." arXiv preprint arXiv:2306.02697 (2023).

[19] "Language models are multilingual chain-of-thought reasoners." arXiv preprint arXiv:2210.03057 (2022).

[20] "DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs." arXiv preprint arXiv:1903.00161 (2019).

审稿意见

评分: 3置信度: 52024-11-04

The authors applied tensor train to an edge device application, and found we can use that to compress some NLP models. It reads to me that the author is simply applying a pre-existing, well-known tensor decomposition method on a problem. And I didn't see any new knowledge developed or presented (I might be wrong, so hopefully the authors could point out clearly what's the real technical contribution other than yet another application).

优点

Working on a problem that's important.

缺点

The authors claim "- As far as we know, we are the first to compress LLMs with low-rank factorization, specifically", which is not true to me. The architecture is the same for most NLP models and thus pardon me I really don't know what's added. I think all the tensor-train stuff has been developed and applied before. It's not immediately obvious to me what's new conclusions or findings drawn from this paper.

问题

Pardon me that I really don't know what's the biggest contribution. Other than your claim of try things on LLM, what's the real technical contribution in your paper not discussed by previous works?

评论- Thanks for your comments.

2024-11-25

Thanks for your comments, our response is as follows.

Weakness 1. Overclaim

The authors claim "- As far as we know, we are the first to compress LLMs with low-rank factorization, specifically", which is not true to me.

Thanks for pointing out. We acknowledge that it is confusing and misleading here. We wanted to convey "As far as we know, we are the first to compress Small Language Models (SLMs) [9] for low-end devices use cases, with low-rank factorization'', which has been updated in line 78-80 of the updated submission.

The architecture is the same for most NLP models and thus pardon me I really don't know what's added.

We guess the Reviewer wanted to express that our approach has no obvious contribution in algorithms, since model compression is to work on the models with existing architectures rather than creating a new architecture.

For a short answer, our approach is specifically designed for small language models on low-end devices, to meet the requirements of adaptivity and low energy in edge applications. We systematically analysed energy, latency, etc, which common LLM compression rarely considers.

For a detailed answer, please refer to the General Response G0, G2 and our answers to Weakness 4 raised by Reviewer JBLc.

Weakness 2. About Tensor-Train

I think all the tensor-train stuff has been developed and applied before.

General Response G1 includes all the relevant tensor-train stuff [11,16,17] we know. All [11,16,17] contain extra training, which is unsuitable for low-end devices. In the updated submission, this part has been emphasized in line 189-193.

We would highly appreciate it if the Reviewer were aware of such work and possibly gave more references for such work, so that we could discuss them in our paper.

Weakness 4. Conclusion

It's not immediately obvious to me what's new conclusions or findings drawn from this paper.

Conclusion: our approach is suitable for low-end devices in edge applications.

Question

Pardon me that I really don't know what's the biggest contribution. Other than your claim of try things on LLM, what's the real technical contribution in your paper not discussed by previous works?

As we mentioned in the response to Weakness 2, our contribution is the first work for low-end devices to design a low-rank model compression strategy to satisfy the adaptivity and low energy requirements in edge applications.

For the comparison between relevant works and ours, please refer to General Response G2.

Reference

[8] "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases", ICML 2024

[9] "Small Language Models: Survey, Measurements, and Insights." arXiv preprint arXiv:2409.15790 (2024).

[11] "Tensorized embedding layers for efficient model compression." arXiv preprint arXiv:1901.10787 (2019).

[17] "Efficient gpt model pre-training using tensor train matrix representation." arXiv preprint arXiv:2306.02697 (2023).

审稿意见

评分: 3置信度: 52024-11-06

This paper proposes using tensor-train decomposition to compress the token embedding matrix, aiming to reduce model size and accelerate inference, particularly in edge device scenarios. The technique leverages a higher-order tensor approximation method, built upon singular value decomposition (SVD), to efficiently represent embeddings while maintaining performance. This tailored approach is well-suited for resource-constrained environments, promising benefits in storage and computational efficiency.

优点

The paper is well-structured and easy to follow, with a clear presentation of the methodology and its applications.

缺点

Limited Novelty: The method largely builds upon the existing TT_SVD approach, with Algorithm 1 in this paper replicating methods already established in prior literature [1]. The methodological advancements or generalizations beyond TT_SVD appear minimal based on the methodology section.

Unconvincing Experiments: The experiments are conducted on older and relatively small models, such as GPT-2 and models up to only 1.3B parameters. This limits the relevance of the results, as they don’t reflect performance on contemporary large language models (LLMs). Additionally, the evaluation setup lacks modern benchmarking practices; for instance, LLM harness [2] would provide a more standardized evaluation framework.

Unintuitive Rationale: The foundational rationale for embedding a matrix using a tensor is unclear. The method requires reshaping the matrix into a higher-order tensor before applying tensor decomposition, but the paper does not provide an intuitive explanation for why this approach is effective or reasonable. Also, the embedding matrix only occupies a small part of the model even for a mid size of model, and a common way to reduce parameter size is through weight tie, which directly reduce the parameter sizes of embedding into half. How does you method performs in this case?

Overstatement of Contribution: Some claims appear overstated. For instance, the authors state, "As far as we know, we are the first to compress LLMs with low-rank factorization, specifically for low-end devices." However, the cited reference [3] already demonstrates compression of LLMs using SVD for similar purposes. Furthermore, there are numerous existing works that apply low-rank decomposition for LLM compression, such as [4-6].

Limited Impact on Overall Model Size: The embedding matrix occupies only a small portion of the model's parameters, even in mid-sized models. A commonly used technique, weight tying, can directly halve the embedding parameter size, offering a straightforward compression approach. The paper does not address how the proposed method compares to weight tying or performs when weight tying is already applied, raising questions about its practical impact on overall model size reduction.

[1] V. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011. doi: 10.1137/090752286.

[2] Gao, Leo, et al. "A framework for few-shot language model evaluation." Version v0. 0.1. Sept 10 (2021): 8-9.

[3] Yen-Chang Hsu, Ting Hua, Sungen Chang, Qian Lou, Yilin Shen, and Hongxia Jin. Language model compression with weighted low-rank factorization. In International Conference on Learning Representations, 2022.

[4] Yuan, Zhihang, et al. "Asvd: Activation-aware singular value decomposition for compressing large language models." arXiv preprint arXiv:2312.05821 (2023).

[5] Lin, Chi-Heng, et al. "MoDeGPT: Modular Decomposition for Large Language Model Compression." arXiv preprint arXiv:2408.09632 (2024).

[6] Ashkboos, Saleh, et al. "Slicegpt: Compress large language models by deleting rows and columns." arXiv preprint arXiv:2401.15024 (2024).

问题

How does your method perform with weight tying? What is the percentage of overall compression for model sizes larger than 13B?

伦理问题详情

评论- Thank you so much for the very detailed comments and references.

2024-11-25

(1/2)

Thank you so much for the very detailed comments and references! Our response is as follows.

Weakness 1. Limited Novelty

... The methodological advancements or generalizations beyond TT_SVD appear minimal based on the methodology section.

Though we did not change the exact implementation of TT-SVD, we did adjust its working unit (vectors rather than the whole matrix) and workflow to cope with the unique issues (adaptivity and low energy, as stated in General Response G0) for low-end devices. We also have a comprehensive systematic analysis regarding latency, energy and flops, and the impacts (latency and accuracy) of the tensor orders in the experimental section.

Also, we do not believe a simple methodology necessarily means no novelty. "Attention is All You Need" and "LoRA: Low-rank Adaptation of Large Language Models" are two examples. Their methodologies are rather simple (as stated in their abstract or introduction), but both revolutionised the community.

Weakness 2. Unconvincing Experiments:

The experiments are conducted on older and relatively small models, such as GPT-2 and models up to only 1.3B parameters...

Please refer to General Response G1, we only focus on SLMs on low-end device.

... the evaluation setup lacks modern benchmarking practices...LLM harness [2] ...

Thank you so much for the helpful evaluation resources, our new experiment results are shown in General Response G3.

Weaknesses 3. Unintuitive Rationale:

The foundational rationale for embedding a matrix using a tensor is unclear. The method requires reshaping the matrix into a higher-order tensor before applying tensor decomposition, but the paper does not provide an intuitive explanation for why this approach is effective or reasonable.

Tensors can model implicit high-dimensional representations (as well as the interactions among orders) of the model weights. In this sense, tensors have better expressivity than matrices, which is also a good solution for small-size models to express complex functionality with limited parameter space.

The results in G3 have empirically proved this point, as tensor-based approaches have higher chances of retaining the language task performance than the matrix-based approach.

... the embedding matrix only occupies a small part of the model even for a mid size of model, and a common way to reduce parameter size is through weight tie, which directly reduce the parameter sizes of embedding into half. How does you method performs in this case?

Firstly, apart from reasoning and language modelling, classification is also an important task for edge applications, which we have investigated in the paper. Weight tying cannot be compiled with the classification layer.

Secondly, we acknowledge that weight tying is a common approach to reducing memory, but we do not reckon it is a necessary part of LMs. Especially for the adaptivity requirements in edge applications (Section 2.1 in the updated submission), there should be a different approach to compress the fully connected layers (i.e. further amplifying the signal from the transformer) rather than directly reusing the weights of the embedding layer.

Weakness 4. Overstatement of Contribution:

Some claims appear overstated. For instance, the authors state, "As far as we know, we are the first to compress LLMs with low-rank factorization, specifically for low-end devices."

Thanks for pointing out. We acknowledge it is confusing and misleading. We wanted to convey "As far as we know, we are the first to compress Small Language Models (SLMs) for low-end devices use cases, with low-rank factorization'', which has been updated in line 82-83 of the updated submission.

However, the cited reference [3] already demonstrates the compression of LLMs using SVD for similar purposes. Furthermore, there are numerous existing works that apply low-rank decomposition for LLM compression, such as [4-6].

Thanks for the references. [3-6] are discussed in General Response G2.

Furthermore, we want to clarify the inconsistency of the term "LLMs". The use of the term "LLM'' in the original submission is to follow the term used in MobileLLM [8]. To avoid confusion, we emphasized that we only focus on sub-billion parameter models in the abstract and the introduction of our original submission. We have updated the paper title in our new submission accordingly, TensorSLM: Sub-billion Parameter Language Model Compression for Low-end Devices based on Tensor-Train Decomposition. This clarifies our focus is not "LLMs" but Small Language Models (SLMs) [9].

2024-11-25

(2/2)

Weakness 5. Limited Impact on Overall Model Size:

The embedding matrix occupies only a small portion of the model's parameters, even in mid-sized models.

Please refer to the General Response G1.

A commonly used technique, weight tying, can directly halve the embedding parameter size, offering a straightforward compression approach. The paper does not address how the proposed method compares to weight tying or performs when weight tying is already applied, raising questions about its practical impact on overall model size reduction.

Please refer to the last response to Weakness 3.

Question 1. How does your method perform with weight tying?

Please refer to the last response to Weakness 3.

Question 2. What is the percentage of overall compression for model sizes larger than 13B?

Please refer to the General Response G1.

Reference

[1] "Tensor-train decomposition." SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011. doi: 10.1137/090752286.

[2] "A framework for few-shot language model evaluation." Version v0. 0.1. Sept 10 (2021): 8-9.

[3] "Language model compression with weighted low-rank factorization." ICLR 2022.

[4] "Asvd: Activation-aware singular value decomposition for compressing large language models." arXiv preprint arXiv:2312.05821 (2023).

[5] "MoDeGPT: Modular Decomposition for Large Language Model Compression." arXiv preprint arXiv:2408.09632 (2024).

[6] "Slicegpt: Compress large language models by deleting rows and columns." arXiv preprint arXiv:2401.15024 (2024).

[8] "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases", ICML 2024

[9] "Small Language Models: Survey, Measurements, and Insights." arXiv preprint arXiv:2409.15790 (2024).

2024-11-27

Thank you for your response. However, the response did not fully address my concerns regarding two fundamental aspects:

Compression of small language models (SLMs) is less general than for large language models (LLMs): LLMs are more widely used, and their compression techniques can often be applied to smaller models, whereas the reverse is not always true. For instance, methods like LLM-Pruner [1] and layer pruning strategies [2] demonstrate fast and effective compression, even for large models.

Lack of comparisons with state-of-the-art (SOTA) methods: The additional comparisons with SVD are insufficient to justify the proposed method. SVD performs significantly worse compared to other SOTA approaches, making it an inadequate baseline.

In summary, while the authors have clarified the paper's applicability, I believe the method lacks sufficient generality and empirical support to justify its contribution for publication in this venue. Therefore, I will retain my score.

[1] Ma, Xinyin, Gongfan Fang, and Xinchao Wang. "LLM-Pruner: On the Structural Pruning of Large Language Models." Version v1. May 19 (2023). arXiv:2305.11627.

[2] Men, Xin, et al. "ShortGPT: Layers in Large Language Models are More Redundant Than You Expect." Version v1. March 6 (2024). arXiv:2403.03853.

2024-12-02

(3/3)

Tab.JBLc.3. Zero-shot performance of OPT-1.3B after compression.

OPT-1.3B	Params %	ARC-c	ARC-e	BoolQ	HellaS.	PIQA	WinoG.	Avg.
Original	100	22.35	50.88	59.33	32.55	66.49	51.93	47.26
SparseGPT 2:4	-	19.88	44.82	57.34	32.97	63.49	55.49	45.67
SVD	96.16	20.56	44.87	57.34	28.04	63.28	51.30	44.23
	98.14	22.01	50.13	61.80	30.69	66.76	56.51	47.98
	99.73	23.55	53.96	60.28	36.35	69.59	57.77	50.25

SliceGPT	96.29	24.15	53.66	46.91	37.18	67.46	55.41	47.46
	97.81	24.15	55.39	47.95	39.08	68.72	56.75	48.67
	99.91	23.72	55.22	48.13	38.34	68.44	55.72	48.26

Ours	96.04	21.08	26.35	54.04	25.91	53.81	48.93	38.35
	97.71	25.26	52.86	57.98	38.78	69.48	58.48	50.47
	99.59	23.38	55.22	51.68	40.43	71	59.43	50.19

We also give the results of SparseGPT. Since SparseGPT only freezes the weights rather than "deletes" them, we do not list its parameter ratio. From Tab Tab.JBLc.2 and Tab.JBLc.3, we can observe that the different compression approaches have different superiorities for different zero-shot reasoning tasks and compression ratios. Even SVD-based sometimes outperforms the others.

These results further indicate that compression approaches of LLMs may not be easily migrated to SLMs, as we discussed in (1/3).

References

[6] "Slicegpt: Compress large language models by deleting rows and columns." arXiv preprint arXiv:2401.15024 (2024).

[22] Ma, Xinyin, Gongfan Fang, and Xinchao Wang. "LLM-Pruner: On the Structural Pruning of Large Language Models." Version v1. May 19 (2023). arXiv:2305.11627.

[23] Men, Xin, et al. "ShortGPT: Layers in Large Language Models are More Redundant Than You Expect." Version v1. March 6 (2024). arXiv:2403.03853.

[24] Frantar, Elias, and Dan Alistarh. "Sparsegpt: Massive language models can be accurately pruned in one-shot." International Conference on Machine Learning. PMLR, 2023.

评论- Thank you so much for your reply and new references

2024-12-02

(1/3)

Thank you so much for your reply and new references for LLM-Pruner [22] and ShortGPT [23]. Our responses to your reply are as follows:

1. Compression of small language models (SLMs) is less general than for large language models (LLMs).

LLMs are more widely used, ...

We acknowledge that "LLMs are more widely used" at the moment, but we do think SLM compression is also important. SLMs are for on-device applications (e.g. mobile phones and Raspberry Pi), and are suitable for applications without stable networks, sufficient GPU resources or continuous power charging. The compression of SLMs directly impacts battery life and, therefore, the user experience.

... and their compression techniques can often be applied to smaller models, whereas the reverse is not always true.

There has yet to be a consensus that the compression approaches of LLMs can be easily migrated to SLMs. On the contrary, there exists empirical evidence that LLM compression approaches cannot maintain the accuracy of SLMs as they can for LLMs:

In Tab.2 of ShortGPT [23], for the same model series (Llama2 and Baichuan2), 13B models maintain more model accuracy than 7B models with ShortGPT, as summarized as follows

Tab.JBLc.1. Average score degradation after compression with ShortGPT.

$\Delta$ Avg.	7B	13B
Llama2	-6.54	-4.61
Baichuan2	-8.59	-7.88

As stated in the second last paragraph of the introduction of SparseGPT [24], "larger models are more compressible". In Fig.2 of [24], with the same compression settings, the models with fewer parameters have a more severe accuracy drop (larger slope in Fig.2).
In line 418-420 and line 445-448 of our current submission, we addressed that our compression approach performs better on the larger-sized models, which can be observed from Fig. 3 (a-b,d-j,m) in our current submission.

For instance, methods like LLM-Pruner [1] and layer pruning strategies ShortGPT [2] demonstrate fast and effective compression, ...

Thanks for this information. In the referred LLM-Pruner (7B as their smallest tested models) and ShortGPT (2.8B as their smallest tested models) in detail, we did not find their results on sub-billion parameter models. As we stated in line 58 of our current submission, running an uncompressed Gemma-2B on Raspberry Pi leads to a system crash. Thus we only consider models around or less than 1B.

... , even for large models.

Our paper only focuses on SLMs running on low-end devices, which normally cannot hold LLMs. Thus we do not consider the "large models".

2024-12-02

(2/3)

2. Lack of comparisons with state-of-the-art (SOTA) methods.

The additional comparisons with SVD are insufficient to justify the proposed method. SVD performs significantly worse compared to other SOTA approaches, making it an inadequate baseline.

Thanks for this emphasis. To find an appropriate baseline, we investigated the references in our General Response, the newly referred LLM-Pruner[22], ShortGPT[23] and a commonly used baseline SparseGPT[24]. Among these references, only SliceGPT[6] is training-free and compresses the embedding layers. The comparisons are in Tab.JBLc.2 and Tab.JBLc.3. Given that SliceGPT also compresses other layers, we only listed the results of similar overall parameter ratios after the compression. The bold numbers are the best performance for each parameter ratio setting.

Tab.JBLc.2. Zero-shot performance of OPT-125M after compression.

OPT-125M	Params %	ARC-c	ARC-e	BoolQ	HellaS.	PIQA	WinoG.	Avg.
Original	100	23.38	57.11	57.74	41.53	71.71	59.35	50.91
SparseGPT 2:4	-	19.03	37.12	58.59	27.77	58.32	51.7	42.09
SVD	77.36	20.05	31.14	37.83	26.59	56.31	50.59	37.09
	85.51	18.17	34.26	37.83	26.96	57.56	52.88	37.94
	97.74	18.77	43.14	47.37	28.51	63.00	51.30	42.02

SliceGPT	77.15	19.20	35.14	37.86	27.38	55.33	51.93	37.81
	86.20	19.11	38.55	37.92	28.04	58.00	50.20	38.64
	99.16	20.39	41.46	40.00	28.84	61.59	50.28	40.43

Ours	78.16	20.22	28.66	39.14	26.17	53.54	49.88	36.27
	84.83	21.25	29.55	40.15	26.19	54.90	50.12	37.03
	99.76	20.05	38.68	45.41	28.86	61.53	49.88	40.74

评论- General Response

2024-11-25

General Response (1/6)

We sincerely appreciate the time, effort, and detailed comments from the Reviewers.

We first respond to the four common issues in the reviews and then respond separately to each Reviewer.

G0. What is the novelty and contribution of this work?

This paper focuses on compressing the Small Language Models (SLMs) [8,9] deployed on low-end devices (i.e. Raspberry Pi) in edge applications. The edge applications pose two requirements to our compressing approach, which are not common in LLM applications:

Adaptivity: the approach should dynamically adjust the model to the environmental changes (e.g. tokens registered or deregistered);
Low energy: the computation and memory operations should consider the energy consumption (i.e. for longer battery life).

Centred on these two issues (detailed discussion is in Section 2 of the updated submission), our approach based on Tensor-Train Decomposition (TTD) is specifically designed for compressing SLMs:

Adaptivity: TTD works on embedding vector level, which allows the application to update the vocabulary without operating on the whole compressed embedding matrix;
Low energy: As computation operations are "cheaper" than memory operations regarding energy consumption, we chose to "exchange" the memory with computation to save energy during the forwarding passes, with negligible extra latency.

As far as we know, none of the current LLM (at least for those designed for GPUs) compression work has these concerns, though these concerns are critical for low-end devices and edge applications. TT format has an expressive and flexible form, which makes it easier for us to analyze and satisfy the two requirements.

Apart from common metrics like compression ratio and language task performance, we also give the estimated energy costs (with a similar approach in [18]) of our approach and SVD-based approach. At a quick glance, the following is the estimated inference energy costs with their best language task performance are

Tab.G.0 Inference energy costs of an input text of 100 tokens. (% percentage of uncompressed model energy costs, the lower, the better)

Model	OPT-125M	OPT-350M	DistilGPT2	GPT-2	GPT2-M	GPT2-L	CerebrasGPT-111M	CerebrasGPT-256M	CerebrasGPT-590M
SVD	84.44%	43.35%	70.28%	70.28%	54.01%	44.24%	84.44%	81.80%	75.34%
Ours	59.21%	52.73%	66.84%	59.21%	52.74%	43.23%	61.51%	51.71%	50.45%

The details of this part can be found in line 137-186, line 281-300, line 515-522 of the updated submission.

G1. The investigated LLMs are too small.

We feel there might be some misunderstanding here, especially since our paper focuses on the sub-billion language models (as stated in MobileLLM [8]) running on low-end devices, and we had stated this in the abstract and introduction of our original submission.

We have therefore decided to change the title of our paper to avoid this ambiguity. The new title is "TensorSLM: Sub-billion Parameter Language Model Compression for Low-end Devices based on Tensor-Train Decomposition". We have also rephrased our paper to further emphasise the focus (mainly Section 1,2,3 in the updated submission).

2024-11-25

General Response (6/6)

Reference

[1] "Tensor-train decomposition". SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011. doi: 10.1137/090752286.