6.0

/10

Poster3 位审稿人

最低5最高7标准差0.8

3.7

置信度

正确性3.0

贡献度2.7

表达3.7

NeurIPS 2024

FM-Delta: Lossless Compression for Storing Massive Fine-tuned Foundation Models

Wanyi Ning,Jingyu Wang,Qi Qi,Mengde Zhu,Haifeng Sun,Daixuan Cheng,Jianxin Liao,Ce Zhang

OpenReview PDF

提交: 2024-05-14更新: 2024-12-23

TL;DR

To mitigate cloud model storage overhead, we propose a novel lossless compression scheme FM-Delta to compress massive fine-tuned models stored in cloud, significantly saving cloud storage costs.

摘要

关键词

model compressionlossless compressioncloud storage

评审与讨论

审稿意见

评分: 5置信度: 42024-06-20

This paper proposes FM-Delta, which addresses significant storage overhead caused by fine-tuned LLMs. It maps model parameters into integers and entropy codes their small differences from pre-trained models, reducing cloud storage consumption by about 50% on average.

优点

This paper proposes FM-Delta, a method for jointly compressing pre-trained and fine-tuned models, achieving a compression ratio of up to 2x for fine-tuned models.
Compressing large models is crucial, and the approach of jointly compressing fine-tuned and pre-trained models makes sense. Traditional compression algorithms struggle with large models, so a delta-based method intuitively provides a feasible compression rate and is an interesting direction.
The paper offers a formula for the distribution shift between fine-tuned and pre-trained models under given assumptions, providing valuable guidance.

缺点

Despite the interesting concept of delta compression, I have some concerns about its practical applications. Could the author please elaborate on specific scenarios where this method can be applied? What kind of situations require frequent downloading of finetuned models from the network?
Has the author considered comparing the proposed method in this paper with quantization? For example, quantizing delta. There are existing studies showing that quantized LLM still maintains some compressibility, roughly around half the compression rate [1]. In other words, my question is: what is the necessity of performing lossless compression in this setting? Because the compression-decompression process inevitably introduces additional overhead, especially at the user end.
While the first half is well-written, the writing from the Method section onwards needs improvement. Some technical terms need to be explained in detail, such as "most significant bit" mentioned in 4.1 and "range" in "difference range" mentioned in 4.2. Understanding these concepts is crucial for comprehending the algorithm, yet I couldn't find corresponding explanations in the text. The subsequent method descriptions are also somewhat confusing.
Why use range coding instead of other entropy coding methods? I noticed the experiments used tools like Gzip, but currently, methods like zstd are more popular. Could the author consider trying zstd compression?
I noticed that the reported compression throughput in the final experiments is approximately 100MB/s, which is not particularly fast in practical terms. There seems to be significant room for improvement in this speed.
Some related works are not mentioned, such as [1] [2].

[1] Mao Y, Wang W, Du H, et al. On the compressibility of quantized large language models[J]. arXiv preprint arXiv:2403.01384, 2024.

[2] Hershcovitch M, Choshen L, Wood A, et al. Lossless and Near-Lossless Compression for Foundation Models[J]. arXiv preprint arXiv:2404.15198, 2024.

问题

See weaknesses.

局限性

See weaknesses.

作者回复

2024-08-07

We thank the Reviewer eKhr for the detailed and useful feedbacks. We address your concerns point by point below, and the analyses will be incorporated into our paper.

Q1: Could the author please elaborate on specific scenarios where this method can be applied? What kind of situations require frequent downloading of finetuned models from the network?

A1: Regarding application scenarios, as shown in Figure 1 and Table 1 of the original paper, we focus on a new storage issue on cloud resulting from the rapid development of LLMs. Our method aims to compress massive finetuned large models stored on the cloud platforms like HugggingFace, which currently hosts over 800,000 models, leading to substantial storage costs.

Time	Total Num of Models in HuggingFace
2022-03	33,187
2023-03	157,082
2024-03	574,270
2024-07	805,291

Regarding the situation of frequently downloading finetuned models, as a public model storage platform, HuggingFace enables numerous end-users to download the models they need from network for further testing.

It is important to note that our primary concern is to reduce the cloud storage overhead rather than downloading frequency. In fact, a large portion of models (89% as reported in Table 1) in cloud are inactive, with fewer than 10 downloads per month. If applying our method on these models, the total storage costs can be saved at least 40% as discussed in Appendix D.7.

Q2: Has the author considered comparing the proposed method in this paper with quantization? For example, quantizing delta. What is the necessity of performing lossless compression in this setting?

A2: Regarding quantizing delta, we select two user-uploaded GPT2 models from Huggingface, and present the quantization results in the Table 1 of the uploaded PDF.

We can see from the quantization results that lossy compression inherently alters the model eval results. We believe lossy compression goes against the original uploader's desire to store his model safely (unchanged), even though the change might be small (actually it is hard to ensure the performance of all models in hub due to diversity).

Just like when we upload a model to HuggingFace today, we don't want the case that we download it and then find any of its eval results inconsistent with the original model. At this point, FM-Delta is exactly lossless and has a rather competitive compression rate.

Q3: Some technical terms need to be explained in detail, such as "most significant bit" mentioned in 4.1 and "range" in "difference range" mentioned in 4.2. The subsequent method descriptions are also somewhat confusing.

A3: We explain the mentioned terms in the following.

"most significant bit": The most significant bit (MSB) is the bit in a binary number that has the highest value position.
"difference range": Range of differences between fine-tuned and pre-trained model parameters.

We also improve the method description with a more detailed workflow figure. Please check Figure 1 of the uploaded PDF.

Q4: Why use range coding instead of other entropy coding methods? I noticed the experiments used tools like Gzip, but currently, methods like zstd are more popular. Could the author consider trying zstd compression?

A4: We use range coding since it can dynamically adapt to the probability distribution of the data and is suitable for compressing floating-points requiring the fine granularity and precision.

Model	FM-Delta	ZSTD	ZSTD-Delta (on unsigned int delta)
Jorgeutd/bert-large-uncased-finetuned-ner	68%	92%	81%
rajkumarrrk/gpt2-fine-tuned-on-imdb-positive-reviews	68%	90%	81%
mikesmodels/Waltz_with_Bashir_Diffusion	59%	93%	72%
Comp. Throughput (MB/s)	109	520	346
Decomp. Throughput (MB/s)	100	560	380

Regarding trying ZSTD, we present its results in the above table. We see that while ZSTD has a higher throughput, its compression rate is worse than FM-Delta.

We further present a hybrid approach ZSTD-Delta that combines the mapping of FM-Delta with ZSTD, i.e., applying ZSTD on the mapped unsigned int delta. ZSTD-Delta serves as a compromise in practice with the superiority of both FM-Delta (compression rate) and ZSTD (throughput).

Q5: I noticed that the reported compression throughput in the final experiments is approximately 100MB/s, which is not particularly fast in practical terms. There seems to be significant room for improvement in this speed.

A5: The trade-off between speed and compression rate is indeed a critical consideration in practice. As mentioned in Answer A4, we provide a hybrid approach ZSTD-Delta that combines the mapping of FM-Delta with ZSTD as a compromise for cloud customers.

We believe that although the throughput of FM-Delta is not the highest, its best compression rate and cost saving of 40% are still valuable in practice. We will strive for greater improvements in lossless coding in our future work.

Q6: Some related works are not mentioned, such as [1] [2].

A6: Thanks for your kindful suggestion. The authors in [1] investigate the compressibility of quantized LLMs, and FM-Delta focuses on the lossless compression on floating-point LLMs. The authors in [2] propose a byte grouping method and apply the stardard compressors like zstd to compress float models, while its compression rate is not as good as FM-Delta, and the actual compression speed is not reported.

These two works are contemporaneous with ours. We add discussion about them in the Related Work section of our final manuscript.

2024-08-12

Thank you for the clarification and new sensitivity analysis, which address most of my concerns. I still have questions about the application scenario. As mentioned, FM-delta is proposed to address the storage challenge for model providers such as huggingface. However, in my opinion, FM-delta can only compress fine-tuned models. Therefore "Total Num of Models in HuggingFace" may not reflect the importance of compressing fine-tuned models, since currently model trained from scratch may yields a large portion in huggingface.

2024-08-13

Thank Reviewer eKhr for your further response. Regarding your concern about the portion of pretrained models, we roughly calculate the portion of pretrained and finetuned models in HuggingFace.

Specifically, based on model creation time, we iterate 10,000 models respectively in both ascending (old to new) and descending (new to old) order. Among these models, we only count the models that have explicitly stated their identity (i.e., pretrained or finetuned) in their "README.md" file.

ascending (old to new)

	# Pretrained	# Finetuned
Num.	501	2,082
Portion	19%	81%

descending (new to old)

	# Pretrained	# Finetuned
Num.	52	4,295
Portion	1%	99%

We can see that fine-tuned models occupy a significant portion (81% and 99%) of the model hub. Furthermore, the results of the descending order indicates that fine-tuned models have become overwhelmingly dominant currently. This suggests that the prevalence of fine-tuned models will undoubtedly continue to rise. For instance, as shown in the Table 1 of our original paper, the pretrained model "meta-llama/Llama-2-7b-hf" already has over 6,000 finetuned variants.

We would again like to thank Reviewer eKhr, and we hope that our explanation addresses your concern. Please let us know if you have any further questions, and we would be very delighted to follow up.

评论- Thanks for the reply

2024-08-14

Thanks for the detailed reply. I'll raise my score to 5.

审稿意见

评分: 7置信度: 42024-06-24

This paper proposes a novel lossless compression scheme FM-Delta specifically for storing massive fine-tuned models in cloud.

FM-Delta maps fine-tuned and pre-trained model parameters into integers with the same bits, and entropy codes their integer delta. In this way, cloud only needs to store one uncompressed pre-trained model and other compressed fine-tuned models.

Extensive experiments demonstrated that FM-Delta efficiently reduces cloud storage consumption for massive fine-tuned models by an average of around 50% with only negligible additional time in most end-to-end cases.

优点

The research topic of compressing finetuned large models on the cloud is novel and has strong practical meaning in the age of foundation model. The statistics provided in Figure 1 and Table 1 clearly show the necessity of developing such a method to compress the large number of finetune models.

I like all the statistics of those large models on HuggingFace. The authors devote a lot of effort to provide a systematic study on this novel research topic. The experiments are extensive.

Theoretical results are given in Theorem 1 to analyze the growing rate of model difference, which serves a great motivation for the proposed delta coding method.

The bit redundancy of model difference is analyzed in Theorem 2, proving a solid theoretical foundational for the robustness of the proposed method.

缺点

It is not clear how range coding is applied. Range coding needs a probability table for the coding symbols. Did the paper use the symbol frequency as the probability? If this is the case, the probability table should also be transmitted in the compressed stream to ensure correct decoding on user side.

Following the previous quesrtion, it would be interesting to see how the probability estimation affects the compression ratio. For example, in Boosting Neural Representations for Videos with a Conditional Decoder, CVPR 2024, a Gaussian Entropy Model is proposed to model the distributin of neural network weight for entropy coding, where only two scalar values are transmitted for each weight or embedding.

问题

Please refer to the weakness part.

局限性

The limitation is well addressed in the manuscript.

作者回复

2024-08-07

We thank the Reviewer SDxT for the insightful feedback and address your concerns below, and the analyses and clarifications will be incorporated into our paper.

Q1: Did the paper use the symbol frequency as the probability?

A1: Yes, we use a quasi-static probability modeler as in [1][2] . Initially, each symbol is assigned an equal frequency. As we encode or decode the data, we update the symbol frequencies dynamically based on the processed data. This approach ensures that the frequency table is dynamically built during the encoding and decoding processes, without the need for transmission of the probability table.

Regarding range coding, we update the workflow figure of FM-Delta in the uploaded PDF, involving the illustration of range coding, which includes:

Symbolization. We regard the sign $s$ and the most significant bit $k$ of delta as symbols $<s,k>$ for range coding.
Probability model. As mentioned above, we use a quasi-static probability modeler to termly update symbol frequencies.
Encoding. Range coding encodes the symbols and the raw bits on all delta elements through range scaling, leading to the compressed fine-tuned model.
Decoding. Range coding maps the encoded value back to the original symbol range and termly updates the probability model. Then we get the original float-point fine-tuned model through reverse-mapping delta.

[1] M. Schindler. Range Encoder version 1.3, 2000. URL http://www.compressconsult.com/rangecoder/.

[2] Peter Lindstrom, et al. Fast and Efficient Compression of Floating-Point Data. TVCG 2006.

Q2: Following the previous question, it would be interesting to see how the probability estimation affects the compression ratio. For example, in Boosting Neural Representations for Videos with a Conditional Decoder, CVPR 2024 [3], a Gaussian Entropy Model is proposed to model the distribution of neural network weight for entropy coding, where only two scalar values are transmitted for each weight or embedding.

A2: Thanks for your valuable reference. The authors in [3] build a network-free Gaussian model for the probability estimation of the quantized weights in INR, with tiny metadata transmission overhead. In comparison, FM-Delta losslessly compresses diverse floating-point fine-tuned models using a quasi-static probability modeler without transmission of the probability table.

Both methods focus on the compression of models. The example[3] inspires us to further investigate model characteristics for more fine-grained probability estimation in the future work. We add discussion about [3] in the Related Work section of our final manuscript.

[3] Boosting Neural Representations for Videos with a Conditional Decoder, CVPR 2024

2024-08-09

Thanks for the reply.

审稿意见

评分: 6置信度: 32024-07-10

This paper proposes a method to compress the differences between a pretrained model and a full fine-tuned model to save storage space on cloud servers where the full fine-tuned model is stored. For compression, the pretrained weights and full fine-tuned weights are first converted to unsigned integers and then subtracted from each other. Subsequently, range coding is applied to compress redundant zeros, while non-zero values are retained as raw bits. Consequently, this method demonstrates a higher compression ratio than existing lossless compression methods and also shows robustness across various models and datasets.

优点

The proposed method for addressing the storage space issues of massive fine-tuned models is highly novel.
It achieves a higher compression ratio compared to existing lossless methods and demonstrates robustness across various models. Additionally, it illustrates the trade-off in compression rates resulting from fine-tuning, highlighting the robustness of the proposed method.
The paper provides both theoretical and experimental evidence to demonstrate the robustness of the proposed method.
From the perspective of practical users, the proposed method is shown to be efficient, and the paper suggests further ways to accelerate the process.

缺点

The proposed method is effective only for full fine-tuned models and not for peft (parameter-efficient fine-tuning) models. The impact of the proposed method may vary depending on the proportion of full fine-tuned models. Although Table 1 shows the number of models for the six most popular models, it does not provide information on the overall proportion of full fine-tuned models within the entire Hugging Face model repository.

问题

What is the proportion of full fine-tuned models within the entire Hugging Face model repository?

局限性

Included in Weakness

作者回复

2024-08-07

We thank the Reviewer yfTq for the insightful feedback and address your concern below, and the analyses will be incorporated into our paper..

Q1: What is the proportion of full fine-tuned models within the entire Hugging Face model repository?

A1: Since it is hard to distinguish the full finetuned models from all the pre-trained models in HuggingFace based on model meta information, we count the total number of PEFT models and its proportion in HuggingFace as shown below.

# PEFT Models	# All Models	Proportion of PEFT
53,186	805,291	6.60%

It should be noted that "All Models" includes PEFT, full fine-tuned and pretrained models. We can see that PEFT models only occupy a small proportion of all models.

For your further review, we provide the statistical results of the following additional model families, and show the proportion of full models on these families.

Model	# Full	# PEFT	Proportion of Full
Gemma-9b	315	121	72%
Gemma-2b	3,836	279	93%
Bloom-7b1	163	105	60%
Bloom-1b7	130	61	68%
Pythia-12b	120	131	47%
Pythia-6.9b	316	93	77%
T5-xxl	106	62	63%
T5-large	1,277	203	86%
Llama-2-70b	214	96	69%
Mistral-7b	6,972	2,027	77%
AVG			71%

It can be seen that the number of full fine-tuned models still occupy the majority (71% on average) in the HuggingFace repository. Moreover, given their full size, the storage space required for these models takes up a significant portion of cloud storage.

2024-08-13

Thank you for your thorough investigation and detailed response. Based on your explanation, it is evident that a significant proportion of the models were fully fine-tuned. This observation further substantiates the effectiveness of the proposed method. I appreciate your efforts in addressing this point.

作者回复

2024-08-07

We sincerely appreciate all the reviewers for dedicating their time to review our manuscript.

The uploaded PDF includes an updated workflow figure of FM-Delta for the Reviewer SDxT, and the results of quantizing delta for the Reviewer eKhr.

最终决定Accept (poster)

2024-09-25

Summary:

Due to their billion-level parameters, pre-trained foundation models (e.g., LLMs) often require significant cloud storage. Many fine-tuned models exhibit only small differences from their pre-trained counterparts, leading to the development of FM-Delta, a novel lossless compression scheme. FM-Delta reduces cloud storage needs by compressing fine-tuned models while keeping one uncompressed pre-trained model, cutting storage requirements by around 50% with minimal additional processing time.

Strengths:

Novelty:
- "addressing the storage space issues of massive fine-tuned models is highly novel" (yfTq)
- "is novel and has strong practical meaning in the age of foundation model" (SDxT)
- "is an interesting direction" (eKhr)
Insight:
- "both theoretical and experimental evidence" (yfTq)
- "a systematic study on this novel research topic" (SDxT)
- "intuitively provides a feasible compression rate" (eKhr)
Theory:
- "Theoretical results are given in Theorem 1 to analyze the growing rate of model difference, which serves a great motivation" (SDxT)
- "The bit redundancy of model difference is analyzed in Theorem 2, proving a solid theoretical foundational" (SDxT)
Results:
- "higher compression ratio" (yfTq)
- "achieving a compression ratio of up to 2x for fine-tuned models" (eKhr)

Weaknesses:

Reviewer yfTq initially contended that the proposed method might not be suitable for parameter-efficient fine-tuning (PEFT) models, as PEFT models typically require only a small portion of parameters for fine-tuning. In response, the authors presented statistical evidence showing that 71% of the models in the HuggingFace repository are full models. I believe this is a strong rationale for the motivation behind this work, and it remains a valid approach to reducing various costs. Including the specifics of this analysis, along with the date it was conducted, in the Appendix, would be beneficial for readers. (The proportional trend would be changed later.) The other issues raised were either minor or effectively addressed during the author-reviewer discussions.

Reasons to Recommend for Acceptance:

The AC believes this work explores a novel and important area of research focused on saving storage space on cloud servers. By potentially reducing computing costs, conserving energy, and (possibly) lowering carbon emissions, the work makes a significant contribution. Built on strong statistical motivations and theoretical analysis, the work has been well-received and agreed upon by reviewers, prompting the AC to recommend its acceptance for its innovative direction in modern computing and its emphasis on cost and energy efficiency, potentially, environmental care.

As quantitative measure, the work received mostly positive evaluations, with good or excellent ratings in soundness (333), presentation (443), and contribution (332), resulting in an overall strong rating of 675.