BackSlash: Rate Constrained Optimized Training of Large Language Models
摘要
评审与讨论
This work proposes Rate-Constrained Training (RCT), a novel training method that allows for training LLMs in a way that allows for effective compression of their weights at the end of training. The main idea is to derive a weight regularizer by assuming a specific distribution over the model weights, the generalized Gaussian (GG) distribution (with a shape parameter that is adapted during training). The authors argue for this choice empirically by showing how the training model weights have a better fit under that distribution. This regularizer essentially becomes an norm for the weights of the network, with the being dynamically updated during training. After model training, the authors use exp-Golomb (EG) codes in order to compress the resulting weights. The argument for this choice is that the EG codes can achieve almost the entropy limit for GG sources. The authors experiment with RCT and various LLM architectures and simple classification tasks.
给作者的问题
No other questions besides the above.
论据与证据
The paper claims are moderately supported by the evidence. While the authors do show improvements upon "normal training" with RCT, I think what is missing is a more convincing evaluation against the Gaussian model (or alternative the GG model with a shape parameter of two) and the Laplace model (i.e., the GG model with a shape parameter of one), which are two most important baselines. Furthermore, it is unclear what the "normal training" for the models presented at, e.g., Table 4 is, which makes the interpretation of the results more difficult. For example, since the authors assume that the Gaussian model is widely used, I would expect an evaluation against a "normal training" run that uses weight decay for the parameters (along with codes that are appropriate for such a distribution).
方法与评估标准
The evaluation criteria make sense, but, given that LLMs are more or less used as general purpose models, I would have expected some other tasks, besides classification, to be used (such as language modeling with next-token prediction).
理论论述
I checked the mathematical details and they were mostly ok, albeit with some assumptions that do not necessarily hold true in practice (e.g., for the transition from Eq. 3 to Eq. 4). More specifically, the argument that is typically small only makes sense when we have a large bit-width. When that is small, e.g., 2, that can be quite large in practice. As a result, the approximation at Eq. 4 might be off in such cases. Why did the authors not use the GG CDF for proper discretization into a quantization grid?
实验设计与分析
The experimental design is ok but this work is missing critical baselines as mentioned above. More specifically:
- A "normal training" run where a model is trained with an L2 regularizer (i.e., weight decay)
- A more standard "sparse training run" where a model is trained with an L1 regularizer
Besides those results, I am also wondering a couple of other things:
- Is the adaptivity of the shape parameter necessary for good compression? Since you train the model from scratch, the parameters can adapt to the specific penalty you employ and thus can be easily compressed via appropriate codes for that norm.
- The authors apply entropy coding to the indices of quantized values (page 5, line ~240 second column), however it is unclear if those indices actually follow the desired discretized GG distribution. I would have expected some analysis there.
补充材料
I did not check the supplementary material.
与现有文献的关系
The two main novel contributions of this work against the broader literature are
- Proposing an norm as a regularizer for the weights during training of the model with a that is automatically updated during training. This is a relatively minor contribution but a novel one nonetheless.
- Using EG codes for compression after training. This is the main novel contribution of the work relative to prior art.
遗漏的重要参考文献
Nothing as far as I am aware of.
其他优缺点
Overall, I found the paper interesting and the combination of the GG distribution and the EG codes in the LLM setting is novel and yielded good results in the experiments shown. The authors also considered several architectures during the evaluation, which is another bonus.
One more weakness that the authors could work on is (besides what mentioned above) the argument for the extensive application of the Gaussian distribution. It is based on old references about Gaussian initialization of neural network weights. One critical point here is that in those works the weights are only initialized to be Gaussian and do not assume that they are Gaussian after training. In fact, the weights after training become more heavy-tailed (which also acts a supporting argument as to why GG distributions with shape < 2 makes sense) [1]. It would be good if the authors provide more supporting evidence for this claim.
[1] Fortuin et al., Bayesian neural network priors revisited, https://openreview.net/pdf?id=xkjqJYqRJy
其他意见或建议
- It is unclear what the bit-width is for the experiments at Fig.7; this would allow for better understanding the effects of the quantization steps.
- While I agree that training a model for a specific RD will yield better benefits at the end, it is a bit impractical for large scale model training as one would need to retrain the (expensive) large model multiple times for finding the desired tradeoff (e.g., to pick an appropriate ). This is why in practice, post-training compression is more desirable.
- It would be good to show the Gaussian fit at Figure 1 for comparison purposes.
- What is the performance if you fix the shape parameter at the specific value you found at the end of training and then train again? This would highlight whether the adaptivity of the shape parameter is necessary.
Comment #1: Evaluation against the Gaussian model and the Laplace model.
Response: We thank the reviewer for valuable suggestion. We have conducted experiments with L0.5, L1, L2 regulations, the latter two correspond to the Gussian and Laplacian models. As can be seen, RCT with EG achieved the best accuracy and compression. We also found that the shape parameter of the latest deepseek-7B is 0.85, also significantly different from 0.5, 1, or 2.
| Final Shape Parameter | Accuracy | EG Code | FL Code | |
|---|---|---|---|---|
| DGGR | 0.13 | 91.18% | 1.37 | 10.00 |
| L0.5 | 0.22 | 91.88% | 2.90 | 10.00 |
| L1 | 0.15 | 90.65% | 1.52 | 10.00 |
| L2 | 0.10 | 88.29% | 1.16 | 10.00 |
Comment #2: It is unclear what the "normal training".
Response: We thank the reviewer for the comment. “Normal training” refers to "fine-tuning models with only cross entropy".
Comment #3: I would have expected some other tasks to be used.
Response: Many thanks to the reviewer for this advice. If accepted we will also add results from text generation experiment using the latest deepseek-7B and evaluate performance by next token accuracy. Results show 74% compression using RCT and EG0 over an already highly optimized model. In addition, EG0, the same code used for BERT, Llma, and GPT achieved nearly identical coding efficiency as Huffman, which required higher complexity. Additionally, the measured shape parameter (0.85) of deepseek-7B further validates GG distribution assumption.
| FL | EG | Huffman | EG Compress | Huffman Compress | Accuracy | |
|---|---|---|---|---|---|---|
| Normal Training | 11.00 | 5.93 | 4.70 | 46% | 57% | 99.97% |
| RCT | 11.00 | 2.90 | 2.81 | 74% | 74% | 99.97% |
Comment #4: can be quite large in practice.
Response: We appreciate the feedback. represents the quantization step and Fig. 1 shows almost all parameters are concentrated in . So may be much smaller (such as or ) than 2. We will clarify in the paper if accepted.
Comment #5: Is the adaptivity of the shape parameter necessary for good compression?
Response: We are profoundly thankful to the reviewer for the comment. As shown in Comment #1, experiments with L0.5, L1 and L2 and the DGGR show that adapting shape parameter did lead to performance and compression gains.
Comment #6: It is unclear if those indices actually follow the desired discretized GG distribution.
Response: Many thanks to the reviewer for this advice, it is very valuable to us. We checked the distribution of indexes and found they does follow the GG, but its shape is slightly different due to value mapping. For example, the shape of parameters and index of BERT with normal training is 1.36 and 1.47; RCT is 0.26 and 0.30. It will not affect the entropy coding because “EG code is robust to GG sources of various shapes”.
Comment #7: The references do not assume that weights are Gaussian after training.
Response: We thank the reviewer for the feedback. We revised the description in Section 3 “Most research assumes that LLM parameters follow the Gaussian distribution in the initialization and rarely discussed the distribution after training.”, and cited reference [1] in paper. Thank you again for such valuable argument.
Comment #8: It is unclear what the bit-width is for the experiments at Fig.7.
Response: Many thanks to the reviewer for this advice. We sincerely apologize for our omission and provide the bit-weight below.
| Quantization Step | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Normal Training | 7 | 10 | 13 | 16 | 19 | 23 | 26 | 26 | 26 |
| RCT | 7 | 10 | 13 | 16 | 19 | 22 | 25 | 25 | 26 |
Comment #9: One would need to retrain the (expensive) LLMs multiple times for finding the desired tradeoff.
Response: We greatly appreciate your helpful suggestion. Indeed the setting of needs more research. However, retraining (expensive) LLMs many times is not necessary, since can be adjusted in training. Specifically, during training, if the shape of the distribution remains large, can be adjusted. We also found that the value could be set to the same empirical one to achieve the best results in our experiments.
Comment #10: It would be good to show the Gaussian fit at Figure 1 for comparison purposes.
Response: We are profoundly thankful to the reviewer for bringing this issue to our attention and we have added the Gaussian fit and GG fit to Fig. 1.
The paper presented Rate-Constrained Training (RCT) for Large Language Models (LLMs), exploring model compression in the training stage. The paper showed that parameters of representative LLMs typically followed generalized Gaussian instead of vanilla Gaussian. The paper further enforced the distribution constraints (DGGR) for model training, which eased the parameter entropy encoding with exp-Golomb (EG) codes. RCT demonstrated promising compression performance with different model architectures and parameter scales.
Update after rebuttal
The reviewer appreciated the authors’ rebuttal that resolved some of the concerns, e.g. comparisons to vanilla training + vanilla huffman coding, L1/L2 regularization. The reviewer therefore raised the score.
给作者的问题
N.A.
论据与证据
Please refer to the strengths and weaknesses
方法与评估标准
Please refer to the strengths and weaknesses
理论论述
Please refer to the strengths and weaknesses
实验设计与分析
Please refer to the strengths and weaknesses
补充材料
The supplementary material contains only code, which was not verified by the reviewer
与现有文献的关系
The literature review looks sufficient
遗漏的重要参考文献
No
其他优缺点
Strengths
i) The exploration of compression rate aware LLM training is interesting.
ii) The paper provided various analysis, e.g. shape parameter changes in training, analysis with different lambda, etc
iii) The paper is easy to follow
Weaknesses
The main concern of the reviewer is on the evaluations
i) Baselines. While RCT was claimed to consider compression during training, the parameter encoding was applied after the model was trained, similar to vanilla solutions, e.g. huffman coding. In the paper, fixed length coding was listed as the only baseline, which was insufficient. The authors are encouraged to include baselines like, vanilla training + vanilla huffman coding, etc
ii) Ablations.
The readers would be of interest to understand the advantages of several experimental designs, e.g. dynamic shape parameter vs vanilla L1/L2 regularization
iii) Model architectures.
Most of the experiments/analysis were performed on BERT and classification tasks. The readers may wonder if RCT scales/generalizes well
iv) Presentation
Providing diagrams like Figure 2-5, while helping with the interpretation, is not sufficient. For example, it wont be easy for followup works to do the comparison. The authors are encouraged to provide also the actual numbers shown in Figures.
其他意见或建议
L205, right column, "difficult" -> "different" ?
Comment #1: Baseline: include baselines like, vanilla training + vanilla Huffman coding, etc.
Response: We greatly appreciate the suggestion, and will incorporate Huffman coding as a baseline. We found that although Huffman coding shows a small advantage in efficiency over EG when applied to compressing model parameters trained using conventional training, there is no advantage of using Huffman code during RCT. At the same time, Huffman code will need to be designed for each round of training and for each model, with complexity that cannot be parallelized, while EG has very simple software and hardware implementations and can be applied to all models and tasks. In all our experiments, the same EG0 code was used.
| Model | EG | Huffman | FL |
|---|---|---|---|
| BERT | 2.64 | 2.42 | 10.00 |
| GPT | 2.46 | 2.25 | 11.00 |
| Llama | 1.72 | 1.66 | 10.00 |
| Gemma | 1.16 | 1.15 | 11.00 |
| Task | EG | Huffman | FL |
|---|---|---|---|
| Sentiment | 2.64 | 2.42 | 10.00 |
| Spam | 2.42 | 2.19 | 10.00 |
| Topic | 3.61 | 3.18 | 10.00 |
| Q-A | 2.90 | 2.81 | 11.00 |
Comment #2: Ablations: dynamic shape parameter vs vanilla L1/L2 regularization.
Response: We are grateful for your valuable suggestion and added L0.5, L1, and L2 to BERT fine-turning on IMDB, as shown below. In principle, L0.5, L1, and L2 need to assume “the parameters obey GG shapes of 0.5, 1, and 2”, while in reality the shape parameters of the parameters were below 0.22. The latest deepseek-7B model has a shape parameter of 0.85. Such differences call for taking shape parameter into consideration during RCT, which achieved good results.
| Final Shape Parameter | Accuracy | EG | FL | |
|---|---|---|---|---|
| DGGR | 0.13 | 91.18% | 1.37 | 10.00 |
| L0.5 | 0.22 | 91.88% | 2.90 | 10.00 |
| L1 | 0.15 | 90.65% | 1.52 | 10.00 |
| L2 | 0.10 | 88.29% | 1.16 | 10.00 |
Comment #3: Architectures: The readers may wonder if RCT scales/generalizes well.
Response: We thank the reviewer for pointing out this. Since submitting the paper, we conducted experiments with the latest DeepSeek model and obtained similar results. We will include the results in the final manuscript if accepted. Below is the result from text generation using deepseek-7B. This further confirmed the viability of RCT. Again, Huffman coding has an advantage over EG when used to compress a model that have already been trained, at additional traning and implementation costs. Whereas for RCT training, there is virtually no difference. Additionally, the measured shape parameter (0.85) of deepseek-7B further validates the necessity of shape parameter estimation.
| FL | EG | Huffman | EG Compress | Huffman Compress | Accuracy | |
|---|---|---|---|---|---|---|
| Normal Training | 11.00 | 5.93 | 4.70 | 46% | 57% | 99.97% |
| RCT | 11.00 | 2.90 | 2.81 | 74% | 74% | 99.97% |
Comment #4: Presentation: The authors are encouraged to provide also the actual numbers shown in Figures.
Response: We thank the reviewer for the suggestion and will annotate each node in Figs. 3-4 with specific data values for clarity as shown below.
| Lagrange multiplier | 0 | 1 | 10 | 100 | 1000 |
|---|---|---|---|---|---|
| EG Code | 7.31 | 6.66 | 5.78 | 4.03 | 2.64 |
| Huffman Code | 5.47 | 5.25 | 4.81 | 3.59 | 2.42 |
| Fixed Length Code | 10.00 | 10.00 | 10.00 | 10.00 | 10.00 |
| Train Dataset | 99.54% | 99.69% | 99.56% | 99.47% | 99.52% |
| Test Dataset | 93.63% | 93.85% | 93.93% | 93.68% | 92.59% |
Comment #5: L205, right column, "difficult" -> "different"?
Response: We thank the reviewer for catching the typo and will proof-read the paper very thoroughly if accepted.
The paper introduces Rate-Constrained Training (RCT), which can integrate compression during the training phase using rate-distortion optimization. The authors observed that LLM parameters follow a generalized Gaussian distribution (GG) with shape parameters less than 2. Thus, the proposed idea is to use rate-distortion theory during training, model parameters with GG distributions, and then apply exp-Golomb coding. The authors conducted experiments on various models and downstream tasks with LLMs. They demonstrate RCT can reduce memory usage by 60-90% without accuracy loss and is better than post-training compression.
给作者的问题
See the above weaknesses.
- why omit in Eq7?
- is the in line 320 the same as in Eq2? If not, please check the notions carefully.
论据与证据
The main claims are largely supported by theoretical and empirical results.
Weaknesses:
- Authors claim "EG codes is robust with regard to parameter mismatches." in lines 61-62, there is no analysis or experimental results on it.
- The authors mentioned that "different regulations during training may impact parameter distribution" in lines 110-112. I would like to see any experimental support or observations.
- The soft gradient clipping() is not evaluated.
方法与评估标准
Yes, but there are weaknesses:
- Limited discussion on convergence time
- Only evaluate classification tasks; it is better to run some experiments on generation tasks.
理论论述
Equations and algorithms are sound.
实验设计与分析
Weaknesses:
- No comparison to other entropy coders in the same setup.
- No ablation study on EG's contribution and RCT solely
- The selection of may be sensitive to different tasks, but there are no experiments and analysis.
补充材料
Yes, GolombCode.py
与现有文献的关系
The paper synthesizes ideas from information theory, statistics, and deep learning into a novel framework. It can be a bridge between classical compression theory and LLMs optimization.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
Comments on writing-related:
- The contributions are too long; it is better to summarize the main contributions in very few sentences.
- Line 85 (section 2.2), "other works" are missing citations.
- line 164: tick -> trick, line 206: difficult -> different?
- Figure 6 is hard to interpret. Please consider enlarging the critical part or normalizing the values for better visualization.
- I think it is better to explicitly describe the ν estimation step in Algo 1.
伦理审查问题
N/A
Comment #1: Analysis of "EG codes is robust with regard to parameter mismatches."
Response: We are sincerely grateful to the reviewer for providing such valuable advice. Table. 3 demonstrates 0-order EG's optimality across models with varying shapes, confirming its robustness. Theoretical analysis can be found in Wen & Villasenor, https://ieeexplore.ieee.org/document/761289.
Comment #2: Analysis of "different regulations during training may impact parameter distribution".
Response: We greatly appreciate your helpful suggestion, and agree with the advice. We added L0.5, L1, and L2 to BERT fine-turning using the IMDB shown below. The original BERT has shape 1.36 and was fine-tuned to 0.22, 0.15, and 0.10 under L0.5, L1, and L2 regulations, with differences in size and performances. This shows the importance of shape parameter estimation. The shape parameter of the latest deepseek-7B model is 0.85, also significantly different from 0.5, 1, or 2.
| Shape | Accuracy | EG | FL | |
| DGGR | 0.13 | 91.18% | 1.37 | 10.00 |
| L0.5 | 0.22 | 91.88% | 2.90 | 10.00 |
| L1 | 0.15 | 90.65% | 1.52 | 10.00 |
| L2 | 0.10 | 88.29% | 1.16 | 10.00 |
Comment #3: The soft gradient clipping () is not evaluated.
Response: We are sincerely grateful for your valuable suggestion regarding . prevents gradient explosion when and is the threshold so 1 was an empirical choice. Our experiments with show brought unstable gradient, led to slow compression and was a good compromise.
Comment #4: Limited discussion on convergence time.
Response: Thank you for pointing out this problem. Number of RCT convergence’s rounds relates to the selection of . We trained for 10 epochs in Fig.2 and Fig.3. converges at 700 steps (3 epochs). Smaller took longer.
Comment #5: Run some experiments on generation tasks.
Response: Thank you for the suggestion. We added a text generation experiment with the latest deepseek-7B and evaluate it by next token accuracy below. Huffman and EG code have the same code length with RCT, but the implementation of EG code is more efficient. Furthermore, the shape parameter of deepseek-7B is 0.85.
| FL | EG | Huffman | EG Compress | Huffman Compress | Accuracy | |
|---|---|---|---|---|---|---|
| Normal Training | 11.00 | 5.93 | 4.70 | 46% | 57% | 99.97% |
| RCT | 11.00 | 2.90 | 2.81 | 74% | 74% | 99.97% |
Comment #6: No comparison to other entropy coders.
Response: We found that although Huffman coding shows a small advantage in efficiency over EG when applied to compressing model parameters trained using conventional training, there is no advantage of using Huffman code during RCT. At the same time, Huffman code will need to be designed for each round of training and for each model, with complexity that cannot be parallelized, while EG has very simple software and hardware implementations and can be applied to all models and tasks. In all our experiments, the same EG0 code was used.
| Model | EG | Huffman | FL |
|---|---|---|---|
| BERT | 2.64 | 2.42 | 10.00 |
| GPT | 2.46 | 2.25 | 11.00 |
| Llama | 1.72 | 1.66 | 10.00 |
| Gemma | 1.16 | 1.15 | 11.00 |
| Task | EG | Huffman | FL |
|---|---|---|---|
| Sentiment | 2.64 | 2.42 | 10.00 |
| Spam | 2.42 | 2.19 | 10.00 |
| Topic | 3.61 | 3.18 | 10.00 |
| Q-A | 2.90 | 2.81 | 11.00 |
Comment #7: No ablation study on EG's contribution and RCT solely.
Response: We thank the reviewer for the feedback. We will incorporate more comparisons with RCT using Huffman coding, conventional training (no entropy coding), L0.5, L1, L2 regulation to demonstrate the usefulness of EG.
Comment #8: (1) Contributions are too long. (2) "other works" misses citations. (3) tick -> trick, difficult -> different. (4) Figure 6 hard to interpret. (5) Describe the ν estimation in Algo.1. (6) why omit in Eq7? (7) Is the in line 320 the same as it in Eq2?
Response: We are extremely grateful to the reviewer for catching these issues. We will fix these typos and clarify throughout the paper if it is accepted.
Comment #9: The selection of may be sensitive to different tasks, but there are no experiments and analysis.
Response: We are sincerely grateful to the reviewer for the comment. We agree that the optimal setting of needs more research, which we are currently doing. In our extensive experiments it was set to the same empirical value to achieve the best results.
The paper introduce Rate-Constrained Training (RCT), a method integrating rate-distortion optimization into the training process of Large Language Models (LLMs). RCT leverages a generalized Gaussian (GG) distribution to accurately model LLM parameters and uses exp-Golomb (EG) coding for entropy-efficient parameter encoding. The method dynamically optimizes both model performance and compression rate during training, and reduces model size (60%-90% memory reduction) while maintaining or slightly improving accuracy.
给作者的问题
Could the authors elaborate on the computational complexity introduced by the required "value mapping" before entropy coding? and provide insights into practical implementations
论据与证据
yes
方法与评估标准
yes
理论论述
yes
实验设计与分析
yes
补充材料
no
与现有文献的关系
A novel approach to integrate rate distortion optimization methodology to the LLM training.
遗漏的重要参考文献
no
其他优缺点
strengths:
- Introduces a novel rate-distortion optimization framework integrated directly into the training phase, addressing a significant gap in existing compression techniques.
- Experiments show that LLM parameters follow a generalized Gaussian distribution (shape parameter v < 2), significantly improving modeling accuracy over traditional Gaussian assumptions.
- Employs exp-Golomb codes which closely approach entropy limits, are robust to parameter distribution changes, and simplify implementation on hardware and software platforms.
weakness:
- The optimal selection of the Lagrange multiplier (λ) is pretty empirical, there is no theoretical guidance or any adaptive strategies to systematically select λ across different tasks.
- Incorporating iterative shape parameter estimation and rate calculations within every training step introduces non-trivial computational overhead, more discussions would be needed.
- although partially addressed via soft gradient clipping, instability in gradient computation remains a concern, especially for the cases when v is close to 1.
其他意见或建议
update after rebuttal: I appreciate the responses from the author. I'd like to keep my score.
Comment #1: The optimal selection of the Lagrange multiplier (λ) is pretty empirical, there is no theoretical guidance or any adaptive strategies to systematically select λ across different tasks.
Response: We thank the reviewer the valuable feedback. Indeed the theoretical foundation for optimal setting of is critical and currently under investigation. In our extensive experiments however we found that the same empirical value could be used for different models and tasks to achieve the best results.
Comment #2: Shape parameter estimation and rate calculations within every training step introduces non-trivial computational overhead:
Response: We thank the reviewer for the comment. In our experiments, statistics for estimating the shape parameter are collected during training of the previous round, incurring little additional overhead. Overall, current implementation of RCT fine-tuning of BERT on A100 GPUs using the IMDB dataset required 8.63 min/epoch, compared with conventional training at 7.92 min/epoch. This overhead stems mainly from DGGR gradient descent, as shape parameter evaluation added negligible time. Further optimization is possible, such as storing quantized DGGR gradients and replacing explicit computations with lookup tables.
Comment #3: Although partially addressed via soft gradient clipping, instability in gradient computation remains a concern, especially for the cases when v is close to 1.
Response: Thanks to reviewer for highlighting potential gradient instability in soft gradient descent. Our retraining experiments with/without RCT show that soft gradient clipping effectively mitigates unstable gradients' impact. For instance, with clipping, RCT achieves mean=1.70 (std=2.18) versus normal training's mean=1.89 (std=1.23). Extensive preliminary experiments confirm that gradients did not cause significant performance fluctuations. Should soft gradient clipping fail, other means such as fixed value gradient clipping exist.
Comment #4: Could the authors elaborate on the computational complexity introduced by the required "value mapping" before entropy coding? and provide insights into practical implementations.
Response: Value mapping converts parameters to sorted indices via hash tables, with O(N) time/space complexity and the process can be highly parallelizable. In our serial CPU implementation, the "value mapping" step for 110M parameters takes 0.16s, while the entire encoding process consumes 3.20s in total.
The review process identified a number of missing baselines, ablation studies and the reviewers had requested a number of clarifications in their comments. The authors have responded to these concerns in their rebuttal. I would advise the authors to incorporate this discourse in the camera-ready version of the manuscript.