Matryoshka Quantization
Matryoshka Quantization (MatQuant) trains a single model that can operate at multiple bitwidths (e.g. Int8, Int4, Int2) simultaneously by leveraging the nested structure of integer data types, outperforming independent quantization of the same models
摘要
评审与讨论
The paper presents a method for multi-scale quantization of large language models across multiple precisions (int8, int4, int2). By utilizing the nested structure of integer data types, the proposed technique allows different precision levels to be nested within one another. The resulting quantized model can then be served at different precisions based on deployment requirements.
给作者的问题
A clarifying question: It is stated that upon quantization by MatQuant, the weights shift to higher values. Does the choice of quantization bins agree with information-theoretically optimal approach to scalar quantization utilized by e.g. Normal Float Quantization method?
论据与证据
The paper correctly claims comparable accuracy of the proposed method and existing int4 and int8 quantization schemes, and improved int2 performance, although the experiments were restricted to the C4 dataset.
方法与评估标准
Use of the C4 dataset is meaningful but limiting; including more comprehensive results on e.g. the Pile or at least WikiText-2 is strongly recommended.
理论论述
This is an empirical paper with no major theoretical results.
实验设计与分析
The experiments are appropriately constructed and conducted.
补充材料
Yes, all of it.
与现有文献的关系
The proposed methods belongs to the category of Learning-Based Quantization Methods and can work with other such (existing) techniques, including QAT and OmniQuant.
遗漏的重要参考文献
The proposed MatQuant outperforms existing methods only at the extreme int2 quantization levels. It is therefore critically important to compare and contrast MatQuant against other schemes specifically targeting such regimes. The work that comes to mind is Egiazarian et al, "Extreme Compression of Large Language Models via Additive Quantization", ICML '24, which reports strong performance in 2-bit (vector) quantization settings.
其他优缺点
On the one hand, when considering quantization levels higher than int2, the proposed method does not perform as strongly as the existing (quantization level-specific) techniques. On the other hand, at int2 MatQuant does work better than the considered competing (scalar quantization) alternatives but the performance is much weaker than that of higher quantization level schemes. To assert the utility of MatQuant, the authors should compare and contrast it with SOTA methods for low-bitwidth quantization schemes (e.g., AQLM in the above mentioned reference).
其他意见或建议
None.
伦理审查问题
None.
We thank the reviewer for their thorough feedback. Below are our responses to the comments/questions:
Comparison with SOTA int2 methods: We would like to clarify that the main goal of the work is to propose an adaptive quantization method that generates a single model that can do well at any precision and provide accurate mixed precision models (Mix’n’match). Int2 result demonstrates that the technique can potentially help train a significantly more accurate low bit-width model, but it is not the main goal, so the baselines are also set accordingly.
For Mix’n’match, we compare against the baseline of learning quantized models independently and then stitching layers together to produce a model with the same per layer quantization. The table given below shows that MatQuant based Mix’n’match can be 21% more accurate than a Mix'n'Match between separately trained baselines.
AQLM is a complementary technique to MatQuant, so AQLM+MatQuant might provide a more accurate 2bit model than AQLM. However, as mentioned above, that is orthogonal to the goal of the work and we leave it for future investigation. The Matryoshka-style nesting is possible across multiple axes like the number of code books for each group and the number of elements in each code book (i.e., 2^bits).
Table: Mix'n'Match on MatQuant + OmniQuant for Gemma-2 9B. Mix'n'Match with MatQuant sub-models does substantially better than a Mix'n'Match between baseline models trained explicitly for 2 and 4 bits.
| Config. | Method | ARC-c | ARC-e | BoolQ | HellaSwag | PIQA | Winogrande | Average |
|---|---|---|---|---|---|---|---|---|
| 222222444444444444444444444444444444222222 = 3.43 avg bits | Mixnmatch using Baseline 2, 4 bit | 30.03 | 48.02 | 62.26 | 44.47 | 67.30 | 59.75 | 51.97 |
| 222222444444444444444444444444444444222222 = 3.43 avg bits | Mixnmatch using MatQuant 2, 4 bit | 52.56 | 79.04 | 78.99 | 75.66 | 80.20 | 69.38 | 72.64 |
| 222444444444444444444444444444444444444222 = 3.71 avg bits | Mixnmatch using Baseline 2, 4 bit | 31.48 | 50.34 | 62.14 | 43.97 | 67.19 | 60.30 | 52.57 |
| 222444444444444444444444444444444444444222 = 3.71 avg bits | Mixnmatch using MatQuant 2, 4 bit | 56.83 | 81.06 | 80.73 | 76.34 | 81.07 | 67.88 | 73.98 |
| int4 | Baseline | 58.79 | 78.37 | 83.55 | 76.71 | 81.45 | 67.09 | 74.33 |
| MatQuant | 57.25 | 77.36 | 84.86 | 75.52 | 81.50 | 66.77 | 73.88 | |
| int2 | Baseline | 39.16 | 63.43 | 72.11 | 52.24 | 72.63 | 61.88 | 60.24 |
| MatQuant | 48.72 | 72.18 | 79.20 | 68.11 | 76.17 | 66.77 | 68.52 |
Relation to NF Quantization Data Type: MatQuant does not explicitly try to push the weight distribution towards an information-theoretic optimal quantization scheme. The fact that we slice bits in MatQuant to obtain sub-models incentivizes gradient descent to carefully optimize the MSBs. Since the 2-bit sub-model corresponds to the top-2 MSBs, MatQuant optimizes the top-2 MSBs to maximize the int2 sub-model’s quality and thus ends up pushing the peak of the weight distribution towards the right, and in the process does a more uniform allocation to the higher buckets. This is evident in Figure 1 (c), where we can see that the peaks for a model trained with MatQuant are lower than that for the baseline and the weight distribution in the case of MatQuant is a bit more uniform.
Datasets other than C4: Popular quantization papers such as GPTQ, QuIP, and SpinQuant utilize C4. To be consistent with popular literature, we use C4 for our experiments.
We hope that the rebuttal clarifies questions raised by the reviewer. We would be very happy to discuss any further questions about the work, and would really appreciate an appropriate increase in score if reviewers’ concerns are adequately addressed.
This paper introduces Matryoshka Quantization, a novel multi-scale quantization technique that enables single-model multi-bitwidth operation. The low-precision models extracted by MatQuant can have significant improvement compared with standard quantization.
update after rebuttal
The authors address most of my concerns, and I keep my original rating to lean toward accepting the paper.
给作者的问题
- What is the runtime overhead of MatQuant vs. standard QAT methods?
- How does MatQuant perform on very large-scale models?
论据与证据
The claims made in the submission are supported by clear and convincing evidence.
方法与评估标准
The method makes sense for the problem.
理论论述
No theory result in this paper.
实验设计与分析
Lack of inference runtime benchmarks, which are critical for deployment. Also, does MatQuant need more GPU memory and time during training? It's better to provide the comparison results.
补充材料
No code is available.
与现有文献的关系
Unlike existing works that need to optimize for each precision, MatQuant can train and maintain a single quantized model but serve it with the precision demanded by the deployment.
遗漏的重要参考文献
Though QAT and OmniQuant are effective tools for quantization, there are some works like that show better performance than those. Considering the results of MatQuant+novel QAT methods can strengthen the paper's contribution.
Bondarenko Y, Del Chiaro R, Nagel M. Low-rank quantization-aware training for llms.
Also, as mentioned in the previous section. It's unclear whether MatQuant introduces more GPU memory usage or training time. There are several works that focus on the efficiency of QAT.
Chen M, Shao W, Xu P, Wang J, Gao P, Zhang K, Luo P. Efficientqat: Efficient quantization-aware training for large language models.
其他优缺点
There is one more weakness: the main results are on LM <10 B. I have a scalability concern about how MatQuant performs on large models.
其他意见或建议
No extra comments.
We thank the reviewer for their insightful review. Below are our responses to the comments/questions:
Training memory/compute requirements: For both QAT and OmniQuant, MatQuant can be up to 30% cheaper than the training three separate baselines (one for 2-bits, one for 4-bits, and one for 8-bits). By recomputing some of the intermediate tensors generated during the forward pass when needed for gradient computation, MatQuant can run in the same memory requirements as a single baseline run. So overall from a training cost point of view, MatQuant uses up to 30% less GPU hours, and is more effective than having separate training runs for different bit-widths.
Training curves: Our experiments indicate that MatQuant requires less samples/iterations to converge to the same training perplexity. Specifically,
- 2-bit QAT on Gemma2 9B: MatQuant and S.P. Matquant achieves same training perplexity with 20M tokens as the baseline with 100M tokens
- 2-bit OmniQuant on Gemma2 9B: MatQuant’s layerwise reconstruction error at 8M tokens and S.P. MatQuant’s layerwise reconstruction error at 5.8M matches that of baseline at 20M tokens.
Inference cost: A MatQuant trained 2/4/8-bit model will have the same inference cost as the baseline (QAT or OmniQuant) 2/4/8-bit models, i.e., inference budgets for homogeneous precisions is exactly the same as that of the baseline since we do not change the base quantization algorithm. With Mix’n’Match, we can further trade off quality with latency catering to several serving environments since our models lie on the pareto-optimal curve for memory-vs-quality.
Plugging MatQuant on the latest methods: we thank the reviewer for referring the latest QAT style techniques. Most SOTA QAT methods build on top of the baseline QAT setup by either adding auxiliary losses or distillation or some forms of pretraining. So we adopted a baseline general QAT setup in our experiments to demonstrate the core advantage of MatQuant. MatQuant is complementary to the above mentioned methods of auxiliary losses etc. Furthermore, as we have shown, MatQuant works with a wide variety of quantization techniques (QAT, OmniQuant). So we believe MatQuant can be then combined with the latest QAT style methods.
Scaling MatQuant to larger models: We performed preliminary experiments on Gemma-2 27B with QAT. As shown in the table below, we find that our observations from 2B and 9B scales hold for the 27B model as well. That is, MatQuant’s quantized models are quality neutral (or better by up to 5% for 2-bit) for all the bit widths that we trained for as well as for the interpolated bit-widths 3 and 6. We will add these results to the next version of the manuscript.
Table: MatQuant + QAT on Gemma-2 27B performs on par with the baseline for int4 and int8 and significantly outperforms it for int2. Also, the int3, int6 models obtained for free via interpolation perform comparably to the explicitly trained baselines.
| DType | Method | ARC-c | ARC-e | BoolQ | HellaSwag | PIQA | Winogrande | Avg. | log pplx. |
|---|---|---|---|---|---|---|---|---|---|
| bf16 | 60.49 | 78.96 | 83.18 | 82.24 | 84.60 | 75.69 | 77.53 | 2.199 | |
| int8 | Baseline | 60.49 | 79.12 | 80.34 | 83.15 | 84.60 | 76.01 | 77.28 | 2.169 |
| MatQuant | 60.41 | 79.88 | 78.41 | 82.54 | 85.04 | 75.77 | 77.01 | 2.141 | |
| int4 | Sliced int8 | 59.30 | 77.90 | 83.94 | 81.69 | 83.68 | 73.88 | 76.73 | 2.232 |
| Baseline | 59.39 | 79.38 | 83.79 | 82.45 | 84.44 | 75.30 | 77.46 | 2.250 | |
| MatQuant | 59.73 | 78.96 | 76.06 | 81.68 | 84.49 | 74.82 | 75.96 | 2.201 | |
| int2 | Sliced int8 | 25.85 | 27.99 | 55.72 | 25.07 | 51.25 | 49.41 | 39.21 | 15.601 |
| Baseline | 32.25 | 55.56 | 67.58 | 55.50 | 69.59 | 59.04 | 56.59 | 3.061 | |
| MatQuant | 44.8 | 71.25 | 70.31 | 67.61 | 76.33 | 64.64 | 65.82 | 2.674 | |
| int6 | Sliced int8 | 60.15 | 79.12 | 81.31 | 83.13 | 84.60 | 76.64 | 77.49 | 2.111 |
| Baseline | 60.41 | 79.84 | 80.18 | 82.77 | 84.60 | 76.48 | 77.38 | 2.170 | |
| MatQuant | 59.98 | 79.21 | 75.57 | 82.4 | 84.49 | 75.45 | 76.18 | 2.146 | |
| int3 | Sliced int8 | 52.05 | 72.69 | 63.73 | 73.42 | 79.98 | 69.46 | 68.55 | 2.797 |
| Baseline | 57.17 | 79.38 | 73.61 | 78.91 | 83.24 | 72.85 | 74.19 | 2.373 | |
| MatQuant | 54.95 | 76.18 | 63.18 | 77.36 | 82.37 | 72.30 | 71.06 | 2.494 |
We hope that the rebuttal clarified your questions about the paper and hopefully leads to an even more positive evaluation of the work. We are happy to further discuss if you have any questions about the paper.
Thanks to the authors for the detailed response to the concerns. I have no other concerns and think this work brings a clear contribution to the related community. So I keep the positive rating.
The paper introduces Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique designed to jointly learn various bitness representations in one single training, and improve low-bit precision quantization for LLMs. The method is evaluated on Gemma-2 and Mistral models, and the results demonstrate better perplexity and downstream performance compared to the standard QAT baseline. The paper also demonstrates some variations of MatQuant, including Single Precision and quantizing both FFN and Attention.
Update after rebuttal: My latest reply reflected my final update.
给作者的问题
N/A
论据与证据
[Claims are supported]
- MatQuant maintains a single quantized model but serves it with different precision demands.
- Int2 precision models of MatQuant achieve bigger improvement.
- MatQuant can interpolate int3 and int6 representations without explicitly training on them.
方法与评估标准
Yes. The models (Gemma, Mistral), evaluation tasks (C4, ARC, BoolQ, HellaSwag,...), and baselines (QAT, OmniQuant) make sense.
理论论述
No theoretical claims.
实验设计与分析
Yes.
- The main results on int2 are validated.
- There are experiments to ablate the hyperparameter .
- Single Precision experiment shows better performance.
[Weakness]
- The paper doesn't show the figure of performance vs training samples (or training time) for MatQuant. MatQuant might require more (or maybe fewer) samples or time to converge.
- Isn't Single Precision MatQuant the same as standard QAT because it set for 4 and 8 bits to 0? Why does it still show performance improvement?
- Maybe just me, but I don't understand the config and datatype of Table 4. Why there are two 8 in "[8, 4, 2, 8] -> [4;2]", and what does 4; 2 mean?
补充材料
No Supplementary Material.
与现有文献的关系
- The paper is related to Matryoshka-style training.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
N/A
We thank the reviewer for their insightful review. Below are our responses to the comments/questions:
Training memory/compute requirements: For both QAT and OmniQuant, MatQuant can be up to 30% cheaper than the training three separate baselines (one for 2-bits, one for 4-bits, and one for 8-bits). By recomputing some of the intermediate tensors generated during the forward pass when needed for gradient computation, MatQuant can run in the same memory requirements as a single baseline run. So overall from a training cost point of view, MatQuant uses up to 30% less GPU hours, and is more effective than having separate training runs for different bit-widths.
Training curves: Our experiments indicate that MatQuant requires less samples/iterations to converge to the same training perplexity. Specifically,
- 2-bit QAT on Gemma2 9B: MatQuant and S.P. Matquant achieves same training perplexity with 20M tokens as the baseline with 100M tokens
- 2-bit OmniQuant on Gemma2 9B: MatQuant’s layerwise reconstruction error at 8M tokens and S.P. MatQuant’s layerwise reconstruction error at 5.8M matches that of baseline at 20M tokens. We will add the training curves to the next version of the manuscript.
Single Precision (S. P.) MatQuant vs Baseline:
- S.P MatQuant first quantizes the model to 8-bits and then slices the 2 MSBs. However, the baseline directly quantizes the model to 2 bits.
- The scaling factor and the zero point in S.P. MatQuant are derived from the 8-bit quantized model. Here, the model has 6 additional bits to better optimize the scaling factor and the zero point since the quantized weight corresponds to only the first two bits and the model can use these additional 6 bits to improve the overall quality. We hypothesize this additional entropy (overparameterization) gives S. P. MatQuant an edge over the baseline.
Table 4: [8, 4, 2, 8 -> 4;2]: [8, 4, 2, 8 -> 4;2] means the standard cross entropy (or layer-wise reconstruction) is applied to the 8-bit, and the 4 and 2-bit sub models. In addition to this, the 4 and the 2-bit sub models are distilled from the 8-bit model’s logits (or layer output). The notation is present in the caption of Table 4, but we will further clean it up to make it easier to parse.
We hope that the rebuttal clarified your questions about the paper and hopefully leads to an even more positive evaluation of the work. We are happy to further discuss if you have any questions about the paper.
Thanks to the authors for addressing my questions. I kept my original rating to lean toward accepting the paper.
The paper introduces Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique designed to jointly learn various bit-width quantized network representations in one single training, and improve low-bit precision quantization for LLMs. The method is evaluated on Gemma-2 and Mistral models, and the results demonstrate better perplexity and downstream performance compared to the standard QAT baseline.
Two of the three reviewers were mildly positive with one reviewer slightly negative.
The negative reviewer only expressed one issue, however, that of the proposed MatQuant outperforming existing methods only at the extreme int2 quantization level, and asked for comparing MatQuant against other schemes specifically targeting such regimes. The authors correctly respond that the main goal of the work is to propose an adaptive quantization method that generates a single model that can do well at any precision and provide accurate mixed precision models. The Int2 result demonstrates that the technique can potentially help train a significantly more accurate low bit-width model, but it is not the main goal.
When training multiple bit-widths a single MatQuant training uses up to 30% less GPU hours, and is more effective than having separate training runs for different bit-widths. Overall, the data provided in the rebuttal answered the reviewers' concerns about training time. For example, experiments indicate that MatQuant requires less samples/iterations to converge to the same training perplexity. Similarly, rebuttal data indicate that the method scales to even larger models than mentioned in the original submission.