Training and inference of large language models using 8-bit floating point
A methodology to leverage FP8 formats for training and inference of large language models.
摘要
评审与讨论
This paper contains thorough details inference/fine-tuning with FP8 quantized linear layers in the context of language models. I believe it will be useful to the community. The main part of the method is how to choose the correct per-tensor scaling bias.
优点
This paper is a key piece missing from the large scale FP8 literature. Figure 1 in particular is presented clearly and contains important details for successful FP8 inference/fine-tuning.
缺点
The largest weakness in this paper is in transparency -- it is claimed throughout (e.g., the title) that FP8 training will be demonstrated. However, results are only provided for fine-tuning. I would suggest changing the title/intro to be more clear.
问题
- Do the authors think that their results will hold for FP8 training from scratch?
- Do the authors believe it's possible to use any of their methodology for other layers in the network such as attention?
We thank the reviewer for remarking that our paper “is a key piece missing from the large scale FP8 literature” and how Figure 1 "contains important details for successful FP8 inference/fine-tuning".
Concerning the weakness about transparency and the first question, we understand how the term “training” can be confusing and have modified the title and introduction to better reflect the scope of the paper. Let us clarify that we don’t claim to have experiments for FP8 pre-training from scratch. However, from a conceptual point of view, our methodology can be applied to both pre-training and fine-tuning. That’s why we encapsulate both under the term “training”. In the experiments, we clarify that we are only doing fine-tuning (see the title of subsection 4.5 for instance). We would love to have pre-training results, but they are really expensive and only affordable for a handful of AI labs worldwide, especially for the models we tested of up to 13 billion parameters. As a result, we don’t include them here and leave them for future work.
Concerning the second question about extending the FP8 methodology to other layers, we think so but need more time to have publishable results about it. As a result, we leave this question for future work.
However, from a conceptual point of view, our methodology can be applied to both pre-training and fine-tuning. That’s why we encapsulate both under the term “training”.
My concern is that while it could be applied conceptually, it might not actually work in practice. Thank you for changing the title to be more clear, but there are still many places where I would recommend this to be changed (e.g., first sentence of abstract). I will leave my score as is, which is to still recommend accept.
The paper presents a new methodology for selecting a scaling factor value (Exponent Bias) when representing numbers with an 8-bit floating point in deep learning training and inference. This methodology is based on roughly matching the dynamic range between the parameters and the 8-bit floating point numerical format. In particular, the exponent bias is either selected dynamically for each parameter ( FP8-AMAX ) or selected uniformly for all parameters (FP8-CSCALE). The paper explores the training of two types of large language models, namely GPT and LIama 2, using FP8 representation for model sizes ranging from 111M to 70B. The results indicate that the performance is on par with the FP16 representation.
优点
1- The paper is well-written and organized. 2- The new methodology for FP8 has been evaluated on various large language models, demonstrating that this approach is generalizable.
缺点
1- The paper's contributions and novelty are not immediately clear. The methodology for calculating the Exponent Bias resembles the asymmetric quantization process for INT8, where the scaling factor is determined using a max operation. Furthermore, even within the scope of 8-bit floating point representation, determining the exponent bias based on the max operation has been explored in prior research. The author is recommended to clarify the paper's unique contributions, especially in comparison to the following studies:
[1] Tambe, Thierry, et al. "Algorithm-hardware co-design of adaptive floating-point encodings for resilient deep learning inference." 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 2020.
[2] Sun, Xiao, et al. "Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks." Advances in Neural Information Processing Systems 32 (2019).
[3] Kuzmin, Andrey, et al. "Fp8 quantization: The power of the exponent." Advances in Neural Information Processing Systems 35 (2022): 14651-14662.
[4] Lee, Janghwan, and Jungwook Choi. "Optimizing Exponent Bias for Sub-8bit Floating-Point Inference of Fine-tuned Transformers." 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022.
2- Comparisons with other numerical formats, such as INT8, block floating point, logarithmic number systems, and posit, are not discussed. For instance, the results in [5,6] indicate that INT8 performance is superior for inference, even for models like the Transformer.
[5] van Baalen, Mart, et al. "FP8 versus INT8 for efficient deep learning inference." arXiv preprint arXiv:2303.17951 (2023). [6] Zhang, Yijia, et al. "Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models." arXiv preprint arXiv:2305.12356 (2023).
问题
What is the reason for using the max operation to compute the exponent bias? Why didn't the author consider other statistical metrics such as mean or mode? The max operation typically works best when the distribution is symmetric. However, the distribution in deep learning is often asymmetric.
We thank the reviewer for remarking that our paper is “well-written and organized” and that our FP8 approach is “generalizable” to other large language models in addition to the GPT and Llama models of the paper.
Concerning the weaknesses, the reviewer has caveats about our novelty and contribution. We acknowledge in the paper that the max approach to choose the scales builds upon quantisation ideas from INT formats (see section 2.3). However, our paper is novel because we adapt the max approach to FP8 and detail its application to quantize weights, activations and gradients for both FP8 training and inference. Moreover, we shed light upon scaling decisions taken in popular FP8 software implementations (e.g. Transformer Engine). The papers suggested by the reviewer have been important contributions for FP8, but in our view don’t cover: 1) details for practitioners to reproduce the FP8 methodology that has gained traction in the community, illustrated by the Transformer Engine library, Noune et al. [2022] and Micikevicius et al. [2022]; 2) training and inference validation for large language models of more than 1 billion parameters; 3) evolution of the scales during training and inference.
Concerning the second weakness about lacking “comparisons with other numerical formats”, we refer the reader to our Appendix C, where we discuss them. It’s true that we don’t run experiments with those other formats, but we believe that our paper is self-contained with the focus on FP8. Other papers cited in our manuscript already cover those comparisons, see for instance Kuzmin et al. [2022] and van Baalen et al. [2023].
Regarding the question about why using the max operation and not others, we focus on max since it is simpler than computing the mode or mean while being sufficient to match FP8 training and inference validation compared to higher precision (see section 3). Moreover, max is the methodology gaining traction in FP8 libraries like Transformer Engine. Other papers that we cite already cover those comparisons, see Micikevicius et al. [2022] and van Baalen et al. [2023] which employ other methods like MSE or percentile.
I appreciated the author's response to my comments. The study itself is valuable to communities, and I have increased my score to 5. However, as most of the approaches used in this paper are well-established in previous studies, I still believe that the paper's novelty is not sufficient for acceptance.
- The paper conducts experiments for FP8 inference and finetuning in the context of LLMs.
- It also provides various implementation details such as scaling factor calculations.
优点
- The paper goes into great detail on how exactly quantization is carried out and how scaling factors are computed.
- Additional general discussion and statistics studies are presented in the Appendix.
缺点
- The paper repeatedly emphasizes "training" in FP8 (e.g., in the title, in the abstract, etc.), yet I could not find any actual training experiments, that is training a large LLM from scratch, in the paper. The paper only performs some finetuning on GLUE tasks, which is significantly less interesting given that it is comparatively cheap and FP8 speedups thus not so crucial while, in many cases, even more affordable finetuning techniques like QLoRA also work well. I think significant presentation changes are required to clarify that the paper focuses on inference and finetuning.
- Most of the methodology appears to me like standard low quantized training techniques, e.g., using additional scales that are determined dynamically, adapted directly to FP8. Could you explain more precisely what exactly is new? I also did not find a Related Work section discussing this in more detail.
- FP8 inference has been studied extensively by e.g. [3, 4] and also in the context of LLMs [1, 2]. Further, [5] finetunes (and even trains from scratch) large Transformers in FP8. Hence, the overall novelty of the work appears very low.
Unfortunately, as the paper overall does not seem to contain significant novelty, neither in methodology nor in results, I cannot recommend acceptance at this point.
[1] ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats, Wu et al.
[2] Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models, Zhang et al.
[3] FP8 Quantization: The Power of the Exponent, Kuzmin et al.
[4] FP8 versus INT8 for efficient deep learning inference, Baalen etl a.
[5] FP8 Formats for Deep Learning, Micikevicius et al.
问题
- What real-world inference speedups do you observe when finetuning and inferencing in FP8 using your setup?
We thank the reviewer for remarking that our paper “goes into great detail on how exactly quantization is carried out” and “how scaling factors are computed”. We also appreciate their interest in the additional explanations and plots in the appendix.
Concerning the weakness about “could not find any actual training experiments”, we understand how the term “training” can be confusing and have modified the title and introduction to better reflect the scope of the paper. Let us clarify that we don’t claim to have experiments for FP8 pre-training from scratch. However, from a conceptual point of view, our methodology can be applied to both pre-training and fine-tuning. That’s why we encapsulate both under the term “training”. In the experiments, we clarify that we are only doing fine-tuning (see the title of subsection 4.5 for instance). We would love to have pre-training results, but they are really expensive and only affordable for a handful of AI labs worldwide, especially for the models we tested of up to 13 billion parameters. As a result, we don’t include them here and leave them for future work.
Concerning the other two weaknesses about the novelty, the reviewer rightly points out that we take ideas from standard quantization techniques. For example, our max approach to choose the scales builds upon quantisation ideas from INT formats (see section 2.3). However, our paper is novel because we adapt the max approach to FP8 and detail its application to quantize weights, activations and gradients for both FP8 training and inference. Moreover, we shed light upon scaling decisions taken in popular FP8 software implementations (e.g. Transformer Engine). Some of the fine-grained details and justifications of this implementation are not made explicit. We hope that by opening up our methodology and testing it in the experiments in Section 4, other FP8 researchers can build on top of it.The paper [5] that the reviewer mentions, claims to train large transformers in FP8 from scratch, but leaves many details of the implementation missing, hindering reproducibility. In contrast, we provide a pseudocode of the full forward and backward pass (see figure 1), explain the nuances between quantizing to FP8 the weights, activations and gradients, and show the mathematical formulas to compute the scaling biases (see subsection 2.3).
Lastly, concerning the question about real-world speedups, the hardware that we employ doesn’t have native FP8 native. This is mentioned in the hardware section H.2, in the appendix. As a result, we don’t claim to have such speedup numbers and leave it for future work.
Thank you for making adjustments regarding the use of the term "training". I understand that pretraining is expensive, but that is also why reducing costs via FP8 is so crucial there. There are many methods that could be applied "from a conceptual point of view" but fail in practice in many interesting setups. Similarly, I understand that the authors may not have access to expensive Hopper GPUs to implement and demonstrate actual speedups. However, these aspects could have been strengths of the paper if they were present.
I appreciate that the authors describe technical details to aid reproducability but I do not think that this constitutes sufficient novelty to recommend acceptance, as most of the paper's key results have been established before, using similar techniques (see references in my review). Additionally, I think a lot of the pseudocode, diagrams and formulas are essentially standard practice in low precision literature and are thus hard to count as significant novelty. Finally, there are no comparisons demonstrating the impact of those precise details that are actually different, relative to existing work.
Hence, I maintain my initial score.
We thank the reviewers for reading the manuscript and providing useful comments to improve it. We were encouraged by the reviewers' comments about our paper being " “a key piece missing from the large scale FP8 literature”, “well-written and organized” and “goes into great detail on how exactly quantization is carried out”.
A point raised by two reviewers relates to the fact that we don't have pre-training results with FP8. Conceptually, our methodology applies to both pre-training and fine-tuning, but for resource-constrained reasons we only provide results for fine-tuning. We understand how the term “training” can be confusing, and have modified the title, abstract and introduction to better reflect the scope of the paper.
Another comment raised by two reviewers is the lack of novelty, especially given the ideas we built upon from quantisation strategies in INT formats. We acknowledge it in the manuscript (see section 2.3). However, our paper is novel because we adapt the max approach to FP8 and detail its application to quantize weights, activations and gradients for both FP8 training and inference. Moreover, we shed light upon scaling decisions taken in popular FP8 software implementations (e.g. Transformer Engine), which may not be clearly justified. These details to reproduce FP8 training and inference were missing from the literature, and as one of the reviewers points out: "is a key piece missing from the large scale FP8 literature".
Overall, we hope that by opening up our methodology and testing it in the experiments in Section 4, other FP8 researchers can build on top of it and answer some of the other questions asked by the reviewers. For instance, extend the results to cover pre-training or provide speedup numbers based on real hardware with FP8 support.
This work outlines how 8-bit floating point quantisation can be carried out and applied to LLMs. The procedure is leveraged to fine-tune LLMs efficiently. Conceptually, the contribution is timely and valuable to the community. However, the reviewers found that the novelty of this work was modest as it relies on well-established approaches from previous studies. More importantly, while conceptually the methodology could be used to train LLMs from scratch and while I understand only few, mostly industrial, labs can afford to do this, it is unclear whether the outlined methodology would be beneficial or even work in that setting. To put it differently, it is unclear what would be impact of starting from pre-trained weights as opposed to learning them from scratch. Hence, the contributions and insights of this work are not sufficient to warrant acceptance.
为何不给更高分
While conceptually the contributions are interesting and valuable, the fact that this work did not evaluate the training from scratch mode limits the conclusions that can be drawn from it. Unlike the authors suggest, I do not think this part should be left for future work. I understand many labs don't have the resources to train LLMs from scratch; perhaps there is an opportunity to collaborate with an industrial lab that can afford to do so.
为何不给更低分
N/A
Reject