IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models
This paper presents the integral low-rank adaptation which can adapt the quantized diffusion models by int-multiplication or bit-shifting.
摘要
评审与讨论
This paper proposes the IntLoRA quantization method, which can adapt quantized diffusion models with integer-type low-rand parameters to include inference efficiency during training. The proposed IntLoRA enables the pre-trained weights to be quantized during training and the IntLoRA weights can be seamlessly merged into pre-trained weights to obtain the quantized downstream weights without PTQ.
给作者的问题
In Figure 1(a), why the quantized merged weights are in FP16?
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
I have checked the proofs for the proposed IntLoRA quantization adaptation.
实验设计与分析
I have checked the soundness/validity of the quantitative comparison on different generation tasks and the comparison of training and inference efficiency.
补充材料
Yes. At a glance of the code.
与现有文献的关系
The proposed method solves the problem of additional PTQ for the adaptation of quantized diffusion models.
遗漏的重要参考文献
No.
其他优缺点
Strengths: this paper is well-written and easy to follow. The proposed IntLoRA is a novel method to solve the problem of additional PTQ for the adaptation of quantized diffusion models. Extensive experiments have been conducted to verify the effectiveness of the proposed method. Besides, the code is also provided in the supp.
Weaknesses: It is better to validate the effectiveness of the IntLoRA on more diffusion models other than the StableDiffusion.
其他意见或建议
(1) It is better to also compare the inference speed of the SHIFT and MUL variants of IntLoRA.
(2) Figures 4-6 should be termed as qualitative comparisons other than quantitative comparisons.
Results on More Diffusion Models.
Good comments! As suggested, we evaluate our IntLoRA on the FLUX.1-dev model. Since FLUX is notoriously costly to fine-tune even using LoRA, due to limited computational resources, we only give the results of FP16 vanilla LoRA and our IntLoRA on 15 text-subject pairs of Dreambooth. The results are shown below.
\begin{array}{l|ccccc} \hline \text{methods} & \text{nbits} & \text{DINO} & \text{CLIP-I} & \text{CLIP-T} & \text{LPIPS} \\ \hline \text{FLUX-LoRA(FP)} & W16A16 & 0.3564 & 0.6490 & 0.2501 & 0.8383 \\ \text{FLUX-Ours(MUL)} & W8A8 & 0.3150 & 0.6272 & 0.2348 & 0.8372 \\ \hline \end{array}
It can be seen that the performance drop of our IntLoRA compared to the original LoRA is acceptable, while achieving inference speedup. This preliminary result shows the potential generalization of the proposed method on other Diffusion model backbones.
Efficiency Comparison.
In Tab.3 of the paper, we give the forward time cost of MUL and SHIFT: 0.87s/img for MUL, and 0.84s/img for SHIFT. One can see that the SHIFT is relatively faster than the MUL. However, it is worth noting that the speedup of the log2 quantized model can be further accelerated by hardware-level optimizations. Since this work focuses primarily on the algorithmic level, we leave further log2 acceleration for future work.
Typos.
We will fix the types in the caption of Figures 4-6 in the revision. Thanks for your suggestion!
Responds to other Questions.
In weight merging of previous methods, i.e., , the is the quantized INT8 weights, but is low-rank FP16 weights. This arithmetic inconsistency between FP16 and INT8 caused the to have to upcast back to FP16 to allow addition with LoRA weights. As a result, the summed results is in FP16 type and needs PTQ to accelerate inference.
The paper introduces IntLoRA, which uses integer-type LoRA weights to fine-tune directly on the quantized models, for both training and inference efficiencies. To achieve this, the authors propose three novel techniques. First, the authors propose the Adaptation-Quantization Separation (AQS), which allows for the coexistence of zero-initialized gradients and a quantization-friendly distribution. Then, the Variance Matching Control (VMC) mechanism is developed to fine-tune the channel-aware variance, ensuring favorable distribution shape for log2 quantization. Last, the Multiplicative Low-rank Adaptation (MLA) is used to allow independent optimization of pre-trained and adaptation terms, enlarging the parameter space of LoRA fine-tuning. Extensive experimental results indicate that IntLoRA achieves sota results on both efficiency and performance.
给作者的问题
-
What is the effects of the rank of the parameter and how to determine it?
-
What is the reason for using variance matching control (VMC) to adjust the distribution of the adaptation term?
-
Does the proposed method can be combined with other model compression methods, such as knowledge distillation?
论据与证据
The claims of this paper is well supported by evidence such as distribution visualization for quantization weights, the quantitative ablation results for method's design rationality.
方法与评估标准
• The proposed method make sense. In the Method section, the motivation of each module is clearly stated before the technical details.
• The author also proposed the log2-quantized fine-tuning pipeline, which is hardware efficient.
• The evaluation metrics are commonly used in image generation tasks, such as FID, CLIPscroe, and the authors also gives sufficient qualitative results.
理论论述
The process of quantization is well-justified with Adaptation-Quantization Separation, the Variance Matching Control, and Multiplicative Low-rank Adaptation.
实验设计与分析
• This work validate the effectiveness of IntLoRA on various diffusion generation tasks, including Dreambooth fine-tuning, ControlNet fine-tuning, and Style-customization, all tasks being widely adopted in previous diffusion fine-tuning works.
• The performance is noteworthy. The IntLoRA achieves sota performance on FID and CLIP on the Dreambooth fine-tuning, even under a more challenging integer-type low-rank weights.
• The experimental analysis is well supported by the empirical evidence.
补充材料
The authors have attached Suppl., which contains the code for reproduce.
与现有文献的关系
The integer-type Lora fine-tuning without additional PTQ is almost unexplored by previous work. The IntLoRA can inspire future works towards both training and inference efficient lora tuning pipeline.
遗漏的重要参考文献
The authors give a detailed comparison with a highly related work, i.e., EfficientDM, with both functional analysis and solid experiments on the diffusion network quantization tasks.
其他优缺点
Strentgh:
- The paper is well-motivated, and the sufficient experiments validate the effectivess of the proposed method.
- The presention is easy to follow and well-organized.
Weakness:
- It is better to give more discussions for the dfiferent parts of the proposed method, such as e Adaptation Quantization Separation (AQS).
- The relationship with other model compression methods is not clear.
其他意见或建议
None
More discussions of Adaptation Quantization Separation (AQS).
The main motivation of the proposed AQS is to address the effect of zero-initialized weights of LoRA during quantization. Specifically, the zero-initialized weight leads to infinite results during quantization due to the divide-by-zero (please refer to Eq.1 in the original paper). In order to maintain the zero-initialized weights while achieving the quantization-friendly distribution, we propose the AQS, which decomposes the LoRA weights into a zero-initialized part that requires gradients (corresponding to the vanilla LoRA), and a non-zero part that does not require gradients (for ease of quantization) to facilitate quantization. In this way, AQS facilitates both model learning and quantization.
Correlation with Other Compression Methods, e.g., Knowledge Distillation.
Our approach focuses primarily on quantization to reduce the memory footprint during fine-tuning large models. However, since quantization and other compression methods are orthogonal technically, our approach can therefore potentially be combined with other methods such as distillation to achieve further computational cost reduction.
Specifically, we can first use the large pre-trained model as a teacher model and first obtain a small student model. Afterwards, we can use the proposed IntLoRA on the student model to further obtain quantized downstream task weights, thus realizing further speedup.
Since the knowledge distillation requires complete back-propagation of the model gradients, we cannot finish this process due to limited computational resources. However, given the promising performance of IntLoRA on existing small models (which can be viewed as distilled student models), combining IntLoRA and knowledge distillation is practical. We will give a more detailed discussion in the revision.
Responds to Other Questions.
-
We assume you refer to the rank in the LoRA. Following the common practice, we determine the LoRA ranks through ablation experiments in Fig.9 in the Appendix. Performance generally improves as we increase the rank, but the rate of growth varies. To balance a tradeoff, we choose in all of our experiments.
-
The motivation of VMC is to adjust the distribution shape of the adaptation term for effective log2 quantization. In detail, the log2 quantization requires that most of the values are distributed near zero to use as many log buckets as possible to reduce quantization error. As shown in Fig.8 in the Appendix, the proposed VMC can control the distribution shape to achieve sharp peaks and light tails, thus utilizing more buckets on the logarithmic scale to reduce the quantization error.
Based on the authors' responses, my concerns have been addressed, and I can raise my scores based on these considerations:
-
The significance of directly obtaining the quantized merged weight is noticeable. I have noticed that the author also demonstrated in the rebuttal that the low-rank FP matmul is even slower than INT8 dense matmul. This evidence further makes this paper stronger.
-
The technique is impressive. The auxilary matrix in AQS to address the zero-initialization problem as well as the reformulation of the LoRA in MLA is novel. And these methods communicate well with empirical evidence.
-
I appreciate the inclusion of more experiments on the generalization of this method during the rebuttal, which provides a more comprehensive evaluation.
Given the growing popularity of combining different model compression for further accelerations, I recommend the authors to incorporate this discussion in the revision, as it could serve as a useful reference for future researchers.
Thank you very much for your positive feedback and raised rating score. We are delighted that our responses have addressed your concerns.
We also appreciate your recommendation of including more contents of combining different acceleration methods. In the next version, we will provide a detailed discussion between other compression methods and ours. Your valuable suggestions have significantly contributed to the improvement of our work, and we sincerely thank you once again!
This work proposes a LoRA-based method to fine-tune the quantized weights of diffusion models that consists of Adaptation-Quantization Separation (AQS) for addressing the issue of zero-initialized weights in LoRA tuning for the quantized pre-trained model and Variance Matching Control (VMC) to determine an appropriate R to balance between quantization difficulty and information retention. After using AQS and VMC to preprocess the weights before tuning, the fine-tuning is performed for the adaptation term and then the result will be merged to the original pre-trained term by either Int-multiply or Bit-shifting to yield the quantized merged weights. The proposed method was tested on a number of diffusion-based tasks along with a couple of LLM tasks.
给作者的问题
- Was there any divided-by-zero issue in Eq. (3) for the practical cases?
论据与证据
The claims are somewhat confusing and may be misleading. This work is mainly focusing on an adaptation method with pre-trained diffusion models. Only a couple of experiments with LLMs were presented, thus it is hard to justify the scope of this work beyond diffusion models due to lack of experiments and comparisons on LLMs. However, the introduction and the related work sections are very confusing since a lot of prior works were about LLMs with adaptations using LoRA variants or about quantized diffusion models without adaptations except for one work (EfficientDM, He et al., 2023). This confusion remains until the conclusion, thus makes the claims and conclusions very confusing as if this work have solved the problems in LLMs in general.
The proposed method was not fully investigated with LLMs in my view. However, the conclusion does not mention that this work is limited to diffusion models, thus may mislead that this work may work for all the cases including LLMs. The contributions of this work should be clearly described along with all supporting evidences.
方法与评估标准
This work reported performance in accuracy and what quantization was used. However, this work did not report computation time and memory usage during fine-tuning and pre-processing, which seem important to fully understand the capability of the proposed methods. It is also important to see these results not only for tuning stage, but also for the preprocessing stage as well as for the final merging stage. It seems that the proposed method has a quite heavy preprocessing stage with AQS and VMC.
理论论述
N/A
实验设计与分析
It is unclear why the selected experiments require quantization. Subject-driven generation and style-customized image generation do not seem to have large datasets (25 subjects, 18 styles), so it will be great if this work argues, justifies and reports the reasons to perform these tasks with quantized models and the clear, realistic evidences for them.
补充材料
I did not review the supplementary material since it was good enough to read the main paper to assess this work - it was clear.
与现有文献的关系
This work was very confusing due to mixing references about LLMs with adaptation, diffusions without adaption and so on. It will be great if this work focuses on a single topic, justifies and investigates that with full experiments, and concrete conclusions without over-claims. Or this work should focus on LLMs like other related works on LoRA variants to clearly show the advantage of this work over prior works.
遗漏的重要参考文献
- It seems that this work missed the most important work by Han Guo et al., LQ-LORA: LOW-RANK PLUS QUANTIZED MATRIX DECOMPOSITION FOR EFFICIENT LANGUAGE MODEL FINETUNING, ICLR 2024
This work is not about LLMs, but addressed the issue of zero-initialization of LoRA by proposing a matrix decomposition, which might be related to the proposed method in this work. Comparing with this work seems critical for me to properly assess this work. There is also another work called GPTQ-LoRA by Chai et al. See the above work for more info. Thus, it seems very important to properly discuss and compare with this work.
- While the fundamental basis of this work is in diffusion models and their adaptations, this work failed to properly cite and discuss these works. Without proper justification of quantized adaptation, it will be difficult to appreciate the current work. For example, subject-driven generation does not have to fine-tune the LoRA weights. See the following work:
Rinon Gal et al., AN IMAGE IS WORTH ONE WORD: PERSONALIZING TEXT-TO-IMAGE GENERATION USING TEXTUAL INVERSION, ICLR 2023
This work achieved subject-driven generation by only fine-tuning a single token, which is much smaller than the LoRA weights. For the cases like this, it will be very hard to justify the necessity of using quantized models for fine-tuning. Thus, this work should search for literature on the tasks of adapting diffusion models to tasks, to see if the proposed method is indeed useful for many cases.
其他优缺点
-
Did R introduce additional memory? If so, then the goal of quantizing diffusion models may not be satisfied well due to increased memory usage. It seems that R looks quite large, actually the same size as W, so comparing with other methods in Table 1 and other tables may not be fair. Table 1 simply wrote nbits, which can not explain this additional memory and thus, the proposed method may use more memory. Thus, please add the info about actual GPU memory usage, computation time of pre-training, fine-tuning and merging, and so on.
-
Sigma_R in Eq. (6) seems important to determine. How can we guarantee that the optimal one exists? How sensitive it is to other tasks and baseline pre-trained models? These properties do not seem to be well-investigated since similar baselines were used (SD and SDXL). Introducing additional parameter on top of the rank of LoRA seems undesirable.
其他意见或建议
-
It is unclear how important it is to solve the problem of fine-tuning quantized diffusion models, so it will be important to justify it with concrete examples and applications in the introduction.
-
It is unclear if Figure 1 is accurate: A hat and B hat are INT4, but tuning with FP16 activations may require floating point operations. Figure 1 should be accurate to reflect these facts if they are true.
-
I strongly recommend to revise this work to clearly differentiate among LLMs works with adaptations, Diffusions without adaptations and so on and then to clearly indicate the contribution of this work.
-
In Figure 3, is Adaptation Term also Int4 or FP16? Is the fine-tuning operating on FP16 then?
Why Introduce Many LLM Related Works?
Thanks for your comment! Although there is a lot of work on Diffusion quantization, very little work explores the adaptation of quantized diffusion, which is also recognized by Reviewer wfKN. Therefore, we introduce related methods in quantized LLM adaptation to ensure a comprehensive baseline comparison.
Clarification of the Claims.
Consistent with the title, this work mainly focuses on Diffusion models. The experiments on NLP tasks are preliminary explorations to answer potential concerns about the method generality. As suggested, we will make the relationship between Diffusion and LLM clearer to avoid any misleading or over-claims.
Efficiency of Different Stages.
In Tab.3, we give the training speed and memory usage during fine-tuning stage. Our IntLoRA is similar to that of QLoRA, but we can directly obtain the quantized merged weights.
For the pre-processing and post-processing stage, we give the time cost as follows.
\begin{array}{l|ccc} \hline \text{stage} & \text{pre-processing} & \text{fine-tuning} & \text{post-processing} \\ \hline \text{time} & 28.8s & 128.2s & 0.5s \\ \hline \end{array}
The pre-processing accounts for 18% total time. However, this costs can be shared across different tasks.
For weight merging, the latency can be neglected.
Justification of Quantized Adaptation & Why Dreambooth needs Quantization & Comparison with Text-Inversion.
Our quantized adaptation enjoys benefits in both tuning and inference. Specifically,
-
The low-bit adaptation reduce GPU usage during tuning. Although Dreambooth involves small samples, it still struggles when fine-tuning large models. For example, loading FP16 FLUX consumes over 40GB, let alone fine-tuning with Text-Inversion or LoRA. In contrast, our 4-bit quantized IntLoRA reduces the loading memory to <15GB, facilitating tuning on consumer-level GPUs.
-
The quantized downstream model boost inference efficiency. As shown in the
Response to Reviewer 1Zzg, the INT GEMM is even more efficient than FP low-rank matmul, allowing for fast inference. As a comparison, the Text-Inversion has the same latency of the pre-trained large model.
As shown in Tab.1, our IntLoRA can achieve acceleration with negligible performance loss. We will include more discussion and related works in the revision!
Comparison with LQ-LoRA and GPTQ-LoRA.
Both LQ-LoRA and our IntLoRA share the idea of avoiding zero-initial low-rank weights. But they have several essential differences. 1) The formulation is different. LQ-LoRA uses the decomposition to achieve non-zero LoRA weights, our IntLoRA introduces the auxiliary matrix and use for this goal. 2) LQ-LoRA requires multi-iterations to search for the optimal approximation, while our IntLoRA does not. 3) We also compare LQ-LoRA on Dreambooth as follows. One can see that our IntLoRA achieves better performance.
\begin{array}{l|ccccc} \hline \text{methods} & \text{nbits} & \text{DINO} & \text{CLIP-I} & \text{CLIP-T} & \text{LPIPS} \\ \hline \text{LQ-LoRA} & W8A8 & 0.4056 & 0.6624 & 0.2824 & 0.8126 \\ \text{Ours-MUL} & W8A8 & 0.4498 & 0.6882 & 0.2858 & 0.8062 \\ \text{LQ-LoRA} & W4A8 & 0.4022 & 0.6797 & 0.2680 & 0.8198 \\ \text{Ours-MUL} & W4A8 & 0.4242 & 0.6913 & 0.2710 & 0.8181 \\ \hline \end{array}
As for GPTQ-LoRA, it appends LoRA weights on the GPTQ quantized models, which needs additional PTQ during inference deployment. Although GPTQ-LoRA is not accepted and not open-source, we are happy to add related discussion in the revision.
The Cost of R & Ablation on sigma_R.
-
As stated in Line215, the can be generated on-the-fly under the same random seed, and can be deleted once used. The performance in all tables has included the effects from .
-
Since is first normalized and then rescaled by , the ablation of in Fig.7 can reflect the role of . One can see the performance is robust when . We adopt a fixed , i.e., the same , in all experiments and the results are satisfactory.
Clarification on Fig.1 & Dtype of Adaptation-Term.
Similar to common QAT practice, the weights and activations (and Adaptation term) during training are "simulatively quantizated" with FP16 dtype, for accurate gradient backpropagation. However, the inference is strictly in INT dtype (or 2^X format for SHIFT). Fig.1 is the post-tuning weight merging stage, so it is correct since both W and A are INT. We will clarify this in revision.
Solution to Divide-by-Zero.
Because is ensured to be an integer, we replace the 0 with 1 and then zero mask the division result at corresponding slots.
This paper proposes IntLoRA that allows for seamless weight merging after efficient low-bit parameter-efficient fine-tuning (PEFT). The paper is motivated by the observation that existing low-bit PEFT (e.g., QLoRA) requires an additional round of PTQ due to a mismatch in precision between pre-trained and (low-rank) adapter weights. Towards this goal, authors proposed various techniques like task-agnostic auxiliary weights, variance matching control, and multiplicative/bit-shifting LoRA. Finally, authors demonstrate compute/memory efficiency of IntLoRA as well as the superior quality of images generated from diffusion models trained with IntLora against several popular baselines.
给作者的问题
See "Claims And Evidence."
论据与证据
Technical claims/evidence regarding IntLoRA look impressive to me. However, I am not entirely sure if PTQ is necessary in e.g., QLoRA. Given that adapter weights are generally much smaller than pre-trained weights, I believe the additional compute cost from the adapter forward-pass at inference time can be completely hidden by overlapping its computation with the forward-pass from pre-trained weights. In theory, this of course incurs more FLOPs and GPU memory usage from loading adapter weights, but this can be quite marginal compared to memory/compute costs from pre-trained weights in practice. If authors can demonstrate the significance/necessity of weight-merging in general, it would make the paper stronger.
方法与评估标准
The proposed techniques--auxiliary weights, multiplicative/bit-shifting LoRAs, and VMC-- are all well-motivated. Furthermore, authors evaluated the effectiveness of IntLoRA in terms of compute/memory efficiency and image generation quality against reasonably chosen baselines.
理论论述
N/A
实验设计与分析
Yes. They look good to me.
补充材料
I appreciate that authors provided more qualitative analysis (i.e., image/text generation quality) in Appendix.
与现有文献的关系
N/A/
遗漏的重要参考文献
I believe authors have cited major relevant work. However, I am not very familiar with quantization research, so I would count on other reviewers' opinions.
其他优缺点
See "Claims And Evidence."
其他意见或建议
See "Claims And Evidence."
The significance of weight merging
Good comments! In fact, the efficiency of INT-type matrix multiplication is extremely efficient in the highly optimized GEMM on modern GPUs, and we demonstrate below the INT8 matmul is even faster than that of FP32 low-rank matrix multiplication.
Given a weight matrix , activation and the low-rank decomposition ( where ), we adopt the shapes from SD-1.5, i.e., . We use INT8 GEMM CUDA kernel to calculate ( denotes the quantization operator), and use the common FP32 torch.matmul in Pytorch to compute the . The experimental results on NVIDIA A5000 GPU are as follows:
Although full-rank matrix multiplication has 40x higher FLOPs than its low-rank counterpart, the INT matrix multiplication shows even faster speeds and less GPU footprints due to quantized INT8 type on integer-optimized GEMM kernels.
From the above analysis, the low-rank FP32 matmul is not negligible than the INT8 pre-trained weights. Therefore, it makes sense to apply the proposed IntLoRA to directly obtain the quantized merged weights.
This paper introduces IntLoRA, a LoRA-based method for efficiently fine-tuning quantized models using integer-type LoRA weights. It proposes three key techniques: Adaptation-Quantization Separation (AQS) to support zero-initialized gradients within quantization-friendly distributions, Variance Matching Control (VMC) to align channel-wise variance for log2 quantization, and Multiplicative Low-rank Adaptation (MLA) for flexible parameter optimization. Fine-tuning is conducted on the adaptation term and merged via integer operations. Experiments on both diffusion and LLM tasks demonstrate strong performance and efficiency.
Three out of four reviewers recognized the soundness of the proposed techniques and the strength of the experimental results. One reviewer raised concerns about the paper’s clarity and focus, noting that although the work claims to target diffusion models, it includes substantial content and experiments on LLMs. Additionally, the reviewer pointed out a lack of discussion and comparison with important LLM baselines such as GPTQ-LoRA and LQ-LoRA. The authors have acknowledged these points and committed to addressing them in their revision.
Given the overall contributions and the authors’ willingness to improve the manuscript, we believe the strengths of this submission outweigh its limitations. We therefore recommend acceptance, and encourage the authors to revise the paper in accordance with the reviewers’ feedback and their rebuttal commitments.