PaperHub
5.0
/10
withdrawn4 位审稿人
最低3最高6标准差1.2
3
6
5
6
3.3
置信度
正确性2.5
贡献度2.0
表达3.3
ICLR 2025

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

OpenReviewPDF
提交: 2024-09-28更新: 2024-12-17

摘要

关键词
Large Language Models; Memory Efficient Training; Low Rank

评审与讨论

审稿意见
3

The paper proposes a method to accelerate the GaLore, which is time-consuming and memory-intensive. It cleverly utilizes observed behaviors to reduce the time and memory required for GaLore. The structure of the paper is clear, and the experiments are thorough.

优点

  1. The paper proposes an acceleration method to address the time and memory requirements of the GaLore process. It cleverly observes an early stopping phenomenon in some layers during GaLore training, which reduces the number of SVD decompositions needed. Additionally, the paper employs quantization to lower memory usage.
  2. The structure of the paper is clear, making it easy to understand.
  3. The experiments are comprehensive.

缺点

  1. The paper emphasizes reducing memory usage during the GaLore process. However, the main source of memory consumption in GaLore seems to be the need for SVD on large matrices, which currently only supports 32-bit precision. Without reducing the dimensionality of the decomposition matrices, the proposed method does not seem to genuinely reduce memory requirements.
  2. The proposed adaptive lazy update is interesting, but it lacks further explanation. For instance, why does this phenomenon occur, and what is its relationship with different depths and types of layers?

问题

  1. I do not fully understand how the memory requirement is reduced. The memory bottleneck of GaLore seems to be the 32-bit decomposition of large matrices. Could you provide further clarification? If clarified well, I will improve my score.
  2. Could you provide a more in-depth explanation of the adaptive lazy update, such as the reasons behind it or its behavior across different layers and matrices? Does this phenomenon occur with different training datasets? Is the lazy matrix the same for the same model under different training datasets?
  3. Could the authors provide a more intuitive comparison of the time required to demonstrate the effectiveness of the acceleration?
评论

Thank you for the insightful questions and constructive suggestions, which have been immensely helpful in improving the quality of our work. Below, we provide detailed responses:

[Q1: Clarification of memory reduction] Thanks for the question. We’d like to clarify that the memory overhead of performing SVD on large matrices is minimal compared to the memory required for model weights and optimization states. This is because SVD is conducted in a layerwise manner. For example, consider one “mlp.up” layer of LLaMA-2-7B, which has a shape of 4096×11008. In 32-bit precision, this layer consumes approximately 0.17 GB of memory. During the SVD operation, the resulting matrices U, S, V have shapes 4096×4096, 4096, and 11008×11008, respectively, consuming a total of approximately 0.51 GB of memory. In comparison, the entire model weights require around 14 GB of memory. After completing the SVD for each layer, only the projection matrix is retained, and any intermediate memory usage is released. This approach ensures that the memory cost of the SVD operation is negligible. The memory savings in Q-GaLore come from reductions in weight and optimization states compared to GaLore, as illustrated in Figure 5.

[Q2: Explanation about lazy SVD update] Great suggestion. We’ve updated more results in Section D of the updated draft. As illustrated in Figure 8, the gradient subspace shows the most drastic changes in the q and k projections (with the smallest cosine similarity). Intuitively, these projections are responsible for generating the attention patterns and are more influenced by the token relationships within each training sample. Consequently, their gradient information is more diverse, leading to a more diverse gradient subspace. When comparing different layers, the middle layers tend to exhibit a more consistent gradient subspace compared to the initial and final layers. This behavior may be related to the oversmoothing issue in transformers[1-2], where the middle layers are not well-optimized, resulting in oversmoothed token representations.

[Q3: Comparison of the training time] Thanks for the question. We measure the latency of training a 1B model on two A6000 GPUs where we use the largest possible batch sizes for each method. For GaLore, a batch size of 32 can be used without OOM, while Q-GaLore can use a batch size of 64. In this case, the average throughput for Q-GaLore is 36.87 samples/s, higher than GaLore’s 35.56 samples/s. Additionally, we’d like to clarify the benefits of memory-efficient training across three key scenarios: (i) Single GPU Training: Reducing memory requirements for 7B models makes training more accessible on consumer-grade GPUs. For example, an RTX 4060 Ti with 16GB memory, costs approximately 499 dollars while RTX 4090 with 24GB memory requires around $1599. (ii) Distributed Training with Data Parallelism: When using the same number of GPUs, reducing memory overhead allows for training with larger batch sizes. This can reduce communication costs between GPUs, improving overall efficiency. (iii) Model Parallelism: For very large models that exceed the memory capacity of a single GPU for even one batch, model parallelism becomes necessary. By lowering memory overhead, it is possible to avoid the complexity of model parallelism, enabling training on larger models without additional overhead, and scaling up training larger models more effectively.

[1] PanGu-π: Enhancing Language Model Architectures via Nonlinearity Compensation.

[2] FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping.

评论

Dear Reviewer Ajpa,

Thank you for your time of reviewing our work. We have provided detailed, point-wise responses for addressing your concerns. As the discussion period deadline approaches, Could you kindly review our responses and let us know if you have any further questions.

Thank you again for your thoughtful feedback!

Best regards,

Authors

评论

Dear Reviewer Ajpa,

Thank you again for your time and efforts in reviewing our work. As the discussion period deadline is approaching, could you kindly review our responses and let us know if you have any further questions?

Thank you!

Sincerely,

The Authors

评论

Dear Reviewer Ajpa,

Thank you again for your detailed suggestions and comments. As the deadline of discussion period is approaching, could you kindly review our responses and inform us whether you have further questions?

Thank you!

Sincerely,

The Authors

审稿意见
6

This work proposes Q-GaLore, a memory-efficient training method based on GaLore. Q-GaLore further reduces the memory footprint by quantizing the model weights to 8-bit and the low-rank gradient terms to 4-bit. Q-GaLore also adaptively reduces the number of SVD operations on gradients throughout the training process. This work also performs experiments to decompose the memory footprint which verifies the efficacy of Q-GaLore.

优点

Q-GaLore enhances GaLore with quantization.

  • Q-GaLore havles the memory footprint of weights by quantizing weights to 8-bit.
  • Q-GaLore adaptively reduces the frequency of the computation-intensive SVD operation on gradients (projection matrices), based on the enlightening observations that there are layers with diminishing gradients.
  • The memory footprint breakdown clarifies the improvement of Q-GaLore.

缺点

  • Limited novelty with increased complexity. This work mainly applies quantization to GaLore with adaptively reduced SVD operations on projection matrices. However, it introduces more hyper-parameters to tune, such as the threshold for determining update intervals.
  • A lack of end-to-end latency/time evaluation. Q-GaLore introduces extra operations like the dequantization and the calculation of cosine similarity of projection matrices. The extra cost of these operations, in terms of latency and computation, remains unclear compared to the baselines. If possible, please consider a breakdown of time spent on different operations including the new ones introduced by Q-GaLore.
  • Insufficient and vague evaluation on fine-tuning/training experiments.
    • No description of hyper-parameters like learning rates, training epochs, and batche sizes.
    • If the trained model performance heavily depends on the threshold of cosine similarity, a discussion on the threshold selection is necessary. If not, relevant experiments and a recommend value can be provided.

问题

  1. In line 201, how is the "cosine similarity across matrices" defined? Why not use other ways of measuring distances between matrices like Frobenius norm? Usually the cosine similarity measures the similarity between two vectors of an inner product space. Are the matrices flattened? If so, along which dimension?
  2. In line 315, the statement "no baseline involves quantization and all data are maintained in BF16 format" is confusing, since in the QLoRA baseline weights are quantized. Please clarify this statement or correct it if it's inaccurate.
  3. In Tab 1, for the 60M model, why the LoRA (ReLoRA) uses the the same amount of memory (0.36GB) as full fine-tuning? Similarly, for the 130M model, LoRA takes up more memory (0.80GB) than full fine-tuning (0.76G).
  4. In Tab 1, if QLoRA is added as a baseline (QLoRA was included in Tab 2 as a baseline), how much memory is consumed and what is the perplexity? As Q-GoLore adopts quantization, comparing to QLoRA can be reasonable to evaluate memory efficiency.
  5. In Tab 1, 2, and 3, how hyper-parameters are determined? Since different methods may have different optimal hyper-params like learning rate.
  6. As stated in the weakness. Does the trained model performance heavily depend on the threshold for determining update intervals of projection matrices? If so, how do users determine the value for given a model and a task? If not, what is the recommended value?
  7. In line 472, the time cost is reduced by over 32 hours. What is the total time of training in this case?
评论

Thank you for clarification. I've adjusted my rating, and here are my follow-up questions.

  • I am still a bit concerned with the training perplexity - memory & runtime tradeoff. It will be clearer if Tab.1 includes runtime results like training time or throughput like #batches/sec.
  • Is the cosine similarity threshold model-dependent? Q-Galore will be a practical method if the recommended value of 0.4 works for other model architectures and sizes.
评论

Thanks for the questions. The throughput results across different model sizes are provided in Table R1, where we use the largest possible batch sizes for each method on two A6000 GPUs without causing out-of-memory (OOM) errors. The results show that as model sizes increase, Q-GaLore consistently achieves higher throughput compared to GaLore. This improvement is due to the memory reduction in Q-GaLore, which allows for larger batch sizes and enhances overall throughput for training, that is typically memory-bound.

Additionally, we found the cosine similarity threshold of 0.4 to be highly generalizable across various model architectures and sizes. In Tables 1, 2, and 3, we use the same threshold of 0.4 and demonstrate its effectiveness across model sizes ranging from 60M to 8B and architectures such as LLaMA, Mistral, RoBERTa, and Gemma.

Table R1. Throughput (# samples / second) comparison across different model sizes.

ModelsMethodBatch SizeThroughput
LLaMA-60MGaLore2561023.50
LLaMA-60MQ-GaLore256951.91
LLaMA-130MGaLore256514.75
LLaMA-130MQ-GaLore256480.18
LLaMA-350MGaLore64156.50
LLaMA-350MQ-GaLore128157.80
LLaMA-1BGaLore3235.56
LLaMA-1BQ-GaLore6436.87
LLaMA-3BGaLore1614.68
LLaMA-3BQ-GaLore3215.23
评论

Thank you for the additional experiments.

Compared to GaLore, the doubled batch size does not effectively benefit the training throughput. Is this because of the extra computation introduced by Q-Galore?

However, considering the saved global memory, the generalizable value of the cosine threshold, and the comparable fine-tuned accuracy/perplexity, I would like to raise my score to 6.

评论

Thank you for your responses. The extra quantization/dequantization overhead in Q-GaLore could reduce throughput, while we also believe that the significant global memory savings offer substantial practical benefits, particularly for training on memory-limited devices or improving efficiency in model parallelism.

Additionally, thank you for all your constructive feedback and detailed suggestions, which have been incredibly helpful in enhancing the quality of our work. Wishing you a wonderful week!

评论

We sincerely appreciate Reviewer CuTC for the detailed and constructive suggestions. Below, we provide pointwise responses.

[Q1: Limited novelty with increased complexity.] We respectfully disagree that our novelty is modest: Firstly, training LLMs with weight quantization and low-rank gradients is non-trivial. The main challenge is the substantial reduction in gradient information. This introduces additional complexity in identifying optimal low-rank subspaces. For example, directly quantizing GaLore’s weights to 8-bit results in a significant performance degradation, with perplexity increasing by as much as 7.86 when pre-training a LLaMA-60M model. To address this, we investigated stochastic rounding and for the first time demonstrated the possibility of training large-scale quantized LLMs with low-rank gradients.

Then, Our investigation into the adaptive convergence properties of gradient subspaces across layers revealed novel insights. These findings not only provide an effective solution for reducing the computational overhead of expensive SVD operations but also deepen our understanding of transformer training dynamics. Specifically, we observed: (i) some layers’ gradient subspaces converge early in training. (ii) others exhibit stable subspaces within specific windows. (iii) certain layers only converge later in the training process. These novel observations connect to prior research on the diverse low-rank properties of weight spaces [1,2]. For instance, a stable gradient subspace over an extended training window may cause weights to settle into the same low-rank space, offering valuable insights for future studies on transformer training dynamics.

Additionally, we conducted an ablation study on the threshold for update intervals, as presented in Figure 7. The x-axis in the figure represents the relative SVD count. In this study, we performed a grid search on the cosine similarity threshold within the range [0, 1], reporting the corresponding SVD counts and perplexity values. The results indicate that a threshold of 0.4 (marked as the red star in Figure 7) is a turning point. Below this threshold, performance improves significantly as the threshold decreases. This experiment demonstrates the extra hyperparameter of cosine threshold can be set as 0.4 without performance degradation.

[1] JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention.

[2] The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction.

[Q2: Latency breakdown of different operations.] Thanks for the question. We compare the end-to-end latency on two A6000 GPUs using distributed training. One key benefit of memory-efficient training is that by reducing memory requirements, we can use larger batch sizes during training, which enhances throughput since training is generally memory-bound.

In our experiments, we trained a 1B model using the largest possible batch sizes without causing an out-of-memory (OOM) error. GaLore supports a batch size of 32, while Q-GaLore enables a batch size of 64. The average throughput for Q-GaLore is 36.87 samples/s, higher than GaLore’s 35.56 samples/s. Under the same batch size, the additional computation required for quantization and dequantization in Q-GaLore results in a throughput of 26.60 samples/s. These additional computation costs, however, are compensated for by the ability to use larger batch sizes.

For cosine similarity comparisons, these operations are executed only during projection updates, which occur every 200 iterations. The latency caused by this operation is negligible compared to the SVD computation where its latency is smaller than the time fluctuation observed during different projection update stages.

Additionally, by lowering memory overhead, Q-GaLore makes training more accessible. For example, an RTX 4060 Ti with 16GB memory (priced at approximately 499 dollars) becomes sufficient, compared to an RTX 4090 with 24GB memory, which costs around $1599. Furthermore, for extremely large models that cannot fit a single batch on a single GPU, model parallelism is typically required. However, by reducing memory requirements, Q-GaLore can potentially avoid the complexity of model parallelism, enabling more efficient scaling for training larger models.

[Q3: Experiment details and threshold for cosine similarity.] Thanks for pointing it out. We’ve provided all the training and fine-tuning hyperparameters in Section C of the appendix in the updated draft. Additionally, we’ve conducted an ablation study on the threshold for cosine similarity. The results are presented in Figure 7, from which a threshold of 0.4 (marked as the red star in Figure 7) is a turning point. Below this threshold, performance improves significantly as the threshold decreases. Thus, we recommend using 0.4 as the threshold for cosine similarity.

评论

[Q4: Definition of cosine similarity.] Thanks for the question. We use the first row of the projection matrix to measure the cosine similarity. The first row corresponds to the largest singular value in the gradient subspace, allowing us to focus on the most significant component without incurring additional memory overhead from saving an extra projection matrix. Additionally, cosine similarity effectively captures the directional differences between projection matrices, making it a more suitable method for identifying variations in the gradient subspace.

[Q5: Data format of baselines.] Thanks for pointing it out. It was a typo, and we have corrected it in the updated draft to clarify: “Note that all baseline methods, except QLoRA, are maintained in 16-bit precision, while the base models in QLoRA are kept in 8-bit precision for a fair comparison.”

[Q6: Memory overhead in Table 1.] ReLoRA is a stage-wise version of LoRA that periodically merges low-rank adapters into the original model and reinitializes a new adapter. As a result, its memory overhead is equivalent to that of LoRA. Note that we did not account for the memory overhead of its warm start stage, as it only occupies a short period during the training process. Additionally, the memory cost of LoRA equals to mn + 3mr + 3nr where the memory cost of Full baseline is 3mn, when the rank r is relatively large, the memory cost of LoRA can be larger than Full, i.e., 3r(m+n) > 2mn.

[Q7: QLoRA for pretraining.] Thanks for the question. QLoRA is primarily designed as an effective method for fine-tuning, but it is not well-suited for pre-training. During pre-training, QLoRA fixes the randomly initialized base model in low-bit precision, which introduces significant noise into both the forward and backward processes. Additionally, as a quantized version of LoRA, QLoRA demonstrates performance similar to LoRA. However, LoRA’s performance clearly underperforms compared to GaLore, Q-GaLore, and the full baseline. For these reasons, we did not include QLoRA in Table 1 while comparing Q-GaLore and QLoRA during the fine-tuning scenarios.

[Q8: Hyperparameters in Table 1,2,3.] For experiments in Table 1 and 3, we adopt the same settings from GaLore for fair comparison. For Table 2, we sweep the learning rate from 1e-4 to 5e-6 and report the best performance for each method.

[Q9: Training time for 7B models.] Thanks for the question. Training a 7B model on a single A6000 GPU requires over 5876 hours. However, the SVD operation is challenging to parallelize across multiple GPUs, whereas training can be efficiently handled using data parallelism. Thus the time cost of SVD is nearly consistent across different numbers of GPUs used. For a more practical scenario, training a 7B model on 64 A100 GPUs requires around 110 hours, saving the SVD cost by over 25 hours.

审稿意见
5

The paper propose Q-GaLore to reduce the memory footprint when training LLMs. Q-GaLore is build upon GaLore and further quantize weights and projection matrix into lower precision to improve the performance. Q-GaLore adaptively update the subspace to reduce the overhead correspond to SVD decomposition. This approach achieve better training performance compared with other low-rank training methods and significantly reduce the memory requirement on pretraining of Llama-1B and fine-tuning on 7B scale models.

优点

  1. Q-GaLore achieves substantial memory reduction compared to the original Galore paper by applying 8-bit quantization to model weights and 4-bit quantization to the projection matrix.
  2. Q-GaLore reduces training time associated with Singular Value Decomposition (SVD) by decreasing the frequency of SVD updates for layers where the low-rank subspace remains relatively stable over time, which is well-motivated.

缺点

  1. "pre-training a LLaMA 7B with single batch size" is very strange in Section 1. Discussing the motivation in the fine-tuning scenario is a better choice.
  2. The modifications on GaLore framework is relatively incremental (for the quantization section).
  3. How the quantization of weight and projection matrices contribute to the degradation of the models? Which factor contributes more to the performance degradation? For example, the comparison of (FP16 weight, FP16 projection) - (FP16 weight, INT8 projection) - (FP16 weight, INT4 Projection) - (INT8 weight, FP16 projection) - (INT8 weight, INT8 Projection) - (INT8 weight, INT4 Projection) will help to understand how each component and the choice of quantization bit-width affect the final performance.

问题

See Weakness Part.

评论

We thank Reviewer 1Bwe for acknowledging our method “achieves substantial memory reduction” as well as the lazy SVD update strategy is “well-motivated”, To further address Reviewer 1Bwe’s concerns, we provide pointwise responses in the following:

[Q1: Motivation in the fine-tuning scenario] Thanks for the suggestion, we’ve strengthen the motivation in Section 1 in the updated draft, as “Even without factoring in any considerations for product efficiency, fine-tuning a LLaMA 7B model with 16-bit precision necessitates at least 56 GB memory for maintaining the model weight, Adam optimizer states and weight gradient, which is prohibitively expensive.”

[Q2: Incremental modification about quantization section] Thanks. We respectfully disagree that the quantization aspect in Q-GaLore is incremental compared to GaLore. In the original GaLore, quantization was limited to the use of an 8-bit Adam optimizer, which is a plug-and-play quantized optimizer from [1]. In contrast, Q-GaLore incorporates quantization from two distinct perspectives:

Weight Quantization: different from post-training quantization, one primary challenge of weight quantization during training lies in the significant reduction of gradient information. During each iteration, small gradient information is often lost due to the rounding operation in quantization, leading to ineffective training. We address this by stochastic rounding, which effectively preserves gradient information. And we demonstrate the efficacy of this approach across a wide range of experiment scales.

Projection Quantization: We show that the projection matrix is particularly friendly to quantization, offering a dual benefit. First, it provides an effective solution for further reducing memory costs. Second, it reveals insights into transformer training dynamics. Specifically, we observe that the gradient subspaces of certain layers exhibit an “early-bird” property, converging quickly during the initial stages of training. This empirical analysis provides valuable insights into the training dynamics of transformers.

Given these, we believe the quantization components in Q-GaLore represent a significant advancement rather than incremental improvements.

[Q3: Ablation study of the quantization of weight and projection matrices]: Great question. We conducted additional experiments to validate the individual effect of weight and projection quantization. As shown in Table R1, the final performance is slightly more sensitive to the weight quantization than projection matrix quantization.

Table R1: Comparison results of different precision of weight and projection quantization. Results are reported in perplexity. Where “W” represents weight and “P” represents the projection matrices.

MethodsLLaMA-60MLLaMA-130MLLaMA-350M
W16P1634.8825.3618.95
W8P1634.9125.6619.41
W16P834.8825.2819.29
W8P834.8525.6619.61
W16P434.5225.2319.22
W8P434.8825.5319.79

[1] 8-bit Optimizers via Block-wise Quantization

评论

Thanks for your clarification.

评论

Thank you for taking the time to review our rebuttal. We would like to know if our responses have addressed your concerns or if you have any follow-up questions. Thank you!

审稿意见
6

This paper improves the memory and compute efficiency of GaLore with a combination of aggressive quantization and lazy layerwise subspace exploration techniques. Authors motivate these techniques by thoughtfully designed empirical analyses (e.g., layerwise gradient subspace analysis), and validate its effectiveness through diverse experiments. Notably, Q-GaLore reduces memory consumption by up to 50% and saves over 32 hours for training a 7B model compared to the baselines like LoRA and GaLore.

优点

This paper is tackling the important problem of democratizing the large-scale ML training given the limited compute resources. To achieve this goal, authors have performed insightful analyses, upon which they carefully develop quantization and layerwise adaptive SVD computations. Finally, authors have demonstrated the effectiveness of Q-GaLore in extensive empirical experiments, encompassing language model pertaining and fine-tuning.

缺点

In Table 1, it seems that the performance gap between GaLore and Q-GaLore tends to increase as the model size increases. While it still outperforms LoRA, I personally believe the 0.5 perplexity difference is not negligible. In addition, some hyperparameters like 0.4 cosine similarity threshold looks somewhat arbitrary, and it's unclear how important these hyperparameters are in the final performance.

问题

See above.

评论

We sincerely thank Reviewer PunU for the helpful comments and positive initial feedback on our work. Below, we provide detailed responses.

[Q1: Performance gap] Thanks for pointing it out. In Table 1, the perplexity difference for the 1B model is slightly smaller than that of the 350M model but slightly larger than the 130M model, indicating no consistent trend of an increasing gap as the model size scales up. Furthermore, compared to the baselines of Low-Rank, LoRA, and ReLoRA, the perplexity gap with GaLore is significantly larger than that of Q-GaLore. For instance, in the case of the Llama-1B model, the perplexity differences are 126.89, 3.57, and 2.69, respectively, which are considerably higher compared to the 0.61 of Q-GaLore. This suggests that Q-GaLore maintains comparable pre-training performance, as mentioned in Line 328.

[Q2: Threshold for cosine similarity] Thanks for pointing it out. We’ve conducted an ablation study on the threshold for cosine similarity, as shown in Figure 7, where the x-axis represents the relative SVD count. In this study, we performed a grid search for the cosine similarity threshold within the range [0, 1] and reported the corresponding SVD counts along with the perplexity. The results indicate that setting the threshold to 0.4 (marked as the red star in Figure 7) serves as a turning point, where performance begins to increase significantly when the threshold is below 0.4. And we have included the details of this experiment in the updated draft to avoid potential misunderstanding.

评论

Thanks for your response. I still positively evaluate this work, but do not think I can further increase the score

[Q1. Performance Gap] I believe there are two issues here.

  1. Noting that the performance degradation starts at 350M, I believe it's possible that the effectiveness of Q-GaLore decreases with larger models.

  2. While the author noted that the absolute perplexity difference decreased at 1B (compared to 350M), I want to point out that the loss scale generally decreases as the model size increases following the scaling laws. I am not sure about the best way to measure the performance degradation across different model scales, but my concern 1 still remains valid.

[Q2. Threshold for Cosine Similarity] Thanks for providing additional details!

撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.