PaperHub
6.0
/10
Poster3 位审稿人
最低6最高6标准差0.0
6
6
6
3.7
置信度
正确性3.0
贡献度2.3
表达2.7
NeurIPS 2024

DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-17
TL;DR

DropBP, randomly dropping backward propagation based on layer sensitivity, significantly accelerates fine-tuning in Large Language Models (LLMs) with considerable memory reduction.

摘要

Large language models (LLMs) have achieved significant success across various domains. However, training these LLMs typically involves substantial memory and computational costs during both forward and backward propagation. While parameter-efficient fine-tuning (PEFT) considerably reduces the training memory associated with parameters, it does not address the significant computational costs and activation memory. In this paper, we propose Dropping Backward Propagation (DropBP), a novel approach designed to reduce computational costs and activation memory while maintaining accuracy. DropBP randomly drops layers during backward propagation, which is essentially equivalent to training shallow submodules generated by undropped layers and residual connections. Additionally, DropBP calculates the sensitivity of each layer to assign an appropriate drop rate, thereby stabilizing the training process. DropBP is not only applicable to full fine-tuning but can also be orthogonally integrated with all types of PEFT by dropping layers during backward propagation. Specifically, DropBP can reduce training time by 44% with comparable accuracy to the baseline, accelerate convergence to the same perplexity by 1.5$\times$, and enable training with a sequence length 6.2$\times$ larger on a single NVIDIA-A100 GPU. Furthermore, our DropBP enabled a throughput increase of 79% on a NVIDIA A100 GPU and 117% on an Intel Gaudi2 HPU. The code is available at [https://github.com/WooSunghyeon/dropbp](https://github.com/WooSunghyeon/dropbp).
关键词
Training AccelerationMemory Efficient Fine-TuningLarge Language ModelsBackpropagation Optimization.

评审与讨论

审稿意见
6

The paper introduces DropBP, an innovative approach to accelerate the fine-tuning of Large Language Models (LLMs) by selectively dropping layers during backward propagation. This method is presented as a means to reduce computational costs and activation memory, significant challenges in the efficient fine-tuning of LLMs. The authors have provided a clear implementation of DropBP as a PyTorch extension and demonstrated its effectiveness through experiments on various LLMs and datasets.

优点

  • The concept of dropping backward propagation layers to reduce computational overhead is differential from previous work and addresses an important issue in training large models.

  • The paper includes extensive experiments that validate the effectiveness of DropBP in reducing training time and memory usage while maintaining accuracy.

  • The development of a PyTorch extension for DropBP facilitates easy integration with existing training codes, enhancing the practical applicability of the method.

缺点

  • The motivation is not well illustrated. I agree with that dropping sublayers could lead to training efficiency as the model turns to a shallower counterpart. However, I mean, pervious work like LayerDrop and others omit the layer computation in the forward pass. Then the computation could be removed in the subsequent backward computation with essential engineering efforts. Thus it lacks a clear distinction in terms of technical innovation compared to these previous works.

  • While the paper proposes omitting sublayer computation in the backward pass, it's unclear why the forward pass computation remains unchanged. Justifying this choice or exploring alternatives would strengthen the contribution.

  • The faster convergence observed in Figure 5 with DropBP compared to the vanilla model is counterintuitive. The observation here quite confuses me since the backward pass optimizes a partial computation graph, concerns regarding overfitting arise. The paper would benefit from a discussion on potential regularization techniques employed to address this, and a comparison with related work (e.g., [1]) that utilizes sublayer dropping for regularization in training a deep Transformer model.

    [1] Li et al., 2021 (AAAI) Learning Light-Weight Translation Models from Deep Transformer

问题

some typos

  • Line 49: As a results -> As a result
  • Line 62: a effective -> an effective

局限性

Yes

作者回复

We thank the reviewer for carefully reviewing our submission and providing valuable feedback. Please see below for our response to the questions and comments.

Q3.1. Pervious work like LayerDrop and others omit the layer computation in the forward pass. Then the computation could be removed in the subsequent backward computation with essential engineering efforts. Thus it lacks a clear distinction in terms of technical innovation compared to these previous works.

Q3.2. It's unclear why the forward pass computation remains unchanged.

A3.1-2. We appreciate the reviewer's insightful comments. We understand the reviewer's questions as follows:

  1. Motivation: What is the reason for focusing on reducing computations in the backward propagation rather than the entire computation process?
  2. Methodology: Why does the approach omit only backward propagation computations while keeping forward propagation unchanged?
  3. Related Works: How does this differ from algorithms like LayerDrop [2], which drop computations during forward propagation as well?

First, the goal of LayerDrop is to ultimately perform layer-wise pruning to reduce inference time by dropping layers during forward propagation. In contrast, the aim of our DropBP is to enhance training efficiency directly by reducing FLOPs and training memory usage. Therefore, our approach does not require strictly dropping the forward path. Of course, while it is possible to drop the forward propagation, we chose to avoid this because the training process is highly sensitive to dropping the forward propagation, as shown in Fig. E.

  • Figure E in the attached PDF.

In Fig. E, we compared the loss curves of DropBP and Progressive Layer Dropping (PLD) [3], a representative layer dropping algorithm, when the same drop rate was applied to training computations. The results demonstrate that DropBP achieves much more stable training compared to PLD. This is because dropping the forward path can cause output deviations, which can negatively impact loss and all gradients.

Furthermore, in Appendix D of the manuscript, we compared our DropBP with Layer Dropping algorithms for fine-tuning LLMs. As shown in Table 7 of the manuscript, our DropBP achieved higher accuracy compared to traditional Layer Dropping algorithms.

In summary, while Layer Dropping algorithms aim to improve inference efficiency by randomly dropping the forward path and pruning it during inference, our DropBP focuses on selectively dropping backward paths, that are relatively less sensitive than the forward path, to accelerate training and reduce memory usage while maintaining high accuracy.

Following the reviewer's advice, we will incorporate these clarifications into the paper to reduce confusion and strengthen the contribution.

Q3.3. The faster convergence observed in Figure 5 with DropBP compared to the vanilla model is counterintuitive. The observation here quite confuses me since the backward pass optimizes a partial computation graph, concerns regarding overfitting arise. The paper would benefit from a discussion on potential regularization techniques employed to address this, and a comparison with related work (e.g., [1]) that utilizes sublayer dropping for regularization in training a deep Transformer model.

A3.3. We agree with the reviewer's observation that Fig. 5 of the manuscript appears to show rapid convergence, which may suggest overfitting. However, this is because the x-axis is plotted against training time. When plotted against training steps, a completely different pattern emerges, as shown below:

  • Figure B in the attached PDF.

As shown in Fig. B, when the drop rate is 0.5, the convergence of loss per step is almost identical to the baseline. However, with drop rates of 0.75 and 0.875, the convergence speed per step is slower. Nonetheless, DropBP significantly reduces the time consumed per training step because it skips the backward propagation computations for the dropped layers. Consequently, the convergence speed per training time is actually faster for DropBP compared to the baseline.

Moreover, our DropBP does not optimize a partial computation graph but instead randomly drops layers according to the drop rate, meaning the layers being trained change with each step. This can be interpreted as structured dropout, leading to an ensemble effect where multiple models are effectively trained simultaneously. As a result, it can serve as a regularization technique to mitigate overfitting issues. In fact, when applying DropBP, we did not observe overfitting, where training loss decreases but validation loss worsens.

Additionally, through [1], our DropBP can also be interpreted that DropBP reduces co-adaptation among layers. Specifically, [1] claims that Layer Dropping (LD) [2] successfully achieves regularization by reducing co-adaptation among layers through dropping layers during training. This effect is also applicable to DropBP, where some backward paths are randomly dropped instead of strictly training all layers.

Thanks to the reviewer's insight, we have confirmed that DropBP can also be interpreted as a regularization technique. In the revised manuscript, we will incorporate this analysis, and in future research, we plan to specifically analyze the regularization effects of applying DropBP and work on improving it.

Q3.4. some typos exist.

A3.4. hank you for catching the typo. We will make sure to correct it in the revised version.

[1] Li et al., "Learning Light-Weight Translation Models from Deep Transformer." [2] Fan et al., "Reducing Transformer Depth on Demand with Structured Dropout." [3] Zheng and He, "Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping"

评论

Thank you for the effort the authors have put into addressing my concerns. To this end, most of my concerns have been satisfactorily addressed. Specifically:

  • Regarding motivation, I agree that the authors have demonstrated that PLD-like models face convergence challenges during the training phase if the forward pass is dropped. While I still have some doubts about the results, more detailed discussion should be included in the next version to clarify this issue. It's not entirely clear to me whether these challenges are solely due to the increased model capacity, as larger models can sometimes be more robust to the training data than smaller ones.

  • The training curve plotted against training time, rather than training steps, seems meaningful. A better illustration of this would help others understand your work more clearly.

Overall, I would like to slightly raise my score.

评论

Dear Reviewer NAjC,

We would like to express our sincere gratitude to the reviewer for the careful review and for the increased score. We fully agree with the reviewer’s insightful suggestions, particularly the need for a comparison of Layer Dropping algorithms and DropBP when the model scales up, and the importance of showing the training curve according to the training steps to clarify our arguments. We will incorporate these valuable insights into the final version of our paper.

Sincerely,

Authors of Paper # 9316

审稿意见
6

The paper proposes a novel method to reduce the computational and memory costs associated with fine-tuning large language models (LLMs). The authors introduce DropBP, a technique that randomly drops layers during backward propagation, effectively reducing the computational operations (FLOPs) and activation memory needed. This method assigns drop rates based on the sensitivity of each layer to ensure stable training. The approach is applicable to both full fine-tuning and parameter-efficient fine-tuning (PEFT) methods. The paper reports significant improvements in training time, convergence speed, and maximum sequence length when fine-tuning LLaMA2 models with DropBP.

优点

  • DropBP introduces a novel method for reducing the computational and memory costs associated with fine-tuning LLMs. This is an important contribution to the field, given the increasing size and complexity of these models.

  • The paper provides empirical evidence that DropBP significantly reduces training time (by 44%), accelerates convergence (1.5× faster), and increases the maximum sequence length (up to 6.2×) on a single NVIDIA A100 GPU. These results demonstrate the effectiveness of the approach. The authors conduct thorough experiments on multiple datasets and models, providing a robust evaluation of DropBP's performance across different scenarios.

缺点

  • The paper mentions that the sensitivity calculation is done only once and has negligible overhead. However, more details on this process and its potential impact on training time would provide a clearer understanding of any trade-offs involved.

  • The paper could benefit from a more detailed theoretical analysis of why DropBP works as effectively as it does. This would strengthen the paper by providing a deeper understanding of the underlying principles.

问题

  • Can you provide more details on the sensitivity calculation process? Specifically, how is the sensitivity of each layer computed, and what is the computational overhead associated with this step?

  • What are the best practices for tuning the drop rates in DropBP? Are there guidelines or heuristics that practitioners can follow to optimize performance for their specific use cases?

  • How well does DropBP integrate with other recent advancements in efficient training techniques, such as mixed precision training or distributed training frameworks? Have you explored these combinations in your experiments?

局限性

Limitations have been discussed.

作者回复

We thank the reviewer for carefully reviewing our submission and providing valuable feedback. Please see below for our response to the questions and comments.

Q2.1. Can you provide more details on the sensitivity calculation process? Specifically, how is the sensitivity of each layer computed, and what is the computational overhead associated with this step?

A2.1. Please referto the global response GA2 above.

Q2.2. The paper could benefit from a more detailed theoretical analysis of why DropBP works as effectively as it does. This would strengthen the paper by providing a deeper understanding of the underlying principles.

A2.2.

  • Figure C in the attached PDF.

That's a good point. We interpret transformer models as a collection of numerous blocks, each composed of various modules with residual connections. Our hypothesis is that we can fine-tune LLMs well by training only certain shallow submodules. To theoretically analyze this hypothesis, we measured the impact of submodules based on their path lengths in LLaMA2-7B, as shown in Fig. C. Specifically, we followed these steps as suggested in [1]:

  1. We first perform a forward pass through the entire network.
  2. During the backward pass, we randomly sample kk residual blocks, which are back-propagated without passing through skip connections, while the remaining nkn-k blocks are bypassed through the skip connections.
  3. We then measure the norm of the gradient at the input.

We take 100 measurements for each path length kk. Subsequently, we multiply by the distribution of all possible path lengths, which follows a Binomial distribution, to quantify the gradient contribution from paths of a specific length.

In Fig. C(b), we observed that the gradient per path length decreases as the path length increases. Consequently, Fig. C(c) demonstrates that shorter path lengths have a greater impact on the gradient. These observations are consistent with the findings in [1], which attributed this phenomenon to vanishing gradients. We confirmed that this also occurs in transformers, where the paths that significantly influence training in LLMs are relatively short. Therefore, DropBP enables effective training by focusing on these short submodules.

Thanks to the reviewer's advice, this analysis will be included in the final version, and we plan to conduct a more theoretical analysis of this phenomenon.

Q2.3. Are there guidelines or heuristics that practitioners can follow to optimize performance for their specific use cases?

A2.3. We are very pleased that the reviewer has shown interest in the use case of DropBP. Empirically, using the identical settings as the baseline is sufficient to achieve good convergence of loss and high accuracy when applying DropBP. At higher drop rates such as p=0.75 and p=0.875, however, increasing the learning rate by about 1.5 times can slightly improve accuracy. Thanks to the reviewer, we will incorporate these guidelines into the code we release.

Q2.4. How well does DropBP integrate with other recent advancements in efficient training techniques, such as mixed precision training or distributed training frameworks?

A2.4. We have developed a DropBP library that can be easily integrated into PyTorch, allowing it to be readily combined with most efficient training techniques that can be applied on a single GPU, such as parameter-efficient fine-tuning and mixed precision training. As shown in Fig. 4 and Table 2 of the manuscript, we incorporated these combinations into our experiments.

Additionally, we have confirmed that our library works well with recent distributed training frameworks based on PyTorch, such as FSDP [2]. However, we have encountered some errors when integrating with the DeepSpeed [3] framework. We are currently debugging these issues and plan to resolve them. In the future, we intend to analyze the experimental results once these issues are addressed.

[1] Veit and al., "Residual Networks Behave Like Ensembles of Relatively Shallow Networks." [2] Zho and al., "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel". [3] Rajbhandari and al., "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models".

评论

I want to thank the authors for the detailed responses. After reading the response and other reviews, I would like to keep my original score.

评论

Dear Reviewer Ep4o,

We would like to express our sincere gratitude to the reviewer for the careful and thorough review of our manuscript. We highly appreciate the insightful comments and suggestions provided, and we will incorporate them into the final version of our paper.

Sincerely,

Authors of Paper # 9316

审稿意见
6

The paper proposed to drop layers during backward prop (BP) based on layer sensitivity. The method aims to reduce the cost for gradient computation and storage for intermediate activation in full BP.

优点

  1. Reducing the cost of full BP in PEFT has been an important challenge.
  2. The method is simple and is easy to integrate to either full fine-tuning or PEFT.
  3. Experiments demonstrate that DropBP can speed up the training while retaining the accuracy. The resulting memory reduction makes longer sequence modeling accessible.

缺点

  1. The idea of optimizing NNs with sparse gradient is not new. This paper needs to add more discussion and comparison with related works in sparse learning e.g., [1-3]
  2. Table 1 only shows results on two datasets and limited benchmark. It is unclear if the method works well for generation tasks and domain-specific transfer learning.
  3. It is unclear which algorithm is used to solve the constraint minimization problem, i.e., to determine the layer-specific rates based on sensitivity, and its extra computational cost.
  4. (Minor) In fine-tuning, DropBP drops a set of layers. However, the sensitivity of a set of layers may not be accurately represented by the direct summation of the sensitivities of individual layers in the set.

[1] Sun, Xu, et al. "meprop: Sparsified back propagation for accelerated deep learning with reduced overfitting." [2] Sung, Yi-Lin, Varun Nair, and Colin A. Raffel. "Training neural networks with fixed sparse masks." [3] Brock, Andrew, et al. "Freezeout: Accelerate training by progressively freezing layers."

问题

  1. What is the long context modeling performance after applying DropBP?
  2. Could the authors present Figure 5 with # of steps as the x-axis to demonstrate faster convergence?
  3. I wonder if the sensitivities would evolve, and the drop rate needs to be re-allocated through training.

局限性

The authors have addressed the limitations

作者回复

We thank the reviewer for carefully reviewing our submission and providing valuable feedback. Please see below for our response to the questions and comments.

Q1.1. The idea of optimizing NNs with sparse gradient is not new.

A1.1. We acknowledge that the idea of optimizing neural networks with sparse gradients is not novel. However, our method differs significantly from sparse gradient methods like meProp, FISH Mask, and Freezeout in terms of purpose, methodology, and effect. Specifically:

  • meProp: While meProp accelerates training by masking the output gradient of individual layers, our DropBP skips entire transformer blocks by leveraging skip connections. As shown in Table G.1.2 in global response, our DropBP achieves high accuracy compared to meProp. Furthermore, meProp requires computing top-K gradient masks at each iteration, whereas DropBP only needs to calculate the sensitivities once for the entire training process, reducing the overhead on training time. Additionally, meProp must store activations for all layers during training, while DropBP saves activation memory by not storing activations for dropped layers.
  • FISH Mask: FISH Mask, which is a parameter efficient fine-tuning such as LoRA, reduces communication costs by updating sparse parameters without decreasing FLOPs. In contrast, our DropBP reduces computational costs directly by dropping backward operations. Furthermore, while FISH Mask have to store all activations for backward propagation, our DropBP eliminates the need to store activations for dropped layers, reducing activation memory. DropBP and FISH Mask employ complementary methods, which makes it possible to apply them simultaneously, just as DropBP and LoRA can be. We are working on this implementation but facing delays due to the complexity of FISH Mask's code. We will include related experiments in the paper after completion.
  • Freezeout: Freezeout accelerates training by gradually freezing earlier layers, while DropBP randomly drops layers from the start, regardless of their order. Consequently, Freezeout requires storing all activations initially, which complicates increasing sequence length and parallel processing. In contrast, DropBP maintains low and consistent memory allocation as shown in Table G.1.2, facilitating easier management, longer sequences, and better parallel processing.

We will include these distinctions and a more comprehensive comparison to suggested related works in our revised paper.

Q1.2. It is unclear if the method works well for generation tasks and domain-specific transfer learning.

A1.2. Please refer to the global response GA1 above.

Q1.3. It is unclear which algorithm is used to solve the constraint minimization problem, i.e., to determine the layer-specific rates based on sensitivity, and its extra computational cost.

A1.3. Please refer to the global response GA2 above.

Q1.4. (Minor) The sensitivity of a set of layers may not be accurately represented by the direct summation of the sensitivities of individual layers in the set.

A1.4. We agree with the reviewer's opinion that the total sensitivities of the dropped network do not strictly equal the sum of each individual layer's sensitivity. However, given the practical constraints of calculating the sensitivities for the vast number of possible combinations of dropped networks in a deep neural network, we have made the assumption that the total sensitivities of a dropped network can be approximated by the sum of the sensitivities of its individual layers. In our future research, we will explore more accurate methods to calculate the network's sensitivity and determine optimal drop rates.

Q1.5. What is the long context modeling performance after applying DropBP?

A1.5.

  • Table R.1.1. Processing Time Analysis for Calculating Sensitivities When Fine-Tuning LLMs on the Alpaca Dataset.
No-tuneDropBP (p=0.875)
16K32K
NaN6.818.32
  • Figure A in the attached PDF.

In response to the reviewer's request, we trained the LLaMA2-7B-chat model on the LongAlpaca dataset using sequence lengths of 16K and 32K. The settings followed those of LongLoRA [1]. As shown in Fig. A, our experiments demonstrated that the model successfully converged on loss with long sequence data of 16K or more. Additionally, when evaluating the fine-tuned model on a subset of the PG16 test set with a sequence length of 16K, we achieved lower perplexity (PPL) compared to the non-fine-tuned model as shown in Table R.1.1, confirming that our DropBP method enables effective long sequence modeling.

Due to limited GPU resources and review time, we conducted experiments on smaller training and test sets. Once we secure sufficient resources and time, we plan to obtain more robust experimental results for the revised version and integrate DropBP into LongLoRA for future work..

Q1.6 Could the authors present Figure 5 with # of steps as the x-axis to demonstrate faster convergence?

A1.6 Please refer to the global response GA3 above.

Q1.7 I wonder if the sensitivities would evolve, and the drop rate needs to be re-allocated through training.

A1.7

  • Figure D in the attached PDF.

As shown in Fig. D, our experiments demonstrated that the sensitivity of each layer converges as training progresses. Accordingly, we calculates sensitivity just once at the 10% of the training process, minimizing overhead from sensitivity calculations. This approach proved effective in most experiments. However, we believe the reviewer's concern is also valid and we plan to add a feature that periodically recalculates sensitivity to adjust the drop rate to the DropBP library.

[1] Chen at al., "LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models"

评论

Thanks the authors for providing new results and analysis in the updated version and I would like to keep my original score.

评论

Dear Reviewer nmtJ,

We would like to express our sincere gratitude to the reviewer for the careful and thorough review of our manuscript. We highly appreciate the insightful comments and suggestions provided, and we will incorporate them into the final version of our paper.

Sincerely,

Authors of Paper # 9316

作者回复

We thank the all reviewer for carefully reviewing our submission and providing valuable feedback. We would like to address several common and important questions in the following global response.

GQ1. It is unclear if the method works well for generation tasks and domain-specific transfer learning.

GA1. We have conducted additional experiments to evaluate our method on generation tasks and domain-specific transfer learning following the Reviewer's advice as below:

  • Table G.1.1. Results of Generation Tasks (MT-Bench) with LLaMA3-8B fine-tuned on the OASST1 datasets.
MethodDrop RateMemoryTimeHumanitiesSTEMRoleplayExtractionWritingReasoningCodingMathAvg.
No-tune...6.255.705.455.204.854.403.201.954.62
LoRA.57G27m7.006.405.805.705.304.553.252.955.12
LoRA+DropBP0.542G21m6.556.256.055.505.054.453.753.255.11
0.7536G17m6.755.905.805.705.354.303.603.305.09
0.87532G16m6.606.555.905.705.703.953.402.805.08
  • Table G.1.2. Results of Domain-Specific Learning with LLaMA3-8B on the IMDB Dataset.
MethodDrop RateMemeroyTimeAccuracy (%)
LoRA.40G539s91.7
meProp0.540G507s88.5
Freezeout.40G445s91.5
LoRA+DropBP0.534G438s92.3
0.7532G392s91.5
0.87531G362s91.3

The experimental results show that our DropBP can reduce training memory and time while maintaining comparable accuracy in generation tasks and domain-specific learning tasks. Thanks to the Reviewer nmtJ, we will include the improved results in the revision.

GQ2. I am curious about the specific method for calculating sensitivity and the associated overhead.

GA2. Sensitivity calculation process in DropBP involves two main steps:

  • Step 1. Sensitivity Calculation: We define the sensitivity of a layer as the variance in gradient normalization between when the layer is dropped and when it is not. Therefore, calculating all layer sensitivities requires LL iterations, corresponding to the number of layers, which is typically fewer than the iterations needed for fine-tuning. For example, when fine-tuning LLaMA2-70B, 160 iterations are required to calculate sensitivities, which is significantly fewer than the size of training datasets, such as the Alpaca dataset (52K).

  • Step 2. Drop Rate Allocation: We employ the simple greedy algorithm from [1] to efficiently determine drop rates. Initially, all layers start with a drop rate of 0, which is gradually increased to achieve the target FLOPs. At each step, we increase the drop rate of the layer by 0.1, selecting the layer that minimizes the increase in total sensitivity. By using a binary heap for optimal move selection, the algorithm runs with a complexity of O(L log ⁡L)O(L \ log \ ⁡L). The overhead is negligible, considering that the computation cost for a transformer's attention layer and linear layer is O(bsh2L)O(bsh^2L) and O(bs2hL)O(bs^2hL) respectively, where bb, ss, and hh represent the batch size, sequence length, and hidden dimension.

As explained, both Step 1 and Step 2 incur very low computational overhead and occur only once during the entire training process. Therefore, the overhead from sensitivity computation is negligible. The experimental results in Table G.2 below support this claim. Thanks to reviewers nmtJ and Ep4o, we will include detailed explanations of the sensitivity calculation process in the final version.

  • Table G.2. Processing Time Analysis for Calculating Sensitivities When Fine-Tuning LLMs on the Alpaca Dataset.
ModelPrecisionPEFTCalculate sensitivityTraining
p=0p=0.5p=0.75p=0.875
LLaMA2-7BMixedLoRA10s2.2h1.7h1.4h1.3h
BF16FFT10s2.0h1.3h1.0h0.8h
LLaMA2-13BBF16LoRA21s2.9h2.1h1.7h1.5h
LLaMA2-70BBF16QLoRA6m29.6h22.2h18.4h16.5h

GQ3. Could the authors present Figure 5 with training steps as the x-axis to demonstrate faster convergence?

GA3.

  • Figure B in the attached PDF.

In response to the reviewer's request, we plotted the training curves over training steps in Fig. B(a). When the drop rate is 0.5, the convergence of loss per step is almost identical to the baseline. However, with drop rates of 0.75 and 0.875, the convergence speed per step is slower. Nonetheless, DropBP significantly reduces the time consumed per training step because it skips the backward propagation computations for the dropped layers. Consequently, the convergence speed per training time is actually faster for DropBP compared to the baseline as shown in in Fig. B(b). Thanks to reviewers nmtJ and NAjC, we will include an analysis of the training loss curve in DropBP, using training time and training steps as the x-axis.

[1] Chen et al., "ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training"

最终决定

The paper proposed a new training method which randomly drops backpropagation in transformer layers to improve the training efficiency and reduce memory consumption. The drop rate is determined by sensitivity analysis of each layer. As a result, the author achieve faster wall-clock time convergence with smaller compute budget.

All three reviewers have a consensus that 1) the paper is tackling an important problem, 2) The proposed method is simple and easy to integrated to the existing frameworks, 3) the supporting experimental results show good improvement which reinforces the authors' claim. The reviewers commonly asked to compare the proposed method with pre-existing methods such as medrop, layerdrop and etc. The authors successfully provided comparisons and promised to add them in the final version. Also, the reviewers questioned if this method worked for different tasks and the authors provided generation tasks and domain-specific learning task results.

Overall, the paper proposes a simple yet effective method to make the LLM training more efficient. This could be beneficial to the community and there's no major concerns in the paper from the reviewers' reviews.