D-LLM: A Token Adaptive Computing Resource Allocation Strategy for Large Language Models
摘要
评审与讨论
The paper proposes D-LLMs, a novel dynamic inference framework for large language models (LLMs) that adaptively allocates computing resources based on the importance of individual tokens. By introducing a decision module for each transformer layer, D-LLMs can decide whether to execute or skip specific layers for each token, optimizing computational efficiency. The framework also includes a KV-cache eviction strategy to maintain compatibility with existing acceleration techniques and reduce storage overhead. Experimental results demonstrate that D-LLMs can significantly reduce computational costs and KV-cache storage by up to 50% without compromising performance across various tasks, including Q&A, summarization, and commonsense reasoning.
优点
The paper demonstrates significant originality by introducing D-LLMs, a dynamic inference framework that adaptively allocates computational resources in large language models. This novel approach addresses a critical challenge in deploying LLMs on resource-constrained platforms, offering a creative solution that dynamically determines the necessity of executing each transformer layer based on token importance. The quality of the work is evident in the thorough design and implementation of the decision modules and the KV-cache eviction strategy, which ensure compatibility with existing acceleration techniques. The clarity of the paper is commendable, as it systematically explains the motivation, methodology, and experimental setup, making it accessible to a broad audience. The significance of this research lies in its potential to drastically reduce computational costs and storage requirements for LLMs, making high-performance language models more practical for a wider range of applications. This work not only advances the field by improving efficiency but also opens up new avenues for optimizing model deployment in real-world scenarios.
缺点
One key limitation of this work is the lack of detailed discussion on the granularity of tokens. The effectiveness of the D-LLMs framework heavily relies on accurately assessing the importance of each token during the inference process. However, the paper does not elaborate on how tokenization is handled, whether different granularities were considered, or the impact of token granularity on the model's performance and computational efficiency. This omission leaves questions about the generalizability and robustness of the proposed method across different types of tokens and tasks.
Additionally, the conditions under which the decision module determines to skip or execute layers are not clearly defined, leaving some ambiguity about the decision-making process. The paper would benefit from a more detailed explanation of the hyper-parameters used in the experiments, as this information is crucial for reproducing the results and understanding the model's behavior. Furthermore, there are instances of inconsistent terminology, such as the use of "dynamic decision module" versus "execution decision module," which can cause confusion.
Empirical analysis or ablation studies comparing different levels of token granularity, along with more precise definitions of the decision criteria and detailed hyper-parameter settings, would have strengthened the validation of the approach and provided clearer guidance for practical implementation.
问题
-
Token Granularity:
- How does the framework handle tokenization? Have different levels of granularity (e.g., subwords, words, characters) been considered, and how do they impact the model's performance and efficiency?
- Can you provide empirical results or ablation studies that compare different levels of token granularity to validate the chosen approach?
-
Decision Module Criteria:
- What specific criteria or metrics does the decision module use to determine whether a transformer layer should be executed or skipped for a given token?
- Are there any thresholds or heuristics involved in this decision-making process? If so, how are they set and optimized?
-
Hyper-Parameters:
- Can you provide more details on the hyper-parameters used in your experiments, such as those related to the dynamic decision module and the KV-cache eviction strategy?
- How sensitive is the performance of D-LLMs to these hyper-parameters, and are there any guidelines for tuning them?
-
Compatibility with Different LLM Architectures:
- While the paper demonstrates the framework's effectiveness on certain LLMs (e.g., LLaMA, GPT), how well does the approach generalize to other architectures, especially newer ones?
- Have you tested D-LLMs on any other LLM architectures or benchmarks not mentioned in the paper? If so, can you share the results?
-
Impact on Latency:
- Does the dynamic decision-making process introduce any latency during inference? If so, how significant is this latency, and what are its implications for real-time applications?
- Are there any optimizations or future directions planned to minimize this latency?
-
Comprehensive Evaluation:
- The experiments focus on specific benchmarks. Have you considered evaluating D-LLMs on a broader set of tasks to better understand its generalizability and robustness?
- Are there any plans to conduct further experiments that encompass a wider variety of NLP tasks and datasets?
-
Implementation Details:
- Can you provide more detailed implementation guidelines, including pseudocode or a step-by-step description of the D-LLMs framework?
- Are there any particular challenges or considerations to be aware of when implementing this framework on different hardware or software environments?
局限性
The authors have partially addressed the limitations of their work, primarily focusing on the technical aspects of the D-LLMs framework. However, there are several areas where the discussion of limitations could be expanded and more detailed. Additionally, the paper does not explicitly address potential negative societal impacts, which is an important consideration for any work involving large language models.
-
The paper does not discuss the impact of token granularity on the effectiveness of the D-LLMs framework. Given that token granularity is crucial for the dynamic allocation of computational resources, a more thorough examination is needed. The authors should include an analysis of different granularity levels (e.g., subwords, words, characters) and their impact on model performance and efficiency.
-
The evaluation is limited to specific benchmarks. A broader set of tasks and datasets would provide a more comprehensive understanding of the framework's generalizability and robustness.
-
Dynamic inference mechanisms might inadvertently amplify biases present in the training data, as decisions on token importance might favor certain types of content over others. The authors should discuss how they mitigate such biases.
We truly appreciate the reviewer for the valuable feedback. We have carefully considered the comments and suggestions and would like to address each of the concerns raised.
Q1: Token Granularity
We follow the tokenization released by corresponding LLMs. Today's LLMs utilize their own tokenization respectively. For example, Llama 2[1] employs a BytePair Encoding (BPE) algorithm with a vocabulary of 32k tokens, while Llama 3[2] uses the tiktoken tokenizer, expanding the vocabulary size up to 128k tokens. At the same time, LLMs are trained based on their specific tokenizations. The binding relationship between tokenization and LLM means that if we simply change the tokenization, the LLM will lose its ability. If we want to compare different levels of token granularity, training extra LLMs from scratch is necessary but unaffordable. Our method has been verified on llama2 and llama3, which demonstrates that D-LLM is applicable and effective across various tokenizers.
[1] Llama 2: Open Foundation and Fine-Tuned Chat Models
[2] The Llama 3 Herd of Models
Q2: Decision Module Criteria
- The decision module outputs a probability distribution with dimensionality 2, denoted as where . The transformer layer is decided to be executed or skipped according to the max arguments in . Specifically, if , the layer should be executed; if , the layer should be skipped.
- Decision-making depends on the max argument in vector , therefore we don't have to set or optimize any thresholds.
Q3: Hyper-Parameters
Hyper-parameters in our method include target acceleration rate , reserved initial tokens , and the weight factor of . We provide additional descriptions and details for these hyper-parameters as follows.
-
Target acceleration rate is user-defined for expected computation overheads. We elaborate the performance under different in Figure 2 (line 198-199). More details are available in Table 4 in A.2 (line 532-540).
-
The importance of initial tokens is pointed out in recent works for LLMs' generalization to long contexts[1,2]. Initial tokens encode strong absolute position information and collect significant attention scores, which makes it necessary to be kept when pruning context. Our experiments in Table 3 (line 270-271) demonstrate that initial tokens reserved in the proposed eviction policy lead to better performance.
-
We conduct additional analysis on hyper-parameters in Eq.14. (line 213). Details are available in global rebuttal #2.
Q4: Compatibility with Different LLM Architectures
-
Our proposed D-LLM is designed for decoder-only transformers, which is the mainstream architecture used by current LLMs. With respect to other architectures like encoder-decoder models, our method needs further adaptation, especially the strategy of KV-cache eviction.
-
We are currently focused on the decoder-only architecture but are open to adapting our method to other LLM architectures in the future.
Q5: Impact on Latency
- We make a detailed analysis on latency and other overheads taken by decision modules in global rebuttal #1. In real-time applications, D-LLM with actual acceleration rate 40% raises the generating speed from 32 token/s to about 41 token/s. The latency is negligible compared to the gain of generating speed.
- We also have considered several optimization directions to minimize the latency, for example, reducing the complexity of decision module; utilizing global decision module to predict the routes for all transformer layers; reducing the frequency of utilizing decision module when generating; and so on.
Q6: Comprehensive Evaluation
- In terms of benchmark selection, we mainly follow the popular evaluation datasets used by the mainstream LLMs[1]. Currently, D-LLM is designed and verified on specific domain tasks, like commonsense reasoning and math solving. We used a considerable amount of benchmarks compared with other research on LLMs[2,3]. Certainly, we plan to expand our scale of training of D-LLM in the future, including large-scale common datasets for LLMs' training and the size of training model's parameters. We also plan to assess D-LLM's generalizability and robustness across a broader range of tasks, for example, coding, reading comprehension, and so on.
[1] Llama 2: Open Foundation and Fine-Tuned Chat Models
[2] DoRA: Weight-Decomposed Low-Rank Adaptation
[3] Not all Layers of LLMs are Necessary during Inference
Q7: Implementation Details
-
We have released our implementation code and the web address is provided in the checklist (line 690). The framework of our proposed D-LLM can be viewed in the directory
./llama. We also provide codes for training and inference in the repository. -
There are no particular challenges or restrictions to implement D-LLM on different hardware or software environments. We provide the necessary library list to train and infer our D-LLM in the file
./requirements.txtfrom the repository.
The paper introduces D-LLMs, a dynamic inference paradigm for large language models that adaptively allocates computing resources based on token importance. D-LLMs reduce computational costs and KV-cache storage by up to 45% and 50% on various tasks.
优点
- The concept of dynamically adjusting the execution of transformer layers at the token level is both novel and impactful, providing a promising direction for reducing computational costs in LLMs.
- The paper addresses the practical challenge of integrating dynamic inference with KV-cache methods, which is essential for real-world applications.
- The authors conducted extensive experiments, demonstrating significant reductions in computational resources without sacrificing model performance across diverse NLP tasks.
缺点
- While the experimental results are impressive, the paper lacks a deep theoretical analysis of the dynamic decision module's behavior and its impact on model performance.
- The computational overhead introduced by the dynamic decision module itself is not sufficiently analyzed.
- The manuscript requires a deeper examination of the impact of various components of the loss function. Specifically, a thorough analysis is needed to investigate how varying the value of 𝛼 affects the performance of model.
- Although the paper demonstrates the effectiveness of the overall approach, it would benefit from more detailed ablation studies to isolate the impact of different components of the dynamic decision module and eviction policy.
- The manuscript necessitates a more elaborate description of the evaluation metrics. For instance, a detailed explanation of the computation methodology for FLOPs is required to provide readers with a comprehensive understanding of the evaluation process.
问题
See the above
局限性
See the above
We truly appreciate the reviewer for the valuable feedback. We have carefully considered the comments and suggestions and would like to address each of the concerns raised.
W1: While the experimental results are impressive, the paper lacks a deep theoretical analysis of the dynamic decision module's behavior and its impact on model performance.
We understand your expectations of theoretical analysis and would like to make following clarifications:
-
Theoretically, dynamic inference is a conditional execution paradigm, similar to mixture of experts and sparsity algorithms. The key idea is to reduce network redundancy and customize a network topology for each sample.
-
We formulate the inference of dynamic inference as a sequential decision problem. Each decision module decides whether a transformer block should be executed or not.
-
We provide detailed analysis on dynamic decision module's behavior related to the property of grammatical terms (line 279-293), the relationship between FLOPs in A.1 (line 522-531), and instance complexity in A.5 (line 563-573).
In future work, we will explore more theoretical tools and methods to provide a more in-depth explanation of the model's behavior.
W2: The computational overhead introduced by the dynamic decision module itself is not sufficiently analyzed.
We provide computational overhead analysis taken by the decision module of D-LLM in global rebuttal #1.
W3: The manuscript requires a deeper examination of the impact of various components of the loss function. Specifically, a thorough analysis is needed to investigate how varying the value of 𝛼 affects the performance of model.
We perform a comprehensive analysis on hyper-parameter and details are available in global rebuttal #2.
W4: Although the paper demonstrates the effectiveness of the overall approach, it would benefit from more detailed ablation studies to isolate the impact of different components of the dynamic decision module and eviction policy.
We would like to kindly remind you that we have performed ablation studies to demonstrate the impact of the dynamic decision module and eviction policy in our paper (line 254-270). We select two baselines for comparison: 'LoRA finetune' only utilizes LoRA modules to finetune the LLM, and 'D-LLM w/o Evl. Str.' utilizes dynamic decision module without eviction policy. The results show that decision module can effectively achieve acceleration by layer skipping while maintaining comparable performance on benchmarks. With the help of eviction policy, D-LLM can further reduce the computation overhead due to fewer KV calculations and lower computations in attention operation.
Additionally, we consider that you might be interested in components of dynamic decision module for finer granularity. We utilize one layer network 'Linear & Softmax' as dynamic module for ablation study. The results in following table demonstrate that the module designed in our D-LLM has non-linear fitting capability to help the model control the actual acceleration rate more precisely and achieve better performance.
Values in table represent (FLOPs, Accuracy).
| D-LLM(Linear & Softmax) | D-LLM | |
|---|---|---|
| MaWPS | (0.58, 0.70) | (0.56, 0.74) |
| OBQA | (0.57, 0.79) | (0.53, 0.80) |
W5: The manuscript necessitates a more elaborate description of the evaluation metrics. For instance, a detailed explanation of the computation methodology for FLOPs is required to provide readers with a comprehensive understanding of the evaluation process.
We give a detailed description of all metrics mentioned in our paper as follows:
-
PPL measures the divergence between the model's predicted probability distribution and ground truth. Given a sentence and predicted probability distribution from model, PPL can be computed as .
-
Accuracy: For math-solving tasks, the answers are marked as correct only if the numerical difference between results in prediction and ground truth is less than . For commonsense reasoning tasks, answers are marked as correct only if the output matches the correct options in lowercase.
-
FLOPs: FLOPs, short for floating point operations, measures the complexity of algorithms/models[2]. We calculated the flops of fully connected layers and attention operations in transformer. For fully connected layers without bias, we compute FLOPs as , where is the input dimensionality and is the output dimensionality. FLOPs of attention operations are proportional to the cache size, so we calculated the average FLOPs when generating the 1st token to the 1K-th token. The FLOPs of baselines are normalized to 1 for better comparisons.
We will add the description in the revised version.
[1] Perplexity—a measure of the difficulty of speech recognition tasks, 1977
[2] Pruning Convolutional Neural Networks for Resource Efficient Inference, 2017
This paper proposes D-LLM, a novel layer-skipping framework for LLM inference to reduce computation costs. It designs and trains (fine-tuning) the additional decision module per transformer layer and implements KV-cache eviction by adjusting self-attention masks upon the execution decision. The framework presents ~50% reduction in computing costs.
优点
- The study defines a clear problem to solve, which is important for current trends in model inference
- The consideration of token-level importance and resource allocation, which results in a layer-skipping mechanism, is commendable
- D-LLM carefully contemplates several hyperparameters, such as the number of preserved tokens, to achieve reasonable accuracy compared to the baselines
- Evaluation contains measurements in diverse aspects
缺点
- D-LLM requires additional computation costs/resources to train (fine-tune) the decision modules and adaptors (LoRA modules). Presenting such additional overheads from the decision module training would be interesting.
- Baselines do not include “dynamic inference mechanisms,” which are more closely related to D-LLM
- D-LLM introduces the concept of a “user-defined target acceleration rate;” however, the experimental results do not include its use case or the changing behavior depending on the user-defined rate.
- D-LLM claims that recent research shows preserving initial tokens is crucial, but I recommend the authors add references or clear evidence to support this statement (line 210).
- The paper’s writing needs to be carefully revisited. For example, AdaInfer, MoD, and Shorten-LlaMa (baselines) have two duplicated citations (e.g., 14 and 15), which could be confusing.
问题
- What is the difference between 'dynamic pruning' (line 99) and 'dynamic inference mechanism' (line 105)? I understand the explanations in the paper, but the structure of the current paper distinguishes the related work into 1) LLMs acceleration methods and 2) dynamic inference mechanisms. I wonder why the above dynamic pruning belongs to the first.
- The idea of inserting a decision module as proposed by D-LLM seems to be applicable even without a PEFT like LoRA. Discussion on any specific use of LoRA in this study or applying the module in other situations would be interesting.
- If the LoRA modules (low-rank adaptors) are not trained together, will the desired accuracy not be achieved?
- As mentioned in the weaknesses section, additional training is required to achieve similar performance to the baseline. How much overhead is this additional training? Is it small enough that it doesn't need to be shown in the paper?
- If we want to infer on a completely new dataset, does D-LLM need to train further? If we use the decision module and low-rank adaptor already trained on a different dataset without additional training, how will the performance change?
- How much does the target acceleration rate differ from the actual LLM inference acceleration rate (skipped-layers ratio)?
- In equation 11, why is the average acceleration rate calculated with b^tilde rather than using b?
- The evaluation parts measure performance mainly by FLOPS. Can FLOPS be interpreted for various purposes of evaluation? For example, how do FLOPS (instead of actual memory consumption) explain the overhead in KV-cache?
局限性
- Although the authors mentioned that the difficulty has not been modeled, a precise definition of the concept is required to generalize this design to be applied to other contexts
- Currently, D-LLM is only applicable to parameter-efficient fine-tuning workloads that use LoRA (even if LoRA is a widely adopted PEFT algorithm) and requires additional training for new datasets or benchmarks
We truly appreciate the reviewer for the valuable feedback. We would like to address each of concerns raised as follows.
W1 & Q4: Additional overheads from the decision module training
We provide an analysis of overheads taken by decision modules in global rebuttal #1.
W2: Baselines do not include “dynamic inference mechanisms”
We would like to address the choice of baselines as follows:
-
D-LLM w/o eviction strategy shown in Table 2 (line 260) can be regarded as a baseline for dynamic inference mechanisms on layer skipping. We propose KV-Cache eviction strategy on layer skipping for better adaption to LLM and prove its effectiveness in our ablation study.
-
Dynamic inference mechanisms share similar methods with LLMs acceleration. AdaInfer, one of our baselines, introduces an early-exit mechanism to stop inference at intermediate layers, which is an important approach of dynamic inference mechanisms.
W3: Add references or clear evidence for preserving initial tokens
The importance of initial tokens is pointed out in recent works for LLMs' generalization to long contexts[1,2]. Initial tokens encode strong absolute position information and collect significant attention scores, which makes it necessary to be kept when pruning context. We will include the references in revision.
[1] LM-Infinite: Simple on-the-fly length generalization for large language models.
[2] Efficient streaming language models with attention sinks.
W4: Writing needs revision. Duplicated citations.
We will correct the duplicate citations and carefully check over other mistakes in revision.
Q1: Difference between 'dynamic pruning' (line 99) and 'dynamic inference mechanism' (line 105)
We discuss related works in other fields like computer vision in the context of dynamic inference mechanisms. 'Dynamic pruning methods' like Adainfer and MoD are designed specifically for LLMs. Although some dynamic inference mechanisms are shared in computer vision and LLMs, we want to discuss the LLMs acceleration methods separately to emphasize the domain of our work. We will make it clearer in revision.
Q2-3: Specific use of LoRA
LoRA primarily helps LLM adapt to the layer-skipping computation mode and specific domain tasks. We perform ablation study on D-LLM with and w/o LoRA under . The following table shows that the accuracy significantly decreases when LoRA modules are not trained together. The model also exhibits less control over acceleration rate.
Values in table represent (FLOPs, Accuracy).
| w/o LoRA | D-LLM | |
|---|---|---|
| MaWPS | (0.63, 0.00) | (0.56, 0.74) |
| OBQA | (0.58, 0.25) | (0.53, 0.80) |
Q5: D-LLM on a completely new dataset
Currently, we conduct experiments on specific tasks with few-shot training, To infer on a completely new dataset, further training of D-LLM is required to adapt model to domain knowledge and task templates. The table below shows that D-LLM maintains stable acceleration rates but has decreasing performance on new datasets, which also exists in PEFT methods (LoRA). We will research on generalization of D-LLM with more common datasets in the future.
Values in table represent (FLOPs, Accuracy).
| D-LLM_PIQA | D-LLM_SIQA | LoRA_PIQA | LoRA_SIQA | |
|---|---|---|---|---|
| PIQA | (0.52, 0.84) | (0.53, 0.68) | (1.00, 0.84) | (1.00, 0.68) |
| SIQA | (0.51, 0.54) | (0.54, 0.82) | (1.00, 0.53) | (1.00, 0.81) |
Q6: Difference between target and actual acceleration rate
We give comparison between actual acceleration rate and target in the following table. Results show that larger target rate brings the larger difference. The difference is also influenced by the difficulty of tasks and hyper-parameter in Eq.14 (line 213). For example, actual rate of OBQA is closer to the target because making choices is easier than summarization (SAMSum). Analysis on is available in global rebuttal #2.
| Target | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 |
|---|---|---|---|---|---|---|
| SAMSum | 0.19 | 0.28 | 0.36 | 0.44 | 0.52 | 0.59 |
| OBQA | 0.19 | 0.28 | 0.39 | 0.48 | 0.61 | 0.66 |
Q7: Why is the average acceleration rate calculated with b^tilde rather than using b? (Eq.11)
We use instead of for gradients backward propagation. Utilizing from operation, which is not differentiable, makes it fail to optimize in Eq.12.
In practical implementation, can be approximated to one-hot vectors in 2 ways:
-
Descend parameter to 0, then the softmax becomes 'sharp' and numerically close to .
-
Use straight-through estimator as . The operation means 0 gradients for backward propagation.
Both ways are viable in our framework. We will clarify details in revision.
Q8: Can FLOPS be interpreted for various purposes of evaluation?
In addition to computation, FLOPs can essentially measure the rate of memory overhead in KV-cache, which is exactly the same as executed-layers rate since only executed layers store the KV-cache in D-LLM. We provide the exact values on SAMSum in the following table to show the slight difference between FLOPs and other metrics, which is caused by extra few computation of decision modules.
| executed-layers | KV-cache overhead | FLOPs | |
|---|---|---|---|
| D-LLM () | 0.72 | 0.72 | 0.73 |
| D-LLM () | 0.54 | 0.54 | 0.55 |
| LLM | 1.00 | 1.00 | 1.00 |
I appreciate the authors' efforts in addressing the comments, including the comprehensive new measurements. I have changed my decision from borderline reject to borderline accept.
- Q1) I understood the authors' answer. Since the technical depth and level of dynamic pruning and dynamic inference are somewhat different, I suggest organizing them carefully.
- Q8) I agree that FLOPS might reveal the improvement on the memory side by considering the operations involved, but for model practitioners or system engineers, it is also important to understand the actual amount of increase or decrease in memory, especially given the limited memory of GPUs.
Thank you for your kind response and support. We are pleased that we have addressed most of your concerns.
[Q1)] Thanks for your suggestion. We will carefully organize the writing of related work in the revision.
[Q8)] We understand your comments and advice. While the precise memory overhead is closely related to specific implementations, such as operators and quantization techniques, providing data as an example is still enlightening for system engineers to assess the deployability of the model across various GPU memory constraints.
We provide the information of actual memory overhead in the table below including KV-cache (1K context) overhead and running overhead. We use D-LLM based on llama2-7b and parameters are stored as float16. For example, KV-cache of D-LLM () with context length 1K costs about 276 MB while that of origin LLM costs 512 MB. The running memory when generating each token is only 7.92 GB since skipped layers are unnecessary to load into GPU, significantly less than the 13.90 GB needed by an origin LLM using all layers.
We will incorporate these details in the revision. Once again, we extend our gratitude for your constructive suggestions.
| KV-cache memory (MB) | Running memory (GB) | |
|---|---|---|
| D-LLM () | 368.64 | 10.26 |
| D-LLM () | 276.48 | 7.92 |
| LLM | 512.00 | 13.90 |
This manuscript introduces a new dynamic inference paradigm for LLMs called D-LLMs, which adaptively allocates computing resources in token processing. With the dynamic decision module, the network unit is decided to be executed or skipped on the fly. The KV-cache eviction policy is proposed to exclude skipped layers from subsequent calculations, reducing storage overhead while maintaining compatibility with KV-cache methods. The experimental results demonstrate the notable computation reduction and KV-cache storage utilization.
优点
- The assumption that "not every word is equally important ..." is reasonable.
- The experiments showcase the notable reduction in computation cost.
缺点
- Highlighting how D-LLMs differ significantly from or improve upon existing methods like AdaInfer and SkipNet would strengthen the novelty claim. Providing a more detailed comparison with these methods in terms of theoretical underpinnings or empirical performance could clarify the unique contributions of D-LLMs.
- The ablation on hyper-parameter \alpha in Eq. 14 is not provided. Please discuss its influence on task performance and computation reduction.
- The font in Fig. 2-4 is too small for visualization.
问题
Please see weakness part.
局限性
Yes.
We truly appreciate the reviewer for the valuable feedback. We have carefully considered the comments and suggestions and would like to address each of the concerns raised.
W1: Highlighting how D-LLMs differ significantly from or improve upon existing methods like AdaInfer and SkipNet would strengthen the novelty claim. Providing a more detailed comparison with these methods in terms of theoretical underpinnings or empirical performance could clarify the unique contributions of D-LLMs.
We make a detailed description of existing comparison methods in our paper as follows.
-
ShortenLlaMA prunes unimportant layers based on designed metrics like Talylor and PPL and then apply LoRA to finetune pruned LLM on specific tasks. The pruning is static, which is fundamentally different from our dynamic inference method.
-
AdaInfer is an early-exit method to stop the inference at intermediate layers. It underperforms on complex tasks such as Q&A.
-
Mixture-of-Depth selects top-k tokens for calculation only at specific layers. Our method produces a more comprehensive dynamic inference mechanism and performs better at various benchmarks.
-
Dynamic inference mechanisms in computer vision like SkipNet utilize layer skipping. Building on a similar mechanism, our method introduces the KV-cache eviction strategy for further overhead reduction in memory and computation for LLM inference.
We will incorporate these details in the revision to provide a clearer clarification of our contributions.
W2: The ablation on hyper-parameter \alpha in Eq. 14 is not provided. Please discuss its influence on task performance and computation reduction.
We analyze hyper-parameter and discuss its influence on task performance and computation reduction. Details are available in global rebuttal #2.
W3: The font in Fig. 2-4 is too small for visualization.
We have increased the font size in Fig. 2-4 of our paper to enhance readability. The revised figures can be seen in Figure 2-4 of the PDF in the global rebuttal. We will update these figures in the revision.
Thank you for your detailed response, it addresses most of my concerns. I will raise my rating to weak accept.
We are pleased that we have addressed most of your concerns. Thank you for your kind response and support.
We are deeply grateful to the reviewers for their valuable feedback. We have carefully read the comments, and these insightful suggestions are crucial for enhancing our research. Given the limited length of individual rebuttals, we have chosen several key questions or concerns of interest to most reviewers for discussion in the global rebuttal.
The experiments and analysis in rebuttal are performed on L20 GPUs with D-LLM based on llama2-7b.
#1. Additional costs/overheads/latency for decision modules in D-LLM. (To Reviewer fCWH, 9XGZ, Y7Gp)
We provide extra overhead taken by decision modules of D-LLM in the table below, including information about training and inference. We give a detailed description as follows.
-
Params/FLOPs overheads: Decision modules are parameter-efficient, which only takes 1.0% of the size of parameters of LLM and 0.9% FLOPs compared with the forward computation of LLM.
-
Training memory overheads: During training, we store parameters to be updated in float32 and others in float16. The running GPU memory usage of training LoRA on LLM is about 26.2 GB and training D-LLM is about 32 GB, which means the extra memory cost taken by decision modules is only 5.8 GB.
-
Inference memory overheads: During inference, the additional GPU memory cost taken by decision modules is only 0.3 GB. Furthermore, D-LLM requires less memory when generating each token since layers decided to be skipped are unnecessary to load into GPU memory. For example, D-LLM with acceleration rate 50% requires only 7.4 GB when inference, significantly less than the 13.9 GB needed by an origin LLM using all layers.
-
Latency: To the latency of inference, computing through one decision module costs only 0.1 ms, which is quite lightweight compared with computing through a transformer block, which costs about 1.2 ms. The gain of acceleration brought by layer skipping is significant.
We will include the overhead analysis in the revision.
| Params | FLOPs | Training Memory (GB) | Inference Memory (GB) | Latency Per block (ms) | |
|---|---|---|---|---|---|
| Decision Modules | 0.9% | 0.8% | 5.8 | 0.3 | 0.1 |
| LLM | 1 | 1 | 26.2(LoRA) | 13.9 | 1.2 |
#2. Hyper-parameter analysis on in Eq.14. (To Reviewer RWhA, fCWH, 9XGZ, Y7Gp)
We use hyper-parameter as the factor of in loss function Eq.14 (line 213) to control the importance of acceleration ratio loss during training. significantly influences performance on specific tasks and the ability to control the acceleration rate in D-LLM. Under the same training conditions, a higher value of results in a more precise acceleration rate towards the target rate , while the performance of the model decreases because the cross-entropy loss becomes less important.
We perform parameter analysis on the MaWPS. In the table below, we show the performance of D-LLM based on different settings of . The user-defined target acceleration rate . The results show that larger provides better control over the acceleration rate. For example, with the target accleration ratio set to 80%, the trained model with achieves only 47%, whereas , D-LLM achieves 72%. In addition, an excessively large can decrease the performance of D-LLM, even when trained models share the same computation overhead. For example, comparing cases on () with cases on (), trained models both achieve approximately 60% acceleration. However, the accuracy in decreases by 11% compared to the case. We also provide a visualization of the table below as Figure 1 in the PDF in global rebuttal.
We will include the analysis experiments in the appendix of the revision.
Values in table represent (FLOPs, Accuracy).
| =0.1 | =1 | =5 | |
|---|---|---|---|
| =0.5 | (0.63, 0.75) | (0.56, 0.74) | (0.52, 0.62) |
| =0.6 | (0.57, 0.73) | (0.47, 0.72) | (0.42, 0.62) |
| =0.7 | (0.56, 0.75) | (0.40, 0.73) | (0.34, 0.61) |
| =0.8 | (0.53, 0.73) | (0.36, 0.71) | (0.28, 0.57) |
| =0.9 | (0.49, 0.72) | (0.33, 0.70) | (0.22, 0.50) |
This paper introduces a dynamic inference paradigm based on token importance for LLMs to reduce computation costs. The framework also includes an effective eviction policy by transforming the causal self-attention masks to make it compatible with KV-cache methods and save the storage overhead at inference. Experiments show that framework can notablely reduce computational costs and KV-cache storage by up to 45% and 50% on various tasks.
Research on reducing computational costs and storage requirements for LLMs is critical and meaningful for making high-performance language models more practical for a wider range of applications. And this paper offers a creative solution. Some reviewers raised concerns regarding to the comparison between this approach and the existing methods like AdaInfer and SkipNet in terms of theoretical underpinnings or empirical performance, and lacking a deep theoretical analysis of the dynamic decision module's behavior and its impact on model performance. The authors made clarifications in their rebuttal with some extra results, which solved some concerns from the reviewers. The paper would benefit greatly from the reviewers' suggestions, and the revision would be better if the discussions are included.
Overall, this paper falls on the borderline of acceptance, leaning towards acceptance. While I think the paper does have some room for improvement, it would be beneficial to include it in NeurIPS if there is sufficient space for presentation.