PaperHub
5.6
/10
Poster5 位审稿人
最低4最高7标准差1.0
6
4
5
7
6
3.8
置信度
正确性3.2
贡献度2.6
表达3.0
NeurIPS 2024

Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting

OpenReviewPDF
提交: 2024-05-06更新: 2024-11-06

摘要

Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models (LLMs) while maintaining an identical sampling distribution. However, the conventional approach of training separate draft model to achieve a satisfactory token acceptance rate can be costly and impractical. In this paper, we propose a novel self-speculative decoding framework Kangaroo with double early exiting strategy, which leverages the shallow sub-network and the LM Head of the well-trained target LLM to construct a self-drafting model. Then, the self-verification stage only requires computing the remaining layers over the early-exited hidden states in parallel. To bridge the representation gap between the sub-network and the full model, we train a lightweight and efficient adapter module on top of the sub-network. One significant challenge that comes with the proposed method is that the inference latency of the self-draft model may no longer be negligible compared to the big model. To boost the token acceptance rate while minimizing the latency of the self-drafting model, we introduce an additional early exiting mechanism for both single-sequence and the tree decoding scenarios. Specifically, we dynamically halt the small model's subsequent prediction during the drafting phase once the confidence level for the current step falls below a certain threshold. This approach reduces unnecessary computations and improves overall efficiency. Extensive experiments on multiple benchmarks demonstrate our effectiveness, where Kangaroo achieves walltime speedups up to 2.04$\times$, outperforming Medusa-1 with 88.7% fewer additional parameters. The code for Kangaroo is available at https://github.com/Equationliu/Kangaroo.
关键词
speculative decodingLLMsself-draftingearly exiting

评审与讨论

审稿意见
6

This paper proposes a novel self-speculative decoding framework Kangaroo with a double early exiting strategy for accelerating LLM inference. It addresses the challenge of inference latency and shows effectiveness through extensive experiments, achieving significant speedups and outperforming the competitors with fewer parameters.

优点

  • The paper introduces a novel double early exiting strategy Kangaroo, which is a unique approach to improving the efficiency of LLMs without compromising performance.
  • The framework significantly accelerates inference by reducing the computational overhead associated with autoregressive decoding in LLMs, and achieves remarkable speedups compared to existing methods, with empirical results showing up to 2.04× wall-time speedups.
  • The proposed method can achieve lossless acceleration. The method maintains the same sampling distribution as the target LLM, ensuring that the quality of the generated text is not sacrificed for speed.

缺点

  • This paper introduces extension to tree decoding for Kangaroo, but the implementation details are not fully provided. More explanation of that would be helpful for readers to understand the mechanism.
  • The optimal settings for the early exit layer and the dynamic threshold η may require careful tuning, which could be resource-intensive and may not generalize across different tasks or datasets.
  • This paper focuses on Vicuna-7B and Vicuna-13B models. More experiments on a wider range of model sizes could strengthen the evidence of Kangaroo's scalability and generalizability.

问题

See questions in the weaknesses.

局限性

See questions in the weaknesses.

作者回复

We thank the reviewer for the detailed comments and valuable suggestions. We answer each of these comments below.

More explanation of the extension to tree decoding for Kangaroo

Thank you for your valuable suggestions. We will provide a more detailed formal expression and description of this extension in the final version. Kangaroo uses a static threshold to determine the timing of the second early stopping, based on the observation that the confidence of draft tokens in the small model is strongly correlated with their acceptance by the large model. Therefore, we approximate the probability of a token being accepted by the large model using the top-1 confidence of each token in the token tree. Considering the contextual dependency of tokens in speculative decoding, we model the probability of a token being accepted by the large model as the product of the top-1 confidences from the root node to that token. This approach is inspired by similar ideas in [1]

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding.

The optimal settings for the early exit layer and the dynamic threshold η\eta may require careful tuning

Please refer to section 1.1 in the global response.

More experiments on a wider range of model sizes.

To demonstrate the generalizability of Kangaroo’s approach (reusing shallow parameters + learning a lightweight adapter), we conducted experiments on Llama2 and Llama3 as suggested by the reviewer. The results are as follows:

\qquad MethodTranslation\qquad QASummarization\quad Math\quad RAGMT BenchAvg.
Draft & Verify1.22×(2.61)1.22\times (2.61)1.02×(2.36)1.02 \times (2.36)1.13×(2.84)1.13 \times (2.84)1.08×(2.47)1.08 \times (2.47)1.15×(2.44)1.15 \times (2.44)1.12×(2.46)1.12 \times (2.46)1.12×1.12 \times
SpS1.26×(1.63)1.26\times (1.63)1.34×(1.69)1.34\times (1.69)1.14×(1.47)1.14\times (1.47)1.34×(1.77)1.34\times (1.77)1.32×(1.81)1.32\times (1.81)1.28×(1.67)1.28\times (1.67)1.28×1.28 \times
Medusa^* w/o Tree1.53×(1.85)1.53\times (1.85)1.27×(1.55)1.27\times (1.55)1.19×(1.48)1.19\times (1.48)1.36×(1.76)1.36\times (1.76)1.25×(1.54)1.25\times (1.54)1.43×(1.72)1.43\times (1.72)1.34×1.34 \times
Kangaroo w/o Tree1.46×(1.97)1.46 \times (1.97)1.40×(1.87)1.40 \times (1.87)1.35×(1.97)1.35 \times (1.97)1.52×(2.22)1.52 \times (2.22)1.36×(2.05)1.36 \times (2.05)1.58×(2.28)1.58 \times (2.28)1.45×1.45 \times

Speedup comparison of various speculative decoding methods on Spec-Bench [22] for Llama2-13B-Chat. Values outside the parentheses indicate speedup, while those inside parentheses indicate the compression rate (CR). ^* denotes reproduction result. SpS takes Llama-68M as the draft model.

\quad Model\qquad MethodTranslation\qquad QASummarization\quad Math\quad RAGMT BenchAvg.
Llama-3-8B-InstructKangaroo w/o Tree1.46×(2.10)1.46\times (2.10)1.49×(2.21)1.49 \times (2.21)1.47×(2.33)1.47 \times (2.33)1.61×(2.51)1.61 \times (2.51)1.38×(2.44)1.38 \times (2.44)1.64×(2.44)1.64\times (2.44)1.51×1.51 \times
Llama-3-8B-InstructKangaroo1.57×(2.32)1.57 \times (2.32)1.62×(2.43)1.62 \times (2.43)1.61×(2.61)1.61 \times (2.61)1.92×(2.95)1.92 \times (2.95)1.87×(2.85)1.87 \times (2.85)1.93×(2.87)1.93 \times (2.87)1.75×1.75 \times
评论

Thank the authors for your rebuttals carefully. My concerns about unclear details and experiments are well addressed. Thus, I keep my original rate to accept this paper.

评论

We sincerely appreciate the time you took to provide valuable comments on our paper.

审稿意见
4

The authors introduce "Kangaroo" a novel self-speculative decoding framework designed to accelerate LLMs using a double early exiting strategy. This approach leverages the shallow sub-network and LM Head of the target LLM to construct a self-drafting model and employs a dynamic early exiting mechanism to enhance token acceptance rates and overall efficiency. The method shows promising results, achieving significant speedups and outperforming existing techniques with fewer additional parameters

优点

  1. New Self-Speculative Decoding: The Kangaroo framework introduces a novel double early exiting strategy that effectively combines self-drafting and verification stages within the same model, significantly enhancing decoding efficiency.

  2. Efficiency Gains: Experimental results demonstrate substantial walltime speedups over prev methods, with Kangaroo outperforming Medusa by using 88.7% fewer additional parameters.

  3. Dynamic Drafting: The introduction of a dynamic early exiting mechanism tailored to both single-sequence and tree decoding scenarios ensures optimal balance between token acceptance rate and drafting efficiency.

  4. Comprehensive Empirical Validation: Extensive experiments on Spec-Bench provide robust evidence of Kangaroo's superior performance across multiple tasks, including mathematical reasoning and retrieval-augmented generation.

缺点

  1. Adapter-Network Design: The design of the adapter network is heuristic and seems verified only on LLama2 and Vicuna models. More justification and exploration across different model architectures would enhance the generalizability of the approach. (new results for Gemma1~2, Phi3, LaMMa3, etc..)

  2. Size of Shallow Network: The paper does not provide a rational basis for selecting the size of the shallow network. Clear criteria or experimental validation for this choice are needed.

  3. Task-Specific Performance: While Medusa performs better on translation tasks, Kangaroo excels in mathematical reasoning. A deeper discussion on the strengths and weaknesses of each method across different tasks would provide more insights.

  4. Training Time Justification: The justification for the chosen number of training epochs (10 epochs) compared to previous methods is unclear. A comparative analysis of training times and their impact on performance would be beneficial.

  5. Speedup Verification: Speedup claims are tricky and should be verified over multiple GPUs to ensure robustness. More comprehensive benchmarking in diverse hardware settings is required.

  6. Experimental Results with Temperature > 0: The paper lacks discussion on the impact of different temperature settings during inference. Additional experimental results with temperature > 0 would provide a more complete evaluation of the method.

问题

  1. Adapter-Network Design:

    • Question: Can you provide more detailed justification and exploration for the design of the adapter network? Specifically, how did you determine the architecture (one multi-head attention layer and two normalization layers) and its suitability across different model architectures?
    • Suggestion: Include additional experiments verifying the effectiveness of the adapter network on a broader range of LLMs beyond LLama2 and Vicuna. This could enhance the generalizability of your approach.
  2. Size of Shallow Network:

    • Question: What criteria or experimental validations did you use to select the size of the shallow sub-network? Is there an optimal depth, and how does it vary with different models or tasks?
    • Suggestion: Provide a more rational basis or empirical analysis for the chosen size of the shallow network. Include ablation studies or sensitivity analyses showing the impact of varying the depth on performance and efficiency.
  3. Task-Specific Performance:

    • Question: Why does Medusa perform better on translation tasks compared to Kangaroo, and how does Kangaroo's performance in mathematical reasoning differ? Can you elaborate on the strengths and weaknesses of each method across different tasks?
    • Suggestion: Include a more detailed discussion and comparative analysis of the performance of Kangaroo and other methods across various tasks. Highlight the specific attributes that make Kangaroo excel in mathematical reasoning and discuss potential improvements for translation tasks.
  4. Training Time Justification:

    • Question: How did you justify the chosen number of training epochs (10 epochs) compared to previous methods? How does this training regimen impact efficiency and performance?
    • Suggestion: Provide a comparative analysis of training times and their impact on performance. Include benchmarks or references to previous methods to justify the training duration and its effectiveness.
  5. Speedup Verification:

    • Question: Have you verified the speedup claims over multiple GPUs? How does the performance scale with different hardware configurations?
    • Suggestion: Conduct and report additional experiments verifying the speedup over multiple GPUs and diverse hardware setups. This would strengthen the robustness and reliability of your speedup claims.
  6. Experimental Results with Temperature > 0:

    • Question: How does Kangaroo perform with different temperature settings during inference? Have you evaluated the impact of varying the temperature on token acceptance rates and overall efficiency?
    • Suggestion: Include experimental results and analysis with different temperature settings (e.g., temperature > 0) to provide a more comprehensive evaluation of the method. Discuss how temperature variations influence the performance and stability of Kangaroo.
  7. Generalizability and Scalability:

    • Question: How does the proposed method generalize to even larger models and more diverse application scenarios? Have you explored its scalability and potential limitations in real-world deployments?
    • Suggestion: Extend your experimental evaluations to include larger models and a wider range of application scenarios. Discuss any potential scalability issues and provide insights into how the method can be adapted or improved for broader applicability.

局限性

See Weakness.

作者回复

We thank the reviewer for the detailed comments and valuable suggestions. We answer these comments below.

How did you determine the architecture and its suitability across different model architectures.

The iterative process of designing the adapter network in Kangaroo is reflected in Table 2 of our paper, as follows:

  1. Initial Approach with MLP: We initially considered using a simple MLP to map shallow features to the LM head. However, this approach did not yield satisfactory results as it failed to leverage token context effectively.
  2. Transformer Layer: Next, we replaced the MLP with a single transformer layer to enhance expressive power. This improved performance but still fell short of our expectations.
  3. Removing FFN Module: Observing that the FFN module in the LLM Decoder significantly contributes to latency without leveraging token context, we removed the FFN module from the transformer and introduced a new LM Head to maintain parameter count. This strategy increased the speedup from 1.37 to 1.44.
  4. Reusing LLM’s Language Head: Finally, we found that reusing the LLM’s own language head was highly effective, significantly boosting acceleration performance.

To demonstrate the generalizability of Kangaroo’s approach (reusing shallow parameters + learning a lightweight adapter), we conducted experiments on Llama2 and Llama3 as suggested by the reviewer. The results are as follows:

\qquad MethodTranslation\qquad QASummarization\quad Math\quad RAGMT BenchAvg.
Draft & Verify1.22×(2.61)1.22\times (2.61)1.02×(2.36)1.02 \times (2.36)1.13×(2.84)1.13 \times (2.84)1.08×(2.47)1.08 \times (2.47)1.15×(2.44)1.15 \times (2.44)1.12×(2.46)1.12 \times (2.46)1.12×1.12 \times
SpS1.26×(1.63)1.26\times (1.63)1.34×(1.69)1.34\times (1.69)1.14×(1.47)1.14\times (1.47)1.34×(1.77)1.34\times (1.77)1.32×(1.81)1.32\times (1.81)1.28×(1.67)1.28\times (1.67)1.28×1.28 \times
Medusa^* w/o Tree1.53×(1.85)1.53\times (1.85)1.27×(1.55)1.27\times (1.55)1.19×(1.48)1.19\times (1.48)1.36×(1.76)1.36\times (1.76)1.25×(1.54)1.25\times (1.54)1.43×(1.72)1.43\times (1.72)1.34×1.34 \times
Kangaroo w/o Tree1.46×(1.97)1.46 \times (1.97)1.40×(1.87)1.40 \times (1.87)1.35×(1.97)1.35 \times (1.97)1.52×(2.22)1.52 \times (2.22)1.36×(2.05)1.36 \times (2.05)1.58×(2.28)1.58 \times (2.28)1.45×1.45 \times

Speedup comparison of various speculative decoding methods on Spec-Bench [22] for Llama2-13B-Chat. Values outside the parentheses indicate speedup, while those inside parentheses indicate the compression rate (CR). ^* denotes reproduction result. SpS takes Llama-68M as the draft model.

\quad Model\qquad MethodTranslation\qquad QASummarization\quad Math\quad RAGMT BenchAvg.
Llama-3-8B-InstructKangaroo w/o Tree1.46×(2.10)1.46\times (2.10)1.49×(2.21)1.49 \times (2.21)1.47×(2.33)1.47 \times (2.33)1.61×(2.51)1.61 \times (2.51)1.38×(2.44)1.38 \times (2.44)1.64×(2.44)1.64\times (2.44)1.51×1.51 \times
Llama-3-8B-InstructKangaroo1.57×(2.32)1.57 \times (2.32)1.62×(2.43)1.62 \times (2.43)1.61×(2.61)1.61 \times (2.61)1.92×(2.95)1.92 \times (2.95)1.87×(2.85)1.87 \times (2.85)1.93×(2.87)1.93 \times (2.87)1.75×1.75 \times

Provide a more rational basis or empirical analysis for the chosen size of the shallow network.

Please refer to section 1.1 in the global response, where we conducted multiple comparative experiments across models with different architectures and sizes. It can be seen that deeper models (with depth NN) have larger optimal early exit layers \ell, but the ratio /N\ell / N is relatively consistent. This implies that the latency cost for the self-speculative small model compared to the full model is fairly stable. Therefore, we recommend setting /N\ell / N between 1/16 and 1/10 in practical applications.

Regarding whether different tasks should have different optimal early exit layers \ell, we summarized the changes in speedup across different datasets with varying early exit layers for Kangaroo@Vicuna-7B, as shown in the table below:

Exit Layer \ellTranslationQASummarizationMathRAGMT BenchAvg.
11.10x1.35x1.23x1.43x1.31x1.49x1.32x
21.24x1.43x1.50x1.61x1.52x1.68x1.50x
31.19x1.41x1.53x1.62x1.49x1.63x1.47x
41.12x1.34x1.47x1.56x1.44x1.60x1.43x
51.11x1.29x1.39x1.46x1.37x1.49x1.35x

It can be seen that the optimal early exit layers for different subtasks are not exactly the same but are generally close.

Verifying the speedup over multiple GPUs and diverse hardware setups.

We conducted inference on Vicuna-33B using 4 NVIDIA V100 GPUs, achieving a speed of approximately 7.6 tokens per second. With Kangaroo, using early exit at the 5th layer and training the adapter, the inference speed increased to 11.4 tokens per second, resulting in a speedup of about 1.5x.

Sincerely,

Paper 2124 Authors

评论

Thank you for the detailed rebuttal, and I appreciate the effort in addressing the comments. However, I still have concerns and a few follow-up questions and suggestions:

  1. Tree-Attention Experiments:
  • I noticed that your added results primarily focus on the non-tree attention variant of Kangaroo. Could you clarify why the tree attention variant was not included in the comparisons? My hypothesis is that adding tree attention might shift the computation from being memory-bound to more compute-bound, potentially diminishing some of Kangaroo's advantages. Could you provide insights or results regarding this?
  1. Temperature Sampling Experiments:
  • The lack of experiments involving temperature > 0 during inference remains a concern. In real-world applications, varying temperature settings are commonly used, and understanding how Kangaroo performs under these conditions is crucial. Could you elaborate on why these experiments were not conducted or share any preliminary results if available?
  1. Interpretation of Ratio l/Nl/N:
  • You mentioned that the optimal early exit layers for different subtasks are "generally close." However, looking at the results, it seems that other methodologies might actually perform better in certain scenarios. Doesn't this indicate that the differences are more significant than "generally close"? Could you clarify this point?
  1. Task-Specific Performance:
  • The rebuttal tables highlight Kangaroo's strengths effectively, but I'm interested in understanding if there are scenarios or tasks where Kangaroo does not perform as well. Acknowledging these limitations is important, as no single method can excel in every task. It would be valuable to have a deeper discussion on where Kangaroo excels and where it might need further improvement. This would provide a more balanced and comprehensive perspective on the method's contributions.

I will definitely raise your score once I receive answers to these questions. However, for now, I will lower your score because I still have additional doubts and concerns.

评论

Dear Reviewer J1nq,

Thank you for your ongoing efforts in helping us improve the quality of this manuscript. We greatly appreciate the time and attention you have dedicated.

We have responded to your latest comments. As the discussion period draws to a close, we would like to reach out to see if you have any remaining questions or unresolved issues. If everything is now clear, we would be grateful if you could consider updating your evaluation to reflect this.

Once again, thank you for your constructive feedback and for your invaluable contribution to the development of this manuscript. We look forward to hearing from you soon.

评论

Interpretation of Ratio N\frac{\ell}{N}

By generally close, we refer to cases where the early exit layer falls within our recommended robust range of [1/16,1/10][1/16, 1/10]. For the 32-layer Vicuna-7B, only the second and third layers meet this criterion. On Vicuna-7B, the best early exit layers for Summarization and Math tasks fall on the third layer, but their optimal speedup performance (1.53x and 1.62x) is generally close to that of the speedup performance using the second layer for early exit (1.50x and 1.61x). It's unfair to compare the performance of other methodologies against Kangaroo when it exits early at the fifth layer on Vicuna-7B.

Task-Specific Performance

After a thorough analysis, we could explain why Medusa performs better on translation tasks, while Kangaroo excels in mathematical reasoning, summarization, and retrieval-augmented generation from two perspectives:

  1. The usage of an auto-regressive adapter. It should be noted that the draft tokens in Kangaroo are generated in an auto-regressive manner, while Medusa uses time-independent FFN heads. Since Kangaroo can better leverage long-distance contextual relationships, it excels in subtasks with high similarities between input and output, such as summarization, retrieval-augmented generation, and mathematical reasoning. For the translation subtask, the input and output are the least similar.

  2. The depth of the feature used for drafting. Medusa uses second-to-top-layer features, while Kangaroo performs early exits at shallow layers and trains a lightweight adapter to generate draft tokens. As Reviewer Tuhh also pointed out [3,4], the shallow layers of LLM may learn relatively simple tasks, while deep-layer features learn more abstract capabilities. Therefore, for the more abstract translation task, Medusa shows higher accuracy, reflected in its compression rate (CR) being higher than Kangaroo's. On the other hand, Medusa's compression rate is lower than Kangaroo's in all other subtasks.

[1] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

[2] Benjamin Spector and Chris Re. 2023. Accelerating LLM inference with staged speculative decoding

[3] The Unreasonable Ineffectiveness of the Deeper Layers

[4] Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?

We would like to express our sincere appreciation again for the valuable suggestions from the reviewer, which have greatly enriched the quality of this manuscript.

Sincerely,

Paper 2124 Authors

评论

Dear Reviewer J1nq,

Sorry to bother you again. We appreciate the time and attention you have dedicated to this manuscript.

As mentioned in your last comment, I will definitely raise your score once I receive answers to these questions. With only one day left in the discussion period, we are eager to hear your feedback on whether our recent response has addressed your concerns.

If you have any remaining concerns, please do not hesitate to let us know; we are more than happy to clarify and respond. Engaging in this discussion with you has been a rewarding experience, and your feedback has significantly improved the quality of this manuscript.

We look forward to your feedback.

Best regards,

Authors

评论

Thank you for your responses. I'm glad to see some of my concerns addressed, particularly the explanation on task-specific performance, which makes sense.

I still believe it would be beneficial to compare temperature sampling results more extensively with other methods and explore further differences with Medusa. I’ll raise the score to 5 and discuss these points further in the reviewer-AC discussion.

评论

We have conducted experiments for the case where temperature > 0 and explored a comparative analysis of the performance of Kangaroo and other methods across various tasks. We hope that the following responses can address your concerns.

Tree-Attention Experiments

In most of the rebuttal tables, we only reported the non-tree attention variant of Kangaroo. This is because most of the reviewers’ concerns in the rebuttal focused on the robustness of Kangaroo's two hyperparameters (i.e., η\eta and \ell), which are not closely related to the use of tree attention. Therefore, in the comparison table for Llama2-13B-Chat, we compared Kangaroo and three other algorithms (Draft & Verify[1], SpS, Medusa-1) that also do not use tree attention. It's indeed true that adding tree attention might shift the computation from being memory-bound to more compute-bound. However, there is no evidence that tree attention might potentially diminish some of Kangaroo's advantages. To alleviate your concerns, we have summarized the performance of Kangaroo with and without tree attention across different model architectures:

 \quad Model TreeTranslation \qquad QA Summarization\quad Math  \quad RAG MT Bench Avg.  
Vicuna-7BFalse1.24×(1.41)1.24 \times (1.41)1.43×(1.87)1.43 \times (1.87)1.50×(1.87)1.50 \times (1.87)1.61×(2.14)1.61 \times (2.14)1.52×(2.05)1.52 \times (2.05)1.68×(2.22)1.68 \times (2.22)1.50×1.50 \times
Vicuna-7BTrue  1.43×(1.76)1.43 \times (1.76)1.71×(2.32)1.71 \times (2.32)1.68×(2.31)1.68\times (2.31)2.04×(2.76)2.04 \times (2.76)1.75×(2.37)1.75 \times (2.37)1.93×(2.67)1.93 \times (2.67)1.72×1.72 \times
Llama2-13B-Chat  False1.46×(1.97)1.46 \times (1.97)1.40×(1.87)1.40 \times (1.87)1.35×(1.97)1.35 \times (1.97)1.52×(2.22)1.52 \times (2.22)1.36×(2.05)1.36 \times (2.05)1.58×(2.28)1.58 \times (2.28)1.45×1.45 \times
Llama2-13B-Chat True 1.62×(2.23)1.62 \times (2.23)1.67×(2.07)1.67 \times (2.07)1.64×(2.15)1.64 \times (2.15)1.76×(2.53)1.76 \times (2.53)1.63×(2.37)1.63 \times (2.37)1.81×(2.59)1.81 \times (2.59)1.69×1.69 \times
Llama-3-8B-InstructFalse1.46×(2.10)1.46\times (2.10)  1.49×(2.21)1.49 \times (2.21) 1.47×(2.33)1.47 \times (2.33)  1.61×(2.51)1.61 \times (2.51) 1.38×(2.44)1.38 \times (2.44) 1.64×(2.44)1.64\times (2.44)1.51×1.51 \times
Llama-3-8B-Instruct True1.57×(2.32)1.57 \times (2.32)  1.62×(2.43)1.62 \times (2.43) 1.61×(2.61)1.61 \times (2.61) 1.92×(2.95)1.92 \times (2.95)1.87×(2.85)1.87 \times (2.85) 1.93×(2.87)1.93 \times (2.87) 1.75×1.75 \times

The second column Tree denotes whether tree attention is used. Values outside the parentheses indicate speedup, while those inside parentheses indicate the compression rate (CR).

Temperature Sampling Experiments

In line 118 of the main paper, we claimed that in this work we focus on greedy decoding while our methodology can be easily extended to the rejection sampling case. To further address your concern, we followed the setting in Draft & Verify and set temperature = 0.2 to conduct two sets of comparative experiments on Kangaroo (without tree attention). One set used the original top-1 confidence as the variable to determine early stopping, while the other used adjusted top-1 confidence (softmax of the adjusted logits divided by the temperature). The results are shown in the table below:

Temperature Confidence   Translation     \qquad  QA     Summarization        \quad   Math       \quad  RAG       MT Bench    Avg.  
0.0Original1.24×(1.41)1.24 \times (1.41)1.43×(1.87)1.43 \times (1.87)1.50×(1.87)1.50 \times (1.87)1.61×(2.14)1.61 \times (2.14)1.52×(2.05)1.52 \times (2.05)1.68×(2.22)1.68 \times (2.22)1.50×1.50 \times
0.2   Original    1.23×(1.41)1.23 \times (1.41)1.41×(1.88)1.41 \times (1.88)1.48×(1.88)1.48\times (1.88)1.58×(2.19)1.58 \times (2.19)1.50×(2.01)1.50 \times (2.01)1.67×(2.21)1.67 \times (2.21)1.48×1.48 \times
0.2Adjusted1.04×(1.45)1.04 \times (1.45)1.25×(1.90)1.25 \times (1.90)1.31×(2.00)1.31 \times (2.00)1.48×(2.32)1.48 \times (2.32)1.40×(2.18)1.40 \times (2.18)1.53×(2.41)1.53 \times (2.41)1.34×1.34 \times
0.5Original1.18×(1.41)1.18 \times (1.41)1.38×(1.86)1.38\times (1.86)1.43×(1.88)1.43 \times (1.88)1.55×(2.21)1.55 \times (2.21)1.46×(2.01)1.46 \times (2.01)1.62×(2.23)1.62 \times (2.23)1.43×1.43 \times

From the analysis of the table, it can be seen that Kangaroo still achieves a speedup effect very similar to that of greedy decoding when T=0.2T = 0.2. Moreover, the top-1 confidence used for the second early stopping mechanism should remain unadjusted because when using temperatures less than 1, the adjusted top-1 confidence tends to be overestimated, making it more difficult to trigger the early stopping mechanism (even if we use a large threshold of η=0.96\eta = 0.96 for the adjusted confidence). Besides, the acceleration effect of Kangaroo decreases with an increase in sampling temperature. This is attributed to the increased computational complexity of the speculative sampling criterion at higher temperatures, as revealed in prior research[2].

评论

Dear Reviewer J1nq,

We greatly appreciate your ongoing efforts in helping us improve the quality of this manuscript. We will add more baseline methods and conduct a comprehensive comparison for temperature sampling in the final version as suggested.

We would like to point out that Medusa uses typical sampling when the temperature > 0, which is a lossy method that does not guarantee consistency with the original model’s sampling distribution. In contrast, Kangaroo is a lossless speculative decoding approach that exhibits a larger speedup ratio than Medusa. As claimed in Section 2.3.1 of Medusa[1], ``We ascertain that it is typically unnecessary to match the distribution of the original model. Thus, we propose employing a typical acceptance scheme to select plausible candidates rather than using rejection sampling.`

[1] MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

By the way, as you mentioned, could you please adjust the score to 5 in your review?

Sincerely,

Paper 2124 Authors

评论

I updated the score.

To all reviewers: Please respect the authors' efforts by providing feedback. Regardless of score changes, they put significant work into their rebuttals. If you're also an author, remember that we should all support each other by offering constructive comments. This applies to all submissions.

审稿意见
5

Authors proposed a new method to speed up large language model inference called self-speculative decoding. Instead of training a separate, costly draft model to maintain token acceptance rates, Kangaroo uses a shallow sub-network of the large model itself as the draft model. A lightweight adapter module is trained to bridge the representation gap between the sub-network and the full model. To improve efficiency, the method introduces an early exiting mechanism, halting predictions if the confidence level for the current token is too low. Experiments show that Kangaroo achieves up to 1.68× speedup with significantly fewer additional parameters compared to another method, Medusa-1.

优点

  • Compared to its prior work Medusa-1, the model is extremely lightweight (88.7% less parameter) but the performance is still outperforming.

  • The proposed auto-regressive self-drafting mode design is novel, and low cost. Their simple adapter network design sounds like a kick. Verification on a large language model can start where the self-drafting model has stopped and this can be beneficial as well.

  • Self-drafting models can share KV cache thus can be memory efficient.

  • Well written paper.

缺点

  • It will be nice to evaluate more LLM families other than Vicuna.

  • Evaluation is not comprehensive. It misses evaluations on double-early exiting effect – see question

问题

  • How frequently does the top-1 probability of small model goes below the predefine threshold? Do you have some observations on the efficacy of double early exiting?

局限性

  • Requires some hyperparameters search (e.g., exit-layer l)
作者回复

We thank the reviewer for the detailed comments and answer each of these comments below.

Evaluate more LLM families other than Vicuna

To validate the robustness of Kangaroo across different architectures, we additionally trained adapter networks on Llama2-13B-Chat and Llama-3-8B-Instruct and evaluated the performance on Spec-Bench. The results are shown in the tables below:

\qquad MethodTranslation\qquad QASummarization\quad Math\quad RAGMT BenchAvg.
Draft & Verify1.22×(2.61)1.22\times (2.61)1.02×(2.36)1.02 \times (2.36)1.13×(2.84)1.13 \times (2.84)1.08×(2.47)1.08 \times (2.47)1.15×(2.44)1.15 \times (2.44)1.12×(2.46)1.12 \times (2.46)1.12×1.12 \times
SpS1.26×(1.63)1.26\times (1.63)1.34×(1.69)1.34\times (1.69)1.14×(1.47)1.14\times (1.47)1.34×(1.77)1.34\times (1.77)1.32×(1.81)1.32\times (1.81)1.28×(1.67)1.28\times (1.67)1.28×1.28 \times
Medusa^* w/o Tree1.53×(1.85)1.53\times (1.85)1.27×(1.55)1.27\times (1.55)1.19×(1.48)1.19\times (1.48)1.36×(1.76)1.36\times (1.76)1.25×(1.54)1.25\times (1.54)1.43×(1.72)1.43\times (1.72)1.34×1.34 \times
Kangaroo w/o Tree1.46×(1.97)1.46 \times (1.97)1.40×(1.87)1.40 \times (1.87)1.35×(1.97)1.35 \times (1.97)1.52×(2.22)1.52 \times (2.22)1.36×(2.05)1.36 \times (2.05)1.58×(2.28)1.58 \times (2.28)1.45×1.45 \times

Speedup comparison of various speculative decoding methods on Spec-Bench [22] for Llama2-13B-Chat. Values outside the parentheses indicate speedup, while those inside parentheses indicate the compression rate (CR). ^* denotes reproduction result. SpS takes Llama-68M as the draft model.

\quad Model\qquad MethodTranslation\qquad QASummarization\quad Math\quad RAGMT BenchAvg.
Llama-3-8B-InstructKangaroo w/o Tree1.46×(2.10)1.46\times (2.10)1.49×(2.21)1.49 \times (2.21)1.47×(2.33)1.47 \times (2.33)1.61×(2.51)1.61 \times (2.51)1.38×(2.44)1.38 \times (2.44)1.64×(2.44)1.64\times (2.44)1.51×1.51 \times
Llama-3-8B-InstructKangaroo1.57×(2.32)1.57 \times (2.32)1.62×(2.43)1.62 \times (2.43)1.61×(2.61)1.61 \times (2.61)1.92×(2.95)1.92 \times (2.95)1.87×(2.85)1.87 \times (2.85)1.93×(2.87)1.93 \times (2.87)1.75×1.75 \times

How frequently does the top-1 probability of small model goes below the predefine threshold? Do you have some observations on the efficacy of double early exiting?

The frequency with which the small model's top-1 confidence falls below the predefined threshold varies depending on the model and the evaluation dataset. For instance, in Figure 4 of the main paper, about 40% of tokens have a confidence lower than the threshold η\eta, but most of these tokens are "hard tokens" that would not be accepted by the large model. More confidence distribution graphs are available in the Figures in the global response PDF.

The purpose of the second early exit mechanism is to reduce the time spent on difficult tokens, allowing the model to speculate further on simpler token sequences. As shown in Figure 6(b), when the stride γ=6\gamma = 6 but the dynamic early exit mechanism is not used (η=0\eta = 0), the speedup is 1.2. When η=0.6\eta = 0.6, the speedup increases to 1.5, indicating that the second early exit mechanism can provide approximately a 25% performance improvement.

Requires some hyperparameters search (e.g., exit-layer l)

We argue that in practical applications, the early exit layer \ell can be set based on an empirical ratio derived from the target model's depth. To determine the optimal depth for shallow sub-networks relative to the full model depth, we conducted multiple comparative experiments across models with different architectures and sizes. The table below records the average speedup achieved on Spec-Bench with early exits at various depths, and the final column shows the optimal ratio of the early exit layer to the total model depth.

\qquad Model\quad Group 1\quad Group 2\quad Group 3\quad Group 4\quad Group 5Optimal Ratio
Vicuna-7B @32 layer1.32x @1 layer1.50x @2 layer1.47x @3 layer1.43x @4 layer1.35x @5 layer2 / 32 = 0.0625
Vicuna-13B @40 layer1.26x @1 layer1.36x @2 layer1.44x @3 layer1.34x @4 layer1.31x @5 layer3 / 40 = 0.075
Vicuna-33B @60 layer-1.44x @4 layer1.49x @5 layer1.38x @6 layer-5 / 60 = 0.083
Llama2-13B-Chat @40 layer1.18x @1 layer1.42x @2 layer1.45x @3 layer1.39x @4 layer1.27x @5 layer-3 / 40 = 0.075

It can be seen that deeper models (with depth NN) have larger optimal early exit layers \ell, but the ratio /N\ell / N is relatively consistent. This implies that the latency cost for the self-speculative small model compared to the full model is fairly stable. Therefore, we recommend setting /N\ell / N between 1/16 and 1/10 in practical applications.

评论

Dear Reviewer oUCN,

With the discussion phase nearing the end, we would like to know whether the responses have addressed your concerns.

Should this be the case, we are encouraged that you raise the final rating to reflect this.

If there are any remaining concerns, please let us know. We are more than willing to engage in further discussion and address any remaining concerns to the best of our abilities.

We are looking forward to your reply. Thank you for your efforts in this paper.

Best regards,

Authors

评论

I appreciate the authors' feedback and acknowledge the generalizability of the method as well as the impact of its various hyperparameters. I will maintain my original rating for now but am open to revising the score if necessary during the reviewer-AC discussion phase.

评论

Dear Reviewer oUCN,

Thank you for your continued efforts in helping us improve the quality of this manuscript. We greatly appreciate the time and attention you have dedicated to this process.

Sincerely,

Paper 2124 Authors

审稿意见
7

The paper presents Kangaroo, a novel self-speculative decoding framework designed to accelerate the inference of LLMs while maintaining an identical sampling distribution: a new speculative decoding method: self-speculative decoding method using the model's own subnetwork as the speculative small model. The drafting stage is dynamic based on the confidence level for the current step. To fully utilize GPU, it proposes two decoding method: single-sequence and tree decoding for the verification phase. Extensive experiments are done to demonstrate the effectiveness of the method.

优点

  1. This is a pretty novel speculative decoding method, smartly utilizing self-drafting by the model's own layers. It resolves both the issue of having to accommodate another drafting model by reducing extra parameters needed and the accuracy of the self-drafted model.
  2. Extensive experiments on multiple tasks and baselines
  3. The proposed tree decoding method to enhance parallelism and improve GPU utilization is also pretty interesting
  4. Clear presentation and explanation

缺点

  1. This paper seems largely similar to: Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding. What is the main difference?

问题

Is the optimal layer dataset specific or domain specific? The early exit strategy adopted in self-drafting here reminds me of depth of learnt knowledge/concept in LLM [1, 2]. Thus in ablation study on the depth of shallow sub-network, I wonder how general the results can be seen? Also, is it also possible that the early exit layer also becomes dynamic?

[1] The Unreasonable Ineffectiveness of the Deeper Layers [2] Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?

局限性

N/A

作者回复

We thank the reviewer for the detailed comments and answer each of these comments below.

The main difference between Kangaroo and Draft & Verify

Both Draft & Verify and Kangaroo are self-speculative decoding algorithms that construct a small model by reusing the parameters of the original model. However, the starting points differ: Draft & Verify aims to approximate the full model by removing redundant middle layers, leaving a small model with about half the layers of the original model. This creates a trade-off between the small model's accuracy and inference latency. In contrast, Kangaroo is designed to minimize the layers of shared parameters while maintaining small model accuracy. Therefore, Kangaroo reuses only the initial shallow layers of the LLM, which are crucial for feature extraction [3]. Additionally, the lightweight adapter in Kangaroo is designed to enhance the small model's accuracy without significantly impacting latency.

[3] Tang Y, Liu F, Ni Y, et al. Rethinking optimization and architecture for tiny language models[J]. ICML, 2024.

For the benefits of reusing the initial layers, please also refer to the question "Why starting from shallow layers works better?" raised by Reviewer NQcw.

Is the optimal layer dataset specific or domain specific?

Thank you for pointing out this insightful question and for the valuable references. According to [2], LLMs learn different difficulty levels of tasks at different layers. This raises the question of whether the optimal early exit layer varies across datasets (domains). To address this, we summarized the changes in speedup across different datasets with varying early exit layers for Kangaroo@Vicuna-7B, as shown in the table below:

Exit Layer \ellTranslationQASummarizationMathRAGMT BenchAvg.
11.10x1.35x1.23x1.43x1.31x1.49x1.32x
21.24x1.43x1.50x1.61x1.52x1.68x1.50x
31.19x1.41x1.53x1.62x1.49x1.63x1.47x
41.12x1.34x1.47x1.56x1.44x1.60x1.43x
51.11x1.29x1.39x1.46x1.37x1.49x1.35x

It can be seen that the optimal early exit layers for different subtasks are not exactly the same but are generally close.

Is it also possible that the early exit layer also becomes dynamic?

Early exiting methods often exit at different layers based on the difficulty of the samples [4]. These methods save inference costs by exiting earlier for easier samples. However, in self-speculative inference, the small model only needs to align with the large model on simple tokens. Pursuing higher accuracy by exiting at deeper layers increases the small model's inference cost. Therefore, we chose to exit at a fixed shallow layer to balance inference cost and framework simplicity. Nevertheless, considering different exit layers within the shallow sub-network for different tokens is a valuable direction for future research.

[4] Teerapittayanon S, McDanel B, Kung H T. Branchynet: Fast inference via early exiting from deep neural networks[C]//2016 23rd international conference on pattern recognition (ICPR). IEEE, 2016: 2464-2469.

Sincerely,

Paper 2124 Authors

评论

Dear Reviewer Tuhh,

With the discussion phase nearing the end, we would like to know whether the responses have addressed your concerns.

Should this be the case, we are encouraged that you raise the final rating to reflect this.

If there are any remaining concerns, please let us know. We are more than willing to engage in further discussion and address any remaining concerns to the best of our abilities.

We are looking forward to your reply. Thank you for your efforts in this paper.

Best regards,

Authors

审稿意见
6

The authors propose a new self-speculative decoding framework, Kangaroo, for accelerating large language models (LLMs) by leveraging a double early exiting strategy. Kangaroo is able to enhance inference efficiency without the need for a separate draft model. It utilizes the shallow sub-network and the LM Head of the target LLM to construct a self-drafting model and introduces a dynamic token-level early exiting mechanism to minimize latency. The proposed approach achieves significant speedups, up to 2.04x, compared to existing methods, with substantially fewer additional parameters. The paper provides empirical validation mainly on Spec-Bench for Vicuna, showing Kangaroo achieves better wall-clock speedup.

优点

  1. The paper is well-written and clearly presents the proposed framework. The related works are well-discussed, and the proposed method is well-motivated.

  2. Novelty: the proposed token-level early exiting mechanism is interesting. It dynamically adjusts the decoding process based on the confidence levels of the predictions, which leads to better inference efficiency.

  3. The authors conduct extensive experiments on multiple benchmarks, providing a thorough comparison with state-of-the-art methods.

缺点

  1. Insufficient explanation and comparison to existing early-exiting methods. The authors suggest that their work differs from the previous early exiting methods [1] [2] by using early exiting from shallow layers. However, the paper lacks a detailed explanation of the differences and a comparison with these methods empirically. In Table 1, the main evaluation doesn't include the comparison with these early exiting methods. It would be great to have a detailed comparison.

  2. Double early exiting strategy involves additional hyperparameters, such as the exiting layer and the threshold for early exiting. Although the authors provide ablation studies on the selection of hyperparameters, it would be helpful to discuss their sensitivity across different models and tasks.

[1] Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168, 2023. [2] Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, and Sophia Shao. Speed: Speculative pipelined execution for efficient decoding. arXiv preprint arXiv:2310.12072, 2023.

问题

  1. According to Figure 4, it seems there is no clear separation between "Accept" and "Reject" draft tokens. This makes choosing an appropriate threshold difficult. Do different down-streaming tasks have different favorable threshold and distribution?

  2. How does the proposed double early exiting mechanism differ fundamentally from existing early exiting strategies in speculative decoding? Why starting from shallow layers works better?

  3. How robust is Kangaroo in terms of hyperparameters under different models and tasks?

局限性

  1. Reproducibility: the paper doesn't provide source code, which makes it difficult to reproduce the results.

  2. Evaluation: the paper mainly evaluates the proposed method on Spec-Bench for Vicuna. It would be great to see more evaluation on other benchmarks and models to provide a more comprehensive comparison.

作者回复

We thank the reviewer for the detailed comments and answer each of these comments below.

1. Insufficient explanation and comparison to existing early-exiting methods.

We highlight the fundamental differences between Kangaroo and existing early-exiting methods in the following table:

\qquad Method\qquad \quad Early-Exiting\qquad ArchitecturePlug and Play
Kangarooshallow layersGeneral LLMTrue
Draft & Verify [1]skipping redundant layersGeneral LLMTrue
SPEED [2]Speculative Pipelined ExecutionParameter sharing LLMFalse

SPEED specifically targets parameter-sharing (block-reuse) architectures, combining speculative decoding with pipelined execution for inference acceleration. Parameter sharing allows deeper decoders to enhance model performance. The combination of speculative decoding and pipelined execution offers SPEED a speed advantage over larger models without parameter sharing. However, Draft & Verify and Kangaroo are architecture-agnostic and can be plugged into mainstream large language models (LLMs). Draft & Verify employs Bayesian optimization to pre-select less critical layers in the target LLM for skipping, while Kangaroo implements early exits in the continuous shallow layers of the target LLM and uses a lightweight adapter to bridge the expressiveness gap between the self-speculative small model and the large model. Due to SPEED’s limited applicability and there is no released code, we focus on a detailed performance comparison between Kangaroo and Draft & Verify.

Model\quad MethodTranslation\qquad QASummarization\quad Math\quad RAGMT BenchAvg.
Vicuna-7BDraft & Verify1.22×(2.61)1.22\times (2.61)1.02×(2.36)1.02 \times (2.36)1.13×(2.84)1.13 \times (2.84)1.08×(2.47)1.08 \times (2.47)1.15×(2.44)1.15 \times (2.44)1.12×(2.46)1.12 \times (2.46)1.12×1.12 \times
Vicuna-7BKangaroo1.24×(1.41)1.24 \times (1.41)1.43×(1.87)1.43 \times (1.87)1.50×(1.87)1.50 \times (1.87)1.61×(2.14)1.61 \times (2.14)1.52×(2.05)1.52 \times (2.05)1.68×(2.22)1.68 \times (2.22)1.50×1.50 \times
ModelMethodTranslation\qquadQASummarization\quad Math\quad RAGMT BenchAvg.
Llama2-13B-ChatDraft & Verify1.11×(2.34)1.11 \times (2.34)1.06×(2.07)1.06 \times (2.07)1.02×(2.25)1.02 \times (2.25)1.08×(2.39)1.08 \times (2.39)1.05×(2.32)1.05 \times (2.32)1.07×(2.34)1.07 \times (2.34)1.07×1.07 \times
Llama2-13B-ChatKangaroo1.46×(1.97)1.46 \times (1.97)1.40×(1.87)1.40 \times (1.87)1.35×(1.97)1.35 \times (1.97)1.52×(2.22)1.52 \times (2.22)1.36×(2.05)1.36 \times (2.05)1.58×(2.28)1.58 \times (2.28)1.45×1.45 \times

We used the official code from the Draft & Verify repository to optimize the redundant layers for Vicuna-7B and used the official optimized layers for Llama2-13B-Chat. Values outside the parentheses indicate speedup, while those inside parentheses indicate the compression rate (CR).

While Draft & Verify achieves a higher compression rate, indicating the retained layers in the draft model better approximate the full model, the high latency of the small model limits its end-to-end speedup.

2. Why starting from shallow layers works better?

Theoretical analysis of speculative inference [3] highlights two key factors: the accuracy of the small model and its inference latency. Although Draft & Verify benefits from higher accuracy by leveraging deeper features, its small model incurs a high cost, almost half that of the large model (compared to about 1/10 in Kangaroo), limiting its overall speedup.

The reasons for the effectiveness of shallow layers in Kangaroo may include:

  • The initial layers directly connect token embeddings to subsequent decoder layers, playing a critical role in feature representation. References provided by Reviewer Tuhh [4, 5] suggest that the shallow layers of LLMs are sufficient for understanding simple tasks, aligning with the requirement for speculative decoding where the draft model needs only to align with the large model on simple tokens.
  • The autoregressive adapter used in Kangaroo enhances the expressiveness of shallow layers, and its lightweight design reduces inference latency.

[3] Leviathan Y, Kalman M, Matias Y. Fast inference from transformers via speculative decoding[C]. ICML 2023.

[4] The Unreasonable Ineffectiveness of the Deeper Layers

[5] Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?

3. Do different down-streaming tasks have different favorable thresholds and distribution?

As shown in Figure 4 in the main paper, "Accept" and "Reject" draft tokens cannot be perfectly separated by a fixed threshold. However, selecting the intersection of these distributions as the threshold allows us to retain a high proportion of tokens likely to be accepted while minimizing the proportion of tokens likely to be rejected. To explore the sensitivity of this threshold, we visualized the conditional distributions of the small model's Top-1 confidence across different model architectures and subtasks (see Figures in the global response PDF). The confidence distributions across different subtasks suggest that selecting η[0.6,0.8]\eta \in [0.6, 0.8] is a robust choice.

4. Robustness and sensitivity of Kangaroo's hyperparameters

Please refer to the common concerns in the global response.

Reproducibility

We will make our training and evaluation code publicly available in the camera-ready version.

评论

Dear Reviewer NQcw,

With the discussion phase nearing the end, we would like to know whether the responses have addressed your concerns.

Should this be the case, we are encouraged that you raise the final rating to reflect this.

If there are any remaining concerns, please let us know. We are more than willing to engage in further discussion and address any remaining concerns to the best of our abilities.

We are looking forward to your reply. Thank you for your efforts in this paper.

Best regards,

Authors

评论

Thanks for your detailed response, and this addresses most of my concerns. I am happy to raise my score based on the rebuttal and discussion with ACs and other reviewers (especially the upcoming response from reviewer J1nq).

评论

Dear Reviewer NQcw,

Thank you for your continued efforts in helping us improve the quality of this manuscript. We greatly appreciate the time and attention you have dedicated to this process.

You mentioned that you would consider raising your score based on the upcoming response from Reviewer J1nq. We are pleased to report that Reviewer J1nq has expressed satisfaction with our response, indicating that the concerns have been also adequately addressed.

As the discussion period draws to a close, we would like to reach out to see if you have any remaining questions or unresolved issues. If everything is now clear, we would be grateful if you could consider updating your evaluation to reflect this.

Once again, thank you for your constructive feedback and for your invaluable contribution to the development of this manuscript.

We look forward to hearing from you soon.

Sincerely,

Paper 2124 Authors

作者回复

We thank all the reviewers for their insightful comments and valuable feedback. We are encouraged by the positive reception of Kangaroo, as reflected in comments like The proposed method is well-motivated and the proposed token-level early exiting mechanism is interesting from Reviewer NQcw, This is a pretty novel speculative decoding method, smartly utilizing self-drafting by the model's own layers. from Reviewer Tuhh and Their simple adapter network design sounds like a kick. from Reviewer oUCN.

Below, we briefly summarize our response to the common concerns raised by reviewers.

1. Robustness and sensitivity of Kangaroo's hyperparameters across different models

Both Reviewers xb3n and oUCN raise concerns about the optimal early exit layer \ell and the early-stopping threshold η\eta potentially requiring careful tuning. We argue that in practical applications, the early exit layer \ell can be set based on an empirical ratio derived from the target model's depth, while the optimal threshold η\eta is robust across different model architectures and datasets.

1.1 The choice of the exit layer \ell

To determine the optimal depth for shallow sub-networks relative to the full model depth, we conducted multiple comparative experiments across models with different architectures and sizes. The table below records the average speedup achieved on Spec-Bench with early exits at various depths, and the final column shows the optimal ratio of the early exit layer to the total model depth.

\qquad Model\quad Group 1\quad Group 2\quad Group 3\quad Group 4\quad Group 5Optimal Ratio
Vicuna-7B @32 layer1.32x @1 layer1.50x @2 layer1.47x @3 layer1.43x @4 layer1.35x @5 layer2 / 32 = 0.0625
Vicuna-13B @40 layer1.26x @1 layer1.36x @2 layer1.44x @3 layer1.34x @4 layer1.31x @5 layer3 / 40 = 0.075
Vicuna-33B @60 layer-1.44x @4 layer1.49x @5 layer1.38x @6 layer-5 / 60 = 0.083
Llama2-13B-Chat @40 layer1.18x @1 layer1.42x @2 layer1.45x @3 layer1.39x @4 layer1.27x @5 layer-3 / 40 = 0.075

It can be seen that deeper models (with depth NN) have larger optimal early exit layers \ell, but the ratio /N\ell / N is relatively consistent. This implies that the latency cost for the self-speculative small model compared to the full model is fairly stable. Therefore, we recommend setting /N\ell / N between 1/16 and 1/10 in practical applications.

1.2 The early-stopping threshold η\eta

Kangaroo uses a static threshold to determine the timing of the second early stopping, based on the observation that the confidence of draft tokens in the small model is strongly correlated with their acceptance by the large model. For example, Figure 4 in the main paper shows that the Top-1 confidence distribution of the speculative small model on the mathematical reasoning subtask exhibits a clear bimodal distribution. Selecting the intersection of these distributions as the threshold allows us to retain a high proportion of tokens likely to be accepted while minimizing the proportion of tokens likely to be rejected. To explore the sensitivity of this threshold, we visualized the conditional distributions of the small model's Top-1 confidence across different model architectures and subtasks (see Figures in the global response PDF). Interestingly, we found this threshold to be stable (ranging from 0.6 to 0.8) across various models and subtasks.

2. More comprehensive comparison

Following the suggestions from Reviewer xb3n and Reviewer oUCN, we conducted experiments on a wider range of model architectures and larger models on Spec-Bench by including Llama2-13B-Chat.

\qquad MethodTranslation\qquad QASummarization\quad Math\quad RAGMT BenchAvg.
Draft & Verify1.22×(2.61)1.22\times (2.61)1.02×(2.36)1.02 \times (2.36)1.13×(2.84)1.13 \times (2.84)1.08×(2.47)1.08 \times (2.47)1.15×(2.44)1.15 \times (2.44)1.12×(2.46)1.12 \times (2.46)1.12×1.12 \times
SpS1.26×(1.63)1.26\times (1.63)1.34×(1.69)1.34\times (1.69)1.14×(1.47)1.14\times (1.47)1.34×(1.77)1.34\times (1.77)1.32×(1.81)1.32\times (1.81)1.28×(1.67)1.28\times (1.67)1.28×1.28 \times
Medusa^* w/o Tree1.53×(1.85)1.53\times (1.85)1.27×(1.55)1.27\times (1.55)1.19×(1.48)1.19\times (1.48)1.36×(1.76)1.36\times (1.76)1.25×(1.54)1.25\times (1.54)1.43×(1.72)1.43\times (1.72)1.34×1.34 \times
Kangaroo w/o Tree1.46×(1.97)1.46 \times (1.97)1.40×(1.87)1.40 \times (1.87)1.35×(1.97)1.35 \times (1.97)1.52×(2.22)1.52 \times (2.22)1.36×(2.05)1.36 \times (2.05)1.58×(2.28)1.58 \times (2.28)1.45×1.45 \times

Speedup comparison of various speculative decoding methods on Spec-Bench [22] for Llama2-13B-Chat. Values outside the parentheses indicate speedup, while those inside parentheses indicate the compression rate (CR). ^* denotes reproduction result. SpS takes Llama-68M as the draft model.

From the experimental results in the table, we further validate the effectiveness and generalization of Kangaroo. Compared to Table 1 in the main paper, an interesting phenomenon emerges: most algorithms applied to Vicuna-13B show lower speedup in translation tasks relative to other datasets, but when the target model is switched to Llama2-13B-Chat, the speedup significantly improves. This may be related to the different SFT data used by these models.

We will incorporate the valuable suggestions provided by the reviewers in our final version and further refine the ablation experiments and model evaluations. We sincerely hope our response addresses the concerns raised by the reviewers.

Sincerely,

Paper 2124 Authors

评论

Thanks!

Your AC.

最终决定

For this paper, reviewers are in general positive about the novelty of the methods (i.e., using the first few layers of the pre-trained model to build a self-drafting model for Speculative Decoding). After rebuttals, the authors have also addressed most of the concerns (e.g., novelty against Draft & Verify / SPEED, more lightweight than existing works, and e.g., speed claim on different types of GPUs).