5.3

/10

Poster3 位审稿人

最低4最高6标准差0.9

4.0

置信度

正确性2.7

贡献度2.3

表达3.0

NeurIPS 2024

Accelerating Blockwise Parallel Language Models with Draft Refinement

Taehyeon Kim,Ananda Theertha Suresh,Kishore A Papineni,Michael Riley,Sanjiv Kumar,Adrian Benton

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

We improve blockwise parallel decoding in language models by analyzing token distributions and developing refinement algorithms for block drafts, significantly achieving higher block efficiency across various tasks.

摘要

关键词

Blockwise parallel decodingLanguage modelspeculative decoding

评审与讨论

审稿意见

评分: 6置信度: 42024-07-10

The paper analyzes the block drafts generated by multiple independent prediction heads of blockwise parallel language models and observes three key features: consecutive repetitions, confidence of different heads, and efficiency gap with oracle top-k block. To address these issues, the paper proposes two algorithms to leverage the top-k predictions at each head: local rescoring via (small) neural LMs and global rescoring via n-gram LMs with multi-drafts. Experimental results show that the proposed two algorithms can improve block efficiency.

优点

The paper identifies a weakness in the existing blockwise parallel decoding algorithms: the predictions are made independently. The paper proposes two algorithms: local rescoring via neural models and global n-gram rescoring to address the weakness.
The paper analyzes the block drafts and gives several observations. The observation of strong correlation between the index of the largest head such that the average entropy of each head increases monotonically to that point and block efficiency is especially interesting.

缺点

The experiments are only conducted on a 1.5B LM pretrained on 200B tokens of C4, without the alignment stage. This is far from the common practice in current LLMs. The authors should consider adding results on more open LLMs with different sizes.
The medusa paper [1] already proposes using top-k predictions for different heads. The contribution of the paper mainly focuses on the two rescoring algorithms. However, the algorithm of local rescoring with a small LM is very similar to speculative decoding with a small LM [2]. The contribution of the global rescoring algorithm with n-gram models is not sufficient for a NeurIPS paper.

[1] Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. http://arxiv.org/abs/2401.10774 [2] Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192

问题

It is not clear how the observations in Section 6.2 and 6.3 contribute to the design of the algorithms.
The neural draft is generated based on the logits of the original prediction, not the rescored ones. If the rescored top-1 token is different the original top-1 token, the following rescoring will be incorrect.

局限性

The authors have adequately addressed the limitations

作者回复

2024-08-06

Thank you for your valuable feedback in helping refine our work.

Note: Before reading further, we kindly ask you to check the Author Rebuttal by Authors and the attached PDF for detailed explanations and additional results.

W1. Limited Experimental Scope

We acknowledge the limitation of our initial experiments. To address this, we include additional results on more open LLMs with different sizes and more datasets. Specifically, we conducted experiments with Vicuna 7B and 13B with 4 block draft heads in the attached PDF. These new results also show that our approach scales effectively and consistently improves performance across different model sizes, achieving an additional speedup of over ~20% and ~300% speedup relative to vanilla decoding. Here, the blockwise parallel LMs extended from Vicuna 7B and 13B are trained with alignment stages.

W2. Difference with Existing Work

We believe our contributions, including rescoring methods as well as findings, are both valuable and innovative compared to existing works. Here’s why:

Medusa

Similarity: Both approaches explore potential gains from TopK predictions of blockwise parallel LM heads.
Technical Differences:
- Medusa searches for the best candidate in block drafts without altering the original logits
- Our method involves rescoring the block drafts to obtain the best candidate.
Exploratory Differences
- Medusa does not explicitly address the reasons behind its decoding method.
- Our work is the first to explicitly address issues such as (1) consecutive repetition, (2) confidence of different heads, and (3) top-k oracle block efficiency.
Our contributions are orthogonal and different from Medusa, and our method integrated with Medusa shows even better results, detailed in the attached PDF.

Speculative Decoding

First of all, BPD (presented in 2018) is a predecessor and an instance of speculative decoding.

Similarity: Both approaches use small language models for efficient LLM inference.
Technical Differences
- Speculative decoding uses a small drafter to predict the next multiple tokens
- Our method uses the small drafter (for both local and global rescoring) to rescore the logits of block drafts with a top-k mask, which are then used for speculative inference.
TL;DR
- Speculative decoding uses an independent module for drafting
- Our method corrects the drafts from a dependent speculative module (e.g., BPD and Medusa), which is totally different from the use of the small LM in speculative decoding.

W2-2. Regarding the Contribution of the Global Rescoring with N-gram Models

We believe the global rescoring with n-gram models is valuable and innovative. Here’s why:

Efficiency: The n-gram rescoring is highly efficient, taking only 1.6 ms per lattice via OpenFST, making it practical for real-time applications (Table12 & Table13 in Appendix H).
Effectiveness: Despite being a classical approach, n-gram models remain effective. Our results show significant improvements in block efficiency and speedup when integrated with BPD.

The assertion that n-gram models lack novelty is not supported by our evidence. Instead, our results clearly demonstrate their practical value and effectiveness.

Q1. Clarity on How Observations Contributes to Algorithm Design

Section 6.2 (Confidence across multiple heads): This section examines the confidence levels across different prediction heads in BPD, providing insights into how well the parallel heads are trained. These observations, while not directly informing the current design of our rescoring algorithm, are included to align with our paper's goal of "Exploring and Improving Multi-Token Prediction (block draft)." They suggest potential future improvements, such as using variable interpolation weights based on head confidence levels. We will refine the manuscript to clarify this point.
Section 6.3 (Top-k oracle block efficiency): The concept of top-k oracle block efficiency serves as a theoretical upper bound for the potential improvements achievable with rescoring. By quantifying the maximum possible efficiency, we gain a benchmark against which we can measure the actual performance of our algorithms. This observation led to the realization that current methods, such as BPD and Medusa, have significant room for improvement. As a result, we developed our rescoring methods to approach this upper bound more closely, thereby optimizing block acceptance rates and reducing redundancy.

Q2. Potential Rescoring Issues

To clarify, the original prediction is used for the next-token prediction, but rescoring happens after that (from the 2nd position future token onward) (You can find it in Section 7 Algorithm2. We will make it clearer in the camera-ready version). The rescored top-1 token can differ from the original top-1 token, which is intentional. The original top-1 token for future positions often fails to be accepted for speculative inference, but our findings show that rescoring improves this. Evidence is provided by improvements in block efficiency and speed up.

Additionally, BPD follows a draft-verify-accept structure. Incorrect predictions in the draft phase do not pass verification, ensuring that BPD and Medusa's predictions are identical to vanilla decoding. We focused on the number of tokens accepted, using block efficiency as a hardware-agnostic metric. Based on reviewer feedback, we have also included latency experiments in the attached PDF. Appendix H provides a detailed discussion on LLM inference speed improvements, covering low-level aspects such as KV-Cache and Parameter I/O memory bandwidth in TPU/GPU experiments.

We believe these responses and additional results in the attached PDF address your concerns comprehensively and highlight the novelty of our work.

2024-08-14

Thanks for the additional results. I decided to raise my score.

评论- Thank you for your feedback

2024-08-14

Dear Reviewer vMaK

Thank you for your positive assessment and the time you've dedicated to reviewing our work. We appreciate your decision to raise your score and are glad that the additional data provided was helpful in addressing your concerns. Your time and attention in evaluating our work mean a great deal to us.

Best regards,

Authors

评论- Gentle Reminder - Dear Reviewer vMaK

2024-08-13

Dear Reviewer vMaK

In response to the feedback, we enhance the depth and robustness of our work with detailed explanation and additional experimental results. This includes:

Evaluations with open-sourced 7B & 13B LLMs [Figure A and Table A-D in authore-response pdf ]
5 additional datasets (MT-Bench, Natural Questions, GSM8K, RAG, Summarization) [Figure A and Table A-D in authore-response pdf ]
Clarification on the difference from existing works
Comparing with recent efficient llm inference method [Table E in author-rebuttal]
Integration with Medusa decoding (recent extended version of BPD; presented @ ICML2024) [Figure A and Table A-D in authore-response pdf and Table E in author-rebuttal]
Additional experiments with temperature sampling (T=0.7, 1.0), beyond greedy decoding [Table A-D in authore-response pdf ]

Given the tight timeline, with the discussion phase concluding on Aug 13, we kindly request you to review our responses. We believe our detailed responses provide clarity on the concerns raised. Your feedback is pivotal to the quality of our work, and we earnestly await your thoughts, especially since we have less than 2 days remaining.

Thank you for your efforts in this paper.

Best regards,

Authors

审稿意见

评分: 4置信度: 32024-07-12

This paper proposes new ways to improve blockwise parallel decoding (BPD), a method to reduce inference latency in large language models. It first analyzes the token distributions produced by multiple prediction heads and then leverages this analysis to develop algorithms to improve BPD inference speed by refining the block drafts using n-gram and language models.

优点

The paper thoroughly studies BPD's behavior, including issues like consecutive token repetition and varying confidence levels across prediction heads, providing new insights for efficiency improvement. It further introduces the oracle top-k block efficiency as a useful metric for understanding the potential headroom for improvement in block drafts.
The proposed refinement algorithms (local neural rescoring and global n-gram rescoring) demonstrate improvements in block efficiency across multiple tasks, with gains of up to 21.30% in some cases.
The evaluation considers a variety of tasks (language modeling, question answering, and summarization) and datasets,
The paper is well-structured and easy to follow, with helpful illustrations and examples.

缺点

The evaluation mainly compares the proposed improvements over the existing BPD baselines but doesn't compare them with other approaches for reducing inference latency, such as quantization or model pruning. It's suggested to make a more thorough comparison.
The evaluation is conducted mainly on a 1.5B parameter model. It's unclear how well these findings and improvements would generalize to larger, state-of-the-art models. Further, while the paper focuses on improving block efficiency, there's limited discussion on the additional computational cost of the rescoring methods. It's unclear whether the efficiency gains outweigh any increased computational requirements.
The improvements in block efficiency vary significantly across tasks, with some showing little to no improvement. A deeper analysis of why certain tasks benefit more than others would be valuable.

问题

Please address the questions listed in the weakness section.

局限性

The paper has discussed the limitations sufficiently.

作者回复

2024-08-06

Thank you for your valuable feedback in helping refine our work.

Note: Before reading further, we kindly ask you to check the Author Rebuttal by Authors and the attached PDF for detailed explanations and additional results.

W1. Comparison with Other Latency Reduction Approaches

We appreciate the suggestion to include comparisons with other latency reduction approaches, such as quantization or model pruning. Our paper's primary focus is "Exploring and Improving Multi-Token Prediction (Block Draft)", which aligns more closely with speculative inference methods. While speculative inference methods like ours are orthogonal to pruning and quantization, we recognize the importance of providing a thorough comparison for a comprehensive understanding.

Recent works such as Medusa [1,2] (presented @ ICML 2024, last month; extended version of BPD) are examples of speculative inference techniques similar to ours. Although our paper was submitted before many of these studies were published or open-sourced, we have now included a comparative analysis in the Table E (see Author Rebuttal by Authors) and the attached pdf. This analysis demonstrates that our method consistently enhances the efficiency of blockwise parallel LMs (BPD/Medusa), outperforming other speculative inference approaches (achieving an additional speedup of over ~20% and ~300% speed-up relative to vanilla decoding). This inclusion aims to provide a more comprehensive understanding of our method's effectiveness relative to other approaches in the field.

[1] Tianle Cai, et al. "Medusa: Simple LLM inference acceleration framework with multiple decoding heads." ICML 2024.

[2] Ankner, Zachary, et al. "Hydra: Sequentially-dependent draft heads for medusa decoding." arXiv (2024).

W2. Generalization to Larger Models and Computational Cost

We understand the need to evaluate our methods on larger, state-of-the-art models and to discuss the computational costs associated with our rescoring methods. To address this, we have conducted additional experiments with larger models, specifically Vicuna 7B and 13B with 4 block draft heads, as detailed in the attached PDF. These results demonstrate that our approach scales effectively, maintaining its performance gains across different model sizes.

Regarding computational costs, we would like to emphasize that we've already covered a detailed discussion in Appendix H. Appendix H provides a detailed discussion on LLM inference speed improvements, covering low-level aspects such as KV-Cache and Parameter I/O memory bandwidth in TPU/GPU experiments. Our findings show that the efficiency gains from our rescoring methods outweigh the increased computational requirements. The attached PDF supports this statement in terms of the wall clock speedups relative to the standard autoregressive decoding.

W3. Variability in Block Efficiency Improvements

We have extended our experiments to include a wider range of datasets and tasks, as detailed in the attached PDF. While most tasks show significant improvements, we acknowledge that not all tasks benefit equally.

Achieving consistent improvements across all tasks and datasets is a challenge faced by many LLM and LLM studies, not just our own. Various factors, including the characteristics of specific datasets, can influence the results. This is not necessarily a limitation of our method but rather a common challenge in the field [3].

However, our results demonstrate that our method consistently enhances performance, even as model sizes increase (e.g., 7B and 13B parameters). The improvements are consistent for larger models, as highlighted in the attached PDF.

[3] Xia, Heming, et al. "Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding." ACL Findings (2024).

We hope this mitigates your concern and demonstrates the robustness and effectiveness of our proposed methods.

评论- Gentle Reminder - Dear Reviewer rNoJ

2024-08-13

Dear Reviewer rNoJ

In response to the feedback, we enhance the depth and robustness of our work with additional experimental results. This includes:

Evaluations with open-sourced 7B & 13B LLMs [Figure A and Table A-D in authore-response pdf ]
Comparing with recent efficient llm inference method [Table E in author-rebuttal]
5 additional datasets (MT-Bench, Natural Questions, GSM8K, RAG, Summarization) [Figure A and Table A-D in authore-response pdf ]
Integration with Medusa decoding (recent extended version of BPD; presented @ ICML2024) [Figure A and Table A-D in authore-response pdf and Table E in author-rebuttal]
Additional experiments with temperature sampling (T=0.7, 1.0), beyond greedy decoding [Table A-D in authore-response pdf ]

Thank you for your efforts in this paper.

Best regards,

Authors

评论- Looking forward to your feedback

2024-08-14

Dear Reviewer rNoJ,

With the discussion phase nearing the end, we would appreciate knowing if our responses have adequately addressed your concerns.

If you have any remaining concerns, please do let us know. We are eager to refine and enhance our research based on your valuable feedback.

We look forward to your reply and thank you for your efforts on this paper.

Best regards,

Authors

审稿意见

评分: 6置信度: 52024-07-17

This paper provides an improved solution for block-drafting, which is a potential useful way to improve the inference efficiency of LLMs. The work begins with observations of the problems of current block drafting, reveals that the consecutive repetition and drafting confidence are related to the quality of the draft. Rescoring methods are employed to improve the drafting process accordingly.

优点

The observations are persuasive and the solution is intuitive.

The experiments show impressive improvement (up to 20%) of block efficiency, which is potentially useful.

缺点

The rescoring phase uses yet another model to scoring the candidate. I am wondering whether using different models affect the final performance. Because different models have different token generation distributions.

问题

See the weakness part.

局限性

N/A

作者回复

2024-08-06

Thank you for your valuable feedback. We understand your concern regarding the use of different models for the rescoring phase and its potential impact on final performance due to varying token generation distributions.

Note: Before reading further, we kindly ask you to check the Author Rebuttal by Authors and the attached PDF for detailed explanations and additional results.

W1. Additional experiments on different models

To address this, we conducted additional experiments demonstrating that local neural rescoring methods are robust and consistently improve performance across different target models (Instruction-tuned openLLM Vicuna 7B and 13B with 4 block draft heads). Specifically, we have shown that:

Our approach significantly speeds up BPD even further for existing, open-sourced, instruction-tuned LLMs.
Local rescoring also accelerates decoding for different model architectures, including the very recent Medusa model [1] presented at ICML 2024, which employs tree-attention to enhance performance in blockwise parallel LMs, achieving an additional speedup of over ~20% relative to BPD/Medusa and ~300% speed-up relative to vanilla decoding.

In the attached PDF, we provide detailed results showing consistent performance improvements when our model is applied on top of Medusa decoding. These results highlight the versatility and robustness of our rescoring approach in the field regarding multi-token predictions, confirming that it effectively enhances performance regardless of the underlying model used for token generation.

We hope this addresses your concern and illustrates the robustness and effectiveness of our proposed methods.

[1] Tianle Cai, et al. "Medusa: Simple LLM inference acceleration framework with multiple decoding heads", ICML, 2024.

评论- Gentle Reminder - Dear Reviewer FLJ6

2024-08-13

Dear Reviewer FLJ6

In response to the feedback, we enhance the depth and robustness of our work with additional experimental results. This includes:

Evaluations with open-sourced 7B & 13B LLMs [Figure A and Table A-D in authore-response pdf ]
5 additional datasets (MT-Bench, Natural Questions, GSM8K, RAG, Summarization) [Figure A and Table A-D in authore-response pdf ]
Comparing with recent efficient llm inference method [Table E in author-rebuttal]
Integration with Medusa decoding (recent extended version of BPD; presented @ ICML2024) [Figure A and Table A-D in authore-response pdf and Table E in author-rebuttal]

Thank you for your efforts in this paper.

Best regards,

Authors

评论- Looking forward to your feedback

2024-08-14

Dear Reviewer FLJ6,

With the discussion phase nearing the end, we would appreciate knowing if our responses have adequately addressed your concerns.

If you have any remaining concerns, please do let us know. We are eager to refine and enhance our research based on your valuable feedback.

We look forward to your reply and thank you for your efforts on this paper.

Best regards,

Authors

作者回复

2024-08-06

We extend our gratitude to all the reviewers for providing comprehensive and thoughtful feedback on our manuscript. We appreciate your valuable insights into the strengths and areas for improvement of our work.

Core Contributions of Our Work

Novel Findings: This work explicitly addresses key issues in blockwise parallel LMs, such as consecutive repetitions and the confidence of different heads, and introduces the concept of oracle top-k block efficiency (potential headrooms for speedup).
Novel Technical Contributions: Building on BPD, our methods effectively remove repetitions and improve block efficiency, achieving up to a 21.30% improvement. In most cases from rebuttal experiments, our approach also results in (a) ~15% speed-up relative to BPD and Medusa decoding and (b) ~300% speed-up relative to vanilla decoding.
Objective and Scalability: Our main objective is “Exploring and Improving Multi-token Prediction (Block Draft)”. The framework is easy to plug in and scale up because any type of rescoring LM can be used regardless of the size of target blockwise parallel LM.
Oracle Top-k Block Efficiency: To our best knowledge, this metric has not been measured in the field, but it provides a valuable upper bound indicating potential room for improvement.
Instruction-Tuned LLMs: As shown in the attached PDF, this framework performs well even with instruction-tuned LLMs, including models with 7B and 13B parameters.

Summary of Strengths Cited by Reviewers

Impact: We appreciate Reviewer FLJ6 and Reviewer rNoJ for noting the important motivation and impact of our work. The improvements in block efficiency (up to 21.30%) are significant and have practical applications.
Soundness: Reviewers FLJ6 and Reviewer rNoJ acknowledged the technical soundness of our approach and the thorough study of BPD behavior, including issues like consecutive token repetition and varying confidence levels across prediction heads.
Integration: All reviewers highlighted the well-structured and integrated nature of our paper, with a clear presentation and helpful illustrations.
Experiments: All reviewers appreciated our comprehensive experiments across a variety of tasks (language modeling, question answering, and summarization) and datasets, demonstrating the robustness and effectiveness of our methods.

Additional Experiments in PDF

Local rescoring improves BPD with tree-attention
Additional experiments with temperature sampling (T=0.7, 1.0), beyond greedy decoding
Instruction-tuned openLLM results with Vicuna 7B & 13B target model with 4 block draft heads
Additional tasks
- MT-Bench (including writing, roleplay, reasoning, math, coding, extraction, stem, humanities)
- Natural Questions (QA) [3]
- GSM8K [5]
- RAG [2]
Local rescoring vs. vanilla Medusa decoding (extended version of BPD; presented @ ICML24) [1]

TL;DR for PDF

We have conducted extensive experiments with additional open-source LLMs and downstream tasks
Our results consistently show improved performance from local neural rescoring in terms of block efficiency as well as wall clock time, for LLMs up to 13 billion parameters.

Additional results for comparing other methods for efficient LLM inference:

Table E. Speedup ratio (i.e. latency) relative to the standard autoregressive decoding.

NVIDIA A100 GPU	Vicuna 7B					Vicuna 13B
Relative Speedup	MT-Bench	Sum	QA	GSM8K	RAG	MT-Bench	Sum	QA	GSM8K	RAG
-------------------	-----------	-------	-------	-------	-------	------------	-------	-------	-------	-------
Sps [6]	1.432	1.394	1.417	1.364	1.568	1.417	1.424	1.362	1.448	1.606
Lookahead [7]	1.818	1.645	1.503	1.865	1.475	1.118	1.007	1.011	1.324	0.963
PLD [8]	1.676	2.707	1.162	1.605	1.909	1.528	2.384	1.050	1.646	1.876
-------------------	-----------	-------	-------	-------	-------	------------	-------	-------	-------	-------
BPD	1.752	1.509	1.489	1.696	1.409	1.745	1.530	1.488	1.794	1.483
+ Our local rescoring	1.843	1.534	1.555	1.780	1.501	1.819	1.522	1.519	1.819	1.501
-------------------	-----------	-------	-------	-------	-------	------------	-------	-------	-------	-------
Medusa	2.254	2.002	2.045	2.317	1.833	2.232	2.000	1.986	2.507	1.945
+ Our local rescoring	2.482	2.076	2.114	2.357	2.000	2.467	2.136	2.154	2.519	2.068

References

[1] Tianle Cai, et al. "Medusa: Simple LLM inference acceleration framework with multiple decoding heads." ICML, 2024.

[2] Vladimir Karpukhin, et al. "Dense passage retrieval for open-domain question answering." arXiv, 2020.

[3] Tom Kwiatkowski, et al. "Natural questions: a benchmark for question answering research." Transactions of the Association for Computational Linguistics, 7:453–466, 2019.

[4] Lianmin Zheng, et al. "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena." NeurIPS, 2024.

[5] Cobbe, Karl, et al. "Training verifiers to solve math word problems." arXiv (2021).

[6] Chen, Charlie, et al. "Accelerating large language model decoding with speculative sampling." arXiv (2023).

[7] Fu, Yichao, et al. "Break the sequential dependency of llm inference using lookahead decoding." arXiv (2024).

[8] Apoorv Saxena. 2023. Prompt lookup decoding

We believe these additions and clarifications address the reviewers' concerns comprehensively and strengthen our manuscript. Thank you for your constructive feedback, which has significantly contributed to the improvement of our work. We look forward to your favorable consideration.

评论- The end of the discussion phase is approaching

2024-08-13

Dear Reviewers,

Could you please go over our responses and additional results in the author-response pdf since today is the last day of author-reviewer discussion period? We have addressed the questions through additional experiments and clarifications, demonstrating improvements to both block efficiency and wall clock time, on five additional tasks, with open-sourced 7 and 13B parameter LLMs.

We sincerely thank you for your time and efforts in reviewing our paper, and your insightful and constructive comments.

Thanks, Authors

2024-08-13

Dear reviewers,

The discussion period will end soon. If you haven't responded to the authors' rebuttal, please do so and kick off the discussion.

Best, SACa

最终决定Accept (poster)

2024-09-25

The paper addresses block-drafting, an approach to improve the inference efficiency of language models. In particular, the authors proposed a new algorithm involving top-k prediction and local/global rescoring.

In general, reviewers found the approach interesting and effective in their experiments. However, the language model chosen is a little bit small compared with the current practice.