3.0

/10

withdrawn4 位审稿人

最低3最高3标准差0.0

4.0

置信度

正确性1.8

贡献度1.8

表达2.3

ICLR 2025

CASD: Enhancing Generation Accuracy via Context-Aware Speculative Decoding

Jikai Wang,Zhenxu Tian,Juntao Li,Qingrong Xia,Xinyu Duan,Zhefeng Wang,Baoxing Huai,Min Zhang

OpenReview PDF

提交: 2024-09-28更新: 2024-11-18

TL;DR

This paper introduces a general decoding algorithm to enhance the generation accuracy and efficiency of LMs.

摘要

关键词

Language modelNatural language processingLong context generationSpeculative decoding

评审与讨论

审稿意见

评分: 3置信度: 42024-10-30

This paper proposes a modified version of the REST algorithm for speculative decoding called : 1) they use the input context as the pool for draft retrieval. 2) they decide whether to accept the draft by comparing the conditional probability with a predefined threshold. Experimental results show that the proposed method can achieve better accuracy on question answering, summarization and code completion, and a speed-up ratio from 1.63 to 1.99.

优点

It is reasonable to use the input context as the draft pool as the retrieval when the input context is long and the task includes extraction or summarization.
Experimental results show that the method can improve the accuracy while speeding up inference, which is not possible for previous methods of speculative decoding.

缺点

Overall the method is a minor modification of REST, simply replacing the external database with the input context. It is only a new way to build the retrieval database.
The modification is less general than the original REST algorithm, since it only applies to tasks whose prompts are long and the answers are likely to appear in the prompts. When the input prompts are short, the modification will not work.
Some important experiment results are missing. No comparison with traditional speculative decoding (with a small LM) on speedup ratios is provided. To ablate the effect of conditional verification with the threshold, the method should be compared with the acceptance method of REST, not top-k verification.

问题

In section 5.1, the threshold on NQ is set to 1e-5, while the threshold on the other two datasets is 0.1. What is the reason here? I note that in section 4.1 the threshold is 0.1 on all the datasets. Also, the speed-up ratio on NQ is the highest. Is it because the threshold is lower than that of other datasets?
On line 369 the word "mean accuracy length" should be "mean acceptance length".

审稿意见

评分: 3置信度: 42024-10-31

The paper presents Context-Aware Speculative Decoding (CASD), a decoding method to improve LLM performance in tasks involving extensive input contexts. The method is simple, involving three steps: (1) Retrieve high-quality drafts from the contexts relevant to the input prompt and (2) Get the next token distributions for all tokens in the draft; (3) Accept the longest draft if tokens there match the tokens sample from the LLM at a certain probability threshold.

优点

The method shows promising improvements and speedups.

缺点

Limited novelty: This paper primarily focuses on draft retrieval and employs a probability threshold to accept or reject tokens, which I consider as an incremental contribution.
Limited soundness: Unlike original Speculative Decoding, CASD uses the original model as the draft model to compute probabilities, which seems like a backward step. The draft model in Speculative Decoding plays a crucial role in speeding up the method; relying on the original model for probabilities may yield only minor speed gains. Additionally, CASD may violate the traditional autoregressive decoding style since it accepts the longest draft, which may cause unwanted text behavior.
Limited baseline comparison: The paper does not compare CASD with other prompting baselines such as the original SD. In addition, its experiments are conducted with only LLama.
Threshold sensitivity: The results indicate that performance varies significantly with different threshold settings.
Limited writing: The writing can be significantly improved. For example, the details of prompt compression are limited. It's unclear why using LLMLingua-2 and it's unclear why it works.

问题

See weaknesses

审稿意见

评分: 3置信度: 42024-11-04

This paper introduce a novel speculative decoding strategy named CASD. CASD incorporates context-aware speculative decoding with a conditional verification approach, allowing LLMs to reuse relevant context at the token level without additional training or models. CASD achieves superior performance across tasks like question answering, summarization, and code completion, demonstrating a +3.3 increase in average generation scores and a nearly 2x increase in inference speed.

优点

Enhances generation accuracy and nearly doubles inference speed without additional training or models.
Works well with context compression.

缺点

The difference between the proposed method and REST [1] is minimal, same exact retrieval algorithm is used; the general database in REST could also be replaced by the input context.
Training-free / model-free speculate decoding baselines (e.g. [2]) is not compared.
Different templates is need for different datasets. It's not clear whether CASD is suitable for the most common chatbots scenarios.
This method tends to make the output distribution more inclined to restate the input context, which is advantageous for extractive QA tasks where the answer is present in the input. However, its impact on more general scenarios remains unclear.

[1] REST: Retrieval-Based Speculative Decoding [2] Inference with Reference: Lossless Acceleration of Large Language Models

问题

More baselines can be introduced.
How do you choose the 0.1 threshold for all experiments?

审稿意见

评分: 3置信度: 42024-11-04

This paper proposes a method called CASD (Context-Aware Speculative Decoding) for leveraging input context effectively and efficiently. This method uses the original text as the draft source which requires no additional training or draft models. In this paper, they also propose to replace the strict verification in conventional speculative decoding with conditional verification, where pieces of the relevant context are verified and accepted at the token level. They conduct experiments on 8 datasets and show better accuracy and a speed-up ratio of 1.99.

优点

CASD is a plug-and-play method that requires no additional training or draft models, making it easy to implement. It can also be effectively combined with prompt compression methods, which are used to reduce the computational burden of long contexts.
The proposed conditional verification allows for flexibility in balancing context reliance and generation quality. Experimental results on various datasets show that the generation scores of CASD are improved.
Besides improved accuracy, CASD also reduces inference latency.It achieves a speed-up ratio of up to 2 times compared to traditional autoregressive decoding.

缺点

One limitation of CASD is its generalizability to all tasks. While it excels in tasks that benefit from direct context, its effectiveness may be limited in tasks requiring abstractive or creative generations.
As stated in the paper, CASD conflicts with common sampling techniques in autoregressive models, which makes it less practical for many scenarios that require multiple generations per sample.
Although the experiments are conducted on multiple datasets, the baseline is too limited, which is vanilla decoding of LlaMa3.1-8B-instruct model only. While it’s understandable to skip decoding methods that require additional training or draft models, there are other training-free speculative decoding methods available to compare with. Comparisons with these alternative approaches, such as REST, are crucial for a more comprehensive evaluation.
The choice of the probability threshold for accepting draft tokens during verification significantly impacts CASD's performance. The sources acknowledge that a lower threshold can enhance accuracy with reliable contexts but might compromise fluency. However, the experiments primarily rely on fixed or manually tuned thresholds, which lack generalizability and require prior knowledge of the dataset characteristics. It would be better if the threshold could be determined automatically by a generalizable heuristic rule.
The setting of the vanilla decoding is unclear anywhere in this paper, i.e., temperature, sampling sizes, maximum context length and maximum generation token length. These can affect the model's performance and thus should be described in the experiments section.

问题

One main contribution of the proposed method is improving the generation accuracy. In ths paper, only one model is evaluated. How does the base model's performance affect the improvement?
In table 1, Fixed and Mixed methods have different performance trend on different tasks. Which one is recommended and more generalizable?

撤稿通知

2024-11-18

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.