SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Based on MDP theory, we use a trained acceptance prediction head to determine when to stop the current proposal round for speculative decoding.
摘要
评审与讨论
The paper proposes an approach to dynamically adjust the draft length in speculative decoding. This allows for avoidance of unnecessary calls to the large model on tokens that will be rejected. To achieve this, the paper proposes an MDP for determining whether the next token in the drafter is likely to be accepted. This is instantiated via a simple binary classification head on top of the drafter’s outputs. Experiments show this approach can speed up speculative decoding by 5-10%.
接收理由
- Logical method, with a nice combination of formal grounding and empirical results.
- Very well written, likely useful for future work to build upon.
拒绝理由
- Missing comparison to baseline adaptive draft length using confidence of small model (e.g., per BiLD).
- Experiments on more datasets would (as ever) be welcome.
给作者的问题
-
Empirically, how well does the proposed approach compare to using confidence of small model?
-
Fundamentally, the learnability of Equation 4.1 depends on how well we can predict the relative likelihood of the large model given information from the small model. Comment about the viability of this task might be useful. One might also imagine leveraging information from the larger model on previously rejected tokens, e.g., per
Aishwarya PS, “Tandem Transformers for Inference Efficient LLMs”, ICML 2024
-
“take constant time, which is true when we have enough computational resources to” → constant in what? What kind of computational resources – compute or memory?
-
“In the typical setting of speculative decoding where the draft model and the target model align reasonably well” → what does “align” mean here? Why is this the typical setting? See also Appendix D.4.
-
Algorithm 1, what is the end index of the for loop?
-
Similar ideas may have been considered in contemporaneous work:
Zhong et al., “Speeding up Speculative Decoding via Approximate Verification”, arXiV 2025
Thank you for your positive feedback on our work! We sincerely appreciate the time and effort you have invested in reviewing our paper and providing valuable feedback. We would like to provide detailed responses below:
Comment 1: Missing comparison to baseline adaptive draft length using confidence of small model (e.g., per BiLD).
Response 1: We have discussed the sub-optimality of heuristic methods (e.g. only using entropy or top-K prob as a metric of uncertainty) in Appendix B. Empirically speaking, these heuristics may achieve certain performance gain, as is discussed by Liu et al (2024), Kim et al (2024), and Xu et al (2023) (lines 118-121). But still, these methods are fundamentally flawed.
The goal of our paper is not to demonstrate empirical superiority over these heuristic methods, as we have clearly discussed that these methods can fail in certain cases. Instead, our goal is to develop a robust and theory-backed algorithm that practitioners can trust when they have the demands of optimizing the speculative decoding pipeline.
Comment 2: Fundamentally, the learnability of Equation 4.1 depends on how well we can predict the relative likelihood of the large model given information from the small model. Comment about the viability of this task might be useful.
Response 2: It is hard and potentially impossible to give a rigorous argument about the learnability of Equation 4.1. A plausible situation is when the degree of alignment (measured by total variation distance) between the draft distribution and the target distribution correlates with the existing context. In that case, the acceptance prediction head can be trained to utilize the existing tokens to predict the acceptance.
Comment 3: One might also imagine leveraging information from the larger model on previously rejected tokens, e.g., per Aishwarya PS, “Tandem Transformers for Inference Efficient LLMs”, ICML 2024
Response 3: Thank you for the pointer! This is indeed a valid operation. We will try to explore this in the future.
Comment 4: “take constant time, which is true when we have enough computational resources to” → constant in what? What kind of computational resources – compute or memory?
Response 4: We meant that “the time of the forward passes remains constant when we change the length of the input tokens”. The computational resources include both compute units (e.g., number of tensor cores) and memory (e.g,. HBM). When we increase the sequence length, the forward pass may take a longer time if the longer tokens cannot be processed in parallel using the compute resources. This can include both (1) more loops over the tensor core job scheduling and (2) longer waiting times for HBM communication.
Comment 5: “In the typical setting of speculative decoding where the draft model and the target model align reasonably well” → what does “align” mean here? Why is this the typical setting? See also Appendix D.4.
Response 5: By saying “the two models align well”, we mean the distance of the distributions given by the two models is small, which can be formally defined as the expected total variation distance over a distribution of natural text. This is the typical setting because both the draft model and the target model are approximating the same distribution:
- (1) One case is when the draft model and the target model are obtained from a set of pretrained models. These models share the same pretraining corpus and are trained using the same objective --- to approximate the distribution of the pretraining corpus.
- (2) Another case is when a smaller model is distilled from a larger model. In this case, the small model is trained to mimic the distribution of the target model.
Comment 6: Algorithm 1, what is the end index of the for loop?
Response 6: In our experiment, we have set an upper limit for the for loop to be 20. (line 285)
Comment 7: Similar ideas may have been considered in contemporaneous work (Zhong et al. 2025)
Response 7: Zhong et al.'s method cannot guarantee that the output distribution matches the target distribution exactly. We will properly cite and discuss this in our next revision.
We sincerely hope that our responses can address your concerns. Thank you for your positive overall evaluation again!
Thanks for the response. You may wish to include some of the discussion from Comments 2-6 in the final version.
Regarding Comment 1, indeed methods that do not rely on the verifier's scores are fundamentally limited (as Appendix B.1 notes). Per Comment 2, the present method fundamentally relies on estimating some function of the verifier's scores based on the drafter; but the salient property is that it still operates in the speculative decoding regime, thus inheriting the distribution match guarantee (which I agree that works such as Zhong et al. do not guarantee). The present method's speedups thus rely on how well such estimation is possible, but that seems unavoidable and is likely acceptable. Further discussion of this in the text would be good.
Still, I'd encourage the authors to consider adding at least one heuristic method (e.g., BiLD) to their empirical comparison. If it turns out that some of these methods work well empirically, that could be worth calling out -- it doesn't detract from the nice contributions of the present paper, and could suggest some interesting directions for future work.
Thank you for your additional suggestions! We will incorporate the discussions of Comments 2-6 in our next version, and elaborate more on the learnability of Eq. 4.1 (Comment 2).
As for the heuristic method (e.g. BiLD uses the top-1 probability of the current draft token as a counterpart of our "acceptance probability"), we will work on a more systematic empirical comparison with this baseline.
Thank you again for the positive overall evaluation!
This paper presents SpecDec++, a method for dynamically determining draft length in speculative decoding. The problem of determining the draft length is cast as a Markov decision process, where the action space consists of whether to generate another draft token or halt and verify the current draft, and the state space consists of all possible drafted contexts. Under this framing, minimizing the expected decoding time amounts to halting once the probability that any one of the drafted tokens is rejected falls below a threshold (which depends on the latency of the draft and target models). SpecDec++ trains an acceptance prediction head, and uses the above halting criterion based on the predicted acceptance probability to adjust draft length. Some exploration went into training The prediction head is trained with a class-weighted binary cross-entropy loss and a sample of target tokens are replaced with draft tokens. Dynamically adjusting the draft length is shown to increase throughput relative to vanilla speculative sampling on three datasets.
I appreciated the paper's clear framing of the draft length prediction problem. My main concern is the lack of baselines that update the draft length dynamically.
接收理由
The main contribution of this work is in framing the draft length prediction problem as a Markov decision process and analytically solving for the halting threshold. I found the writing to be very clear, although more space is spent describing the mechanics of speculative decoding than is probably necessary.
拒绝理由
The experiments are light, with only vanilla speculative sampling considered as a baseline. This paper makes no claim as to how this proposed method compares empirically to other methods for dynamically adjusting the draft length [1]. Although there is a discussion in Appendix B about why heuristics such as token-level draft distribution statistics are not appropriate, these methods are never evaluated.
[1] Mamou, Jonathan, et al. "Dynamic speculation lookahead accelerates speculative decoding of large language models." arXiv preprint arXiv:2405.04304 (2024).
给作者的问题
- line 111: "...including block paralleling sampling..."
- line 130: "...we first generates candidate tokens..."
- Appendix D.3 varies the weight on the rejection class, , but does not touch . Was tuned, or was it fixed to 1 in experiments?
- How important was tuning , , and to achieve acceptance prediction performance above baseline speculative sampling?
We sincerely appreciate the time and effort you have invested in reviewing our paper and providing valuable feedback. We would like to provide detailed responses below:
Comment 1: This paper makes no claim as to how this proposed method compares empirically to other methods for dynamically adjusting the draft length (Mamou, Jonathan, et al.) [1].
Response 1:
Thank you for the pointer. Compared to [1], our paper has unique advantages of theoretical analysis and clearer, self-contained writing, while the main text of [1] is only 4-page long. We will properly cite [1] and include the following discussion in our next revision:
Discussion: The authors of [1] propose to train an FFN taking several statistics of the probability vector of the draft model to predict the total variance distance of the draft model and the target model, which is highly related to the acceptance probability. Our method differs from [1] in that:
- (1) Our acceptance prediction head is built on top of the final hidden state of the last token, which contains richer information than merely the statistics of probability distribution (top_K and entropy).
- (2) We predict the acceptance of the token when is fed into the draft model, i.e., when the distribution of is being computed. Instead, [1] predicts the acceptance of when the actual sampling result of is undetermined, which may limit the empirical performance as the target is noisier.
Comment 2: The experiments are light, with only vanilla speculative sampling considered as a baseline. Although there is a discussion in Appendix B about why heuristics such as token-level draft distribution statistics are not appropriate, these methods are never evaluated.
Response 2:
We have discussed the sub-optimality of heuristic methods (e.g. only using entropy or top-K prob as a metric of uncertainty) in Appendix B. Empirically speaking, these heuristics may achieve certain performance gain, as is discussed by Liu et al (2024), Kim et al (2024), and Xu et al (2023) (lines 118-121). But still, these methods are fundamentally flawed.
The goal of our paper is not to demonstrate empirical superiority over these heuristic methods, as we have clearly discussed that these methods can fail in certain cases. Instead, our goal is to develop a robust and theory-backed algorithm that practitioners can trust when they have the demands of optimizing the speculative decoding pipeline.
Comment 3: line 111: "...including block paralleling sampling...", line 130: "...we first generates candidate tokens..."
Response 3: Thank you for your meticulous evaluation of the paper. We have fixed the typos!
Comment 4: Appendix D.3 varies the weight on the rejection class, w_{rej} , but does not touch w_{acc} . Was w_{acc} tuned, or was it fixed to 1 in experiments?
Response 4: Thank you for pointing this out. It was fixed to 1 in the experiments. This setting is stated in the main text (line 283). But we agree with you that it should be repeated in Appendix D.3. We have updated our manuscript.
Comment 5: How important was tuning , , and to achieve acceptance prediction performance above baseline speculative sampling?
Response 5: It is easy to achieve performance better than the baseline speculative decoding. In our experiments, the best throughput achieved by the baseline SpecDec is 17.62 tokens/s, 18.55 tokens/s, and 19.14 tokens/s, for Alpaca, HumanEval, and GSM8k, respectively. Cross-referencing the numbers with Table 3 in Appendix D.3, we can see multiple ’s lead to higher performance than the baseline SpecDec. The threshold only requires light tuning (e.g., varying among 0.3, 0.5, 0.7), and can be done efficiently without the need of re-training the prediction heads.
We sincerely hope that our responses can resolve the misunderstandings and address your concerns, and we would greatly appreciate it if you would like to re-evaluate our work given the responses.
Dear Reviewer 3xMF,
As we approach the end of the discussion period, we would like to follow up to see if our previous responses have sufficiently addressed your concerns and questions. We greatly value your feedback and are more than happy to provide any further clarifications or additional information if needed.
We also respectfully note that the other reviewers have both provided positive assessments and would be grateful if you could consider re-evaluating your score.
Thank you again for your time and consideration!
Best regards,
Authors
This paper introduces SpecDec++, an enhanced speculative decoding algorithm designed to improve the efficiency of LLMs by adaptively determining the candidate length for each round of decoding. The authors formulate the problem of choosing length as a Markov Decision Process and theoretically demonstrate that the optimal policy is a threshold policy. This means speculation should stop when the probability of rejecting at least one token exceeds a certain threshold. Based on this theory, SpecDec++ dynamically adjusts K during inference by training an acceptance prediction head on the draft model to estimate the probability of token acceptance. The proposed method is shown to be effective across different datasets and can be integrated with other speculative decoding improvements, making it a viable and powerful tool for accelerating LLM inference.
接收理由
- The proposed method is well-motivated, and the authors have theoretically proven its superiority.
- The method proposed in the paper is practical and can be well integrated into existing chain-based SpecDec methods, showing great application potential.
- The paper validates the effectiveness of the method on multiple datasets and demonstrates a stable improvement in acceleration compared to the SpecDec baseline.
- The paper is clearly written and easy to follow.
拒绝理由
In terms of comparison experimental study, the paper lacks comparisons with other speculative decoding paradigms and methods (like medusa, eagle, sequoia, DySpec, or any comparable methods). Additionally, the experimental setup is relatively simple and does not explore the acceleration of target models and draft models of different scales.
Thank you for recognizing our efforts and for your positive evaluation of our work! We sincerely appreciate the time and effort you have invested in reviewing our paper and providing valuable feedback. We would like to provide additional explanations below:
Comment 1: In terms of comparison experimental study, the paper lacks comparisons with other speculative decoding paradigms and methods (like medusa, eagle, sequoia, DySpec, or any comparable methods).
Response 1: We have discussed this in the related work section. Specifically:
- (1) The improvement of our method is achieved through algorithmic innovations, and can be plugged into methods like EAGLE. (Lines 104-109)
- (2) For methods like Medusa, as the generated tokens may deviate from the target model’s distribution under the general stochastic sampling setting, we choose not to compare against this line of methods in our paper. (Lines 110-115)
- (3) For token tree methods like Sequoia and DySpec, our method is not directly comparable. However, our work can serve as a starting point towards the problem of optimal token tree construction, as the candidate length K can be viewed as the depth of a token tree with only one branch. Please refer to our discussion in Appendix B (Lines 555-568).
Comment 2: Additionally, the experimental setup is relatively simple and does not explore the acceleration of target models and draft models of different scales.
Response 2: Due to page limit, we deferred an additional experiment of Gemma models (2B v.s. 27B) to Appendix D.4, and showed similar speedup of our methods.
We sincerely hope that our responses can address your concerns. Thank you for your positive overall evaluation again!
Thanks for your detailed response, I would like to keep my score unchanged.
Reviewers do not find the ideas to deviate significantly from other contemporaneous work on speculative decoding but state to be mostly on the positive side due to good execution with both positive experimental results and theoretical analysis of the work. Although similar ideas on using a second model to accelerate a more expensive one are being explored by different teams there is still not a principled way to accomplish this and this paper might contribute on this recent body of knowledge. However one reviewer notes during rebuttal discussions that some comparisons to other work based on heuristics could have been incorporated. The AC is still overall leaning positive.