AutoJudge: Judge Decoding Without Manual Annotation
Automatically detecting task-specific important tokens to accelerate speculative decoding
摘要
评审与讨论
This paper proposes AutoJudge, a lossy speculative decoding framework that learns to selectively verify only the tokens deemed “important” for final answer quality. The key idea is to avoid full token-level validation by training a lightweight classifier to identify which mismatches matter. The method involves a semi-greedy search over tokens to label “importance,” trains a classifier on these, and integrates it into the speculative decoding pipeline. Experiments on GSM8K and LiveCodeBench show promising 2× speed-ups with minor quality degradation.
优缺点分析
I think the paper tackles a valuable and timely problem — making speculative decoding more efficient — and it introduces a relatively clean and automatable way to identify "important" mismatches. I also appreciate that the proposed method integrates smoothly into existing decoding systems and shows solid improvements on two benchmarks.
I have several concerns:
- I think the baselines chosen for comparison are somewhat weak. Speculative decoding is a very active area, and there are stronger follow-ups to the original algorithm that are not considered here. This makes it harder to gauge the real benefit of AutoJudge.
- More critically, I feel that the experimental setup — focusing only on GSM8K and code generation — somewhat hides the underlying simplicity of the classification problem. These domains have very rigid output formats and answer structures, so the classifier might just be learning dataset-specific patterns. In fact, I suspect that a simple rule-based filter could likely perform similarly within-domain. This makes me question the generality and transferability of the approach, especially for more open-ended or subjective generation tasks.
- I also noticed that there’s no proper ablation on the classifier itself. Since the core claim is about learning to identify important tokens, I would have expected comparisons with alternative classifiers, feature sets, or even rule-based methods to understand what’s really driving performance.
问题
See Weakness
局限性
Yes
最终评判理由
I have read the author rebuttal carefully and reviewed the other reviews and responses as well.
The concerns I initially raise, such as classifier ablations, and the possibility of overfitting to rigid tasks, were directly addressed with new results (e.g., evaluation on Qwen models, rule-based comparisons, and additional cross-dataset tests like GSM8K → MATH_hard). These updates alleviate my major reservations.
That said, I maintain a lower confidence score, as I am not deeply familiar with the latest developments in speculative decoding. In particular, I cannot fully assess whether the chosen baselines (e.g., Top-K) are sufficient, or whether comparison with more recent methods is critical for a fair evaluation.
I raised my score accordingly (3 -> 4).
格式问题
N/A
Thank you for your feedback and suggestions! We are glad that you appreciated the importance of more efficient speculative decoding and how our work contributes to improving it, as well as the simplicity of integrating AutoJudge into inference pipelines. Below, we discuss the concerns and suggestions you raised in the review, providing additional experiments to address them.
Speculative decoding is a very active area, and there are stronger follow-ups to the original algorithm that are not considered here. This makes it harder to gauge the real benefit of AutoJudge.
We agree that considering additional speculative decoding algorithms and setups would greatly strengthen the paper. To explore this direction, we integrated AutoJudge with the popular EAGLE-2 algorithm [1,2]. Unlike the original speculative decoding, EAGLE-2 does not have a separate draft model, but trains a lightweight “head” to predict future tokens, which makes it one of the few speculative algorithms fit for batched inference.
For AutoJudge decoding, we only use the target model hidden states (during verification) when training the classifier, but otherwise follow the same pipeline as in Section 4.1 for Llama 3.1 8B target model with two exceptions: 1) using EAGLE draft head instead of a separate draft model 2) using a shorter draft window size of 8. The shorter window size is because the EAGLE draft ‘head’ is not designed as a standalone model and does not generate long coherent text, as it is designed for speed rather than accuracy.
Below, we report GSM8K accuracy and the average number of accepted tokens when integrating AutoJudge with the official PyTorch implementation of EAGLE 2. We also report real-world inference speed (tokens per second) using the vLLM implementation on a single A100-SXM4-80GB GPU. Since PyTorch and vLLM implementations are slightly different, we set PyTorch hyperparameters to match the parameters used in vLLM integration (depth = total tokens - 1).
| Method \ Tokens Bracket | 2.5-3 | 3-3.5 | 3.5-4 | 4-5 | 5+ |
|---|---|---|---|---|---|
| AutoJudge (threshold) | 0.01 | 0.08 | 0.5 | 0.8 | 0.9 |
| Accuracy, % | 82.34 | 80.36 | 73.77 | 60.12 | 46.70 |
| Accepted Tokens | 2.872 | 3.128 | 3.656 | 4.458 | 5.119 |
| Baseline (nearest bracket) | |||||
| Accuracy, % | 82.18 | 74.68 | 63.08 | 45.11 | 29.95 |
| Accepted Tokens | 2.718 | 3.018 | 3.615 | 4.462 | 5.248 |
Next, we measure the real-world inference speed (tokens per second) with vLLM on a single A100-SXM4-80GB GPU. Similarly to Section 4.3, we also report relative speed-ups compared to standard Speculative Decoding (150.5 tokens/s).
| AutoJudge Threshold | 0.01 | 0.08 | 0.5 | 0.8 | 0.9 |
|---|---|---|---|---|---|
| Accuracy, % | 82.34 | 80.36 | 73.77 | 60.12 | 46.70 |
| Speed, tokens/s | 158.8 | 170.5 | 195.7 | 221.0 | 225.5 |
| Speedup vs. Speculative Decoding | 1.05x | 1.13x | 1.30x | 1.47x | 1.5x |
The results suggest that AutoJudge can generalize to advanced follow-ups to speculative decoding algorithms. However, we agree that further exploration of various successors of Speculative Decoding would further improve our work. We will include these and additional experiments with speculative decoding follow-ups in the final version of the paper.
More critically, I feel that the experimental setup — focusing only on GSM8K and code generation — somewhat hides the underlying simplicity of the classification problem. These domains have very rigid output formats and answer structures, so the classifier might just be learning dataset-specific patterns. This makes me question the generality and transferability of the approach, especially for more open-ended or subjective generation tasks.
Thank you for your suggestion! For GSM8K and coding problems, it is indeed tempting to try simpler rule-based alternatives to AutoJudge (as you suggest later in the review), and we do that later in our response. We also agree that it would be interesting to evaluate AutoJudge with open-ended problems such as creative writing. In L135-137, we briefly address this scenario by describing how AutoJudge can detect important tokens in these problems using an LLM-as-a-Judge approach to compare answer quality. Due to the limited time of the author response phase, we prioritized your other suggestions, but we will explore AutoJudge for open-ended tasks in the final version of the paper.
I also noticed that there’s no proper ablation on the classifier itself. Since the core claim is about learning to identify important tokens, I would have expected comparisons with alternative classifiers, feature sets, or even rule-based methods to understand what’s really driving performance.
We perform the ablation of the AutoJudge classifier in Appendix B (L974 in supplementary material, referred to in L214). Overall, Figures 5 and 6 suggest that using specifically the target token hidden representations (as opposed to the previous representations that predicted said token) yields substantial benefits to classifier AUC. Using both draft and target hidden representations further yields a smaller (but essentially free) improvement. In turn, using “deeper” classifiers (e.g., MLP, RandomForest) led to no improvements and increased overfitting. We also consider alternative feature sets in Figure 6 (left).
(continued) … rule-based methods to understand what’s really driving performance.
To explore how AutoJudge decoding compares to rule-based methods for mathematical reasoning, we compare it against two heuristic alternatives. The first heuristic approach is to consider only the mathematical tokens as ‘important’. To that end, we filter the mismatching tokens that contain numbers, operations (e.g., + - * / =, etc), as well as some common variables. Mismatches in these mathematical tokens are rejected, whereas mismatches in non-mathematical text are allowed.
As it turns out, this algorithm misses important planning and logical steps that do not contain math explicitly, which results in poor accuracy. To address this, we introduce a second, more complex heuristic approach that combines the mathematical rule with the Top-K baseline we use in Section 4. This works somewhat better, but still does not outperform the learned AutoJudge classifier.
GSM8K (0-shot) rules-based methods, 1B draft / 8B target models
| Heuristic method | Math Only | Math & Top-1024 | Math & Top-128 |
|---|---|---|---|
| Accuracy, % | 65.05 | 72.27 | 80.36 |
| Accepted Tokens | 39.7 | 22.3 | 15.5 |
| AutoJudge (nearest) | |||
| Accuracy, % | - | 79.37 | 84.15 |
| Accepted Tokens | - | 22.7 | 15.7 |
To summarize, we believe that we have discussed the concerns raised in the review and conducted additional experiments following your suggestions. We respectfully ask you to reevaluate your original score based on these additional results. If we missed something, please inform us in the follow-up response, so we can address it in the final version of the paper.
[1] Li, Y., Wei, F., Zhang, C., & Zhang, H. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. In Forty-first International Conference on Machine Learning.
[2] Li, Y., Wei, F., Zhang, C., & Zhang, H. (2024, November). EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 7421-7432).
Thank you for the detailed and thoughtful rebuttal. I appreciate the authors' efforts in addressing the concerns I raised.
Most of my initial questions have been satisfactorily answered.
That said, I still have some reservations about cross-domain generalization—specifically, how well the trained classifier might transfer to domains with different output structures or task semantics, or whether a jointly trained classifier across diverse domains could generalize more robustly. I acknowledge, however, that such generalization is fundamentally difficult: a classifier trained on math-style reasoning is unlikely to be effective for tasks like summarization or dialogue, which involve entirely different notions of “importance.” I don’t view this as a flaw, but rather a natural limitation of task-specific methods.
Given the expanded empirical evidence, I will raise my score.
The paper proposes AutoJudge, a fully automated protocol for task-specific acceleration of speculative decoding. AutoJudge identifies tokens on which (small) surrogate and (large) target LLMs disagree and characterizes if that diagreement is critical or not. The paper proposes to reframe the inference from detecting individual important tokens to finding combinations of tokens that jointly affect the final answer, using a trained classifier (based on logistic regression). The evaluation shows 1) ~2x speedup over speculative decoding at the cost of under 1% drop in accuracy; 2) the improvement of AutoJudge over Top-k inference in terms of accuracy and run-time for both mathematical reasoning and programming tasks.
优缺点分析
Strengths:
-
Given the emergence of speculative LLM decoding, finding ways to improve correct answers and reduce the LLM inference time. The paper automates several steps in a recent strategy Judge Decoding (from ICLR'25).
-
The proposed approach presents promising initial results. The comparison on two standard datasets shows the improvement over the Top-K baseline (which was used in the ICLR'25 paper too).
-
The evaluation shows that the design of AutoJudge could generalize across different tasks. I encourage the authors to continue investigating this question.
Weaknesses:
-
No comparison with JudgeDecoding and/or generalization to other advanced speculative decoding algorithms (as the authors mentioned in the Future work paragraph). The current baseline is weak, even though the initial results are encouraging. Having stronger alternative baselines would lead to a better comparison to state-of-the-art in this nascent problem domain. While I understand that the full comparison with Judge Decoding may not be possible without their annotations, can you re-create a smaller set of annotations based on that paper and run a study with those?
-
The proposed classifier design, a logistic regression, is rather simple and it is unclear how its training generalize for different tasks. Without comparing to Judge Decoding or other methods, it is unclear whether it yields sufficient accuracy. Further, some details from Appendix B would be useful to move to the main paper to motivate the model's use. Overall, a study of different families of classifiers would help understand this problem better.
-
(minor) Some of the existing results can be improved too: (1) The current results are on only several pairs of LLMs. More combinations would give a better picture of the tradeoff space between the overheads and model's accuracy. The results presented in the appendix (i.e., other sizes of Llamas) are good, but it would be even better if the impact of those (and some other combinations of small open source LLMs) on accuracy and run time are presented in a separate figure int eh main paper. (2) In addition to the time speedups, include the estimates of consumed energy while running the target and surrogate models in parallel. The work is acceptable even if energy consumption is larger than a single model, but characterizing the tradeoff makes the paper stronger.
问题
None
局限性
Yes
最终评判理由
I am satisfied with the authors' detailed response to my questions.
格式问题
None
Thank you for your detailed review and suggestions for improving our work! To address these suggestions, we conducted additional experiments, reporting their results below.
The evaluation shows that the design of AutoJudge could generalize across different tasks. I encourage the authors to continue investigating this question.
We agree that cross-task generalization is an interesting direction. While we don’t expect AutoJudge to generalize between completely unrelated tasks, it is curious how it generalizes to adjacent task types. To that end, we evaluate the AutoJudge classifier trained on the GSM8K dataset (Section 4.1) on a subset of the MATH benchmark (math_hard from Llama evals). While both benchmarks involve mathematical reasoning, GSM8K has integer answers, whereas MATH_hard is answered with mathematical formulae and has generally harder problems.
We follow the evaluation protocol from the official “Llama-3.1-8B-Instruct-evals” by Meta. We use and Llama 3.1-8B Instruct as the target model and Llama 3.2-1B Instruct as a draft model. Since the author response rules prohibit us from uploading images, we reformat our results as a table, reporting the best results in every bracket of accepted tokens bracket.
MATH_hard (256 samples) accuracy and accepted tokens (cross-domain evaluation)
| Accepted token bracket | <25 | 25-30 | 30-35 | >35 |
|---|---|---|---|---|
| AutoJudge (trained on GSM8K) | ||||
| Accuracy, % | 17.6 | 16.4 | 14.1 | 12.1 |
| Accepted Tokens | 24.8 | 28.16 | 33.89 | 36.72 |
| Top-K baseline | ||||
| Accuracy, % | 15.5 | 15.2 | 11.3 | 9.8 |
| Accepted Tokens | 24.36 | 28.24 | 32.59 | 37.77 |
While the improvement is not as high as on the dataset that the classifier was trained on, the results suggest that AutoJudge can indeed generalize between these two similar tasks. We agree that this is an important evaluation scenario and will add these results and explore additional cross-task generalization further in the final version of the paper.
No comparison with JudgeDecoding and/or generalization to other advanced speculative decoding algorithms (as the authors mentioned in the Future work paragraph).
While we originally intended this as future work, we agree that testing AutoJudge with more advanced speculative decoding algorithms would be beneficial. To that end, we compare AutoJudge with two such advanced algorithms: EAGLE-2 and offloaded speculative decoding.
AutoJudge with EAGLE-2
Our first advanced algorithm is EAGLE-2 [1,2] that trains a draft ‘head’ on top of target model hidden representations to predict future states. We run our experiments using both the official EAGLE codebase (for training, accuracy, and accepted tokens) and its vLLM integration (for tokens/s), using the hyperparameters that match the behavior of the vLLM version (i.e., depth = total_tokens-1).
Below, we report our evaluation results on the GSM8K dataset in the same evaluation setup as in Section 4.1, except with EAGLE instead of vanilla speculative decoding and a smaller window size of 8. This is because EAGLE was trained with shorter drafts and becomes incoherent after drafting 8-10 tokens. To comply with the author response rules, we format our results as a table.
| Method \ Tokens Bracket | 2.5-3 | 3-3.5 | 3.5-4 | 4-5 | 5+ |
|---|---|---|---|---|---|
| AutoJudge (threshold) | 0.01 | 0.08 | 0.5 | 0.8 | 0.9 |
| Accuracy, % | 82.34 | 80.36 | 73.77 | 60.12 | 46.70 |
| Accepted Tokens | 2.872 | 3.128 | 3.656 | 4.458 | 5.119 |
| Baseline (nearest bracket) | |||||
| Accuracy, % | 82.18 | 74.68 | 63.08 | 45.11 | 29.95 |
| Accepted Tokens | 2.718 | 3.018 | 3.615 | 4.462 | 5.248 |
Additionally, we report vLLM inference speed (tokens per second) for a single A100-SXM4 GPU and relative speed-ups for the same thresholds as above. The speed-ups are relative to Speculative Decoding (150.5 tokens/s), similar to how they are reported in Table 1.
| AutoJudge Threshold | 0.01 | 0.08 | 0.5 | 0.8 | 0.9 |
|---|---|---|---|---|---|
| Accuracy, % | 82.34 | 80.36 | 73.77 | 60.12 | 46.70 |
| Speed, tokens/s | 158.8 | 170.5 | 195.7 | 221.0 | 225.5 |
| Speedup vs. Spec. Decoding | 1.05x | 1.13x | 1.30x | 1.47x | 1.5x |
The results suggest that AutoJudge decoding can indeed generalize to this more advanced speculative decoding algorithm.
AutoJudge with Offloaded Speculative Decoding
Another advanced use case for speculative decoding is when the target model does not fit on the device. Recent works that target this scenario tend to generate significantly longer drafts [3,4]. This is because running the offloaded target model takes significantly longer, regardless of how many draft tokens it needs to verify.
For consistency, we use the same GSM8K benchmark with llama 3.1 8B draft / 3.1 70B target models as in the original Table 1 (right) from Section 4.3. The main purpose of this evaluation is to compare speedups in this more advanced technical scenario. We evaluate on a single A100-SXM4 GPU, which would be unable to run this without offloading. We evaluate AutoJudge with a threshold of 0.26 (see Table 1 in Section 4.3), and report the average inference speed (tokens per second) for different window sizes below.
| Method \ Draft Length | 8 | 12 | 16 | 32 | 64 |
|---|---|---|---|---|---|
| AutoJudge (offloading) (toks/s) | 0.54 | 0.69 | 0.81 | 0.78 | 0.70 |
| Leviathan (offloading) (toks/s) | 0.35 | 0.35 | 0.36 | 0.36 | 0.31 |
| Speedup | 1.54x | 1.97x | 2.25x | 2.17x | 2.26x |
While I understand that the full comparison with Judge Decoding may not be possible without their annotations, can you re-create a smaller set of annotations based on that paper and run a study with those?
While it would be difficult, we generally agree that it would be valuable to compare AutoJudge automated annotation against human-labelled data. However, unlike our previous experiments that only required integrating AutoJudge into existing benchmarks, human annotation requires a longer timeframe than the duration of the Author Response phase. Hence, we will only be able to conduct this experiment after the discussion to include it in the final version of the paper.
The proposed classifier design, a logistic regression, is rather simple and it is unclear how its training generalize for different tasks. Further, some details from Appendix B would be useful to move to the main paper to motivate the model's use. Overall, a study of different families of classifiers would help understand this problem better.
We explore several alternative classifier designs and feature sets in Appendix B (summarized and referenced in S3.2) and find that a simple linear model is sufficient. One possible explanation for this is that the linear classifier has access to the feature representations of the target LLM. This coincides with prior findings from Judge Decoding that also use a linear model. If the paper is accepted, we will be happy to describe these results in the main paper using the additional content page.
The current results are on only several pairs of LLMs. More combinations would give a better picture of the tradeoff space between the overheads and model's accuracy.
To alleviate this concern, we report results with another model pair, but this time from a different family: Qwen 2.5 Instruct models. We use Qwen 2.5 0.5B Instruct as the draft and 7B Instruct target model and summarize our results below.
| Accepted token bracket | <9 | 9-13 | 13-25 | >25 |
|---|---|---|---|---|
| AutoJudge (trained on GSM8K) | ||||
| Accuracy, % | 87.4 | 87.6 | 83.2 | 79.8 |
| Accepted Tokens | 8.95 | 12.30 | 24.00 | 27.53 |
| Top-K baseline | ||||
| Accuracy, % | 86.2 | 82.0 | 52.8 | 46.8 |
| Accepted Tokens | 8.96 | 9.13 | 21.95 | 27.30 |
The results show similar speed-ups (up to 2.68x more accepted tokens) with minor accuracy drawdowns. We will include this and additional experiments from the Qwen model family in the final version of the paper.
In addition to the time speedups, include the estimates of consumed energy while running the target and surrogate models in parallel. The work is acceptable even if energy consumption is larger than a single model, but characterizing the tradeoff makes the paper stronger.
To estimate the energy consumption of AutoJudge, vanilla speculative decoding, and sequential decoding, we run each inference method on a 10% sample of the GSM8K test set using vLLM. We measure real-world power usage on Llama 3.2 1B draft / 3.1 8B target models with a single A100-SXM4-80GB GPU (Watts) as reported by nvidia-smi, and multiply it by the mean inference time. This represents the GPU-reported power consumption, before adjusting for PSU inefficiency (equally for AutoJudge and baselines) and will vary between GPU types. For convenience, we convert all results to kJ, similar to [5].
| Method | Autoregressive | Speculative Decoding | AutoJudge |
|---|---|---|---|
| Power Usage | 87kJ | 43kJ | 37kJ |
Note that both AutoJudge and vanilla speculative decoding turned out to be more energy efficient than simple autoregressive decoding. We attribute this to the fact that LLM inference power consumption is affected not just by ‘FLOPs’, but also by memory reads/writes to load the LLM parameters. Since the autoregressive decoding requires more sequential operations to generate a given sequence, it has a higher total GPU energy consumption per task.
In summary, we believe we addressed the main concerns raised in your review with additional experiments and clarifications. If our response answers your questions, we would kindly ask you to reevaluate your score for the submissions. If you have further suggestions, we are happy to address them during the Reviewer-Author Discussion phase.
[1] Li et al. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. ICML 2024.
[2] Li et al. EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. EMNLP 2024
[3] Svirschevski et al. SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices. NeurIPS 2024.
[4] Miao et al. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. ASPLOS '24
[5] Maliakel et al. Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings, arXiv:2501.08219
Thank you for the detailed response. I will increase my rating as the result.
This paper introduces AutoJudge, a lossy speculative decoding framework that leverages unimportant token classification for boosting draft model acceptance length. To accomplish this, AutoJudge identifies sets of tokens that don't alter a verifiable result and train a linear classifier with intermediate representations as inputs. Evaluation on two combinations of Llama-3 models on math and programming showcase 1.4 - 2x speedup at 1pp of accuracy degradation.
优缺点分析
Strengths
- The paper is nicely motivated and tackles a very timely and important problem with a practical solution that is more scalable than existent baselines.
- The paper showcases the generalisability of the trained classifier across different tasks.
- The method has been integrated with vLLM and shows significant improvements over the selected baselines.
Weaknesses
- The paper has only assessed a single family of models (Llama-3) on two verifiable tasks. It would have been interesting to see additional speculative decoding baselines and how autojudge can be applied on top of such setups.
- It is unclear how the method would perform under large context sizes and different decoding strategies.
- The overhead of the classifier module has not been quantified.
问题
- How does the proposed method perform on top of different speculative decoding methods, like EAGLE [a,b,c] or HASS [d] or multi-token prediction [e]?
- Are there any bounds in the distribution shifts that can happen in the lossy AutoJudge approach?
- How does the draft model confidence correlate with the classification module's output?
- Is the semi-greedy search ordered somehow (e.g. left-to-right) or do replacements happen randomly?
[a] Li, Y., Wei, F., Zhang, C., & Zhang, H. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. In Forty-first International Conference on Machine Learning.
[b] Li, Y., Wei, F., Zhang, C., & Zhang, H. (2024, November). EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 7421-7432).
[c] Li, Y., Wei, F., Zhang, C., & Zhang, H. (2025). Eagle-3: Scaling up inference acceleration of large language models via training-time test. arXiv preprint arXiv:2503.01840.
[d] Zhang, L., Wang, X., Huang, Y., & Xu, R. (2024). Learning Harmonized Representations for Speculative Sampling. arXiv preprint arXiv:2408.15766.
[e] Mehra, S., Garcia, J. A., & Mauch, L. (2025). On multi-token prediction for efficient LLM inference. arXiv preprint arXiv:2502.09419.
局限性
The method's main limitation is its applicability on verifiable tasks. While the authors touch upon interesting topics of judge evaluation for non-verifiable tasks, this is left as future work.
Additionally, while the authors state that their method can be implemented on top of different speculative decoding schemes, this has not been shown or evaluated.
Last, testing the limits of the method's generalisability on long-context tasks would have been an welcome and relevant addition in the evaluation.
最终评判理由
I am happy with the paper and the rebuttal, which did not affect my score.
格式问题
It would be beneficial to include forward references from the main text to the appendix for better content discoverability.
Thank you for your feedback! Below, we answer the questions raised in your review and provide additional experiment results to address your concerns.
The paper has only assessed a single family of models (Llama-3) on two verifiable tasks.
We agree that our analysis would benefit from exploring additional model families and tasks. To that end, we explore an extra model family of Qwen-2.5. Below, we report a pair of Qwen2.5-0.5B-Instruct draft model and Qwen2.5-7B-Instruct target model on GSM8K benchmark using the same evaluation setup as in Section 4.1. Due to the restrictions for this year’s author response (no images), we had to reformat our results as a table.
| Accepted token bracket | <9 | 9-13 | 13-25 | >25 |
|---|---|---|---|---|
| AutoJudge (trained on GSM8K) | ||||
| Accuracy, % | 87.4 | 87.6 | 83.2 | 79.8 |
| Accepted Tokens | 8.95 | 12.30 | 24.00 | 27.53 |
| Top-K baseline | ||||
| Accuracy, % | 86.2 | 82.0 | 52.8 | 46.8 |
| Accepted Tokens | 8.96 | 9.13 | 21.95 | 27.30 |
The results suggest that AutoJudge decoding offers significantly better accuracy to speed (token acceptance rate) trade-offs than the Top-K criterion. To further explore different tasks, we also evaluate on a subset of the MATH benchmark in our response to Reviewer vcrD. We will explore additional Qwen model pairs and tasks in the final version of the paper.
It would have been interesting to see additional speculative decoding baselines and how autojudge can be applied on top of such setups. … How does the proposed method perform on top of different speculative decoding methods, like EAGLE [a,b,c] or HASS [d] or multi-token prediction [e]?
To address this, we evaluate and compare AutoJudge with EAGLE-2 [a,b], a popular speculative decoding algorithm that trains a predictive draft ‘head’ instead of using a separate draft model. We evaluate Llama 3.1-8B Instruct model using the “EAGLE-LLaMA3.1-Instruct-8B” draft head published with the original paper [a]. We use the official PyTorch implementation to report accuracy and accepted tokens, and run vLLM EAGLE integration for real-world inference speed (tokens per second) with the same configuration as in Section 4.3. We configure the hyperparameters to match the behavior of the vLLM model (i.e. depth = total_tokens - 1).
AutoJudge with EAGLE, GSM8K (0-shot) accuracy and inference speed
| Method \ Tokens Bracket | 2.5-3 | 3-3.5 | 3.5-4 | 4-5 | 5+ |
|---|---|---|---|---|---|
| AutoJudge (threshold) | 0.01 | 0.08 | 0.5 | 0.8 | 0.9 |
| Accuracy, % | 82.34 | 80.36 | 73.77 | 60.12 | 46.70 |
| Accepted Tokens | 2.872 | 3.128 | 3.656 | 4.458 | 5.119 |
| Baseline (nearest bracket) | |||||
| Accuracy, % | 82.18 | 74.68 | 63.08 | 45.11 | 29.95 |
| Accepted Tokens | 2.718 | 3.018 | 3.615 | 4.462 | 5.248 |
We also report vLLM inference speed (tokens/s) on one A100-SXM4 GPU and the relative speed-up compared to Speculative Decoding (150.5 tokens/s). For this evaluation, we use the same 5 thresholds as above and otherwise follow the same protocol as in Section 4.3.
| AutoJudge Threshold | 0.01 | 0.08 | 0.5 | 0.8 | 0.9 |
|---|---|---|---|---|---|
| Accuracy, % | 82.34 | 80.36 | 73.77 | 60.12 | 46.70 |
| Speed, tokens/s | 158.8 | 170.5 | 195.7 | 221.0 | 225.5 |
| Speedup vs. Speculative Decoding | 1.05x | 1.13x | 1.30x | 1.47x | 1.5x |
The results above correspond to AutoJudge decoding with EAGLE draft model on the GSM8K dataset using the same setup as in Section 4.1, but with EAGLE draft model and a smaller window size of 8 since the EAGLE draft ‘head’ was not trained to generate longer drafts. We use the same window size and other hyperparameters for both AutoJudge classifier training and inference. The results suggest that AutoJudge can generalize to at least one significantly different speculative decoding setup.
It is unclear how the method would perform under large context sizes and different decoding strategies.
It would indeed be interesting to explore alternative decoding strategies. Note that some of our benchmarks (e.g., LiveCodeBench in the main paper) already require long-form reasoning with thousands of generated tokens, though there are tasks with significantly longer inputs. Due to the limited time for author responses, we prioritized your other suggestions, but we will explore more long-context tasks in the final version of the paper.
The overhead of the classifier module has not been quantified.
Since we claim that the overhead of using a linear classifier is negligible (L194), we agree that it would be best to more formally quantify the overhead. Compute-wise, the classifier head can be merged with the existing LM “head” (L212) of draft and target models to reduce its overhead. In that case, the actual multiplication is done in the same kernel call as the next token prediction. However, using the predicted classifier outputs does incur a minor overhead.
We evaluate the runtime overhead of running the classifier on a A100-SXM4 80GB GPU in the same setup as in Section 4.3. We found that the compute overhead is within 0.3% even without merging (0.026ms classifier vs 6.992ms 8B model forward). Likewise, the memory overhead of storing the classifier weights is within 10^-5 of the target model weights (12KiB for 1B/8B pair and 24KiB for 8B/70B. The model weights take up tens of GiB).
Are there any bounds in the distribution shifts that can happen in the lossy AutoJudge approach?
The natural upper bound on the token distribution shift is the shift between draft and target models on the tokens deemed important by the classifier. Establishing a tighter guarantee on the distribution shift would be difficult due to the complex nature of LLMs. However, if you would like to explore this problem from a specific angle, we are happy to discuss this further.
How does the draft model confidence correlate with the classification module's output?
Thank you for suggesting this sanity check. When measuring this, we found that the instruct-tuned draft models are often overconfident, and the probability of the chosen draft token in these models is not well correlated with that token’s importance to the downstream task. More specifically, we measured the correlation between the Llama 3.2 1B Instruct draft model’s probability of the chosen draft token and the corresponding AutoJudge classifier probability on the 1B/8B model pair on the GSM8K dataset (0-shot). The resulting Pearson correlation varies within ±0.3 per generation, and the full sample covariance is -0.073.
Is the semi-greedy search ordered somehow (e.g. left-to-right) or do replacements happen randomly?
In Algorithm 1, we order the search from left to right, as you mentioned (line number 9 in the algorithm). This is because every time we consider a token, there is a chance that we will need to replace it and re-generate all subsequent tokens given the new prefix. Hence, changing an earlier token can invalidate the mismatches found in subsequent tokens. To avoid this, AutoJudge does indeed explore tokens left-to-right. We will state this more clearly in Section 3.1.
We hope that these new results address the concerns raised in your review. In the final version of the paper, we plan to use the extra space to further explore the directions you suggested, as detailed above.
I would like to thank the authors for their valuable clarifications and additional experiments. I will keep my score as is.
LLM inference is expensive. This method proposes a classifier that detects which tokens are important in the context of speculative decoding of LLM generations. If a token produced by speculative decoding doesn't exactly match the original one, the classifier can still tag it as "unimportant" and we can continue the speculative decoding without significantly affecting the generation quality. This is evaluated in relevant LLM benchmarks, showing improved decoding speed at the expense of small accuracy degradation.
优缺点分析
Strenghts
- The method is novel and well-principled.
- Automatically constructs training data.
- The empirical results are compelling.
- The application (speeding up LLM decoding in real-world tasks without degrading accuracy) is relevant. The integration with vLLM is useful and proof that authors take real-world applications seriously. This integration, if released, increases the likelihood that the method will be used.
- The design choices are well-motivated. For example, "2. Which token alternative to use?" in line 196.
Weaknesses
- The task-specificity of the method might limit its applications. Plus, as acknowledged by the authors, this method requires having a pre-determined way to extract the final answer, which might not work in less "templatic" scenarios.
- Relatively small novelty with respect to Bachmann.
问题
"a semi-greedy search that starts from the target model response and iteratively tries to replace mismatching target model tokens with their draft counterparts": does the method go token-by-token? Did you consider any heuristic to skipping tokens?
The way of extracting the final answer is task-specific. However, once the classifier has been already trained, would it work for other tasks?
局限性
The main limitation of this work is already acknowledged by the authors. An additional one I observe is that there aren't many benchmarks (or models) represented in the results, which might pose some doubts on whether this would work in other cases.
最终评判理由
I have updated the "Significance" results given the additional results provided by the authors, which show some potential cross-task generalization and results for additional tasks. The rebuttal is compelling and I recommend acceptance, and that they include the additional results in the camera ready version.
However, my original global score was already strong, a 5, and while I do think it's a high-quality paper, I don't think it's enough for updating my global score to 6.
格式问题
Thank you for your valuable feedback and insightful questions! Please find our responses below. For convenience, we answer the questions in the order they are raised in the review.
"a semi-greedy search that starts from the target model response and iteratively tries to replace mismatching target model tokens with their draft counterparts": does the method go token-by-token? Did you consider any heuristic to skipping tokens?
Algorithm 1 does indeed go token-by-token, but it only considers the “mismatching” tokens (L7 in Algorithm 1). These are tokens where the draft and target models would predict different next tokens given the same (current) prefix. Across different models and benchmarks, we found that only one in 10-20 generated tokens is a mismatch. This is because many of the tokens can be unambiguously predicted from the prefix (e.g. following grammar / punctuation rules or repeating an earlier sentence). Another contributing factor is that the draft and target models come from the same family (both Llama in the main paper, Qwen below) in the sense that they were trained on the same data and predict similar next tokens.
In a setup with substantially more mismatches (e.g. weak draft model), one way to speed up Algorithm 1 is to prune obviously bad draft tokens early. This could be implemented, for instance, by forbidding any draft tokens that have very low probability according to the target model (e.g. <0.001).
The way of extracting the final answer is task-specific. However, once the classifier has been already trained, would it work for other tasks?
The answer would likely depend on how similar this new task is. If the classifier was trained on GSM8K with no code in the training set, it will not generalize to coding problems. However, it might be able to generalize to other problems that involve mathematical reasoning.
To quantify this, we evaluate the AutoJudge classifier trained on the GSM8K dataset (from Section 4.1) on a subset of the MATH benchmark (256 problems from math_hard) using the protocol from the official “Llama-3.1-8B-Instruct-evals” dataset by Meta.
While both benchmarks involve mathematical reasoning, the problems in MATH_hard are harder in the sense that they require longer chains of thought. Furthermore, the answer to these problems is a mathematical expression, whereas GSM8K problems have an integer answer.
We use Llama 3.1-8B Instruct as the target model and Llama 3.2-1B Instruct as a draft model. Since the NeurIPS author response rules prohibit us from uploading images, we reformat our results as a table, reporting the best results in every bracket of accepted token range.
MATH_hard (256 samples) accuracy and accepted tokens (cross-domain evaluation)
| Accepted token bracket | <25 | 25-30 | 30-35 | >35 |
|---|---|---|---|---|
| AutoJudge (trained on GSM8K) | ||||
| Accuracy, % | 17.6 | 16.4 | 14.1 | 12.1 |
| Accepted Tokens | 24.8 | 28.16 | 33.89 | 36.72 |
| Top-K baseline | ||||
| Accuracy, % | 15.5 | 15.2 | 11.3 | 9.8 |
| Accepted Tokens | 24.36 | 28.24 | 32.59 | 37.77 |
While the advantage is not as high as on the task that the classifier was trained on, this suggests that AutoJudge can indeed generalize between these two similar tasks. We agree that this is an important evaluation scenario and will add these results and explore additional cross-task generalization further in the final version of the paper.
The main limitation of this work is already acknowledged by the authors. An additional one I observe is that there aren't many benchmarks (or models) represented in the results, which might pose some doubts on whether this would work in other cases.
To alleviate this concern, we report additional model evaluations and extra benchmarks in addition to the MATH_hard evaluation above. To see how our algorithm generalizes beyond the Llama model family, we evaluate AutoJudge with the Qwen 2.5 Instruct model family, with the 0.5B draft model and 7B target model. These evaluations follow the same protocol as in Section 4.1, except that we use the recommended system prompts for Qwen 2.5 models from the official technical report [1]. The results are summarized in the table below.
| Accepted token bracket | <9 | 9-13 | 13-25 | >25 |
|---|---|---|---|---|
| AutoJudge (trained on GSM8K) | ||||
| Accuracy, % | 87.4 | 87.6 | 83.2 | 79.8 |
| Accepted Tokens | 8.95 | 12.30 | 24.00 | 27.53 |
| Top-K baseline | ||||
| Accuracy, % | 86.2 | 82.0 | 52.8 | 46.8 |
| Accepted Tokens | 8.96 | 9.13 | 21.95 | 27.30 |
The results show similar speed-ups (up to 2.68x more accepted tokens) with minor accuracy drawdowns. We will include this and additional experiments from the Qwen model family in the final version of the paper. To further explore alternative use cases, we combine AutoJudge with the EAGLE decoding algorithm (see response to Reviewer DSqA).
If you have any further questions or suggestions, we are happy to address them in the Reviewer-Author discussion phase.
[1] Qwen Team (Yang et al.), Qwen2.5 Technical Report, arXiv:2412.15115, 2024a.
Thank for your comments and additional results. The cross-task generalization results are promising.
I also appreciate the clarification on how the algorithm works.
I'm happy with clarifications and I'll update my scores accordingly. I recommend you include these additional results in the final paper (along with some discussion).
Despite some limitations for the proposed approach, the paper proposes a novel, automatable, and practical solution to improving speculative decoding. Its demonstrated integration with real-world systems (vLLM), competitive speedups, and clean framing that make it a valuable contribution to the efficient inference community. The limitations are acknowledged and partially addressed during the rebuttal phase. While the method is simple in design, it is effective, and its simplicity enhances scalability and ease of deployment. I recommend acceptance