PaperHub
8.2
/10
Spotlight4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.5
置信度
创新性3.0
质量3.0
清晰度3.3
重要性3.3
NeurIPS 2025

SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

OpenReviewPDF
提交: 2025-05-07更新: 2025-10-29
TL;DR

A model-free speculative decoding method that accelerates agentic AI workloads using suffix trees. Achieves 5.3x speedup on multi-agent tasks.

摘要

Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce SuffixDecoding, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 3.9$\times$, outperforming state-of-the-art methods -- 2.2$\times$ faster than model-based approaches like EAGLE-2/3 and 1.6$\times$ faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced.
关键词
speculative decodingLLM agentsmodel-free speculationSWE-BenchLLM inference

评审与讨论

审稿意见
5

This paper proposes a novel model-free speculative decoding algorithm for the LLM agent domain. By leveraging a prefix-tree, the proposed algorithm could effectively find previous tokens for the new speculation. Even though this method cannot be genealize to all domains, it serves well for the agent domain, where the generation often follows certain patterns. Overall, I think this paper proposes an efficient solution for this niche challenge.

优缺点分析

Strengths:

  1. The proposed method is sound and carefully designed for the target scenario
  2. The paper is clearly written and easy to follow

Weakness:

  1. The technical depth of this work can be further improved. For example, the current adaptive speculation length algorithm is quite straightforward. As the main contribution of the proposed algorithm, it is worth further investigation. For example, whether the past information in each trajectory can help select a better speculation length.

问题

  1. How will the proposed algorithm impact the memory consumption
  2. What is the performance of the proposed algorithm on a more general domain benchmark like MT-Bench?

局限性

yes

最终评判理由

I have read the author's feedback and tend to keep my original rating of accepting this paper.

格式问题

NA

作者回复

Thank you for your positive review and detailed feedback, our response is below.

How will the proposed algorithm impact memory consumption?

Suffix Decoding uses CPU memory on the host during serving, which is usually plentiful and under-utilized. To show Suffix Decoding’s memory consumption and other performance stats, we performed an additional micro-benchmark on its suffix tree, shown below:

Total # entriesTotal # tokensMemory footprint (MB)Avg update time per token (μs)Avg lookup time per speculated token (μs)
1 K entries27 M1670 MB3.9512.18
5 K entries143 M2980 MB4.0610.93
10 K entries285 M4070 MB4.0512.00
15 K entries429 M5110 MB4.0711.62
20 K entries572 M6150 MB4.0411.64

As shown, Suffix Decoding can cache 572M tokens using only 6.15 GB of CPU memory, and its memory usage scales linearly. In comparison, optimized serving systems can generate on the order of ~5000 tokens/s for Llama-8B on an A100 GPU 22 . Typical A100 systems (e.g. AWS p4d.24xlarge) have 144GB of CPU memory per A100 GPU. This means that Suffix Decoding can potentially cache 31 days of generated tokens (see math below) before requiring cache eviction.

Time to fill up GPU memory:

  1. Memory per token: 6.15GB572Mtokens=1.075×108GB/token\frac{6.15 GB}{572M tokens} = 1.075 \times 10^{-8} GB/token
  2. Total token capacity: 144GB(1.075×108GB/token)=1.34×1010tokens\frac{144 GB}{(1.075 \times 10^{-8} GB/token)} = 1.34 \times 10^{10} tokens
  3. Time to fill RAM: 1.34×1010tokens5000tok/s=2.68×106seconds=31days\frac{1.34 \times 10^{10} tokens}{5000 tok/s} = 2.68 \times 10^{6} seconds = \boxed{31 days}

General domain (e.g. MT-Bench) performance.

Suffix Decoding's MT-Bench performance is reflected in SpecBench A.1.2 tables. These tables contain 8 MT-Bench subsets (Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities). As SpecBench equally samples all 8 categories, overall Suffix Decoding MT-Bench performance can be calculated by averaging these column results.

Additionally, we provide evaluation results that include hybrid Suffix/Spec Decoding (end of Sec 3) using Suffix Decoding with EAGLE-3:

SpecBench - Speedup (x)

Systemcodingextractionhumanitiesmathmath_reasoningqaragreasoningroleplaystemsummarizationtranslationwritingOverall
hybrid (τ=7)3.4412.7952.8003.1392.8662.4042.5422.4962.4612.8712.5731.7592.8332.500
eagle33.1152.6342.7132.9102.8882.1722.4742.4462.3942.7642.5201.4472.6222.367
recycling2.6192.2342.2092.6602.5771.9512.0572.1501.9792.3342.2181.9322.0572.169
eagle22.5481.9981.9582.2152.2201.7331.8771.9681.7512.0071.7561.3361.7911.825
eagle2.4271.9561.8482.1422.1121.6001.8411.8711.7031.9221.7021.3051.7531.752
suffix1.8281.6301.3831.9671.5441.7731.8901.3891.1561.3861.6181.6181.3271.659
pld1.7121.4841.2311.7531.3231.3261.8071.3201.0551.2751.6281.2191.2321.448
vanilla1.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000

AgenticSQL - Speedup (x)

SystemCATEGORIZATIONFEATURE_EXTRACTIONQUESTION_SUGGESTIONSQL_COMBINESQL_FANOUT1SQL_FANOUT2SQL_FANOUT3Overall
suffix3.0169.85410.4064.8483.2052.8393.2115.345
hybrid (τ=1)2.3097.0258.3683.3822.0242.4212.0723.947
recycling2.5883.4722.6722.6402.5022.6042.4922.710
pld1.3303.6951.2553.2981.9361.3111.9052.105
eagle21.5912.7511.6731.8551.5272.1191.5271.864
eagle31.3281.7202.0251.6191.1082.4961.0561.623
eagle1.3242.2461.5981.6691.2291.9551.1381.595
vanilla1.0001.0001.0001.0001.0001.0001.0001.000

As shown, the hybrid Suffix Decoding and EAGLE-3 method performs similarly to plain Suffix Decoding on the more structured/repetitive tasks (e.g. AgenticSQL), while being faster than plain EAGLE-3 on the more general domain SpecBench. We will include these additional results in updating our paper.

11 TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput | Liu et al. 2025 | arXiv:2406.14066

22 (Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) | SGLang Team, Jul 25, 2024 | LMSYS Org)

评论

Thank you for acknowledging our response and updating your rating, we hope that our response sufficiently addressed your concerns.

审稿意见
5

This paper introduces SuffixDecoding, a novel, model-free speculative decoding method designed to reduce LLM inference latency for agentic AI applications that produce repetitive and predictable sequences. SuffixDecoding uses efficient suffix trees to cache and rapidly match long token sequences from prompts and past outputs, enabling fast draft token generation without GPU overhead. A key innovation is its adaptive nature; it speculates more tokens when pattern matches are long and confidence is high, and fewer when matches are short, thereby conserving computational resources. Evaluations on agentic benchmarks like SWE-Bench and a Text-to-SQL pipeline show that SuffixDecoding achieves speedups of up to 3.9x, outperforming SOTA model-based (EAGLE-2/3) and model-free (Token Recycling) methods on these agentic tasks.

优缺点分析

Strengths:

  1. The paper addresses the critical and timely issue of LLM inference latency, focusing on the important and growing domain of agentic applications.
  2. The method proposed in the paper is simple and straightforward while being practically useful.
  3. Significant speedup on agentic workloads compared to SOTA model-based and model-free methods.
  4. The ablation experiments answer most of my questions about the method.

Weaknesses

  1. The primary benefit of the proposed method is in repetitive workloads. While highly effective for agentic applications characterized by repetitive token sequences, its advantages may be less pronounced in purely open-ended or less repetitive scenarios when compared to specialized model-based methods designed for those tasks. However, the paper does show it can learn useful patterns even in open-ended workloads and can fall back to the SOTA model-based method.

问题

It would be great if the authors could also briefly discuss how to trade off the number of draft tokens per request within a batch, especially when all requests might have high confidence in generating long draft sequences -- how the proposed method can be properly integrated with the online serving scenario.

局限性

yes

最终评判理由

The paper addresses an important problem and demonstrates superior performance. The author further discussed the possibility of integrating the method with the common serving scenario and provide results for open-ended tasks.

格式问题

N/A

作者回复

Thank you for your positive review and detailed feedback, our response is below.

Performance on open-ended or less repetitive token sequences.

As the reviewer mentioned, Suffix Decoding can still learn useful patterns even in open-ended workloads and can fall back to the SOTA model-based method. The hybrid suffix/speculative decoding (end of Sec 3) also helps address this scenario. We provide additional evaluations that show the performance of this hybrid method below:

SpecBench - Speedup (x)

Systemcodingextractionhumanitiesmathmath_reasoningqaragreasoningroleplaystemsummarizationtranslationwritingOverall
hybrid (τ=7)3.4412.7952.8003.1392.8662.4042.5422.4962.4612.8712.5731.7592.8332.500
eagle33.1152.6342.7132.9102.8882.1722.4742.4462.3942.7642.5201.4472.6222.367
recycling2.6192.2342.2092.6602.5771.9512.0572.1501.9792.3342.2181.9322.0572.169
eagle22.5481.9981.9582.2152.2201.7331.8771.9681.7512.0071.7561.3361.7911.825
eagle2.4271.9561.8482.1422.1121.6001.8411.8711.7031.9221.7021.3051.7531.752
suffix1.8281.6301.3831.9671.5441.7731.8901.3891.1561.3861.6181.6181.3271.659
pld1.7121.4841.2311.7531.3231.3261.8071.3201.0551.2751.6281.2191.2321.448
vanilla1.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000

As shown, hybrid Suffix Decoding + EAGLE-3 performs similarly to plain Suffix Decoding on more structured/repetitive tasks (e.g. AgenticSQL), while slightly improving upon plain EAGLE-3 on the open-ended tasks (Spec-Bench). We will include these additional results in the updated version of our paper. Given space limitations, we are including here only the wall-clock speedups compared to the baseline. A comprehensive breakdown of results, including per-token latency (TPOT), mean accepted tokens per step (MAT), mean acceptance rate, and speculation time per generated token, will be incorporated into the revised paper.

How to set the number of draft tokens per request within a batch.

We agree that optimizing the speculation per request in a batch is crucial for many practical online deployments. While SuffixDecoding focuses on what tokens to speculate, it is also compatible with existing works that explore how much to speculate per request. For example, TurboSpec 11 uses a "goodput" metric to balance speculation lengths and batch size, noting that optimal speculation decreases as batch size increases. By dynamically adjusting the number of speculative tokens generated by the SuffixTree per request, SuffixDecoding can be integrated with TurboSpec to maximize the batch-wise goodput metric.

Furthermore, the statistics-based scoring used by Suffix Decoding can potentially better inform methods like TurboSpec. For example, choosing to speculate more from sequences that have a higher marginal score. These are interesting and important directions for future work.

[1] TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput | Liu et al. 2025 | arXiv:2406.14066

评论

Thank you for acknowledging our response. Please let us know if there are any further questions, clarifications, or updates to our paper that would impact your assessment and scoring of our paper.

审稿意见
5

The paper introduces SuffixDecoding, a model-free speculative decoding framework tailored for agentic LLM pipelines where generations exhibit high repetitiveness and structured loops. Instead of relying on a smaller draft model, SuffixDecoding constructs a suffix tree over past prompts and completions to identify the longest matching context at each decoding step. It then greedily expands a “speculation tree” of frequent continuations, verifies the selected prefix in a single forward pass of the full LLM, and adjusts its maximum speculation length dynamically based on match depth. Evaluated on two real-world agentic benchmarks (SWE-Bench, AgenticSQL) and in live vLLM, SuffixDecoding achieves up to 3.9× speedups on offline tasks and 4.5× end-to-end gains, while preserving exact token fidelity.

优缺点分析

Quality

  • The authors compare against both model-based speculators (EAGLE-2/3) and other model-free baselines (Token Recycling, Prompt Lookup), demonstrating consistent speed and throughput gains.– The paper reports FLOP-based efficiency but omits wall-clock latency and energy measurements on GPU hardware, leaving system-level impact unquantified.

Clarity

  • The algorithmic pipeline is described clearly, with pseudocode guiding the reader through tree construction, speculation, and verification steps. Figures illustrate key data structures and control flow.– Details around memory management (suffix-tree storage, updates under concurrency) and fallback threshold tuning (τ) are only sketched in prose, making reproduction nontrivial.

Significance

  • By decoupling speculative decoding from auxiliary draft models, SuffixDecoding opens a new direction for model-free inference acceleration in settings where prompt/response patterns repeat.– Its applicability to more open-ended or non-agentic workloads remains unclear: acceptance rates vary and no clear decision rule is given for when to revert to traditional draft-model approaches.

###Originality

  • The use of suffix trees for speculative decoding is novel and exploits structural redundancies in agent-driven loops.– Prior work has explored cache-based and retrieval-augmented inference; the paper could more sharply delineate how suffix-tree speculation fundamentally differs in algorithmic guarantees or failure modes.

问题

  1. Latency and profiling
  • Please report real-world latency and energy consumption comparisons between SuffixDecoding and a standard draft-model speculative decoder on GPU. This will verify that FLOP savings translate into actual system-level speedups.
  1. Memory overhead
  • How large do the suffix trees grow in practice, and what are the RAM and update-time costs under concurrent agentic requests? Empirical profiling or a detailed complexity analysis would clarify deployment feasibility.
  1. Fallback threshold
  • The hybrid threshold τ governs when to abandon tree-based speculation. Can you characterize its sensitivity across mixed workloads? A small study varying τ on open-ended versus highly repetitive streams would help practitioners tune the system.
  1. Generalization beyond Agentic workloads
  • Have you evaluated SuffixDecoding on less structured or single-turn tasks (e.g., open-ended question answering)? Understanding its failure modes and identifying break-even points would guide broader application.

Note: In particular, providing detailed latency and memory-overhead measurements across diverse tasks, along with a thorough analysis of τ’s sensitivity, will likely eliminate any remaining concerns and could substantially raise the paper’s overall score.

局限性

yes

最终评判理由

Well structured clarification to my previous weakness and questions.

格式问题

None

作者回复

Thank you for your detailed review and feedback, our response is below.

Latency and profiling (the paper reports FLOP-based efficiency but omits wall-clock latency and energy measurements on GPU hardware).

We believe there is a misunderstanding regarding the metrics reported in our paper. To clarify, all speedup measurements in our paper are based on wall-clock latency, not FLOP counts. The speedup in Fig 4 is the ratio of real-world wall-clock time per output token measured on H100 GPUs. For example, in SWE-Bench, Suffix Decoding achieves 31.093 ms per token vs vanilla's 49.836 ms per token (see Appendix A.1.2), resulting in a speedup of 49.836 / 31.093 = 1.6x.

FLOPs and energy consumption are less relevant for evaluating speculative decoding methods because speculative decoding typically expends more FLOPs (via speculation) to obtain lower wall-clock latency (via parallel verification).

Memory overhead: How large do the suffix trees grow in practice, and what are the RAM and update-time costs under concurrent agentic requests?

The suffix trees employed by Suffix Decoding are highly time and memory efficient. To illustrate this, we present a microbenchmark below showing the memory footprint and per-token update cost across various tree sizes (using requests from the SWE-Bench dataset). We will include this additional microbenchmark in the revised paper.

Total # entriesTotal # tokens (millions)CPU Memory (MB)Avg update time per token over last 10 entries (μs)
1 K entries27 M1670 MB3.95
5 K entries143 M2980 MB4.06
10 K entries285 M4070 MB4.05
15 K entries429 M5110 MB4.07
20 K entries572 M6150 MB4.04

Overall, the memory consumption scales linearly with the size of the suffix tree, while the update time remains consistently low. For comparison, optimized inference engines running Llama-8B on an A100 GPU can typically generate ~5000 tokens per second 11 or ~432M tokens per day. Typical A100 systems (e.g. AWS p4d.24xlarge) have 144GB of CPU memory per A100 GPU. This means that Suffix Decoding can potentially cache 31 days of generated tokens per A100 GPU before hitting CPU memory limits. Beyond this limit, it is also straightforward to incorporate cache eviction (e.g. LRU) into Suffix Decoding to avoid out-of-memory problems.

Generalization beyond Agentic workloads: Have you evaluated SuffixDecoding on less structured or single-turn tasks (e.g., open-ended question answering)?

To clarify, the Spec-Bench dataset used in our evaluation (Sec 4.1, Fig 4) consists of mainly non-agentic, open-ended tasks (e.g. single-turn question-answering). We included this dataset explicitly to stress-test Suffix Decoding in a scenario where we did not expect it to perform well, as evaluating performance under these extreme scenarios is important to understand the limitations and robustness of Suffix Decoding.

The tables in appendix A.1.2 detailed performance results across all 13 Spec-Bench categories. For example, the qa (question answering) column represents open-ended, single-turn questions. Among the 13 categories, qa, rag, math_reasoning, summarization, and translation are single-turn tasks.

Although Suffix Decoding is not specifically designed for these open-ended tasks, it can be combined with other speculative decoding methods in a hybrid mode to achieve the best of both worlds. We have performed additional experiments evaluating this hybrid-mode, which we present below.

SpecBench - Speedup (x)

Systemcodingextractionhumanitiesmathmath_reasoningqaragreasoningroleplaystemsummarizationtranslationwritingOverall
hybrid (τ=7)3.4412.7952.8003.1392.8662.4042.5422.4962.4612.8712.5731.7592.8332.500
eagle33.1152.6342.7132.9102.8882.1722.4742.4462.3942.7642.5201.4472.6222.367
recycling2.6192.2342.2092.6602.5771.9512.0572.1501.9792.3342.2181.9322.0572.169
eagle22.5481.9981.9582.2152.2201.7331.8771.9681.7512.0071.7561.3361.7911.825
eagle2.4271.9561.8482.1422.1121.6001.8411.8711.7031.9221.7021.3051.7531.752
suffix1.8281.6301.3831.9671.5441.7731.8901.3891.1561.3861.6181.6181.3271.659
pld1.7121.4841.2311.7531.3231.3261.8071.3201.0551.2751.6281.2191.2321.448
vanilla1.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000

AgenticSQL - Speedup (x)

SystemCATEGORIZATIONFEATURE_EXTRACTIONQUESTION_SUGGESTIONSQL_COMBINESQL_FANOUT1SQL_FANOUT2SQL_FANOUT3Overall
suffix3.0169.85410.4064.8483.2052.8393.2115.345
hybrid (τ=1)2.3097.0258.3683.3822.0242.4212.0723.947
recycling2.5883.4722.6722.6402.5022.6042.4922.710
pld1.3303.6951.2553.2981.9361.3111.9052.105
eagle21.5912.7511.6731.8551.5272.1191.5271.864
eagle31.3281.7202.0251.6191.1082.4961.0561.623
eagle1.3242.2461.5981.6691.2291.9551.1381.595
vanilla1.0001.0001.0001.0001.0001.0001.0001.000

Here, we used a fallback threshold of τ=7 for SpecBench, and τ=1 for AgenticSQL. For more open-ended tasks such as QA (question answering), the hybrid approach with Suffix Decoding outperforms the speculative decoding method alone. This indicates that while Suffix Decoding can greatly improve agentic applications, it does not have to come at the expense of other open-ended applications.

Note that all baselines are slightly improved compared to the table in our submission due to performance improvements in the benchmark framework.

Given space limitations, we are presenting only the wall-clock speedups compared to the baseline. A comprehensive breakdown of results, including per-token latency (TPOT), mean accepted tokens per step (MAT), mean acceptance rate, and speculation time per generated token, will be incorporated into the revised paper.

Hybrid Fallback Threshold: Can you characterize its sensitivity across mixed workloads?

We report the results of SpecBench and AgenticSQL experiments using the Hybrid Suffix Decoding method with various thresholds. In the table below, we show the measured wall-clock speedups with respect to the vanilla (no speculation) baseline.

SpecBench - Speedup (x)

Systemcodingextractionhumanitiesmathmath_reasoningqaragreasoningroleplaystemsummarizationtranslationwritingOverall
τ=73.4412.7952.8003.1392.8662.4042.5422.4962.4612.8712.5731.7592.8332.500
τ=53.2472.7012.6822.7022.6052.2802.4022.3512.3582.7352.4611.7632.6972.366
τ=63.1862.5952.6142.8752.5672.2162.3602.2962.3042.6752.3801.7192.6102.314
τ=43.0722.5612.5522.7192.4022.2192.2782.2492.2762.5692.2851.7282.6412.249
τ=32.8752.4072.3522.5362.1952.0862.1662.0992.1612.3872.1611.7332.5032.126
τ=22.7342.2562.2182.4011.9802.0412.0612.0111.9902.1942.0261.7912.3622.028
τ=12.2841.9771.9352.0141.6201.8581.8701.6941.7451.8971.7861.7232.1171.802
τ=0 (suffix alone)1.8281.6301.3831.9671.5441.7731.8901.3891.1561.3861.6181.6181.3271.659

AgenticSQL - Speedup (x)

SystemCATEGORIZATIONFEATURE_EXTRACTIONQUESTION_SUGGESTIONSQL_COMBINESQL_FANOUT1SQL_FANOUT2SQL_FANOUT3Overall
τ=0 (suffix alone)3.0169.85410.4064.8483.2052.8393.2115.345
τ=12.3097.0258.3683.3822.0242.4212.0723.947
τ=22.2437.1377.9653.3271.9932.4831.4053.799
τ=72.0596.7947.0963.4401.6922.8121.7143.663
τ=82.0806.8484.9593.3581.6912.8541.2593.298
τ=32.2175.5055.6793.5171.9132.2561.4243.220
τ=42.1735.6695.4043.5261.8922.2371.3903.189
τ=52.0235.2115.1393.5751.8682.3761.3263.078
τ=62.0275.1135.1993.4911.7822.3511.2873.040

For open-ended generation, a good heuristic is to set the fallback threshold (τ) to a value close to or slightly exceeding the mean number of accepted tokens (MAT) for the model-based speculator used with the suffix-tree. For instance, we used EAGLE-3, which has a MAT of approximately 4.65 tokens/step in SpecBench. The table shows that τ values of 5, 6, and 7 yield the best overall speedups, with the results being quite similar across these values.

Conversely, in agentic workloads like AgenticSQL, the standalone SuffixDecoding implementation (τ=0) is optimal because it can confidently speculate much longer sequences than model-based speculators such as EAGLE-3.

11 (Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) | SGLang Team, Jul 25, 2024 | LMSYS Org)

评论

I’m happy to raise my score in light of these clarifications.

Separately, the paper would benefit from citing prior work on independent multi-token heads in its Related Works, for example [1,2,3]:

[1] Better & Faster LLMs via Multi-Token Prediction, Gloeckle et. al., ICML 24.

[2] Accelerating Blockwise Parallel Language Models with Draft Refinement, Kim et. al, NeurIPS 24.

[3] Blockwise Parallel Decoding for Deep Autoregressive Models, Stern et. al, NIPS 18.

评论

Thank you for reviewing our response and raising your score. We shall cite the additional works provided in the updated version of our paper.

审稿意见
5

The paper introduces SuffixDecoding, a model-free speculative decoding method to save compute at inference time for LLMs by reusing tokens predicted on previous prompts/contexts. The novelty introduced is in the trie/suffix tree which allows for lookup of past sequences (from prompts or completions) in a dynamic / adaptive system. Its highlighted use in agent-based systems with structured outputs is highlighted due to the type of repetition present in those settings that allows for the differentiated improvement of this method in storing/retrieving past tokens in such a system. Like standard speculative decoding, the inference result is exact, with no accuracy loss, and only gains from leveraging the previous cache. The experiments show significant speedups compared to existing methods in agent/structured settings. The method has also been open-sourced and integrated into vLLM.

优缺点分析

Strengths

  • Paper is clear and well-written
  • Motivation is clear, and emphasized through the mention of patterns that are particularly useful for the priors of this method: self-consistency, self-refinement, multi-agent workflows
  • Method:
    • Simple: easy to understand, no added complexity that isn’t justified, model-free
    • Practical/general: doesn’t require strict homogeneity of LLMs within the multi-agent setting or training an auxiliary model; demonstrated use outside of multi-agent systems as well (with the caveat of reduced benefit in these settings)
    • Open-sourced integration into vLLM – good community contribution
    • Handles the tradeoff in number of consecutive draft tokens vs likelihood of them being incorrect in an adaptive rather than purely heuristic fashion, adapting specifically to the distribution driven by that agentic system.
    • Mentions practical additions, like maintaining two trees – one that’s global and one that’s smaller, per-request
    • Significant gains, including comparison to and performance against EAGLE
    • Admits to limitations compared to existing methods in certain settings

Weaknesses

  • The scoring/lookup is the key challenge to overcome and could have been highlighted/explained better
  • The title is strange to me. 'Extreme' in the title seems a bit extreme -- I'd expect that word only with orders of magnitude difference. I also didn't understand what 'Emerging' means here. Are you referring to agents/structured outputs?
  • Proprietary benchmarks will make it hard for the community to reproduce/compare
  • Doesn't address scenarios when cpu memory is filled -- many practical scenarios at scale when this might happen.

问题

  1. How does lookup time, speedup, and scoring approximation (compared to best lookup) scale with the number of records? (within some distributional assumptions)
  2. Can you explain the title?
  3. 'Related Works' section is in a bit of a weird spot. Why not together with ‘Background’? If you were trying to point out tradeoffs or highlight most recent works in particular, I suggest renaming the section.

局限性

yes

最终评判理由

The original review stands, and the decision is only strengthened by the degree to which the authors addressed each of my questions and concerns.

格式问题

No major issues I saw.

作者回复

Thank you for your positive review and detailed feedback, our response is below.

Scalability of lookup time, speedup and scoring.

For lookup time vs number of records, we provide an additional micro-benchmark of the suffix tree below, where we measure its lookup time, update time, and memory consumption for various tree sizes. We will include it in our final paper.

Total # entriesTotal # tokensMemory footprint (MB)Avg update time per token (μs)Avg lookup time per speculated token (μs)
1 K entries27 M1670 MB3.9512.18
5 K entries143 M2980 MB4.0610.93
10 K entries285 M4070 MB4.0512.00
15 K entries429 M5110 MB4.0711.62
20 K entries572 M6150 MB4.0411.64

In short, the total memory consumption scales linearly with the size of the tree, and both the update and lookup times remain fast even with larger trees.

For speculative speedup and accuracy vs number of records, we provided a brief ablation study in Fig 8. Speedup consistently improves as suffix tree size increases, demonstrating that more historical data leads to better performance. Importantly, acceptance rates remain stable (~10-20%) even as tree size varies dramatically, indicating that our adaptive speculation mechanism (MAX_SPEC = αp) effectively manages the quality-quantity tradeoff.

If there are specific questions beyond what’s provided in our ablation study, we are happy to answer during the discussion period.

Scenarios when CPU memory is filled up.

We agree that the CPU memory can be filled up, but we would like to work through a practical scenario. Our micro-benchmark results above show that Suffix Decoding is highly memory efficient, caching 572M tokens using only 6.15 GB of memory, and scaling linearly. At the same time, optimized serving systems can generate on the order of ~5000 tokens/s for Llama-8B on an A100 GPU 11 . Typical A100 systems (e.g. AWS p4d.24xlarge) have 144GB of CPU memory per A100 GPU. This means that Suffix Decoding can potentially cache 31 days of generated tokens per A100 GPU before hitting CPU memory limits (see math below). Beyond this limit, it is also straightforward to incorporate cache eviction (e.g. LRU) into Suffix Decoding to avoid out-of-memory problems.

Time to fill up GPU memory:

  1. Memory per token: 6.15GB572Mtokens=1.075×108GB/token\frac{6.15 GB}{572M tokens} = 1.075 \times 10^{-8} GB/token
  2. Total token capacity: 144GB(1.075×108GB/token)=1.34×1010tokens\frac{144 GB}{(1.075 \times 10^{-8} GB/token)} = 1.34 \times 10^{10} tokens
  3. Time to fill RAM: 1.34×1010tokens5000tok/s=2.68×106seconds=31days\frac{1.34 \times 10^{10} tokens}{5000 tok/s} = 2.68 \times 10^{6} seconds = \boxed{31 days}

Can you explain the title?

We used the term "extreme" in reference to the significantly longer speculation sequences possible with suffix trees compared to typical speculative decoding, which is also crucial for exploiting repetition in agentic applications such as SWE-Bench and AgenticSQL. We used the term "emerging" in reference to the growing field of agentic AI applications that involve multiple steps in answering a user query, often with highly repetitive LLM calls. We will add additional explanations and a footnote to our introduction to explain these choices.

'Related Works' section is in a bit of a weird spot.

We thank the reviewer for pointing out the non-standard location of the related works section and we will move it closer to the background section as suggested.

11 (Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) | SGLang Team, Jul 25, 2024 | LMSYS Org)

评论

Thank you for addressing the questions, especially with the quantitative values backing up the practical issue of CPU memory.

Thanks for explaining 'extreme' as well -- I think it's important to back up such a strong word in the title.

Great work; looking forward to seeing it at NeurIPS.

评论

We hope our response has sufficiently addressed your main concerns. Please let us know if there are any further clarifications, information, or updates to our paper that would impact your assessment and rating of our paper.

最终决定

(a) Summary: This paper proposes SuffixDecoding, a training-free speculative decoding method based on suffix tree. The method leverages the high repetitiveness and structured loops in agentic LLMs to achieve acceleration. Experiments on SWE-Bench and AgenticSQL show the power of the proposed method.

(b) Strengths:

  1. Reviewers recognize the novelty in this paper.

  2. The paper is well-written.

  3. Experiments on agentic tasks show the advantages of the proposed method.

  4. Training-free method is easier to be applied to more models with lower cost.

(c) Weaknesses:

  1. The applicability of SuffixDecoding to more open-ended or non-agentic workloads remains unclear. Acceptance rates vary for different dataset. Multiple reviewers have concerns that for non-agentic tasks and non-repetitive workloads, the proposed method might lead to limited speedup.

  2. How suffix-tree speculation fundamentally differs from cache-based and retrieval-augmented inference in algorithmic guarantees or failure modes remains unclear.

  3. Some references are missing.

  4. The reported numbers of baseline methods seem to be worse than the reported numbers in the original papers. It is known that the choice of hyper-parameters matters. It is unclear how this paper tunes the hyper-parameters for baseline methods (e.g., tree width, tree depth, tree size, etc when using tree attention).

(d) Why the decision: Overall, the paper is of good quality. All reviewers vote for accept. AC stands with the reviewers and recommend accept with spotlight.

(e) Summary of discussions: In the first round of review, all reviewers are positive about the paper. The rebuttal further strengthens the advantage of this paper. AC would recommend accept too.