PaperHub
6.7
/10
Poster3 位审稿人
最低6最高8标准差0.9
6
6
8
4.0
置信度
正确性2.7
贡献度2.7
表达2.7
ICLR 2025

Efficient Inference for Large Language Model-based Generative Recommendation

OpenReviewPDF
提交: 2024-09-27更新: 2025-03-30

摘要

关键词
LLM-based Generative RecommendationSpeculative DecodingDecoding Acceleration

评审与讨论

审稿意见
6

This paper introduces a speculative decoding approach for generative recommendation. In traditional speculative decoding, a cheap draft LLM is given a prefix and at each decoding step it suggests NN tokens among which the token predicted by the target LLM should be. This enables the target model to verify all the drafted predictions in parallel through a single call, conditioning on the draft model's previous predictions. The sequence of drafted predictions is then accepted up to the decoding step where the draft model and the target model disagree in their predictions.

In top-KK generative recommendation, a beam search is performed to generate the KK item identifiers. Applying speculative decoding to this problem then requires the NN tokens suggested by the draft model to contain the KK tokens predicted by the target model to accept the current decoding step. This condition is hard to fulfill in practice. To address this, the paper defines a framework named AtSpeed, which trains the draft model to be more aligned with the target model and optionally relaxes the acceptance condition to allow the NN drafted tokens to contain tokens that are similar (rather than identical) to the KK target tokens. AtSpeed was validated through experiments on the Amazon Beauty and Games datasets.

优点

  • The approach proposed for applying speculative decoding to generative recommendation is both novel and intuitive.
  • The experiments are comprehensive and provide convincing evidence of the benefits of AtSpeed from an empirical standpoint.
  • The source code was made available.

缺点

  • The paper suffers from an overall lack of mathematical rigor, both in the proofs and notations, which raises concerns about the theoretical soundness of the approach. In particular, some key formulas, such as the proposed alignment objective, appear to be either incorrect or lack sufficient derivation to validate their correctness (see specific points in the 'Questions' section).
  • The paper is overall lacking polish and contains multiple typos in mathematical formulas, which makes it hard for the reader to follow the details of the approach. Thorough proofreading would be needed to improve the paper's readability. Certain parts of the paper could also benefit from being further clarified, as pointed out in the 'Questions' section.

问题

  • In Section 3.1, paragraph "Acceptance Rate", the condition given for having β=1\beta = 1, namely that p(y)p(yK)p(\mathbf{y}) \geq p(\mathbf{y}_K) for all y\mathbf{y} in Yq\mathcal{Y}_q, doesn't seem to be fulfillable. By definition of yK\mathbf{y}_K, there exists only K1K-1 sequences with greater probability according to pp, yet the condition requires that all NN (K\geq K) sequences in Yq\mathcal{Y}_q have a probability greater than yK\mathbf{y}_K. The only possibility for this condition to be realized is if Yq\mathcal{Y}_q contains duplicates, which should not happen in the case of beam search.
  • Derivation of Eq 15 in App A.2 (which leads to the definition of the alignment objective in Eq 3) seems incorrect or at least misses important steps to be understandable for the reader. More steps or explanations are needed.
  • Eq 3 introduces D\mathcal{D'} in which Y\mathcal{Y} is the top-KK of a mixture of qq and pp. Following Gu et al (2024) is mentioned as the motivation for this, but it would be appreciated to have more intuitions on this choice. Moreover, the D\mathcal{D'} used later in the relaxed alignment objective (Eq 8) is different. What is the rationale to have different D\mathcal{D'} in the strict and relaxed objectives?
  • What loss is used in practice for LRec\mathcal{L}_{Rec}? It is only mentioned to be a recommendation loss but additional details would be appreciated (at least in the appendix).
  • The relaxation sampling verification strategy misses some intuition: why is p(y)q(y)p(\mathbf{y}) \geq q(\mathbf{y}) the right criterion for accepting a sequence y\mathbf{y}? Why are such y\mathbf{y}'s good candidates? Moreover, in the case of rejection, y\mathbf{y} is drawn from p=norm(max(0,p(y)q(y)))p' = norm(max(0, p(\mathbf{y}) - q(\mathbf{y}))) but it is unclear whether this is an actual distribution and what the norm operator exactly consists of. Is pp' denoting the uniform distribution over the y\mathbf{y}'s such that p(y)q(y)p(\mathbf{y}) \geq q(\mathbf{y})? If so, how does one sample from this in practice? The definition of this distribution would deserve more clarifications and details.
  • The "with-replacement sampling approximation" paragraph in App A.2 is difficult to follow so it would benefit from being reworked.
  • The proof of Lemma 1 in App A.2, which corresponds to Lemma 3.3 from Leviathan et al (2023), is missing the step with the term 1yp(y)+q(y)p(y)q(y)21 - \sum_{\mathbf{y}} \frac{p(\mathbf{y})+q(\mathbf{y})-|p(\mathbf{y})-q(\mathbf{y})|}{2}.
  • In Eq 7, what does K\sum_K denote? It seems like there is an index variable missing there.
  • The strategy relying on tree-based attention to speed up inference would benefit from being described in more details (at least in Appendix). Section 3.3 only refers to Figure 2(c) to describe the strategy, but this figure is not self-explanatory.
  • The WS metric represents the walltime speedup, but with respect to which baseline? I assume this is in comparison to directly running the target model without speculative decoding, but it would be helpful for the reader to mention this when defining the metric.
  • Table 1 reports the results for AtSpeed-S and AtSpeed-R on both the strict and the relaxed settings. How can AtSpeed-S be applied to the relaxed setting and AtSpeed-R to the strict setting?
  • In Figure 5 of the Appendix, it seems that a larger value for α\alpha is always beneficial for AtSpeed-R, whereas Figure 3 (c) showed that α\alpha should neither be too small nor too large for AtSpeed-S. Are there any intuitions on these different behaviors between AtSpeed-R and AtSpeed-S?
评论

Dear Reviewer Up12,

Thanks for your comments. Your review is very detailed and thorough. we greatly appreciate it! We have provided detailed clarification, explanations, intuitions behind method design, and step-by-step derivation to address each of your concerns. We have also updated our manuscript accordingly (marked in blue). If we have any misunderstanding, please feel free to leave further comments. We eagerly anticipate our discussion with you!

Q1: In Section 3.1, paragraph "Acceptance Rate", the condition given for having β=1\beta = 1, namely that p(y)p(yK)p(\mathbf{y}) \geq p(\mathbf{y}_K) for all y\mathbf{y} in Yq\mathcal{Y}_q, doesn't seem to be fulfillable. By definition of yK\mathbf{y}_K, there exists only K1K-1 sequences with greater probability according to pp, yet the condition requires that all NN (K\geq K) sequences in Yq\mathcal{Y}_q have a probability greater than yK\mathbf{y}_K. The only possibility for this condition to be realized is if Yq\mathcal{Y}_q contains duplicates, which should not happen in the case of beam search.

Reply: Thank you for pointing out this problem. The acceptance rate in Section 3.1 indeed would be unfulfillable if N>KN>K. The only possibility to achieve this condition is when N=KN=K and the top-KK (N=KN=K) token sequences from the draft model are also the top-KK token sequences from the target model. To correct this, we revise the condition as:

β=1\beta=1 if Y_qY_q\exists \mathcal{Y}'\_{q}\in\mathcal{Y}\_{q} such that p(y)p(y_k)p(\mathbf{y})\ge p({\mathbf{y}}\_k) for yY_q\forall \mathbf{y}\in\mathcal{Y}'\_{q}, where Yq=K|\mathcal{Y}'_{q}| = K, KK is the target LLM beam size and yk\mathbf{y}_k is the sequence that has the KK-th highest probability in pp.

Based on this condition, the alignment objective as in Eq.(2) is unaffected, which aims to encourage the top-NN generated sequences from the target model to have high probabilities in the target LLM distribution (a high p(y)p(yK)\frac{p(\mathbf{y})}{p(\mathbf{y}_K)}). Precisely, if the drafted sequence fails to be in the top-KK sequences according to pp (i.e., p(y)<p(yK)p(\mathbf{y})<p(\mathbf{y}_K)), we still encourage it to have a high probability closer to yK\mathbf{y}_K (i.e., a high p(y)p(yK)\frac{p(\mathbf{y})}{p(\mathbf{y}_K)}), aiming to push the top-NN distribution of draft model to cover the top-KK distribution of the target LLM.

Q2: Eq 3 introduces D\mathcal{D'} in which Y\mathcal{Y} is the top-KK of a mixture of qq and pp. Following Gu et al (2024) is mentioned as the motivation for this, but it would be appreciated to have more intuitions on this choice.

Reply: Thanks for your valuable suggestions. For the choice of the mixture of qq and pp in Eq.(3), we have two main considerations:

  • 1) Higher training efficiency. The original D\mathcal{D} should include the sequences sampled from the draft model (i.e., yYq\mathbf{y}\in\mathcal{Y}_q in Eq.(2)). However, training over draft model-generated sequences requires the online learning process, which will lead to high computational costs and time costs for training. Because we need to sample the sequences continuously from the draft models for every epoch or even every batch during the alignment training process. Therefore, to alleviate the reliance on the frequent sampling from the draft model qq, we consider using the mixture of draft model qq and target LLM pp. In practice, the mixture of pp and qq is achieved by alternating the sequences sampled from pp and qq, where the alternating sessions are controlled by the mixture coefficient λ\lambda. Since the target LLM-generated sequences can be pre-stored, we can improve the training efficiency by significantly reducing the number of sampling sequences from draft model.

  • 2) Mitigation of low-quality training data issue. During alignment training, the draft model might generate low-quality sequence (e.g., repeated phrases). However, such low-quality sequences will be rejected by the target LLM during inference since they are invalid identifiers. As such, pushing q(y)q(\mathbf{y}) closer to p(y)p(\mathbf{y}) over low-quality sequences will lead to unnecessary and suboptimal alignment. Therefore, we consider utilizing LLM-generated data to ensure high-quality training data for alignment training.

评论

Q3: Moreover, the D\mathcal{D'} used later in the relaxed alignment objective (Eq 8) is different. What is the rationale to have different D\mathcal{D'} in the strict and relaxed objectives?

Reply: Thanks for your insightful question. The different choices of D\mathcal{D}’ essentially result from their different expressions of acceptance rate and different alignment objectives. Specifically,

  • For the strict verification, since we hope every drafted sequence can achieve high probability in target LLM distribution, the alignment objective is maximizing p(y)p(yK)\frac{p(\mathbf{y})}{p(\mathbf{y}_K)} over the output distribution of draft model. Based on the consideration for higher training efficiency and mitigation of the low-quality training data issue as mentioned in the previous response, we adopt the mixed sampling for AtSpeed-S.

  • For the relaxed verification, we are inspired by Eq.(1) in [1] to maximize the sequence acceptance rate over the output distribution of target LLM. This is because, with each approximated with-replacement sequence sampling (Eq.(6)), the expectation over the drafted sequence is essentially the expectation over the model vocabulary given the prefix. While the vocabulary is the same for both the draft model and the target LLM, the prefix of y\mathbf{y} in Eq.(6) is accepted by the target LLM in inference and thus follows the with-replacement sampling distribution of target LLM. Therefore, the y\mathbf{y} in Eq.(7) of our paper is considered to be sampled over the target LLM distribution for AtSpeed-R.

[1] Yongchao Zhou, et al. DistillSpec: Improving Speculative Decoding via Knowledge Distillation. ICLR 2024.

Q4:What loss is used in practice for LRec? It is only mentioned to be a recommendation loss but additional details would be appreciated (at least in the appendix)

Reply: Thanks for your valuable suggestions. The recommendation loss used in our work is defined as

L_Rec=1N_(x,y)D_t=1ylogM_p(y_ty_<t,x),\mathcal{L}\_\text{Rec} = - \frac{1}{N} \sum\_{(\mathbf{x}, \mathbf{y}) \sim \mathcal{D}} \sum\_{t=1}^{|\mathbf{y}|} \log \mathcal{M}\_p({y}\_t|\mathbf{y}\_{<t}, \mathbf{x}),

where x\mathbf{x} is user’s historical interactions, y\mathbf{y} is the user’s next interacted item identifier, and D=(x,y)D=\\{(\mathbf{x}, \mathbf{y})\\} denotes the original recommendation dataset. We have also added this to our manuscript in Appendix.

Q5: The relaxation sampling verification strategy misses some intuition: why is p(y)q(y)p(\mathbf{y}) \geq q(\mathbf{y}) the right criterion for accepting a sequence y\mathbf{y}? Why are such y\mathbf{y}'s good candidates? Moreover, in the case of rejection, y\mathbf{y} is drawn from p=norm(max(0,p(y)q(y)))p' = norm(max(0, p(\mathbf{y}) - q(\mathbf{y}))) but it is unclear whether this is an actual distribution and what the norm operator exactly consists of. Is pp' denoting the uniform distribution over the y\mathbf{y}'s such that p(y)q(y)p(\mathbf{y}) \geq q(\mathbf{y})? If so, how does one sample from this in practice? The definition of this distribution would deserve more clarifications and details.

Reply: Thanks for your valuable questions and we appreciate your careful review.

  • The intuition behind accepting sequence y\mathbf{y} if p(y)p(y)p(\mathbf{y}) \ge p(\mathbf{y}) is that, given a candidate sequence y\mathbf{y} with a high probability according to the draft model, if the target LLM is even more confident of generating the sequence (i.e., p(y)p(y)p(\mathbf{y}) \ge p(\mathbf{y})), the candidate sequence is likely to be generated by the LLMs and should be accepted. We have added the intuition explanation in our updated manuscript.

  • In the case of rejection, the normalized probability distribution is adjusted as follows:

p=norm(max(0,p(y)q(y)))=max(0,p(y)q(y))ymax(0,p(y)q(y))p’ = \text{norm}(\max (0, p(\mathbf{y})- q(\mathbf{y}))) = \frac{max(0, p(\mathbf{y})-q(\mathbf{y}))}{\sum_{\mathbf{y}}{max(0, p(\mathbf{y})-q(\mathbf{y}))}}

We then sample a sequence from this normalized distribution to replace the rejected sequence. The clarification of normalized distribution is added in our updated manuscript.

Q6: The proof of Lemma 1 in App A.2, which corresponds to Lemma 3.3 from Leviathan et al (2023), is missing the step with the term 1yp(y)+q(y)p(y)q(y)21 - \sum_{\mathbf{y}} \frac{p(\mathbf{y})+q(\mathbf{y})-|p(\mathbf{y})-q(\mathbf{y})|}{2}.

Reply: Thank you for pointing out this typo. Your careful review is greatly appreciated. We have revised it in our updated manuscript.

评论

Step3. Relate K_iK\_i to NN and p_ip\_i.

We then relate K_iK\_i to NN and p_ip\_i. Since K_i=Np_iK\_i=Np\_i, we have lnK_ik_i=lnNp_ik_i\ln \frac{K\_i}{k\_i}=\ln \frac{Np\_i}{k\_i}. We also have n=_i=1rk_in=\sum\_{i=1}^{r}k\_i. Then, we can express the logarithm of hypergeometric distribution in terms of p_ip\_i and k_ik\_i as

lnP_hyper_i=1r[k_ilnK_ik_i+k_i2K_i][nlnNn+n2N]=_i=1r[k_ilnNp_ik_i+k_i2Np_i][nlnNn+n2N]=_i=1r[k_i(lnN+lnp_ilnk_i)+k_i2Np_i][nlnNn+n2N]=_i=1rk_ilnN+_i=1r[k_i(lnp_ilnk_i)+k_i2Np_i]nlnNnn2N=nlnNnlnN+nlnnn2N+_i=1r[k_i(lnp_ilnk_i)+k_i2Np_i](we have _i=1rk_i=n in last expression)=_i=1r[k_i(lnp_ilnk_i)+k_i2Np_i]+nlnnn2N=nlnn_i=1rk_ilnk_i+_i=1rk_ilnp_in2N+_i=1rk_i2Np_i.\begin{aligned} \ln P\_\text{hyper} % & \approx \sum\_{i=1}^{r} [k\_i(\ln Np\_i - \ln k\_i) + \frac{k\_i^2}{Np\_i}] % - [n(\ln N - \ln n) + \frac{n^2}{N}] \\\\ % & \approx \sum\_{i=1}^{r} [k\_i(\ln N + \ln p\_i - \ln k\_i) + \frac{k\_i^2}{Np\_i}] % - [n(\ln N - \ln n) + \frac{n^2}{N}] \\\\ & \approx \sum\_{i=1}^{r} [k\_i\ln \frac{K\_i}{k\_i}+ \frac{k\_i^2}{K\_i}] - [n\ln \frac{N}{n} + \frac{n^2}{N}] \\\\ & = \sum\_{i=1}^{r} [k\_i\ln \frac{Np\_i}{k\_i} + \frac{k\_i^2}{Np\_i}] - [n \ln \frac{N}{n}+\frac{n^2}{N}] \\\\ & =\sum\_{i=1}^{r} [k\_i (\ln N + \ln p\_i - \ln k\_i)+ \frac{k\_i^2}{Np\_i}] - [n\ln \frac{N}{n} + \frac{n^2}{N}] \\\\ & = \sum\_{i=1}^{r} k\_i\ln N + \sum\_{i=1}^{r}[k\_i(\ln p\_i - \ln k\_i)+ \frac{k\_i^2}{Np\_i}] - n\ln \frac{N}{n} - \frac{n^2}{N} \\\\ & = n\ln N - n\ln N + n\ln n - \frac{n^2}{N} + \sum\_{i=1}^{r} [k\_i(\ln p\_i - \ln k\_i) + \frac{k\_i^2}{Np\_i}] \quad (\text{we have } \sum\_{i=1}^{r}k\_i=n \text{ in last expression})\\\\ & = \sum\_{i=1}^{r} [k\_i(\ln p\_i - \ln k\_i)+ \frac{k\_i^2}{Np\_i}] + n\ln n - \frac{n^2}{N} \\\\ &= n\ln n - \sum\_{i=1}^{r}k\_i \ln k\_i + \sum\_{i=1}^{r} k\_i \ln p\_i - \frac{n^2}{N} + \sum\_{i=1}^{r}\frac{k\_i^2}{Np\_i}. \end{aligned}

Now the approximation of the logarithm of multivariate hypergeometric distribution has been finished.

Step4. Compare with multinomial distribution.

Similarly, we approximate the multinomial distribution with Stirling's approximation. We expand the logarithm of multinomial distribution as

lnP_multi(k_1,k_2,,k_r)=lnn!_i=1rk_i!_i=1rp_ik_i=lnn!_i=1rlnk_i!+_i=1rk_ilnp_i.\begin{aligned} \ln P\_\text{multi}(k\_1, k\_2, \dots, k\_r) & = \ln \frac{n!}{\prod\_{i=1}^{r}k\_i!}\prod\_{i=1}^{r}p\_i^{k\_i} \\\\ & = \ln n! - \sum\_{i=1}^{r} \ln k\_i! + \sum\_{i=1}^{r}k\_i \ln p\_i. \end{aligned}

Using Stirling's approximation, we have lnn!nlnnn\ln n ! \approx n\ln n - n and lnk_i!k_ilnk_ik_i\ln k\_i! \approx k\_i \ln k\_i -k\_i. We then substitute lnn!\ln n! and lnk_i!\ln k\_i! with the approximation and obtain

lnP_multinlnnn_i=1r(k_ilnk_ik_i)+_i=1rk_ilnp_i=nlnnn_i=1rk_ilnk_i+_i=1rk_i+_i=1rk_ilnp_i.\begin{aligned} \ln P\_\text{multi} &\approx n\ln n - n -\sum\_{i=1}^{r} (k\_i \ln k\_i - k\_i) + \sum\_{i=1}^{r} k\_i \ln p\_i \\\\ & = n\ln n - n - \sum\_{i=1}^{r}k\_i\ln k\_i + \sum\_{i=1}^{r}k\_i + \sum\_{i=1}^{r}k\_i \ln p\_i. \end{aligned}

Since we have _i=1rk_i=n\sum\_{i=1}^{r}k\_i = n, we have

lnP_multinlnnn_i=1rk_ilnk_i+n+_i=1rk_ilnp_i=nlnn_i=1rk_ilnk_i+_i=1rk_ilnp_i.\begin{aligned} \ln P\_\text{multi} &\approx n\ln n - n - \sum\_{i=1}^{r}k\_i\ln k\_i + n + \sum\_{i=1}^{r}k\_i \ln p\_i \\\\ & = n\ln n - \sum\_{i=1}^{r}k\_i \ln k\_i + \sum\_{i=1}^{r}k\_i\ln p\_i. \end{aligned}

Now, comparing the approximated logarithm of multinomial distribution with the approximated logarithm of multivariate hypergeometric distribution, we have

lnP_multilnP_hyper+n2N_i=1rk_i2Np_i.\begin{aligned} \ln P\_\text{multi} \approx \ln P\_\text{hyper} + \frac{n^2}{N} - \sum\_{i=1}^{r}\frac{k\_i^2}{Np\_i}. \end{aligned}

Note that the term n2N_i=1rk_i2Np_i\frac{n^2}{N} - \sum\_{i=1}^{r}\frac{k\_i^2}{Np\_i} is negalectable when NN is large and k_ik\_i and nn are small compared to N. Therefore, we show that when the population size NN is large (e.g., all possible sequences for sampling) and k_i,nk\_i, n are small (e.g., k_i=1k\_i=1 or 00 since each sequence represents a category and nn is usually less than 20 in LLM-based recommendation), the multivariate hypergeometric distribution can be approximated to the multinomial distribution. That is, sampling without replacement is approximately equivalent to sampling with replacement.

评论

Thank you very much for taking the time to provide such a detailed answer, I greatly appreciate the amount of work you have put into answering my concerns.

In particular, I have carefully checked the derivation of Eq 15 that you have detailed in your response and in the updated version of the paper -- which was my primary concern. While some steps remain slightly unclear to me (specially the expectation/sum swapping in step 3 and the length normalization in step 4), it looks overall reasonable and convincing to me. Given the numerous, non-straightforward steps of this derivation, I do think that it was indeed crucial to include it in the paper.

Some additional minor comments I have:

  • Line 202 in the updated paper, I think YpYp\mathcal{Y}'_p \in \mathcal{Y}_p should be replaced with YpYp\mathcal{Y}'_p \subset \mathcal{Y}_p.
  • The answer to Q7 on K\sum_K did not help clarify my understanding. The explanation that "Eq. (7) minimizes TVD with the strength of K for the same sequence." from the paper is hard to understand and should be reformulated. Additionally, using the notation for sum in this context is not standard as a sum should be over an index and, unless I am mistaken, KK here does not seem to play the role of an index. If it is, please specify the range of this index in the sum.
  • The notation of q(y)q(\mathbf{y}) for tq(ytx,y<t)\prod_t q(y_t | x, y_{<t}) seems slightly confusing to me, shouldn't this be noted as q(yx)q(\mathbf{y} | \mathbf{x}) instead?

Nonetheless, I am overall happy with the authors' comprehensive response and I will update my score to 6.

评论

Dear Reviewer Up12,

Thanks for your positive reply. Your hard work is greatly appreciated! Your valuable suggestions really help us to improve our paper.

For the additional comments:

Comment 1: Line 202 in the updated paper, I think YpYp\mathcal{Y}'_p \in \mathcal{Y}_p should be replaced with YpYp\mathcal{Y}'_p \subset \mathcal{Y}_p

Reply: Thank you for your comments. We agree with you and we will revise our manuscript accordingly.

Comment 2: The answer to Q7 on K\sum_K did not help clarify my understanding. The explanation that "Eq. (7) minimizes TVD with the strength of K for the same sequence." from the paper is hard to understand and should be reformulated. Additionally, using the notation for sum in this context is not standard as a sum should be over an index and, unless I am mistaken, KK here does not seem to play the role of an index. If it is, please specify the range of this index in the sum.

Reply: Thank you for pointing out the notation usage issue. Indeed, we agree with you that the sum notation should be over an index. To facilitate standard notation usage and better understanding, we will revise the sum notation into a multiplier KK, that represents "the strength of KK for the same sequence".

Comment 3: The notation of q(y)q(\mathbf{y}) for tq(ytx,y<t)\prod_t q(y_t | x, y_{<t}) seems slightly confusing to me, shouldn't this be noted as q(yx)q(\mathbf{y} | \mathbf{x}) instead?

Reply: Thanks for your valuable comments. Your understanding is correct. The notation of q(y)q(\mathbf{y}) should be q(yx)q(\mathbf{y} | \mathbf{x}). We omitted the condition xx for q(yx)q(\mathbf{y}|\mathbf{x}) in our previous version, which could lead to confusion. We will keep the condition in q(yx)q(\mathbf{y}|\mathbf{x}) and tq(ytx,y<t)\prod_t q(y_t | x, y_{<t}) consistent and revise our manuscript accordingly.

Thanks again for taking the time to give us such a detailed and insightful review.

评论

Thank you for your fast reply and additional clarifications! I again appreciate your engagement in this discussion.

评论

Theorem 1. When population size NN is large and sample size nn is small compared to NN (i.e., nNn\ll N), the multivariate hypergeometric distribution approximates the multinomial distribution:

P_hyper(k_1,k_2,,k_r)P_multi(k_1,k_2,,k_r). P\_\text{hyper}(k\_1, k\_2, \dots, k\_r) \approx P\_\text{multi}(k\_1,k\_2, \dots, k\_r).

Proof.

Step1. Apply Stirling's approximation on logarithm of multivariate hypergeometric distribution.

We can expand the logarithm of hypergeometric probability in factorials as follows:

lnP_hyper(k_1,k_2,,k_r)=ln_i=1rK_i!k_i!(K_ik_i)!N!n!(Nn)!=ln_i=1rK_i!k_i!(K_ik_i)!lnN!n!(Nn)!=_i=1rlnK_i!k_i!(K_ik_i)!lnN!n!(Nn)!=_i=1r[lnK_i!lnk_i!ln(K_ik_i)!][lnN!lnn!ln(Nn)!].\begin{aligned} &\ln P\_\text{hyper}(k\_1,k\_2,\dots,k\_r) =\ln \frac{\prod\_{i=1}^{r}\frac{K\_i!}{k\_i!(K\_i-k\_i)!}}{\frac{N!}{n!(N-n)!}} \\\\ =& \ln \prod\_{i=1}^{r}\frac{K\_i !}{k\_i!(K\_i-k\_i)!} - \ln \frac{N!}{n!(N-n)!} \\\\ =& \sum\_{i=1}^{r}\ln \frac{K\_i!}{k\_i!(K\_i-k\_i)!} - \ln \frac{N!}{n!(N-n)!} \\\\ =& \sum\_{i=1}^{r} [\ln K\_i! - \ln k\_i! - \ln (K\_i-k\_i)!] - [\ln N! - \ln n! - \ln (N-n)!]. \end{aligned}

Using the Stirling's approximation, we have:

\left\\{ \begin{aligned} \ln K\_i ! &\approx K\_i \ln K\_i - K\_i + \frac{1}{2} \ln 2\pi K\_i ,\\\\ \ln k\_i ! &\approx k\_i \ln k\_i - k\_i + \frac{1}{2} \ln 2\pi k\_i , \\\\ \ln (K\_i-k\_i) ! &\approx (K\_i-k\_i) \ln (K\_i-k\_i) - (K\_i-k\_i) + \frac{1}{2} \ln 2\pi (K\_i-k\_i) ,\\\\ \ln N! &\approx N\ln N -N + \frac{1}{2}\ln2\pi N ,\\\\ \ln n! &\approx n\ln n -n + \frac{1}{2}\ln2\pi n ,\\\\ \ln (N-n)! &\approx (N-n)\ln (N-n) -(N-n) + \frac{1}{2}\ln2\pi (N-n). \end{aligned} \\right.

Then, we can substitute the logarithm of factorials with the approximation as:

lnP_hyper(k_1,k_2,,k_r)=_i=1r[lnK_i!lnk_i!ln(K_ik_i)!][lnN!lnn!ln(Nn)!]=_i=1r[K_ilnK_iK_i+12ln2πK_i(k_ilnk_ik_i+12ln2πk_i)((K_ik_i)ln(K_ik_i)(K_ik_i)+12ln2π(K_ik_i))][NlnNN+12ln2πN(nlnnn+12ln2πn)((Nn)ln(Nn)(Nn)+12ln2π(Nn))]=_i=1r[K_ilnK_ik_ilnk_i(K_ik_i)ln(K_ik_i)+12ln2πK_i12ln2πk_i12ln2π(K_ik_i)][NlnNnlnn(Nn)ln(Nn)+12ln2πN12ln2πn12ln2π(Nn)].\begin{aligned} &\ln P\_\text{hyper}(k\_1,k\_2,\dots,k\_r) \\\\ = & \sum\_{i=1}^{r} [\ln K\_i! - \ln k\_i! - \ln (K\_i-k\_i)!] - [\ln N! - \ln n! - \ln (N-n)!] \\\\ = &\sum\_{i=1}^{r} [ K\_i \ln K\_i - K\_i + \frac{1}{2}\ln 2\pi K\_i - (k\_i \ln k\_i - k\_i + \frac{1}{2}\ln 2\pi k\_i) \\\\ & \quad\quad - ((K\_i-k\_i)\ln (K\_i-k\_i) - (K\_i-k\_i) + \frac{1}{2}\ln2\pi (K\_i-k\_i) )] \\\\ & \quad\quad - [N\ln N - N + \frac{1}{2}\ln 2\pi N - (n\ln n - n + \frac{1}{2} \ln 2\pi n ) \\\\ & \quad\quad - ((N-n)\ln (N-n) - (N-n) + \frac{1}{2} \ln 2\pi (N-n) )] \\\\ = &\sum\_{i=1}^{r} [ K\_i \ln K\_i - k\_i \ln k\_i - (K\_i-k\_i) \ln (K\_i-k\_i) + \frac{1}{2} \ln 2\pi K\_i - \frac{1}{2} \ln 2\pi k\_i - \frac{1}{2}\ln 2\pi (K\_i-k\_i) ] \\\\ & \quad\quad - [N\ln N -n\ln n - (N-n) \ln (N-n) + \frac{1}{2} \ln 2\pi N - \frac{1}{2} \ln 2\pi n - \frac{1}{2} \ln 2\pi (N-n) ]. \end{aligned}

Since NN is a very large number and nNn \ll N, k_iK_ik\_i \ll K\_i, we have 12ln2πK_i12ln2πk_i12ln2π(K_ik_i)0\frac{1}{2} \ln 2\pi K\_i - \frac{1}{2} \ln 2\pi k\_i - \frac{1}{2}\ln 2\pi (K\_i-k\_i) \approx 0 and 12ln2πN12ln2πn12ln2π(Nn)0\frac{1}{2} \ln 2\pi N - \frac{1}{2} \ln 2\pi n - \frac{1}{2}\ln 2\pi (N-n) \approx 0. Then, the logarithm of multivariate hypergeometric distribution is approximated as:

lnP_hyper_i=1r[K_ilnK_ik_ilnk_i(K_ik_i)ln(K_ik_i)][NlnNnlnn(Nn)ln(Nn)]. \ln P\_\text{hyper} \approx \sum\_{i=1}^{r} [K\_i \ln K\_i - k\_i \ln k\_i - (K\_i-k\_i) \ln (K\_i-k\_i)] - [N\ln N -n\ln n - (N-n) \ln (N-n)].

Step2. Approximate ln(K_ik_i)\ln (K\_i-k\_i) and ln(Nn)\ln (N-n) using Taylor expansion.

Since k_ik\_i is small compared to K_iK\_i, we can expand ln(K_ik_i)\ln (K\_i-k\_i) using Taylor expansion

ln(K_ik_i)=lnK_ik_iK_i12(k_iK_i)2+,\ln (K\_i-k\_i) = \ln K\_i - \frac{k\_i}{K\_i} - \frac{1}{2}(\frac{k\_i}{K\_i})^{2} + \dots,

where we can neglect the high-order terms and obtain

ln(K_ik_i)lnK_ik_iK_i. \ln (K\_i-k\_i) \approx \ln K\_i - \frac{k\_i}{K\_i}.

Similarly, for ln(Nn)\ln (N-n), we have ln(Nn)lnNnN.\ln (N-n) \approx \ln N - \frac{n}{N}.

We can then substitute ln(K_ik_i)\ln (K\_i-k\_i) and ln(Nn)\ln (N-n) in logarithm of multivariate hypergeometric distribution and obtain

lnP_hyper_i=1r[K_ilnK_ik_ilnk_i(K_ik_i)ln(K_ik_i)][NlnNnlnn(Nn)ln(Nn)]=_i=1r[K_ilnK_ik_ilnk_i(K_ik_i)(lnK_i+k_iK_i)][NlnNnlnn(Nn)(lnN+nN)]=_i=1r[K_ilnK_ik_ilnk_i(K_ilnK_ik_ilnK_i+k_ik_i2K_i)][NlnNnlnn(NlnNnlnN+nn2N)]=_i=1r[k_ilnK_ik_i+k_i2K_ik_i][nlnNn+n2Nn]=_i=1r[k_ilnK_ik_i+k_i2K_i][nlnNn+n2N].(we have_i=1rk_i+n=0 in last expression)\begin{aligned} \ln P\_\text{hyper} &\approx \sum\_{i=1}^{r} [K\_i \ln K\_i - k\_i \ln k\_i - (K\_i-k\_i) \ln (K\_i-k\_i)] - [N\ln N -n\ln n - (N-n) \ln (N-n)] \\\\ & = \sum\_{i=1}^{r} [K\_i\ln K\_i - k\_i \ln k\_i - (K\_i-k\_i) (\ln K\_i + \frac{k\_i}{K\_i})] - [N\ln N - n\ln n - (N-n)(\ln N + \frac{n}{N})] \\\\ & = \sum\_{i=1}^{r} [K\_i\ln K\_i - k\_i\ln k\_i - (K\_i\ln K\_i - k\_i\ln K\_i + k\_i - \frac{k\_i^2}{K\_i}) ] - [N\ln N - n\ln n - (N\ln N - n\ln N + n - \frac{n^2}{N})] \\\\ & = \sum\_{i=1}^{r} [k\_i \ln \frac{K\_i}{k\_i} + \frac{k\_i^2}{K\_i} - k\_i] - [n\ln \frac{N}{n} + \frac{n^2}{N} -n] \\\\ & = \sum\_{i=1}^{r} [k\_i \ln \frac{K\_i}{k\_i} + \frac{k\_i^{2}}{K\_i}] - [n \ln \frac{N}{n}+ \frac{n^2}{N}]. \quad\quad (\text{we have} -\sum\_{i=1}^{r}k\_i+n=0 \text{ in last expression}) \end{aligned}
评论

Q13: The "with-replacement sampling approximation" paragraph in App A.2 is difficult to follow so it would benefit from being reworked.

Reply: Thanks for your valuable comments. The with-replacement sampling approximation is mainly based on Stirling’s approximation, and we provide the detailed step-by-step proof to facilitate better understanding. The detailed derivation has also been updated in the Appendix of our updated manuscript.

We aim to prove that the distribution of sampling without replacement is approximately equivalent to that of sampling with replacement. Our proof is mainly based on Stirling's approximation. In the following, we will first clarify notations, and introduce multivariate hypergeometric distribution and the multinomial distribution, which are used to model the sampling without replacement and sampling with replacement, respectively. We then present Stirling's approximation, and show the step-by-step proof.

Notations. To model the sampling, we have the total population size NN, sample size nn, the category size rr, the number of items in category ii in the population K_iK\_i, and the number of items in category ii in the samples k_ik\_i. In the case of sequence sampling in LLM decoding, the population includes every possible sequence. Every possible sequence is a unique category, and the sample size nn is the beam size, the population size is the total number of all possible sequences at each beam search step. Now we assume that the population sizes go to infinity in such a way that p_i=K_iNp\_i=\frac{K\_i}{N}. And we have _i=1rk_i=n\sum\_{i=1}^{r}k\_i=n and _i=1rK_i=N\sum\_{i=1}^{r} K\_i = N.

Multivariate hypergeometric distribution. Formally, when sampling without replacement, the probability of drawing k_1,k_2,,k_rk\_1, k\_2, \dots, k\_r items from each category is given by the multivariate hypergeometric distribution

P_hyper(k_1,k_2,,k_r)=_i=1r(K_ik_i)(Nn)=_i=1rK_i!k_i!(K_ik_i)!N!n!(Nn)!.\begin{aligned} P\_\text{hyper}(k\_1,k\_2,\dots,k\_r) &= \frac{\prod\_{i=1}^{r}\binom{K\_i}{k\_i}}{\binom{N}{n}} \\\\ &=\frac{\prod\_{i=1}^{r}\frac{K\_i!}{k\_i!(K\_i-k\_i)!}}{\frac{N!}{n!(N-n)!}}. \end{aligned}

Multinomial distribution. Formally, when sampling with replacement, the probability follows the multinomial distribution as

P_multi(k_1,k_2,,k_r)=n!_i=1rk_i!_i=1rp_ik_i.\begin{aligned} P\_\text{multi}(k\_1, k\_2, \dots, k\_r) = \frac{n!}{\prod\_{i=1}^{r}k\_i!}\prod\_{i=1}^{r}p\_i^{k\_i}. \end{aligned}

Stirling's approximation. Stirling's approximation gives us the approximation of the logarithm of factorials as:

lnn!nlnnn+12ln(2πn). \ln n! \approx n\ln n -n + \frac{1}{2}\ln(2\pi n).
评论

Step3. Move expectation over y_<t\mathbf{y}\_{<t}.

We aim to align the draft model with the target LLM at every beam search step. Notably, at each beam search step TT, the sequence lengths are fixed and are independent with tt. Therefore, we can rewrite the objective into:

_tE_y_<tq[_y_tq(y_tc_<t)logq(y_tc_<t)p(y_tc_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_K,ty_K,<t)]=E_y_Tq_t=1T[_y_tq(y_tc_<t)logq(y_tc_<t)p(y_tc_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_K,ty_K,<t)].\begin{aligned} & - \sum\_t \mathbb{E}\_{\mathbf{y}\_{<t}\sim q} [ \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} - \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} ] \\\\ = &- \mathbb{E}\_{\mathbf{y}\_{\le T}\sim q} \sum\_{t=1}^{T} [ \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} - \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} ]. \end{aligned}

Step4. Normalize by length and rewrite the objective.

Since we aim to align at every step T{1,,L}T\in\{1,\dots,L\}, where LL is the length of item identifier in LLM-based recommendation, we further normalize the expression by sequence length to prevent different scales on alignment loss across different steps. As such, for every step T{1,,L}T\in\{1,\dots,L\}, the objective can be rewritten as

E_y_Tq1y_T_t=1y_T[_y_tq(y_t)logq(y_tc_<t)p(y_tc_<t)_y_tq(y_t)logq(y_tc_<t)p(y_K,ty_K,<t)].\begin{aligned} & - \mathbb{E}\_{\mathbf{y}\_{\le T}\sim q} \frac{1}{|y\_T|} \sum\_{t=1}^{|y\_T|} [ \sum\_{y\_t} q(y\_t) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} - \sum\_{y\_t} q(y\_t) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} ]. \end{aligned}

However, the expectation over the sequence space, (i.e., E_y_Tq\mathbb{E}\_{\mathbf{y}\_{\le T}\sim q}) is intractable, so we follow previous work (Wen et al., 2023; Kim & Rush, 2016) to approximate it by sampling top-KK sequences generated by draft model M_q\mathcal{M}\_q. Now we can rewrite our alignment objective for strict top-KK verification and obtain Eq.(3) in our paper:

argmax_θΘE_(x,Y)D_yY1y_t=1y[_y_tq(y_tc_<t)logq(y_tc_<t)p(y_tc_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_K,ty_K,<t)]=argmin_θΘE_(x,Y)D_yY1y_t=1y[_y_tq(y_tc_<t)logq(y_tc_<t)p(y_tc_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_K,ty_K,<t)],\begin{aligned} & \mathop{\arg\max}\_{\theta\in\Theta} - \mathbb{E}\_{(\mathbf{x},\mathcal{Y})\sim D'} \sum\_{\mathbf{y}\in \mathcal{Y}} \frac{1}{|\mathbf{y}|} \sum\_{t=1}^{\mathbf{|y|}} [ \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} - \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} ] \\\\ =& \mathop{\arg\min}\_{\theta\in\Theta} \mathbb{E}\_{(\mathbf{x},\mathcal{Y})\sim D'} \sum\_{\mathbf{y}\in \mathcal{Y}} \frac{1}{|\mathbf{y}|} \sum\_{t=1}^{|\mathbf{y}|} [ \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} - \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} ], \end{aligned}

where Y\mathcal{Y} denotes the top-KK beam sequences generated from the mixture distribution of draft model and target LLM (as stated in the "Alignment Objective" paragraph in Section 3.1, line 216-218).

评论

Q7: In Eq 7, what does K\sum_K denote? It seems like there is an index variable missing there.

Reply: Thanks for your question. The K\sum_{K} in Eq.(7) is written in the form of summation to denote the derivation from the joint distribution of KK sequences, i.e., KK in logβ=KlogTVD(q,p)\log \beta = - \sum_{K} \log \text{TVD}(q, p) in the “Alignment Objective” paragraph of Section 3.2. To clarify this, we explained the meaning of KK in the context of Eq.(7) in line 281-284 in our submitted manuscript, i.e., line 284-286 in our updated manuscript (“Eq.(7) minimizes TVD with the strength of K for the same sequence. Nevertheless, considering the beam search with sampling would obtain KK different sequences, we alternatively leverage the top-KK sequences from the target LLM”). This also partially explains our choice of D\mathcal{D}’ to sample Top-KK sequences from the target LLM for the alignment training under relaxed sampling verification.

Q8: The strategy relying on tree-based attention to speed up inference would benefit from being described in more details (at least in Appendix). Section 3.3 only refers to Figure 2(c) to describe the strategy, but this figure is not self-explanatory.

Reply: Thanks for your insightful comments. We provide more details and a concrete example of how tree-based attention is implemented in practice. We have also added them to our updated manuscript.

The main idea of tree-based attention is to eliminate the repeated self-attention calculation for the same prefix of the beam sequences during beam search.

  • Detailed explanation of tree-based attention. In tree-based attention strategy, we first compress the N beam sequences into a single flattened sequence, and then construct the sparse tree-based attention mask for efficient target LLM verification. More precisely, given γN\gamma N drafted beam sequences with different lengths, where NN is the draft beam size and γ\gamma is the number of drafted beam steps,

    • we first flattened the beam sequences by sequentially adding the newly generated tokens from the beam sequences at each step. We denote the length of the flattened sequence as LfL_f.

    • Then, based on the flattened sequence, we construct an attention mask with the shape of Lf×LfL_f \times L_f. Specifically, each row in attention mask represents a specific beam sequence, and for each row, we set the corresponding column of the last token and the preceding tokens in the beam sequence as 1, otherwise 0.

  • Example. Here, we give a concrete example to illustrate how tree-based attention is implemented to save repeated calculation. We set the beam size of 2 and collect the beam sequences of 3 steps. The collected beam sequences are as follows:

    step 1 beam 1: a1

    step 1 beam 2: a2

    step 2 beam 1: a1 b1

    step 2 beam 2: a2 b2

    step 3 beam 1: a1 b1 c1

    step 3 beam 2: a1 b1 c2

    The flattened sequence will be a1 a2 b1 b2 c1 c2. The constructed tree-based attention is shown in Figure 2(c) of our manuscript, where each row represents a specific beam sequence. For each row, the tickled cell represents the preceding tokens and the last tokens of each beam. And the different colors represent the different steps of beam search. This flattened sequence and sparse tree-based attention enable efficient verification since it saves the repeated calculation of the same prefix across different beam sequences, e.g., a1b1.

Q9: The WS metric represents the walltime speedup, but with respect to which baseline? I assume this is in comparison to directly running the target model without speculative decoding, but it would be helpful for the reader to mention this when defining the metric.

Reply: Thanks for your valuable question and suggestion. Your understanding is correct. The walltime speedup is compared to the execution of original target LLM without speculative decoding. And the walltime speedup is defined as WS=TTWS=\frac{T}{T'}, where T is the time for running target LLM without speculative decoding and T’ is the time for running LLM with speculative decoding with a specific draft model. We have also added the clarification to our updated manuscript.

评论

Step2. Decompose _yq(y)\sum\_{\mathbf{y}}q(\mathbf{y}) and obtain step-wise alignment.

Since y=(y_1,y_2,,y_n)=(y_<t,y_t,y_>t)\mathbf{y}=(y\_1, y\_2, \dots, y\_n) = (\mathbf{y}\_{<t}, y\_t, \mathbf{y}\_{>t}) is a sequence, we can decompose _yq(y)\sum\_{\mathbf{y}} q(\mathbf{y}) into nested sum _y_<t_y_t_y_>tq(y)=_y_<t_y_t_y_>tq(y_<t,y_t,y_>t)\sum\_{\mathbf{y}\_{<t}} \sum\_{y\_t} \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}) = \sum\_{\mathbf{y}\_{<t}} \sum\_{y\_t} \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{<t}, y\_t, \mathbf{y}\_{>t}), where the nested sum over three parts of y\mathbf{y}, i.e., y_<t,y_t\mathbf{y}\_{<t}, \mathbf{y}\_t, and y_>t\mathbf{y}\_{>t} can cover all possible sequence y\mathbf{y}.

Since q(y_<t,y_t,y_>t)=q(y_<t)q(y_tc_<t)q(y_>tc_t)q(\mathbf{y}\_{<t}, y\_t, \mathbf{y}\_{>t})=q(\mathbf{y}\_{<t})q(y\_t|c\_{<t})q(\mathbf{y}\_{>t}|c\_{\le t}), we can rewrite the nested sum of q(y_<t,y_t,y_>t)q(\mathbf{y}\_{<t}, y\_t, \mathbf{y}\_{>t}) over y\mathbf{y}:

_yq(y)=_y_<t_y_t_y_>tq(y_<t)q(y_tc_<t)q(y_>tc_t)=_y_<t_y_tq(y_<t)q(y_tc_<t)_y_>tq(y_>tc_t)=_y_<tq(y_<t)_y_tq(y_tc_<t)_y_>tq(y_>tc_t).\begin{aligned} & \sum\_{\mathbf{y}} q(\mathbf{y}) \\\\ =&\sum\_{\mathbf{y}\_{<t}} \sum\_{y\_t} \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{<t})q(y\_t|c\_{<t})q(\mathbf{y}\_{>t}|c\_{\le t}) \\\\ = &\sum\_{\mathbf{y}\_{<t}} \sum\_{y\_t} q(\mathbf{y}\_{<t})q(y\_t|c\_{<t}) \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t}) \\\\ = &\sum\_{\mathbf{y}\_{<t}} q(\mathbf{y}\_{<t}) \sum\_{y\_t} q(y\_t|c\_{<t}) \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t}). \end{aligned}

We then can substitute _yq(y)\sum\_{\mathbf{y}} q(\mathbf{y}) with _y_<tq(y_<t)_y_tq(y_ty_<t)_y_>tq(y_>tc_t)\sum\_{\mathbf{y}\_{<t}} q(\mathbf{y}\_{<t}) \sum\_{y\_t} q(y\_t|\mathbf{y}\_{<t}) \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t}) in previous expansion and obtain:

_t_yq(y)logq(y_tc_<t)p(y_tc_<t)+_t_yq(y)logq(y_tc_<t)p(y_K,ty_K,<t)=_t[_y_<tq(y_<t)_y_tq(y_tc_<t)_y_>tq(y_>tc_t)]logq(y_tc_<t)p(y_tc_<t)+_t[_y_<tq(c_<t)_y_tq(y_tc_<t)_y_>tq(y_>tc_t)]logq(y_tc_<t)p(y_K,ty_K,<t)=_t_y_<tq(y_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_tc_<t)_y_>tq(y_>tc_t)+_t_y_<tq(y_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_K,ty_K,<t)_y_>tq(y_>tc_t).\begin{aligned} & \quad - \sum\_t \sum\_{\mathbf{y}} q(\mathbf{y}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} + \sum\_t \sum\_{\mathbf{y}} q(\mathbf{y}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} \\\\ & = - \sum\_t [ \sum\_{\mathbf{y}\_{<t}} q(\mathbf{y}\_{<t}) \sum\_{y\_t} q(y\_t|c\_{<t}) \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t}) ] \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} + \sum\_t [ \sum\_{\mathbf{y}\_{<t}} q(c\_{<t}) \sum\_{y\_t} q(y\_t|c\_{<t}) \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t}) ] \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} \\\\ & = - \sum\_t \sum\_{\mathbf{y}\_{<t}} q(\mathbf{y}\_{<t}) \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t}) + \sum\_t \sum\_{\mathbf{y}\_{<t}} q(\mathbf{y}\_{<t}) \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t}). \end{aligned}

Since _y_>tq(y_>tc_t)=1\sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t})=1, we can remove _y_>tq(y_>tc_t)\sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t}):

_t_y_<tq(y_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_tc_<t)_y_>tq(y_>tc_t)+_t_y_<tq(y_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_K,ty_K,<t)_y_>tq(y_>tc_t)=_t_y_<tq(y_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_tc_<t)+_t_y_<tq(y_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_K,ty_K,<t)=_t_y_<tq(y_<t)[_y_tq(y_tc_<t)logq(y_tc_<t)p(y_tc_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_K,ty_K,<t)]=_tE_y_<tq[_y_tq(y_tc_<t)logq(y_tc_<t)p(y_tc_<t)_y_tq(y_tc_<t)logq(y_tc_<t)p(y_K,ty_K,<t)].\begin{aligned} & - \sum\_t \sum\_{\mathbf{y}\_{<t}} q(\mathbf{y}\_{<t}) \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t}) + \sum\_t \sum\_{\mathbf{y}\_{<t}} q(\mathbf{y}\_{<t}) \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} \sum\_{\mathbf{y}\_{>t}} q(\mathbf{y}\_{>t}|c\_{\le t}) \\\\ =& - \sum\_t \sum\_{\mathbf{y}\_{<t}} q(\mathbf{y}\_{<t}) \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} + \sum\_t \sum\_{\mathbf{y}\_{<t}} q(\mathbf{y}\_{<t}) \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} \\\\ =& - \sum\_t \sum\_{\mathbf{y}\_{<t}} q(\mathbf{y}\_{<t}) [ \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} - \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} ] \\\\ =& - \sum\_t \mathbb{E}\_{\mathbf{y}\_{<t}\sim q} [ \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} - \sum\_{y\_t} q(y\_t|c\_{<t}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} ]. \end{aligned}

Now we have a step-wise alignment over all sequence from qq, i.e., E_y_<tq\mathbb{E}\_{\mathbf{y}\_{<t}\sim q}.

评论

Q10: Table 1 reports the results for AtSpeed-S and AtSpeed-R on both the strict and the relaxed settings. How can AtSpeed-S be applied to the relaxed setting and AtSpeed-R to the strict setting?

Reply: Thanks for your valuable questions. The AtSpeed-S and AtSpeed-R are essentially two alignment methods to train a draft model. After the alignment training, the well-trained draft model can be applied to both strict top-KK and relaxed sampling verification. The difference between AtSpeed-S and AtSpeed-R is that the training loss is specifically designed to improve the acceptance rate for strict top-KK and relaxed sampling verification, respectively. Therefore, the alignment effectiveness might be different, where AtSpeed-S has a better alignment under strict verification while AtSpeed-R achieves a better alignment under relaxed sampling verification (empirical results also validate this as shown in Table 1 in the manuscript).

Q11: In Figure 5 of the Appendix, it seems that a larger value for α is always beneficial for AtSpeed-R, whereas Figure 3 (c) showed that α should neither be too small nor too large for AtSpeed-S. Are there any intuitions on these different behaviors between AtSpeed-R and AtSpeed-S?

Reply: Thanks for your insightful questions. We suspect the different behaviors are due to the different scales between LAlign-SL_\text{Align-S} and LAlign-RL_\text{Align-R}. Specifically, given the sequence distribution from draft model qq and that from target LLM pp, we have RKLD qlogqpq\log \frac{q}{p} in AtSpeed-S and TVD qp2\frac{|q-p|}{2} in AtSpeed-R to align the two models. When sequence probability qq becomes small as the sequence length increases, qp\frac{q}{p} becomes very large. It is possible for AtSpeed-S to give a larger loss value compared to AtSpeed-R (i.e., pq2\frac{|p-q|}{2}). Therefore, for the same α=0.7\alpha=0.7, the strength is still not too large to hurt the alignment for AtSpeed-R.

  • Empirical results. To validate this, we continue increasing α\alpha to 1 for AtSpeed-R and present the results in the following table. We can find that AtSpeed-R yields lower WS@20 and AS@20 when we increase α\alpha to 0.9. Besides, we have the worst performance when α=1\alpha=1, i.e., removing the recommendation loss. This overall behavior is consistent with AtSpeed-S, where the extremely large α\alpha will hurt the alignment. We have also updated results and observations into our latest manuscript.
α\alphaWS@10WS@20AS@10AS@20
AtSpeed-R02.04212.38011.98931.4866
0.12.02822.32841.99161.5336
0.32.03402.39091.99751.6085
0.52.02702.36651.99321.5840
0.72.03942.45212.01441.6116
0.92.10552.31002.04191.5398
11.50021.92451.06390.9491
评论

Q12: Derivation of Eq 15 in App A.2 (which leads to the definition of the alignment objective in Eq 3 seems incorrect or at least misses important steps to be understandable for the reader. More steps or explanations are needed.

Reply: Your thorough review is greatly appreciated. We provide detailed step-by-step derivations of Eq.(3) to facilitate the reader’s understanding. And we have also updated the Appendix with the detailed derivation in our revised manuscript.

Step1. Decompose p(y)p(\mathbf{y}) and q(y)q(\mathbf{y}).

We start from Eq.(2) in our paper, i.e., the alignment objective under strict top-KK verification, as

_yq(y)logq(y)p(y)+_yq(y)logq(y)p(y_K)-\sum\_{\mathbf{y}}q(\mathbf{y}) \log \frac{q(\mathbf{y})}{p(\mathbf{y})} + \sum\_{\mathbf{y}} q(\mathbf{y}) \log \frac{q(\mathbf{y})}{p(\mathbf{y}\_K)}

where p(y_K)=p(y_K,ty_K,<t)p(y\_K)=p(\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t}). Denote c_<t=(x,y_<t)c\_{<t}=(\mathbf{x},\mathbf{y}\_{<t}), referred to as context, we have q(y)=_tq(y_tx,y_<t)=_tq(y_tc_<t)q(\mathbf{y})=\prod\_{t}q(y\_t|\mathbf{x},\mathbf{y}\_{<t})=\prod\_{t}q(y\_t|c\_{<t}) and p(y)=_tp(y_tx,y_<t)=_tp(y_tc_<t)p(\mathbf{y})=\prod\_{t}p(y\_t|\mathbf{x},\mathbf{y}\_{<t})=\prod\_{t}p(y\_t|c\_{<t}). We then can substitute q(y)q(\mathbf{y}) with _tq(y_tc_<t)\prod\_{t}q(y\_t|c\_{<t}) and p(y)p(\mathbf{y}) with _tp(y_tc_<t)\prod\_{t}p(y\_t|c\_{<t}) and rewrite the objective as:

_yq(y)logq(y)p(y)+_yq(y)logq(y)p(y_K)=_yq(y)log_tq(y_tc_<t)_tp(y_tc_<t)+_yq(y)log_tq(y_tc_<t)_tp(y_K,ty_K,<t)=_yq(y)[log_tq(y_tc_<t)log_tp(y_tc_<t)]+_yq(y)[log_tq(y_tc_<t)log_tp(y_K,ty_K,<t)]=_yq(y)[_tlogq(y_tc_<t)_tlogp(y_tc_<t)]+_yq(y)[_tlogq(y_tc_<t)_tlogp(y_K,ty_K,<t)]=_yq(y)_tlogq(y_tc_<t)p(y_tc_<t)+_yq(y)_tlogq(y_tc_<t)p(y_K,ty_K,<t)=_t_yq(y)logq(y_tc_<t)p(y_tc_<t)+_t_yq(y)logq(y_tc_<t)p(y_K,ty_K,<t).\begin{aligned} &-\sum\_{\mathbf{y}}q(\mathbf{y}) \log \frac{q(\mathbf{y})}{p(\mathbf{y})} + \sum\_{\mathbf{y}} q(\mathbf{y}) \log \frac{q(\mathbf{y})}{p(\mathbf{y}\_K)} \\\\ =& -\sum\_{\mathbf{y}} q(\mathbf{y}) \log \frac{\prod\_t q(y\_t|c\_{<t})}{\prod\_t p(y\_t|c\_{<t})} + \sum\_{\mathbf{y}} q(\mathbf{y}) \log \frac{\prod\_t q(y\_t|c\_{<t})}{\prod\_t p(\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} \\\\ =& - \sum\_{\mathbf{y}} q(\mathbf{y}) [\log \prod\_t q(y\_t|c\_{<t}) - \log \prod\_t p(y\_t|c\_{<t})] + \sum\_{\mathbf{y}} q(\mathbf{y}) [\log \prod\_t q(y\_t|c\_{<t}) - \log \prod\_t p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})] \\\\ =& - \sum\_{\mathbf{y}} q(\mathbf{y}) [\sum\_t \log q(y\_t|c\_{<t}) - \sum\_t \log p(y\_t|c\_{<t})] + \sum\_{\mathbf{y}} q(\mathbf{y}) [\sum\_t \log q(y\_t|c\_{<t}) - \sum\_t \log p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})] \\\\ =& - \sum\_{\mathbf{y}} q(\mathbf{y}) \sum\_t \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} + \sum\_{\mathbf{y}} q(\mathbf{y}) \sum\_t \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})} \\\\ =& - \sum\_t \sum\_{\mathbf{y}} q(\mathbf{y}) \log \frac{q(y\_t|c\_{<t})}{p(y\_t|c\_{<t})} + \sum\_t \sum\_{\mathbf{y}} q(\mathbf{y}) \log \frac{q(y\_t|c\_{<t})}{p (\mathbf{y}\_{K,t}|\mathbf{y}\_{K,<t})}. \end{aligned}
审稿意见
6

This paper proposes AtSpeed, a framework to accelerate LLM-based generative recommendation systems through speculative decoding. The main challenge addressed is the inherent inefficiency of autoregressive beam search in generating top-K recommendations. The authors identify that traditional SD methods, which focus on N-to-1 verification, are insufficient for recommendation tasks that require N-to-K verification to output a ranked list of K items. To tackle this, the paper introduces two methods: AtSpeed-S for strict top-K alignment and AtSpeed-R for relaxed sampling verification, which aims to reduce the number of LLM calls while preserving accuracy. Experimental results demonstrate that AtSpeed achieves significant speedups (up to 2.5×) with minimal degradation in recommendation accuracy.

优点

  • The paper extends the N-to-1 verification to the N-to-K verification in the context of generative recommendation, which fits the setting of real-world recommender systems.
  • The authors provide a solid theoretical foundation for AtSpeed to align the draft and target models. The optimization objectives are clearly motivated and mathematically sound.
  • Empirical results are compelling, showing that AtSpeed can achieve up to a 2.5× speedup while maintaining competitive recommendation accuracy. The results are well presented and highlight the practical utility of the proposed method.

缺点

  • The paper may overclaim their contribution to be the first to propose the speculative decoding task for LLM-based recommender acceleration. There already exists prior work on speculative decoding for LLM-based recommendation: A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems (Xi, Yunjia, et al), which needs to be compared and discussed.
  • The experiments are somewhat limited to two datasets (Amazon Beauty and Games). While these datasets are commonly used, the paper would benefit from broader validation across additional domains or larger-scale datasets.

问题

  • [Q1] The paper primarily compares AtSpeed with KD-based baselines. Existing SD-based baselines should also be compared including the paper I mentioned before (A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems).
  • [Q2] Have the authors tested their framework on any larger and more diverse datasets, such as MovieLens or more complex datasets like Goodreads? If not, can the authors comment on the expected performance and generalization ability of AtSpeed in these contexts?
  • [Q3] The paper mentions the use of LLaMA-7B, but it would be interesting to know how AtSpeed scales with even larger models (e.g., LLaMA-13B or GPT-3). Do the authors anticipate any bottlenecks or limitations when scaling to larger models, especially concerning memory usage and GPU efficiency?

伦理问题详情

NA

评论

Dear Reviewer WeyY,

Thanks for your valuable comments. We greatly appreciate your effort to review. We have provided detailed explanations and additional experimental results to address your concerns. We have also included the additional results in Appendix of the updated manuscript accordingly (marked in orange). Please feel free to leave further comments or questions if there’s any misunderstanding. And we will reply quickly.

Weakness 1. The paper may overclaim their contribution to be the first to propose the speculative decoding task for LLM-based recommender acceleration. There already exists prior work on speculative decoding for LLM-based recommendation: A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems (Xi, Yunjia, et al), which needs to be compared and discussed.

Reply: Thanks for your valuable comments. We also recognized this work and discussed it in our paper. It is a great work, which focuses on the speculative decoding for feature generation. Although this paper also adopts speculative decoding, it aims to accelerate the feature generation (LLM as feature generator) for the downstream non-LLM recommender, while we aim to accelerate the item generation (LLM as item recommender). Specifically,

  • This related work (named DARE) aims to apply SD to the inference of LLMs for user/item feature generation. The generated user/item feature is then utilized in the downstream conventional recommender models for Click-Through-Rate (CTR) tasks. However, the user/item feature generation process only requires a single sequence as output, which simply follows the traditional N-to-1 verification as the same as traditional SD to verify each drafted token at each step.

  • In contrast, our work aims to apply SD to the inference of LLMs for topKK item recommendation, which emphasizes addressing the challenge of difficult NN-to-KK verification at each step to accept KK sequences out of NN drafted sequences instead of accepting only a token.

Considering the significant difference in the tasks between DARE and our work, we classify DARE as SD for LLM-enhanced recommendation (LLM as feature generator) and ours as SD for LLM-based recommendation (LLM as item recommender). The distinction between LLMs for feature generation and LLMs for item generation has also been widely recognized in current literature [1][2]. Therefore, we claim that we are the first work to apply SD for LLM-based recommendation and discuss DARE in the related work of our manuscript.

[1] Wu Likang, et al. A Survey on Large Language Models for Recommendation. In World Wide Web 2024.

[2] Lin Jianghao, et al. How Can Recommender Systems Benefit from Large Language Models: A Survey. In TOIS 2024.

评论

Q2: Have the authors tested their framework on any larger and more diverse datasets, such as MovieLens or more complex datasets like Goodreads? If not, can the authors comment on the expected performance and generalization ability of AtSpeed in these contexts?

Reply: Thanks for your valuable questions. Following your suggestions, we run additional experiments on the MovieLens-1M dataset and Goodreads dataset to validate the generalization ability of our proposed methods. Please refer to the Reply to Weakness 2 for detailed results. We have also included the additional results into the Appendix of our updated manuscripts (marked in orange on page 23).

Q3: The paper mentions the use of LLaMA-7B, but it would be interesting to know how AtSpeed scales with even larger models (e.g., LLaMA-13B or GPT-3). Do the authors anticipate any bottlenecks or limitations when scaling to larger models, especially concerning memory usage and GPU efficiency?

Reply: Thanks for your insightful questions.

Currently, research studies on LLM-based recommender typically run experiments on 7B models, such as LC-Rec [1], PALR [2], TALLRec [3], CoLLM [4], LLaRA [5], and BIGRec [6]. As such, we follow the current literature to validate the effectiveness of our method in inference acceleration on the 7B LLM recommender. The bottleneck of using larger LLMs (e.g., LLaMA-13B or GPT-3) mainly lies in the large resource costs of fine-tuning LLMs on recommendation data. If we continue scaling up the model size, the computational costs (e.g., memory, GPU) and time costs for fine-tuning LLMs on recommendation data will be significantly high.

Nevertheless, if there are sufficient resources for fine-tuning larger LLM recommender (e.g., LLaMA-13B), our method is expected to be practically feasible and effective to accelerate the larger LLM inference for recommendation. This is because our work aims to train a draft model to generate drafts that align well with the target model’s output. Therefore, given a fine-tuned target LLM and the training data, the draft model’s training cost remains constant, thus facilitating accelerations regardless of the target LLM’s size.

[1] Bowen Zheng et al., Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation. In ICDE 2024.

[2] Fan Yang et al., PALR: Personalization Aware LLMs for Recommendation. Arxiv 2023.

[3] Keqin Bao et al., TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation. In RecSys 2023.

[4] Yang Zhang et al., CoLLM: Integrating Collaborative Embeddings into Large Language Models for Recommendation. In TKDE.

[5] Jiayi Liao et al., LLaRA: Large Language-Recommendation Assistant. In SIGIR 2024.

[6] Keqin Bao et al., A Bi-Step Grounding Paradigm for Large Language Models in Recommendation Systems. Arxiv 2023.

评论

Q1: The paper primarily compares AtSpeed with KD-based baselines. Existing SD-based baselines should also be compared including the paper I mentioned before (A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems).

Reply: Thanks for your valuable comments. Following your suggestions, we extend the related work you mentioned to our setting and report the performance comparison as follows. We have also updated the additional results in our latest manuscript (Table 1 on page 8 and Table 6 on page 23).

From the results, we can observe that 1) our proposed method consistently outperforms extended DARE. This is reasonable since the candidate items are uniformly sampled from the valid items, which might not be well-aligned with the topKK sequence distribution from the target LLM, thus leading to a low acceptance rate and less satisfying speedup. Notably, 2) DARE has constant zero accept step on Games, which is mainly due to the large valid item size during retrieval. Uniform sampling from a large population is less likely to get accepted by the target LLM. In contrast, DARE achieves constant one accept step on MovieLens-1M and Goodreads. The possible reason is that these two datasets have relatively small item size, thus the number of first valid tokens might be smaller than KK. As such, all retrieved valid items will be verified to be accepted for the first step.

Beauty
VerificationMethodWS@3WS@5WS@10WS@20AS@3AS@5AS@10AS@20
Strict topKDARE1.071.061.151.480.440.260.050.00
AtSpeed-S1.971.841.871.842.202.001.640.57
Relaxed SamplingDARE1.651.701.531.952.001.971.141
AtSpeed-R1.941.942.162.472.192.132.011.77
Games
VerificationMethodWS@3WS@5WS@10WS@20AS@3AS@5AS@10AS@20
Strict topKDARE0.950.991.131.440.000.000.000.00
AtSpeed-S1.771.781.851.762.021.961.690.68
Relaxed SamplingDARE1.641.681.191.422.001.960.370
AtSpeed-R1.922.002.052.202.182.171.981.35
MovieLens-1M
VerificationMethodWS@3WS@5WS@10WS@20AS@3AS@5AS@10AS@20
Strict topKDARE1.261.281.391.721.001.001.001.00
AtSpeed-S1.861.791.801.752.082.031.981.09
Relaxed SamplingDARE2.011.841.351.442.162.021.000.35
AtSpeed-R2.242.222.141.642.442.382.150.95
Goodreads
VerificationMethodWS@3WS@5WS@10WS@20AS@3AS@5AS@10AS@20
Strict topKDARE1.301.321.441.751.001.001.001.00
AtSpeed-S2.252.262.202.482.322.181.811.08
Relaxed SamplingDARE1.841.831.351.432.062.021.000.35
AtSpeed-R2.452.391.772.362.502.321.100.87
评论

For the other existing SD-based methods:

From the perspective of verification strategy, they are typically designed for SD with NN-to-11 verification, including the mentioned related work DARE. However, the top-KK item generation requires an NN-to-KK sequence verification. Therefore, from the perspective of verification strategy, prior SD-based methods cannot be directly adopted for performance comparison.

On the other hand, from the perspective of drafting strategy, current NN-to-11 SD-based methods can be broadly categorized into self-drafting, external language model drafting, and external retrieval-based drafting [1]. While the self-drafting and external retrieval-based drafting approaches fail to be directly adopted in SD for LLM-based recommendation, we mainly compared our method with the baselines from external language model drafting. Specifically,

  1. Self-drafting methods typically leverage target LLM to efficiently generate multiple tokens at each future step (e.g., via multi-head prediction [2][3]). However, it is non-trivial to adopt the self-drafted multiple tokens for top-KK item generation via beam search. In particular, the SD under beam search requires each candidate a token sequence for every future step rather than a specific token.

    For example, we have γ=3\gamma=3 and N=5N=5 for each drafted future step, the self-drafting approach gives N drafted tokens as:

    step 1: a1, a2, a3, a4, a5

    step 2: b1, b2, b3, b4, b5 > > step 3: c1, c2, c3, c4, c5

    Since we need sequence for each step under the NN-to-KK verification, we need to construct sequences at each step. An intuitive way is to obtain all possible combinations, e.g., 25 possible sequences a1b1, a1b2, …, a1b5, a2b1, …, a2b5, …, a5b1, …, a5b5 at step 2. Similarly, we have 535^3 sequences at step 3. However, the NN is usually set to a large number, e.g., 40, which makes it infeasible to verify all these possible sequences. Therefore, it requires extensive additional work to design an effective combination strategy to combine the tokens at different steps into token sequences that align well with the target LLM.

  2. External language model drafting aligns with our work, which utilizes a small-sized language model as a draft model and mainly leverages KD to achieve better alignment. Therefore, we compare the representative KD-based methods for performance comparison.

  3. External retrieval-based drafting retrieves tokens from the external corpus. The related work “A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems” also lies in this line of work, which retrieves tokens from existing user features. Nevertheless, DARE focuses on user/item feature generation, where the proposed retrieval method is based on previously generated user/item features, and thereby cannot be directly used in our setting. To compare with DARE, we borrow the concept and devise a retrieval-based drafting method, which retrieves all valid sequences from item identifiers.

[1] Heming Xia, et al., Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. In ACL 2024.

[2] Tianle Cai, et al., Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. In ICML 2024.

[3] Yuhui Li, et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. In ICML 2024.

评论

Weakness 2. The experiments are somewhat limited to two datasets (Amazon Beauty and Games). While these datasets are commonly used, the paper would benefit from broader validation across additional domains or larger-scale datasets.

Reply: Thanks for your valuable comments. Following your suggestions in Question 2, we added new experiments on MovieLens-1M dataset and the Goodreads dataset. The results are as follows.

Table1. Performance comparison of AtSpeed and baselines on MovieLens-1M dataset under strict topK verification and relaxed sampling verification.

MovieLens-1M
VerificationMethodWS@3WS@5WS@10WS@20AS@3AS@5AS@10AS@20
Strict TopKDARE1.261.281.391.721.001.001.001.00
SFT1.651.351.481.761.991.261.141.03
WordKD1.291.291.391.691.071.021.001.00
TVDKD1.731.721.251.241.981.901.041.00
SeqKD1.771.781.421.502.032.001.241.05
AtSpeed-S1.861.791.801.752.082.031.981.09
AtSpeed-R1.761.781.541.742.011.991.291.08
Relaxed SamplingDARE2.011.841.351.442.162.021.000.35
SFT2.082.032.021.612.282.202.060.93
WordKD1.971.871.481.282.152.081.230.00
TVDKD2.001.981.681.292.302.211.590.00
SeqKD2.132.081.931.562.192.142.010.78
AtSpeed-S2.232.162.111.652.412.332.161.02
AtSpeed-R2.242.222.141.642.442.382.150.95

Table 2. Performance comparison of AtSpeed and baselines on Goodreads dataset under strict topK verification and relaxed sampling verification.

Goodreads
VerificationMethodWS@3WS@5WS@10WS@20AS@3AS@5AS@10AS@20
Strict TopKDARE1.301.321.441.751.001.001.001.00
SFT1.831.812.172.462.041.981.721.07
WordKD1.831.922.072.382.001.961.581.00
TVDKD1.891.932.172.462.071.971.701.07
SeqKD1.821.892.192.482.001.961.731.08
AtSpeed-S2.252.262.202.482.322.181.811.08
AtSpeed-R2.112.072.202.492.242.091.801.12
Relaxed SamplingDARE1.841.831.351.432.062.021.000.35
SFT2.152.091.701.912.272.081.010.10
WordKD2.012.041.681.922.152.051.000.15
TVDKD2.272.221.712.022.362.181.030.25
SeqKD1.901.961.661.852.082.011.000.02
AtSpeed-S2.182.131.711.932.282.121.020.17
AtSpeed-R2.452.391.772.362.502.321.100.87
评论

From the results, we can observe that

    1. AtSpeed-S and AtSpeed-R outperforms baselines in most cases under strict topKK verification and relaxed sampling verification, respectively. This validate the effectiveness and generalization ability of our proposed method on diverse datasets and is consistent with the observations on Amazon Beauty and Games (Table 1 in our manuscript).
    1. The relaxed sampling verification generally shows superior speedup compared to strict topKK verification when K=3,5K=3,5, while yields inferior speedup when KK is large (e.g., K=20K=20 on MovieLens-1M and Goodreads). One possible reason is that the item size is relatively small on the two datasets (3,0173,017 movies and 4,6674,667 books) compared to Beauty (12,03512,035 products) and Games (17,33217,332 products), which might results in long-tailed draft distribution, where topKK valid sequences have overwhelmingly high probabilities (i.e., qpq\ge p), thus leading to a high rejection probabilities. We have also included the results on the two additional datasets in the Appendix of our updated manuscripts on page 23 (marked in orange).
评论

Dear Reviewer WeyY,

We would like to kindly follow up to see if our response addresses your concerns. We are happy to take any further questions and we eagerly anticipate our discussion with you! Please feel free to let us know if there's any misunderstanding. Thanks for your time and review.

评论

Thanks for addressing most of my concerns, and I decided to raise my score.

评论

Thanks for your positive feedback and your hard work on the review. We sincerely appreciate it! Your constructive and insightful feedback has been very helpful in improving our paper.

审稿意见
8

This paper focuses on accelerating inference in LLM-based generative recommendation using speculative decoding. It highlights the challenges of applying speculative decoding directly to the generative recommendation, due to the N-to-K issue. The authors introduce two improvements in both the drafting and verification stages. Experiments on two public datasets demonstrate the efficiency of the proposed method.

优点

  1. A timely study addressing inference efficiency in generative recommendation.
  2. The ideas of (1) fine-tuning the draft model to align with top-K items of the target model and (2) adding a probability to accept rejected items in a more flexible verification setting are both novel and interesting.
  3. The paper is well-written and easy to follow.
  4. Experiments on two public datasets demonstrate the efficiency of the proposed method.
  5. Code is available during the review phase, enhancing reproducibility.

缺点

  1. Performance metrics (e.g., NDCG and Recall) are not well-presented. Only Table 2 includes some ranking metrics (also only Recall, without NDCG), which may raise doubts about the proposed method's ranking performance. Including these metrics in Table 1 and Figure 3 would strengthen the results. The limited metrics reported make it challenging to fully assess AtSpeed's ranking performance.
  2. Presentation issues. Reporting WS@K and AS@K for all values of K in {1, 3, 5, 10, 20} in Table 1 seems unnecessary, especially since the discussion focuses on average WS and AS. I suggest presenting only representative Ks alongside the averages, freeing space for additional ranking metrics.

问题

Please refer to "Weaknesses" for more details.

Lastly, I want to share a related paper, "Inductive Generative Recommendation via Retrieval-based Speculation". This paper was released after the ICLR submission deadline and is currently available only as a preprint. While this is not a critique or question, it’s relevant as it discusses speculative decoding in generative recommendation. It proposes a dynamic N for the draft model and introduces a technique to perform limited beam search, using prefixes generated by the first few steps of the target model to guide the draft model. I believe this work shares some conceptual similarities with this paper, so I thought it worth mentioning.

评论

Related work: Lastly, I want to share a related paper, "Inductive Generative Recommendation via Retrieval-based Speculation". This paper was released after the ICLR submission deadline and is currently available only as a preprint. While this is not a critique or question, it’s relevant as it discusses speculative decoding in generative recommendation. It proposes a dynamic N for the draft model and introduces a technique to perform limited beam search, using prefixes generated by the first few steps of the target model to guide the draft model. I believe this work shares some conceptual similarities with this paper, so I thought it worth mentioning.

Reply: Thank you for sharing this interesting related work. We also recognize this work after its recent release and we will add this paper to our manuscript. It is a great work, which cleverly leverages the “draft-then-verify” paradigm to allow high-quality unseen items to be introduced into the system and thus address cold-start issue. Though conceptually similar to our work, we would like to discuss the difference between our work and this related work (named SpecGR).

While SpecGR focuses on drafting unseen items for generative models to rerank, we aim at drafting beam sequences at every step that align well with the target LLM to reduce decoding steps. Specifically,

  • SpecGR aims to address the challenge for generative models to generate unseen items. It uses a draft model to allow unseen items to be reranked by generative models. As you mentioned, SpecGR introduces a guided re-drafting strategy, which essentially aligns the draft model with the target LLM on the unseen items (i.e., low-probability area of the target distribution).
  • In contrast, we focus on addressing the NN-to-KK verification challenge when applying speculative decoding on inference acceleration. To tackle this problem, the key lies in the strong alignment over the topKK generated sequence between draft model and the target LLM. Therefore, our work proposes an alignment training framework to strengthen the alignment between the draft model and the target LLM on the topKK high-probability area of the distribution.

We will also add the discussion to our latest manuscript.

评论

Weakness 2. Presentation issues. Reporting WS@K and AS@K for all values of K in {1, 3, 5, 10, 20} in Table 1 seems unnecessary, especially since the discussion focuses on average WS and AS. I suggest presenting only representative Ks alongside the averages, freeing space for additional ranking metrics.

Reply: Thanks for your great comments. Following your suggestion, we will add the ranking performance to Table 1 together with the acceleration performance. Since this work mainly focuses on inference acceleration, we consider reporting the acceleration metrics WS@K and AS@K with K=5,10,20 alongside the average of K=1,3,5,10,20, and present the Recall@5 and NDCG@5 in Table 1. The comprehensive results of ranking performance (as presented in the Reply to Weakness 1) and acceleration performance will be presented in the Appendix.

The adjusted Table 1 is shown below. We will also update the adjustments in our latest manuscript. We will promptly update the manuscript once we finish the revision.

Beauty
VerificationMethodWS@5WS@10WS@20Avg WSAS@5AS@10AS@20Avg ASRecall@5NDCG@5
Strict Top-KSFT1.431.371.551.561.320.660.091.180.00560.0051
WordKD1.581.521.581.681.601.030.161.400.00560.0051
TVDKD1.441.371.571.551.310.650.091.170.00560.0051
SeqKD1.751.671.681.831.851.270.301.600.00560.0051
AtSpeed-S1.841.871.841.972.001.640.571.800.00560.0051
AtSpeed-R1.701.711.741.761.821.330.431.560.00560.0051
Relaxed SamplingSFT1.802.062.361.952.031.991.481.940.0057 (+0.0001)0.0041 (-0.0010)
WordKD1.811.992.051.872.011.871.071.820.0066 (+0.0010)0.0043 (-0.0008)
TVDKD1.812.062.351.962.031.991.451.940.0057 (+0.0001)0.0045 (-0.0006)
SeqKD1.902.112.312.012.102.011.401.970.0055 (-0.0001)0.0045 (-0.0006)
AtSpeed-S1.892.122.512.072.092.031.712.050.0060 (+0.0004)0.0046 (-0.0005)
AtSpeed-R1.942.162.472.112.132.011.772.100.0058 (+0.0002)0.0049 (-0.0002)
Games
VerificationMethodWS@5WS@10WS@20Avg WSAS@5AS@10AS@20Avg ASRecall@5NDCG@5
Strict Top-KSFT1.431.401.581.531.491.320.911.200.00740.0065
WordKD1.311.351.471.481.491.100.801.130.00740.0065
TVDKD1.241.321.501.421.260.950.660.990.00740.0065
SeqKD1.601.461.771.711.951.671.051.550.00740.0065
AtSpeed-S1.781.851.761.832.021.961.691.720.00740.0065
AtSpeed-R1.761.761.601.741.981.951.531.590.00740.0065
Relaxed SamplingSFT1.841.971.691.862.122.051.891.780.0073 (-0.0001)0.0060 (-0.0005)
WordKD1.781.841.561.762.051.991.681.630.0072 (-0.0002)0.0058 (-0.0007)
TVDKD1.811.901.551.802.082.021.801.690.0069 (-0.0005)0.0061 (-0.0004)
SeqKD1.902.032.051.952.132.101.981.930.0071 (-0.0003)0.0059 (-0.0006)
AtSpeed-S1.912.042.132.042.192.101.982.000.0080 (+0.0006)0.0068 (+0.0003)
AtSpeed-R2.002.052.202.052.182.171.982.020.0076 (+0.0002)0.0063 (-0.0002)
评论

In addition, following your suggestion, we run the ranking performance for the ablation study. The results are as follows. From the results, we can find that

  • Under strict topKK verification, our proposed method and ablation variants have identical ranking performance compared to the target LLM, which is expected.

  • Besides, under relaxed sampling verification, our methods and ablation variants achieve limited performance drop to the target LLM. This is also consistent with the observations of the ranking performance across different baselines, which also meets our expectation. We also theoretically show that the output distribution under relaxed sampling is approximately equivalent to the original output distribution of the target LLM. The empirical results further confirm the limited performance drop under our proposed relaxed sampling verification. We will also add the ranking performance to Figure 3 in our manuscript.

Table 1. Ranking performance of our method and the ablation variants under strict topKK verification.

Beauty
VerificationRecall@5Recall@10NDCG@5NDCG@10
Strict TopKw/o CA0.00560.00980.00510.0066
w/o DR0.00560.00980.00510.0066
w/o TA0.00560.00980.00510.0066
AtSpeed-S0.00560.00980.00510.0066

Table 2. Ranking performance of our method and the ablation variants under relaxed sampling verification.

Beauty
VerificationRecall@5Recall@10NDCG@5NDCG@10
Relaxed SamplingTarget LLM (topK)0.00560.00980.00510.0066
w/o CA0.00610.00940.00470.0058
w/o TopK0.00590.00940.00440.0057
w/o TA0.00600.00970.00490.0063
AtSpeed-R0.00580.00920.00490.0063
Average0.00590.00940.00470.0060
评论

Dear Reviewer y4np,

Thanks for your positive comments, we greatly appreciate your effort to review. We have provided additional experiments for support. If there’s any misunderstanding, please feel free to let us know and we will reply quickly.

Weakness 1. Performance metrics (e.g., NDCG and Recall) are not well-presented. Only Table 2 includes some ranking metrics (also only Recall, without NDCG), which may raise doubts about the proposed method's ranking performance. Including these metrics in Table 1 and Figure 3 would strengthen the results. The limited metrics reported make it challenging to fully assess AtSpeed's ranking performance.

Reply: Thanks for your valuable comments. Following your suggestions, we added the recommendation performance of all methods in terms of Recall and NDCG, in comparison to target LLM with topKK beam search and sampling-based beam search. The results on Beauty and Games are as follows. We will also add the results to our manuscript.

Table 1. Performance comparison under strict topKK verification on Beauty.

Beauty
Recall@5Recall@10NDCG@5NDCG@10WS@5WS@10
Without SDTarget LLM (topK)0.00560.00980.00510.006611
Strict TopK VerificationSFT0.00560.00980.00510.00661.431.37
WordKD0.00560.00980.00510.00661.581.52
TVDKD0.00560.00980.00510.00661.441.37
SeqKD0.00560.00980.00510.00661.751.67
AtSpeed-S0.00560.00980.00510.00661.841.87
AtSpeed-R0.00560.00980.00510.00661.701.71

Table 2. Performance comparison under relaxed sampling verification on Beauty.

Beauty
Recall@5Recall@10NDCG@5NDCG@10WS@5WS@10
Without SDTarget LLM (topK)0.00560.00980.00510.006611
Target LLM (sampling)0.00560.00820.00430.006611
Relaxed Sampling VerificationSFT0.00570.00910.00410.00631.802.06
WordKD0.00660.01050.00430.00581.811.99
TVDKD0.00570.00830.00450.00541.812.06
SeqKD0.00550.01160.00450.00671.902.11
AtSpeed-S0.00600.00960.00460.00601.892.12
AtSpeed-R0.00580.00920.00490.00631.942.16
Average0.00590.00970.00450.0061//

Table 3. Performance comparison under strict topKK verification on Games.

Games
Recall@5Recall@10NDCG@5NDCG@10WS@5WS@10
Without SDTarget LLM (topK)0.00740.01250.00650.008311
Strict TopK VerificationSFT0.00740.01250.00650.00831.431.40
WordKD0.00740.01250.00650.00831.311.35
TVDKD0.00740.01250.00650.00831.241.32
SeqKD0.00740.01250.00650.00831.601.46
AtSpeed-S0.00740.01250.00650.00831.781.85
AtSpeed-R0.00740.01250.00650.00831.761.76

Table 4. Performance comparison under relaxed sampling verification on Games.

Games
Recall@5Recall@10NDCG@5NDCG@10WS@5WS@10
Without SDTarget LLM (topK)0.00740.01250.00650.008311
Target LLM (sampling)0.00750.01150.00660.007911
Relaxed Sampling VerificationSFT0.00730.01120.00600.00741.841.97
WordKD0.00720.01130.00580.00731.781.84
TVDKD0.00690.01080.00610.00741.811.90
SeqKD0.00710.01100.00590.00731.902.03
AtSpeed-S0.00800.01310.00680.00851.912.04
AtSpeed-R0.00760.01230.00630.00802.002.05
Average0.00730.01160.00620.0077//
评论

From the tables, we have the following observations:

  • The ranking performance under strict topKK verification is lossless (Table 1 and 3). This is expected since strict verification only accepts the drafts that perfectly match the topKK sequence from the target LLM. Therefore, we obtain identical generation results with and without speculative decoding under strict verification. Based on lossless results, our proposed method AtSpeed-S achieves up to an average of 1.85X speedup.

  • The ranking performance under relaxed sampling verification across different alignment methods only shows limited performance drops compared to the target LLM’s topKK results (comparable performance on AtSpeed-S, AtSpeed-R and “Average” line in Table 2 and 4), which is consistent with the results in Table 2 of our manuscript. This also meets our expectations since the sampling-based verification ensures the approximately equivalent distribution between the SD output and target LLM output under sampling-based generation. We calculate the average over all methods for comparison because we care about how relaxed sampling verification affects the recommendation accuracy. In other words, baseline draft models are also expected to show limited ranking performance drop even if they are less aligned with the target LLM and have a relatively low speedup (e.g., SFT in Table 2).

  • Compared to NDCG, the Recall under relaxed sampling verification usually achieves comparable or even better values compared to that of the target LLM. This is reasonable since this work aims to align the topKK sequence distribution between the draft model and the target LLM. We emphasize the topKK drafted sequence to be accepted with a higher acceptance rate (i.e., a high recall of topKK sequences), which does not explicitly require the draft model to distinguish the ranking between topKK sequences (potentially lead to relatively limited performance in terms of NDCG). Nonetheless, it is worth pursuing the non-trivial explicit probability ordering during alignment, which we consider leaving for further exploration in future work.

评论

Thank you to the author(s) for taking the time to address my concerns. The additional results and discussion have made the paper more self-contained, and I will raise my score.

评论

Dear Reviewer y4np,

Thank you for your positive feedback. We greatly appreciate the time and effort you dedicated to the review! Your valuable and insightful comments are really helpful in guiding us to improve our work.

评论

We sincerely appreciate the thoughtful and constructive feedback provided by the three reviewers. We are delighted that all reviewers recognized our idea to be novel (Reviewers y4np, Up12), intuitive (Reviewer Up12), and mathematically sound (Reviewer WeyY). Furthermore, we are grateful for the reviewers' acknowledgment of the effectiveness of our experimental results and the reproducibility of our work (Reviewers y4np, WeyY, Up12).

We also greatly value the detailed feedback regarding potential weaknesses in our study, which provided an opportunity to further improve our work. The major concerns raised by the reviewers included:

  • Insufficient results on ranking performance (Reviewer y4np): In response, we have supplemented with comprehensive experimental results on recommendation performance across all methods. The results further validate the ranking capability of our proposed relaxed sampling verification strategy, as acknowledged by the reviewer.

  • Generalization ability across diverse datasets and comparison with more SD-based baselines (Reviewer WeyY): To address this concern, we conducted experiments on two additional datasets (MovieLens-1M and Goodreads), comparing our method against all baselines and an additional SD-based method DARE. The consistent superior performance demonstrated by our approach substantiates its strong generalization ability, thereby addressing the reviewer’s concerns regarding datasets and baselines.

  • Mathematical rigor in proofs and notations (Reviewer Up12): We have provided detailed clarifications, including step-by-step derivations of the alignment objective for AtSpeed-S and the proof of with-replacement sampling approximation. The reviewer carefully examined these derivations and the updated manuscript, ultimately finding the mathematics of our work to be reasonable and convincing.

Following the discussion period, we are pleased that all reviewers expressed satisfaction with our responses, which effectively addressed most of their concerns. Overall, we extend our heartfelt thanks to the reviewers for their hard work and active engagement during the discussion phase. Their valuable and constructive comments have been instrumental in refining and enhancing the quality of our paper.

AC 元评审

This paper studied the problem of speeding up using LLM-based model for top K recommendations, based on the framework of speculative decoding.

Strength:

  1. An potentially useful technique for speeding up with LLM-based generative top K recommendation with some promising results.

Weakness:

  1. Presentation and experimental needs improvement.
  2. Experimental results are a bit limited to two Amazon datasets. (Authors did expand on Movie-Lens in rebuttal.)

审稿人讨论附加意见

All reviewers agreed that this is a good contribution to the conference. During the rebuttal, concerns were addressed and reviewers were satisfied with the rebuttal.

最终决定

Accept (Poster)