Approximately Aligned Decoding
Efficiently generate text avoiding constraint violations, with less distortion of LLM output distribution.
摘要
评审与讨论
The paper considers the problem of how to let LLMs efficiently and effectively generate outputs that satisfy certain hard constraints. The proposed algorithm is essentially built upon ASAp, a baseline mentioned in the paper. It accelerates ASAp with speculative decoding to re-utilize the tokens generated in previous iterations. The experiment results confirm it can reduce the number of model calls to generate a grammarly correct outputs.
优缺点分析
S1. Using speculative decoding to accelerate ASAp is an interesting and good application of speculative decoding.
W1. The paper is poorly written. I suggest authors to learn the writing of the Grammar-Aligned Decoding paper [1].
W2. Although I agree using speculative decoding to accelerate ASAp is an interesting and good combination in practice, the algorithm proposed in the paper is an incremental improvement over ASAp. I don't think this paper has enough original contribution and it is more suitable for a workshop paper.
W3. Based on the results in Table 1, I suspect the code implementation of AprAD might be buggy. Speculative decoding does not change the sampling distribution, so AprAD and ASAp should have same sampling distribution and thus same quality. But in Table 1, the KL divergence of AprAD is significantly worse than that of ASAp, which is weird. Can authors explain this?
W4. Instead of using Generation Ratio, I suggest authors directly compare the speed-up of AprAD w.r.t. ASAp to show the efficiency advantage.
W5. I don't think BigCodeBench is a proper benchmark for evaluating this kind of problem, because state-of-the-art LLMs can get high accuracy using autoregressive decoding. Perhaps authors could slightly modify this benchmark by adding some additional constraints (e.g., length of the response smaller than n tokens, do not use certain key words in the code) to further show the advantage of the proposed method.
[1] Park, Kanghee, et al. "Grammar-aligned decoding." Advances in Neural Information Processing Systems 37 (2024): 24547-24568.
问题
I believe AprAD and ASap have identical sampling distributions because speculative decoding does not change the sampling distribution. Is my claim correct? If so, I suggest authors adding another theorem in the paper to explicitly state this.
局限性
yes
最终评判理由
My concerns are resolved by the authors. So I decided to increase my score.
格式问题
NA
Thank you for your review. We would like to respectfully correct a misunderstanding: while our method is inspired by speculative decoding, it is not ASAp plus speculative decoding---there is no small speculative model, and if no errors are encountered during generation, the prefix selection algorithm is never invoked.
The main insight behind AprAD is that instead of re-using the entire prefix (constrained decoding), or re-using none of the prefix (ASAp), it is possible to re-use part of the prefix in an error-free generation task. The speculative decoding algorithm has some properties (1) that make it particularly suited as a prefix selection algorithm between these two extremes, but other prefix selection algorithms could work in its place (see Section 6.2 and Appendix C).
If AprAD were just ASAp with speculative decoding, then we would agree with most of the stated weaknesses. Speculative decoding would not change the sampling distribution of ASAp, and we would have structured our evaluation environment so that only one experiment runs at a time per machine, to prioritize consistent wall-clock times over machine utilization. Speculative decoding usually increases the total amount of computation in terms of FLOPs, and derives a latency improvement by parallelizing the compute and reducing memory reads- ASAp with speculative decoding would have a higher generation ratio compared to the original version of ASAp.
In contrast, AprAD derives its performance improvement by reducing the total amount of computation relative to ASAp; the tradeoff is that our method introduces a distribution shift (2). This leads to a lower generation ratio, but higher KL-divergence, though not as much as constrained decoding.
If desired, traditional speculative decoding could be layered on top of the generation phase of AprAD, in between backtracking iterations, as an orthogonal improvement to improve its latency at the expense of additional computation.
We will update the wording to clarify the distinction between AprAD versus ASAp with speculative decoding.
(1) In particular, that it scales prefix selection likelihood by the amount of probability mass that is "lost" along a given path after incorporating an error. It also usually selects a prefix that ends in a high-entropy token.
(2) The distribution is skewed because a non-violating sequence may either be generated directly, or by re-sampling after a violating sequence occurs. In contrast, with speculative decoding, a sequence is never generated directly---the output is always resampled after any speculative sequence (even non-violating ones). See discussion with reviewer 5S8W for more analysis.
Thank you for your response. Here is my understanding of the ASAp and AprAD algorithms, please correct me if there is any mistake.
Let's consider the problem statement in [1], which I assume is equivalent to this paper. Ideally, we want to generate the next token from the distribution stated in Eq (2) of the ASAp paper [1]. But it is intractable to compute . So ASAp uses to approximate , which is defined in Eq (3) of [1]. is updated every time the sample is rejected, so there is a slight difference between the sampling distribution of -th sampling and -th sampling.
Let denote the sampling distribution at -th sampling. The insight of AprAD is that and are similar, so the sequence sampled at -th sample can serve as a good draft sequence for -th sampling. So AprAD applies the verification algorithm of speculative decoding to determine how many tokens in the draft sequence should be directly accepted. Then the AprAD will continue to generate future tokens based on the accepted tokens.
If my understanding is correct, then there shouldn't be any distribution shift between AprAD and ASAp. Let be the tokens obtained via speculative decoding verification. Due to the unbiased nature of speculative decoding, follow the target distribution . After that, each token is directly sampled from , for . so the overall distribution of at -th sampling should still follow .
So I don't think the distribution is skewed. Is there any mistake in my understanding?
[1] Park, Kanghee, et al. "Grammar-aligned decoding." Advances in Neural Information Processing Systems 37 (2024): 24547-24568.
That was our first thought as well. We had originally implemented the testbench described in Section 5.1 to check our implementation of speculative decoding. We were surprised at first to find that while our implementation of speculative decoding in a traditional setting behaved as expected, there was some distribution skew when the sampling subroutine was used to resample subsequent distributions after encountering an error. It took us some time before we realized that this is due to the algorithm itself, rather than an implementation bug.
Your first two paragraphs (Let's consider the problem statement, Let denote) are correct. The tricky part is this:
Let be the tokens obtained via speculative decoding verification. Due to the unbiased nature of speculative decoding, follows the target distribution .
Assuming that refers to the initial sample from , and refers to the the backtracked and re-sampled sequence from after speculative decoding:
would follow the target distribution if SpecSample() was always invoked for every possible generation in . But SpecSample is only invoked in cases where is an error, and even then, is different depending on the actual that was sampled.
It is helpful to use a more concrete example- we'll use the running example from the paper where , and , leading to
With traditional speculative decoding using as the speculative model and as the main model, the process looks like this:
selects AA with probability 0.25. SpecSample keeps prefix A with probability 2/3, because . This entire probability is given to AB. In the remaining 1/3 of cases, prefix B is chosen, and is split evenly between BA and BB.
In AprAD, if AA is selected first, the same exact analysis applies so far as with speculative decoding.
Total probability so far (both methods):
AB: 1/4 * 2/3 * 1 = 1/6 # P(AA) * SpecSample(A|AA, P, P^\mathcal{B}) * P^\mathcal{B}(AB|A)
BA: 1/4 * 1/3 * 1/2 = 1/24 # P(AA) * SpecSample(B|AA, P, P^\mathcal{B}) * P^\mathcal{B}(BA|B)
BB: 1/4 * 1/3 * 1/2 = 1/24 # Similar derivation for remainder of example
Total: 0.25
With speculative decoding, the SSM selects AB with probability 0.25. SpecSample still keeps prefix A with probability 2/3 and switches to prefix B in 1/3 of cases; same analysis as if the SSM selected AA.
Total probability so far (Speculative Decoding):
AB: 1/6 + (1/4 * 2/3) = 1/3 # Prob from when AA was selected by P + (Prob from when AB was selected)
BA: 1/24 + (1/4 * 1/3 * 1/2) = 1/12
BB: 1/24 + (1/4 * 1/3 * 1/2) = 1/12
Total: 0.5
In contrast, in the 1/4 of cases AB is selected in the first round of AprAD, AB is always kept without invoking SpecSample, because it is not a violation. AprAD doesn't "know" that AA is a violation, and it can't do any renormalization, because it hasn't even encountered AA in this case.
This is the cause of the shift: AprAD doesn't invoke SpecSample in cases where an error was never generated.
Total probability so far (AprAD):
AB: 1/6 + 1/4 = 5/12
BA: 1/24
BB: 1/24
Total: 0.5
With speculative decoding, the SSM selects BA with probability 0.25. SpecSample is invoked, but because , the prefix is always kept.
Total probability so far (Speculative Decoding):
AB: 1/3
BA: 1/12 + 1/4 = 1/3
BB: 1/12
Total: 0.75
With AprAD, when BA is selected, it is kept without ever invoking SpecSample because it isn't a violation. AprAD doesn't know about the violation in AA:
Total probability so far (AprAD):
AB: 5/12
BA: 1/24 + 1/4 = 7/24
BB: 1/24
Total: 0.75
The same thing happens for both methods in the 1/4 of cases where BB is initially selected:
Total probability so far (Speculative Decoding):
AB: 1/3
BA: 1/3
BB: 1/12 + 1/4 = 1/3
Total: 1
Total probability so far (AprAD):
AB: 5/12
BA: 7/24
BB: 1/24 + 1/4 = 7/24
Total: 1
If there were multiple possible errors, then when AprAD encounters one of those errors and backtracks, the target distribution is based only on the errors it has actually encountered so far, and not any errors that it has not seen yet.
Thank you for your detailed response to resolve my questions. Now I understand why there is a distribution shift. My concerns are resolved and I will update my score.
This paper addressed the problem of violation-free generation by proposing to combine techniques from speculative decoding with ASAP which enables selective and more efficient backtracking upon hitting a violation during generation.
优缺点分析
Strength:
- The subject of study -- violation-free generation is an important topic that has wide real world implications.
- I really like the thorough background provided to spec dec and violation free generation; even though i am reasonable familiar with relevant literature, I feel like I'm learning something from reading these well-written background.
- Very thorough, rigorous and reasonably comprehensive evaluation of the proposed technique.
Weakness:
Let me preface by saying that I find the paper well-written, technically interesting, and a valuable contribution in its current form. I'm respectfully adding the following comments only because my initial review—left without weaknesses—was flagged for being too short. To the authors: please feel entirely free to disregard the following.
- The paper would be clearer if there can be more discussions about how to efficiently encode P^B in real world algorithm implementations (see question for more details).
- For the same reason as above, it may be helpful to also report the runtime for each benchmark as an assessment of overhead associated with the algorithm implementation.
- It would be a really interesting technical contribution if there can be analysis as to what is the expected or maximum probability amplification.
问题
how does one represent P^B in practice? for example, in addBadSample in Algorithm 3, how do u keep track of these modifications to language modeling probabilities in your implementation?
局限性
yes
最终评判理由
After reading author rebuttal as well as other reviewer's comments, I continue to believe that this work represents a valuable addition to the Neurips program so I maintain my recommendation of acceptance.
格式问题
n/a
Thank you for your review.
Representing in an implementation
The main structure we use to track token probabilities is a trie. However, note that there are times where the program must use both and . To avoid the need to copy the trie, which can grow quite large, our program first collects the probabilities that will later be required from during SpecSample (the token probabilities along the "path" to ) before calling AddBadSample to transform the trie into .
Additional implementation details are located in Appendix F. We will release our code along with the camera-ready version.
Runtime reporting
Unfortunately, the wall-clock runtimes would not be meaningful as we maximized machine utilization by running several experiments in parallel, leading to varied times depending on exact resource contention at the time of a given experiment. As the runtime is dominated by the LLM forward pass, we used generation ratio as a proxy for the required amount of computation.
Analysis of probability amplification
The recursive/backtracking nature of AprAD meant that we did not have complete theoretical results available at submission time, and omitted a theoretical discussion of the method to focus on the empirical results. If the reviewers feel that this additional analysis would be helpful, we will add it to the Appendix, with a disclaimer:
Intuitively, during a traditional speculative decoding procedure with speculative model and main model , there are two ways that may be generated at the end of the process:
- is drawn directly from , and then it is kept during the speculative sampling process.
- Some sample is drawn from , and then during the speculative sampling process, the it is "re-sampled" to .
In AprAD we use as the "speculative" model. would ideally be the "main" model, though we must approximate it with after the first iteration.
If is initially drawn, AprAD always keeps it, in contrast to its possible discarding by the speculative sampling procedure. This increases the overall probability of generating , but only by as much as the probability of sampling in the first place (when speculative sampling would be likely to discard ). The second way of generating does not increase in probability with AprAD. Therefore, the probability of generating is at most . As for , there is a maximum amplification of 2.
Note that this amplification is per-iteration, where an iteration is defined as encountering an error sequence, potentially backtracking, and resampling. In practice, the per-iteration amplification is likely much less, as several of the inequalities involved are very loose. However, there is a cumulative effect as more iterations are required in dense error sets.
In contrast, the probability amplification after encountering even a single error sequence in constrained decoding is unbounded, as the best non-error token may have an arbitrarily low probability.
A more in-depth explanation is as follows:
Let (SpeculativeReSample) be the probability of eventually selecting sequence with speculative decoding, conditional on having drawn from speculative model . The original version of speculative decoding implicitly relies on the identity .
If , then for . Intuitively, excludes more invalid sequences compared to , so the probability mass of these sequences during the re-sampling process should be distributed among the non-error sequences, including .
For similar reasons, for , as excluding additional errors and renormalizing the probability means that all other sequences become more likely.
The probability of sequence being generated after the first iteration of AprAD is equal to the probability that is generated directly, plus the probability that is re-sampled after some invalid sequence is generated:
We apply both inequalities described above:
As and are always nonnegative, we can add additional elements to the sum---expanding it to include all sequences, rather than just error sequences---while maintaining the inequality:
And finally use the speculative decoding identity:
I have read the author's rebuttal and decide to maintain my recommendation for acceptance.
The paper introduces Approximately Aligned Decoding (AprAD), a method for sampling from an autoregressive LM sequences under some constraints. The algorithm leverages speculative decoding and some insights from Adaptive Sampling with Approximate Expected Futures (ASAp). Experiments are conducted that show their method requires fewer model evaluations (generation ratio).
优缺点分析
Strengths:
- Tackles an important problem, is well motivated, starting with simple toy examples. I liked "Simulated Model with Known Ideal Distribution"
- Experimental results show that the method requires few model evaluations than baselines. Multiple models -- Mistral7b, Starcoder2 -- used.
- Baselines compared against are reasonable.
Weaknesses:
- The error set assumption could be emphasized a bit more -- italicized, and put in the limitations of this work
- Algorithm 1 is not really necessary -- just takes up space
- nit: CFG (classifier-free guidance) is not defined on line 90. Same with HMM on 274.
- Wall-clock times are not shown, but I understand this metric can be very implementation dependent.
问题
The algorithm is general purpose; are there more problem / errors sets you can experiment with?
局限性
yes
格式问题
none
Thank you for your comments- we will incorporate this feedback into the final version.
The algorithm is general purpose; are there more problem / errors sets you can experiment with?
You are correct that the method is general purpose; however, we focused on only a few evaluation domains to allow for a more in-depth comparison between methods---BigCodeBench as a representative coding task that relies on multiple libraries, and Lipogram as a free-form generation task with a dense but easily understandable error set.
One potential interesting application of AprAD is to enforce syntactically correct tool use in an agentic system- we leave this for future work.
Thanks for the rebuttal -- keeping my score as is.
Generating text under certain constraints with LLMs is still an open problem. Retraining LLMs to accommodate the constraints of every task is infeasible. Several decoding methods have been developed to generate text without needing to retrain the model. The paper argues that the existing methods often severely distort the original output distribution, or are inefficient as they need to do an unnecessary amount of re-sampling after rejections.
The paper proposes a novel Approximately Aligned Decoding (AprAD) which balances the distortion of the output distribution along with computational efficiency.
Through a variety of experiments, the authors demonstrate the effectiveness of their approach. AprAD achieves comparable performances to the methods that don’t distort the output distribution and are computationally inefficient.
优缺点分析
The paper is very well written. I enjoyed reading this paper. The authors have done a fantastic job in discussing relevant work at relevant places to thoroughly explain their motivations while also building on and explaining their contributions. The paper nicely explains how it takes inspirations from constrained generation, ASAp and Speculative decoding methods. Relevant information is provided in the appendix as well, without reducing the readability of the main paper.
While AprAD takes inspiration from constrained generation, ASAp and Speculative decoding methods, the presented approach is novel, interesting and very effective. On the constructed test bench for a simulated LM setup, AprAD is closer to the ideal distributions, compared to constrained decoding, and invokes LM fewer times than ASAp.
The paper also describes the limitations of their approach at various places.
There were few missing ablations that would have been interesting to see in the paper. For example, more detailed comparison with Fudge or Ctrl-G, on some of the real-world control generation tasks.
问题
Why were more detailed comparisons to Fudge or Ctrl-G not covered in the paper?
-> We observe that PˆB and PˆB∪{x} are almost always near-identical distributions, with PˆB∪{x} generally as a “more accurate” distribution because it incorporates an additional error sample
How does this comparison hold with different values of x? For example, when the model is certain about x vs when the model is not certain about it, or, when x is a function word vs when it is not a function word?
Minor:
Please bolden the best numbers in Table 1.
Please reports CIs for all the numbers in Table 3.
局限性
Yes
格式问题
NA
Thank you for your review.
Why were more detailed comparisons to Fudge or Ctrl-G not covered in the paper?
Due to time constraints, we focused our main comparison on training-free methods that work with black-box constraints. Both FUDGE/Ctrl-G require an extra training step to learn a discriminator/distilled HMM. Additionally, Ctrl-G is limited in the types of constraints that are expressible as a DFA, and as such couldn't be used for the hallucination avoidance task (the allowed text depends on variable assignments, imports from the Python environment, etc).
While we could not run a full evaluation on these methods, the discussion in Section 6.1 was informed in part by preliminary experiments on posterior-estimation based methods that did not make it into the paper.
We observe that PˆB and PˆB∪{x} are almost always near-identical distributions, with PˆB∪{x} generally as a “more accurate” distribution because it incorporates an additional error sample
How does this comparison hold with different values of x? For example, when the model is certain about x vs when the model is not certain about it, or, when x is a function word vs when it is not a function word?
Generally, the more likely that was to be generated by , the larger the change in the distribution between and - if was already low probability in , then adding it as a violating sample will not change the distribution very much. This is irrespective of if the last token of is a function word, content word, or if the domain isn't natural language at all.
More concretely, consider a lipogram generation excluding the letter "u"
Prefix: "Attention Is All"
Next token probabilities:
You - 0.5
Important - 0.4
Natural - 0.1
P^B∪{"Attention Is All Natural"} is going to be closer to P^B compared to P^B∪{"Attention Is All You"}, because the latter requires excluding a sequence of much higher probability. However, both are "more accurate" than because they have both excluded a sequence that is incorrect according to the lipogram constraint.
Minor
Thank you, we will make these clarity changes.
Dear Reviewer,
Your active participation in the review process is crucial. Please respond to the authors' response and acknowledge you have done so.
Thanks.
-AC
The paper proposes approximately aligned decoding, a method that uses speculative decoding to accelerate adaptive sampling with approximate expected futures, in order to perform more efficient constrained decoding.
The reviewers believe the method to be novel, interesting, and effective, and they appreciate the thorough background section.
Main criticisms of the paper from reviewers are:
- Lack of confidence intervals in Table 3
- Evaluation is on a limited set of tasks, reducing the strength of the claims
- Missing experiments comparing against other relevant methods (FUDGE/Ctrl-G)
- Some concerns about writing quality.
While the review pointing out writing quality issues did not elaborate, some concerns I have are:
-
Very short introduction that isn't sufficiently standalone to be a useful read. it does not give any details on the 'existing methods' it mentions multiple times, or provide an intuitive explanation for why these methods cause such significant deviation from the original output distribution. Developing intuitions in the introduction would make for a stronger paper.
-
The Related Work section lists a lot of papers and topics but doesn't do a very good job explaining how they are actually relevant to the paper' proposed method. The current Related Work section is not adding much to the paper