PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
4
5
3.8
置信度
创新性3.0
质量3.5
清晰度3.3
重要性3.0
TL;DR

Efficiently generate text avoiding constraint violations, with less distortion of LLM output distribution.

摘要

关键词
Constrained DecodingLarge Language Models

评审与讨论

审稿意见
4

The paper considers the problem of how to let LLMs efficiently and effectively generate outputs that satisfy certain hard constraints. The proposed algorithm is essentially built upon ASAp, a baseline mentioned in the paper. It accelerates ASAp with speculative decoding to re-utilize the tokens generated in previous iterations. The experiment results confirm it can reduce the number of model calls to generate a grammarly correct outputs.

优缺点分析

S1. Using speculative decoding to accelerate ASAp is an interesting and good application of speculative decoding.

W1. The paper is poorly written. I suggest authors to learn the writing of the Grammar-Aligned Decoding paper [1].

W2. Although I agree using speculative decoding to accelerate ASAp is an interesting and good combination in practice, the algorithm proposed in the paper is an incremental improvement over ASAp. I don't think this paper has enough original contribution and it is more suitable for a workshop paper.

W3. Based on the results in Table 1, I suspect the code implementation of AprAD might be buggy. Speculative decoding does not change the sampling distribution, so AprAD and ASAp should have same sampling distribution and thus same quality. But in Table 1, the KL divergence of AprAD is significantly worse than that of ASAp, which is weird. Can authors explain this?

W4. Instead of using Generation Ratio, I suggest authors directly compare the speed-up of AprAD w.r.t. ASAp to show the efficiency advantage.

W5. I don't think BigCodeBench is a proper benchmark for evaluating this kind of problem, because state-of-the-art LLMs can get high accuracy using autoregressive decoding. Perhaps authors could slightly modify this benchmark by adding some additional constraints (e.g., length of the response smaller than n tokens, do not use certain key words in the code) to further show the advantage of the proposed method.

[1] Park, Kanghee, et al. "Grammar-aligned decoding." Advances in Neural Information Processing Systems 37 (2024): 24547-24568.

问题

I believe AprAD and ASap have identical sampling distributions because speculative decoding does not change the sampling distribution. Is my claim correct? If so, I suggest authors adding another theorem in the paper to explicitly state this.

局限性

yes

最终评判理由

My concerns are resolved by the authors. So I decided to increase my score.

格式问题

NA

作者回复

Thank you for your review. We would like to respectfully correct a misunderstanding: while our method is inspired by speculative decoding, it is not ASAp plus speculative decoding---there is no small speculative model, and if no errors are encountered during generation, the prefix selection algorithm is never invoked.

The main insight behind AprAD is that instead of re-using the entire prefix (constrained decoding), or re-using none of the prefix (ASAp), it is possible to re-use part of the prefix in an error-free generation task. The speculative decoding algorithm has some properties (1) that make it particularly suited as a prefix selection algorithm between these two extremes, but other prefix selection algorithms could work in its place (see Section 6.2 and Appendix C).

If AprAD were just ASAp with speculative decoding, then we would agree with most of the stated weaknesses. Speculative decoding would not change the sampling distribution of ASAp, and we would have structured our evaluation environment so that only one experiment runs at a time per machine, to prioritize consistent wall-clock times over machine utilization. Speculative decoding usually increases the total amount of computation in terms of FLOPs, and derives a latency improvement by parallelizing the compute and reducing memory reads- ASAp with speculative decoding would have a higher generation ratio compared to the original version of ASAp.

In contrast, AprAD derives its performance improvement by reducing the total amount of computation relative to ASAp; the tradeoff is that our method introduces a distribution shift (2). This leads to a lower generation ratio, but higher KL-divergence, though not as much as constrained decoding.

If desired, traditional speculative decoding could be layered on top of the generation phase of AprAD, in between backtracking iterations, as an orthogonal improvement to improve its latency at the expense of additional computation.

We will update the wording to clarify the distinction between AprAD versus ASAp with speculative decoding.

(1) In particular, that it scales prefix selection likelihood by the amount of probability mass that is "lost" along a given path after incorporating an error. It also usually selects a prefix that ends in a high-entropy token.

(2) The distribution is skewed because a non-violating sequence may either be generated directly, or by re-sampling after a violating sequence occurs. In contrast, with speculative decoding, a sequence is never generated directly---the output is always resampled after any speculative sequence (even non-violating ones). See discussion with reviewer 5S8W for more analysis.

评论

Thank you for your response. Here is my understanding of the ASAp and AprAD algorithms, please correct me if there is any mistake.

Let's consider the problem statement in [1], which I assume is equivalent to this paper. Ideally, we want to generate the next token from the distribution stated in Eq (2) of the ASAp paper [1]. But it is intractable to compute c(w1:i)c(w_{1:i}). So ASAp uses tildec_S(w_1:i)\\tilde{c}\_S(w\_{1:i}) to approximate c(w1:i)c(w_{1:i}), which is defined in Eq (3) of [1]. tildec_S\\tilde{c}\_S is updated every time the sample is rejected, so there is a slight difference between the sampling distribution of tt-th sampling and t+1t+1-th sampling.

Let PtP_{t} denote the sampling distribution at tt-th sampling. The insight of AprAD is that PtP_{t} and Pt+1P_{t+1} are similar, so the sequence sampled at tt-th sample can serve as a good draft sequence for t+1t+1-th sampling. So AprAD applies the verification algorithm of speculative decoding to determine how many tokens in the draft sequence should be directly accepted. Then the AprAD will continue to generate future tokens based on the accepted tokens.

If my understanding is correct, then there shouldn't be any distribution shift between AprAD and ASAp. Let w1:kw_{1:k} be the tokens obtained via speculative decoding verification. Due to the unbiased nature of speculative decoding, w1:kw_{1:k} follow the target distribution P_t+1(w_1:k)P\_{t+1}(w^{'}\_{1:k}). After that, each token wk+iw_{k+i} is directly sampled from Pt+1(wk+iw1:k+i1)P_{t+1}(w_{k+i}|w_{1:k+i-1}), for i=1,,ni=1,\cdots,n. so the overall distribution of w1:nw_{1:n} at t+1t+1-th sampling should still follow Pt+1P_{t+1}.

So I don't think the distribution is skewed. Is there any mistake in my understanding?

[1] Park, Kanghee, et al. "Grammar-aligned decoding." Advances in Neural Information Processing Systems 37 (2024): 24547-24568.

评论

That was our first thought as well. We had originally implemented the testbench described in Section 5.1 to check our implementation of speculative decoding. We were surprised at first to find that while our implementation of speculative decoding in a traditional setting behaved as expected, there was some distribution skew when the sampling subroutine was used to resample subsequent distributions after encountering an error. It took us some time before we realized that this is due to the algorithm itself, rather than an implementation bug.

Your first two paragraphs (Let's consider the problem statement, Let PtP_t denote) are correct. The tricky part is this:

Let w1:kw_{1:k} be the tokens obtained via speculative decoding verification. Due to the unbiased nature of speculative decoding, w1:kw_{1:k} follows the target distribution Pt+1(w1:k)P_{t+1}(w'_{1:k}).

Assuming that w1:iw_{1:i} refers to the initial sample from PtP_t, and w1:kw_{1:k}' refers to the the backtracked and re-sampled sequence from Pt+1P_{t+1} after speculative decoding:

w1:kw_{1:k}' would follow the target distribution if SpecSample(w1:kw1:i,Pt,Pt+1w_{1:k}' | w_{1:i}, P_t, P_{t+1}) was always invoked for every possible generation in PtP_t. But SpecSample is only invoked in cases where w1:iw_{1:i} is an error, and even then, Pt+1P_{t+1} is different depending on the actual w1:iw_{1:i} that was sampled.

It is helpful to use a more concrete example- we'll use the running example from the paper where P(AA)=P(AB)=P(BA)=P(BB)=0.25P(\text{AA})=P(\text{AB})=P(\text{BA})=P(\text{BB})=0.25, and B=AA\mathcal{B}=\\{\text{AA}\\}, leading to PB(AB)=PB(BA)=PB(BB)=0.33P^{\mathcal{B}}(\text{AB}) = P^{\mathcal{B}}(\text{BA}) = P^{\mathcal{B}}(\text{BB}) = 0.33

With traditional speculative decoding using PP as the speculative model and PBP^\mathcal{B} as the main model, the process looks like this:

PP selects AA with probability 0.25. SpecSample keeps prefix A with probability 2/3, because PB(A)P(A)=0.330.5=23\frac{P^{\mathcal{B}}(\text{A})}{P(\text{A})} = \frac{0.33}{0.5} = \frac{2}{3}. This entire probability is given to AB. In the remaining 1/3 of cases, prefix B is chosen, and is split evenly between BA and BB.

In AprAD, if AA is selected first, the same exact analysis applies so far as with speculative decoding.

Total probability so far (both methods):

AB: 1/4 * 2/3 * 1 = 1/6     # P(AA) * SpecSample(A|AA, P, P^\mathcal{B}) * P^\mathcal{B}(AB|A)
BA: 1/4 * 1/3 * 1/2 = 1/24  # P(AA) * SpecSample(B|AA, P, P^\mathcal{B}) * P^\mathcal{B}(BA|B)
BB: 1/4 * 1/3 * 1/2 = 1/24  # Similar derivation for remainder of example
Total: 0.25

With speculative decoding, the SSM selects AB with probability 0.25. SpecSample still keeps prefix A with probability 2/3 and switches to prefix B in 1/3 of cases; same analysis as if the SSM selected AA.

Total probability so far (Speculative Decoding):

AB: 1/6 + (1/4 * 2/3) = 1/3  # Prob from when AA was selected by P + (Prob from when AB was selected)
BA: 1/24 + (1/4 * 1/3 * 1/2) = 1/12
BB: 1/24 + (1/4 * 1/3 * 1/2) = 1/12
Total: 0.5

In contrast, in the 1/4 of cases AB is selected in the first round of AprAD, AB is always kept without invoking SpecSample, because it is not a violation. AprAD doesn't "know" that AA is a violation, and it can't do any renormalization, because it hasn't even encountered AA in this case.

This is the cause of the shift: AprAD doesn't invoke SpecSample in cases where an error was never generated.

Total probability so far (AprAD):

AB: 1/6 + 1/4 = 5/12
BA: 1/24 
BB: 1/24
Total: 0.5

With speculative decoding, the SSM selects BA with probability 0.25. SpecSample is invoked, but because PB(B)>P(B)P^{\mathcal{B}}(\text{B}) > P(\text{B}), the prefix is always kept.

Total probability so far (Speculative Decoding):

AB: 1/3
BA: 1/12 + 1/4 = 1/3
BB: 1/12
Total: 0.75

With AprAD, when BA is selected, it is kept without ever invoking SpecSample because it isn't a violation. AprAD doesn't know about the violation in AA:

Total probability so far (AprAD):

AB: 5/12
BA: 1/24 + 1/4 = 7/24
BB: 1/24
Total: 0.75

The same thing happens for both methods in the 1/4 of cases where BB is initially selected:

Total probability so far (Speculative Decoding):

AB: 1/3
BA: 1/3
BB: 1/12 + 1/4 = 1/3
Total: 1

Total probability so far (AprAD):

AB: 5/12
BA: 7/24
BB: 1/24 + 1/4 = 7/24
Total: 1

If there were multiple possible errors, then when AprAD encounters one of those errors and backtracks, the target distribution is based only on the errors it has actually encountered so far, and not any errors that it has not seen yet.

评论

Thank you for your detailed response to resolve my questions. Now I understand why there is a distribution shift. My concerns are resolved and I will update my score.

审稿意见
5

This paper addressed the problem of violation-free generation by proposing to combine techniques from speculative decoding with ASAP which enables selective and more efficient backtracking upon hitting a violation during generation.

优缺点分析

Strength:

  • The subject of study -- violation-free generation is an important topic that has wide real world implications.
  • I really like the thorough background provided to spec dec and violation free generation; even though i am reasonable familiar with relevant literature, I feel like I'm learning something from reading these well-written background.
  • Very thorough, rigorous and reasonably comprehensive evaluation of the proposed technique.

Weakness:

Let me preface by saying that I find the paper well-written, technically interesting, and a valuable contribution in its current form. I'm respectfully adding the following comments only because my initial review—left without weaknesses—was flagged for being too short. To the authors: please feel entirely free to disregard the following.

  • The paper would be clearer if there can be more discussions about how to efficiently encode P^B in real world algorithm implementations (see question for more details).
  • For the same reason as above, it may be helpful to also report the runtime for each benchmark as an assessment of overhead associated with the algorithm implementation.
  • It would be a really interesting technical contribution if there can be analysis as to what is the expected or maximum probability amplification.

问题

how does one represent P^B in practice? for example, in addBadSample in Algorithm 3, how do u keep track of these modifications to language modeling probabilities in your implementation?

局限性

yes

最终评判理由

After reading author rebuttal as well as other reviewer's comments, I continue to believe that this work represents a valuable addition to the Neurips program so I maintain my recommendation of acceptance.

格式问题

n/a

作者回复

Thank you for your review.

Representing PBP^B in an implementation

The main structure we use to track token probabilities is a trie. However, note that there are times where the program must use both PBP^B and PBxP^{B \cup \\{x\\}}. To avoid the need to copy the trie, which can grow quite large, our program first collects the probabilities that will later be required from PBP^B during SpecSample (the token probabilities along the "path" to xx) before calling AddBadSample to transform the trie into PBxP^{B \cup \\{x\\}}.

Additional implementation details are located in Appendix F. We will release our code along with the camera-ready version.

Runtime reporting

Unfortunately, the wall-clock runtimes would not be meaningful as we maximized machine utilization by running several experiments in parallel, leading to varied times depending on exact resource contention at the time of a given experiment. As the runtime is dominated by the LLM forward pass, we used generation ratio as a proxy for the required amount of computation.

Analysis of probability amplification

The recursive/backtracking nature of AprAD meant that we did not have complete theoretical results available at submission time, and omitted a theoretical discussion of the method to focus on the empirical results. If the reviewers feel that this additional analysis would be helpful, we will add it to the Appendix, with a disclaimer:

Intuitively, during a traditional speculative decoding procedure with speculative model SS and main model PP, there are two ways that yy may be generated at the end of the process:

  1. yy is drawn directly from SS, and then it is kept during the speculative sampling process.
  2. Some sample xyx \neq y is drawn from SS, and then during the speculative sampling process, the it is "re-sampled" to yy.

In AprAD we use PP as the "speculative" model. PBP^\mathcal{B} would ideally be the "main" model, though we must approximate it with PxP^{\\{x\\}} after the first iteration.

If yBy \notin \mathcal{B} is initially drawn, AprAD always keeps it, in contrast to its possible discarding by the speculative sampling procedure. This increases the overall probability of generating yy, but only by as much as the probability of sampling yy in the first place (when speculative sampling would be likely to discard yy). The second way of generating yy does not increase in probability with AprAD. Therefore, the probability of generating yy is at most 2Px(y)2 P^{\\{x\\}}(y). As Px(y)PB(y)P^{\\{x\\}}(y) \leq P^{\mathcal{B}}(y) for yBy \notin \mathcal{B}, there is a maximum amplification of 2.

Note that this amplification is per-iteration, where an iteration is defined as encountering an error sequence, potentially backtracking, and resampling. In practice, the per-iteration amplification is likely much less, as several of the inequalities involved are very loose. However, there is a cumulative effect as more iterations are required in dense error sets.

In contrast, the probability amplification after encountering even a single error sequence in constrained decoding is unbounded, as the best non-error token may have an arbitrarily low probability.

A more in-depth explanation is as follows:

Let SRS(yx,S,P)SRS(y|x, S, P) (SpeculativeReSample) be the probability of eventually selecting sequence yy with speculative decoding, conditional on having drawn xx from speculative model SS. The original version of speculative decoding implicitly relies on the identity P(y)=xΣS(x)SRS(yx,S,P)P(y) = \sum_{x \in \Sigma^*} S(x) SRS(y|x, S, P).

If BBB \subseteq \mathcal{B}, then SRS(yx,S,PB)SRS(yx,S,PB)SRS(y|x, S, P^B) \leq SRS(y|x, S, P^\mathcal{B}) for yBy \notin \mathcal{B}. Intuitively, PBP^\mathcal{B} excludes more invalid sequences compared to PBP^B, so the probability mass of these sequences during the re-sampling process should be distributed among the non-error sequences, including yy.

For similar reasons, PB(x)PB(x)P^B(x) \leq P^\mathcal{B}(x) for BB,xBB \subseteq \mathcal{B}, x \notin \mathcal{B}, as excluding additional errors BB\mathcal{B} \setminus B and renormalizing the probability means that all other sequences become more likely.

The probability of sequence yBy \notin \mathcal{B} being generated after the first iteration of AprAD is equal to the probability that yy is generated directly, plus the probability that yy is re-sampled after some invalid sequence is generated:

AprAD(yP,B)=P(y)+xBP(x)SRS(yx,P,Px)AprAD(y|P, \mathcal{B}) = P(y) + \sum_{x \in \mathcal{B}} P(x) SRS(y|x, P, P^{\\{x\\}})

We apply both inequalities described above:

PB(y)+xBP(x)SRS(yx,P,PB)\leq P^\mathcal{B}(y) + \sum_{x \in \mathcal{B}} P(x) SRS(y|x, P, P^\mathcal{B})

As P(x)P(x) and SRS(yx,P,PB)SRS(y|x, P, P^\mathcal{B}) are always nonnegative, we can add additional elements to the sum---expanding it to include all sequences, rather than just error sequences---while maintaining the inequality:

PB(y)+xΣP(x)SRS(yx,P,PB)\leq P^\mathcal{B}(y) + \sum_{x \in \Sigma^*} P(x) SRS(y|x, P, P^\mathcal{B})

And finally use the speculative decoding identity:

2PB(y)\leq 2 P^\mathcal{B}(y)

评论

I have read the author's rebuttal and decide to maintain my recommendation for acceptance.

审稿意见
4

The paper introduces Approximately Aligned Decoding (AprAD), a method for sampling from an autoregressive LM sequences under some constraints. The algorithm leverages speculative decoding and some insights from Adaptive Sampling with Approximate Expected Futures (ASAp). Experiments are conducted that show their method requires fewer model evaluations (generation ratio).

优缺点分析

Strengths:

  • Tackles an important problem, is well motivated, starting with simple toy examples. I liked "Simulated Model with Known Ideal Distribution"
  • Experimental results show that the method requires few model evaluations than baselines. Multiple models -- Mistral7b, Starcoder2 -- used.
  • Baselines compared against are reasonable.

Weaknesses:

  • The error set assumption could be emphasized a bit more -- italicized, and put in the limitations of this work
  • Algorithm 1 is not really necessary -- just takes up space
  • nit: CFG (classifier-free guidance) is not defined on line 90. Same with HMM on 274.
  • Wall-clock times are not shown, but I understand this metric can be very implementation dependent.

问题

The algorithm is general purpose; are there more problem / errors sets you can experiment with?

局限性

yes

格式问题

none

作者回复

Thank you for your comments- we will incorporate this feedback into the final version.

The algorithm is general purpose; are there more problem / errors sets you can experiment with?

You are correct that the method is general purpose; however, we focused on only a few evaluation domains to allow for a more in-depth comparison between methods---BigCodeBench as a representative coding task that relies on multiple libraries, and Lipogram as a free-form generation task with a dense but easily understandable error set.

One potential interesting application of AprAD is to enforce syntactically correct tool use in an agentic system- we leave this for future work.

评论

Thanks for the rebuttal -- keeping my score as is.

审稿意见
5

Generating text under certain constraints with LLMs is still an open problem. Retraining LLMs to accommodate the constraints of every task is infeasible. Several decoding methods have been developed to generate text without needing to retrain the model. The paper argues that the existing methods often severely distort the original output distribution, or are inefficient as they need to do an unnecessary amount of re-sampling after rejections.

The paper proposes a novel Approximately Aligned Decoding (AprAD) which balances the distortion of the output distribution along with computational efficiency.

Through a variety of experiments, the authors demonstrate the effectiveness of their approach. AprAD achieves comparable performances to the methods that don’t distort the output distribution and are computationally inefficient.

优缺点分析

The paper is very well written. I enjoyed reading this paper. The authors have done a fantastic job in discussing relevant work at relevant places to thoroughly explain their motivations while also building on and explaining their contributions. The paper nicely explains how it takes inspirations from constrained generation, ASAp and Speculative decoding methods. Relevant information is provided in the appendix as well, without reducing the readability of the main paper.

While AprAD takes inspiration from constrained generation, ASAp and Speculative decoding methods, the presented approach is novel, interesting and very effective. On the constructed test bench for a simulated LM setup, AprAD is closer to the ideal distributions, compared to constrained decoding, and invokes LM fewer times than ASAp.

The paper also describes the limitations of their approach at various places.

There were few missing ablations that would have been interesting to see in the paper. For example, more detailed comparison with Fudge or Ctrl-G, on some of the real-world control generation tasks.

问题

Why were more detailed comparisons to Fudge or Ctrl-G not covered in the paper?

-> We observe that PˆB and PˆB∪{x} are almost always near-identical distributions, with PˆB∪{x} generally as a “more accurate” distribution because it incorporates an additional error sample

How does this comparison hold with different values of x? For example, when the model is certain about x vs when the model is not certain about it, or, when x is a function word vs when it is not a function word?

Minor:

Please bolden the best numbers in Table 1.

Please reports CIs for all the numbers in Table 3.

局限性

Yes

格式问题

NA

作者回复

Thank you for your review.

Why were more detailed comparisons to Fudge or Ctrl-G not covered in the paper?

Due to time constraints, we focused our main comparison on training-free methods that work with black-box constraints. Both FUDGE/Ctrl-G require an extra training step to learn a discriminator/distilled HMM. Additionally, Ctrl-G is limited in the types of constraints that are expressible as a DFA, and as such couldn't be used for the hallucination avoidance task (the allowed text depends on variable assignments, imports from the Python environment, etc).

While we could not run a full evaluation on these methods, the discussion in Section 6.1 was informed in part by preliminary experiments on posterior-estimation based methods that did not make it into the paper.

We observe that PˆB and PˆB∪{x} are almost always near-identical distributions, with PˆB∪{x} generally as a “more accurate” distribution because it incorporates an additional error sample

How does this comparison hold with different values of x? For example, when the model is certain about x vs when the model is not certain about it, or, when x is a function word vs when it is not a function word?

Generally, the more likely that xx was to be generated by PBP^B, the larger the change in the distribution between PBP^B and PBxP^{B\cup \\{x\\}}- if xx was already low probability in PBP^B, then adding it as a violating sample will not change the distribution very much. This is irrespective of if the last token of xx is a function word, content word, or if the domain isn't natural language at all.

More concretely, consider a lipogram generation excluding the letter "u"

Prefix: "Attention Is All"

Next token probabilities:

You - 0.5
Important - 0.4
Natural - 0.1

P^B∪{"Attention Is All Natural"} is going to be closer to P^B compared to P^B∪{"Attention Is All You"}, because the latter requires excluding a sequence of much higher probability. However, both are "more accurate" than PBP^B because they have both excluded a sequence that is incorrect according to the lipogram constraint.

Minor

Thank you, we will make these clarity changes.

评论

Dear Reviewer,

Your active participation in the review process is crucial. Please respond to the authors' response and acknowledge you have done so.

Thanks.

-AC

最终决定

The paper proposes approximately aligned decoding, a method that uses speculative decoding to accelerate adaptive sampling with approximate expected futures, in order to perform more efficient constrained decoding.

The reviewers believe the method to be novel, interesting, and effective, and they appreciate the thorough background section.

Main criticisms of the paper from reviewers are:

  • Lack of confidence intervals in Table 3
  • Evaluation is on a limited set of tasks, reducing the strength of the claims
  • Missing experiments comparing against other relevant methods (FUDGE/Ctrl-G)
  • Some concerns about writing quality.

While the review pointing out writing quality issues did not elaborate, some concerns I have are:

  • Very short introduction that isn't sufficiently standalone to be a useful read. it does not give any details on the 'existing methods' it mentions multiple times, or provide an intuitive explanation for why these methods cause such significant deviation from the original output distribution. Developing intuitions in the introduction would make for a stronger paper.

  • The Related Work section lists a lot of papers and topics but doesn't do a very good job explaining how they are actually relevant to the paper' proposed method. The current Related Work section is not adding much to the paper