PaperHub
6.4
/10
Rejected3 位审稿人
最低4最高4标准差0.0
4
4
4
3.3
置信度
创新性3.3
质量2.7
清晰度2.7
重要性2.7
NeurIPS 2025

GOOD: Decoding-Time Black-Box LLM Alignment

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

We propose a decoding-time alignment method that does not require access to model parameters or vocabulary, achieving performance comparable to fine-tuning-based alignment methods while offering faster speed than vanilla decoding.

摘要

关键词
Large language modelsAlignmentBlack-BoxSpeculative Decoding

评审与讨论

审稿意见
4

The paper proposes GOOD (Guided Online Optimal Decoding), a novel method for aligning large language models (LLMs) at decoding time without accessing model parameters or vocabularies. GOOD utilizes a pair of guiding models to identify critical positions related to alignment and adjusts the model’s output dynamically during the decoding phase. . Experiments show that in weak-to-strong alignment, GOOD can achieve performance comparable to direct fine-tuning in terms of comprehensive capability and harmless generation, reaching relative scores up to 102% and 99% without sacrificing decoding efficiency.

优缺点分析

Strengths:

  1. The proposed GOOD is claimed as the first method to achieve black-box LLM alignment at decoding time. Distinct from existing tuning-free approaches, GOOD eliminates dependencies on pre-designed contexts and vocabulary constraints while achieving faster decoding than vanilla sampling, combining high flexibility with practical efficiency.

  2. Solves critical limitations of tuning-based alignment (resource intensity, parameter access) and tuning-free methods (vocabulary constraints, prompt dependency).

  3. Extensive experiments across 4 tasks (comprehensive capability, harmlessness, model enhancement, speed) using diverse models (Gemma, Llama, Qwen) and benchmarks (MT-Bench, HH, HumanEval).

Weaknesses: Alignment discrimination relies on heuristic functions (Max-Match/Top-P overlap, §3.2) without theoretical justification for sensitivity thresholds. Limited exploration of failure modes: No analysis of scenarios where guiding models provide misleading signals (e.g., stylistic mismatches harming task performance).

问题

  1. The heuristic discrimination methods (Max-Match/Top-P overlap, §3.2) lack validation. Why are these sufficient? How do hyperparameters (e.g., threshold τ) affect performance? Provide an ablation study on discrimination functions (e.g., compare precision/recall of alignment position detection).

  2. Token-mapping mechanics (§3.3) are technically dense; String-level token conversion (§3.3) may fail for vocabulary mismatches (e.g., subword splits causing semantic drift). a concrete example in the main text would aid accessibility.

局限性

The authors have partially addressed limitations but neglected societal impacts. Here’s a constructive assessment and suggestions:

  1. Negative Societal Impacts. GOOD could circumvent safety guards of black-box models (e.g., realigning GPT-4 to generate harmful content).

最终评判理由

The authors have provided corresponding reponse in the rebuttal, which have addressed my main concerns. As a result, I recommend assigning a score of 4 for the Final Justification.

格式问题

The section 3 method, especially 3.3 Guidance Transformation could be explained with some concrete examples for help understanding.

作者回复

Thanks for your valuable feedback and constructive suggestions. We have addressed your comments on the justification and robustness of our heuristic discrimination methods, the need for an analysis of failure modes, and the clarity of our token-mapping explanation. In response, we provide a detailed theoretical grounding for our approach, present a new ablation study on several discrimination functions, and will add both a qualitative analysis of failure cases and a concrete token-mapping example to the revised paper. Please refer to the specific responses below for a detailed discussion.


To Q1-1: Justification for Using a Simple Heuristic

We sincerely thank the reviewer for highlighting this critical point. And we appreciate the opportunity to further validate our choice of alignment discrimination. Our choice of a simple heuristic like Max-Match is theoretically grounded in the Superficial Alignment Hypothesis.

  • Superficial Alignment Hypothesis: This hypothesis posits that the core knowledge of an LLM is acquired during pre-training, and alignment primarily teaches the model a specific style or sub-distribution of responses, rather than altering its fundamental capabilities.

This theoretical foundation is supported by some empirical evidence.

  • For example, by analyzing the differences between a base model and its aligned version, URIAL found that knowledge-intensive content overwhelmingly occurs at unshifted positions, where the token chosen by the aligned model is also the most probable choice for the base model. Across thousands of examples, a vast majority of tokens (92.2%) were found to be either unshifted or only marginally shifted. This strongly suggests that the significant distribution shifts, the ones that truly constitute the "alignment," are not in the core knowledge but are concentrated in other parts of the response, such as linguistic style and formatting.
    • Their definition of a shifted position (where the aligned model's chosen token is ranked low by the base model) conceptually mirrors our own condition for triggering alignment in Max-Match (where the top-ranked tokens of the two models disagree).
  • We also observed a similar pattern (Appendix A): when comparing different sizes of aligned models within the same family (Gemma2), we found a very high overlap (over 70-80%) in their most frequently altered tokens.

These findings suggest that the most critical alignment decisions often happen at predictable, high-frequency "choice points" where the model selects a stylistic or safety-related token. A simple method like Max-Match, which checks for a disagreement in the single most probable token (argmax) between the unaligned model (A) and its aligned counterpart (A it), is therefore highly effective at capturing these most significant and frequent alignment-related shifts.

It is effective because it directly targets the primary mechanism of superficial alignment without the need for more complex distributional comparisons, which might be diluted by the vast majority of tokens that remain unchanged.


To W1 & Q1-2: Threshold Tuning and Robustness

We appreciate the concern regarding hyperparameter sensitivity. In our experiments, we chose Max-Match as the default configuration precisely because it is hyperparameter-free. The rule is simple: if the most likely token predicted by model A differs from that of model A_{it}, the position is flagged for alignment. This eliminates the need for any manual threshold tuning, making the method robust and easy to deploy.

To address the broader question of "How do hyperparameters affect performance?", we direct the reviewer to Figure 6 (previously Figure 4) in our paper, which can also effectively serve as a hyperparameter ablation study. In this analysis, we moved from the hyperparameter-free Max-Match (equivalent to a Top_p threshold of 0) to a Top_p based overlap method and varied the Top_p_A threshold.

The results clearly demonstrate that the model's performance on MT-Bench remains remarkably stable across a range of thresholds, even as the "Guided Decoding Ratio" (the percentage of replaced tokens) decreases from 30% to 23%. This indicates that GOOD's effectiveness is not very sensitive to the exact quantity of guided tokens but rather hinges on the accuracy of identifying the correct positions for guidance. The stability shown in this experiment validates our choice of a simple, robust heuristic.


To Q1-3: Ablation Study on Discrimination Functions

To provide a direct quantitative validation of our default Max-Match function, we analyzed its performance in detecting alignment-critical positions. For our ground truth, we decoded responses on the MT-Bench dataset using Qwen2.5-7b-it and identified every position where an alignment occurred by comparing its output to its base model, Qwen2.5-7b, via Max-Match. We then evaluated how well the alignment decisions made by our guiding pair (Qwen2.5-3b-it vs. Qwen2.5-3b) correlated with this ground truth. The performance of this predictive task was as follows:

  • Accuracy: 0.8169
  • Precision: 0.5122
  • Recall: 0.5789
  • F1 Score: 0.5435

These metrics show that our simple heuristic achieves a strong balance. It correctly identifies over half of the true alignment-critical positions (Recall: 57.9%) while ensuring that the positions it flags are indeed alignment-related more than half the time (Precision: 51.2%).

Furthermore, to explore whether more sophisticated discrimination functions would yield significant gains, we conducted an additional ablation study comparing Max-Match to three advanced, logits-based methods. The results on AlpacaEval are summarized below:

Discrimination MethodLC Win Rate (%) on AlpacaEval
Max Match (Ours, hyperparameter-free)29.55
KL Divergence27.41
Entropy Difference24.68
Cosine Similarity27.52

(In these three new methods, we used Qwen2.5-7b as the guided model, and Qwen2.5-3b / Qwen2.5-3b-it as the guiding model pair. Since these methods require hyperparameter tuning, we first analyzed token-level statistics from MT-Bench with Qwen2.5-7b-it. Specifically, we collected logits at every decoding position, computed the above metrics, and labeled positions requiring alignment. We then searched optimal thresholds for each method by maximizing F1 Score).

Interestingly, while these advanced methods also perform well, our simple, hyperparameter-free Max-Match method remains highly competitive and even outperforms them in this configuration. This reinforces our finding that a straightforward heuristic is surprisingly effective for capturing superficial alignment shifts and offers an excellent trade-off between simplicity, robustness, and performance.


To W2: Analysis of Misleading Signals

That is a very insightful point, and we thank the reviewer for raising it. Understanding the failure modes, especially scenarios where the guiding models might provide misleading or harmful signals, is crucial for a complete picture of GOOD's robustness and limitations.

The reviewer is correct that our main paper did not explicitly analyze these failure cases. This is a valuable area for exploration. To address this, we will add a new section to the appendix dedicated to a qualitative analysis of failure modes.

In this new section, we will provide and discuss several concrete examples from our experiments, illustrating:

  1. Successful Guidance: Cases where the guiding signal correctly identifies an alignment-related issue (e.g., a missing safety disclaimer or a stylistic deviation) and successfully corrects the output of the base model.
  2. Misleading Guidance: Scenarios where the stylistic preferences of the guiding model do not align well with the requirements of a specific task, potentially leading to a suboptimal but not factually incorrect response.
  3. Potentially Harmful Guidance (Negative Cases): We will also present instances where the guidance might have been detrimental, for example, by incorrectly suppressing a correct fact or introducing an unhelpful conversational filler.

By presenting these case studies, we aim to provide a more nuanced understanding of when and why GOOD performs well, and what the potential pitfalls are when the guiding signal is imperfect. We believe this analysis will not only strengthen the paper but also provide valuable insights for future research on decoding-time alignment.

Thank you again for this excellent suggestion.


To Q2: Concrete Example of Token-mapping

We thank the reviewer for these comments on the clarity of our token-mapping mechanics. We agree that the process described in Section 3.3 is technically dense and that a concrete example would significantly improve its accessibility. We will incorporate this suggestion into our revision.


Finally

We sincerely thank you for your thoughtful and constructive feedback on our work. If you believe our replies have resolved the issues raised in your review, we would greatly appreciate your reconsideration of the manuscript’s evaluation, for a potentially higher rating. If there are any remaining concerns or further suggestions, we would be more than happy to address them.

评论

Thanks for the rebuttal. After reading the response, my concerns are mostly solved. I tend to keep my positive rating.

审稿意见
4

This paper proposes Guided Online Optimal Decoding, a training-free method for aligning large language models at decoding time. The key insight is that different aligned models exhibit similar patterns in identifying alignment. GOOD uses a pair of guiding models, an unaligned model and its aligned version, to identify critical positions during decoding where alignment is needed.Then replaces the guided model's predictions with those from the aligned guiding model. GOOD integrates with speculative decoding to achieve efficiency gains. Experiments on MT-Bench, HH dataset, and HumanEval show performance comparable to direct fine-tuning.

优缺点分析

Strengths:

  1. GOOD utilizes two guiding models to determine critical alignment positions, enhancing the guided model at decoding time with negligible cost compared to training the guided model (especially for large models, such as 70B models).
  2. Unlike existing methods that require access to parameters or logits of the guided model, GOOD operates purely at the string level, making it applicable to API-based, closed-source models (although it remains challenging to apply to actual commercial models).
  3. The clever integration with speculative decoding achieves a certain degree of speedup, enhancing the method's practical applicability.
  4. The experiments in the paper cover multiple scenarios (weak-to-strong alignment, harmless generation, and code enhancement) and are compared with relevant baselines (e.g., proxy-tuning). Many detailed explanations are also provided in the appendices.

Weaknesses:

  1. Relatively outdated models and baselines: The paper mainly conducts experiments on Qwen2, Llama2, and Gemma2. While these models achieved decent results, their performance still has gaps compared to newer models, such as Qwen2.5/3, Llama3, and Gemma3. Given the rapid iteration of large language model training techniques, I am uncertain whether GOOD can achieve similar results on these latest models. Additionally, the baselines used in the paper are proxy-tuning and GaC, lacking comparison with newer methods in the field. Similarly, the Related Work section lacks newer related work, making it difficult to assess GOOD's actual contribution.
  2. Overly simple discrimination methods: The discrimination methods for identifying critical positions (Max Match, Top-P/K Overlap) are too basic. More sophisticated alignment-critical position identification methods could significantly improve performance, such as at least basic discrimination methods like logits similarity.
  3. Instability of MT-Bench: Although the experiments cover multiple benchmarks, a crucial aspect relies on MT-Bench, which contains only 80 data points and exhibits low stability. I suggest using averages from multiple runs, but the paper lacks these details. I recommend adding at least one of AlpacaEval or Arena-Hard as a supplement in key sections, as these benchmarks contain several times more data than MT-Bench, while the usage cost is less than 2x that of MT-Bench (such as AlpacaEval).
  4. Deployment difficulties: The method requires running multiple models simultaneously and maintaining continuous communication between them, which may not be practical in many real-world scenarios, especially for commercial models, which almost completely limits the practical application of this method.
  5. Lack of theoretical analysis: The paper's method lacks basic theoretical analysis, such as explaining why alignment decisions can be transferred. Although GOOD's approach is relatively straightforward, adding theoretical analysis could significantly improve the paper's completeness.

问题

  1. Are there more supplementary, stable benchmarks to address MT-Bench's limitations?
  2. Would you consider comparing more discrimination methods?
  3. On stronger models, is GOOD's performance still competitive?
  4. Even a very brief theoretical analysis could improve the paper's completeness.

I would be willing to raise my scores if the authors address the weaknesses mentioned.

局限性

Yes.

最终评判理由

The authors have provided additional experiments and analyses in the rebuttal, which have enhanced the credibility of the paper and effectively addressed my main concerns. As a result, I recommend assigning a score of 4 for the Final Justification.

格式问题

N/A.

作者回复

Thanks for your valuable feedback and constructive suggestions. We have addressed your comments regarding the selection of models, benchmarks and baseline methods, the exploration of advanced discrimination methods, the practicality of deployment, and the need for theoretical grounding. To address these concerns, we have conducted extensive new experiments on AlpacaEval, compared GOOD against important baselines, and evaluated new logits-based discrimination methods. Please refer to the specific responses below for the new results and our reasoning.


To W1, W3, Q1, Q3: Models, Benchmark & Baseline Method Concern

We sincerely thank the reviewer for highlighting these critical points. In response, we provide new experiments to address the concerns about models, benchmarks, and baselines.

For Models & Benchmark Concern

We expanded our evaluations by:

  • Including more recent and powerful model (Qwen2.5) alongside part of our original choices (Gemma2).
  • Conducting experiments on AlpacaEval 2.0, a more comprehensive and stable benchmark widely recognized in recent alignment literature.

The extended evaluations on AlpacaEval 2.0 are shown below:

Guidance SetupGOOD-aligned PerformanceFine-tuned Target
Gemma-2b-it → Gemma2-9b44.82Gemma2-9b-it: 51.90
Qwen2.5-3b-it → Qwen2.5-7b29.55Qwen2.5-7b-it: 30.99

These results confirm that our method can successfully transfer alignment even to state-of-the-art models (recovering up to 95.35% of the fine-tuned target) and generalize well beyond a single benchmark scenario.

For Baseline Method Concern

Regarding baseline methods, we agree with the reviewer’s observation and acknowledge the importance of comparisons against newer methods.

Actually, in Table 4 of the original manuscript, we already provided comparative results with recent methods ( ARGS (Jan 2024), Transfer-Q (May 2024), CARDS (Jun 2024), and GenARM (Oct 2024 ) ) evaluated on the HH Dataset. Despite leveraging significantly smaller guiding models, GOOD achieves competitive alignment performance, outperforming several reward-based methods (ARGS, Transfer-Q, CARDS) and approaching the performance of GenARM.

Additionally, we have now added a direct comparison on AlpacaEval 2.0 (evaluated with the official default setup). In this new benchmark, we evaluate GOOD against a broad spectrum of recent and established alignment baselines, including PPO, DPO, BoN, Item-level RS, RAIN, and TreeBoN, to further substantiate its efficacy:

ModelMethodsLC Win Rate (%)Win Rate (%)
llama–7bVanilla LLM0.7700.352
PPO (Jul 2017)0.4850.195
DPO (May 2023)0.3960.159
BoN (Feb 2023)0.7630.358
Item-level RS (Nov 2022)1.3870.702
ARGS (Jan 2024)0.5440.238
RAIN (Sep 2023)1.2520.619
TreeBoN (Oct 2024)0.5990.271
CARDS (Jun 2024)1.6090.878
GOOD1.6801.503

(Note: For this comparison, GOOD used TinyLlama-1.1B-Chat and TinyLlama-1.1B as guiding models. And Llama-7B was used as the base model for all methods. Transfer-Q's implementation is currently unavailable, and GenARM’s repository has become inactive, with no direct comparative data on AlpacaEval provided in the original paper.)

The significant improvements demonstrated by GOOD relative to these advanced methods underscore its competitive advantage and practical applicability.


To W2, Q2: Improving and Comparing more Discrimination Methods

Thanks for this insightful suggestion. To address this, we conducted additional experiments exploring three advanced discrimination methods based on logits:

  • KL divergence between logits distributions.
  • Entropy difference between the logits distributions.
  • Cosine similarity between embedding vectors (obtained by multiplying logits with the embedding matrix).

In this experiments, we used Qwen2.5-3b and Qwen2.5-3b-it to guide Qwen2.5-7b. Since these methods require hyperparameter tuning, we first analyzed token-level statistics from MT-Bench with Qwen2.5-7b-it. Specifically, we collected logits at every decoding position, computed the above metrics, and labeled positions requiring alignment. We then searched optimal thresholds for each method by maximizing F1 Score:

  • For KL divergence and entropy difference, alignment is triggered if the metric is above the threshold.
  • For cosine similarity, alignment is triggered if the metric is below the threshold.

The final evaluation on AlpacaEval is summarized below:

Discrimination MethodLC Win Rate (%)
Max Match (Ours)29.55
KL Divergence27.41
Entropy Difference24.68
Cosine Similarity27.52
Qwen2.5-7b (Base Model)6.7

The results show that all three advanced methods perform well and significantly outperform the base model. It also confirms the strength and simplicity of our Max Match method, which works reliably without extra hyperparameters. At the same time, the findings suggest promising future directions, such as hybrid or learning-based discrimination methods, to further enhance alignment accuracy and stability.


To W4: Deployment Concern

We appreciate the reviewer for raising this important point about deployment practicality. While multi-model systems introduce complexity, we would like to offer a different perspective: we believe that GOOD is in line with a growing trend in LLM deployment practices, where multi-model designs are increasingly explored as a means to enhance—rather than compromise—efficiency, performance, and modularity.

Here are several lines of evidence from recent research and deployed systems that support the practicality of our approach:

  1. Multi-Model Systems for Inference Acceleration**:** For instance, SpecInfer [1], a system for accelerating LLM serving, utilizes multiple small draft models in parallel to generate a token tree, which is then verified by a single large target model. In this highly practical system, the multi-model approach is not a deployment burden, but the core mechanism for achieving significant serving acceleration.

  2. Prevalence and Conceptual Analogy to Mixture-of-Experts (MoE): This trend extends beyond academic research and into widely-used commercial models. The MoE architecture, exemplified by models like Mixtral 8x7B, is fundamentally a multi-model system at inference time, routing tokens through different expert sub-networks to combine their specialized knowledge.

    From this perspective, GOOD can be viewed as a dynamic MoE-like system for alignment, where the guiding model pair (A, A_it) serves as an alignment expert, and the discrimination function acts as a router.

  3. Ensembles and Hierarchies for High-Stakes Applications: In many real-world scenarios, especially high-stakes domains like medicine or finance, using a single model is often insufficient. The LLM‑Synergy framework [2] shows that combining a clinically fine-tuned LLM with a general-purpose one can significantly reduce hallucinations and improve factual accuracy in medical QA, underscoring the practicality and necessity of LLM ensembles in high-stakes domains.

Viewed in this context, GOOD may be seen not as an outlier, but as a novel example within the broader multi-model paradigm—one that aligns with trends in advanced LLM system design and deployment.

[1] SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

[2] Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study


To W5, Q4: Theoretical Analysis

Thank you for this suggestion. Our method,is theoretically grounded in the Superficial Alignment Hypothesis. This hypothesis posits that alignment primarily teaches the model a specific style or sub-distribution of responses, rather than altering its core knowledge and fundamental capabilities.

This theoretical foundation is supported by some empirical evidence:

  • Evidence from Prior Work: Prior work (URIAL [2]) shows that alignment primarily affects stylistic tokens, with over 92% of tokens remaining "unshifted" from the base model's top choice, strongly suggests the "alignment signal" is concentrated in the few positions where the top choice diverges.
  • Evidence from Our Observations: Our own analysis (Appendix A) reveals the implication of this for transferability. When comparing different sizes of aligned models within the same family, we found a high overlap (>70-80%) in their most frequently altered tokens, which implies a consistent, transferable alignment behavior.

These two points provide a direct theoretical justification for why GOOD's approach is effective:

  • The high overlap in alignment related tokens across different models suggests that the "rules" of alignment (e.g., which phrases to use for politeness, how to format code blocks) are not arbitrary or model-specific but follow predictable, transferable patterns.
  • An "alignment expert" derived from one model pair learns decision patterns that are largely applicable to another model from a similar pre-training distribution. This makes the alignment signal fundamentally transferable.

Finally

We sincerely thank you for your thoughtful and constructive feedback on our work. If you believe our replies have resolved the issues raised in your review, we would greatly appreciate your reconsideration of the manuscript’s evaluation, for a potentially higher rating. If there are any remaining concerns or further suggestions, we would be more than happy to address them.

评论

Thank you for the additional experiments and analyses provided in the rebuttal. These have significantly enhanced the credibility of the paper and addressed my main concerns effectively. Based on this, I am raising my score from 3 to 4. I hope the authors will include these experiments in the final version of the paper to further improve its quality.

评论

Thank you very much for your thoughtful feedback and for reconsidering our rebuttal! We sincerely appreciate your time and effort in evaluating our work. We are glad to hear that our responses have addressed your major concerns, and we truly value your positive recommendation. As you suggested, we will include these experiments in the final version of the paper to further improve its quality.

审稿意见
4

This paper proposes a decoding-time alignment method that enables aligning pretrained language models without any fine-tuning. The core idea is to use two guiding models—one aligned and one unaligned—to identify token positions that require alignment by comparing their logits. The aligned model's generations are then selectively incorporated into the original model's output at those identified positions. Unlike existing tuning-free methods, this approach does not rely on pre-defined context formatting or access to model internals such as logits or parameters. The method can further be enhanced with speculative decoding to improve efficiency.

The main experiments are conducted on MT-Bench, where the method achieves performance comparable to fine-tuning baselines, including in safety evaluations using HH-RLHF. The authors also demonstrate the method’s effectiveness when aligning an already aligned model (e.g., in code generation) and when using guiding models from different model families.

优缺点分析

Strength:

  • The proposed method is a flexible extension of existing discrimination-based decoding-time algorithms that eliminates the need for access to logits or vocabularies. It enables interaction between the guiding and guided models via strings instead of token-level manipulation.
  • The approach shows strong results across several settings.
  • The paper includes thoughtful analyses, such as varying guidance ratios, examining token-level changes, and mixing guiding models from different families.

Weakness:

  • The main experimental validation relies heavily on MT-Bench, raising concerns about the generalizability of the method to broader benchmarks such as MMLU, mathematical reasoning, or complex QA tasks.
  • The presentation of the method—both in text and figures—needs significant improvement. The description of the core technique is difficult to follow, and several figures lack clarity. For instance, the speculative decoding variant is illustrated via a figure in the main text without a thorough explanation of how it works (while it’s best to first illustrate the main original method) — see more comments below

问题

  1. Table 1: I think URIAL mrthod need special prompt design but no extra training.
  2. Line 78 & sec. 4.4: the claims on speedup is mostly due to speculative decoding and not attributed to you proposed method. In fact a fair comparison would be augmenting both regular decoding and your method w/ speculative decoding and compare the runtime.
  3. Figure 1 can be improved to be more self explanatory. The purpose of method figures is usually to help readers understand the method easier but here the figure i more confusing with notations not being defined: Notations are not defined, what’s n_matches_main, n_matches_align? I’d suggest the author to include a figure for their original GOOD as opposed to one with speculative decoding. I found the figure 8 and appendix C more informative than this one.
  4. It would be nice to include the Algorithm in the main text especially when the section is not detailed. For example, it’s not clear how the last step “verification” is being done? Is this an step only with speculative decoding?
  5. Figure 2: A bit more details on this figure would be useful. Is this on MT-Bench? How the scores are computed and what’s the metric?
  6. Figure 4: Again , it’s hard to understand and read this figure. The legend is having more than 10 colors, it’s not clear what does Top_p_ori mean? The caption is uninformative so as the text.
  7. Same with figure 5. As a reader when looking at this figure, what can I infer without checking the text?

局限性

Yes, discussed

最终评判理由

Based on my review and the rebuttal, I will maintain my original acceptance score of 4! This is a good paper and likely benefit the broader community to align models in black-box setting.

格式问题

n/a

作者回复

Thanks for your valuable feedback and constructive suggestions. We've addressed your comments on the generalizability of our evaluation, the clarity of the method's presentation, and the fairness of the decoding speed comparison. To address these concerns, we have conducted new experiments on AlpacaEval 2.0 and a direct runtime comparison against standard speculative decoding. We will also thoroughly revise the paper for clarity as suggested. Please refer to the specific responses below for detailed discussions and results.


To W1: Benchmark Concern

We thank the reviewer for the suggestion to broaden the scope of our evaluation, and we appreciate the opportunity to further validate the generalizability of our method GOOD.

Following this suggestion, we conducted new experiments that expand our evaluation in two key dimensions. First, we incorporated evaluations on AlpacaEval 2.0 (also recommended by reviewer fBoL), a widely-used benchmark for alignment assessment in LLMs. AlpacaEval covers diverse open-ended and reasoning tasks, and is commonly used in recent alignment literature [1]. Second, to demonstrate GOOD's effectiveness on the current state-of-the-art LLMs, this new evaluation incorporates a more recent and powerful model family (Qwen2.5). In each experiment setup, we utilize a pair of guiding models to align the behavior of the pretraining guided model.

Our results show that GOOD also achieves competitive performance on AlpacaEval 2.0 (evaluated with gpt-4o-mini using the official specific stepup):

Guidance SetupGOOD-aligned PerformanceFine-tuned Target
Gemma-2b-it → Gemma2-9b44.82Gemma2-9b-it: 51.90
Qwen2.5-3b-it → Qwen2.5-7b29.55Qwen2.5-7b-it: 30.99

We believe these results provide strong additional evidence for the robustness of GOOD. The competitive performance (recovering up to 95.35% of the fine-tuned target) on a benchmark with a distinct methodology demonstrates that our method's ability to transfer alignment is not specific to a single evaluation format.

We hope this new data effectively addresses the reviewer's comment. We thank the reviewer again for their insightful feedback.

[1] Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators


To Q1: Table Presentation of URIAL Method

Thanks for your comment. In Table 1, the green color indicates "No" (an advantage), and the red color indicates "Yes" (a disadvantage), which may not have been sufficiently clear. We share the same understanding of URIAL as you described. We acknowledge that the current presentation in the table may cause confusion and will revise it to improve clarity in the next version.


To Q2: Compare Speed with Standard Speculative Decoding

We thank the reviewer for this insightful suggestion. The reviewer is correct that the speedup we presented is relative to vanilla decoding and that a direct comparison with a standard speculative decoding baseline is necessary for a complete and fair assessment.

Following the reviewer's suggestion, we have conducted this exact runtime comparison. The results are presented below:

Model FamilyMethodTime per Token (s/token, lower is better)
Gemma2Vanilla Decoding (Gemma2-27b-it)0.234
GOOD (Gemma2-2b-it → Gemma2-27b)0.203
Standard Speculative Decoding (Gemma2-2b-it → Gemma2-27b-it)0.138
Llama2Vanilla Decoding (Llama-2-70b-chat)0.270
GOOD (Llama-2-7b-chat → Llama-2-70b)0.251
Standard Speculative Decoding (Llama-2-7b-chat → Llama-2-70b-chat)0.175
Qwen2Vanilla Decoding (Qwen-2-72B-Instruct)0.274
GOOD (Qwen-2-7B-Instruct → Qwen-2-72B)0.266
Standard Speculative Decoding (Qwen-2-7B-Instruct → Qwen-2-72B-Instruct)0.200

These results confirm the reviewer's intuition. While GOOD is faster than vanilla decoding (achieving up to a 13% speedup), it is slower than standard speculative decoding in the current configurations.

This is an expected result, as GOOD utilizes a pair of guiding models (A and Ait) to perform its unique alignment discrimination, whereas the standard speculative decoding baseline here uses only a single draft model. The overhead of running this second guiding model contributes to the performance difference.

However, we would like to argue that this is not a fundamental limitation but rather a matter of configuration and optimization. The core idea that using multiple smaller models can still yield significant speedups is supported by recent work. For instance, SpecInfer[1] successfully uses multiple small draft models and ultimately achieves a 1.5-2.8x speedup. This demonstrates that multi-model overhead can be effectively amortized in a well-designed system.

Analogously, we believe that with further optimization—such as carefully selecting the relative sizes of the guiding pair and the target model—it is plausible that GOOD's runtime could become comparable to that of standard speculative decoding.

[1] SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification


To W2, Q3, Q4, Q5, Q6, Q7

We appreciate the reviewer for their constructive feedback. These suggestions are very helpful for improving our paper’s clarity and quality, and we will incorporate these improvements in our revision. Here, we provide additional information to clarify the specific points raised.

To Q3: Explanation of Figure 1

Thanks for pointing this out. The notations relate to our two-step speculative process:

  • n_matches_align: This refers to the number of tokens accepted in the first guess-and-verify step (between the guiding model pair). It counts how many initial tokens from the aligned guiding model (A_{it}) are generated without requiring an alignment
  • n_matches_main: This is the number of tokens accepted in the second guess-and-verify step. It counts how many tokens (which come from the first step) are accepted by the guided model (B) in standard speculative decoding.
  • In both cases, “match” means the decoding results are consistent between the compared models.

We have included a detailed explanation and illustration of this process in Appendix D and Figure 9. And we also plan to release the codebase soon, which will further help readers understand the implementation details.

To Q4: Explanation of the Algorithm Description

The final "verification" step integrates both alignment and speculative decoding, taking the output from the first verification as its input. The logic is as follows:

  • If n_matches_main < n_matches_align, the guided model disagrees with the predicted sequence before an alignment related token is reached. We thus accept the n_matches_main matching tokens.
  • If n_matches_main ≥ n_matches_align, the guided model accepts the sequence up to the alignment point. We therefore accept n_matches_align tokens and then substitute the next one with the guidance from A_{it}.

A detailed illustration of this process is provided in Appendix D and Figure 9. The upcoming release of our codebase will also help clarify the implementation for interested readers.

To Q5: Explanation of Figure 2

This figure shows scores from the MT-Bench benchmark (as mentioned in the last paragraph of Section 4.3). The scores were computed using the official evaluation script of MT-Bench, which uses GPT-4 for automated assessment. We will ensure the figure caption is more self-contained in the revision.

To Q6: Explanation of Figure 4

We apologize for the unclear legend. Top_p_ori should have been written as Top_p_A . The purpose of this analysis was to show how tuning the alignment sensitivity (by adjusting Top_p_A) impacts the Guided Decoding Ratio (the proportion of tokens replaced) and the overall model performance on MT-Bench.

To Q7: Explanation of Figure 5

We acknowledge this comment and agree that Figure 5 should also be more self-explanatory. Both figures will be revised with more informative captions and clearer legends.


Finally

We sincerely thank you for your thoughtful and constructive feedback on our work. If you believe our replies have resolved the issues raised in your review, we would greatly appreciate your reconsideration of the manuscript’s evaluation, for a potentially higher rating. If there are any remaining concerns or further suggestions, we would be more than happy to address them.

评论

Thank you for the detailed response and additional results, which addressed most of my concern. I would strongly suggest the authors to include these in the future version as it significantly enhances the clarity and quality of your work. I will maintain my original score of 4.

最终决定

The paper introduces GOOD, a tuning-free method for aligning large language models during decoding, which operates at the string level without accessing the target model’s parameters or vocabulary. This makes GOOD particularly suitable for black-box LLM alignment. The approach uses two guiding models—one aligned and one unaligned—to compare logits and identify token positions in the target model that require alignment, selectively incorporating outputs from the aligned model at these positions. Experiments on MT-Bench demonstrate that GOOD achieves performance comparable to traditional fine-tuning baselines.

This submission is borderline and presents a challenging decision. While reviewers acknowledge the soundness of the method, they raise concerns about limited methodological novelty, practical relevance, insufficient comprehensiveness in experiments and analysis, and writing quality. Although the rebuttal provides additional experiments, analyses, and clarifications, these improvements—such as using the latest models and baselines, discussing discrimination methods based on logits, and addressing practical deployment concerns—should have been included in the original submission, as NeurIPS does not allow post-submission revisions. Addressing these issues would require a major revision.

Furthermore, the submitted version does not adequately address ethical concerns, including potential bias and misuse risks highlighted by the ethics reviewers, although the authors have indicated plans to add a section on this topic in their rebuttal.

No reviewer championed the paper during the discussion period. Given these considerations, I am currently inclined to recommend rejection, as the paper falls slightly below the standard expected for NeurIPS.