PaperHub
5.5
/10
Rejected4 位审稿人
最低3最高8标准差1.8
5
6
8
3
3.5
置信度
正确性3.0
贡献度2.3
表达3.0
ICLR 2025

A Critical Look At Tokenwise Reward-Guided Text Generation

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

We analyse some of the pitfalls of contemporary reward guided text generation methods, and present a principled approach with strong performance on several language generation benchmarks.

摘要

关键词
LLMRLHFAlignmentModel EfficiencyReward ModelsSampling

评审与讨论

审稿意见
5

The paper demonstrate the deficiency of utilizing bandit-setting RM for alignment by reward-guided sampling (ARGS), and propose to train RM with partial sentences of preference corpora for ARGS. Experimental results validate the effectiveness of their method.

优点

  1. The analysis about deficiency of the original bandit RM in ARGS is enlightening.
  2. The performance improvement comparing to baselines is inspiring.

缺点

  1. The assumption that the partial sequence y1:iwy_{1:i}^w is also preferred to the partial sequence y1:ily_{1:i}^l” is quite strong, since in most cases like reasoning, undesirable parts of the response just occur in specific steps or tokens.

  2. Lack of some experiments. See Question 4/5.

  3. The training cost for RM would be multiplied by the token number, which might be unbearable under current situations where sentence length is generally hundreds of tokens.

问题

  1. In Eq.3 and corresponding context, β\beta should be 1/β1/\beta. Otherwise, change β\beta to1/β1/\beta in Eq.2.
  2. Better clarify what the subscripts ii,i1i-1 of πRLHF\pi_{RLHF} mean in Theorem 3.
  3. Since the prefix lengths in the training objective are the same for chosen and rejected response as in Eq.4, it seems unable to directly induce Lemma 2 with different prefix lengths (ii for y1y_1, jj for y2y_2). Can you provide more detailed derivations about Lemma 2?
  4. How do you implement Best-of-N sampling? Do you use your reward model the same as in the bandit setting? For a fair comparison, you could try to use your RM to score partial sentence and determine the final score of a sentence by the the minimum/summation/mean of these partial scores.
  5. Would PARGS perform stably under different β,k\beta,k in Algorithm 1? Would different k,βk,\beta affect generative diversity?
评论

Thank you for your review and detailed comments. We are glad that you found our analysis enlightening and our experimental results inspiring.

Assumption too strong We acknowledge that for some problems this assumption would not work and we would need additional information e.g. from human annotators or an LLM to generate the partial sequence preferences. We included an experiment in our original submission that demonstrates how PARGS improves baseline machine translation when we have a post-edit dataset which provides a token-level reward (Table 4). Therefore for situations where the assumption is too strong, we can collect additional data, human or LLM, and still apply our algorithm.

Lack of some experiments Based on your suggestion we have included an experiment looking at the performance of PARGS and diversity of the generated text when we vary β\beta and kk. The results are in Appendix F (Tables 13 and 14). In summary, β\beta values between 1 and 2 perform well and k=10k = 10 performs the best and k=5k=5 performs better than k=15k = 15. Diversity improves with larger kk as expected and does not vary much with β\beta.

Training costs In order to maintain a reasonable training cost, we sample a subset from the list of all possible partial sequences. The number of training samples is about 2x for summarization and 1.5x for HH-dialogue. We included the exact number in Table 6 of Appendix B. In summary, we keep the training cost comparable to that of the full reward model.

Eq. 2 and eq. 3 We have made the change in Equation 2. Thanks!

Subscripts in Theorem 3 πRLHF,i\pi_{RLHF_{,i}} and πRLHF,i1\pi_{RLHF_{,i-1}} are two distinct policies over prefix sequences of length ii and i1i − 1 (as mentioned in line 275).

Lemma 2 We added a mathematical derivation for the proof of Lemma 2 in Appendix A.

Best-of-N sampling We first sample NN complete responses from the language model and then rank them using the full-sequence reward model. The sample with the highest reward is chosen.

Thank you for your suggestion! We compared using Best-of-N using terminal reward from the full sequence reward model, terminal reward from the partial sequence reward model and sequence level mean rewards on the TLDR dataset. The results are as follows:

Best-of-N strategyAverage Reward Score
Terminal Reward - Full RM2.2 ±\pm 0.19
Terminal Reward - Partial RM2.02 ±\pm 0.18
Mean of Partial Rewards-2.04 ±\pm 0.39

This shows that using partial rewards for Best-of-N is not a good strategy. In contrast our work shows that using partial rewards to guide generation at inference time works well.

评论

Thank you for your detailed response. Based on your feedback, my main concerns are still not well addressed. I list them as below:

  • Too strong assumption

Although authors mention we can collect additional data by human or LLM, and still apply our algorithm for situations where the assumption is too strong. However, there seems to be no universal solution for the collection strategy of additional data, at least not mentioned in the current version of paper. Meanwhile, additional data collection would bring more computational costs, which further decrease the practical usefulness of PARGS.

  • Computational overheads

Based on your response, the experiments only utilize a subset from the list of all possible partial sequences. However, it seems that authors did not mention it in the main paper. Meanwhile, the specific sampling process is also unknown. Does it generally outperform baselines with random sampling? How does the performance changes across different sampling size? Is there a good sampling strategy can further assist PARGS? These are all significant points to the concern of computational overheads, but there are no demonstration for any of them yet.

Therefore, I currently would maintain my score.

评论

Thank you for your quick reply and the follow-up questions.

Too Strong Assumption As a reminder, the no-free lunch theorem says that we can't learn without making some assumption in machine learning. In our paper, we make the assumption clear by explicitly stating that partial sequences are ranked the same as their full extension in the data. All other works also make some assumption, but it is not always clear and it is not always stated. For instance, ARGS assumes that the reward model will generalize from full sequences to partial sequences based on whatever is the choice of architecture of the reward model, which is unclear and arbitrary. It is easy to argue that our assumption is "too strong" because we made it explicit, while previous work may seem better simply because their assumptions are not always clear and may not be stated. Ultimately, assumptions that better reflect reality will lead to better generalization and better results. We evaluate PARGS on three diverse datasets, Reddit TLDR (summarization), Anthropic-HH (harmless and helpful text), and Ultra Feedback (instruction following, truthfulness, honesty, and helpfulness). Ultra Feedback in particular collects prompts from diverse sources such as UltraChat, ShareGPT, Evol-Instruct, TruthfulQA, FalseQA, and FLAN. On all these datasets without collecting any additional data, PARGS shows strong performance, beating RGTG methods and being comparable to offline RLHF such as DPO and PPO, which is very expensive.

In our previous response, we suggested that our technique could be extended by collecting preference annotations for partial sequences when possible since this would mitigate any issue with the assumption itself. Note that this is the case for any technique. Since all assumptions are imperfect in practice, additional data is needed to rectify those imperfections. In fact, this would be needed even more for the techniques that perform worse than PARGS. Naturally, this increases computational costs for all techniques, not just our technique.

Computational overheads

Thanks for pointing this out. We have added Section I to the appendix where we discuss this and also present an ablation on the effect of different subset sizes on PARGS performance. To summarize, we randomly sample a subset of all possible partial sequences. Therefore yes, we outperform baselines using random sampling. We observe from the ablation that 1.5 to 2x of the original dataset size is enough for good performance.

The use of other sampling strategies is an interesting suggestion. We may be able to incorporate active learning similar to [1] to improve our sampling strategy. But this would be an orthogonal direction to our current work.

We hope that your concerns are addressed and you can review your score !

[1] Muldrew, William, et al. "Active Preference Learning for Large Language Models." ICML 2024

评论

Thanks for your response. Your response has basically addressed my concern about computational overheads. I think the computational cost is bearable now according to section I.

However, I found a point which I previously missed. There are many published works about reward/discriminator-biased decoding. Addition to the comments of Reviewer K21J, I list some of them below:

FUDGE: Controlled Text Generation With Future Discriminators (https://aclanthology.org/2021.naacl-main.276.pdf) Decoding-time Realignment of Language Models (https://arxiv.org/pdf/2402.02992)

I also see the authors' comments to Reviewer K21J, which claims PARGS uses BT loss to train partial rewards. However, it seems authors do not mention why BT loss is better or necessary in modeling partial-sequence reward comparing to other objectives, especially considering the theoretical assumption of BT objective to model partial-sequence rewards seems not true practically and training by BT loss is not computationally efficient as mentioned in my previous feedback.

评论

We are glad that your concerns, so far, have been addressed! As for your additional questions:

Previous Works

FUDGE [1] uses an attribute classifier to condition the output of an LLM. But how can one use FUDGE to align with preference data? One approach is precisely what CD [2], a recent paper at ICML, does in their work. Given a full sequence reward model trained on a preference dataset, and a dataset produced by rollouts from the baseline LLM, they can train a FUDGE prefix scorer for alignment according to an RL objective they define. They call this method CD-FUDGE, and as we mentioned in our original submission (line 385 for current revision), we compared against this for all our experiments and showed that PARGS performed better. Note that CD does not demonstrate a connection to RLHF.

DeRa [3] does not avoid expensive offline RLHF such as PPO. Instead given an offline RLHF model and a baseline fine-tuned model, it can adjust the KL regularization strength during decoding/inference. Their claim is that they can avoid retraining the RLHF if the KL regularization needs to be changed for a particular task. Our method, on the other hand, avoids offline RLHF altogether. DeRa therefore cannot be categorized as an RGTG method and solves a different problem.

Bradley Terry

As we mentioned to reviewer K21J, RLHF trained with Bradley-Terry (BT) is one of the most prominent approaches for LLM alignment in the literature. For a comprehensive review of the many reasons that justify the BT model, we recommend having a look at "The many routes to the ubiquitous Bradley-Terry model" by Hamilton, Tawn and Firth (2023) [4]. This paper shows how several axiomatic derivations, optimization objectives, discriminal processes and standard models all lead to the BT model. The BT model is a principled and natural model for ranked pairs of outcomes. Here, it does not matter whether this pair consists of full sequences, partial sequences, or other objects. In RLHF, the BT model can be trained to approximate the empirical distribution of the full sequences in preference data. Similarly, we train the BT model to approximate the empirical distribution of partial sequences (i.e., sequence prefixes). Using BT, PARGS can show a clear connection to RLHF. Secondly, our strong empirical results back our choice of training using BT. Thirdly, competing non-BT-based methods, such as CD-FUDGE, which require rollouts from the baseline LLM during training are computationally more intensive.

We hope that we addressed all the concerns that you had and that you can update your score accordingly. Thanks!

References

[1] Yang, Kevin, and Dan Klein. "FUDGE: Controlled Text Generation With Future Discriminators." NAACL-HLT 2021.

[2] Liu, Tianlin, et al. "Decoding-time Realignment of Language Models." ICML 2024

[3] Mudgal, Sidharth, et al. "Controlled Decoding from Language Models." ICML 2024.

[4] Hamilton, Ian, Nick Tawn, and David Firth. "The many routes to the ubiquitous Bradley-Terry model." arXiv 2023

评论

Thank you for the response!

To summarize, I do think that this paper proposes a interesting question and theoretically analyzes the weakness of previous works. However, the proposed approach seems too naive by just transforming pairwise preference from (yw,yl)(y_w,y_l) to (yw1:i,yl1:i)(y_w^{1:i},y_l^{1:i}) without any further adaption, especially considering that it introduces additional computational costs and is based on an assumption seems not true practically. I would suggest you could emphasize more on the sampling strategy (or other methods to reduce this computational overheads) in the main paper, which would solve a key challenge of your proposed approach. However, currently, this is only discussed in section I. (though currently the paper version could not be changed, I suggest to revise them in the final version if possible).

Currently, I would like to increase the score of soundness, since the computational costs seems bearable according to your response. For the overall score, I currently maintain it.

评论

Thanks for your prompt response. We appreciate your continued engagement during the review process.

We are happy that we addressed your previous concerns and you have increased the score on the soundness of our approach. Please note that the ablation on the sampling size, performance and computational impact is in Appendix I because putting it in the main paper during the review process could have led to incorrect section and line references in our responses to the other reviewers. Rest assured that we still have space and will put it in the main paper. We thank you again for suggesting this experiment.

To reiterate, our approach is theoretically sound, empirically strong, and shows a clear connection to RLHF. Moreover, as the experiments showed and you agreed earlier, we are computationally reasonable as well. On this evidence our assumption is justified.

审稿意见
6
  • The paper argues that existing inference-time reward-guided text generation (RGTG) has a flaw: it provides arbitrary rewards for partial sequences, as these reward models are trained on full sequences.
  • To address this issue, the authors propose training reward models (RMs) using partial sequences.
  • Experiments demonstrate that these RMs perform better in reward-guided text generation across four tasks: summarization, dialogue, fine-grained text generation, and machine translation.

优点

  • The paper proposes a straightforward method for training RMs on partial sequences.
  • The presentation is clear and well-structured, covering classic RLHF concepts, DPO, and decode-time RGTG.
  • The experiments evaluating performance and additional inference costs are thorough and balanced.

缺点

  • Beyond theoretical analysis, the claim that full reward models produce arbitrary rewards would be stronger with empirical evidence, such as human experiments.
    • For instance, an ablation study demonstrating the sensitivity of full RMs to varying sequence lengths could strengthen this argument.
  • To validate that the token-level RMs (using partial sequences) are more effective than full sequence RMs, it would be beneficial to train PPO and compare it against full sequence RMs. One might expect that token-level rewards lead to faster convergence, ultimately resulting in superior performance.

问题

  • Interestingly, the win rate of PARGS over PPO is high, even though the reward scores suggest otherwise.
  • For HH dialogue, why was PPO omitted?
  • Regarding GPT evaluation, what are the evaluation criteria? It would be helpful to include the exact prompts used for evaluation.
评论

Thank you for your positive review and comments! We have updated the paper as you suggested, e.g. ablations and human evaluation. Details below.

Human evaluation We designed a human experiment to compare the full sequence reward model ranking and human ranking of a pair of TLDR summaries at different sequence lengths. We observe that the reward model has a high agreement for full sequences and an arbitrary agreement for partial sequences. This corroborates our theoretical result that a full sequence reward model can give arbitrary rewards to partial sequences. The details are in Table 12 in Appendix Section E (line 904).

Effectiveness of token-level RMs Across all our experiments we demonstrate that a token-level RM (PARGS) is better than sequence-level RM for RGTG.

Prior work [1] has demonstrated that dense rewards can lead to faster convergence and a better local optima for PPO based RLHF. However, it involves making a single call to the reward model upon termination of the sequence and using the attention weights to redistribute the rewards. The policy update is also at a sequence level. If we were to do this token-wise for our (or any conventional) reward model we would need to call it token by token over the length of the sequence leading to much longer training times. Additionally we may require to update the policy several times over the course of a sequence.

High win rate, but not reward score Note that evaluation by a reward model and GPT4 are different metrics. The reward scores for PPO and PARGS are close for the summarization experiment (2.41 vs 2.36) and within the standard error. The win-rate does not tell us the difference in scores of the winner and the loser. It is possible that average scores of PPO responses is higher even though the win-rate is slightly lower.

PPO omitted for HH This is because updating the Llama-2-7b model with PPO required too much compute for our infrastructure to handle. Instead, we compared PARGS with DPO. Note that we did not claim to beat RLHF in our work, instead we provide an online alternative of alignment without the need of fine-tuning the target LLM.

Exact prompts for evaluation We included the exact prompts in Appendix F in the original submission (Appendix H in the updated draft).

References

[1] Chan, Alex James, et al. "Dense Reward for Free in Reinforcement Learning from Human Feedback." ICML 2024

评论

Thank you for your comments and the time you spent reviewing our work. As the rebuttal phase draws to a close, we welcome any further questions that you may have. If we have addressed your concerns we would be thankful if you can consider updating your score accordingly.

评论

Thank you for your time and constructive suggestions. We hope all your concerns are addressed. Please review the updated draft which includes a human evaluation to evaluate the effectiveness of full-sequence reward models for RGTG (as per your suggestion). We hope that you can update your score.

审稿意见
8

This paper proposed RGTG. It aims to improve LLMs without expensive fine-tuning. They identify issues as the existing approaches use the full-sequence reward model to score the partial sequences during decoding. To alleviate this issue, the paper proposed to train Bradley-Terry reward model.

优点

The theoretical analysis is great and the empirical evaluation is comprehensive. It is important to improve the text generation without expensive fine-tuning. The proposed method gave a practical approach with strong empirical experimental results.

缺点

The experiments only focus on automated metrics and GPT-4 evaluation. It can benefit from some human evaluation, even on a small scale. It can provide additional validation of the claimed improvements.

问题

  1. Could you include an ablation study session to show the impact of different components?

  2. It will be great if the paper can provide some examples of generated text, comparing to the baseline models. It will give readers a better sense of the improvements.

评论

Thank you for your positive review!

Human evaluation We took your suggestion and designed a human evaluation to compare PARGS with ARGS, DPO and CD on the Ultra Feedback dataset. PARGS has a higher win rate compared to the other methods. Details are in Appendix E.

Ablation In our original submission we had an ablation on the impact of β\beta on the performance of PARGS. We have updated the section (Appendix F in the updated draft) with the impact of both β\beta and kk (Table 13), i.e. the number of candidates from the language model in top-k generation. We observe that β=2\beta = 2 is the best followed by β=1.5\beta = 1.5 and k=10k=10 is the best followed by k=5k=5. We also looked at the effect of different β\beta and kk on diversity (on Table 14). As expected higher kk leads to more diversity whereas β\beta values do not have a significant effect on generation diversity.

Examples of generated responses Thanks for the suggestion. We have added examples of generated responses on the Ultra Feedback and TLDR test set prompts in Appendix G.

评论

Thank you for your positive review and suggestions. Please review the updated draft. Based on your suggestions, we included a human evaluation which demonstrates the superiority of PARGS over baselines, added an ablation on components of PARGS (β\beta and kk) and added examples of generated text for different models. We hope that we have further strengthened our work.

审稿意见
3

This paper pointed out that current RGTG methods struggle as they rely on full-sequence reward models for partial sequences. To improve this, the authors augment the training on partial sequences, assuming that the ranking won't change, achieving good tokenwise guidance. This method outperforms previous RGTG techniques and bridge the gap between the quality compared to finetuning solutions such as DPO and PPO baselines.

优点

  • This work gives good recap on RLHF and DPO for people who are not familiar with the fields can get to connect the dots.
  • This work show that by a simple approach of dropping some follow sequence and train on the partial rewards can help inference time generation quality (by LLM-judge).

缺点

I think this reward model training on partial inputs are not novel as there are quite a few works of using partial observation to train reward models. For example, Learning to Rank Generation with Pairwise Partial Rewards (https://aclanthology.org/2023.emnlp-main.371.pdf), Teacher Forcing Recovers Reward Functions for Text Generation (https://openreview.net/pdf?id=1_gypPuWUC3), Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model (https://arxiv.org/pdf/2310.09520), etc.

In line 323, the authors claimed “These algorithms are different from our work as they do not align language models using reward models or preference datasets.” I think this is not a valid claim? Many work such as PPLM and GeDi they are using a classifier (as a way to provide “reward”, either through classifier final logits or forward gradient computation), which is the same as this work as the authors are also not using the reward models to finetune the base LLMs.

问题

n/a

评论

Thank you for your comments. We have updated the Related Work section accordingly and made the claim on Line 351 (previously Line 323) more precise. Below we address your concerns in detail. Please let us know if you have further comments and suggestions!

We acknowledge that there are works in the literature that solve some of the same sub-problems that we solve in this work. As you rightly point out Gedi [1] and PPLM [2] do not modify the language model but instead use a class-conditional discriminator and an attribute classifier respectively to guide the language model at inference time. Moreover, the idea of using partial rewards or a dense reward function for reinforcement learning is not new. Hao et. al (2022) [3] and Lee et. al (2023) [4] are two works to do this when using reinforcement learning for language modeling. However, our underlying problem (and approach) is different from all these works.

The problem under consideration in this paper is the task of aligning language models to datasets of human preferences. This task has become a key component of most modern LLMs. In the literature, RLHF via a Bradley-Terry reward model has emerged as the most prominent and principled approach to LLM alignment. Thus, we are interested in an efficient, inference-time algorithm that is equivalent to RLHF. When assessing related work that use partial rewards or guide the decoding process, an important question is whether this related work can be used for alignment with preference data and more precisely in RLHF. The answer is no and therefore there is still a need for what we proposed in this paper. More precisely, Hao et. al (2022) [3] assume that expert demonstrations are provided instead of preference data and therefore their approach is not applicable for preference alignment. Lee et al. (2023) [4] use a prefix tree to estimate some partial rewards, but do not learn a reward model that could be used for alignment by RLHF. Deng and Raffel (2023) [5] consider preference alignment for a different model than the Bradley-Terry model.

Our paper contributes a technique to train a partial sequence reward model using the Bradley-Terry preference model. We also demonstrate that using a full-sequence reward model for this (e.g. as in [6]) can lead to arbitrary intermediate rewards. Moreover, we derive a clear connection to RLHF.

In the guided decoding literature the closest methods to our work, to the best of our knowledge, are RAD [5] and CD [7]. We have discussed and compared them in our paper. Even though they use partial rewards at inference time to align language models, they have a different training objective for the reward model and do not demonstrate any connection to RLHF.

I hope our contribution is clearer now and you can revise your score.

References

[1] Krause, Ben, et al. "GeDi: Generative Discriminator Guided Sequence Generation." EMNLP Findings 2021.

[2] Dathathri, Sumanth, et al. "Plug and Play Language Models: A Simple Approach to Controlled Text Generation." ICLR 2020

[3] Hao, Yongchang, Yuxin Liu, and Lili Mou. "Teacher forcing recovers reward functions for text generation." NeurIPS 2022

[4] Lee, Youngwon, Jinu Lee, and Seung-won Hwang. "Learning to Rank Generation with Pairwise Partial Rewards." EMNLP Main 2023

[5] Deng, Haikang, and Colin Raffel. "Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model." EMNLP 2023

[6] Khanov, Maxim, Jirayu Burapacheep, and Yixuan Li. "ARGS: Alignment as Reward-Guided Search." ICLR 2024

[7] Mudgal, Sidharth, et al. "Controlled Decoding from Language Models." ICML 2024.

评论

Thank you for your comments and the time you spent reviewing our work. As the rebuttal phase draws to a close, we welcome any other questions that you may have. If we have addressed your concerns we would be thankful if you can review your score accordingly.

评论

Again, thank you for the time you spent on reviewing our work. We hope that your concerns are addressed. Please review our response and the updated draft, and review your score accordingly. We will be grateful.

评论

Dear Reviewers,

The rebuttal discussion period is coming to a close and the paper currently has a mix of positive and negative reviewers. The authors have spent a lot of time responding to each concern -- can you take a look at the author responses and let them know any remaining concerns you have?

Best, AC

评论

Summary

Our work analyzes the pitfalls of current reward guided text generation (RGTG) and presents PARGS an inference-time RGTG method which is empirically strong, computationally reasonable and shows a clear connection to RLHF. Our method avoids updating the LLM with expensive training such as DPO and PPO. Below we discuss the salient points of the review process.

ReviewerK21J

  • The reviewer discussed some prior works in guided decoding and language modeling
  • We point out that, although these works solve some of the same sub-problems as our work, they do not apply to learning from preference data or RLHF.
  • The guided decoding methods that learn from preference data are already discussed in our paper and we compare against them.
  • Nonetheless, we updated the Related Works section to clarify this
  • We did not hear back from the reviewer and assume that their concerns are addressed.

Reviewer XMHs

  • The reviewer suggested two ablations which would show the superiority of partial sequence reward model over full-sequence rewards beyond RGTG.
  • We created a human evaluation for the first ablation (results in Appendix E Table 12).
  • For the second experiment we argued that it was computationally not feasible and pointed to supporting evidence in the literature.
  • We did not hear back from them.

Reviewer Rbb2

  • The reviewer believed that a human evaluation comparing PARGS to other baselines would further strengthen our paper. Moreover, some examples of generated responses would also be useful.
  • We added both of these in Appendices E and G respectively
  • We also added an ablation on the effect of changing different components of our method (Appendix F) as requested by the reviewer.
  • We did not hear back from them.

Reviewer KTXT

  • The reviewer requested an ablation on the impact of β\beta and kk on the performance and diversity of PARGS. We provided this in Appendix F.
  • The reviewer asked for implementing a baseline (Best-of-N) differently. We ran experiments with the new implementation, and demonstrated that the original way (standard) was better.
  • In the follow-up they asked for an ablation on different sampling sizes for PARGS. They were concerned about the computational costs.
  • We provided the ablation in Appendix I and got confirmation that their concerns were addressed.
  • Their only outstanding concern is that our approach is "a bit too naive", i.e. training a Bradley Terry reward model on partial sequences may not be practically viable. We contend that we have strong empirical results across the board, avoid the pitfalls of previous methods and can establish a clear connection to RLHF.
AC 元评审

Summary

The paper proposes a new method PARGS to address the limitations of reward-guided text generation (RGTG) by training a Bradley-Terry (BT) reward model on partial sequences instead of full sequences. This theoretically principled method provides a clear connection to Reinforcement Learning from Human Feedback (RLHF) while avoiding the computationally expensive fine-tuning of large language models (LLMs). Empirical evaluations demonstrate strong performance across several benchmarks, including summarization, dialogue, and fine-grained text generation tasks, with competitive results against RLHF methods like PPO and DPO.

Strengths

The paper provides a clear and well-structured theoretical analysis, highlighting the pitfalls of full-sequence reward models in RGTG. The authors' proposed BT-based partial sequence reward model is computationally reasonable and provides an elegant alternative to offline RLHF methods. Comprehensive experiments validate the method's effectiveness, showing strong empirical results across diverse datasets.

Weaknesses

Reviewers thought the paper's primary assumption, that partial sequences inherit the same preference as full sequences, may not hold in all cases, particularly in tasks like reasoning where undesirable responses often occur at specific tokens or steps. While computational overheads were addressed through sampling subsets of partial sequences, this process and its implications were not sufficiently emphasized in the main paper.

Conclusion

The paper shows PARGS as a computationally efficient, inference-time RGTG method, but concerns from reviewers about its novelty, assumptions, and practicality remain. The authors strengthened the work during the rebuttal with added experiments and human evaluations, further clarification. Future improvements should clarify sampling strategies, discuss partial-sequence limitations and alternatives to the Bradley-Terry framework. Given mixed reviewer feedback, I would recommend further refinement and resubmission.

审稿人讨论附加意见

Authors have made significant efforts to address reviewers' concerns and questions. Only reviewer kTXt deeply engaged in the discussion and maintained their original score.

最终决定

Reject