6.5

/10

Poster4 位审稿人

最低6最高7标准差0.5

3.0

置信度

COLM 2024

Don't throw away your value model! Generating more preferable text with Value-Guided Monte-Carlo Tree Search decoding

Jiacheng Liu,Andrew Cohen,Ramakanth Pasunuru,Yejin Choi,Hannaneh Hajishirzi,Asli Celikyilmaz

OpenReview PDF

提交: 2024-03-13更新: 2024-08-26

TL;DR

We adapt and apply AlphaGo-style MCTS decoding to post-RLHF language models, and achieve superior generation quality.

摘要

关键词

text generationdecodingsearchreinforcement learningppomonte-carlo tree search

评审与讨论

审稿意见

评分: 7置信度: 32024-05-11

This paper proposes using the value model to guide text generation from PPO policy, particularly in the form of Monte-Carlo Tree Search (MCTS). The experimental results demonstrate the effectiveness of value-guided MCTS in controlled text generation tasks. The idea is simple, but it is a good contribution to reminding people to use value models that have not been utilized in practice. The paper is clearly written and well-organized.

接收理由

Guiding with the value model is simple but has been overlooked in practice. It is a good contribution to bring this direction to attention.
The experiments on four text generation datasets and the well-designed ablation studies demonstrate the effectiveness of the proposed decoding method.
The paper is clearly written and well-organized.

拒绝理由

The advantage of MCTS itself is not verified enough. The ablation study "Do we need MCTS?" partially shows the effectiveness of MCTS but the baseline, which decodes based only on values, seems to be weak. Why not additionally use $p_{\theta}(x_t|w, x_{<t})$ , as in existing controllable text decoding methods, e.g., FUDGE [1]? In addition, I would like to see the value-model guiding combined with other search algorithms, such as beam search.
DPO, which is widely used these days, does not accompany a value model. This situation may reduce the utility of this method.

[1] FUDGE: Controlled Text Generation With Future Discriminators. 2021. NAACL.

给作者的问题

Why is there no human evaluation conducted in helpful and harmless chatbots?
I understand that using only the value model is the benefit, but did you try MCTS based on the reward models used for the evaluation in each dataset? If a user has a clear idea of the reward he or she wants to maximize, it may be most effective to generate sentences using that reward model rather than a value model, as in existing methods.
Is there any way to apply the proposed decoding in DPO as well?

作者回复

2024-05-31

Thanks for recognizing our contribution to the direction of guided decoding. Below is our response to your concerns:

Alternatives to MCTS: First we would like to clarify that our “stepwise-value” baseline (used to compare with MCTS) uses not just values, but a combination of policy and values. It first selects top-k candidate tokens from policy model’s proposal, $p_\theta(x_t | w, x_{<t})$ , and then decide the token based on the values of these k candidates. It would be interesting to explore applying other search algorithms (e.g., FUDGE, beam search) to PPO models, and we leave it as future research.

MCTS cannot be applied to DPO models: Indeed, DPO does not produce a value model required by MCTS, and MCTS can only be applied out-of-the-box to Actor-Critic methods like PPO. It is possible to post-hoc train a value model after the DPO model is trained, but that requires extra effort. Meanwhile, some recent work [1, 2] showed that online algorithms (like PPO) tend to produce better policy models than offline algorithms (like DPO). Building on such advocates for PPO, our proposed method can further boost the generation quality of PPO models.

Using reward model in MCTS: In our submission we reported this ablation of using reward model instead of the value model for MCTS. The result is in Table 6 (under the name “PPO-MCTS[R]”) and we discussed this in the first paragraph of Sec 5.1. Using reward model for MCTS resulted in lower goal satisfaction and lower fluency. We offered two potential explanations: the reward model is not trained to process partial sequences, and the reward model is not adapted to the PPO policy model.

2024-06-06

I thank the authors for the response. The concerns are mostly addressed, so I raised the score.
Could you also answer the first question?

2024-06-07

Thank you. RE your Q1: Thanks for your suggestion, we will add human evaluation for the helpful and harmless chatbot task in the next paper revision. We note that in our submission we have reported human evaluation on two other tasks, sentiment steering and toxicity reduction.

审稿意见

评分: 6置信度: 22024-05-11

This paper proposes a method to apply Monte-Carlo Tree Search (MCTS) decoding over PPO-trained policy and value models for controlled text generation task. The key idea is to use the value model produced from PPO training as the evaluation function in MTCS decoding. Empirical evaluations are conducted on 4 text generation tasks. The proposed method outperforms a baseline of direct decoding with PPO policy model, as well as a best-of-n decoding baseline. Further ablation experiments are analyzed, including the effectiveness of the value model vs. reward model, the effectiveness of the MCTS vs. decoding with stepwise value, MCTS vs. more PPO, and so on.

接收理由

In general the idea of this paper is clear and straightforward. Empirical evaluations demonstrate the effectiveness of the proposed method over several different controlled text generation tasks.
The proposed method is an inference-time algorithm, which means no further SFT, RLHF needed. It should be easy to apply to any LLM after RLHF.

拒绝理由

One concern I have is the efficiency of the proposed method. I like the discussion of the inference time overhead in Sec. 5. But I also think it is not adequate. I would love to see some empirical numbers like wall-clock time. I think this should be further explained.

给作者的问题

I think some writings can be further polished for better fluency and consistency. For example, in the explanation of Eq 3 in page 5, $\bar{V}(s)$ does not appear in the equation.
Table 1: any idea why PPO+best-of-n is worse than PPO (51.47 vs. 52.44) in % of desired sentiment for positive sentiment?

作者回复

2024-05-31

Thanks for appreciating that our proposed method can be easily applied to post-RLHF LLMs. Below is our response to your concerns:

Discussion of efficiency: Thanks for the suggestion. We will further expand our discussion on the efficiency of PPO-MCTS in our next paper revision, and include empirical results for decoding latency.

Clarity and the issue with $\bar{V}(s)$ : We will improve the clarity in our next paper revision. $\bar{V}(s)$ appeared indirectly in Eq 3, as part of the computation of the $Q(s, a)$ term in Eq 4. We will defer the mention of $\bar{V}(s)$ to reduce confusion in Eq 3.

PPO+best-of-n being worse than PPO: We were also quite surprised by this result. However, from Table 1 and 2 we see that PPO+best-of-n is consistently comparable with PPO. We think this implies that vanilla sampling is unable to produce outputs with a diverse range of rewards, and guided decoding is a better solution.

审稿意见

评分: 6置信度: 32024-05-11

Quality:

The paper presents a novel method (PPO-MCTS) for controlled text generation by combining Monte-Carlo Tree Search (MCTS) with Proximal Policy Optimization (PPO) models.
The method is theoretically well-motivated and empirically evaluated on multiple tasks, showing significant improvements over baselines.
The experiments are comprehensive, and the results are analyzed in-depth with relevant ablation studies.

Clarity:

The paper is well-written and structured, making it easy to follow the proposed method and the experiments.
The background and notations are clearly explained, aiding in understanding the technical details.
The figures and tables effectively illustrate the key ideas and results.

Originality:

The idea of using the value model from PPO for guided decoding with MCTS is novel and has not been explored before.
The paper introduces a critical modification to the MCTS algorithm (initializing the Q of children actions from the V of their parent node) to improve exploration.
The work contributes to the under-explored area of utilizing value models for inference-time improvements.

Significance:

The proposed method achieves substantial improvements in generating preferable text across multiple tasks, including sentiment steering, toxicity reduction, knowledge introspection, and creating helpful and harmless chatbots.
The results demonstrate the promise of search algorithms like MCTS even on top of aligned language models from PPO.
The work highlights the valuable potentials of the value models, which are often discarded in practice, and recommends saving them for enhanced inference.

接收理由

The paper proposes a novel and theoretically well-motivated method (PPO-MCTS) that combines MCTS with PPO models for controlled text generation, which is a significant contribution to the field of language modeling and controlled generation.
The extensive experiments and analysis on multiple tasks provide strong empirical evidence for the effectiveness of the proposed method, demonstrating its practical value and potential for real-world applications.
The paper explores an under-explored area of utilizing value models for inference-time improvements, unlocking the valuable potentials of these models, which are often discarded in practice.
The work is well-written, clear, and easy to follow, making it accessible to a broad audience in the computational linguistics and natural language processing communities.
The proposed method addresses an important problem in controlled text generation, which is a critical area of research with potential impacts on various applications, such as dialogue systems, content generation, and language model alignment.

拒绝理由

One potential concern is the computational overhead introduced by the PPO-MCTS method during inference time.
While the authors provide some discussion and mitigation strategies, a more in-depth analysis and comparison of the computational costs and trade-offs with other methods could further strengthen the paper.
The font of figures is too small.

作者回复

2024-05-31

Thanks for your feedback! Below is our response to your concerns:

Computational overhead: In our submission we discussed the inference time overhead of PPO-MCTS in Sec 5.1. We see PPO-MCTS as a method to boost quality with more compute at inference time, and would like to explore techniques of reducing this overhead in the future.

Figure font too small: We will improve this as well as the overall presentation in the next paper revision.

审稿意见

评分: 7置信度: 42024-05-18

The PPO objective during the RLHF training simultaneously learn two models: a policy model and a value model. Often, the learned value models are ignored and only the policy model is used during inference. The paper proposes a novel guided decoding algorithm called PPO-MCTS which integrates the value model to work with the policy model during the inference time using Monte-Carlo Tree Search (MCTS). The paper is very well written and nice to read. The proposed method is thoroughly evaluated on four different controlled generation benchmarks. The results demonstrate the advantages of PPO-MCTS compared to PPO-only approaches, using both automatic and human evals.

接收理由

The proposed approach is simple, yet very effective. The chosen text generation benchmarks are very relevant for the study and the results clearly show the advantages of PPO-MCTS compared to PPO-only approaches. The LLM community will find the paper very interesting.

The paper is well written. The authors have done a great job of clearly motivating the proposed approach.

拒绝理由

There are a couple of questions (see below) that were left unaddressed. It would be great to have the authors respond to them and also improve the paper accordingly.

I did not understand why the PPO-MCTS approach was not directly compared with prior guided decoding approaches using either hand-crafted rules or separate discriminative models as value functions. I feel the value functions specifically tailored for the generative LM (as used here) will work better than prior guided approaches, but it would be important to show this comparison.
It would be interesting to see the cost comparison between PPO and PPO-MCTS during inferences. Is it feasible to use PPO-MCTS when generating a long sequence?
Additionally, the paper would benefit from comparison to DPO (https://arxiv.org/abs/2305.18290) or Slic (https://arxiv.org/abs/2210.00045) trained models. But these comparisons might be okay to skip in this paper.

给作者的问题

Please see my three points/comments raised in 'Reasons to Reject'. It would be great to have you respond to them. I would be happy to reconsider my scores.

作者回复

2024-05-31

Thanks for recognizing the importance of our studied problem and the effectiveness of our method. Below is our response to your concerns:

Comparison with using hand-crafted rules or separate discriminative models: In our submission we experimented with using the reward model to guide MCTS. (For the tasks we evaluate on, the reward model would be the separate discriminative model.) This method is reported as “PPO-MCTS[R]” in Table 6, and it resulted in lower goal satisfaction and lower fluency than our PPO-MCTS. We did not try hand-crafted rules because it is unclear how to design rule-based classifiers for these tasks.

Inference cost comparison: We mentioned the inference time of PPO-MCTS in the last paragraph of Sec 5.1. Naively, PPO-MCTS has a 2S times overhead compared to standard decoding from PPO models (S is the number of simulations), and this overhead can be mitigated by various techniques like caching and pruning. In some internal experiments, we have successfully generated sequences of 1k tokens with PPO-MCTS. We see PPO-MCTS as a method to boost quality with more compute at inference time.

Comparison with DPO and SLiC: We did not compare with these methods because they do not produce a value model, which is required to apply MCTS. In addition, some recent work [1, 2] showed that online algorithms (like PPO) tend to produce stronger policy models than offline algorithms (like DPO).

最终决定Accept

2024-07-10

This is a pretty interesting paper that proposes using the value network learned in PPO for guided decoding.

Most reviewers are pretty positive about the paper. There are some flaws in the paper (no human evaluation, not enough justification for MCTS, inference speed trade-offs etc). However, i think it's interesting enough to warrant acceptance.

This is probably the best (decent) paper in my batch. I recommend acceptance.