Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment
This work introduces a new Policy Gradient algorithm that update directly with relative reward and showcase its power in fine-tuning LLMs.
摘要
评审与讨论
This paper proposed a new algorithm for RLHF and tried to replace PPO. Experiments are full LM fine-tuning is conducted.
Overall, I feel this paper studied an interesting and important problem. However, this paper is not very well-written. It is very hard to get the key contribution and understand where the benefit is coming from. The authors made a lot of comparisons: absolute feedback v.s. relative feedback; trajectory-wise v.s. token-wise; MDP v.s. CB, but without giving a direct correlation with the proposed algorithm.
The algorithm seems to follow the standard approach: learn a reward model first, and then optimize the reward model using some optimizations. It is unclear how significant to replace PPO by PG. In practice, the more important question is the reward model is not good enough.
Do you still need to train a reward model first? I think it is helpful to write down the pseudo-code. If so, comparing with DPO is not fair. The main benefit of PPO is to avoid the expensive separate reward modeling + RL optimization steps.
In DPO paper, the SFT baseline is trained on the preferred response of the preference feedback dataset. In the current work, the SFT baseline is not further trained on the preference feedback dataset. I hope the author can reproduce the result in DPO paper such that we can make sure the implementation is current.
优点
see above
缺点
see above
问题
see above
Thank you for your feedback. We appreciate your recognition of the importance of the problem our paper addresses. We acknowledge that the clarity of the paper can be improved, and we will endeavor to address your concerns in our revised draft.
Comment 1:
Overall, I feel this paper studied an interesting and important problem. However, this paper is not very well-written. It is very hard to get the key contribution and understand where the benefit is coming from. The authors made a lot of comparisons: absolute feedback v.s. relative feedback; trajectory-wise v.s. token-wise; MDP v.s. CB, but without giving a direct correlation with the proposed algorithm.
Response:
To clarify the distinctions and contributions of our work, we will focus on delineating the key features of P3O compared to PPO and DPO:
- P3O is online, policy-gradient-like, utilizes relative feedback, employs trajectory-wise rewards (CB reward formulation)
- PPO is also online, policy-gradient-like but relies on absolute feedback, utilizes token-wise rewards (MDP reward formulation)
- DPO diverges from the above as it’s offline, and also not policy-gradient like. It employs relative feedback and trajectory-wise rewards (CB reward formulation)
We will emphasize these distinctions more clearly in the updated version of our paper to help readers better understand the position of P3O in this landscape.
Comment 2:
The algorithm seems to follow the standard approach: learn a reward model first, and then optimize the reward model using some optimizations. It is unclear how significant to replace PPO by PG. In practice, the more important question is the reward model is not good enough.
Response:
Regarding the experimental results, our findings on the Anthropic HH and TL;DR datasets, demonstrate that P3O can optimize rewards more effectively while maintaining better control over KL divergence compared to PPO. In direct comparisons evaluated by GPT-4, responses generated by P3O were preferred approximately 60% of the time over those by PPO. This comparison is significant, considering the same dataset and minimal parameter tuning were used for our algorithm. We also posit that the performance gap could widen with an improved reward model, highlighting P3O's potential for even greater effectiveness.
Comment 3:
Do you still need to train a reward model first? I think it is helpful to write down the pseudo-code. If so, comparing with DPO is not fair. The main benefit of PPO is to avoid the expensive separate reward modeling + RL optimization steps.
Response:
Yes, P3O requires training a reward model first. We have included a concise pseudo-code in the appendix for both P3O and Policy Gradient methods. In the revised draft, we will make this aspect more prominent. Our comparison with DPO is fair and methodologically sound. We have adapted DPO into an online algorithm, enhancing its capability to continuously tune with an online reward model, which is typically more effective than solely relying on offline tuning.
Comment 4:
In DPO paper, the SFT baseline is trained on the preferred response of the preference feedback dataset. In the current work, the SFT baseline is not further trained on the preference feedback dataset. I hope the author can reproduce the result in DPO paper such that we can make sure the implementation is current.
Response:
In our work, the SFT baseline was not fine-tuned on the preferred response from the preference feedback dataset. This decision aligns with the standard baseline approach followed in InstructGPT, APA, and RAFT. The algorithm Preferred-FT, which fine-tunes the SFT model on chosen responses, is not considered a standard baseline for our comparisons. We have focused our comparisons with the most significant baselines such as SFT, PPO, and DPO, which are widely recognized in the field.
We appreciate your feedback and hope that these clarifications will make our revised draft more comprehensible and informative. Thank you for your engagement, and we look forward to further discussions.
This paper proposes a new framework, reinforcement learning with relative feedback, and a novel trajectory-wise policy gradient algorithm, Pairwise Proximal Policy Optimization (P3O) that operates directly on comparative rewards. The authors show theoretically that P3O is invariant to equivalent rewards and avoids the complexity of PPO. Empirical evaluations demonstrate that P3O outperforms PPO in the KL-Reward trade-off and can align with human preferences as well as or better than prior methods.
优点
- The studied problem, i.e., policy optimization algorithms with human feedback, is very well-motivated and important in LLM alignment.
- The authors provide a rigorous theoretical guarantee on invariance for their algorithm P3O, and conduct experiments to show the good performance of algorithm P3O in the KL-reward trade-off.
缺点
- The authors mention that P3O enjoys the invariance property while PPO does not. Why is the invariance property important? The authors should elaborate more on how this invariance property of P3O helps improve its performance in LLM alignment.
- The authors should give a more detailed comparison between P3O and DPO. They both satisfy the invariance property. Why does P3O perform better in the KL-reward trade-off?
- Why should the KL-reward trade-off be the performance metric for LLM alignment, not the reward? Is this a standard criterion in the literature?
I will consider raising my score if my concerns are well addressed.
问题
Please see the weaknesses above.
Thank you for your insightful questions. We appreciate your recognition of the importance of the problem that our paper addresses. We value the opportunity to further illustrate the key aspects of our research.
Comment 1:
The authors mention that P3O enjoys the invariance property while PPO does not. Why is the invariance property important? The authors should elaborate more on how this invariance property of P3O helps improve its performance in LLM alignment.
Response:
The invariance property is crucial in the context of P3O and its application in Language Model (LM) alignment. To illustrate this, consider two responses, and , with being preferred 73.1% of the time by human evaluators. When minimizing the Bradley-Terry Loss, the difference in their ratings is 1, regardless of whether , or . Algorithms lacking invariance to reward translation, like PPO, might be misled by the scaling into overvaluing response as excellent choices, even though it may not be an optimal choice. Although clipping rewards can mitigate this to some extent, it’s not a comprehensive solution.
In practice, We often observe initial reward drops in PPO, possibly due to the value function's difficulty in accurately capturing the scale of the rewards. P3O addresses these challenges by being immune to shifts in reward and reducing the complexity of approximating additional functions, such as the value function in PPO.
Comment 2:
The authors should give a more detailed comparison between P3O and DPO. They both satisfy the invariance property. Why does P3O perform better in the KL-reward trade-off?
Response:
For the comparison between P3O and DPO, both of which satisfy the invariance property, the key difference lies in their approach to optimizing the policy. While both algorithms use gradients directed along , the weighting of these gradients differs. DPO’s weight term is derived from a specific supervised loss function, while P3O’s weight terms stem from the derivation of policy gradient. We hypothesize why DPO fell short in KL-reward tradeoff is because DPO directly minimizes an alternate “distance”, represented by DPO loss, between the current and the goal policy. While in the middle of this optimization trajectory, intermediate policies might not directly optimize the reward like policy gradient methods do. In contrast, P3O ensures strict policy improvement, a characteristic of policy gradient-like algorithms, leading to its superior performance in this trade-off.
Comment 3:
Why should the KL-reward trade-off be the performance metric for LLM alignment, not the reward? Is this a standard criterion in the literature?
Response:
Regarding the choice of the KL-Reward frontier as a performance metric for LLM alignment, it stems from the need to balance preference optimization and maintaining linguistic coherence. While reward maximization is a common goal in reinforcement learning, in language generation, a sole focus on reward can lead to policy that generates nonsense content, for reference see Scaling laws for reward model overoptimization. The KL-Reward metric effectively captures the algorithm's ability to optimize user preferences (represented by the reward) while preserving the original policy's language generation capabilities (represented as KL), thus is more suitable than reward in our case.
Besides using the KL-Reward frontier to evaluate the generation quality, we also employ GPT-4 evaluation to consolidate our findings. Previous studies have shown that GPT-4’s assessments correlate strongly with human evaluations, sometimes even exceeding the consistency found in human-to-human evaluations. The result confirms that P3O can align better than other baselines.
We hope these clarifications address your concerns and provide a deeper understanding of our work. We are grateful for your engagement and look forward to further discussions.
Thank you for your response! I raised my score to 6. But I am not very familiar with the deep RL and LLM literature, and would like to listen to the opinions of other reviewers and AC during the discussion period.
This paper presents a novel policy learning method for learning from human feedback for LLM Alignment. The main idea is to derive a pairwise policy gradient method that can improve LLM alignment directly based on the comparison between rewards. The authors also compare the new proposed P3O method with previous method such as PPO and DPO, from both mathematical understanding and empirical comparison, indicating that P3O is a comparative method for LLM alignment.
优点
-
P3O is derived in a principled way by adding a baseline constant/function to the naive policy gradient, resulting in a pairwise policy gradient policy for LLM Alignment
-
The authors discuss the connection and difference between P3O and previous methods such as PPO and DPO, which helps authors understand the underlying relationships between different alignment algorithms.
-
Empirical experiments show that the new proposed method P3O can achieve better performance on both reward and automatic evaluation scores by GP4.
缺点
-
Compared with DPO, P3O still needs to learn a reward function, which increases the complexity of the overall algorithm pipeline. At the same time, I can tell from the GPT4 automatic evaluation the improvement is marginally better than DPO.
-
I feel the overall experiments are good, but at the same time, the authors might miss some baselines. For example, given the same reward function learned from the dataset, it would also be good to compare with RAFT, since it is also a very simple algorithm for policy improvement.
-
What if we apply the same clip trick for DPO? Will DPO also achieve similar performance with P3O?
-
If I understand correctly, the algorithm design is based on the assumption that we can sum over : to obtain a constant baseline w.r.t y. But in practice, especially for the latter algorithm design, we only have one sample estimation for the summation. This could be problematic since y is a piece of sentence (tokens. or discrete variables), usually we might need large samples to obtain such an unbiased estimation, will this cause a problem for gradient estimation?
问题
Please answer my question listed above.
Thank you for your comments. We greatly appreciate the opportunity to clarify these points and are committed to addressing your concerns.
Comment 1:
Compared with DPO, P3O still needs to learn a reward function, which increases the complexity of the overall algorithm pipeline. At the same time, I can tell from the GPT4 automatic evaluation the improvement is marginally better than DPO.
Response:
Regarding the complexity of learning a reward function in P3O as compared to DPO, we acknowledge this concern. While having to train a reward model does add complexity, it also provides a more robust and adaptive framework for policy optimization. This is evidenced by our results, where P3O achieves approximately a 55% preferred rate over DPO evaluated by GPT-4.
Comment 2:
I feel the overall experiments are good, but at the same time, the authors might miss some baselines. For example, given the same reward function learned from the dataset, it would also be good to compare with RAFT, since it is also a very simple algorithm for policy improvement.
Response:
We appreciate this insightful recommendation to include RAFT in our comparisons. We agree that benchmarking P3O against RAFT would provide a more comprehensive evaluation of our algorithm. We are committed to conducting this experiment. Currently, our comparisons include three significant algorithms – SFT, PPO, and DPO – which we believe provide a strong baseline for evaluating P3O's performance.
Comment 3:
What if we apply the same clip trick for DPO? Will DPO also achieve similar performance with P3O?
Response:
Regarding applying clipping technique to DPO, it is important to note that clipping is a strategy tailored for Policy Gradient algorithms. In the paper Trust Region Policy Optimization, each policy gradient (PG) update includes an explicit KL divergence constraint. This constraint ensures that the updated policy remains close to the previous one. PPO simplifies this by replacing the explicit KL constraint, which is difficult to estimate with samples, with a more straightforward clipping technique. This approach has shown to yield similar performance.
However, DPO differs fundamentally from PG-based algorithms. DPO relies on a supervised loss rather than performing policy gradient update. Therefore, there may be no necessity to apply clipping in this context. In fact, clipping might actually be detrimental to DPO. This is because clipping can introduce bias in the gradients, potentially harming the algorithm's effectiveness.
Comment 4:
If I understand correctly, the algorithm design is based on the assumption that we can sum over ∑ pi r to obtain a constant baseline w.r.t y. But in practice, especially for the latter algorithm design, we only have one sample estimation for the summation. This could be problematic since y is a piece of sentence (tokens. or discrete variables), usually we might need large samples to obtain such an unbiased estimation, will this cause a problem for gradient estimation?
Response:
We would like to clarify that we do not rely on summing over all possible outcomes for our estimations. Instead, our approach utilizes expectations to create an unbiased estimator. Hence the estimator is unbiased regardless of the number of samples.
The original gradient in summation form is , and this is equivalent to . In practice we use the estimator , with two samples drawn from the old policy.
We hope this response adequately addresses your queries and provides a clearer understanding of our work. We are grateful for your engagement and look forward to further discussions.
This paper studies Reinforcement Learning with Human Feedback (RLHF), where the default optimizer, Proximal Policy Optimization (PPO), is replaced by a new algorithm, Pairwise Proximal Policy Optimization (P3O), that is invariant to equivalent rewards. The authors named the resulting framework 'reinforcement learning with relative feedback' and empirically show that P3O is better than existing methods.
优点
The following are the key strengths of the paper:
-
The problem studied in the paper is interesting as making RLHF more efficient and practical has many real-world applications, especially for LLMs fine-tuning.
-
The authors show that PPO (commonly used optimizer used in RL, especially for RLHF) is not invariant to equivalent rewards and then propose a new algorithm, P3O, which overcomes this shortcoming.
缺点
The following are the key weaknesses of the paper:
-
Reward Equivalence: Since there are no constraints (Definition 1 should hold for all prompts, needs to be an increasing function or constant), Definition 1 holds for each prompt. For me, saying two reward functions are equivalent implies the reward of one reward function magnifies (either positively or negatively) the reward of another function for all prompts, but Definition 1 does not guarantee it.
-
Relative feedback: As the relative feedback is derived from the difference in rewards (as shown in Figure 1), it implies access to the rewards. First, it is not clear in which scenarios one can access the true rewards in LLMs. Second, it is unclear why not directly train the RL model using the available rewards. Methods designed for dueling bandits or RLHF are generally useful when it is hard to get the reward but easier to get pairwise preferences.
问题
Please address the above weaknesses.
Typo: Page 4, second paragraph: terminates with with a <eos> -> terminates with a <eos>
伦理问题详情
Not applicable
Thank you for your feedback. We deeply appreciate the opportunity to enhance the clarity and understanding of our work.
Comment 1:
Reward Equivalence: Since there are no constraints (Definition 1 should hold for all prompts, needs to be an increasing function or constant), Definition 1 holds for each prompt. For me, saying two reward functions are equivalent implies the reward of one reward function magnifies (either positively or negatively) the reward of another function for all prompts, but Definition 1 does not guarantee it.
Response:
Thank you for raising this important point regarding reward equivalence. Regarding the function , it can be an arbitrary function that takes in a prompt and returns a scalar. To offer further clarification, let’s consider two reward functions: and . These two reward functions have the same Bradley-Terry Loss on the response pair if and only if and are equal. The quantity represent the relative preference of over . This indicates that and can differ in constant on the prompt . However, this doesn’t mean should be a constant or increasing function (we can’t define what is an increasing function since is a sequence, not scalar), since the constant in which and differ can vary for different prompts.
For instance, consider two query and responses pairs . These two rewards are equivalent: and . in this example corresponds to , which varies between prompts.
Your mention of reward function magnification is insightful. However, scaling a reward function (e.g. ) doesn’t yield the same Bradley-Terry Loss, so it won’t fit our original definition.
Comment 2:
Relative feedback: As the relative feedback is derived from the difference in rewards (as shown in Figure 1), it implies access to the rewards. First, it is not clear in which scenarios one can access the true rewards in LLMs. Second, it is unclear why not directly train the RL model using the available rewards. Methods designed for dueling bandits or RLHF are generally useful when it is hard to get the reward but easier to get pairwise preferences.
Response:
Thank you for your question about relative feedback and reward access. We do have a trained reward model, following the previously proposed RLHF pipeline. Note that the reward model is trained from pairwise comparison data, where a human chooses a preferred response from two options for the same query. The reward learning is facilitated by the Bradley-Terry Loss function.
Regarding your question about training the RL model directly using the rewards, approaches like PPO which are not reward equivalence, can exhibit instability and are vulnerable to noise in the reward model. This is particularly problematic when the scale of the reward can be misleading. For instance, consider two responses, and , with being preferred 73.1% of the time by human evaluators. When minimizing the Bradley-Terry Loss, the difference in their ratings is 1, regardless of whether , or . Algorithms lacking invariance to reward translation might be misled by the scaling into overvaluing response as excellent choices, even though it may not be an optimal choice.
P3O, our proposed algorithm, addresses this issue by focusing on relative feedback. It can discern the relative preference of over and prioritize generating accordingly. This approach provides a more nuanced and accurate reflection of human preferences, avoiding the pitfalls of absolute reward scaling.
In this paper, the authors propose a trajectory-based policy gradient algorithm, called pairwise proximal policy optimization (P3O), that operates directly on comparative rewards. They show theoretically that unlike PPO that requires to calibrate the reward scale, P3O is invariant to equivalent reward functions, and thus, argue that it is a better choice for RLHF. They provide empirical evidence in support of P3O.
The problem of policy optimization with human feedback, studied in the paper, is important in alignment of LLMs. The results and derivations seem to be correct. The method is well-motivated and is compared with the main competitors: PPO and DPO.
However, the reviewers do not seem to be convinced about the advantages of P3O over the existing algorithms. The experiments are simple and do not show clear performance improvement. The reviewer suggest more diverse experiments (besides summarization) and more holistic evaluation of P3O (w.r.t. PPO and DPO) that takes into account its complexity/cost in terms of implementation and number of preference feedback (comparisons), and its ease of tuning and optimizing for different problems. I am confident that adding these results plus slightly improving the writing to make the paper more accessible would significantly improve its quality.
为何不给更高分
The reviewers do not seem to be convinced about the advantages of P3O over the existing algorithms. The experiments are simple and do not show clear performance improvement. They ask for more diverse experiments (besides summarization) and more holistic evaluation of P3O (w.r.t. PPO and DPO) that takes into account its complexity/cost in terms of implementation and number of preference feedback (comparisons), and its ease of tuning and optimizing for different problems.
为何不给更低分
None
Reject