Selective Preference Optimization via Token-Level Reward Function Estimation
We propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection.
摘要
评审与讨论
This paper propose selective preference optimization (SePO). SePO select the top-k token that dominate the final reward and train DPO on these tokens to eliminate noise and improve efficiency. Experiments shows that SePO outperforms a bunch of direct -preference learning methods.
优点
The strengths of this paper are listed as follows
-
This paper observes that the total reward of a generated utterance is usually dominated by a few tokens. This observation is interesting and motivate the method well
-
This paper propose a token-selection-based training method, which is new and interesting to me
-
The experiments are comprehensive and results look good.
缺点
My concerns are listed as follows:
-
My major concern is about the token selection mechanism. The motivation behind using as the proxy of the reward is unclear to me. Theorem 1 only proved that , which only guarantees that the sum of and the sum of is the same (up to a constant). However, the value distribution of and might still be drastically different. Therefore, the token selection based on does not make sense given the current illustration.
-
It looks like that that SePO is quite sensitive to the parameter . The search space of looks weird and it seems that there is a fluctuate of performance when varies in this set. This makes a issue give that the improvement over baseline is not that significant.
-
(Minor issue): The in equation (6) looks like a typo
问题
See weakness section
The authors propose SePO—a method that utilizes selected tokens from an oracle model to perform preference optimization. The approach is evaluated across a wide range of models and general assistant benchmarks. The authors report that by optimizing only 30% of the tokens, they were able to surpass other methods for preference optimization.
优点
- The idea is clear and novel.
- The reported results indicate the promise of the approach.
缺点
- The proof of Theorem 1, which asserts that after training a DPO, the reward function can be expressed as a decoupled reward , inherits this property (Line 810) from the assumption that the reward can be written in such a manner (Assumption 1). This raises the question of whether all reward functions can be expressed in a decoupled way. From a naive perspective, a decoupled reward is not normalized, and longer texts might have larger absolute values of reward. In my attempts to learn reward models in online settings using a decoupled approach, I found that without normalization, their accuracy dramatically reduced. Normalizing the sum of small rewards over tokens led to improvements. Therefore, I strongly feel that not all rewards can be expressed in this way. Moreover, the training objective (Equation 11) uses a normalized reward, making it unclear why Theorem 1 was presented.
- Some of the writing is ambiguous. The SePO objective (Equation 11) is hard to parse visually and would benefit from a human-understandable explanation before the equation.
- The experiments lacked exploration of the dependence of performance on the KL divergence with the reference policy. It is evident that training a policy with the SePO objective could cause it to diverge significantly. This is similar to observations in Rafailov et al. [1]. For instance, could selecting a lower value enable DPO or other baselines to perform better than SePO?
[1] Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
问题
See weaknesses
The paper introduces Selective Preference Optimization (SePO), a novel strategy for aligning large language models (LLMs) at the token level by selectively optimizing only on key tokens. Leveraging Direct Preference Optimization (DPO), SePO identifies and optimizes high-impact tokens, reducing supervision costs while improving alignment performance on benchmarks.
优点
SePO offers a cost-efficient alignment strategy by focusing on a subset of high-reward tokens, which reduces annotation costs. The method demonstrates better performance on several benchmarks, surpassing existing token-level and response-level alignment methods. SePO’s weak-to-strong generalization enables effective supervision from smaller, weaker oracle models, showing scalability across varying model sizes.
缺点
- The method is limited by the requirement that oracle and policy models share the same vocabulary and tokenizer, which reduces flexibility across different model architectures.
- The use of the DPO reward format as an automated credit assignment behaviour has been attempted by other works, and the paper's contribution is weaker as only quantifies the results of this assignment to the weights of the DPO loss.
- Suppose the confidence given by the Oracle model is used as the gold label for the credit distribution. In that case, we do not need dpo to fit the reward distribution given by the optimal policy (https://arxiv.org/abs/2404.12358, https://arxiv.org/abs/2408.14874). Alternatively, the paper needs to discuss the error problems associated with this approximation in the method to validate the need for SePO further.
- Performance increases are relatively minor in the experiments, and the comparison model used is GPT-4-0314 (not the current optimal model, but again, not compared to the model untrained itself to provide a more intuitive increase)
问题
- How performance when comparing win rates to the trained models themselves.
- The model used for the experiment is a bit old and SePO needs to prove its performance on newer open source models.
- The theoretical part needs to be further refined, and would like to see a discussion on whether DPO credit assignments need to be constructed through token-level weighted training, and that there should exist better ideas to take advantage of this feature of the DPO reward format.
This paper introduces Selective Preference Optimization (SePO) which optimizes model performance by selectively training only on key tokens with high token-level reward values using Direct Preference Optimization (DPO).
This approach significantly reduces the data requirements by focusing on 30% of the tokens, avoiding noise from less informative tokens and improving computational efficiency.
Additionally, this paper also explores weak-to-strong generalization, demonstrating that weaker oracle models can provide useful supervision for larger, more powerful policy models.
Experimental results across three benchmarks show that SePO outperforms baseline methods in alignment tasks, supporting its effectiveness and adaptability.
优点
This paper introduces a novel token-level reward function estimator using DPO.
SePO reduces the need for extensive token optimization, demonstrating improved alignment performance while training on only 30% of tokens. This is valuable for scaling LLMs and reducing computational overhead.
The weak-to-strong generalization capability of SePO allows smaller models to supervise larger ones.
缺点
The experiments primarily involve relatively moderate-sized models. Testing SePO on stronger models, such as LLaMA2-Chat-70B, would provide further insights into its scalability and potential bottlenecks, especially for the weak-to-strong generalization experiment.
Compared to other methods, the improvement seems to be slight.
问题
How does SePO scale with very large policy models?
How to decide the token selection threshold more smartly on different datasets and models?
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.