Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
We address the non-stationarity preference drift using exponential reweighting strategy for LLMs.
摘要
评审与讨论
The paper introduces Non-stationary Direct Preference Optimization (NS-DPO) for preference learning on non-stationary offline datasets. With an assumption on the upper bound of preference drift and log-linear policies, the paper also shows that the method achieves a sample complexity of , or when the upper bound of drift goes to zero. In experiments on synthetically-created temporally-varying preference datasets, the paper shows that NS-DPO achieves a better reward accuracy than vanilla DPO. In experiments on stationary datasets, experiments show that the two achieve comparable reward accuracy.
给作者的问题
- Could the authors provide more discussion on the implications of the theoretical results?
- Could the authors additionally evaluate using win rate?
- Could the authors include SW-DPO for further comparison?
论据与证据
- The theoretical claims are well-supported, with careful enumeration of assumptions.
- The claim that NS-DPO achieves comparable performance empirically is based on reward accuracy, which doesn't necessarily reflect how the policy model is used in practice. For instance, Rafailov et al. 2024 show that reward accuracy correlates poorly with win rate (Fig 4).
方法与评估标准
- See above for note on the use of reward accuracy alone to assess performance in the experiments.
- Even after reading the supplement, I was still a bit confused about the dataset construction and setup. Just double checking, for the gradual drift experiments, the datasets were constructed such that the model sees increasing new data during training, and evaluation occurs on a held-out timestep that's later than all the training data?
理论论述
I skimmed but did not carefully check the proofs in the appendix.
实验设计与分析
- See above for extending the evaluation beyond reward accuracy.
- The sliding window DPO (SW-DPO) method seems like an important one to mention in the main paper. I appreciate the synthetic experiment comparing with it, but a further comparison seems important to justify the use of all the data and NS-DPO over a subset of the data and vanilla DPO. For instance, a characterization of when NS-DPO is preferred would improve this paper substantially.
补充材料
I skimmed through the entire supplement.
与现有文献的关系
I appreciate the comparison to the literature on parameter drift in bandits, as well as other theoretical literature on DPO.
遗漏的重要参考文献
To my knowledge, the related work does a good job comparing to recent work studying and extending DPO.
其他优缺点
Strengths:
- The theoretical analysis is precise and is well-situated within the existing literature (e.g., comparable setup to Chowdhury et al. 2024). Weaknesses:
- More discussion about the practical implications of the theorems would be helpful.
- The experimental analysis could be strengthened with the use of other evaluation metrics (e.g., win rate) and inclusion of the SW-DPO baseline in the main experiments.
其他意见或建议
The conclusion mentions ICL but I do not see any mention elsewhere. Is this perhaps a typo?
We thank the reviewer for their time and recognising the precision of our theoretical analysis. We address the questions and suggestions of the reviewer as follows.
Response to the questions
(1) “For the gradual drift experiments, were the datasets constructed such that the model sees increasing new data during training? ”
Please note that our setting considers offline learning scenarios, so all the datapoints with time step information are provided before the training starts. Because the reward for determining preference is changing gradually, the preferences between responses are more likely to change at datapoints with later timesteps. It is correct that the evaluation is done with the preferences at a time step that is later than all the training datapoints.
(2) “Does the evaluation occur on a held-out timestep that's later than all the training data?”
Yes, the evaluation is done with the preferences at a time step that is later than all the training datapoints.
(3) “The conclusion mentions ICL but I do not see any mention elsewhere”
In page 7, left column, line 369-372, we explained that tDPO refers to the in-context learning baseline. tDPO is a variant of DPO where the time step information is explicitly mentioned in the prompt. As demonstrated in the experiment result (Figure 2, Figure 9), tDPO did not significantly improve from stationary DPO.
Response to the suggestions
(1) “Could the authors provide more discussion on the implications of the theoretical results?”
We believe the core implications of the theoretical results are sufficiently discussed in the paper. To recap and summarize:
(1-a) Preference drift negatively affects the performance of policies. (in the form of tracking error, Theorem 2 in page 5)
(1-b) When we only know the total amount of preference drift ( in the paper) and the time steps in the paper, we can use exponential weighting to address the preference drift. The optimal value of the discounting parameter can be computed when we know . (line 293-294, page 6)
(1-c) As approaches zero, NS-DPO with optimally defined becomes identical to stationary DPO. (Corollary 4 in page 6)
(2) “The experimental analysis could be strengthened with the use of other evaluation metrics (e.g., win rate)”
Win rate evaluation is in the appendix, where we find it correlates well with reward accuracy. We point the reviewer to the UltraFeedback-LM dataset explained in Appendix C.2 and Appendix C.6 where we presented the results of win rate experiments. We used llama-3.2-1b-it for fully fine-tuning the policy within our compute limit. NS-DPO outperforms stationary DPO in terms of Length-Controlled Win rate (LCWR), similar to the reward accuracy comparisons.
(3) "inclusion of the SW-DPO baseline in the main experiments”
We point the reviewer to the additional experiment we conducted with SW-DPO (link 1), where we added SW-DPO results to Table 1 in page 18 of the paper. We trained policies with SW-DPO using the optimal window sizes to only allow the datapoints after the preference change. Despite using information not allowed in the setting of our work, SW-DPO significantly underperforms NS-DPO in Length-Controlled Win Rate (LCWR) of AlpacaEval. We attribute this to SW-DPO ignoring all the datapoints outside the window. We also present the results for UltraFeedback-RM datasets extended from Figure 9, where we show that SW-DPO shows worse reward accuracy when window size is not optimal (link 2) and similar accuracy when we assign optimal window sizes (link 3).
We hope our responses have addressed all the questions the reviewer had about our work. We respectfully ask the reviewer to reconsider their evaluation of our paper in light of our answers and additional experimental results. We are eager to help clarify any further questions the reviewer might have as we believe in the importance of the problem this paper seeks to address.
Thank you for the response. The additional experiments as well as the clarifications to my questions are great, and I have raised my score.
This paper discusses a new approach called NS-DPO to address the issue of temporal preference drift in LLMs. Current LLM preference optimization algorithms do not account for changes in preferences over time, leading to misalignment. NS-DPO models time-dependent reward functions using a Dynamic Bradley-Terry model and introduces a computationally efficient solution by adding a single discount parameter in the loss function. This parameter enables exponential weighting, focusing learning on more recent and relevant data. The paper provides theoretical analysis of NS-DPO's convergence, offering upper bounds on estimation error and regret caused by non-stationary preferences. Experimental results demonstrate that NS-DPO fine-tuned LLMs remain robust under drifting preferences, outperforming baseline algorithms that ignore temporal changes, while maintaining performance in stationary scenarios.
给作者的问题
Can NS-DPO be aware of the preference change of users?
论据与证据
The problem setting and direction of this research are intriguing, as current studies in LLMs are increasingly focusing on more complex and challenging scenarios. The overall claims made in the paper are valid, and the identified challenges, as well as the discussion of related works, are well-aligned with the evolving landscape of LLM research.
方法与评估标准
However, the proposed method does not fully make sense to me for this challenging problem setting. The main difference between NS-DPO and DPO is the introduction of the decaying factor , which is sensitive to time . While this design allows the LLM to focus more on recently collected data, it may fail to account for sudden changes in user preferences. This is because the decaying factor cannot detect or adapt to abrupt shifts in user preferences. If the LLM places more weight on learning recent data, historical data may become irrelevant; conversely, sudden changes in preferences could negatively impact the model's alignment. As a result, it is challenging to determine an appropriate value for that effectively resolves this dilemma.
理论论述
I appreciate this analysis of this paper, especially for leveraging the regret analysis in multi-armed bandits. Provide new insights for this RLHF community.
实验设计与分析
The experimental design make sense and may create some dataset with different preference change strength.
补充材料
Briefly review the proof.
与现有文献的关系
LLM, RLHF, and DPO.
遗漏的重要参考文献
No.
其他优缺点
No.
其他意见或建议
.
We thank the reviewer for their time and recognising the intriguing nature of this problem. We believe this will become a far more prevalent problem as LLM providers gather bespoke datasets over longer time frames and are keen to raise this within the LLM research community. As such, we are glad the reviewer agrees with us that this is aligned with the evolving direction of LLM research.
Response to the Concerns: “Adopting decay factor may fail to account for sudden changes in user preferences.”
We address this at two points in our manuscript:
(a) Firstly we direct the reviewer to Assumption 1 [Section 4]. This assumption places no constraint on how the preference changes, as such our theoretical analysis addresses the case the reviewer raises where preferences change suddenly. As seen in Theorem 3, the value of is set by which can be thought of as the ‘size’ of the preference change. A large sudden change will result in a smaller value of focusing the algorithm’s learning on points closer to the present. We also note that this setting and exponential weighting technique are thoroughly analyzed in the multi-armed bandits literature. We point the reviewer to line 99-105 of page 2, left column where we discuss the references.
(b) Secondly, we explicitly test the setting where preferences change suddenly in our experiment section. We direct the reviewer to Figure 3 a) and Figure 4, in which we test NS-DPO when preferences change suddenly at a specific changepoint for a variety of different strengths of preference change (controlled by ). We note that whilst DPO and IPO fail as the reviewer predicted, NS-DPO maintains its strong performance, even when we fixed the value of for all sudden preference drifts!
Response to the Question: “Can NS-DPO be aware of the preference change of users?”
Yes, while the information is limited. The setting we consider in our work only allows the knowledge of the total amount of parameter drift, which is . NS-DPO requires the knowledge of to set its exponential weight for the best possible performance. In practice, we additionally provide experiment results where we test NS-DPO under various strengths of preference drift with a fixed value of , which still showed significant performance improvement over stationary baselines.
We hope our responses have addressed all the questions the reviewer had about our work and ask that they raise their score. We are eager to help clarify any further questions the reviewer might have as we believe in the importance of the problem this paper seeks to address.
This paper addresses the issue of temporal preference drift in training large language models (LLMs) using human feedback. The authors propose Non-Stationary Direct Preference Optimization (NS-DPO), which incorporates a Dynamic Bradley-Terry model to account for time-dependent reward functions. This method introduces an exponential weighting parameter to focus learning on more recent data points, improving model robustness under preference changes.
给作者的问题
see Experimental Designs Or Analyses
论据与证据
The claims made in the submission are generally supported by clear and convincing evidence. The authors provide theoretical analysis with proofs for the convergence and regret bounds of NS-DPO, showing its effectiveness in handling non-stationary preferences. They also conduct extensive experiments on various datasets with controlled preference drift, demonstrating that NS-DPO outperforms existing methods like DPO and IPO in different scenarios, especially when significant preference changes occur.
方法与评估标准
The proposed methods and evaluation criteria in the paper are well-suited for addressing non-stationary preference optimization in LLM training.
理论论述
While I did not verify the detailed proofs, the derivations and arguments presented are clear and logically consistent.
实验设计与分析
The experimental designs demonstrate both merits and drawbacks. On the positive side:
- The synthetic experiments offer controlled settings to assess NS-DPO under known preference shifts
- Testing across multiple datasets showcases the method's broad applicability
- Ablation studies effectively clarify the impact of key parameters
However, limitations exist:
- The paper primarily uses synthetic data for main experiments
- Real-world datasets have artificially induced preference drift rather than naturally occurring shifts
补充材料
N/A
与现有文献的关系
N/A
遗漏的重要参考文献
N/A
其他优缺点
see Experimental Designs Or Analyses
其他意见或建议
see Experimental Designs Or Analyses
We thank the reviewer for their response and recognising that our conclusions are well supported by both theoretical evidence and extensive experiments that show NS-DPO’s effectiveness in handling non-stationary preferences. We think this is a super important problem, unaddressed within the literature, and that it will become more prevalent as companies build proprietary datasets over longer time frames.
Response to Limitations
(1) To the best of our knowledge, we are the first to address the problem of non-stationary preferences within the LLM fine-tuning literature. As such we were unable to find any publicly available datasets that exhibit this property, as most are well curated for specific goals e.g. Helpful Harmless, and not preference drift.
(2) To account for this we created datasets by introducing preference drift into pre-existing, realistic preference datasets. We specifically tested challenging preference drifts in our experiments i.e. the preference change is strong and occurs suddenly. In Figures 3a and 4 we explicitly test NS-DPO on such preference drifts, a sudden shift from one reward model to another over varying times and with increasing strengths, and show that NS-DPO is robust to such changes even in the most challenging settings.
(3) By drawing attention to this problem and providing a simple and effective solution, we hope to attract the attention of the wider research community and collaborate to create more realistic datasets, however that is beyond the scope of this initial work which clearly demonstrates the problems that non-stationary preference drift can have on algorithms designed from a stationary perspective.
Once again we thank the reviewer for their time and ask that they consider raising their score further in light of these arguments. This problem is exciting and underexplored though it will be of great interest in the future, and we believe that our work addressed this in a simple and effective way.
Thanks for your effort and reply and I will maintain my score.
The reviewers agreed that this paper examines a relevant problem in the area of preference optimization for LLMs (in particular, the problem that human preferences can change over time, and standard RLHF methods may be slow to react to shifts in these preferences) and were generally convinced by both the theoretical arguments and the empirical evidence in favor of the proposed solution (DP-PPO). All reviewers were in favor of acceptance.