$\alpha$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs
摘要
评审与讨论
This paper extended the SimDPO to an adaptive setting where the reference model is a mixture of the uniform distribution and the original reference model . The authors provide some theoretical justification and experiments demonstrate the performance of the proposed algorithm.
优点
- This paper is well-written and easy to follow, the author provide a good motivation to extend the SimPO to a mixture setting
- The authors tried to provide some justification building the connection between the SimPO and the proposed method
- Extensive experiments are conducted on verification
缺点
- It seems to me that this method is just a one-step extension to the SimPO to make the reference model a tunable mixture of the uniform distribution and original reference model. The authors might want to demonstrate the technical challenges more in order to make this extension trivially 'tuning the parameter in a super set'
- The theoretical justification in Definition 4.1 and later is not well supported. For example, in definition 4.1, the authors write the expectation as . I assume that the authors are trying to say is from the prompt dataset and are independently sampled from the , according to this definition. Even regardless of the flaw of this notation, there are still some issue in this definition: the role that and plays are not equal, so in online DPO, another oracle needs to be called to compare between and and generate the preference label. Also, an online DPO justification might not be convincing enough to justify the offline DPO methods.
- Following up withe the previous theoretical justification, the transit between eq(15) and eq(16) is not convincing. It seems that from Definition 4.1, eq (15) is well supported, but eq (16) is connected with the proposed algorithm, thus the authors directly draw the connection by flipping the and . I would suggest the authors to improve the justification for section 4, and reconsider the contribution of this part from over claimed.
问题
- It seems that when , the algorithm becomes the SimDPO. I wonder when the algorithm becomes DPO?
- A Theoretical Justification via Reward Difference Estimation
The original DPO objective maximizes the probability , which, under the Bradley-Terry (BT) model, can be formulated as maximizing the following log-likelihood: where is the sigmoid function. Defining the reward difference , the optimization objective simplifies to: Suppose we start with an initial . A possible improved estimate that maximizes is given by:
where is a monotonic function preserving the direction of . This implies that for sufficient samll, yields higher likelihood and can be viewed as a more optimistic estimation of the reward difference than .
In the case of DPO, the reward difference can be reformulated as:
Updating using leads to:
\delta\_1 &= \frac{\log \pi\_{\theta}(y\_{\mathrm{win}}|x)}{\log \pi\_{\mathrm{ref}}(y\_{\mathrm{win}}|x)} - \frac{\log \pi\_{\theta}(y\_{\mathrm{lose}}|x)}{\log \pi\_{\mathrm{ref}}(y\_{\mathrm{lose}}|x)} + \alpha G\left(\sigma\left(- \frac{\log \pi\_{\theta}(y\_{\mathrm{win}}|x)}{\log \pi\_{\mathrm{ref}}(y\_{\mathrm{win}}|x)} + \frac{\log \pi\_{\theta}(y\_{\mathrm{lose}}|x)}{\log \pi\_{\mathrm{ref}}(y\_{\mathrm{lose}}|x)}\right) \right)\notag \\\\
& = \frac{\log \pi\_{\theta}(y\_{\mathrm{win}}|x)}{\log \pi\_{\theta}(y\_{\mathrm{lose}}|x)} + \left(\gamma - \frac{\log \pi\_{\mathrm{ref}}(y\_{\mathrm{win}}|x)}{\log \pi\_{\mathrm{ref}}(y\_{\mathrm{lose}}|x)}\right) -\left[\gamma -\alpha G\left(\sigma\left(-\frac{\log \pi\_{\theta}(y\_{\mathrm{win}}|x)\log \pi\_{\mathrm{ref}}(y\_{\mathrm{lose}}|x)}{\log \pi\_{\mathrm{ref}}(y\_{\mathrm{win}}|x)\log \pi\_{\theta}(y\_{\mathrm{lose}}|x)}\right)\right)\right]
If we take the assumption that holds per uniform assumption, then above equation implies The remaining part follows by choosingwhile maintaining monotonicity.
In conclusion, our proposed method can be viewed as a new reward difference estimator (i.e., instead ), which is more optimistic than its vanilla counterpart under the uniform assumption.
Q5: I wonder when the algorithm becomes DPO?
A5: Currently, the -DPO algorithm cannot be transformed into DPO merely through parameter adjustments, similar to how SimPO cannot be converted to DPO by altering . However, I believe this topic presents significant promise, allowing us to propose a more generalized formulation: This expression encompasses several specific cases:
- Setting and results in the DPO formulation with .
- Setting recovers the SimPO expression.
- Setting yields the -DPO formulation.
We hope these additional clarifications address your concerns comprehensively. Thank you again for your thoughtful review and the opportunity to improve our work.
Q4: The transition between eq(15) and eq(16) is not convincing.
A4: To elucidate the rationale behind this transition, we offer the following explanations.
- Necessity of the Transformation:
When the condition arises—potentially due to the overrepresentation of less preferred responses in —the importance weight becomes disproportionately large. This disproportionate weight amplifies the impact of less preferred responses within the expectation, which could consequently result in suboptimal updates to . To mitigate this issue, an adjustment is made by placing in the denominator, thereby inversely scaling the weight associated with the overrepresented less preferred responses . This adjustment serves to diminish the undue influence of these responses that have higher probabilities under compared to .
- Supporting the Transition Between Equations (15) and (16) Through Simulation of Online DPO Training Steps:
Consider the scenario at step of the optimization process, where we refine . In this scenario, the updated model assigns a higher score to and a lower score to . Consequently, when new samples are generated using , the probability of generating in response to the same prompt increases, while the probability of generating decreases. This aligns with the corrected importance weight direction, given by: We attribute the perceived inconsistency to a lack of rigorous symbol usage, which can be clarified by considering the following conditional distributions as mentioned in A1:
- For the response with the highest score , it adheres to the conditional distribution:
- For the response with the lowest score , it adheres to the conditional distribution:
We thank the reviewer for their thoughtful feedback and constructive suggestions. Below, we address your comments in detail.
Q1: It seems to me that this method is just a one-step extension to the SimPO to make the reference model a tunable mixture of the uniform distribution and original reference model.
A1: While -DPO has a mixture form, we do not view this as a weakness. Recent work, such as [1], highlights the importance of the reference model and its impact on performance, with SimPO introducing the concept of being reference-free. Our work aims to explore the optimal reference model needed for offline preference learning without additional gold RM models. We hope to provide new insights, for example, by integrating online algorithms and KL Divergence control.
[1] Liu et al. 2024. Understanding Reference Policies in Direct Preference Optimization. arXiv preprint arXiv:2407.13709
Q2: In online DPO, another oracle needs to be called to compare between and and generate the preference label.
A2:Thank you for your suggestion. We acknowledge the issue of inconsistency with the notation, and we will revise this section in the draft for clarity. The corrected description of the online process is as follows:
Let's revisit the acquisition method for these samples (e.g., using princeton-nlp/llama3-ultrafeedback dataset): For each prompt , we generate 5 responses using the SFT model with a sampling temperature of 0.8. We then score these responses with llm-blender/PairRM and select the highest-scoring one as and the lowest-scoring one as . For the highest-scoring sample , it follows a new conditional distribution:
Where the last part uses the fact that for a given x the generative model has a uniform chance to generate all possible y. Similarly, for the lowest-scoring sample , it follows: We have ensured that the notation is now consistent and precise across the descriptions.
For online DPO, ideally, we would utilize an oracle reward model to evaluate each sampled instance. However, this approach incurs significant computational costs. Therefore, we aim to approximate this effect through importance sampling, as described in Equation 14.
Q3: An online DPO justification might not be convincing enough to justify the offline DPO methods
A3:
It is well known that preference optimization with online ingredients, particularly methods like online AI feedback (OAIF) [1] and self-play [2,3], enhances model alignment by generating new data during training.
However, due to the computation cost sample regeneration and training stability issues, the pure online type preference optimization methods are not well applied in industrial practice. Based on this observation, we thus proposed to study an interesting research topic, i.e., can we mimic the online feature in the classic offline setting? Our theoretical analysis is trying to close the gap with the importance sampling trick.
In particular, the underlying logic here first defines the expression for online SimPO, which is characterized by the continuous updating of sampled data during training. In this process, the data transitions from the offline set to the online set . This operation can be viewed as an important sampling method. Interestingly, this key technique of importance sampling aligns with the optimization direction of -DPO loss, with both approaches converging in the scenario where . Consequently, Lemma 4.2 aims to establish a connection between -DPO and the Online SimPO loss, thereby facilitating the integration of these concepts.
[1] Guo et al. Direct Language Model Alignment from Online AI Feedback. CoRR abs/2402.04792 (2024)
[2] Chen et al. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. ICML 2024.
[3] Wu et al. Self-Play Preference Optimization for Language Model Alignment. CoRR abs/2405.00675 (2024)
We greatly appreciate the opportunity to engage in a substantive dialogue with you, aimed at further improving the quality and clarity of our work. Below, we provide a detailed summary addressing your key concerns. We hope that this explanation will sufficiently address your questions and merit your endorsement.
1. The proposed reference model is a tunable mixture of the uniform distribution and orginal refernece model.
In our response A1, we clarify that this formulation is neither a weakness nor merely a trivial extension achieved by "tuning the parameter in a superset."
First, let us clarify that the primary motivation for proposing the reference policy in Eq. (8) arises from the limitations of the conventional setup, which often exhibits significant randomness and errors. Addressing this limitation is a core objective of our work, a perspective that is similarly reflected in SimPO [1] (Section 3.2 and Figure 4b) and Liu et al. [2] (Section 5).
Our goal is thus to design a more robust and effective reference policy, as expressed in Eq. (8). To provide additional context, we elaborate on the following key points:
-
Alignment with online SimPO optimization. Preference optimization with online elements (e.g., online AI feedback [3] and self-play [4][5]) is widely acknowledged to improve model alignment by generating new data during training. However, the high computational cost and instability of purely online methods limit their practicality in industrial applications. Our work explores whether online-like behavior can be emulated in a classical offline setting. Theoretical analysis based on importance sampling is provided to bridge this gap.
-
Structural advantages of . A major contribution of token-level DPO is the introduction of the form , which is similar to DPO with an offset and helps control the KL divergence. Our -DPO framework leverages the form , which enhances performance by using instead of . Appendix Table 6 provides empirical evidence, showing the performance advantages of -DPO over TDPO.
-
Mitigating the impact of label flipping noise. Within the SimPO framework, the gradient is expressed as:
where
This formulation amplifies weights when the reward estimate is inaccurate (e.g., ), increasing noise in gradients. In contrast, -DPO introduces an additional term:
This term increases the weight when the reward estimate is accurate and decreases it when the reward estimate is inaccurate, mitigating noise amplification.
-
Improving reference policy quality. This work directly addresses the unreliability of reference policies (Section 3.1). By integrating the policy model into the reference model's design, we enhance the quality of the reference model and improve fine-tuning performance. Similar ideas have been explored in recent works [2][6], highlighting the broader relevance of this approach.
References:
[1] Meng et al. (2024): Simpo: Simple preference optimization with a reference-free reward. NeurIPS 2024.
[2] Liu et al. (2024): Understanding Reference Policies in Direct Preference Optimization. arXiv preprint arXiv:2407.13709.
[3] Guo et al. (2024): Direct Language Model Alignment from Online AI Feedback. CoRR abs/2402.04792.
[4] Chen et al. (2024): Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. ICML 2024.
[5] Wu et al. (2024): Self-Play Preference Optimization for Language Model Alignment. CoRR abs/2405.00675.
[6] Gorbatovski et al. (2024): Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656.
2. The significance of theoretical guarantees
In our response A2, we revisited the distributions of and and revised the original text to ensure consistency throughout.
In our response A3, we explained the motivation for establishing connections with online algorithms. The core rationale is that online algorithms remain the most effective choices due to their superior performance. However, purely online preference optimization methods are often impractical for industrial applications due to computational cost, sample regeneration, and training stability issues. As a compromise, importance sampling is frequently employed, and similar approaches are widely adopted in the reinforcement learning literature [1][2][3][4].
Finally, we would like to emphasize that SimPO has already demonstrated its effectiveness in various experimental settings. Our method extends SimPO to personalized scenarios, with the parameter acting as a key control factor. Specifically:
- The choice of balances two competing objectives. When , the method reduces to SimPO, where all training samples are assigned the same target margin.
- As increases, the model places greater emphasis on samples with accurate reward estimates.
- However, excessively large values of cause the model to focus solely on a few samples with the most accurate reward estimates, neglecting the contributions of other samples, which can hinder the training process.
This trade-off highlights the importance of selecting an appropriate to achieve a balance between conservativeness and aggressiveness.
[1] Sergey Levine et al.: "Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems," CoRR abs/2005.01643 (2020).
[2] Alberto Maria Metelli et al.: "Policy Optimization via Importance Sampling," NeurIPS 2018.
[3] Tengyang Xie et al.: "Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling," NeurIPS 2019.
[4] Philip J. Ball et al.: "Efficient Online Reinforcement Learning with Offline Data," ICML 2023.
3. The transition between eq(15) and eq(16) is not convincing.
In our response A4, we provided a detailed analysis of the necessity and rationale behind this transition from multiple perspectives.
Your constructive feedback has been invaluable, and we are committed to leveraging this discussion to improve our work. Thank you again for the opportunity to address your questions and for considering our responses.
DPO and SimPO are two popular offline methods for LLM alignment. The authors demonstrate through theoretical analysis that SimPO is a special case of DPO where the reference model is assumed to be a uniform policy. As a result, DPO does not use an optimal reference policy while SimPO does not take into account data specific variances in defining a target margin. The authors address these limitations by proposing a novel loss function () that relies on data specific differences to compute a dynamic reward margin. They theoretically demonstrate () is the lower bound of SimPO. They demonstrate through empirical results that the performance of () is better than the baselines on multiple alignment benchmarks
优点
- The paper is easy to understand. The theoretical analysis showing the limitation of DPO and SimPo along with the proposed improvements seems logical
- The empirical results show that this method outperforms DPO and SimPO with minimal additional complexity
- The ablation studies are very useful in understanding the contributions of different changes to the loss functions
缺点
- As discussed in the limitations, this requires tuning an additional parameter . It is not clear if a single value is used for each pair of benchmark and model. If LC and raw WR values comes from different values then the results are slightly misleading since it's not just one model being used for comparison against the benchmarks.
问题
N/A
We thank the reviewer for their thoughtful feedback and constructive suggestions. Below, we address your comments in detail.
Q1: If LC and raw WR values come from different α values, then the results are slightly misleading since it's not just one model being used for comparison against the benchmarks.
A1: Thank you for your insightful suggestion. We would like to clarify that all LC and raw WR values reported in our experiments were obtained using a fixed $\alpha$ value across all benchmarks.
We observed that introducing -DPO leads to performance improvements across all benchmarks, which maintains the state-of-the-art performance, albeit with varying degrees of relative improvement. We attribute this observational phenomenon to the current limitations in benchmark evaluation methods, possibly due to the inconsistencies introduced by GPT-4's judgment as a metric.
This trend is not unique to -DPO; it also occurs with other methods, meaning each method has different rankings depending on the benchmark. We believe that developing a truly rigorous and effective benchmark represents an intriguing future research direction.
We hope these additional clarifications address your concerns comprehensively. Thank you again for your thoughtful review and the opportunity to improve our work.
This paper studies how to align LLMs with human preferences. The authors propose a new algorithm -DPO which adaptively set the reward margin based on the ratio between the preference model and the policy model. They prove that the objective of -DPO is a lower bound on the online SimPO loss. They also conduct experiments on AlpacaEval 2 and Arena-Hard to validate the empirical performance of -DPO.
优点
a. The authors propose a new objective for LLM alignment and conduct extensive experiments on several benchmarks. The proposed algorithm outperforms baselines on these benchmarks.
b. The presentation is clear and easy to follow.
缺点
a. The proposed method is not technically sound. The derivation begins with an implicit reference model, but this model is neither well-motivated nor justified. First, Equation 8 does not have a normalization factor, and tuning the hyperparameter very likely results in an invalid distribution. Additionally, it’s unclear why this implicit reference model is necessary instead of using a standard SFT model. The authors consider a special case with , incorporating the ratio between the policy model and the reference model, but there is no clear rationale for why such a ratio is required.
b. The theoretical analysis is problematic. First, relating the objective to the online SimPO loss is not meaningful, as the online SimPO loss itself lacks theoretical guarantees. Second, Lemma 4.2 claims that the objective of -DPO provides a lower bound on the online SimPO loss. However, minimizing a lower bound is questionable since the gap between the true value and the lower bound is unknown. Minimizing an upper bound would be more meaningful. The statement, “the lower-bounding property provides theoretical guarantees that … not perform worse than online SimPO loss, ensuring convergence to a well-generalized policy,” is confusing. How can minimizing a lower bound provide such a strong theoretical guarantee?
c. The experimental improvement is marginal, typically less than 1.5%. Given that benchmark evaluations such as AlpacaEval 2 and Arena-Hard rely on GPT-4’s judgment, which can vary by 1-2%, these improvements may not be convincing, especially when the method involves at least one more hyperparameter than the baselines. Furthermore, the authors evaluate performance on only two benchmarks, which is limited, particularly in LLM alignment experiments. More evaluations on academic benchmarks like MT-Bench, MMLU, GSM8K, and TruthfulQA are required. Additionally, most experiments involve models already trained using RLHF methods. Testing on models without RLHF, such as Llama-3-8B, would be necessary to confirm that the proposed algorithm does not rely on pre-existing alignment.
问题
See Weaknesses.
Q4: First, relating the objective to the online SimPO loss is not meaningful, as the online SimPO loss itself lacks theoretical guarantees.
A4: It is well known that preference optimization with online ingredients, particularly methods like online AI feedback (OAIF) [1] and self-play [2,3], enhances model alignment by generating new data during training.
However, due to the computation cost sample regeneration and training stability issues, the pure online type preference optimization methods are not well applied in industrial practice. Based on this observation, we thus proposed to study an interesting research topic, i.e., can we mimic the online feature in the classic offline setting? Our theoretical analysis is trying to close the gap with the importance sampling trick.
In particular, the underlying logic here first defines the expression for online SimPO, which is characterized by the continuous updating of sampled data during training. In this process, the data transitions from the offline set to the online set . This operation can be viewed as an important sampling method. Interestingly, this key technique of importance sampling aligns with the optimization direction of -DPO loss, with both approaches converging in the scenario where . Consequently, Lemma 4.2 aims to establish a connection between -DPO and the Online SimPO loss, thereby facilitating the integration of these concepts.
[1] Guo et al. Direct Language Model Alignment from Online AI Feedback. CoRR abs/2402.04792 (2024)
[2] Chen et al. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. ICML 2024.
[3] Wu et al. Self-Play Preference Optimization for Language Model Alignment. CoRR abs/2405.00675 (2024)
Q5: However, minimizing a lower bound is questionable since the gap between the true value and the lower bound is unknown. Minimizing an upper bound would be more meaningful.
A5: We apologize for any confusion. A more precise statement is needed: For any policy model and reference model , there exists a sufficiently small such that the following inequalities hold: where
This result demonstrates that optimizing -DPO provides a tight approximation of online-SimPO, which is crucial for achieving performance improvements. In response to feedback, we have revised Lemma 4.2 and included the supporting proofs. Thank you for your valuable suggestion.
Q6: The experimental improvement is marginal, typically less than 1.5%.
A6: We believe -DPO's improvements are significant. For example, AlpacaEval2 (LC) shows improvements near 6% on most models. Arena-Hard benchmarks also demonstrate over 5% gains, except for gemma2-9b. We conduct a comparison of the relevant results from Table 1, focusing on SimPO and -DPO. Detailed results are provided:
| Metric | LC | WR | SC | LC | WR | LC | WR | SC | LC | WR |
|---|---|---|---|---|---|---|---|---|---|---|
| SimPO | 43.8 | 38.0 | 33.5 | 33.5 | 32.6 | 30.2 | 32.1 | 25.6 | 19.8 | 20.1 |
| -DPO | 46.6 | 38.1 | 34.1 | 34.2 | 33.3 | 32.3 | 32.6 | 27.2 | 21.5 | 21.5 |
| improv. | 6.39% | 0.26% | 1.79% | 2.09% | 2.15% | 6.95% | 1.56% | 6.25% | 8.59% | 6.97% |
| Metric | LC | WR | SC | LC | WR | LC | WR | SC | LC | WR |
|---|---|---|---|---|---|---|---|---|---|---|
| SimPO | 55.6 | 49.6 | 28.5 | 34.0 | 33.6 | 72.4 | 65.0 | 45.0 | 56.1 | 57.8 |
| Method | 58.7 | 51.1 | 30.8 | 36.3 | 35.7 | 73.4 | 66.1 | 48.6 | 59.3 | 60.8 |
| improv. | 5.58% | 3.02% | 8.07% | 6.76% | 6.25% | 1.38% | 1.69% | 8.00% | 5.70% | 5.19% |
We thank the reviewer for their thoughtful feedback and constructive suggestions. Below, we address your comments in detail.
Q1: First, Equation 8 does not have a normalization factor, and tuning the hyperparameter very likely results in an invalid distribution.
A1: We recognize that the expression lacks rigor. A more precise formulation is:
This expression combines the benefits of DPO and SimPO. We have revised the manuscript to reflect this clarification.
Q2: Additionally, it’s unclear why this implicit reference model is necessary instead of using a standard SFT model.
A2: The assumption that the SFT has large errors serves as the motivation for proposing -DPO. This assumption is discussed in Section 3.1, where we highlight the unreliability of the reference policy in DPO.
Q3: The authors consider a special case with , incorporating the ratio between the policy model and the reference model, but there is no clear rationale for why such a ratio is required.
A3: The motivation for the proposed reference policy can be clarified as follows:
-
Utility Theory Perspective: The proposed is designed with the uniform distribution as a baseline. The term dynamically adjusts the reward margin by balancing contributions from the policy and reference models. This mechanism can be interpreted through the lens of utility theory as relative attractiveness, enabling adaptive instance-specific reward modeling.
-
Gradient Perspective: By introducing , the framework mitigates the label flipping issues found in DPO or SimPO. In the SimPO framework, the gradient is expressed as: where . This formulation may amplify weights when the reward estimate is incorrect. By contrast, under -DPO: the additional component increases weight when the reward estimate is accurate, ensuring a more robust reward signal.
-
Motivational Core: The central goal of the proposed -DPO is to address the unreliability of the reference policy, as outlined in Section 3.1. By integrating the policy model into the reference model design, the quality of the reference model is enhanced, improving fine-tuning performance. Similar concepts have been explored in recent works [1][2]:
We have also provided a theoretical justification through reward difference estimation. For more details, please refer to our response (A4) to Reviewer 8xwP.
[1] Gorbatovski et al. (2024): Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656.
[2] Liu et al. (2024): Understanding Reference Policies in Direct Preference Optimization. arXiv preprint arXiv:2407.13709.
Q7: Furthermore, the authors evaluate performance on only two benchmarks, which is limited, particularly in LLM alignment experiments. More evaluations on academic benchmarks like MT-Bench, MMLU, GSM8K, and TruthfulQA are required. Testing on models without RLHF, such as Llama-3-8B, would be necessary to confirm that the proposed algorithm does not rely on pre-existing alignment.
A7: Thank you for your suggestion. We have incorporated a new benchmark into our study. Please note that all experiments have been conducted using the Llama-3-8B-BASE model to address your concerns regarding pre-existing alignment.
| Method | DPO (Llama3-8B-Base) | SimPO (Llama3-8B-Base) | -DPO (Llama3-8B-Base) |
|---|---|---|---|
| truthfulqa_mc2 | 53.66 | 60.03 | 62.89 |
| gsm8k | 52.90 | 52.84 | 53.90 |
| mmlu | 62.14 | 62.05 | 62.43 |
| MT-Bench | 6.5 | 6.6 | 6.9 |
| LC(AlpacaEval2) | 14.92 | 17.97 | 22.69 |
| WR(AlpacaEval2) | 13.02 | 15.60 | 20.47 |
These results confirm stable performance improvements, and the additional experiments will be included in the paper.
We hope these additional clarifications address your concerns comprehensively. Thank you again for your thoughtful review and the opportunity to improve our work.
Thanks for the detailed response. After reading it, I still have the following questions and concerns.
-
From the authors’ response, it seems the advantage of the new reference policy increases the weight of the gradient when the reward signal is correct. However, I am unclear why this ensures a more robust reward signal. Would it decrease the weight when the reward signal is incorrect? Additionally, there are no experimental results demonstrating that -DPO achieves better performance when the reward signal contains noise. If the advantage is simply to increase the weight, why not just use a larger ? The motivation for introducing is still unclear to me.
-
The theoretical guarantees of online algorithms largely rely on iteratively updating the preference dataset. Proving that the loss value of online SimPO and -DPO is close for a fixed dataset does not seem particularly meaningful. Furthermore, a close loss value does not imply much—at the very least, you need to show that the gradients are close. Moreover, is there any theoretical guarantee provided for SimPO?
-
Regarding the experimental results, my concern is that the win rate is judged by GPT-4, where the absolute judgment noise can vary by 1-2%. This makes the reported improvement appear marginal. Since the win rate is typically small, reporting relative improvement feels more like a way to inflate the percentage, which I find not very meaningful.
Q12: Furthermore, a close loss value does not imply much—at the very least, you need to show that the gradients are close.
A12: The gradient directions of online SimPO and -DPO are consistent. At optimization step , refining updates , assigning higher probabilities to and lower probabilities to . This adjustment ensures that newly generated samples with align with the corrected importance weight direction. Below, we explicitly compare the gradients of the two methods:
For Online SimPO, the gradient is given by:
where
For -DPO, the gradient is:
where
The primary difference lies in the importance weights ( for Online SimPO and for -DPO). While Online SimPO adjusts weights based solely on the importance sampling correction term, -DPO incorporates the additional term to further modulate the reward signal. This allows -DPO to refine alignment more effectively under scenarios where reward noise exists.
Thus, both methods adjust the importance weights during training, prioritizing positive samples () and penalizing negative samples (). However, the gradient modulation introduced by in -DPO provides a more robust mechanism for alignment and diversity control.
Q13: Moreover, is there any theoretical guarantee provided for SimPO?
A13: In Section 3.1, we demonstrate that SimPO can be interpreted as DPO with a uniform reference distribution. This aligns with SimPO’s motivation to address the unreliability of the reference model by adopting a reference-free approach. Furthermore, SimPO has shown to be the most effective variant of DPO, justifying further exploration of its validity.
Q14: Regarding the experimental results, my concern is that the win rate is judged by GPT-4, where the absolute judgment noise can vary by 1–2%.
A14: We respectfully disagree with this concern. Benchmarks such as AlpacaEval2 and ArenaHard are widely considered the most reliable for evaluating LLM performance. Their adoption is largely due to the high consistency of GPT-4 judgments with human evaluations, exceeding 98% agreement.
For instance, on AlpacaEval2 (LC metric), -DPO demonstrates significant improvements:
- On Llama3-Instruct v0.2 (8B): 51.9 (DPO, NeurIPS 2023) → 55.6 (SimPO, NeurIPS 2024) → 58.7 (-DPO).
- On Llama3-Instruct (8B): 40.2 (DPO, NeurIPS 2023) → 40.3 (R-DPO, ACL 2024) → 43.8 (SimPO, NeurIPS 2024) → 46.6 (-DPO).
We believe these improvements are substantial and meaningful.
We greatly appreciate the opportunity to engage in a substantive dialogue with you, aimed at further improving the quality and clarity of our work. Below, we provide a detailed summary addressing your key concerns. We hope that this explanation will sufficiently address your questions and merit your endorsement.
1. Motivation for the proposed reference policy (Eq. 8)
First, let us clarify that the primary motivation for proposing the reference policy in Eq. (8) arises from the limitations of the conventional setup, which often exhibits significant randomness and errors. Addressing this limitation is a core objective of our work, a perspective that is similarly reflected in SimPO [1] (Section 3.2 and Figure 4b) and Liu et al. [2] (Section 5).
Our goal is thus to design a more robust and effective reference policy, as expressed in Eq. (8). To provide additional context, we elaborate on the following key points:
-
Alignment with online SimPO optimization. Preference optimization with online elements (e.g., online AI feedback [3] and self-play [4][5]) is widely acknowledged to improve model alignment by generating new data during training. However, the high computational cost and instability of purely online methods limit their practicality in industrial applications. Our work explores whether online-like behavior can be emulated in a classical offline setting. Theoretical analysis based on importance sampling is provided to bridge this gap.
-
Structural advantages of . A major contribution of token-level DPO is the introduction of the form , which is similar to DPO with an offset and helps control the KL divergence. Our -DPO framework leverages the form , which enhances performance by using instead of . Appendix Table 6 provides empirical evidence, showing the performance advantages of -DPO over TDPO.
-
Mitigating the impact of label flipping noise. Within the SimPO framework, the gradient is expressed as:
where
This formulation amplifies weights when the reward estimate is inaccurate (e.g., ), increasing noise in gradients. In contrast, -DPO introduces an additional term:
This term increases the weight when the reward estimate is accurate and decreases it when the reward estimate is inaccurate, mitigating noise amplification.
-
Improving reference policy quality. This work directly addresses the unreliability of reference policies (Section 3.1). By integrating the policy model into the reference model's design, we enhance the quality of the reference model and improve fine-tuning performance. Similar ideas have been explored in recent works [2][6], highlighting the broader relevance of this approach.
References:
[1] Meng et al. (2024): Simpo: Simple preference optimization with a reference-free reward. NeurIPS 2024.
[2] Liu et al. (2024): Understanding Reference Policies in Direct Preference Optimization. arXiv preprint arXiv:2407.13709.
[3] Guo et al. (2024): Direct Language Model Alignment from Online AI Feedback. CoRR abs/2402.04792.
[4] Chen et al. (2024): Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. ICML 2024.
[5] Wu et al. (2024): Self-Play Preference Optimization for Language Model Alignment. CoRR abs/2405.00675.
[6] Gorbatovski et al. (2024): Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656.
Q8: However, I am unclear why this ensures a more robust reward signal. Would it decrease the weight when the reward signal is incorrect?
A8: Yes, when the reward signal is incorrect (e.g., ), it decreases the weight as defined by
Q9: Additionally, there are no experimental results demonstrating that -DPO achieves better performance when the reward signal contains noise.
A9: The existence of noise in reward signals has been well established in the literature, and addressing this issue is the primary motivation for methods such as cDPO, rDPO and SimPO. Our work assumes this general observation and focuses on demonstrating improvements under standard settings.
Q10: If the advantage is simply to increase the weight, why not just use a larger ? The motivation for introducing is still unclear to me.
A10: First, the core motivation for introducing is to address the unreliability of existing reference policies (consistent with the motivation of SimPO). Additionally, we aim to alleviate the limitations of SimPO’s fixed setting for all samples. To this end, we propose a novel design for .
Second, we have shown that can be interpreted as a reweighting mechanism. Furthermore, we believe it can be analogized to a dynamic approach. However, since no existing work has implemented such a technique, we consider to be a non-trivial contribution.
Finally, we do not claim that the advantage of -DPO is simply to increase the weight. This is just one intuitive explanation among others. For example, additional advantages include:
-
Addressing the unreliability of the reference policy (Section 3.1): By integrating the policy model into the reference model, the quality of the reference model is improved, leading to enhanced fine-tuning performance. Similar concepts have been explored in recent studies [1][2].
-
Balancing alignment and diversity via KL divergence control (Section 4.1): We demonstrated that -DPO shares the same advantage as TDPO, effectively achieving KL divergence control.
[1] Gorbatovski et al. (2024): Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656.
[2] Liu et al. (2024): Understanding Reference Policies in Direct Preference Optimization. arXiv preprint arXiv:2407.13709.
Q11: Proving that the loss value of online SimPO and -DPO is close for a fixed dataset does not seem particularly meaningful.
A11: We respectfully disagree with the assertion that simulating online algorithms on a fixed dataset is not meaningful. Due to computational cost, sample regeneration, and training stability issues, purely online preference optimization methods are often not practical for industrial applications. A common alternative is to employ importance sampling. Similar approaches are prevalent in reinforcement learning literature [3][4][5][6].
[3] Sergey Levine et al.: "Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems," CoRR abs/2005.01643 (2020).
[4] Alberto Maria Metelli et al.: "Policy Optimization via Importance Sampling," NeurIPS 2018.
[5] Tengyang Xie et al.: "Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling," NeurIPS 2019.
[6] Philip J. Ball et al.: "Efficient Online Reinforcement Learning with Offline Data," ICML 2023.
2. The significance of theoretical guarantees
In our response A11, we emphasized that importance sampling is a widely adopted strategy in the literature on preference optimization and reinforcement learning.
In our response A12, we analyzed the advantages of -DPO from a gradient perspective, highlighting its alignment with the optimization direction of online SimPO.
Finally, we would like to emphasize that SimPO has already demonstrated its strengths in various experiments. Our method extends SimPO to personalized scenarios, with the parameter serving as a critical control factor. Specifically:
- The choice of strikes a balance between two competing objectives. When , the method simplifies to SimPO, where all training samples share the same target margin.
- As increases, the model places greater weight on samples with accurate reward estimates.
- However, excessively large values of cause the model to focus solely on a few samples with the most accurate reward estimates, neglecting the contributions of other samples, which can hinder the training process.
This trade-off underscores the necessity of selecting an appropriate to balance conservativeness and aggressiveness effectively.
3. The improvements are not significant.
In our response A14, we first highlighted that AlpacaEval2 and Arena-Hard are two of the most widely used benchmarks. These benchmarks have been validated using GPT-4 Turbo, achieving over 98% agreement with human evaluations. Therefore, we believe the scores derived from these benchmarks are sufficiently reliable.
Furthermore, considering the development trends of current methodologies—from DPO to various approaches such as IPO, KTO, CPO, and the state-of-the-art SimPO introduced this year—our improvements over SimPO are both substantial and stable. We encourage the reviewers to re-examine the progress in existing techniques and recognize the significance of our contributions within this context.
Your constructive feedback has been invaluable, and we are committed to leveraging this discussion to improve our work. Thank you again for the opportunity to address your questions and for considering our responses.
This paper finds that SimPO shares the same offset across all samples, which leads to suboptimal performance. To address this limitation, the authors construct a novel reference model based on SimPO and DPO, resulting in a novel method The authors provide theoretical analysis on the lower bound and its connections to TDPO. Extensive experiments reveal the effectiveness of the proposed methods over various baselines and across different LLM structures.
优点
-
Extensive experiments reveal that the -DPO outperforms various baselines across different LLM structures.
-
The authors try to connect the proposed methods with other existing alignment methods, which is interesting.
缺点
-
My major concerns is that the motivation of the proposed reference policy Eq.(8) is not so clear. I understand that the authors want to construct a new reference policy that takes the advantages of DPO and SimPO. However, I think the authors should discuss more details why it is this form and how is its advantage. I understand that the authors provide some theoretical analysis to convince the readers. But I also think the motivation is also important.
-
The proofs should be discussed in more details. For example, in Lemma 4.2, in line 862, the authors use the Taylor expansion w.r.t on This makes the lower bound only establishes around zero, which limits the theoretical contribution. Moreover, if then according to (8), we have But as discussed lines 152-153, this is the reference function in SimPO? Then, does -DPO degenerates to SimPO? If so, I think the Lemma 4.2 does not provide sufficient lower bound details over -DPO.
Moreover, in lines 913 and 914, the authors state that "under the assumption that the reference policy has large errors", but it seems that I do not find it in the lemma, the authors mention it directly on the lemma and discuss why this assumption is mild.
问题
Please answer my questions mentioned above.
We thank the reviewer for their thoughtful feedback and constructive suggestions. Below, we address your comments in detail.
Q1: My major concern is that the motivation of the proposed reference policy Eq.(8) is not so clear.
A1: The motivation for the proposed reference policy can be clarified as follows:
-
Utility Theory Perspective: The proposed is designed with the uniform distribution as a baseline. The term dynamically adjusts the reward margin by balancing contributions from the policy and reference models. This mechanism can be interpreted through the lens of utility theory as relative attractiveness, enabling adaptive instance-specific reward modeling.
-
Gradient Perspective: By introducing , the framework mitigates the label flipping issues found in DPO or SimPO. In the SimPO framework, the gradient is expressed as: where . This formulation may amplify weights when the reward estimate is incorrect. By contrast, under -DPO: the additional component increases weight when the reward estimate is accurate, ensuring a more robust reward signal.
-
Motivational Core: The central goal of the proposed -DPO is to address the unreliability of the reference policy, as outlined in Section 3.1. By integrating the policy model into the reference model design, the quality of the reference model is enhanced, improving fine-tuning performance. Similar concepts have been explored in recent works [1][2]:
We have also provided a theoretical justification through reward difference estimation. For more details, please refer to our response (A4) to Reviewer 8xwP.
[1] Gorbatovski et al. (2024): Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656.
[2] Liu et al. (2024): Understanding Reference Policies in Direct Preference Optimization. arXiv preprint arXiv:2407.13709.
Q2: The authors use the Taylor expansion w.r.t (\log(A-\alpha B)) on (\alpha=0). This makes the lower bound only established around zero, which limits the theoretical contribution.
A2: Thank you for pointing this out. We apologize for the imprecise explanation in the original submission. The correct statement is:
By expanding around using the first-order Taylor expansion, we obtain: ...
- The expansion is expanded around , not .
- Using the first-order Taylor expansion for at , we derive:
where the derivative of , , is used. This derivation ensures mathematical validity and is not limited to the neighborhood of . We have update the manuscript to reflect this clarification.
Q3: Moreover, in lines 913 and 914, the authors state that "under the assumption that the reference policy has large errors", but it seems that I do not find it in the lemma. The authors should mention it directly in the lemma and discuss why this assumption is mild.
A3: Thank you for pointing this out. The assumption that the reference policy has large errors serves as the motivation for proposing -DPO. This assumption is discussed in Section 3.1, where we highlight the unreliability of the reference policy in DPO. To make this point clearer, we have revised the manuscript to include this assumption explicitly in the lemma and expanded the discussion to emphasize its practical relevance and mildness. Specifically:
- The unreliability of the reference policy arises due to the static nature of its design, which may not generalize well across diverse datasets or instances.
- Incorporating the policy model into the reference policy design mitigates these limitations and improves the alignment between the reward model and policy fine-tuning objectives.
This revision will ensure that the lemma is self-contained and directly addresses the context of the assumption.
We hope these additional clarifications address your concerns comprehensively. Thank you again for your thoughtful review and the opportunity to improve our work.
Thank you to the authors for the detailed responses. However, I still have some concerns:
-
I understand that when the reward estimate is accurate, increases the weight, ensuring a more robust reward signal. However, when the reward estimate is inaccurate, could this hinder training? Is there any empirical evidence demonstrating that the reward estimate is consistently accurate (or accurate most of the time)?
-
I understand that can be non-zero, but lines 891 and 841 suggest that still needs to be very small. This makes -DPO still need to be quite similar to SimPO to make the theory establish.
Thank you for your comments. Below are our responses to your concerns:
Q4: However, when the reward estimate is inaccurate, could this hinder training? Is there any empirical evidence demonstrating that the reward estimate is consistently accurate (or accurate most of the time)?
A4: When the reward estimate is inaccurate, it does not hinder training. Specifically, when the reward signal is incorrect (e.g., ), the weight decreases as defined by
This reduces the influence of incorrect signals, aligning with the design motivation of -DPO.
Although we could not provide additional experiments within the rebuttal period due to time constraints, we have strong reasons to believe that the reward estimate is consistently accurate. Specifically, the quality of the reference model improves with the fine-tuned , as shown in our theoretical and empirical analyses.
Q5: I understand that can be non-zero, but lines 891 and 841 suggest that still needs to be very small. This makes -DPO still need to be quite similar to SimPO to make the theory establish.
A5: Based on both theoretical considerations (to maintain similarity to SimPO and support the theoretical framework) and empirical results (validated by the parameter selections in Appendix Table 3 and Table 4), it appears that is effective when kept small. Our findings suggest that using a small , such as the default value of , provides stable and significant performance improvements, aligning well with theoretical expectations and experimental observations.
The choice of reflects a trade-off. When is set to 0, the approach reduces to SimPO. As increases, the model places greater emphasis on samples where the reward estimate is correct. However, if becomes too large, the model may become overly conservative, which can negatively affect training by limiting its ability to explore diverse solutions. This trade-off underscores the importance of selecting an appropriate to balance alignment and adaptability.
We greatly appreciate the opportunity to engage in a substantive dialogue with you, aimed at further improving the quality and clarity of our work. Below, we provide a detailed summary addressing your key concerns. We hope that this explanation will sufficiently address your questions and merit your endorsement.
1. Motivation for the proposed reference policy (Eq. 8)
First, let us clarify that the primary motivation for proposing the reference policy in Eq. (8) arises from the limitations of the conventional setup, which often exhibits significant randomness and errors. Addressing this limitation is a core objective of our work, a perspective that is similarly reflected in SimPO [1] (Section 3.2 and Figure 4b) and Liu et al. [2] (Section 5).
Our goal is thus to design a more robust and effective reference policy, as expressed in Eq. (8). To provide additional context, we elaborate on the following key points:
-
Alignment with online SimPO optimization. Preference optimization with online elements (e.g., online AI feedback [3] and self-play [4][5]) is widely acknowledged to improve model alignment by generating new data during training. However, the high computational cost and instability of purely online methods limit their practicality in industrial applications. Our work explores whether online-like behavior can be emulated in a classical offline setting. Theoretical analysis based on importance sampling is provided to bridge this gap.
-
Structural advantages of . A major contribution of token-level DPO is the introduction of the form , which is similar to DPO with an offset and helps control the KL divergence. Our -DPO framework leverages the form , which enhances performance by using instead of . Appendix Table 6 provides empirical evidence, showing the performance advantages of -DPO over TDPO.
-
Mitigating the impact of label flipping noise. Within the SimPO framework, the gradient is expressed as:
where
This formulation amplifies weights when the reward estimate is inaccurate (e.g., ), increasing noise in gradients. In contrast, -DPO introduces an additional term:
This term increases the weight when the reward estimate is accurate and decreases it when the reward estimate is inaccurate, mitigating noise amplification.
-
Improving reference policy quality. This work directly addresses the unreliability of reference policies (Section 3.1). By integrating the policy model into the reference model's design, we enhance the quality of the reference model and improve fine-tuning performance. Similar ideas have been explored in recent works [2][6], highlighting the broader relevance of this approach.
References:
[1] Meng et al. (2024): Simpo: Simple preference optimization with a reference-free reward. NeurIPS 2024.
[2] Liu et al. (2024): Understanding Reference Policies in Direct Preference Optimization. arXiv preprint arXiv:2407.13709.
[3] Guo et al. (2024): Direct Language Model Alignment from Online AI Feedback. CoRR abs/2402.04792.
[4] Chen et al. (2024): Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. ICML 2024.
[5] Wu et al. (2024): Self-Play Preference Optimization for Language Model Alignment. CoRR abs/2405.00675.
[6] Gorbatovski et al. (2024): Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656.
2. How does affect the training process?
As elaborated in A4, we provide an intuitive understanding of how aids training. When (reward signal is correct), the model assigns a higher weight to the sample for updating. Conversely, when (reward signal is incorrect), the model reduces the weight, mitigating the impact of erroneous signals.
3. Why does need to be very small?
The is not necessarily better when smaller; rather, it represents a trade-off. Specifically:
The choice of balances two competing objectives. When , the method simplifies to SimPO, where all training samples are assigned the same target margin. As increases, the model places greater emphasis on samples with accurate reward estimates. However, excessively large values of cause the model to focus solely on a few samples with the most accurate reward estimates, neglecting the contributions of other samples, which can hinder the training process. This trade-off highlights the critical need to select an appropriate to balance conservativeness and aggressiveness.
Finally, as shown in Figure 4(a) of our experiments, model performance initially improves and then declines as increases, aligning well with the theoretical understanding of this trade-off.
Your constructive feedback has been invaluable, and we are committed to leveraging this discussion to improve our work. Thank you again for the opportunity to address your questions and for considering our responses.
The paper proposes a new loss for fine-tuning language models based on preference data in a reward-free fashion along the lines of DPO and SimPO. In particular, the paper aims to address the limitations of the DPO and SimPO loss by introducing an instant dependent offset in the SimPO loss. The loss and relations to SimPO and token-level DPO loss have been studied. Experiments have been conducted using standard benchmarks and ablations with various hyperparameters.
优点
Strength:
- Considering an instant-specific margin compared to SimPO is interesting.
- Experimental Setup, analysis along ablations is exhaustive
- Reward differences between chosen and rejected responses and Log probabilities of the chosen response w.r.t. the fine-tuned model is shown. Further, KL divergence from the base model has been studied.
- Limitations of DPO and SimPO are -discussed.
I have read the rebuttal by the authors and thank them for clarification. Those helped me better understand and as such I raised my score.
缺点
- To the best of my understanding, Theorem 3.1 seems not correct? In particular, has been defined to be log-difference of the chosen vs rejected response w.r.t. uniform policy. However, as uniform policy assigns equal probabilities to all responses, this directly implies . As the entire Theorem 3.1 depends on this to connect SimPO to a uniform reference distribution, this theorem seems meaningless. Can the authors please elaborate and explain this point? Maybe I misunderstood something?
- In their derivation of DPO objective, they again use to be log-difference of the chosen vs rejected response w.r.t. uniform policy and then claim that when is 0, they get back the SimPO loss. That is again confusing, as detailed above. Can you kindly elaborate?
- Their Margin term is the same as the inner term in DPO loss. So essentially, their proposed loss in Eqn.12 combines SimPO and DPO loss with an extra stop gradient on the DPO loss. Is this correct?
- In Section 4.1, they aim to connect DPO with SimPO loss. However, they introduce an importance sampling correction term for the online SimPO loss. It is unclear how that is relevant to Lemma 4.2 or any other aspect of the paper. Can you please help me better understand this part?
- I am not sure how Lemma 4.3 contributes. The margin term M is same as inner term of DPO. Hence, it is evident that is close to from token-level DPO loss. Further, their claim that they improve upon token-level DPO is not correct. Token-level DPO was proposed to improve upon DPO but with more computational cost. One can’t revert back to sequence-level loss as in DPO from token-level loss and claim improvement/novelty. All the previous works use this sequence-level loss.
- It feels that several parts of the paper, including discussions, seem to be written using AI tools, e.g., a discussion of Theorem 3.1 and Lemma 4.1. I could be very well wrong, just trying to understand
问题
Please see above
We thank the reviewer for their thoughtful feedback and constructive suggestions. Below, we address your comments in detail.
Q1: However, as uniform policy assigns equal probabilities to all responses, this directly implies .
A1: We first want to jusitfy that the uniform distribution assumption is proper, and the uniform distribution does not lead to . The uniform distribution assumption on a LLM describes this model as not having any alignment w.r.t human preference. It is commonly happening in the LLM without enough high equality instruction tuning. The positive/negative samples are generated via reject sampling procedures on LLM and the corresponding distributions are different.
Let's revisit the acquisition method for these samples (e.g., using princeton-nlp/llama3-ultrafeedback dataset): For each prompt , we generate 5 responses using the SFT model with a sampling temperature of 0.8. We then score these responses with llm-blender/PairRM and select the highest-scoring one as and the lowest-scoring one as .
For the highest-scoring sample , it follows a new conditional distribution:
where the last part uses the assumption that for a given x the generative model has a uniform chance to generate all possible y. Similarly, for the lowest-scoring sample , it follows:
Each response is generated from the same conditional distribution , implying the same distribution during generation. However, by scoring and selecting using llm-blender/PairRM, there is an inherent selection bias. In the above setting, the value is .
By further assuming this situation uniformly happens over all x, we reduce to the SimPO configuration. Thus, the statement "uniform policy assigns equal probabilities to all responses" is inaccurate.
Q2: They claim that when it is 0, they get back the SimPO loss.
A2: Please note that according to the discussion in the answer to Q1, the uniform assumption doesn't lead to . Based on the definition in Eq (8), when , becomes the uniform distribution. Similarly to A1, the uniform distribution can mitigate the unreliability of , but it still lacks sample personalization and remains suboptimal. Our -DPO objective is designed to incorporate the strengths of both DPO and SimPO. We thank the reviewer for pointing it out and will add the discussion in the Section 3.1 to further clarify it.
Q3: Their proposed loss in Eqn.12 combines SimPO and DPO loss with an extra stop gradient on the DPO loss.
A3: Indeed, from a formal perspective, the -DPO appears as DPO with an additional stop gradient on the Margin term. However, this result demonstrates that SimPO with an offset () can further enhance performance, presenting theoretical benefits (c.f. Section 4) and relationships that, although concise, are effective.
We have also provided a theoretical justification through reward difference estimation. For more details, please refer to our response (A4) to Reviewer 8xwP.
Q4: They introduce an importance sampling correction term for the online SimPO loss. How is that relevant to Lemma 4.2 or any other part of the paper?
A4: It is well known that preference optimization with online ingredients, particularly methods like online AI feedback (OAIF) [1] and self-play [2,3], enhances model alignment by generating new data during training.
However, due to the computation cost sample regeneration and training stability issues, the pure online type preference optimization methods are not well applied in industrial practice. Based on this observation, we thus proposed to study an interesting research topic, i.e., can we mimic the online feature in the classic offline setting? Our theoretical analysis is trying to close the gap with the importance sampling trick.
In particular, the underlying logic here first defines the expression for online SimPO, which is characterized by the continuous updating of sampled data during training. In this process, the data transitions from the offline set to the online set . This operation can be viewed as an important sampling method. Interestingly, this key technique of importance sampling aligns with the optimization direction of -DPO loss, with both approaches converging in the scenario where . Consequently, Lemma 4.2 aims to establish a connection between -DPO and the Online SimPO loss, thereby facilitating the integration of these concepts.
[1] Guo et al. Direct Language Model Alignment from Online AI Feedback. CoRR abs/2402.04792 (2024)
[2] Chen et al. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. ICML 2024.
[3] Wu et al. Self-Play Preference Optimization for Language Model Alignment. CoRR abs/2405.00675 (2024)
Q5: I am not sure how Lemma 4.3 contributes. Further, their claim that they improve upon token-level DPO is not correct.
A5: There seems to be a misconception that needs clarification. The core contribution of token-level DPO is the introduction of the form , similar to DPO with an offset, enabling control over KL divergence. The -DPO follows the form SimPO with an offset , providing performance enhancement and showing that is more effective than . Appendix Table 6 supports this with performance comparisons between TDPO and -DPO. These findings illustrate:
- Adding an offset to DPO and its variants is a successful strategy, applicable to both token-level DPO and -DPO.
- The choice of offset is still undecided. Under the premise of an unreliable reference model in the SimPO concept, outperforms , offering valuable insights.
We hope these additional clarifications address your concerns comprehensively. Thank you again for your thoughtful review and the opportunity to improve our work.
We sincerely thank you for your support and for raising the score of our work. We deeply appreciate your recognition of our contributions to the -DPO. Regarding your valuable suggestions on improving the presentation, we will incorporate them into the final version. Thank you again for your thoughtful feedback, which has been instrumental in enhancing the quality of our work.
We thank all reviewers for their valuable and insightful feedback.
We are encouraged that the reviewers found our paper meaningful (Reviewers , , , ). Furthermore, we are grateful that the reviewers recognized the simplicity and effectiveness of our proposed -DPO algorithm (Reviewers , , , , ). We also appreciate that several reviewers found our paper well-written and easy to follow (Reviewers , , ).
We acknowledge the reviewers' constructive comments and critiques, which helped us to identify areas for improvement. Below, we summarize our discussions and detailed responses to each reviewer’s feedback:
-
Reviewer : We provided a detailed explanation of the uniform policy and clarified why it does not result in . Considering your positive rating, we believe that your concerns have been addressed. Thank you for your encouraging feedback.
-
Reviewer : We elaborated on the motivation behind the proposed reference policy in Eq. (8) from several perspectives, including its connection to online algorithms, the advantage of DPO with an offset, and its ability to mitigate label-flipped noise. Thank you for your constructive and valuable comments.
-
Reviewer : We extended the explanation of the proposed reference policy in Eq. (8) by highlighting its connections to online algorithms, the offset advantages of DPO, and its robustness to label-flipped noise. Additionally, we included experimental results on benchmarks such as MT-Bench, MMLU, GSM8K, and TruthfulQA. We believe these extensive comparisons provide strong evidence for the reliability of our method. Thank you for your constructive feedback.
-
Reviewer : We clarified the performance trends of different metrics under varying values, demonstrating that -DPO achieves superior performance across all benchmarks with a fixed value. Considering your positive rating, we believe your concerns have been adequately addressed. Thank you for your encouraging feedback.
-
Reviewer : We provided an in-depth explanation of the motivation behind the proposed reference policy in Eq. (8), focusing on its connection to online algorithms, the advantages of DPO with an offset, and its capability to mitigate label-flipped noise. Thank you for your constructive comments.
As the author-reviewer discussion phase comes to a close, we sincerely hope that our responses and improvements have addressed your concerns effectively. If there are any remaining questions, we are more than happy to provide further clarifications. Once again, we thank all reviewers for their thoughtful efforts in improving the quality of our work.
This paper proposes a novel strategy for LLM alignment designed to address the limitations of simple preference optimization (SPO) and direct policy optimization (DPO). There were remaining concerns about the clarity of the exposition, the scope of the theoretical contributions, as well as the significance of the empirical results. Thus, the paper still requires significant revision before it can be considered for acceptance.
审稿人讨论附加意见
There was significant discussion during the rebuttal period. During this time, the authors addressed some of the technical concerns regarding the correctness of the paper. However, there were lingering concerns about the scope of the contribution, especially from Reviewer RjJZ but also to a lesser degree from the other reviewers.
Reject