Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model
a better alternative to the Bradley-Terry preference model
摘要
评审与讨论
This paper introduces a novel alternative to DPO for training LLMs, using an energy-based preference model (EBM). The proposed method, Energy Preference Alignment (EPA), semmly guarantees a unique MLE)and consistently outperforms DPO in empirical benchmarks, providing better alignment with human preferences and faster convergence during training.
给作者的问题
-
My primary concern and question lie in Eq. (11) and Eq. (12). The authors introduce cross-pairs of responses between mini-batches to increase the number of negatives (mismatched responses, when index ). Does this imply that the proposed training objective relies on a perfectly clean dataset? In other words, does always need to be greater than and when ?
-
What is the rationale behind computing the reward for irrelevant and pairs? Could the authors provide a more solid theoretical justification for introducing negatives in Eq. (11) and Eq. (12)?
-
Could the authors provide an ablation study regarding the batch size? I noticed that the global batch size is set to 64—does the performance improvement over DPO depend on the use of a large batch size?
-
In the current experiments, the introducation of more negatives are within a mini-batch. Could the authors explore the possibility of randomly sampling more mismatched responses from the training data to increase the number of negatives in the mini-batch, thereby achieving further performance improvements?
-
Furthermore, could the authors provide a performance comparison between EPA and DPO under the same training time? Can Figure (b) be interpreted as showing that, given the same computational budget, EPA achieves faster convergence and performance improvement compared to DPO?
-
Additionally, some performance results in the current experiments seem relatively weak. Only the AlpacaEval and MT-Bench results are reported. I would expect to see more benchmark evaluations, such as Arena-Hard.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
Please refer to the questions 1&2.
实验设计与分析
Please refer to the questions 3-6.
补充材料
Yes. Code.
与现有文献的关系
Please refer to the questions.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- The combination of strong and weak negatives in the EPA loss function shows a novel and effective way to improve the model’s performance, as supported by ablation studies.
Weaknesses:
Relative weak experiments. Please refer to the questions.
其他意见或建议
Please refer to the Questions.
Concern#1: Does this imply that the proposed training objective relies on a perfectly clean dataset??
Thank you for raising this interesting question. No, it is not a requirement to have a perfectly clean dataset. The reason for the weak negative samples is to provide high variance. Even if some is accidentally better than in the training dataset, it should not be a problem. However, if all or too many such becomes better than , we might end up with lower variance as a result, which is not desired according to the theorems (see our response to concern#2 for more information about weak negatives and the variance of negatives).
Concern#2: What is the rationale behind computing the reward for irrelevant x and y pairs? Could the authors provide a more solid theoretical justification for introducing negatives in Eq. (11) and Eq. (12)?
The rationale, as mentioned in the response to Concern#1, is that the irrelevant x and y pairs can provide high variance. The premise of Theorem 3.2 and Theorem 3.3 only states that the Var[Y|Z] has to be positive. However, from the proof of Theorem 3.2, we can also see that the value of Var[Y|Z] is positively associated with the 2nd-order derivative of the energy discrepancy. Therefore, the higher variance will lead to a more convex Energy Discrepancy functional/loss.
We also discuss their intuitive usefulness as a regularizer in section 4.3. The negative gradients in DPO might push the policy parameters towards unexpected directions because they are likely not pointing to the RLHF minimizer. Therefore, the additional negative gradients introduced by the weak negatives can alleviate this issue.
Concern#3: does the performance improvement over DPO depend on the use of a large batch size?
No, because we use 64 as the global batch size for all the models and methods (including DPO) reported in the paper. As for why 64 in particular, this is because it is the highest possible number we can set to prevent our most memory-consuming setting (i.e., EPA's 1:3:10 setting in Table 3) from OOM.
Concern#4: sampling more mismatched responses from the training data instead of the minibatch.
We believe an irrelevant response from a mini-batch is not meaningfully different from an irrelevant response elsewhere, because the mini-batches are randomly sampled from the dataset in the first place. Also, as we mentioned in the response to Concern#2, the value of weak negatives is from the high variance as opposed to where they are sampled from.
Concern#5: EPA and DPO under the same training time?
The x axis in Figure 2.(b) is proportional to the number of prompts (ie., x in the datasets). However, since EPA (1:1:2) has exactly twice the amount of the responses (ie, y in the datasets) used in DPO, we think scaling the dynamics curve of EPA rightward by a coefficient of 2 is close to what you ask for. In such a graph, we do not think EPA achieves faster convergence compared to DPO. However, we also do not intend to make such a claim in the paper. Instead, we do agree the computation cost of EPA is its disadvantage compared to DPO.
Concern#6: more benchmark evaluations, such as Arena-Hard
Thank you for the suggestion. We quickly ran Arena-Hard for EPA against the two most important baselines: DPO and DPO-PL, using the same checkpoints we use in Table 2. The results are as follows. And, we can see that EPA is still better. We will add them to the final version of the paper. However, we believe the widely used MT-Bench and Alpaca-Eval 2.0 are enough for most of the experiments in the paper.
| Training Data | Method | Arena-Hard (Win Rate, %) |
|---|---|---|
| UF-binarized | DPO | 12.0 |
| UF-binarized | EPA | 16.3 |
| UF-all | DPO-PL | 13.0 |
| UF-all | EPA | 16.9 |
Thanks for the authors feedback and I don't have other concerns.
This paper identifies a major limitation in Direct Preference Optimization (DPO): the underlying Bradley-Terry Model (BTM) does not always guarantee a unique Maximum Likelihood Estimator (MLE), leading to suboptimal alignment in Reinforcement Learning with Human Feedback (RLHF).To address this, the authors propose an Energy-Based Model (EBM), termed the Infinite Preference Model (IPM), which inherently ensures a unique MLE. They introduce Energy Preference Alignment (EPA)—a contrastive loss function that better enforces the required slope-1 linearity between learned and true rewards.
给作者的问题
Could you clarify how robust that uniqueness advantage remains if your negative sampling is incomplete or unrepresentative (e.g., limited coverage in real-world data)? Specifically, do suboptimal negative samples risk undermining uniqueness—leading to the same drawbacks that can occur under BTM—and, if so, how severe is the impact?
论据与证据
The paper’s claims are well-supported by a mix of theoretical proofs, empirical benchmarks, and ablation studies. The primary claims regarding the limitations of BTM, the uniqueness of EBM’s MLE, and EPA’s improved alignment are convincingly demonstrated.
Minor Weaknesses: 1.Higher computational cost (acknowledged but not mitigated). 2.Tested mainly on Mistral-7B (broader validation needed).
方法与评估标准
The paper employs a well-justified Energy-Based Preference Model (EBM) to address Direct Preference Optimization (DPO) limitations, with strong theoretical support (Theorem 3.1). The Energy Preference Alignment (EPA) loss effectively enforces slope-1 linearity, outperforming DPO, IPO, KTO, and NCA.
理论论述
The paper's theoretical claims are largely well-justified with rigorous mathematical proofs. Theorem 3.1, which guarantees the uniqueness of the Maximum Likelihood Estimator (MLE) for the Infinite Preference Model (IPM), is correctly derived and logically sound. The equivalence between slope-1 linearity and the minimizer of the RLHF loss (Theorems A.3 and A.4) follows standard derivations in reinforcement learning theory and is correctly structured. The proof of Proposition B.5, which demonstrates the non-uniqueness of the MLE in the Bradley-Terry Model (BTM) under certain conditions, is valid and aligns with prior work on ranking models.
实验设计与分析
The experimental design is generally sound, with a clear methodology comparing the proposed Energy Preference Alignment (EPA) loss against strong baselines, including Direct Preference Optimization (DPO), IPO, KTO, and NCA. The use of intrinsic (Pearson correlation, slope-1 regression error) and extrinsic (MT-Bench, Alpaca-Eval 2.0) evaluation metrics provides a well-rounded assessment of alignment performance.
补充材料
I am able to inspect their code in the Supplementary Material but unable to run. Their code seems to align with the paper. However, I wise they could provide the checkpoint from their training for the community.
与现有文献的关系
The paper builds on the foundation of preference optimization for RLHF, particularly Direct Preference Optimization (DPO), which was introduced as a reward-model-free approach to align preference. It identifies a major limitation of DPO stemming from the Bradley-Terry Model’s (BTM) non-uniqueness of the Maximum Likelihood Estimator (MLE), a problem well-documented in ranking literature. By introducing the Infinite Preference Model (IPM), an energy-based alternative, the paper aligns with prior work in energy-based modeling and construct the Energy Preference Alignment (EPA) loss.
遗漏的重要参考文献
None
其他优缺点
Strengths: 1.Clarity of Motivation: The paper clearly explains the limitations of BTM and how the proposed EBM circumvents them, making a compelling argument for its adoption.
2.Robust Evaluation: The authors conduct extensive experiments, comparing against strong baselines, and include multiple evaluation perspectives.
Weaknesses: 1.Lack of experiments on diverse model architectures beyond Mistral-7B limits the generalizability of its findings.
2.Discussion on computational trade-offs is relatively brief.
其他意见或建议
None
Concern #1: generalizability to more datasets and more base models
(Please refer to our response to reviewer xiBU for our additional experiments using a different dataset and a different model)
Concern #2: limited discussion on computational trade-offs
Thank you for pointing out the limitation of this paper. As the main focus of this paper is on the theoretical benefit of using our EBM instead of BTM to model human preference, we have not included many possible further analyses regarding the computational cost of EPA (which is one among many other possible ways to fit the EBM). However, there is an explicit question raised by reviewer RQg9 (Concern#5) about a computational comparison between DPO and EPA. We hope our response there can provide additional insights.
Concern #3: "Could you clarify how robust that uniqueness advantage remains if your negative sampling is incomplete or unrepresentative (e.g., limited coverage in real-world data)? Specifically, do suboptimal negative samples risk undermining uniqueness—leading to the same drawbacks that can occur under BTM—and, if so, how severe is the impact?"
Thank you for raising these very interesting questions that we think suffice for many future research directions.
From a general point of view, we agree that suboptimal negative samples do risk undermining the uniqueness. Although there are no designated experiments in the paper to show this, we do think there are some related observations in the paper.
- Observation 1. In Figure 2.(b), EPA also degenerates (slightly) after some epochs, which means the dataset might still be not enough.
- Observation 2. In Table 5, as we stated in the paper, the fact that there are two tricks (adding a margin and using a dynamic weight on the loss) that can further boost the performance of EPA means it is empirically not perfect.
- Observation 3. In Table 1, the slope-1 linear regression error is non-zero.
However, in all these cases, we see EPA is still better than DPO. Therefore, although EPA is not robust enough, it is safe to say that it is more robust than DPO. In other words, the error caused by using EPA to estimate the unique MLE of the EBM is milder than the intrinsic flaw of BTM of not being guaranteed to have a unique MLE in the first place.
I confirm that I have read the author response to my review and will update my review in light of this response as necessary.
The authors highlight a fundamental issue with Bradley-Terry DPO, namely the lack of uniqueness in its solutions. To address this, they propose a novel approach based on energy-based models, termed Energy Preference Alignment. They demonstrate that their model overcomes the non-uniqueness problem inherent in Bradley-Terry DPO. Additionally, they approximate the model’s objective to develop a practical algorithm, which they empirically validate, showing its effectiveness compared to other DPO-based methods.
给作者的问题
- Does DPO-PL suffer from the same issue as DPO, particularly in terms of the non-uniqueness of solutions?
- Could you clarify the relationship between the proposed approach and DPO-PL?
论据与证据
- The theoretical results appear to be well-supported.
- The connection between the theoretical and practical results could be strengthened. For instance, in Theorem 3.3, the result holds only asymptotically as both M and N approach infinity. Providing insights into the rate of convergence would make the theoretical findings more informative.
- The practical evaluation seems limited, as the proposed approach was tested on only one dataset. Expanding the experiments to multiple datasets would help better assess its generalizability.
方法与评估标准
- The theoretical results appear to be valid.
- The current practical evaluation is insufficient and could benefit from further validation.
理论论述
I thoroughly checked the proof of B.5 but only skimmed the rest.
实验设计与分析
See above.
补充材料
Only the appendix.
与现有文献的关系
- The authors discuss various related works, particularly different objectives for DPO. -The final algorithm can be interpreted as DPO-PL with a technique to expand the dataset for K-wise comparisons. Specifically, given a dataset with k responses, the approach involves adding unrelated k' responses to them from other queries and then solving the problem. This formulation effectively leads to the final practical objective derived from DPO-PL.
遗漏的重要参考文献
Not to my knowledge.
其他优缺点
See above!
其他意见或建议
In A.5, I believe the argument relies on the condition that must be non-zero for the reasoning to hold.
Concern #1 (also raised by reviewer ZCWk): generalizability to more datasets and more base models
As this is a point mentioned by multiple reviewers, we managed to run quick experiments on another relatively small dataset and on another base model. Although the experiments can never be exhaustive, we believe the results (as follows) should be additional evidence for the generality of our findings. We will include them in the final version of the appendix.
-
dataset:
- source: https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs
- reference: Álvaro Bartolomé Del Canto et al., 2024; https://github.com/argilla-io/distilabel
- description: a cleansed version of the widely used dataset (Intel/orca_dpo_pairs)
-
results: (MT-Bench scores for epoch#1/#2/#3) | base model | EPA | DPO | |------------|--------------------|----------------| | Llama3-8B | 6.84/6.98/7.04 | 6.94/6.83/6.71 |
Concern #2: A.5 requires an assumption of the non-negativity of
Thank you for the suggestion. Yes, the assumption is required, which is true for both DPO (as explicitly stated in (Rafailov et al., 2023)) and EPA. Although this is a mild assumption that does not affect the main theories of our paper, we will add this to the final version of the appendix for comprehensiveness.
Concern #3: Does DPO-PL suffer from the same issue as DPO, particularly in terms of the non-uniqueness of solutions?
Yes, we think so. The proof will be analogous to that of DPO. It is not difficult to show that proposition B.5 is also true for the Plackett-Luce Model by only checking that the reconstructed reward also leads to the same expected likelihood of data. We will add the exact proof to the final version of the appendix.
Besides the proof per se, there is an intuitive understanding of why this is the case. As we have mentioned in the first paragraph of subsection 3.1., the Plackett-Luce Model also just models human preference among a finite number of candidates. This is problematic when the actual space of candidates is infinite.
Concern #4 (also raised by reviewer PHXo): detailed differences between EPA and other multi-response losses such as DPO-PL, infoNCA, etc.
We are thankful that some reviewers point out the similarity shared by EPA and other multi-response losses. This is exactly the reason why we include many such losses as baselines for comparison in Table 2. However, for a clearer description of the differences, we hope the following table can suffice. We will add it to the final version of the paper.
| name | probalistic model | data | loss |
|---|---|---|---|
| EPA | IPM | and and | |
| DPO-PL | Plackett-Luce Model | and | |
| InfoNCA | EBM for the ideal policy | and and |
Bradley-Terry model (BTM) has been the default modelling assumption between rewards and preferences, used in RLHF and DPO. This paper challenges the BTM, claiming that the BTM has not guaranteed unique minimizers in DPO training. To solve this issue, they propose Energy-based Preference Model (EBM), which models preferences in a Boltzmann distribution. The paper claims that the EBM has unique minimizers towards the global optimum in the RLHF objective.
While EBM is not tractable, the paper proposes Energy Preference Alignment (EPA) to approximate the EBM objective. Specifically, they propose to use two types of negative responses: negative responses from the same prompt and negative responses from different prompts.
They conducted thorough experiments to compare EPA and other baselines like DPO. Their experiments show superior performances of EPA in benchmarks like MT-bench and AlpacaEval. They also included ablation studies to investigate the impact of "weak negative examples" and "other tricks".
给作者的问题
As mentioned before, my main concern is that I am not convinced that adding responses from irrelevant prompts will add any benefit to model training.
论据与证据
-
Non-unique minimums of DPO training. As argued in this paper, one main problem of DPO is that its minimizers are not unique. Reading their proposition B.5, it is based on the assumption that some were never sampled within the preference dataset. Then one can build a translated reward model by adding an arbitrary to the target RM for any . I don't understand why this is a problem. If is never sampled in the preference dataset, then it is an issue with the dataset instead of with the DPO algorithm.
-
Can the authors explain why EBM still have unique minimizers when some never appear in their training data? What about the approximation, EPA? Does it guarantee unique minimizers?
-
Difference between EPA and InfoNCA / RPO The proposed EPA method is very similar to InfoNCA [1] and Multi-Response RPO [2], except that the weak responses come from other prompts. Can the author further clarify the differences? Intuitively, I am not convinced that adding responses from irrelevant prompts will add any benefit to model training.
[1] Noise Contrastive Alignment of Language Models with Explicit Rewards
[2] Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment
方法与评估标准
For the proposed methods, please see "Claims and Evidence".
The evaluation criteria are standard for comparing preference optimization techniques.
理论论述
I checked the proof of Proposition B.5, which is correct. However, as discussed in "Claims and Evidence", I don't agree with the implications of that proposition.
实验设计与分析
In general, this paper presents comprehensive studies to understand the impact of the proposed approach.
- Figure 2 is pretty good. It shows that EPA achieves a better reward-KL tradeoff compared to DPO.
- Table 3 studies the impact of the number of weak negative samples.
Some questions,
- For Table 2. They fixed the same for all methods. This feels problematic, cuz different methods might have different optimal . How is selected is not mentioned either.
- In Table 4, how are the weak negatives added to DPO or DPO-PL training? Can you also include InfoNCA for a comparison?
补充材料
I checked the proof of Proposition B.5, which is correct. However, as discussed in "Claims and Evidence", I don't agree with the implications of that proposition.
与现有文献的关系
This paper aims to push forward preference optimization algorithm, which could be useful in many problems like LLM post-training and off-policy RL.
遗漏的重要参考文献
None
其他优缺点
This paper is clearly written.
I especially like how they conduct multiple ablation studies to investigate the performance of each element in their algorithm.
One weakness is the proposed approach incurs larger computational and memory costs due to the additional weak negative samples. Can the author present specific experimental data points or analysis regard them?
其他意见或建议
None
Concern #1: The implication of Proposition B.5: the issue of the dataset or the issue of the DPO algorithm?
We believe a reasonable definition of a good algorithm is one that works for datasets "easy" to collect. If there are too many constraints to collect the datasets, it is indeed an issue of the algorithm.
B.5 gives a necessary constraint which is itself not easy to meet, i.e., we can hardly sample all s from an infinite space given a finite data point budget.
What makes the datasets "even harder" to collect is that the B.5 constraint is not sufficient. This means even if we somehow are able to sample all s in an infinite space, we still do not guarantee such a dataset will allow the DPO algorithm to work. As mentioned in the paragraphs before B.5, one can refer to (Ford, 1957; Bong & Rinaldo, 2022) for more situations for BTM to have non-unique MLEs.
Concern #2: "why EBM still have unique minimizers when some y* never appear in their training data?" & "what bout EPA"
As shown in Theorem 3.1 (also B.4), EBM having a unique minimizer to its maximum likelihood estimation is not dependent on any requirements on the structure of . Even when some , the proof we provide to Theorem 3.1 still holds.
For EPA, we acknowledge that there will be an error in the approximation (illustrated as the fuzzy dashed curve in Figure 1). This is why we need to check empirically whether or not the error introduced in EPA is smaller than that of DPO. For example, Table 1 explicitly shows that the slope-1 linear regression error is smaller for EPA, although the error is non-zero.
Concern #3: Difference between EPA and RPO, infoNCA, etc.
We noticed that the RPO paper was posted to arxiv.org after the submission of our paper. So, we were not aware of this concurrent work. Although EPA itself may or may not be a special case of RPO, we argue that our paper is more about the underlying energy-based preference model as opposed to EPA (which is only one of many possible ways to fit the EBM).
(see our response to xiBU for EPA vs other losses)
Concern #4: Why setting for Table 2?
As mentioned in the paper's subsection 5.1.4, the philosophy to set in Table 2. is to treat it as a given part of the RLHF objective (Eq.1). As for the reason why we choose 0.01 in particular, it is two-fold.
- Firstly, we did tune the for multiple values ranging from 0.01 to 0.5 (see Figure 2.(a)). close to 0.01 (low-KL region) will make both EPA and DPO achieve higher rewards.
- Secondly, is also the default setting for NCA and infoNCA (Chen et al., 2024a) and also a recommended setting for Mistral-based DPO (Tunstall et al., 2023) and KTO (Ethayarajh et al. (2024) ). For IPO's best , we did preliminary experiments with and , and found works better. Therefore, we believe setting is a reasonable choice. We will add these results (MT-Bench scores) in the final version of the appendix: | | IPO (epoch#1/#2/#3) | |--------|----------------------------------------| | 0.1 | 6.73/6.88/6.87 | | 0.01 | 7.20/7.31/7.23 |
Concern #5: "how are the weak negatives added to DPO or DPO-PL training?" & "what about InfoNCA?"
In the caption of Table 4., we have briefly stated how the weak negatives are added. The following are some examples for "+UF-WEAKx1".
suppose the original pairwise dataset has 3 data points:
-
For DPO, weak negatives are added "as additional to be paired with the original ". This means we add something like these to the original dataset: .
-
For DPO-PL, "as additional negatives ranked after the and ". This means we replace the original dataset with something like . The K-wise ranking information required by DPO-PL can be implicitly inferred here. For example, for , the ranking is .
We do not consider InfoNCA because the purpose of Table 4. is to study if EPA's main advantage comes from introducing more computation alone or being the product of a better preference model. However, InfoNCA is not based on a "preference model". Therefore, we do not think it is relevant for the purpose here.
Concern #6: more data points regarding the computational cost.
We provide our actual training time to support our linear-time-complexity argument in sec 5.2.3:
| #resp | setting | training time |
|---|---|---|
| 8 | 1:3:4 | 44h41min |
| 6 | 1:3:2 | 33h51min |
| 4 | 1:1:2 | 23h0min |
| 2 | 1:1:0 | 12h53min |
Concern #7: the benefit of weak negatives
(please refer to our response to reviewer RQg9's concern#2)
To address the limitation that the DPO loss may admit multiple minimizers, this paper introduces an energy-based preference model (EBM) that guarantees a unique maximum likelihood estimator (MLE), inherently satisfying the desired linearity condition. To demonstrate the practical benefits of replacing DPO with the proposed EBM in offline alignment, the authors adopt a simple yet scalable objective function inspired by recent advances in EBM training, which they term Energy Preference Alignment (EPA). Empirical results on standard benchmarks show that EPA consistently outperforms DPO. The paper received four reviews. The authors provided rebuttals to address the concerns raised by the reviewers. Following the rebuttal and discussion, all four reviewers reached a consensus to accept the paper, recognizing its innovation, strong theoretical justification, and promising empirical results. The Area Chair concurs with the reviewers’ assessment and recommends acceptance.