Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO
摘要
评审与讨论
This paper addresses the limitation that Direct Preference Optimization (DPO) oversimplifies the regularization term. To highlight this issue, the authors present a series of theoretical results. In response, they propose Proximalized Preference Optimization (PRO), a variant of DPO designed to resolve likelihood underdetermination and provide a unified framework for aligning language models across diverse types of preference feedback. Theoretical analysis and empirical experiments are conducted to demonstrate the effectiveness of PRO.
优缺点分析
Strengths:
- The paper clearly identifies a key limitation of sample-based DPO methods, namely, the oversimplification of the regularizer term, which leads to likelihood underdetermination.
- To address this issue, the authors propose a principled variant of DPO, called Proximalized Preference Optimization (PRO). They support their method with theoretical analysis and empirical results demonstrating its effectiveness.
Weaknesses:
- Some of the theoretical proofs lack rigor. For example, the proof of Theorem 3.2 could be more formally structured by applying the KKT conditions.
- Several theorems are presented without sufficient explanation or intuition, making them difficult to interpret. (see Questions section for specifics)
问题
- The corollary establishes a bound involving a constant , but it would be helpful if the authors could further clarify the valid range or typical values of , and how it depends on properties of the policy or data distribution.
- In Theorem 4.2, is given as a lower bound on the regularization coefficient . However, if is required to be too large (i.e., is large), then the regularization term may dominate the PRO loss, potentially hindering learning from the comparison data. Could the authors provide more discussion on how is determined
- If I understand correctly, the existence of an interior solution in Theorems 4.2 and 4.3 seems to address the degeneracy problem of the policy learned via the DPO loss, as highlighted in the IPO paper. Could the authors clarify whether this connection is intentional, and if so, provide experimental results or discussion to support this claim?
- Theorem 3.1 shows the equivalence between the and the standard DPO formulation. While the formal result is clear, it would be helpful if the authors could provide additional intuition behind this equivalence.
局限性
The discussion of limitations is included in Appendix C.
最终评判理由
The authors have addressed all my concerns through new experimental results and a clearer intuition for the proposed formulation. I maintain my score of acceptance.
格式问题
N/A
Thanks for your insightful comments and helpful questions!
W1: Some of the theoretical proofs lack rigor. For example, the proof of Theorem 3.2 could be more formally structured by applying the KKT conditions.
A1: We would like to clarify why the KKT conditions are not directly applied, and how our derivation still captures their underlying principles.
The classical KKT conditions are designed for optimization problems with equality and non-strict inequality (i.e., or ) constraints. However, the problem in Theorem 3.2 is defined over the open simplex , since the logarithm in the loss requires all strictly positive. This makes the KKT conditions not directly applicable.
Briefly, the KKT conditions consist of:
- Primal feasibility: The optimum satisfies all original constraints.
- Stationary & complementary slackness: At the optimum, the gradient of the objective is a linear combination of the gradients of active constraints (those satisfied with equality), signifying that any descent direction will violate some constraint.
- Dual feasibility: Lagrange multipliers for inequality constraints are non-negative.
In our case, provided that the optimal solution exists (as assumed in Theorem 3.2):
- The optimum necessarily lies in the interior of (i.e., the two conditions following Line 677).
- This in turn means that sufficiently small moves in any direction remain the strict inequality constraints satisfied, and thus these constraints are, in effect, inactive. Then, we only need to consider the equality constraint; the stationarity condition reduces to that the gradient of the objective is colinear to the gradient of this equality constraint—precisely as stated in equation (16).
- The variable in (16) serves as the Lagrange multiplier associated with this equality constraint and, in line with the KKT framework, is unconstrained.
Therefore, each element above corresponds exactly, in a one-to-one fashion, to a component in the KKT framework. Based on this derivation, we further establish Theorem 3.2 by analyzing , showing that it must be zero.
We appreciate the reviewer’s attention to the rigor of our proofs and are glad to provide further clarification if needed.
Q1: The corollary establishes a bound involving a constant , but it would be helpful if the authors could further clarify the valid range or typical values of , and how it depends on properties of the policy or data distribution.
A2: The constant depends on the values of for where or . However, as established in Theorem 3.2, is determined by an equation system involving the non-linear sigmoid and logarithm functions, which makes a closed-form analysis of intractable.
To gain more intuition on what influences , recall that eDPO includes a regularizer to encourage to remain close with , alongside an optimizer that adjusts response probabilities upward or downward depending on the preference signal. According to (2), the preference signal is amplified by a factor of at optimum. If the dataset contains much more preferred responses with large amplification factors than dispreferred responses, the increases in probabilities for these preferred responses is likely to outweigh the decreases for dispreferred ones. Since probabilities must sum to one, this results in lower probabilities for unobserved responses—thus, tends to be smaller. In general, the typical scale of is inversely related to the total amplified factors of both preferred and dispreferred responses present in the data.
Moreover, It is important to emphasize that the main purpose of Corollary 3.3 is to show that eDPO enforces an ordering on the probability changes among preferred, unobserved, and dispreferred responses. This means that the probability of unobserved response cannot be increased or decreased arbitrarily, as it is bounded by the changes to preferred and dispreferred responses. In other words, the corollary ensures that the underdetermination issue is resolved, regardless of the specific value of .
Q2: If the required is too large (i.e., is large), then the regularization term may dominate the PRO loss, potentially hindering learning from the comparison data. Could the authors provide more discussion on how is determined?
A3: We would like to clarify that and jointly affect the strength of regularization relative to the learning signal. To elaborate, let us rewrite the PRO loss as
Here, and only appear in . To better understand their roles, note that
This reveals that determines the maximum gradient magnitude of the regularizer, while controls how rapidly the gradient grows as moves away from zero. Therefore, even if a large is required, the overall regularization effect can be modulated by adjusting .
We have empirically validated this in our response A1 to Reviewer 5FRF, where additional experiments are conducted by increasing beyond the one already producing strong performance. The results show that, even for or values of , reducing appropriately maintains competitive performance.
These results also indicate that, once exceeds the required threshold, its further increase has minimal effect, as we can always find a suitable to retain strong performance. Thus, in practice, it suffices to choose any above the certain threshold, without the need for precise tuning. Considering this, we present a constructive method to determine a sufficient in our response A3 to Reviewer CDcH. With this analytical , hyperparameter tuning primarily focused on —similar to the standard DPO. For details, we kindly refer the reviewer to that response due to space constraints.
Q3: The existence of an interior solution in Theorems 4.2 and 4.3 seems to address the degeneracy problem of the policy learned via the DPO loss, as highlighted in the IPO paper. Could the authors clarify whether this connection is intentional, and if so, provide experimental results or discussion to support this claim?
A4: Thanks for the insightful observation! While our initial derivation was not specifically intended to address the degeneracy issue highlighted in the IPO paper, Theorems 4.2 and 4.3 do guarantee that our proposed method resolves this problem.
To empirically demonstrate, we conducted the same experiment as presented in the IPO paper: a bandit environment with three actions and the dataset . The policy to be optimized is parameterized as , with the reference policy being uniform. We trained the policy for 18,000 steps using Adam with learning rate and batch size 9. All methods converged before training completed; the summarized results are as follows:
- PRO (we only present the result of due to the space constraints): ||||| |---|---|---|---| ||||| ||||| |||||
- IPO: ||||| |---|---|---|---| ||||| ||||| |||||
- DPO: ||||| |---|---|---|---| ||||| ||||| |||||
As shown above, DPO degenerates to assigning all probability mass to response , regardless of the value. In contrast, both IPO and our PRO method maintain meaningful, non-trivial distributions over the actions.
We will incorporate the above discussion into the revision.
Q4: It would be helpful if the authors could provide additional intuition behind the equivalence in Theorem 3.1.
A5: Theorem 3.1 essentially utilizes an interesting property of the log-sigmoid function. Concretely,
$
\nabla_\delta\big[a\log\sigma(\delta)+(1-a)\log\sigma(-\delta)\big]&=a\sigma(-\delta)-(1-a)\sigma(\delta)\\ &=a-\sigma(\delta)=\nabla_\delta\big[a\delta+\log\sigma(-\delta)\big]\\ &=a-1+\sigma(-\delta)=\nabla_\delta\big[(a-1)\delta+\log\sigma(\delta)\big].
$
In other words, the convex combination of log-sigmoid gradients can be decoupled into two parts: one that depends on the data-relevant signal ; another that is independent of , and thus can be re-arranged to serve as a regularizer.
We will incorporate the above explanation in the revision to better convey the intuition underlying our proposed method.
Thank you for the thorough response. All my concerns are addressed, so I will keep my current score for acceptance.
We are pleased to hear that all the raised concerns have been addressed. Thanks again for your valuable comments!
This work identifies the issue of likelihood underdetermination in direct alignment methods for LLMs, which causes decreased absolute likelihoods of example responses and undesired outputs. It revisits and reformulates the loss in direct preference optimization (DPO), revealing the oversimplification of a regularizer as the underlying cause. Based on this, the authors introduce PRO, which resolves the underdetermination issue through an efficient approximation of the complete regularizer and demonstrates superior performance in handling various feedback types.
优缺点分析
Strengthes:
- The paper provides a novel decomposed reformulation of the Direct Preference Optimization (DPO) loss function. This reformulation broadens the applicability of DPO to a wider range of feedback types, and offers new insights into the underlying cause of likelihood underdetermination.
- The authors identify that the standard DPO implementation oversimplifies a regularizer, leading to likelihood underdetermination. They demonstrate that reinstating the complete regularizer effectively resolves this issue.
Weaknesses and Questions:
- Appendices seem to be missing.
- It is surprising to see the bad performance of DPO. It would be helpful The paper does not provide detailed information on how hyperparameters were chosen. More transparency in the hyperparameter tuning process would enhance the reproducibility and credibility of the results.
问题
See the section above
局限性
See the section above
最终评判理由
This work addresses the issue of likelihood underdetermination within direct alignment methods for Large Language Models (LLMs). This problem leads to a decrease in the absolute likelihood of example responses and results in undesired outputs. By revisiting and reformulating the loss function of Direct Preference Optimization (DPO), the authors identify an oversimplified regularizer as the root cause. Building on this insight, they introduce Preference Ranking Optimization (PRO), which resolves the underdetermination issue by efficiently approximating the complete regularizer, demonstrating superior performance across various feedback types. All my concerns are addressed.
格式问题
Appendices are missing.
Thank you for your review and feedback!
We would like to clarify that all appendices are included in the supplementary material, in line with NeurIPS guidelines, which state that extensive appendices can be provided either in the supplementary material or the main submission. For your convenience, the appendices cover the following sections:
- A Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
- B Proof of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
- C Comparison of PRO Loss and RLHF Objective . . . . . . . . . . . . 25
- D Implementation Details of PRO Loss . . . . . . . . . . . . . . . . . . . . 28
- E Performance Degeneration of KTO . . . . . . . . . . . . . . . . . . . . . 29
- F The Role of and in PRO Regularizer . . . . . . . . . . . . . . . . . . 29
- G Additional Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 29
- H Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 31
We understand how important complete appendices are for a thorough evaluation, and apologize for any confusion regarding their location. We would be grateful if you could further assess our submission in light of the availability of appendices.
Below, we respond to the remaining concerns raised in the review.
Q1: It is surprising to see the bad performance of DPO.
A1: We agree that DPO's underperformance on the Anthropic-HH dataset is noteworthy. We believe this is due to the following differences between our setup and the original DPO paper:
- Evaluation Criteria: Unlike the DPO paper, which assesses helpfulness alone, our evaluation incorporates helpfulness, harmlessness, and conciseness. As noted in [1], the Anthropic-HH dataset is annotated for both helpfulness and harmlessness (with HH being short for them). Moreover, LLM judges are known to favor overly verbose responses; several benchmarks (e.g., AlpacaEval 2) and previous works (e.g., KTO) have explicitly included conciseness as an evaluation dimension. These metrics therefore more closely match the intent of the dataset as well as real-world alignment objectives.
- Sequence Length: Our experiments utilize more modern context windows (prompt: 1024, sequence: 2048) compared to the original DPO setting (256/512 respectively). As reward hacking often manifests as generating overly verbose outputs (please see the right panel of Figure 1), increasing the context window could further exacerbate this phenomenon. Given that modern LLMs typically operate with even longer contexts, DPO’s vulnerability to such issue is likely even greater.
Furthermore, the underwhelming results of DPO we observe are consistent with earlier findings in the KTO paper [2], where Figure 2 shows that SFT alone outperforms SFT+DPO across multiple Pythia model sizes.
We will revise the manuscript accordingly to reflect these points.
[1] Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
[2] Model alignment as prospect theoretic optimization, 2024.
Q2: The paper does not provide detailed information on how hyperparameters were chosen. More transparency in the hyperparameter tuning process would enhance the reproducibility and credibility of the results.
A2: We appreciate the reviewer’s concern regarding transparency in hyperparameter tuning. To clarify:
- General hyperparameters: These are specified in Appendix G (Line 954) and align with the official KTO repository (commit ID: 306ed27). We did not perform additional tuning beyond these canonical settings.
- Method-specific hyperparameters (, ):
- selection: We conduct a hyperparameter sweep over , as described in Lines 324-326. Figure 2 summarizes the results across on the Anthropic-HH dataset. For the UltraFeedback dataset, Table 1 reports the best performance across these values for each method, as described in Lines 332-335. For imbalanced binary feedback (from the Anthropic-HH dataset), we adopt the optimal identified in Figure 2, as mentioned in Line 345-346 and 358-359. For scalar feedback, we similarly perform a sweep over the same range of ; we will add this clarification in the revision.
- selection: Details of settings are provided in Appendix D (Lines 907-909, 917-918), and a reference is also provided in Line 301 of the main paper. Additionally, based on the comment from Reviewer 5FRF, we have added further experiments during the rebuttal phase to examine the performance across a range of (see our response A1 to Reviewer 5FRF).
We hope these clarifications sufficiently address the reviewer's concerns. If there are any additional points or specific aspects the reviewer would like us to elaborate on, we would be glad to provide additional clarification.
Dear Reviewer o7Vr,
As the discussion phase is nearing its end, we would like to kindly follow up and confirm whether our responses have addressed your concerns. If there's anything unclear or if you'd like us to elaborate further, we're eager to do so.
Thanks once again for your time and insights on our work!
This paper presents Proximalized Preference Optimization (PRO), a new method for aligning large language models (LLMs) with human feedback that addresses key limitations of the widely used Direct Preference Optimization (DPO) method. The authors first identify a critical issue in DPO they term "likelihood underdetermination," where the model's likelihood for both preferred and dispreferred responses decreases during training. This can lead to reward-hacking effects, causing the model to generate outputs that deviate from expected patterns. The core contribution is a theoretical reformulation of the DPO loss function, decomposing it into two distinct parts: an optimizer term that reorganizes feedback into a pointwise signal, naturally extending its applicability beyond pairwise comparisons to other formats like binary and scalar feedback; and a regularizer term that is independent of preference labels. Through this new perspective, the paper argues that likelihood underdetermination in standard DPO arises from an oversimplification of this regularizer. The proposed PRO method addresses this by incorporating a more complete, yet computationally efficient, approximation of the regularizer. This is achieved through a novel "hyper response" mechanism, which groups unobserved responses to make the calculation tractable. PRO offers a unified framework for aligning LLMs with diverse feedback types (pairwise, binary, and scalar) while simultaneously resolving the underdetermination issue.Extensive experiments demonstrate that PRO successfully mitigates reward hacking and achieves performance that is comparable to or better than specialized methods like DPO, KTO, and NCA across various feedback scenarios, including challenging cases with extremely imbalanced data.
优缺点分析
Strengths
- This paper first proves that the population-based DPO's gradient is equivalent to the eDPO's gradient. The reformulation decomposes the loss into an optimizer and a regularizer, and only the optimizer relies on the preference feedbacks. This decomposition therefore provides greater flexibility in developing sample-based loss since the preference feedbacks are usually limited.
- A major contribution is the unification of alignment across pairwise, binary, and scalar feedback within a single framework. While methods like KTO and NCA were developed for specific feedback types , PRO's design, derived from its reformulated optimizer, naturally accommodates this diversity.
Weaknesses
- Some sentences are not easy to understand. See questions.
- PRO introduces a new hyperparameter, , which balances the optimizer and regularizer. While the experiments on imbalanced data show this to be a powerful and necessary lever, it also adds to the complexity of hyperparameter tuning compared to standard DPO. From Table 2, the impact of is significant. How to select the best ? The paper could be strengthened by including a more systematic sensitivity analysis for across different settings to provide practitioners with better guidance.
- The "hyper response" is an elegant approximation, but it is still an approximation. It prevents the model from differentiating the probabilities of individual responses within the hyper set. The paper argues this is not problematic because the total probability mass is constrained. However, it is possible that this could mask certain failure modes where the model learns to assign disproportionate probability to a specific "bad" but unobserved response.
- In Table 2, other baseline methods are missing.
问题
- I think this paper does not define clearly. I think is the responses with feedbacks, and is all responses. Is that correct? I think the authors should state this more clearly around Line 164. Otherwise, what is "full regularizer"?
- Why Corollary 3.3 solves the underdetermination issue? I think the underdetermination is the important weight diminishing. But the difference of probability ratios of chosen and rejected pairs could still be large under Corollary 3.3.
- I'm not sure I understand "unobserved"correctly. I think it means responses without feedbacks. So Sec 4.1 is basically you treat unobserved responses evenly in the regularizer's expectation. I think you could make this easier to understand for readers.
- I think you use hyper response because sampling is intractable. But is this Hyper Response method only be able to use in PRO? I think in the population-based formulation of DPO (Line 144), you can also use Hyper Response, is that correct?
I give the score "Borderline reject". But I am willing to raise my score if my questions are resolved.
局限性
yes
最终评判理由
The rebuttal clarify my concern on the hyperparameter selection, underdetermination, importance weight diminishing, and hyper responses. I think this is a very good paper. Writing needs some improvement to make it clearer to follow.
格式问题
No
Thanks for the constructive feedback and helpful questions!
W1: How to select the best ? The paper could be strengthened by including a more systematic sensitivity analysis for across different settings to provide practitioners with better guidance.
A1: We conducted additional experiments to analyze the sensitivity of PRO to , with the results summarized as follows.
- Win rate (%) under pairwise feedback: ||||||| |---|---|---|---|---|---| ||||| |||||| |||||||
- Win rate (%) under 1%-desired binary feedback: ||||||| |---|---|---|---|---|---| ||||| |||||| |||||||
For reference, in our paper, for pairwise feedback and for 1%-desired binary feedback are sufficiently large to yield strong performance.
In the new experiments, we further increased by factors of 3 and 9 to examine its impact. The results show that, once surpasses the required threshold, there always exists a suitable such that the competitive performance is maintained. In other words, the performance is insensitive to further increase in beyond this point. This raises the opportunity to simplify the hyperparameter tuning. Due to the space constraints, we refer the reviewer to our response A2 to Reviewer CDcH, where we provide an explanation for the above phenomenon, and discuss a practical hyperparameter tuning strategy that brings its overhead on par with DPO.
W2: The "hyper response" is an elegant approximation, but it is still an approximation. It prevents the model from differentiating the probabilities of individual responses within the hyper set. The paper argues this is not problematic because the total probability mass is constrained. However, it is possible that this could mask certain failure modes where the model learns to assign disproportionate probability to a specific "bad" but unobserved response.
A2: We would like to clarify that enumerating every unlabeled response in alignment is fundamentally computationally intensive, and this limitation is not unique to PRO. In fact, existing approaches such as DPO, KTO, NCA, and IPO only include labeled responses in their losses, leaving all other unobserved responses unconstrained. As a result, due to the absence of training signals, the probabilities of unobserved responses may drift arbitrarily during optimization, which potentially leads to specific "bad" and unobserved response being assigned with disproportionately high probability. PRO makes this limitation transparent, rather than introducing a new one.
Moreover, even if additional computation for unlabeled responses were feasible, existing methods do not specify how to include them in the loss. By contrast, PRO allows for straightforward exclusion of these responses from the hyper-response set, and the regularizer can directly optimize their probability mass. This provides a flexible way to balance computational cost and regularization granularity across the response space.
W3: In Table 2, other baseline methods are missing.
A3: Among the baselines, only KTO supports binary feedback. Its results, based on using the optimal identified in Figure 2, are reported in Lines 345-347 and 358-359 (since the primary focus of Table 2 is to investigate how affects PRO’s performance, we did not include them there).
In the 1%-desired setting, KTO performs poorly and appears to suffer from reward hacking: the aligned model generates numerous duplicated and meaningless tokens (Line 347). We thus increased in an attempt to bring the optimized model closer to the reference model, but the results remained unsatisfactory (Line 350). For this reason, we did not report the win rates in the submission. We provide the corresponding results below and will include them in the revision:
| Win rate (%) |
If the reviewer has more suggestions for adapting other baselines to binary feedback, we are happy to include them in the revision.
Q1: I think this paper does not define clearly. I think is the responses with feedbacks, and is all responses. Is that correct? I think the authors should state this more clearly around Line 164. Otherwise, what is "full regularizer"?
A4: We agree that this distinction is critical and will explicitly clarify it around Line 164. Specifically:
- refers to the empirical distribution over responses that are labeled with feedbacks (i.e., empirical distribution of the preference dataset, as described in Line 167).
- refers to the distribution over all possible responses (typically, it can be the underlying distribution from which responses are sampled for preference annotation, as stated in Line 89).
Q2: Why Corollary 3.3 solves the underdetermination issue? I think the underdetermination is the important weight diminishing. But the difference of probability ratios of chosen and rejected pairs could still be large under Corollary 3.3.
A5: Let us clarify the terminology and the role of Corollary 3.3 detailedly:
- Underdetermination means that adding a constant to both log-probabilities of the preferred and dispreferred responses () does not alter the value of DPO loss:
As a result, when is negative, the probability mass can "leak" from the labeled responses to the unobserved responses, causing the "reward hacking" phenomenon observed in prior works.
- Importance weight diminishing provides a more specific explanation of how, in practice, both preferred and dispreferred log-probabilities decrease in DPO. Specifically, once the log-probability gap is large enough, there is no incentive to update the model, even if the log-probabilities of preferred responses have dropped (which highly likely happens due to catastrophic forgetting).
In other words, underdetermination creates the possibility for reward hacking, while importance weight diminishing makes it actually occur during DPO training.
Corollary 3.3 addresses the underdetermination by enforcing an order among the log-probability changes after optimization. Specifically, it ensures that the log-probability change for any unobserved response must lie between those for preferred and dispreferred responses. Unlike before, the simultaneous probability decreases for preferred and dispreferred responses would also require a reduction in the probability of unlabeled responses, which is impossible due to the constraint of fixed total probability.
Lastly, we note that the log-probability difference between paired responses becoming unboundedly large is a separate issue of DPO, often referred to as degeneracy. We appreciate that the comment (and Reviewer iesQ's Q3) for bringing our attention to the fact that PRO also addresses this issue. For further details, please refer to our response A4 to Reviewer iesQ.
Q3: I'm not sure I understand "unobserved" correctly. I think it means responses without feedbacks. So Sec 4.1 is basically you treat unobserved responses evenly in the regularizer's expectation. I think you could make this easier to understand for readers.
A6: "Unobserved" indeed refers to the responses without feedbacks, i.e., those not represented in the preference dataset.
To clarify, treating unobserved responses evenly in the regularizer's expectation would still require enumerating every such response, which is computationally infeasible. In Section 4.1, we aggregate all unobserved responses into a single hyper response, treating it as a unified group. This allows us to reduce the expectation to just terms: for the labeled responses and one for the hyper response. Computation of the regularizer then requires evaluating , , and on the hyper response. To enable it, we define for any distribution . This avoids the need for explicit enumeration and makes the computation tractable.
We will revise accordingly to make these points easier to understand.
Q4: I think you use hyper response because sampling is intractable. But is this Hyper Response method only able to use in PRO? I think in the population-based formulation of DPO (Line 144), you can also use Hyper Response, is that correct?
A7: We appreciate this insightful question. Hyper response is indeed designed to address the intractability of sampling . According to Theorem 4.1, hyper response can only be constructed from unlabeled/unobserved responses. In contrast, the population-based DPO loss assumes access to preference feedback for every pair, which means that hyper response cannot be directly applied to it.
However, as stated in Theorem 4.3, under the specific choices of and , PRO recovers a similar form with DPO, except that it is based on an augmented empirical preference and involves the hyper-response mechanism. Actually, it can be treated as an enhanced variant of sample-based DPO, where pseudo preference is introduced for .
Dear Reviewer 5FRF,
As the discussion phase is nearing its end, we would like to kindly follow up and confirm whether our responses have addressed your concerns. If there's anything unclear or if you'd like us to elaborate further, we're eager to do so.
Thanks once again for your time and insights on our work!
Thank you sincerely for your positive feedback and for raising the score. We appreciate your helpful suggestions and will make sure to clarify "underdetermination" and "importance weight diminishing" in the final draft as you advised.
Regarding the question on "hyper response": The hyper response is a conceptual aggregate representing unlabeled responses. In practice, we do not need to explicitly enumerate its elements for approximating the regularizer. We illustrate this below with a simple example.
Assume we have only a pair of labeled responses, denoted by and . Given that the hyper response consists of all unlabeled responses, the overall response space is . In the regularizer, the expectation over all pairs can be explicitly written as (omitting constant terms):
$
&\frac{\alpha}{2}\mathbb{E}_{y_1,y_2\dot\sim\mu}\bigg[D_\text{KL}\bigg(\mathcal{B}\bigg(\frac{1}{2} \bigg) ~\Bigg|\Bigg|~ \mathcal{B}\Big(\sigma\big(r_\theta(y_1) - r_\theta(y_2)\big)\Big) \bigg] \\
&= \alpha \mu(y_w)\mu(y_l)\bigg[D_\text{KL}\bigg(\mathcal{B}\bigg(\frac{1}{2} \bigg) ~\Bigg|\Bigg|~ \mathcal{B}\Big(\sigma\big(r_\theta(y_w) - r_\theta(y_l)\big)\Big) \bigg] \\
& \quad + \alpha \mu(y_w)\mu(\mathcal{H})\bigg[D_\text{KL}\bigg(\mathcal{B}\bigg(\frac{1}{2} \bigg) ~\Bigg|\Bigg|~ \mathcal{B}\Big(\sigma\big(r_\theta(y_w) - r_\theta(\mathcal{H})\big)\Big) \bigg] \\
& \quad + \alpha \mu(y_l)\mu(\mathcal{H})\bigg[D_\text{KL}\bigg(\mathcal{B}\bigg(\frac{1}{2} \bigg) ~\Bigg|\Bigg|~ \mathcal{B}\Big(\sigma\big(r_\theta(y_l) - r_\theta(\mathcal{H})\big)\Big) \bigg], \\
$
where
$
\mu(\mathcal{H}) = 1 - \mu(y_w) - \mu(y_l), \\
r_\theta(\mathcal{H}) = \beta\log\frac{\pi_\theta(\mathcal{H})}{\pi_\text{ref}(\mathcal{H})} = \beta\log\frac{1-\pi_\theta(y_w) - \pi_\theta(y_l)}{1-\pi_\text{ref}(y_w) - \pi_\text{ref}(y_l)}.
$
This enables us to succinctly represent and efficiently compute the regularizer, without instantiating every possible unlabeled response.
We will further clarify and illustrate the function of the hyper response in the revision. Thank you again for your insightful question.
Thanks for your response. I notice that you imply this in Line 224-226, but you should make it clearer for the first time readers.
Thanks for your reply. It resolves my concerns. My suggestion includes clarify "underdetermination" and "Importance weight diminishing" in the final draft. I raise my score to "Accept". One additional question: can you give me an example of "hyper response"? How is the hyper response constructed?
Thank you for this constructive suggestion! We plan to add a figure to visually illustrate how the modified response space and the approximated expectation relate to the original ones. We believe this will make our explanation more accessible.
This paper proposes proximalized preference Optimization (PRO) by reformulating the standard DPO. The reformulation incorporates a complete regularizer to fix likelihood underdetermination and robustly align large language models across pairwise, binary, and scalar feedback. To make the optimization scalable, PRO uses a hyper-response approximation to efficiently maintain absolute likelihoods. It outperforms existing methods by preventing reward hacking and ensuring stable, high-quality model alignment.
优缺点分析
Strengths:
-
The paper studies an equivalent form of DPO and answers the question of why standard DPO has the problem of likelihood underdetermination, which stems from the oversimplified regularizer.
-
The new loss derived can now handle multiple types of feedback, including pairwise, binary, and scalar feedback. Therefore, it is a unified framework.
-
RPO is robust to extreme data imbalance.
-
The training of RPO is still contrastive, and therefore inherits the advantage of DPO that no reward model training is needed.
Weaknesses:
-
The idea of hyperresponse is interesting and saves compute. However, I am concerned that this also forces rare or novel responses to share one probability mass, which can suppress diversity.
-
RPO now introduces an additional hyperparameter , which also needs to be tuned jointly with . Therefore, the tuning might become harder in practice. From the imbalanced experiment, the performance of RPO seems to be affected by .
问题
- Can the in Thm 4.2 be determined in practice?
- How to compute the in line 223? This also corresponds to .
- How is the hyperresponse constructed?
局限性
Yes.
最终评判理由
The authors have addressed my concerns. I raised my score from 4 to 5.
格式问题
No.
Thanks for your thoughtful comments and appreciating the idea of hyperresponse!
W1: The idea of hyperresponse is interesting and saves compute. However, I am concerned that this also forces rare or novel responses to share one probability mass, which can suppress diversity.
A1: Hyperresponse aggregates a group of unlabeled responses, which prevents us from distinguishing their individual probabilities. However, as established in Theorem 4.1, this aggregation does not alter the total probability mass assigned to this group in the optimal solution; only the allocation within the group is not explicitly managed. This means that while rare or novel responses may sometimes receive low probabilities, the effect arises from limited intra-group control, not from increased competition among these responses.
We would like to further clarify two important points:
-
The inability to control the distribution over unlabeled responses is a challenge shared by all existing alignment methods, not one caused by hyperresponse. Prior approaches such as DPO, KTO, NCA, and IPO optimize the probabilities only for labeled responses, leaving the rest under-specified. Hyperresponse simply makes this explicit and transparent, instead of introducing a new issue.
-
Accounting for the individual probabilities of unlabeled responses would require explicitly computing them, which is intrinsically computationally intensive. Moreover, existing methods—even if computational resources allow—typically do not incorporate these probabilities into the loss function. In contrast, PRO moves a forward step by offering a flexible manner to tradeoff the computational cost and the granularity in the probability allocation over the response space. Specifically, if certain unlabeled responses are deemed important (e.g., we want to preserve the model’s capacity for generating them), they can be excluded from the aggregation. In this way, the regularizer would be computed over a finer-grained response space, preserving more detailed likelihood knowledge from the reference model.
We appreciate this valuable comment and will incorporate the above discussion in the revision.
W2: PRO now introduces an additional hyperparameter , which also needs to be tuned jointly with . Therefore, the tuning might become harder in practice. From the imbalanced experiment, the performance of PRO seems to be affected by .
A2: We would like to clarify that, in practice, tuning and is considerably simpler than it might appear. Specifically, it is not necessary to jointly tune and : any value of above a certain threshold suffices; for each such , there exist corresponding values that allow our method to consistently attain strong performance. We first provide a detailed explanation for this assertion, followed by supporting empirical evidence.
By rewriting the PRO loss as
the hyperparameters and are only involved in the function . This allows us to analyze their impacts on the loss by simply examining . Noticing that , we see that determines the maximum gradient magnitude while governs how rapidly the gradient grows as departs 0. If is too small, the gradient of the overall loss can be dominated by the optimizer, causing unpreferred responses reduced towards zero probability and thus compromising our theoretical guarantees. Theorem 4.2 established there exists some : for any , the regularizer maintains its effectiveness and prevents probabilities from vanishing.
Importantly, a value of being sufficiently large implies that the magnitude of the regularizer gradient should rarely hit its saturation regime during optimization. Now, let be a sufficient large value (i.e., ), and denote the tuned value accordingly. As and provides adequate flexibility to shape the curve of , for any , we can always select a such that the curve of shows a very similar shape with prior to its saturation region. For instance, in Figure 3 (right column) of Appendix F, the gradient curves for and largely overlap in the region of . In this way, the optimization procedure would progress similarly under these two settings.
Empirical evidence (please see our response A1 to Reviewer 5FRF for details) also confirms that given sufficiently large there always exists suitable to make PRO achieve similar performances. In other words, once exceeds a required threshold, it has marginally effect on the performance.
Therefore, the tuning process involves first determining a sufficiently large , after which can be tuned independently. Theorem 4.3 provides a valid threshold of in the case of pairwise feedback, and we will elaborate another one for general form of feedback in A3. With that, we only need to tune , and the tuning overhead becomes similar to that of DPO.
Q1: Can the in Thm 4.2 be determined in practice?
A3: Yes— can constructively determined in practice.
As shown in the proof of Theorem 4.2 (Line 773), to prevent PRO loss from decreasing indefinitely as the solution approaches a specified boundary point \\pi_\\infty of the feasible region, it suffices to select
where
Once satisfies this condition for all boundary points, the loss cannot decrease continuously on the whole boundary, thus ensuring the existence of an optimal solution within the feasible region.
This can be achieved by further strengthening the above condition to make it -independent:
- Instead of enforcing the inequality only for , we can require it for all .
- Since , we can safely use this lower bound for further simplification.
Putting these together, a sufficient and easily computable choice is
PS: As discussed in A2, the value of does not perceptibly affect the performance, once it exceeds the required threshold. Therefore, there is no need to pursue the minimal .
Q2: How to compute the in Line 223? This also corresponds to .
A4: The general function is used to illustrate how expectation can be efficiently computed under the hyperresponse approximation (as described in Line 223). In the case of PRO loss, is instantiated as a function involving the terms and with respect to . By the definition (6), the relevant quantities are calculated as
Since is typically constructed (in Lines 215-219) to include all unlabeled responses, the complement of corresponds exactly to the labeled responses. As their probabilities are necessarily computed for alignment, computing the above terms incurs negligible additional cost.
Q3: How is the hyperresponse constructed?
A5: Theorems 4.1-4.3 require that consists exclusively of unlabeled responses. By default, we construct it by including all such unlabeled responses (in Lines 215-219) , and this construction is consistently used in our reported experiments.
In initial experiments on the HH dataset, we also explored the effect of excluding 2 or 4 unlabeled and -generated responses from . The results show negligible changes in win rate (less than 0.5%).
However, in scenarios where certain unlabeled responses generated from the reference model are of particular interest, excluding them from can allow for more targeted regularization, and help maintain the model’s ability to generate these responses.
Dear Reviewer CDcH,
As the discussion phase is nearing its end, we would like to kindly follow up and confirm whether our responses have addressed your concerns. If there's anything unclear or if you'd like us to elaborate further, we're eager to do so.
Thanks once again for your time and insights on our work!
Thank you for raising this insightful follow-up question.
To clarify, denotes the modified response space after introducing the hyper response . Typically, acts as a single aggregated response that contains all unobserved responses, and directly correspond to the labeled responses. Therefore, the cardinality satisfies , where is the number of labeled responses, and the extra accounts for the hyper response.
Since is usually small in practice (e.g., just a few per prompt), the and over can be efficiently computed by direct enumeration.
We appreciate the reviewer’s attention to implementation details and will clarify this point explicitly in the revision to preclude any ambiguity.
Thanks for the further explanation. I don't have any concern now.
Thanks for the authors' responses. I still have one more question left regarding the sufficient choice of
. Since the and here are within response space , can these be easily done?
Thank you for your follow-up. It’s great to hear that all concerns have been addressed. We appreciate your time and valuable feedback!
Dear Reviewers,
Thank you for your initial reviews. I’d like to remind everyone to actively engage in the author–reviewer discussion (and thank you if you’ve already done so!).
-
If authors have resolved your (rebuttal) questions, do tell them so.
-
If authors have not resolved your (rebuttal) questions, do tell them so too.
As per NeurIPS review policy this year, please make sure to submit the “Mandatory Acknowledgement” only after you have read the rebuttal and participated in the discussion.
Thank you for your efforts,
AC
This paper proposes Proximalized Preference Optimization (PRO), a variant of DPO designed to resolve likelihood underdetermination and provide a unified framework for aligning language models across diverse types of preference feedback. Reviewers appreciated the clear decomposed formulation, theoretical grounding, and promising empirical results. Given the overall positive sentiment, I recommend acceptance.