/10

Poster3 位审稿人

最低3最高3标准差0.0

ICML 2025

AlphaPO: Reward Shape Matters for LLM Alignment

Aman Gupta,Shao Tang,Qingquan Song,Sirou Zhu,Jiwoo Hong,Ankan Saha,Viral Gupta,Noah Lee,Eunki Kim,Siyu Zhu,Parag Agrawal,Natesh S. Pillai,Sathiya Keerthi

OpenReview PDF

提交: 2025-01-24更新: 2025-07-24

TL;DR

Use a parameter alpha to control the shape of the reward function for alignment training, in order to improve performance

摘要

Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Some popular examples of DAAs include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce AlphaPO, a new DAA method that leverages an $\alpha$-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7% to 10% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B while achieving 15% to 50% relative improvement over DPO on the same models. The analysis and results presented highlight the importance of the reward shape and how one can systematically change it to affect training dynamics, as well as improve alignment performance.

关键词

llmlarge language modelsdeep learningalignmentpreference tuningpost trainingreward shaping

评审与讨论

审稿意见

评分: 32025-03-13

Authors step from likelihood displacement, a phenomenon in which both preferred and dispreferred responses drop likelihood during optimization. Derived from f-DPO, authors add length normalization to the alpha-divergence and show that alpha controls the likelihood displacement strengths through gradient analysis: larger alpha will be less aggressive.

Results show that AlphaPO outperforms DPO and SimPO, with a proper positive alpha achieving the best performance.

给作者的问题

My main concern is that the AlphaPO is less theoretically convincing (See above reviews). Also, given that the experiment was conducted on chat tasks, it is hard to convince me that the AlphaPO is generally useful.

I would be happy to raise scores if the authors can address my concerns.

论据与证据

alpha in AlphaPO controls the strengths of likelihood displacement: larger alpha gives less agressiveness in likelihood displacement.

Evidence: experiment figure 2 and Theorem 3.1.

The theorem does not provide proof of monotonicity. Instead,it analyzes the extreme cases where alpha reaches -inf or inf. Therefore, the claim is not fully supported by the theorem. However, the experiment looks good.

AlphaPO has less constraint on the range of alpha

Evidence: Not found

It is unclear why the range of alpha in AlphaPO is wider, given its theory foundation is the same as Wang et al. 2024a

AlphaPO performs better than DPO and SimPO, and corresponding side claims in ablation study

Evidence: Table 1, and other related figures

This is well supported by the experiment.

方法与评估标准

Yes.

理论论述

I check Thm 3.1, 3.2, and Cor 3.3. I read Thm 3.1 proof and had a quick glance at Thm 3.2 and Cor 3.3.

The proof is correct, to my knowledge. For the claim correctness, please see above sections.

实验设计与分析

The experiment is overall sound and complete.

Several things that are unclear to me:

Does the author apply hyperparam tuning for SimPO too?
Is the hyperparam searching applied to each dataset? Or do authors apply hyperparam searching on one dataset and apply it on another dataset without searching again?

补充材料

N/A

与现有文献的关系

Prior works solve this problem either by adding NLL loss [1], reducing learning rate [2], and adaptive margin [3].

[1] https://arxiv.org/pdf/2404.19733 [2] https://arxiv.org/pdf/2409.02392 [3] https://arxiv.org/pdf/2402.13228

遗漏的重要参考文献

Reducing the learning rate can mitigate preferred responses probability drop[2], which is not discussed in the paper.

其他优缺点

Other Strengths:

The gradient analysis is interesting and gives a picture of how AlphaPo is working.

Other Weakness:

The introduction of length normalization is intuitive but less theoretical. Specifically, it is unclear what divergence the modified objective is representing, and whether it makes sense from the theory perspective.
Authors say that prior work (Wang et al 2024) shows that $\alpha$ does not bring much performance gain but their experiment shows that different $\alpha$ affects results much. What causes the differences? Is it because of length normalization? More discussion are needed here.
Experiments are only conducted on chat tasks, making it unclear if the methods are still good for other tasks such as reasoning.

其他意见或建议

It is better to add label to y axis in Figures 3 for easier understanding without reading texts

update after rebuttal

I raise my score to 3 since the authors address my major concerns.

作者回复

2025-04-01

The theorem does not provide proof ...

We acknowledge that the theorem can be made more clear. In particular, monotonicity can be proved for $T_1(\alpha)$ ; see the summary table here. In general, the gradient magnitude is not monotonic as presented in Illustrations. We present a 3D plot wrt the relation of the gradient, response length, and $\alpha$ here, which demonstrates non-monotonicity clearly.

We will update the draft with the proof of $T_1(\alpha)$ monotonicity.

It is unclear why the range of alpha in ...

In order to derive an explicit formula for the reward as a function of $\pi$ without normalization, Wang et al. require that 0 is not in the domain of $f'(u)$ , which constrains the allowable range of $\alpha$ . In contrast, our reward construction—similar to SimPO—is not based on divergences and is thus not subject to that same constraint. Moreover, the AlphaPO reward function involves length normalization—something Wang et al. do not consider.

Does the author apply hyperparam ...

Yes, we performed comprehensive tuning. The published HPs from the SimPO authors proved optimal for all three models, and our AE and AH results closely match.

Is the hyperparam searching ...

HP search was done independently for every model. The instruct models use on-policy data, so datasets across models are not directly comparable (details in section 4.1).

HPs introduced in SimPO are required to be tuned for every model. However, extensive tuning of α is not required. See response to reviewer 2gDT for additional details.

Reducing the learning rate ....

Although small LR can mitigate the drop in the probabilities of preferred responses, it may result in worse generalization. We verify this by decreasing the optimal LR for Mistral to $6.0e-7$ . Result - LC 29.36 / WR 30.28, compared to LC 33.03 / WR 34.12 obtained with LR = $7.0e-7$ . A similar behavior was observed in the SimPO paper (Table 17) for the Gemma-2-9b model.

From reference [2], the authors propose a smaller LR based on gradient analysis in the paragraph “On-policy sampling and small learning ....” Furthermore, they explicitly confirm: “A more comprehensive understanding of the training dynamic of the direct preference learning algorithms remains largely open.” - this is a key contribution of our paper.

Additionally, incorporating an NLL loss term [1] harms generalization because it inhibits likelihood displacement of preferred responses. Methods such as RRHF, SLiC-HF, CPO, and ORPO include an NLL loss term, yet their LC / WR scores in Table 12 are considerably lower than those of SimPO. Moreover, as noted in Appendix H of SimPO under the paragraph “Incorporating SFT regularization in SimPO,” while NLL regularization may benefit certain tasks, it leads to a decrease in performance on AE2.

Adaptive Margin [3] is an interesting idea. We will acknowledge this in the paper.

The introduction of length normalization is intuitive but...

The reward function is designed—rather than derived from a divergence—based on its impact on training dynamics and generalization. This approach is similar to SimPO [1] and ORPO [2]. R-DPO [3] use length penalty - similar in spirit to [1][2]. Tulu3 [4] uses the length normalized variant of DPO successfully. Although length normalization is appealing and empirically beneficial, literature has not yet succeeded in deriving it as an emergent property from any specific divergence.

References: [1] "Simpo" [2] "Orpo" [3] "Disentangling length from quality in direct preference optimization."
[4] "Tulu 3: Pushing frontiers..."

Authors say that prior work (Wang et al 2024) shows that α ....

Due to length normalization, the effective α’ in the fDPO exponent becomes α divided by response length. Based on Theorem 3.1 and our empirical study, a small positive α leads to improved generalization. In the context of fDPO, this translates to a tiny effective α’—especially considering that responses can be as long as 1000 tokens, which the authors of fDPO did not consider

We also refer the reviewer to Figure 2 of the SimPO paper which also applies to AlphaPO. Due to length normalization, AlphaPO, like SimPO, consistently achieves a positive reward margin for all response pairs (irrespective of length differences), and consistently improves the margin over the SFT model. Therefore, this is another key reason for AlphaPO doing better than without LN (fDPO)

Experiments are only conducted on chat tasks..

Details + additional experiments can be found in the response to Reviewer G3bY.

It is better to add label to y axis...

We have fixed this.

My main concern is that the AlphaPO is less...

Please see above our response to your specific questions on the theoretical aspects of AlphaPO. Also, please see our answer to Reviewer G3bY for experiments on reasoning and other tasks where AlphaPO still outperforms SimPO.

审稿人评论

2025-04-04

Thanks for the authors' rebuttal.

If my understanding is correct, the AlphaPO formulation is less theoretically supported and leans more on empirical evidence, such as the range of alpha and length normalization.

I've raised my score to 3 accordingly.

作者评论

2025-04-07

Thank you for acknowledging important aspects of our work, such as the choice of the alpha range and length normalization. Although AlphaPO does not rely on a divergence-based formulation, it is grounded in training dynamics and supported by both theoretical insights and empirical findings. We will further clarify the theoretical aspects in the camera-ready version.

审稿意见

评分: 32025-03-14

The paper introduces AlphaPO, a variant of f-DPO that adopts $\alpha$ -divergence and length normalization. The paper shows that varying $\alpha$ affects the shape of the implicit reward. With an appropriate value of $\alpha$ , AlphaPO can mitigate the over-optimization issue of Direct Alignment Methods.

给作者的问题

The paper mainly focuses on RLHF tasks. How does AlphaPO perform on reasoning tasks, e.g., math or coding?
According to figure 2, the median of the margin stays around 0. But a well-trained model should have a positive margin between preferred and dispreferred responses, if I understand correctly.

论据与证据

Yes.

The authors perform extensive experiments over 2 RLHF benchmarks and 3 base models. The effects of crucial hyper-parameters ( $\alpha$ and margin $\gamma$ ) are studied.
The authors theoretically analyze the effect of varying $\alpha$ on the gradient scale and the likelihood of preferred responses.

There are a few questions though:

In section 3.3 the authors mention "large $|\alpha|$ values impose a regularization effect on alignment training due to the vanishing gradient for samples with positive length-normalized margins". Methods like IPO [1] and SLiC [2] also limit the margin between preferred and dispreferred responses through L2-loss or hinge loss. It would be better if the paper includes a discussion of these works.
Section 3.2 states that length normalization is a crucial element. But the importance of length normalization is not thoroughly discussed in the paper. For example in Illustration 1, the length is fixed to be 1.

[1] Azar, Mohammad Gheshlaghi et al. “A General Theoretical Paradigm to Understand Learning from Human Preferences.” ArXiv abs/2310.12036 (2023): n. pag.

[2] Zhao, Yao et al. “SLiC-HF: Sequence Likelihood Calibration with Human Feedback.” ArXiv abs/2305.10425 (2023): n. pag.

方法与评估标准

Yes. The authors demonstrate that AlphaPO with an appropriate $\alpha$ regularizes the magnitude of positive margin and thus mitigating over-optimization.

理论论述

I have checked the main claims, which looks alright to me.

实验设计与分析

The paper follows standard evaluation protocol of RLHF. The effects of crucial hyper-parameters ( $\alpha$ and margin $\gamma$ ) are studied.

补充材料

I have checked the additional experiments in the appendix. The results support the claim of the paper.

与现有文献的关系

The paper studies the problem of over-optimization, a well known issue in Direct Alignment Algorithms.

遗漏的重要参考文献

See "Claims And Evidence".

其他优缺点

Strengths

There is a significant improvement on AlpacaEval 2 and Arena-Hard over previous methods like DPO and SimPO.

Weaknesses

AlphaPO introduces an additional hyper-parameter $\alpha$ . It seems that the optimal value of $\alpha$ depends on the base model and must be found using grid search.

其他意见或建议

I don't have other comments for this draft.

作者回复

2025-04-01

In section 3.3 the authors mention "large values impose a regularization effect ...

We thank the reviewer for their insightful comment. The reviewer is right to point out that IPO and SLiC are important methods. The reasons we did not include them in the paper are (1) the SimPO paper already compares to SLiC and IPO, and demonstrates that SimPO is substantially better. (2) Both SLiC and IPO do not contain length normalization, which is key for generalization. (3) SLiC contains a separate term for log likelihood for the winning response, which can affect likelihood displacement (some likelihood displacement is actually necessary for good generalization). IPO, on the other hand, enforces its margin constraints solely via the L2-loss, which lacks a mechanism to encourage the controlled likelihood displacement.

We will update the paper to include these details in the draft.

Section 3.2 states that length normalization is a crucial element.....

We do want to highlight that we indeed study the effect of generation length on generalization (1) We highlight in Figure 3 (left panel) and Figure 8 that longer responses usually result in a lower quality (AE 2.0 length-controlled win rate) as we vary . (2) Section 2.2 of the SimPO paper comprehensively discusses that length normalization is a key element in preventing the reward formulation from resulting in a bias towards generating longer but lower-quality sequences.

The reviewer is right to point out that Illustration 1 uses a fixed length of 1. We create a few more illustrations with length > 1. We create a 3D plot, with gradient norm in log scale on the z-axis, alpha on the y-axis and length of y_w, y_l on the x-axis. We use the following parameters for the plot - log_pi_w = -5, log_pi_l = -10, beta = 5, partial π_w / partial v = partial π_l / partial v = 1. The resulting plot can be found here. From the plot (a) The gradient goes to zero whenever alpha goes to + / - infinity, as proved in Theorem 3.1 (b) In general, the gradient is not a monotonic function of alpha.

We will be happy to include this improved illustration in the camera-ready version.

AlphaPO introduces an additional hyper-parameter α. ....

Please see our response for Reviewer 2gDt for a detailed treatment.

The paper mainly focuses on RLHF tasks. How does AlphaPO perform on reasoning tasks, e.g., math or coding?

Thank you for the question. Our evaluation is on AlpacaEval 2.0 (AE2.0) and ArenaHard. Both benchmarks include math and coding questions. Below are one example for each category from each benchmark.

AE2.0

Coding

Question 1:
Write a C++ function that takes a reference to a std::string containing markdown formatted text and returns a std::string containing html formatted text.

Math

Question 1:
Given that f(x) = 5x³ - 2x + 3, find the value of f(2).

ArenaHard

Coding

Question 1:
I have a Python script that scrapes a webpage using Playwright. Now I want to start ten instances of that script in parallel on one AWS EC2 instance, but so that each script binds to a different IP address. How can I do that with Terraform?

Math

Question 1:
What is the 95% confidence interval for the sum of 100 fair six-sided dice?

Based on the reviewer’s questions, we conducted additional experiments:

HellaSwag – A commonsense reasoning benchmark.
TruthfulQA – Carefully designed questions to test models' susceptibility to common misconceptions and their factual accuracy.

We compare AlphaPO to SimPO on these datasets:

HellaSwag (10 Shot)

Model	AlphaPO	SimPO
Llama3-8b-instruct	0.7694	0.7576
Mistral-7b-instruct	0.8638	0.8610

TruthfulQA

Model	AlphaPO	SimPO
Llama3-8b-instruct	0.6142	0.6078
Mistral-7b-instruct	0.7127	0.7061

Based on these results, it is clear that AlphaPO outperforms SimPO.

According to figure 2, the median of the margin stays around 0 ..

In Figure 2, we plot the margin of length-normalized probability, i.e., $\frac{\log \pi_w}{y_w} - \frac{\log \pi_l}{y_l}$ , instead of the reward margin (with an additional factor of $\beta$ ), to better track the training dynamics. That is why the absolute value of the margin is small. For $\alpha=0$ , our results align with the training curve open-sourced by the SimPO author link for the Gemma2 model (the mistral training curve is not published by the authors).

At the end of training, the SimPO authors reported the average eval margin of 0.48 and our average eval margin is 0.528 and the median is 0.372 (Figure 6). Similarly, the positive but small margin of length-normalized probability is as expected.

审稿意见

评分: 32025-03-14

The paper introduces AlphaPO, a novel preference training algorithm designed to improve the alignment of large language models (LLMs) by modifying the shape of the reward function used in preference optimization.

Unlike existing methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), which often suffer from likelihood displacement and reward over-optimization, AlphaPO leverages an $\alpha$ -parameter to adjust the reward function shape, allowing for better control over these issues.

The authors demonstrate that AlphaPO achieve improvements in alignment performance, with relative gains of 7% to 10% over SimPO and 15% to 50% over DPO on models like Mistral-7B and Llama3-8B. A comparable performance with SimPO is obtained on Gemma models.

Through gradient analysis and experiments, the paper shows that AlphaPO's reward function shapes influence the training dynamics, leading to better generalization and reduced over-optimization. The results highlight the importance of reward function design in LLM alignment and suggest that AlphaPO's approach can be extended to other alignment methods.

update after rebuttal

My position stilll stand as weak accept

给作者的问题

Can you discuss more on how you made the hyper-parameter choices for your experiment. Given that the differences between your result is very close to SimPO, could hyper-parameters be the differing factor?

Also, does reward shaping help with stabilizing training, because find right hyper-parameter for SimPO has been a pain point for people.

论据与证据

Reward shaping affects the training dynamics, and a properly shaped reward can alleviate the likelihood displacement issue. This claim is supported by analysis in Figure-2 tracking the likelihood for different alpha values.
The paper claims better generalizability after reward shaping for SimPO method. The claim is supported by superior results on widely acknowledged benchmarks in Table-1,

方法与评估标准

makes sense. AlpacaEval and ArenaHard are the right benchmarks for preference alignment.

理论论述

I check the proof for Theorem 3.1, and it looks good.

实验设计与分析

The experiment design is good.

补充材料

与现有文献的关系

It relates to offline preference training literature.

遗漏的重要参考文献

其他优缺点

The main weakness is that the absolute improvement on SimPO is limited. If the paper can address the stability issues of SimPO like method (choice of hyper-parameter), it would be making a more impactful contribution.

其他意见或建议

作者回复

2025-04-01

The main weakness is that the absolute improvement on SimPO is limited. If the paper can address the stability issues of SimPO like method (choice of hyper-parameter), it would be making a more impactful contribution.

We thank the reviewer for the comment. We would like to point out that our improvements over SimPO are actually significant and in-line with other well-established papers:- Reviewer G3bY pointed out that AlphaPO provides a significant improvement on AlpacaEval 2 and Arena-Hard over methods like DPO and SimPO.

We also compared our gains to the gains of established, popular methods like ORPO [1] and SimPO [2].
- ORPO achieves a relative gain of ~10% over the Zephyr model, which was the SOTA model at the time of publication (Table 1 in the ORPO paper).
- From Table 4 of the SimPO paper, it is clear that many established methods such as SLiC-HF [3], IPO [4], ORPO [1], and KTO [5] perform at par or worse than both DPO and SimPO on many benchmarks.
Thus, improving upon SimPO's performance is a challenging problem.

We thank the reviewer for the important comment about stability issues for SimPO related to choice of hyperparameters. We address this in the responses below.

Can you discuss more on how you made the hyper-parameter choices for your experiment. Given that the differences between your result is very close to SimPO, could hyper-parameters be the differing factor?

As discussed above, we would like to emphasize that our improvements over SimPO are significant and consistent with observations from other papers (please refer to the discussion above).

From Table 1 in our paper, we observe the following:

For Mistral-Instruct, AlphaPO achieves an LCWR of 33.03, while SimPO results in 29.71. This is a substantial improvement.
For LLaMA-Instruct, AlphaPO yields an LCWR of 45.37, compared to 42.05 from SimPO — again, a significant improvement.

To tune AlphaPO, we started with the best-tuned SimPO baseline and then lightly tuned the gamma and learning rate (LR). In many cases, we did not need to modify gamma and LR at all. As for alpha, both theoretical insights and ablation studies indicate that extremely high positive or negative values suppress the gradient. In practice, a small positive value — typically around 0.1 or 0.25 — consistently improves performance over SimPO.

We believe AlphaPO could perform even better with more extensive tuning, which we were unable to pursue due to compute limitations.

We appreciate the reviewer's question and conducted an additional experiment to investigate further. Specifically, we took the SimPO-tuned hyperparameters, changed only alpha to a positive value, and trained an AlphaPO model. Note that this setup is not necessarily optimal for AlphaPO. The results are:

LLaMA-3-Instruct:

AlphaPO (SimPO HPs, α = 0.25) → LCWR: 43.33 , WR of 38.51
SimPO→ LCWR: 42.05, WR: 36.90

LLaMA-3-Instruct ArmoRM:

AlphaPO (SimPO HPs, α = 0.25) → LCWR: 52.91, WR: 47.21
SimPO → LCWR: 51.66, WR: 46.54

These results demonstrate that simply modifying alpha, without changing any other hyperparameters, can lead to clear performance gains over SimPO.

Also, does reward shaping help with stabilizing training, because find right hyper-parameter for SimPO has been a pain point for people.

Tuning the hyperparameters for SimPO has long been a challenge. While the reward shaping introduced in our method does help stabilize training by regularizing the gradient (see Theorem 3.1), empirically, SimPO with α = 0 typically exhibits the most aggressive increase in the margin (as demonstrated in Figures 14 and 15). However, despite its stabilizing effects against reward over optimization, this does not always translate into improved generalization. Therefore, while it aids training stability, reward shaping does not fully alleviate the need for hyperparameter tuning to achieve optimal generalization performance. Reward shape helps to further enhance the generalization on top of the SimPO.

[1] Hong, Jiwoo, Noah Lee, and James Thorne. "Orpo: Monolithic preference optimization without reference model." arXiv preprint arXiv:2403.07691 (2024).

[2] Meng, Yu, Mengzhou Xia, and Danqi Chen. "Simpo: Simple preference optimization with a reference-free reward." Advances in Neural Information Processing Systems 37 (2024): 124198-124235.

[3] Zhao, Yao, et al. "Slic-hf: Sequence likelihood calibration with human feedback." arXiv preprint arXiv:2305.10425 (2023).

[4] Azar, Mohammad Gheshlaghi, et al. "A general theoretical paradigm to understand learning from human preferences." International Conference on Artificial Intelligence and Statistics. PMLR, 2024.

[5] Ethayarajh, Kawin, et al. "Kto: Model alignment as prospect theoretic optimization." arXiv preprint arXiv:2402.01306 (2024).

最终决定Accept (poster)

2025-05-01

Summary

The paper you provided, titled "AlphaPO - Reward Shape Matters for LLM Alignment", proposes a new method called AlphaPO that falls under the category of Direct Alignment Algorithms (DAAs) for aligning large language models (LLMs) to human preferences. The paper shows that varying $\alpha$ coming from the $\alpha-$ divergence affects the shape of the implicit reward. The authors demonstrate that AlphaPO improves alignment performance, with relative gains of 7% to 10% over SimPO and 15% to 50% over DPO on models like Mistral-7B and Llama3-8 B. A comparable performance with SimPO is obtained on Gemma models.

Decision

Overall, the paper is well-written. The topic is timely and important. The reviewers have raised several concerns and provided feedback to the authors. The authors have done a good job during the rebuttal, addressing the concerns of the reviewers. Thus, some of the reviewers have increased their scores. The results reported in the paper are good; there is a significant improvement on AlpacaEval 2 and Arena-Hard over other methods like DPO and SimPO. The proposed approach is reasonable. I would recommend that the authors address the main issues raised by the reviewers during the rebuttal papers in the final version of the paper. This paper presents a new approach with a simple modification over the pre-existing SimPO loss; thus, I believe it is a meaningful contribution to the community. As a result, I recommend this paper for acceptance.