PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
4
4
3
ICML 2025

ROPO: Robust Preference Optimization for Large Language Models

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

We propose an iterative alignment framework that mitigates the impact of preference noise by effectively identifying and filtering noisy samples.

摘要

关键词
noise tolerancelarge language modelspreference optimization

评审与讨论

审稿意见
3

The paper considers the alignment problem of large language models (LLMs) trained on noisy preference data, where human preferences are flipped with a certain probability η\eta. To align the LLM in a robust manner and mitigate performance degradation due to noisy data, the authors propose an iterative alignment framework that alternates between training the model in a noise-tolerant manner and filtering out noisy samples.

They begin by formulating this as a constrained optimization problem, aiming to minimize a weighted loss over the samples, where the weights are dynamic and intended to be smaller for noisy samples. Moreover, they introduce a constraint ensuring that the sum of these weights remains a fraction of the total number of samples, thereby reducing the effect of noisy data.

Next, they analyze the noise tolerance of the Direct Preference Optimization (DPO) loss and demonstrate that it is ineffective in distinguishing between noisy and clean samples. Specifically, DPO aggressively updates the model parameters to fit noisy samples during gradient descent. To address this issue, the authors propose an alternative loss function, lnal_{na}, which exhibits lower tolerance to noise and better differentiates between clean and noisy samples. Furthermore, they introduce the ROPO loss, lropol_{ropo}, which combines lnal_{na} and ldpol_{dpo} with a trade-off parameter α\alpha. A detailed discussion is provided on the choice of this parameter and its practical implications.

Finally, they propose a rejection sampling strategy that generates new responses for samples identified as noisy and creates candidate samples using both the generated responses and the original responses from the dataset. They then select the candidate sample with minimal loss and include it in the next stage of training.

In the experimental section, they evaluate the effectiveness of their proposed method against baselines across three datasets and two base models under different levels of artificial noise injected into the data. Additionally, they conduct ablation studies to examine the impact of different framework components and hyperparameters on overall performance.

给作者的问题

Kindly refer to weaknesses above.

论据与证据

All theorems stated in the paper are accompanied by proper proofs in the appendix. However, the robustness-guided rejection sampling strategy presented in Section 3.3 lacks a rigorous theoretical foundation and is primarily heuristic in nature. Specifically, the authors sample multiple responses from the model and generate candidate samples using both the newly generated responses and the original responses from the dataset. They then select the candidate sample with minimal loss and include it in the next stage of training.

However, a formal justification for the effectiveness and robustness of this strategy in practice is lacking. In particular, there is no guarantee that this strategy would not introduce noisy samples into the data. A more thorough theoretical analysis would strengthen the validity of this approach.

方法与评估标准

The proposed method has been properly evaluated on multiple datasets and models. Further, ablation studies have been performed to study the utility of the different components of the method. Although, one limitation would be that the noise has been artificially injected in the dataset, to demonstrate the utility of the approach (In the 0% noise injection case, the performance difference is minimal except for TL;DR dataset).

理论论述

All theorems stated in the paper have proper proofs in the appendix.

实验设计与分析

The proposed method has been properly evaluated on multiple datasets and models. Further, ablation studies have been performed to study the utility of the different components of the method. However, in Table 2 and 3, other baselines like cDPO and rDPO are missing. In particular, it would be interesting to compare the performance of rDPO against different components of their proposed ROPO framework in the ablation studies.

补充材料

Yes, particularly reviewed the proofs in Appendix F.

与现有文献的关系

The paper analyzes the effect of noisy preference data in the alignment of large language models (LLMs). Prior work on this problem primarily focuses on alternative loss functions that are robust to noise and often require prior knowledge of the percentage of noisy data. In this work, the authors first propose a noise-tolerant loss function that facilitates the identification of noisy samples while preventing overfitting to noisy data. Building on this, they introduce an effective noise-filtering strategy to remove noisy samples from the dataset. Furthermore, they propose a robustness-guided rejection sampling technique to introduce new clean samples into the data.

遗漏的重要参考文献

A few references in the area of robust preference optimization of LLMs are missing. Namely,

  1. Choi, Eugene, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, and Mohammad Gheshlaghi Azar. "Self-improving robust preference optimization." arXiv preprint arXiv:2406.01660 (2024).
  2. Bukharin, Alexander, Ilgee Hong, Haoming Jiang, Zichong Li, Qingru Zhang, Zixuan Zhang, and Tuo Zhao. "Robust reinforcement learning from corrupted human feedback." arXiv preprint arXiv:2406.15568 (2024).
  3. Yan, Yuzi, Xingzhou Lou, Jialian Li, Yiping Zhang, Jian Xie, Chao Yu, Yu Wang, Dong Yan, and Yuan Shen. "Reward-robust rlhf in llms." arXiv preprint arXiv:2409.15360 (2024).
  4. Wu, Junkang, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. "Towards robust alignment of language models: Distributionally robustifying direct preference optimization." arXiv preprint arXiv:2407.07880 (2024).
  5. Ramesh, Shyam Sundhar, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, and Ilija Bogunovic. "Group robust preference optimization in reward-free rlhf." Advances in Neural Information Processing Systems 37 (2024): 37100-37137.

其他优缺点

Strengths

The paper analyzes the effect of noisy preference data in LLM alignment, addressing an important problem in this domain. The authors propose multiple strategies to tackle this issue in an iterative manner, collectively forming a novel framework in this area. Specifically, they first introduce a noise-tolerant loss function that facilitates the identification of noisy samples while preventing overfitting to noisy data. Building on this, they propose an effective noise-filtering strategy to remove noisy samples from the dataset. Furthermore, they introduce a robustness-guided rejection sampling technique to incorporate new clean samples into the data.

Weaknesses

  1. The robustness-guided rejection sampling strategy in Section 3.3 lacks a rigorous theoretical foundation and is primarily heuristic in nature. Specifically, the authors sample multiple responses from the model and generate candidate samples using both the newly generated responses and the original responses from the dataset. They then select the candidate sample with the minimal loss and include it in the next stage of training. However, a formal justification for the effectiveness and robustness of this strategy in practice is lacking. . In particular, is there any guarantee that this strategy would not introduce noisy samples into the data?

  2. Unlike prior noise-robust approaches that require knowledge of the percentage of noisy data, the proposed method iteratively estimates and filters noisy samples. However, the approach introduces a new hyperparameter, α\alpha, which governs the trade-off between lnal_{na} and ldpol_{dpo}, and must be either estimated or predefined. Additionally, the method requires estimating the filtering ratio ρ\rho, which may vary across different datasets and applications.

其他意见或建议

Consider adding the baseline methods' performance in Table 2&3.

作者回复

Dear Reviewer GZqo,

Thank you for your valuable review. We respond to each comment as follows and sincerely hope that our response can properly address your concerns.

Tables can be found in GZqo.md in https://anonymous.4open.science/r/ICML25-ROPO-F6CD

Claims And Evidence

C1: A formal justification and theoretical analysis for the effectiveness and robustness of the rejection sampling (RS) technique would strengthen the validity of this approach.

Res: The reliability of RS can be guaranteed by the noise identification capability of our loss function, similar to Theorem 3.5. Specifically, since both our RS and noisy sample filtering use loss values as the criterion for sample selection/filtering, their effectiveness is guaranteed similarly.

Due to the rebuttal length limit, we will include the detailed analysis in the paper. We would greatly appreciate your understanding.

Methods And Evaluation Criteria

M1: The noise has been artificially injected in the dataset to demonstrate the utility the approach (under 0% noise injected, the performance difference is minimal except for TL;DR dataset).

Res: We understand your concern that artificial noise may not align well with real-world scenarios. For this, please see experiments in Appendix E.3.1, E.3.2 under practical noise coming from annotators' trust in larger models over smaller ones and LLM comparisons. For your convenience, results are also provided in Tables GZqo-1 and GZqo-2 in the anonymous link. As shown, ROPO significantly outperforms baselines in both settings.

While ROPO's improvement may not always be substantial at 0% artificial noise, we humbly believe that experiments across several practical and artificial noise settings sufficiently demonstrate its advantages over baselines and its contributions to preference alignment.

Experimental Designs Or Analyses

E1: In Tables 2 and 3, cDPO and rDPO are missing. In particular, ... rDPO against different components of ROPO.

Res: We have added them into Tables 2 and 3. Please see Tables GZqo-3 and GZqo-4 in the anonymous link.

Essential References Not Discussed

R1: A few references [3-7] on robust preference optimization of LLMs are missing.

Res: We will expend Related Work as follows.

Robust Preference Alignment of LLMs. Many efforts have been made from various perspectives to achieve robust preference alignment [1-7]. Specifically, [1,2] uses label smoothing to mitigate the impact of preference noise. [3] improves the model’s adaptability to different preference distributions and enables iterative output refinement by jointly optimizing a self-improvement policy and a generative policy. [4] models potentially corrupted preference labels as sparse outliers and solves an 1\ell_1-regularized maximum likelihood estimation problem, thereby consistently learning the true underlying reward. [5] introduces a multi-head reward model (RM) that reflects each head’s confidence in the output reward using the standard deviation of a Gaussian distribution, effectively addresses the challenge of RM imperfections in RM-based RLHF. [6] focuses on different forms of noise and enhances DPO’s resilience to both pointwise and pairwise noise in LLM alignment by leveraging Distributionally Robust Optimization (DRO). [7] robustly aligns LLMs to the preferences of diverse individual groups by incorporating group information into the LLM context and optimizing against the worst-case alignment performance across all groups. Compared to them, our method integrates noise-tolerance and noise-identification capabilities without external models, offering a novel paradigm for robust preference alignment.

Other Strengths And Weaknesses

W1: Please see C1.

W2: ROPO introduces a hyperparameter α\alpha and requires estimating ρ\rho, which may vary across different datasets and applications.

Res: Our ablations show that ROPO is insensitive to α\alpha and ρ\rho within our recommended range and does not require extensive tuning for different tasks.

As stated in Section 4.2, we fix α=14\alpha=14 and ρ=0.2\rho=0.2 without tuning them in most of experiments after we observe that ROPO is insensitive to α\alpha and ρ\rho. For readers, we also recommend ρ=0.2\rho=0.2 in practice. As for α\alpha, we recommend α=14\alpha=14 or 3030 on relatively objective tasks (e.g., summarization), and α=6\alpha=6 or 1414 on relatively subjective tasks (e.g., dialogue).

Other Comments Or Suggestions

S1: Please see E1.


[1] Provably Robust DPO: Aligning Language Models with Noisy Feedback

[2] A Note on DPO with Noisy Preferences & Relationship to IPO

[3] Self-Improving Robust Preference Optimization

[4] Robust Reinforcement Learning from Corrupted Human Feedback

[5] Reward-Robust RLHF in LLMs

[6] Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

[7] Group Robust Preference Optimization in Reward-Free RLHF

审稿意见
4

This paper addresses the problem of robustly learning preference from noisy preference data. It proposed the ROPO framework which iteratively filtering noisy preference data and aligning the LLM with the filtered data. The ROPO framework consists three key modules, 1) noise-aware DPO loss for preference alignment, 2) noisy preference filtering based on the noise-aware loss, and 3) Response resampling based on the noise-aware loss.

The empirical results show that ROPO is better than other noisy preference learning baseline on extensive datasets.

给作者的问题

Line 245, is this a typo? What's the difference between lnal_{na} and ldpol_{dpo}?

论据与证据

yes, as far as I can see, the claims are well supported.

方法与评估标准

The proposed iterative framework does make sense to robustly refine and learn the preference. The empirical results also support this. However, I have one concern, as the framework is heavily enhancing its internal preference judgement, would the following cases undermine the effectiveness of the framework?

  1. If the LLM is aligned to the noisy preference in the beginning, would the error be carried and enlarged throughout the iteration?
  2. If the target preference is very different from the initial model's judgement, would the preference learning be very inefficient?

理论论述

yes, as far as I can see, the claims are accurate.

实验设计与分析

The experiment setup is extensive, especially glad to see the ablation study on noise ratio ρ\rho in Figure 3 and different component in Table 3.

One potential concern is all compared baselines are noisy preference learning methods, is there any noisy data filtering baseline that can be compared?

补充材料

Appendix A, Algorithm 1. This is the main algorithm of the proposed ROPO framework. This is clear.

Appendix B, Related works. This is also clear.

与现有文献的关系

As far as I can see, this paper fits into the literature well. There hasn't been paper on iterative framework for robust preference learning. The noisy-aware loss also make sense.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  • The content of this paper is substantial and the experiments and ablation study are extensive.

Weakness:

  • The presentation of this paper is arguable. The related work section shouldn't be put into appendix. There isn't clear description of the overall framework in the main paper. The overall algorithm is put in Appendix A.

其他意见或建议

N/a

作者回复

Dear Reviewer tkDe,

Thank you for your valuable review. We respond to each comment as follows and sincerely hope that our response can properly address your concerns.

Figures and Tables can be found in tkDe.md in https://anonymous.4open.science/r/ICML25-ROPO-F6CD


Methods And Evaluation Criteria

M1: Would the following cases undermine the effectiveness of the framework? (1) If the LLM is aligned to the noisy preference in the beginning, would the error be carried and enlarged throughout the iteration? (2) If the target preference is very different from the initial model's judgement, would the preference learning be very inefficient?

Res: These two cases will undermine the effectiveness of various preference optimization methods, not just ROPO.

Please note that in the standard practice of DPO-like methods, preference optimization typically requires the initial model to have undergone supervised fine-tuning (SFT) on data that is in-distribution for the DPO algorithm beforehand [1,2]. In other words, the model should possess a basic ability to judge preferences before preference optimization is performed. Based on this prerequisite, we derive the gradient and corresponding loss function for ROPO.

If the initial SFT model lacks basic capability, a reasonable suggestion is to perform SFT first to enhance its fundamental preference judgment ability, rather than proceeding directly with preference training. Otherwise, both DPO and ROPO would be ineffective.


Experimental Designs Or Analyses

E1: Is there any noisy data filtering baseline that can be compared?

Res: Our experiments in Appendix E.5 have tested a confidence-based data filtering method and we find that it underperforms the standard DPO, and thus underperforms ROPO. This further supports our claim that the widely used cross-entropy loss (i.e., DPO loss) cannot serve as a reliable measure of model confidence in scenarios containing noisy preferences. The details are as follows.

According to [3], confidence-based data filtering method is a popular approach to combat noisy preferences. Since our paper focuses on DPO-like methods that leverage implicit rewards, a natural choice is to use implicit rewards and the corresponding loss to reflect confidence. To this end, we conduct experiments of combining DPO with noisy samples filtering (NSF) and rejection sampling (RS) using Mistral-7B as the base model and UFB as the training dataset. For your convenience, the results are shown in Table tkDe-1 in the anonymous link. As can be seen, the incorporation of noisy samples filtering and rejection sampling degrades the performance of DPO, especially at 20% artificial noise.


Other Strengths And Weaknesses

W1: The related work section shouldn't be put into appendix. There isn't clear description of the overall framework in the main paper. The overall algorithm is put in Appendix A.

Res: We will put the related work section and the overall algorithm in the main text by reorganizing the presentation.


Questions For Authors

Q1: Is na\ell_{\rm na} in Line 245 a typo? What's the difference between na\ell_{\rm na} and dpo\ell_{\rm dpo}?

Res: This is not a typo. If we denote P=σ(βlogπθ(y1x)πref(y1x)βlogπθ(y2x)πref(y2x))P = \sigma(\beta\log\frac{\pi_\theta(y_1 \mid x)}{\pi_{\rm ref}(y_1\mid x)} - \beta \log\frac{\pi_\theta(y_2\mid x)}{\pi_{\rm ref}(y_2\mid x)}), then dpo=logP\ell_{\rm dpo} = -\log P and na=1P\ell_{\rm na} = 1-P.


[1] Direct Preference Optimization: Your Language Model is Secretly a Reward Model

[2] https://huggingface.co/docs/trl/dpo_trainer

[3] Impact Preference Noise on the Alignment Performance of Generative Language Models

审稿人评论

Thanks the authors for the rebuttal. I agree with most of the rebuttal, but I think there's misunderstanding to my questions about the method.

I'm not questioning what if the sft model disagree with the preference data. The question is more about the preference data ordering. Because the proposed method is iteratively enhancing the models' own judgement, if the model is trained with noisy data in the beginning, it would be more difficult to adjust.

作者评论

Dear Reviewer tkDe and the other three reviewers who may read this response,

We would like to take this final chance to express our gratitude for your insightful comments, valuable suggestions, and positive feedback. Your time and dedication have made a significant contribution to the improvement of our paper.


We respond to Reviewer tkDe's Rebuttal Comment as follows, and we sincerely hope that our response provides an appropriate answer to the question.

M1 in the initial review: Would the following cases undermine the effectiveness ... would the preference learning be very inefficient?

Rebuttal Comment: I'm not questioning what if the SFT model disagree with the preference data. The question is more about the preference data ordering. Because the proposed method is iteratively enhancing the models' own judgement, if the model is trained with noisy data in the beginning, it would be more difficult to adjust.

Res: Thanks for patiently pointing out our misunderstanding of your question.

TL;DR: (i) The performance of ROPO will degrade if the model is trained with noisy data in the beginning, but the impact is smaller than that on DPO based on an analysis of loss functions. (ii) Furthermore, we provide a probabilistic analysis to show that such a case is unlikely to occur in practice after shuffling the dataset.

  1. The impact on ROPO is smaller than that on DPO. Suppose that all the early samples are noisy. According to our assumption that the SFT model possesses basic preference judgment ability (see our rebuttal), the model is likely to assign a large implicit reward margin Δ(y2,y1,x)=r^(y2,x)r^(y1,x)\Delta(y_2, y_1, x)=\hat{r}(y_2,x) - \hat{r}(y_1,x) to noisy samples (x,y1,y2,y1y2)(x, y_1, y_2, y_1 \succ y_2). Based on our analysis in Section 3.2 (starting from Line 200 in the right column), DPO aggressively increases the gradient weights, leading to stronger learning from noisy samples; whereas ROPO learns noisy samples with more conservative gradient weights, rather than blindly trusting the preference labels of the noisy samples. Therefore, DPO is more influenced by the large number of early noisy samples, while ROPO is less affected.

  2. Such a case is unlikely to occur in practice. Suppose we have a dataset containing NN samples, where a fraction α\alpha are noisy and the remaining 1α1-\alpha are clean. After shuffling the dataset, we are interested in the probability that the first β\beta fraction of the samples contains at least kβNk\beta N noisy samples. Note that:

    • The case we are concerned with is when β<min(α,1/2)\beta < \min(\alpha, 1/2) and k(1/2,1)k \in (1/2, 1), as it represents the scenario where noisy samples dominate in the early stages of training.
    • Without loss of generality, we assume αN\alpha N, βN\beta N, and kβNk\beta N are integers for ease of computation.

    We model it using the hypergeometric distribution, which describes the probability of drawing a specific number of "successes" (noisy samples) in a subset of the dataset without replacement. Let XX denote the number of noisy samples among the first βN\beta N samples. Since the dataset is randomly shuffled, XX follows a hypergeometric distribution with population size NN, number of noisy samples αN\alpha N, and sample size βN\beta N. Thus, the interested probability is:

    P(XkβN)=x=kβNβN(αNx)((1α)NβNx)(NβN).P(X \ge k \beta N) = \sum_{x=k\beta N}^{\beta N} \frac{\binom{\alpha N}{x} \binom{(1-\alpha)N}{\beta N - x}}{\binom{N}{\beta N}}.

    For large NN, we can approximate the hypergeometric distribution with a normal distribution. The mean and variance are:

    μ=βNαNN=αβN,σ2=βNα(1α)NβNN1α(1α)β(1β)N.\mu = \beta N \cdot \frac{\alpha N}{N} = \alpha \beta N,\quad \sigma^2 = \beta N \cdot \alpha(1-\alpha) \cdot \frac{N - \beta N}{N-1} \approx \alpha(1-\alpha)\beta(1-\beta) N.

    Therefore, the probability is approximately:

    P(XkβN)1Φ(kβNμσ)=1Φ((kα)βNα(1α)(1β)),P(X \ge k \beta N) \approx 1 - \Phi\left( \frac{k \beta N - \mu}{\sigma} \right) = 1 - \Phi \left( \frac{(k-\alpha)\sqrt{\beta N}}{\sqrt{\alpha(1-\alpha)(1-\beta)}} \right),

    where Φ\Phi is the CDF of the standard normal distribution. The probability P(XkβN)P(X \ge k \beta N) is very small for the following reason. Because k>αk > \alpha, the threshold kβNk\beta N exceeds the expected number of noisy samples in the subset (μ=αβN\mu=\alpha\beta N). This corresponds to a rare right-tail event in the distribution, where the probability diminishes sharply as the threshold moves further from the mean. The normal approximation quantifies this rarity via the rapidly decaying tail of the Gaussian distribution.

    A numerical verification: The larger α,β\alpha, \beta are and the smaller kk is, the larger the probability P(XkβN)P(X \ge k\beta N) will be. However, when α=0.49,β=0.48,k=0.51\alpha=0.49, \beta=0.48, k=0.51, we have P(X0.510.48N)<0.004P(X \ge 0.51\cdot 0.48 \cdot N) < 0.004 for N5000N \ge 5000 and P(X0.510.48N)<0.00007P(X \ge 0.51\cdot 0.48 \cdot N) < 0.00007 for N10000N \ge 10000. Therefore, we can see that the probability is really very small in practice.

审稿意见
4

This paper tackles the important problem of learning from noisy offline preference data. Motivated by the observation that previous noise-aware preference optimization methods either only partially mitigate the noise problem or requires costly invocation of a separate LLM during the training process, the authors proposed an iterative noise-aware preference alignment method RObust Preference Optimization (ROPO). ROPO combines a robust loss, a noisy sample filtering process, and rejection sampling. On common preference tuning benchmark datasets UFB, Alpaca, and TL;DR, the authors demonstrate that ROPO consistently outperforms previous methods, establishing a practical method for handling preference noise.

给作者的问题

  1. How would ROPO perform in scenarios with non-uniform or clustered noise patterns (e.g., where certain types of queries are more prone to noisy preferences)?
  2. Could the robustness-guided rejection sampling be extended to incorporate more diverse negative examples beyond the model's own generations? Would it be beneficial to include negative samples from additional sources?
  3. How does ROPO handle cases where legitimate preferences might appear contradictory due to subjective differences rather than noise? It might be good to include a discussion on relationship with pluralistic alignment.

论据与证据

The authors claim that noise in preference data is prevalent and properly handling the noisy samples is critical for preference alignment. The experiments designed in this study adequately support this claim and the proposed methods that mitigate the noisy preference data issue led to a clear performance improvement.

方法与评估标准

The proposed methods and evaluation criteria are sound and aligns with widely adopted evaluation processes adopted by the alignment research community.

理论论述

I did not check the proof of the theoretical claims.

实验设计与分析

I carefully reviewed the experiment design and ablation study. The experiment designs are sound.

补充材料

I did not review the supplementary material.

与现有文献的关系

This paper this tackles the important problem of handling noisy preference samples.

遗漏的重要参考文献

N/A

其他优缺点

The paper is well-written and structured. The theoretical analyses are clearly presented with appropriate mathematical formulations. The experimental methodology is thoroughly explained, though some sections could be more concise.

The technical quality is high, with rigorous proofs and comprehensive experiments. The ablation studies effectively isolate the contributions of different components.

The approach is novel and represents a significant advancement over existing methods. While it builds on DPO, the integration of noise-tolerance, filtering, and rejection sampling is innovative and well-executed.

其他意见或建议

I don't have additional comments/suggestions beyond those provided in other review sections.

作者回复

Dear Reviewer mD6s,

Thank you for your valuable review. We respond to each comment as follows and sincerely hope that our response can properly address your concerns.

Figures and Tables can be found in mD6s.md in https://anonymous.4open.science/r/ICML25-ROPO-F6CD


Q1: How would ROPO perform in scenarios with non-uniform or clustered noise patterns?

Res: In addition to the artificial uniform noise, our experiments also contain two practical settings that include non-uniform or clustered noise patterns. ROPO still outperforms all baselines under these two following settings.

  1. (Appendix E.3.1) Practical noise coming from human annotators' trust in larger models over smaller ones. It is common practice to treat the response from a larger model as the preferred one and the response from a smaller model as the dis-preferred one. This leads to non-uniform and clustered noise patterns, as preference noise tends to appear in problems that large models do not handle well but small models handle effectively, rather than being uniformly distributed.
  2. (Appendix E.3.2) Practical noise coming from LLM comparisons. We use Llama3-70B-Instruct to relabel the preferences in UFB, where the original preference label in the UFB dataset comes from GPT-4 rating. Then, we observe that about 30% of the labels are different from the original ones. The noise here is non-uniform and clustered rather than uniform, as differing preference labels only arise in cases where Llama-3-70B-Instruct and GPT-4 fail to reach a consensus.

Q2: Could the robustness-guided rejection sampling be extended to incorporate more diverse negative examples beyond the model's own generations (e.g., samples from additional sources)?

Res: We have added the experiment of training Mistral-7B on UFB with 0% and 20% artificial noise, where the rejection sampling phase uses outputs from Llama-2-7B to obtain negative examples. However, as shown in Table mD6s-1 in the anonymous link, we observe a decrease in the performance. We speculate the reasons as follows.

  1. Recent studies [1,2] suggests that on-policy training, where responses are sampled from the model’s distribution, generally outperforms off-policy training, where responses are sampled from other distributions. Therefore, the on-policy training paradigm of standard ROPO is naturally superior to the off-policy training paradigm that uses outputs from other models (e.g., Llama-2-7B) as dispreferred responses.
  2. Another line of research [3,4] indicates that the relationship between "the value of preference samples (x,y1,y2)" and "the reward margin between y1 and y2" remains inconclusive. It is unclear which reward margins are effective for preference alignment. We speculate that this is also one of the reasons why using generations from other models as dispreferred responses is often ineffective---because we lack a clear understanding of the distribution and impact of reward margins in this scenario.

Q3: How does ROPO handle cases where legitimate preferences might appear contradictory due to subjective differences rather than noise? It might be good to include a discussion on relationship with pluralistic alignment.

Res: We will include the following discussion on pluralistic alignment in Appendix C.

Human preferences in the real world are often multi-dimensional and vary significantly due to differences in cultural background, education level, age, and region. This diversity in preferences has prompted the study of pluralistic alignment [5,6].

However, defining "noise" in pluralistic alignment is challenging. When preference dimensions exceed one, there is no "gold" and latent reward model to rely on, and preference modeling based on the Bradley-Terry model becomes infeasible. Therefore, no groundtruth preference label exists between two responses, so we cannot define "preference noise". To address this challenge, two promising directions can be considered: (1) Inject multiple preference dimensions into the prompt, enabling alignment conditioned on specific preference dimensions [7]. (2) Introduce an additional explanatory text in samples (x, y1>y2) to describe in what sense y1 is superior to y2. Although pluralistic alignment is beyond the scope of the paper, we look forward to engaging in interesting discussions with readers on this topic.


[1] Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

[2] RLHF Workflow: From Reward Modeling to Online RLHF

[3] Larger or Smaller Reward Margins to Select Preferences for Alignment?

[4] Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning

[5] Group Robust Preference Optimization in Reward-Free RLHF

[6] Aligning to Thousands of Preferences via System Message Generalization

[7] Rewards-in-Context: Multi-Objective Alignment of Foundation Models with Dynamic Preference Adjustment

审稿意见
3

LLM model alignment has shown great potential for several applications. However, popular techniques such as DPO are highly sensitive towards positive vs negative samples, and therefore any noise in the training preference data can significantly impact the performance. To alleviate this issue, the paper proposed an optimization framework for selecting noisy samples and then develop an augmented DPO loss function that is noise tolerant and can distinguish the noisy samples. Experimental results on several academic benchmark dataset demonstrate that proposed ROPO technique can almost always outperform base DPO and its variant.

给作者的问题

  1. Do you have performance analysis on larger models (except for 7B)?
  2. What is the computational complexity of ROPO?

论据与证据

The claims about noisy data filtering and robust framework for LLM preference optimization have been validated both theoretically and experimentally.

方法与评估标准

Experiments are conducted on 3 well-known academic benchmark datasets and the results were compared against SOTA baseline DPO methods. Evaluation metrics and overall experimental settings make sense to me.

理论论述

The claim that ROPO is noise-tolerant as opposed to DPO has been proved theoretically using Theorem 3.1-3.5.

实验设计与分析

Experiments are carefully designed. The performance is tested on Alpaca, UFB and TLDR dataset and compared against DPO, IPO, cDPO and rDPO. Table 1 demonstrate that ROPO always outperforms other baselines (albeit by a small margin). More interestingly, the performance of ROPO increases monotonically over the iterations. Ablation studies show the importance of different steps in the ROPO framework.

补充材料

I have read the supplementary material at a high level and might have missed some of the mathematical proofs in Appendix F.

与现有文献的关系

Preference selection and domain alignment of LLMs is an important problem and will have broader interest in the community. This work improves the performance of traditional DPO methods and will be of interest for both academic and industry audience.

遗漏的重要参考文献

NA

其他优缺点

Overall, ROPO is an interesting framework for improving performance over DPO. Concept of noisy data filtering and rejection sampling in DPO is novel. Experimental and theoretical results are also solid. Having said that, there are some concerns:

  1. Experiments are only conducted on 2 7B models and therefore, it is not clear how ROPO performance generalizes to other bigger models.
  2. ROPO is an iterative method, therefore, in terms of computational complexity and cost, it should be much higher than DPO. It would be good to add the cost comparison between ROPO, IPO and DPO.
  3. Experimental results demonstrate that the performance gain on 2 datasets (UFO and Alpaca) is only 1-2%, but ROPO seems more cost inefficient over DPO. Results on TLDR dataset seem very promising, so any discussion or insights on where ROPO shines over DPO and vice versa would be a good addition.

其他意见或建议

Structure of the paper requires a lot of changes. Algorithm 1 and related work section should go into main text.

作者回复

Dear Reviewer smYA,

Thank you for your valuable review. We respond to each comment as follows and sincerely hope that our response can properly address your concerns.

Figures and Tables can be found in smYA.md in https://anonymous.4open.science/r/ICML25-ROPO-F6CD


Other Strengths And Weaknesses

W1: It is not clear how ROPO performance generalizes to models larger than 7B.

Res: In Appendix E.1 of our initial submission, we show the performance of ROPO on Llama-2-13B and Llama-3-70B (trained on UFB and evaluated on AlpacaEval). We will make references to Appendix E.1 in the main text more obvious. For you convenience, we also provide the results in Table smYA-1 in the anonymous link. As can be seen from the table, ROPO significantly outperforms the baselines at the scales of 13B and 70B.


W2: It would be good to add the cost comparison between ROPO, IPO, and DPO.

Res: In Appendix A of our initial submission, we provide the analysis of the computational cost for ROPO and non-iterative methods (e.g., DPO, IPO, rDPO, and cDPO). We will make references to Appendix A in the main text more obvious. For your convenience, we quote the important content as follows.

ROPO introduces additional costs for the noisy sample filtering and robustness-guided rejection sampling stages compared with non-iterative methods. We estimate that the cost of ROPO is approximately 1.6 times that of non-iterative methods. The additional costs of ROPO mainly come from forward computations, which are acceptable compared to the training (backward) cost and almost negligible in the entire chain of real-world large-scale LLM training. For details, please refer to Appendix A.


W3: Any discussion or insights on where ROPO shines over DPO (like on TL;DR) and vice versa would be a good addition.

Ans: We speculate that the extent of ROPO's advantage over DPO depends on whether the task is subjective.

  • For tasks that are relatively more objective (such as the TL;DR summarization task), the groundtruth preference ranking labels are usually more deterministic, as the criteria for evaluating the quality of a summary are typically objective and quantifiable, such as whether it contains complete information and whether the numbers and other details are accurate. In such tasks, flipping preference labels can easily provide the model with incorrect information. Therefore, ROPO has greater potential compared to DPO in such tasks.
  • For tasks that are relatively more subjective (such as dialogue generation), it is often difficult to definitively say that one response is better than another for most questions, as the conclusions of preference comparisons can be influenced by factors such as the evaluator's cultural background, education level, age, etc. In such tasks, flipping preference labels does not necessarily introduce "incorrect" information to the model. Therefore, the advantage of ROPO over DPO may not be as significant in these tasks.

Additionally, we would like to share our observation that may help explain why ROPO still demonstrates a significant advantage on TL;DR with 0% artificial noise. According to the estimation in Table 5 of Appendix C, the original TL;DR dataset inherently contain 21.3%-27.0% noise. We also observe that due to the use of different annotators to label preferences for TL;DR, 5.8% of the posts exhibit "cyclic preferences" among multiple summaries. That is, for a given post xx, the preference ranking among three summaries is y1y2,y2y3,y3y1y_1 \succ y_2, y_2 \succ y_3, y_3 \succ y_1, which is evidently a form of noise. In such cases, ROPO naturally outperforms DPO.


Other Comments Or Suggestions

C1: Algorithm 1 and Related Work should go into main text.

Res: We will put them in the main text by reorganizing the presentation.


Questions For Authors

Q1: Do you have performance analysis on larger models (except for 7B)?

Res: Please refer to our response to W1.


Q2: What is the computational complexity of ROPO?

Res: Please refer to our response to W2.

最终决定

LLM alignment techniques such as DPO can be highly sensitive to noisy preference data. To alleviate this problem, the paper proposes ROPO (RObust Preference Optimization), an iterative framework that combines a noise-aware loss, dynamic sample weighting, and rejection sampling to effectively filter out and learn from noisy preferences. Extensive experiments on several benchmarks show that ROPO consistently outperforms DPO and other baselines.

Reviewers initially were concerned about the size of the models (7B), the lack of cost comparison, and some perceived lack of rigorous theoretical foundation, but these concerns have been mostly addressed. During the reviewer-AC discussion, one of the two reviewers who recommended “weak accept” finally acknowledged this paper offers a good contribution. Considering the reviewers’ consensus, I recommend accepting the paper. As discussed with the reviewers, the authors should probably improve the paper to avoid the confusions that had to be sorted out during the discussions.