Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment
We proposed the first RLHF and DPO algorithms that can simultaneously mitigate corruption, overoptimization and verbosity in large language model (LLM) alignment, with theoretical generalization error rates.
摘要
评审与讨论
This paper studies how to fine-tune LLMs to align it with human preference data, aiming to address the change of corrupted preference, reward Overoptimization, and verbosity. Inspired by previous works on these challenges respectively, this work proposes new algorithms named RLHF-COV and DPO-COV that can simultaneously mitigate these three issues simultaneously. Theoretical results prove that under suitable assumptions, the DPO-COV algorithm enjoys a converging length-regularized generalization error rates trained on corrupted data. Experimental results further demonstrate the effectiveness of the proposed algorithms.
优点
Clarity and Quality:
- The paper is clearly written and the theoretical results are sound.
Originality and Significance:
- The paper is the first to consider the joint solution that mitigates all of data corruption, reward over-optimization, and verbosity while being theoretically supported.
- New technical methods to handle corrupted data in the RLHF setup.
- The experiments show the competitiveness of the proposed algorithm in aligning LLMs to some extent.
缺点
From the reviewer's perspective, despite that the joint handling of these challenges is new in RLHF theory literature , the main idea of the algorithm design is not novel given existing works, i.e., the value-guided method to handle over-optimization/encourage exploration (see [1, 2, 3, 4, 5]) without the new components of handling corrupted data and verbosity, which makes the theoretical contributions of the paper weakened. Also for this reason, the earlier work [1, 2] of this line in standard RL needs also to be mentioned in the paper for the completeness of literature review.
Another weakness is that the experiments are limited, which does not quite demonstrate the overall effectiveness of the proposed algorithm. E.g., see Question 3 below.
References:
[1] Xie T, Cheng C A, Jiang N, et al. Bellman-consistent pessimism for offline reinforcement learning[J]. Advances in neural information processing systems, 2021, 34: 6683-6694.
[2] Liu Z, Lu M, Xiong W, et al. Maximize to explore: One objective function fusing estimation, planning, and exploration[J]. Advances in Neural Information Processing Systems, 2023, 36.
[3] Liu Z, Lu M, Zhang S, et al. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer[J]. arXiv preprint arXiv:2405.16436, 2024.
[4] Cen S, Mei J, Goshvadi K, et al. Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF[J]. arXiv preprint arXiv:2405.19320, 2024.
[5] Xie T, Foster D J, Krishnamurthy A, et al. Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF[J]. arXiv preprint arXiv:2405.21046, 2024.
问题
- In Theorem 1, the choice of the parameter depends on the quantity of partial coverage coefficient , which somehow seems impractical. How does the learner know the exact number of the partial coverage coefficient? Even though in practice the parameter is typically chosen via empirical search, but this theoretical choice still does not make sense to me.
- Beyond the experiments in the paper, how to understand does the proposed algorithm indeed handle the challenge of corrupted data, reward over-optimization, and verbosity? In other words, is there additional experiment which supports that the superior performance of the new algorithm is exactly brought by the mitigation of these issues? For example for reward overoptimization, can the authors show that compared with the algorithm that does not consider this explicitly, the "reward" is indeed not over-optimized. Similar questions for corruption and verbosity.
- What about the capabilities of the LLMs trained by DPO-COV in complex reasoning tasks, e.g., math and coding?
- Typos: inconsistent notation for the partial coverage coefficient, see Section 3.3.
Thank you very much for reviewing our manuscript and providing valuable feedback. Below is a response to the review questions/comments. We have revised the manuscript accordingly as shown in red. Please let us know if further clarifications are needed.
Weaknesses: The earlier work [1, 2] of this line in standard RL needs also to be mentioned in the paper for the completeness of literature review.
A: Thanks for your suggestion. We have added [1,2] (respectively as examples of pessimistic offline RL and optimistic online RL) to the end of the "Overoptimization" part in the introduction, as shown in red.
Q1: In Theorem 1, the choice of the parameter depends on the quantity of partial coverage coefficient , which somehow seems impractical. How does the learner know the exact number of the partial coverage coefficient? Even though in practice the parameter is typically chosen via empirical search, but this theoretical choice still does not make sense to me.
A: Thanks for your suggestion. In the revised Theorem 1, we have removed from the stepsize and accordingly changed the error bound.
Q2: Beyond the experiments in the paper, how to understand does the proposed algorithm indeed handle the challenge of corrupted data, reward over-optimization, and verbosity? In other words, is there additional experiment which supports that the superior performance of the new algorithm is exactly brought by the mitigation of these issues? For example for reward overoptimization, can the authors show that compared with the algorithm that does not consider this explicitly, the "reward" is indeed not over-optimized. Similar questions for corruption and verbosity.
A: Thank you for your suggestion. We are working on these additional experiments.
Q3: What about the capabilities of the LLMs trained by DPO-COV in complex reasoning tasks, e.g., math and coding?
A:: Thank you for your suggestions. We compare our DPO-COV algorithm with other DPO variants on Grade School Math 8K (GSM8K) and AI2 Reasoning Challenge (ARC) tasks by following [1] and adapting the code from [2]. The accuracy results below indicate that our DPO-COV algorithm outperforms the other variants also on the complex reasoning tasks. We are going to add these experiments to the revised paper.
[1] Liu, Z., et. al. (2024). Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer. ArXiv:2405.16436.
[2] Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Shengyi Huang, Kashif Rasul, Alexander M. Rush, and Thomas Wolf. The alignment handbook. https://github.com/huggingface/alignment-handbook, 2023.
------------------------------------------------------------------------------------------------------------------------------
Model GSM8K ARC (Easy) ARC (Challenge)
------------------------------------------------------------------------------------------------------------------------------
Our DPO-COV 46.78 72.52 49.32
Robust DPO 46.25 72.14 47.35
Pessimistic DPO 45.19 72.14 46.16
Length-regularized DPO 44.50 72.31 46.16
Vanilla DPO 45.26 71.89 \quad\quad\quad\quad\quad\quad\quad \quad \~ 46.50
Reference Model 42.38 71.72 45.14
------------------------------------------------------------------------------------------------------------------------------
Q4: Typos: inconsistent notation for the partial coverage coefficient, see Section 3.3.
A: Thanks for pointing that out. In the revised paper, we have changed all to .
Thank you for your response to my questions and concerns. The new experiment seems promising that the proposed algorithm has positive effects for reasoning tasks. I have raised my score to 6.
Dear Reviewer RwZ8
We thank you very much for reviewing our new experimental results and raising you score. We are conducting experiments investing our ability to tackle corrupted data, reward over-optimization, and verbosity respectively, as suggested by your Q2.
Authors
This paper studies how to deal with corrupted preferences, overoptimization and verbosity in RLHF and DPO. The authors propose RLHF-COV and DPO-COV algorithms that address these issues simultaneously in both offline and online settings. Their approach combines the existing techniques and they provide theoretical guarantees for their method. Experimental results demonstrate the efficacy of their approach compared to methods which only deal with part of the three issues.
优点
The literature review is detailed and the presentation is clear.
缺点
The novelty of this paper is very limited and not sufficient enough for ICLR.
- Algorithmically, the proposed algorithms are a simple combination of existing works. The anti-corruption component comes from (Bukharin et al., 2024), the anti-overoptimization component from (Liu et al., 2024c) and the anti-verbosity component comes from (Park et al., 2024). The online version adds (Bukharin et al., 2024) and (Park et al., 2024) on (Xie et al., 2024). I see no obvious novelty in the algorithm design.
- Theoretically, the contribution is also relatively small. Most of the proof utilizes the techniques in (Bukharin et al., 2024), (Liu et al., 2024c) and (Xie et al., 2024). The authors claim that the novelty lies in the analysis order of the noise terms, which seem not significant to me.
问题
In Theorem 1, it should be rather than ?
Thank you very much for reviewing our manuscript and providing valuable feedback. Below is a response to the review questions/comments. We have revised the manuscript accordingly as shown in red. Please let us know if further clarifications are needed.
Q1: There is no obvious novelty in the algorithm design and theorems.
A: This combined algorithm for the first time can solve corruption, overoptimization and verbosity issues simultaneously within one simple implementation, which is important in practice.
Q2: In Theorem 1, it should be rather than ?
A: Thanks for pointing that out. In the revised paper, we have changed all to .
Thank you for the response.
Dear Reviewer 25JV,
Do you have more questions we need to solve, or we have already solved your questions? Thanks.
Authors
Thank you for your feedback! I have admitted that you solved these three issues simultaneously in the review. My opinion is that the approach to each of them has been proposed in the literature and the way you combined these three issues is trivial. Could the authors provide me with more specific novelty that I missed in the review? Like how these three issues interact with each other and why is combining them difficult? Otherwise, I would think this work is simple A+B+C.
Thanks for bringing the question. By further investigation, we found the following additional novelties.
In the algorithm design, we need to notice that the length-penalized reward is applied to only the relative value function not the log-likelihood function .
In the theoretical analysis of the offline algorithm (i.e., Theorem 1), the true and estimated noise terms are analyzed not only in different stages as mentioned in the paper, but also in different ways. Specifically, in (b) of Eq. (74) at the end of the proof of Theorem 1, the first term is replaced by its equivalent for the estimated noise , while the second is upper bounded by for the true noise .
Thank you for your feedback! In my opinion these novelties are relatively minor. For the algorithmic design, the offline DPO version you proposed in Eq.24 is exactly the same as the offline DPO algorithm in (Park et al., 2024, Eq. 9) without the anti-corruption and anti-overoptimization terms. Therefore, I don't quite agree that not using length-regularized reward in the log-likelihood loss is a novelty. For theoretical analysis, the techniques for analyzing and have been proposed in the literature and here it seems that you only applied them respectively to and . That said, I appreciate the authors' responses and thus decide to increase my score. There is no 4 in the evaluation system so I will select 5, but my true score is around 4.
Dear Reviewer,
Please engage with the authors in a non-dismissive way. They spent time writing a rebuttal. Can you please go through their arguments and respond to them instead of just saying "yes".
Thanks!
Thank you very much for reevaluating our work and raising your rating.
You think our techniques for analyzing the noise terms and have been proposed in the literature. Could you tell us which paper use the same techniques?
To our knowledge, such noise terms and in RLHF/DPO algorithm design are only used in [1]. Their theoretical analysis and result are different from ours. Specifically, as to the analysis technique, right below Eq. (A.1) of [1], they apply second-order Taylor expansion to (translated to our notations) followed by strong convexity of to obtain linear and quadratic terms of the reward estimation error and noise estimation error . They compare with directly. In contrast, we compare , respectively with zero noise and did not use strong convexity of . As to the theoretical result, [1] only obtains upper bound for reward estimation error, while we obtain upper bound for optimization error of the policy we finally aim to learn.
Sorry for the misunderstanding. I mean that the way to bound the estimation error in your analysis (Lemma 8) basically follows the proof of Lemma B.1 in (Liu et al., 2024c) and the techniques are almost the same. The only difference is that now you have and , where you used two inequalities (Lemma 4 and 5) to get rid of them. I agree that these two inequalities are new but I think they are only minor changes to Lemma B.1 in (Liu et al., 2024c). That's why I decided to raise the score to 4. I would suggest the authors mention the similarity between their techniques and (Liu et al., 2024c), and emphasize that Lemma 4/5 are novel.
Thanks for your great suggestion.
We will mention the similarity between their techniques and (Liu et al., 2024c), and emphasize that Lemma 4/5 are novel.
Authors
This paper combines several techniques in addressing preference corruption, model optimization, and preference to verbosity into one objective, and shows the effectiveness of the objective both theoretically and empirically.
优点
The paper is sound in it's theory part. The derivations of the proposed objective, along with the mathematical proofs, are sound. The objective is novel to the best of my knowledge.
缺点
The objective itself, although reduced from minimax problem to just one minimization problem, has many hyperparameters in it. Authors mentioned that they used grid search over three out of the four hyperparameters in their experiment section. This is concerning due to the uncertainty and cost behind the hyperparameter search. Moreover, comparing all the methods against one much more powerful model (GPT4 in this case) on just one test set (AlpacaEval 2.0) isn't very convincing as the evidence to show the efficacy of the proposed objective. I would suggest the authors to add head-to-head win rates between pairs of the algorithms listed for comparison on more test sets.
问题
How fine should we set the grids and how sensitive are the results to the hyperparameters?
Thank you very much for reviewing our manuscript and providing valuable feedback. Below is a response to the review questions/comments. We have revised the manuscript accordingly as shown in red. Please let us know if further clarifications are needed.
Q: How fine should we set the grids and how sensitive are the results to the hyperparameters?
A: Great question. We first perform a coarse-grained search: {}, {}, {}. Then we perform a fine-grained search around the best results obtained from the coarse-grained search.
The algorithm is sensitive to hyperparameters in the coarse-grained search but is quite stable in the fine-grained search stage. In general, we should choose small and at the order of , and smaller than but close to 1.
This paper proposed designs to mitigate Corrupted preference, reward Overoptimization, and bias towards Verbosity observed in RLHF and DPO. The authors use noise model to mitigate corruption, pessimistic and optimistic regularization to mitigate overoptimization, and length regularization to mitigate verbosity. They show a generalization error guarantee for the proposed methods, and experiments on Alpaca Eval 2.0 to support the proposed methods
优点
Tackling the corruption, overoptimization, and verbosity issues of LLM preference alignment at the same time is not achieved before, and this work makes a reasonable contribution.
The paper is well written and easy to understand.
缺点
The experimental results are not satisfactory.
问题
Using Llama-3-8B-Instruct it seems the current win-rate can be 20%+ (see Table 1 of https://arxiv.org/pdf/2405.00675). The authors reported ~7% win-rate of using Llama-3-8B. Could you explain the reason here?
Thank you very much for reviewing our manuscript and providing valuable feedback. Below is a response to the review questions/comments. We have revised the manuscript accordingly as shown in red. Please let us know if further clarifications are needed.
Q: Using Llama-3-8B-Instruct it seems the current win-rate can be 20%+ (see Table 1 of [1]]). The authors reported ~7% win-rate of using Llama-3-8B. Could you explain the reason here?
[1] https://arxiv.org/pdf/2405.00675
A: Great question. The results are not directly comparable for two reasons. First, we apply Lora to compress (the change of the big model) to save time while [1] did not. Second, Table 1 of [1] and our experiments for online algorithms use different models for labeling and initial policy . Specifically, we use Llama-3-8B only to generate the labels and the zephyr-7b-gemma-sft-v0.1 model to initialize the policy . In contrast, Table 1 of [1] uses GPT-4 to generate the labels and Llama-3-8B-Instruct to initialize the policy .
This work proposes two methods to tackle several problems in RLHF at once. These issues include corrupted preferences, reward overoptimization and verbosity bias. For the proposed methods are both online and offline the authors can show theoretical results and experimental results. The reviewing process raised some concerns about the effectiveness and soundness of the results. In particular, the high number of hyperparameters required, the fact that the algorithms appear to be a mere combination of existing algorithms from related works, and the limited nature of the experiments. Hence why I cannot recommend acceptance.
审稿人讨论附加意见
After the reviewing process, concerns remained such as the incremental nature of the results, the limited nature of the experiments, and how the methodological innovations seem to be the result of combining several existing methods into one.
Reject