PaperHub
4.0
/10
Rejected4 位审稿人
最低3最高5标准差0.7
4
4
5
3
3.8
置信度
正确性2.5
贡献度2.0
表达3.0
NeurIPS 2024

Direct Language Model Alignment from Online AI Feedback

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

This work improves the alignment of LLMs by directly optimising models with preferences from online AI feedback.

摘要

关键词
LLM Alignment; AI Feedback; On-policy Learning; Online Feedback

评审与讨论

审稿意见
4

This paper propose OAIF, an online method to align language model with human preference where feedback from language models serve as a surrogate of human feedback. The key of OAIF is to use online generated preference pair along the training process. Experiment results shows that, by switching offline preference dataset to online dataset labeled by other language models, the generated responses are more aligned with human preference.

优点

The strengths of the paper are listed below:

  1. This paper introduces OAIF, which is featured by using on-the-fly generated preference pairs and AI-provided labels.
  2. The author conducted experiment on various direct alignment methods and the results consolidate the claim by the authors

缺点

My questions and concerns are listed as follows:

  1. My first concern is regarding the novelty of the paper. It seems that the language model annotator is essentially a preference model. Therefore, OAIF can be seen as a method of online direct alignment algorithm with access to a preference model. The author mentioned several previous work with on-policy generation and online feedback but in need of a reward model. How is OAIF different from different from these method if we simply plug in the language model annotator as the reward model in their methods?
  2. At line 118 the author pointed out that RM might suffer from distribution shift because the training data of RM might not share the same distribution with πθ\pi_\theta. However, it seems to me that using language model as preference annotator cannot bypass this problem since the language models' pretraining corpus or the finetuning corpus relating to preference labeling has a similar distribution with πθ\pi_\theta.
  3. How is OAIF's performance compared to other online methods like RSO and IterativeDPO? I think that these methods might also be included as baselines since reward model can also be taken by AI annotators.

问题

See weakness Section

局限性

The limitation is discussed by the author

作者回复

We reply to each of the review’s concerns below.

My first concern is regarding the novelty of the paper. It seems that the language model annotator is essentially a preference model. Therefore, OAIF can be seen as a method of online direct alignment algorithm with access to a preference model. The author mentioned several previous work with on-policy generation and online feedback but in need of a reward model. How is OAIF different from different from these method if we simply plug in the language model annotator as the reward model in their methods?

Please note our work is novel in the following aspects:

  • We firstly propose AI feedback to address the offline and distributional shift issue of DAP methods
  • We firstly demonstrate the adaptability of alignment via prompting AI feedback model (Section 4.6)
  • We firstly show the feasibility of weak-to-strong alignment via OAIF (Section 4.7)

At line 118 the author pointed out that RM might suffer from distribution shift because the training data of RM might not share the same distribution with πθ\pi_\theta. However, it seems to me that using language model as preference annotator cannot bypass this problem since the language models' pretraining corpus or the finetuning corpus relating to preference labeling has a similar distribution with πθ\pi_\theta.

This is a great question! We assume the large-scale pretraining/finetuning and model scaling equips LLM with strong generalization, which could generalize over responses from different policy models when used as an AI annotator. This has been evidenced by the Reward Bench (https://huggingface.co/spaces/allenai/reward-bench, see Gemini 1.5 Pro performance based on zero-shot prompting) results where LLM achieves comparable and even better performance than dedicated reward models by just prompting.

How is OAIF's performance compared to other online methods like RSO and IterativeDPO? I think that these methods might also be included as baselines since reward model can also be taken by AI annotators.

Please notice that OAIF is orthogonal to RSO as RSO develops the sampling strategy rather than preference labeling, and we consider iterative DPO as a concurrent work.

审稿意见
4

This work extends offline preference learning methods, i.e., DPO, to a online variant by using LLM as annotator to collect new datasets for further preference learning. The results show that Direct alignment from preferences (DAP) methods win-rate over the offline methods beyond 60%.

优点

  1. Paper is good writing, easy to follow.
  2. This online variant provides demonstrates significant performance improvements over offline DAP and RLHF methods through comprehensive evaluations.

缺点

  1. The improvement by extending online is under expectation as it introduces more datasets and training budgets.
  2. The contribution is limited. The only difference compared to the prior method is substituting the reward model of prior methods (Iterative DPO) to LLMs, though I agree the explicitly static reward model may introduce the model distributional shift problem.
  3. Some drawings or comparisons are not fair enough. (a). Table 1 explicitly avoids the limitation of this method by leveraging the feedback from LLM, though it is another variant of the "reward model". (b). Figure 3, the training step is not an approximate x-axis as the online DPO variant has been heavily fine-tuned offline.
  4. There are no theoretical foundations, or new plausible explanations, aside from more datasets and the online budget, for the further improvement of the online variant DPO.

问题

n/a

局限性

see Weaknesses

作者回复

We reply to each of the reviewer's concerns individually below.

The improvement by extending online is under expectation as it introduces more datasets and training budgets.

We may disagree with this argument as it is unclear how much improvement would be considered as “expected”. We showed empirical results, from both human and LLM, that OAIF performs better than several strong baselines, which we believe are solid evidence to support the superiority of OAIF.

The contribution is limited. The only difference compared to the prior method is substituting the reward model of prior methods (Iterative DPO) to LLMs, though I agree the explicitly static reward model may introduce the model distributional shift problem.

We consider the distributional shift problem, an overlooked problem in the literature, as one major bottleneck when applying DAP methods, and we consider iterative DPO as a concurrent work.

Besides, we’d like to remind the reviewer that the proposal of online learning with AI feedback is just one of our contributions. The high controllability of OAIF via prompting LLM (as shown by the length control experiments in Section 4.6) and the feasibility of weak-to-strong alignment in Section 4.7, we believe, are solid contributions to the community.

Some drawings or comparisons are not fair enough. (a). Table 1 explicitly avoids the limitation of this method by leveraging the feedback from LLM, though it is another variant of the "reward model". (b). Figure 3, the training step is not an approximate x-axis as the online DPO variant has been heavily fine-tuned offline.

As stated in your comments, “the explicitly static reward model may introduce the model distributional shift problem”, which highlights the significance of leveraging feedback via large off-the-shelf LLMs, which are not specialized on generations from a single policy.

We don’t understand the question (b). We perform OAIF based on SFTed models rather than offline DAP models. Could the reviewer clarify their point?

There are no theoretical foundations, or new plausible explanations, aside from more datasets and the online budget, for the further improvement of the online variant DPO.

First, we’d like to highlight that our work strictly follows the theoretical foundations of the corresponding DAP methods, since online learning itself doesn’t break the foundation.

Secondly, we consider our work as an empirical study. We believe our empirical evidence is solid enough to support all our claims.

评论

Thanks for the rebuttal! However, leveraging the feedback from LLMs rather than reward models to address the model distributional shift problem is not an appreciated way, and also not valid with extra experiments or proof justify except the final performance, as it introduces other knowledge such as expert information. I strongly suggest that the authors provide additional experimental validation, if this is the preferred method, to support the above claim.

I also have the following two comments:

  1. Iterative DPO should not be considered as a concurrent work. Takes from NeurIPS FAQ: "Papers appearing less than two months before the submission deadline are generally considered concurrent to NeurIPS submissions. Authors are not expected to compare to work that appeared only a month or two before the deadline."
  2. (b). Figure 3 shows that the training step is not an approximate x-axis as the online DPO variant has been fine-tuned offline. I apologize for the misunderstanding, I thought the online variant continues to train the offline policy.
评论

We didn’t follow the reviewer’s point on the expert information issue. It seems that the reviewer has a significant misunderstanding about the signal from our LLM annotators. We would like to clarify that the feedback from LLM annotators is the same as the preference signal obtained from reward models, and there’s no expert information from the LLM annotators. Can the reviewer further clarify the question here?

We believe there are also misunderstandings regarding the distributional shift problem in our study. In DAP methods, this problem results from the use of pre-collected offline preference data (sometimes from the other policy models), rather than the use of reward models. The significant log-likelihood gap between online and offline response in Figure 3 clearly explains this problem.

We consider iterative DPO and our work as concurrent work as both are finished around the end of last year. Note this is a resubmission from ICML.

审稿意见
5

This paper applies direct alignment from preferences (DAP) methods, particularly DPO, to online settings where responses are sampled in an on-policy manner and feedback is provided by the LLM annotator in real-time. Extensive experiments demonstrate the effectiveness of these simple ideas.

优点

The paper is well-written, with detailed explanations of introduced definitions and discussions with existing methods.

The experiments are well-designed, supporting the main idea of the paper. The proposed prompt-controllable approach is particularly commendable.

缺点

The rationale for why on-policy learning brings performance gains is not well clarified. The cited reference [1] does not provide strong support for this claim. There is no experimental evidence that on-policy sampling encourages exploration.

Most experiments are conducted with the closed-source LLM Palm; evaluating state-of-the-art open-sourced LLMs would enhance generalizability.

It is unclear how much of the performance gains are due to on-policy sampling versus online feedback.

The reasons why utilizing online on-policy data can avoid overfitting and improve performance should be further analyzed and discussed.

References: [1] Lambert, N., Wulfmeier, M., Whitney, W., Byravan, A., Bloesch, M., Dasagi, V., Hertweck, T., and Riedmiller, M. The challenges of exploration for offline reinforcement learning. arXiv preprint arXiv:2201.11861, 2022.

问题

Is it correct to categorize RSO and iterative DPO as on-policy generation in Table 1?

What new opportunities and challenges arise when applying DAP to online settings? Did you encounter common issues of DAP methods, such as overfitting, in the online setting? What are the differences in these issues between online and offline settings?

Where are the experimental results to support the superiority of using LLMs over RMs to provide online feedback in Line 267?

局限性

The computational overhead introduced by on-policy sampling and online feedback is not discussed.

作者回复

Thanks for identifying our contribution to addressing the off-policy sampling issue via the proposed online AI feedback (OAIF) method. We have addressed each concern as described below.

The rationale for why on-policy learning brings performance gains is not well clarified. The cited reference [1] does not provide strong support for this claim. There is no experimental evidence that on-policy sampling encourages exploration.

Our empirical analysis in Figure 8 (Appendix) demonstrates a distributional shift for off-policy responses, where off-policy responses get significantly lower generation probability than on-policy responses. Intuitively, using off-policy responses enforces the model to adapt to these examples apart from learning the preferences, which increases the learning difficulty.

Most experiments are conducted with the closed-source LLM Palm; evaluating state-of-the-art open-sourced LLMs would enhance generalizability.

Please note PaLM is available to use and can be fine-tuned on Vertex. We are also aware of a reproduction of this work by HuggingFace which supports our concolusions and will be announced soon.

It is unclear how much of the performance gains are due to on-policy sampling versus online feedback.

We included a comparison between on-policy sampling (with a fixed reward model) and OAIF in line 265 and 266 of Section 4.4, which shows that OAIF obtains significantly better performance.

The reasons why utilizing online on-policy data can avoid overfitting and improve performance should be further analyzed and discussed.

Thanks for this suggestion! We consider that the online feedback to dynamically sampled on-policy responses offers timely signals for the learning which helps avoid overfitting to some extent. Please also note that OAIF allows high and straightforward controllability on the alignment by modifying the prompt as demonstrated by the length control experiments (Section 4.6), and enables weak-to-strong generalization (Section 4.7).

Is it correct to categorize RSO and iterative DPO as on-policy generation in Table 1?

Yes, both RSO and iterative DPO use on-policy sampling, i.e. the responses are sampled from the current policy model during training.

What new opportunities and challenges arise when applying DAP to online settings? Did you encounter common issues of DAP methods, such as overfitting, in the online setting? What are the differences in these issues between online and offline settings?

These are great questions! We believe online learning is the right way to unlock the potential of DAP algorithms, but how to select responses and collect preference data from the current policy model and the feedback model during online learning can be challenging.

Online DAP, like OAIF, still suffers from overfitting, while the behavior and bias are different. For example, offline DPO often shows clear overfitting patterns with decreased win rates against the SFT baseline; OAIF, in contrast, often produces much longer outputs, which we consider as a type of overfitting. However, such overfitting is adjustable by modifying the AI feedback prompt.

Where are the experimental results to support the superiority of using LLMs over RMs to provide online feedback in Line 267?

We performed automatic evaluation by Gemini Pro and showed that OAIF outperforms LLM+RM with a win ratio of >70%.

评论

Thanks for your response. I would keep my previous evaluation.

审稿意见
3

The paper presents a new method called Online AI Feedback (OAIF) for direct alignment from preferences (DAP) that addresses the limitations of existing DAP methods, which rely on static, offline feedback datasets. By using an LLM as an online annotator to provide real-time feedback during each training iteration, OAIF ensures the alignment process remains on-policy and adapts dynamically to the evolving model. Through human evaluations across various tasks, the authors demonstrate that OAIF outperforms traditional offline DAP and reinforcement learning from human feedback (RLHF) methods.

优点

OAIF uses LLMs for preference annotation, eliminating the need for a separate reward model and large datasets typically required for RLHF methods. It introduces a new way to address off-policy issues in policy optimization, a significant problem in traditional DPO methods.

The paper is well-written and easy to understand. OAIF outperforms offline DPO and other offline RLHF methods.

缺点

  1. The idea is straightforward but lacks theoretical proof. The proposed method combines DPO and AI feedback, unlike the constitutional AI paper, which integrates PPO with AI feedback. However, this point is minor. Given the abundance of concurrent work [1-7], the authors should further develop the theoretical analysis of their approach to strengthen their method.

  2. Different methods should use an equal amount of training data. In the second epoch of onlineDPO, although the prompts remain the same as in the first epoch, the responses and rank information differ due to online generation.

  3. Recent results on Reward Bench indicate that small reward models are more effective than LLM critiques. The iterative DPO methods are similar to OAIF DPO. A performance comparison between OAIF and various iterative DPO methods using cheaper reward models, as both address the off-policy issue, is essential and should be included.

[1] Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

[2] RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models

[3] RSO: Statistical rejection sampling improves preference optimization

[4] Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682

[5] Hoang Tran, Chris Glaze, and Braden Hancock. 2023. Iterative dpo alignment. Technical report, Snorkel AI.

[6] Self-rewarding language models. arXiv preprint arXiv:2401.10020

[7] Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719.

问题

N/A

局限性

N/A

作者回复

We respond to the questions you have as follows.

The idea is straightforward but lacks theoretical proof. The proposed method combines DPO and AI feedback, unlike the constitutional AI paper, which integrates PPO with AI feedback. However, this point is minor. Given the abundance of concurrent work [1-7], the authors should further develop the theoretical analysis of their approach to strengthen their method.

While we appreciate the recognition of our work’s simplicity which we deem valuable for both academia and industry, we need to highlight that our work identifies the offline issue of DAP methods (not just DPO) and demonstrates the significance of online learning via OAIF with solid experimental evidence. We consider this as a convincing contribution to the community rather than a “minor point”; these concurrent works listed further supports the significance of the problem investigated in our work.

We are surprised by the argument that our work lacks theoretical proof, since we didn’t develop new DAP algorithms but make their learning online and on-policy. Take DPO as an example. The equivalence between the optimal policy and optimal reward function is built on online and on-policy learning; but traditional DPO often adopts preference data collected and fixed ahead of the training and even from other policy models, leading to offline and even off-policy learning. Our work ensures the optimality of these DAP algorithms.

Different methods should use an equal amount of training data. In the second epoch of onlineDPO, although the prompts remain the same as in the first epoch, the responses and rank information differ due to online generation.

The ability of leveraging newly generated responses and new rank information is exactly the strength of OAIF, which is infeasible for offline DAP methods.

We believe this should be considered as an advantage of OAIF rather than a weakness.

Recent results on Reward Bench indicate that small reward models are more effective than LLM critiques. The iterative DPO methods are similar to OAIF DPO. A performance comparison between OAIF and various iterative DPO methods using cheaper reward models, as both address the off-policy issue, is essential and should be included.

We also noticed recent results on Reward Bench, particularly that generative LLMs perform on-par with and even better than dedicated reward models. This is exciting evidence supporting the reliability of AI feedback thus the potential of OAIF.

A comparison between reward model and OAIF has been provided in line 265 and 266 of Section 4.4. Our experiment showed that the reward model (trained at once) significantly underperformed OAIF from a generative model.

We also highlight that iterative DPO requires retraining reward models frequently, which not only complicates the training process but also incurs significant cost, making the “cheaper” argument debatable. In contrast, OAIF supports quick reward prototyping as shown in our length controllability experiments (Section 4.6), where using the reward model is non-trivial.

最终决定

This paper adapts direct alignment from preferences (DAP) methods, particularly DPO, to online settings. Feedback (preference pairs) from language models serves as a surrogate of human feedback and is generated online. Experiment results shows that, by switching offline preference dataset to online dataset labeled by other language models, the generated responses are more aligned with human preference. The paper is well-written. However, reviewers have concerns regarding the novelty of the approach and limited discussion/comparison made in the submission. For example, compared to Iterative DPO, the main difference of this work seems just substituting the reward model to LLMs. There are other related works pointed out by reviewers. Some reviewers also concern about the theoretical foundations of this approach.