Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
KL regularization doesn't guarantee good outcomes in RLHF.
摘要
评审与讨论
The authors presents a study on the RM in RLHF. Theoretical results of the paper assume that the reward from the RM consists of a true reward plus a noise term which is modeled as a random variable. Under this model the authors show that
- If the noise terms are heavy-tailed there exists policies with vanishing KL-divergence but infinite proxy reward.
- If the noise terms are IID and light-tailed, then KL divergence suffices to guarantee good outcomes
The authors also investigate the actual rewards given by real RMs. They find that the rewards on random sequences are roughly
优点
- The paper is well written and easy to read.
- The theoretical results are relevant.
缺点
- there is probably some discrepancy between the theory and practice. The results are asymptotic, it would be better if the statements could hold in non asymptotic regimes.
- The experimental results seems to show that the rewards are not heavy tailed, and thus the theory seems to suggest that KL-divergence should work well. We already know that it does work well from practice, so there is no novel take-home message.
问题
-
The theoretical results are related to asymptotically heavy tails. What would happen in I take my RM and clip the output to be in [-10000, +10000]. I’m guessing in practice nothing would happen, but would the theoretical results would be invalid since the tails disappear?
-
In practice the output of the RM isn’t a random variable, it’s deterministic. Can you comment on the validity of the assumption that the rewards are random?
局限性
na
Thanks for the thoughtful criticism. We considered writing more about how asymptotic results relate to real life, and will do so in the next version of the paper. We additionally hope that this addresses your concerns:
- Asymptotic vs bounded/clipped: With a bounded RM the asymptotic results would indeed not apply. In practice, the same pattern of rare failures producing most of the reward and resulting in near zero utility would likely apply, with sufficiently regularized policies having, say, a 0.1% chance of 10000 reward and otherwise matching the base policy. This could change if error and utility are nonindependent and the base policy is close to optimal so that it actually encounters lots of clipped rewards. (See also the next paragraph, and the last bullet point of our rebuttal to SFxG.) Though it would be possible to prove things about the bounded case, we presented the asymptotic results because we thought they were much cleaner.
- Take-home message: We think there is an interesting take-home message. It is true that KL divergence penalties work sufficiently well to control overoptimization in current language models, though overoptimization has been demonstrated in practice by the results of Gao et al. But our theoretical results show that light-tailed + independence prevents overoptimization. Therefore in such settings we conclude that overoptimization is most likely a result of independence being violated, perhaps as textual features that increase error also decreasing utility. We have improved section 7.4, which discusses this, in the latest version.
- Deterministic RM: Our theorems still apply when the reward model is deterministic. The randomness in our theorems comes from the sentences generated by the LLM (which are random) being passed through a the deterministic reward model, which results in a joint distribution over and . One of the contributions of our paper is the framing that, to study the final achieved reward, it is sufficient to examine this 2d distribution. Did we understand your question correctly?
We welcome any more comments you might have. Thank you for the review!
This paper analyzes a phenomenon called "catastrophic Goodhart": in RL training, suppose the learned reward function is the sum of the true utility function and some noise, then if (1) the true utility and the noise are independent and light-tailed, maximizing reward with KL regularization can also give high utility, while if (2) the noise is heavy-tailed, then there exists a policy with arbitrarily high reward and arbitrarily small KL penalty, but its true utility is close to the initial policy. Empirical results suggest that open-source reward models could be light-tailed.
优点
This paper considers whether KL regularization is enough to handle reward misspecification, which is an important problem. This paper also gives clean answers to this question: if utility and noise are independent and light-tailed, then KL regularization is enough, while if noise is heavy-tailed, we may have very little improvement in utility even with high proxy reward and small KL penalty. I think these results are nice and enhance our understanding of this problem.
缺点
One missing piece is the implicit regularization effects of the optimization algorithms: although there exists a policy with low utility even under a KL penalty, probably the training algorithm can avoid those bad policies and still find a good one. It will be nice if we can understand whether certain optimization techniques can solve this issue.
Some additional comments:
- The notation and are used inconsistently in the paper: on line 34, denotes the true utility and denotes the proxy reward, while on line 71 their meanings are reversed.
问题
I have one question regarding Theorem 4: the final conclusion is that the expected utility can be arbitrarily large, but if the utility function is bounded, I don't understand why its expectation can be arbitrarily large.
I also have a more open-ended question: if we make the proxy reward function bounded (e.g., by clipping), do we have similar results?
局限性
N/A
Thanks for the positive feedback. To address the weaknesses and questions:
- Implicit regularization: we acknowledge this as a weakness, and have some ideas for how implicit regularization can prevent Goodharting. But there is not really any evidence yet, and we would be excited to see follow-up work like this.
- Typo on line 34: we have fixed this in the latest version. Thank you for pointing it out!
- Question about theorem 4: Thank you for pointing out that we need to assume unbounded light-tailed , we inadvertently dropped it and will explicitly state it for the final version. For the result to hold, the proxy reward should be unbounded so that in the proof (for reference is the regularization coefficient of the KL term).
- When proxy reward is bounded: The optimal policy is Boltzmann rational. So how much utility we get will be determined by whether, for state-actions which are close to optimal, the reward is from utility or error. We roughly think that if you clip at large enough values, the asymptotic results in this paper will basically transfer. With more aggressive clipping, you tend to get both error and utility components when the reward is optimized, so neither of the scenarios in our theorems holds (neither wholly error in heavy-tailed nor arbitrarily high utility in light-tailed). It is possible we can characterize how much utility or error the solution will have based on their ratios for the largest , but we’ll leave that to future work.
Clipping can also distort the reward (see response to reviewer G3eM), so while it prevents overoptimization somewhat it also rewards incorrect behavior.
We would welcome any more questions or feedback you have. Thank you for your review!
We wanted to add some information about the bounded proxy reward case:
- A correction: above, we said the optimal policy was Boltzmann rational, but in addition to the exponential term it still has the base likelihood.
- The best-of-N results from the comment to ngHC look very similar when reward is clipped to [-100,100], though in that case utility will increase again for very large N, and also when noise is Levy-distributed. When reward is clipped to [-10, 10], we no longer see overoptimization if utility is heavy-tailed.
This paper investigates the effectiveness of using KL divergence for regularization in reinforcement learning from human feedback (RLHF), particularly when dealing with reward functions that have misspecified errors. It introduces the concept of "catastrophic Goodhart," a scenario where policies can achieve extremely high rewards with heavy-tailed errors, but without actual improvement in utility.
优点
The paper addresses a critically important problem within the domain of RLHF --- the challenge of dealing with reward misspecification, particularly in scenarios where the error distribution is heavy-tailed. The authors defined the problem as "catastrophic Goodharts" and provided analysis accordingly.
缺点
[writing / presentation] The paper contributes to the field of LLM safety/alignment, yet the writing could be made more engaging and accessible. Improving clarity, enhancing the depth of discussions, and ensuring visual and structural appeal could greatly increase its impact and readability. These changes would not only cater to experts in the field but also to readers who may be new to this area of study.
[Experiments] the experimental evidence could be further bolstered to validate the presented claims comprehensively. Specifically:
-
The experiments primarily focus on simulated environments. Including real-world applications could demonstrate the practical implications and limitations of the findings more effectively.
-
Exploring a wider variety of reward functions, particularly those commonly used in commercial or industrial settings, could help ascertain the generalizability of the catastrophic Goodhart phenomenon across different contexts.
-
Additional experiments comparing KL divergence with other regularization or penalty methods could offer a clearer understanding of its relative efficacy and limitations.
-
How about other optimization methods like Best-of-N where KL divergence term is not used?
-
Will the implementation like LoRA affect the proposed effect?
问题
Please see the weakness section
局限性
please see weakness
Thank you for your review, which gives several intriguing suggestions for experiments, as well as presentation improvements.
As for readability and clarity, we have made various edits in the latest revision of the paper, including streamlining the background section for readers unfamiliar with AI alignment, clarifying the relationship of DMRMDP to RLHF, and improving presentation of graphs. Also see several changes we have already or plan to address in the rebuttal to reviewer G3eM.
We agree that further experimental evidence could be valuable, with some caveats.
- Wider variety of reward functions / real-world experiments: We think work here would be valuable, but it could risk being redundant with the wealth of examples of overoptimization in the existing literature (see concluding paragraph).
- Best-of-N experiments: This would complement our theorems 5 and 6 on optimization by conditioning, which describe the best-of-N distribution. We have already done the experiment to show that best-of-N succeeds in a toy light-tailed regime but fails in a toy heavy-tailed regime, and we will include this in the appendix of the final version.
- Additional experiments comparing KL divergence with other regularization or penalty methods: We will add a synthetic experiment with KL divergence, in response to this review and that of reviewer G3eM. The experiment demonstrates whether a reward function with artificially heavy-tailed error causes catastrophic Goodhart in the KL divergence setting. We are not sure which other regularization schemes we should try: the RLHF literature uses KL divergence, and Best-of-N is an alternative proposed precisely to avoid overoptimization. Perhaps an integral probability metric like the Wasserstein distance, or quantilizing optimization, which only goes up to a percentile of the misspecified reward? It would be good if you could clarify which schemes would be important to try here.
- LoRA: We think studying LoRA would be valuable to nail down exactly when inductive bias prevents catastrophic Goodhart, but considering that a search over LoRA hyperparameters and environments (see paragraph 4 of rebuttal to G3eM for why a single environment is as yet insufficient) would increase the compute requirements by orders of magnitude, this would be valuable follow-up work.
As for why we do not plan to implement all experiments suggested, we see this work as primarily providing a theoretical framework for experimental work already done by others, especially Gao et al [1] and specification gaming [2, 3]. The latter gives 60 examples of real-world utility loss due to specification gaming, several of which we think are examples of "catastrophic Goodhart". The link to previous specification gaming work will be clarified in the latest version.
We thank reviewer ngHC for their feedback. We hope that the theorems, existing experiments, and additional experiments listed above (demonstration of catastrophic Goodhart, and best-of-N study) address the given concerns. Please let us know what weaknesses remain and what we can do to address them, and consider revising your review if we have addressed your concerns to your satisfaction. We would welcome any further feedback and comments. Thank you!
[1] Gao, L., Schulman, J., and Hilton, J. Scaling laws for reward model overoptimization. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 10835–10866. PMLR, 23–29 Jul 2023. URL https:// proceedings.mlr.press/v202/gao23h.html
I appreciate the authors' detailed response. Can the authors please also provide the results they promised to provide?
The best-of-n experiment results are as follows:
We created a synthetic experiment by letting reward U = X + V, where X and V are independent and sampled from different probability distributions, consistent with our theoretical assumptions. We vary N from 1 to 65536, do 100 trials of taking the best-of-N sample with highest U, and note whether V goes towards 0 (overoptimization) or not.
- Possible distributions for V are normal and t-distribution with df=10.
- Possible distributions for X are normal, t with df=3, t with df=5, lognormal, and Levy. All of these are heavy-tailed except for normal.
- V is scaled to a standard deviation of 2 and X has s.d. of 1 (except for the Levy distribution, which has infinite variance), representing that in ordinary regimes most of the variance comes from utility rather than error.
Results (See image at: https://imgur.com/a/mo02KET )
- When the error X is normal and thus light-tailed, V increases monotonically with N, consistent with our Theorem 6.
- In 5 of 6 cases when X is lognormal or student-t, V first increases then starts to decline around N=10^2 or 10^3. When X is (t, df=5) and V is (t, df=10), V instead peaks around N=65536 but declines by N=1048576. This is consistent with Theorem 5.
- When X is Levy-distributed, utility never goes significantly above zero because Levy distribution is too heavy-tailed. In this scenario optimization completely fails.
RLHF experiments are still in progress and we will share results soon. We will be using Llama3 on OpenRLHF and adding noise to an open reward model which we deem "true".
In addition to the best-of-N results in the comment above, we examined PPO with artificially heavy-tailed rewards.
- We used OpenRLHF to train Pythia 1B with a reward model derived also from Pythia 1B, on the default OpenRLHF prompt dataset.
- We used the reward model to represent true utility, and a heavy-tailed error term based on the number of "the" tokens was added to get the proxy reward.
- The kl_target=0.5 option was used to dynamically adjust KL penalty, as we mention is done in Ziegler et al (2020).
- Rewards were not clipped. (Reward clipping can be useful to prevent overoptimization, but is not always used in PPO, see point 1 of our rebuttal to G3eM)
- Response length was limited to 256. Other hyperparameters were unremarkable but can be provided if needed.
Results:
- Initially the policy balanced maximizing error and utility, generating completions like
the suggestions are shown in the images of the Progress Spinner of the completed list view.(It's only 1B so not very coherent) - Later in training the policy achieved extremely large values of reward with similar or lower KL divergence by generating normal text followed by long strings of repeated "the" tokens. (Due to the negative correlation, utility became negative. Note theorems 2 and 3 are still relevant because they have no independence assumptions)
We think this validates that the basic pattern of catastrophic Goodhart can occur in RLHF under conditions of heavy-tailed error. Limitations include the artificial nature of the reward and small size of the models, but because the theorems are very generalizable facts, we expect the same to hold in other conditions, and we hope this has addressed your concerns.
The paper first considers a theoretical stylized model of jointly distributed utility and (mis-specified) rewards and proves a novel result that for any heavy tailed reference distribution, it is possible to find another distribution that arbitrarily approximates it in terms of the KL divergence, yet has an unbounded reward. Beginning with this basic result, the authors then attempt to apply these insights to the popular technique of regularizing the policy optimization procedure in RLHF using a learned reward model with the KL divergence to the base SFT policy. Experiments on open source LLM reward models are conducted to investigate how closely the assumptions made in the theory match with current state of the art models.
优点
- The paper identifies a novel conceptual insight around the heavy tailedness of the utility/rewards as being an important factor in reward hacking.
- Theoretical results are sound and novel, along with some attempts to also ground this in experimental validation.
缺点
- The clarity in some sections can be improved -- see questions below.
- The notion of heavy tailed versus light tailed, while theoretically appealing and convenient, seems quite sensitive to parameterization and/or scale in practice. While the authors do mention this briefly in footnote 2 (Page 6), it appears that simple reward reparametrization can transform the resulting distributions in a way that is not robust to the notion of heavy/light tailedness.
问题
- The definition of DMRMDP in Section 4 is unclear. Is there any precedent to this particular definition? For instance, are you always required to end a trajectory in one of the sink states? If not, are rewards also defined for partial trajectories?
- Is the converse of Theorem 1 also true? i.e. if Q is not heavy tailed, does the maximum achievable mean stay bounded as ?
- In the experimental results section, random sampling and adversarial sampling are used to generate the reward samples which are estimated from another model. However, it is hard to know the true utility, unlike in the case of Gao et. al. (2023), who perform a synthetic study with known controlled true utilities. Since the paper's insights and results are more conceptual than practical at the moment, wouldn't it be interesting to study a synthetic experimental setup similar to Gao et. al. where both true utilities and proxy utilities can be measured explicitly?
- When you speak of a "heavy/light tail under a given policy , one possibility is to consider the action conditioned expected utility, or the unconditional version where action is simply a latent variable that determines the final reward. Is this distinction significant for any of the results? If so, which one is being assumed, and what are the implications for the alternate version on the basic result?
Nits:
- Figure reference in L110 seems like a typo.
- Figure 1 has fonts which are too small to read.
局限性
Yes
We thank the reviewer for raising several questions and highlighting where the paper lacks clarity.
-
On reparameterizing rewards to change whether they are heavy-tailed:
Reward can be reparameterized; however, in settings where the true reward is heavy-tailed, making reward artificially light-tailed or bounded can reward behavior that is unintended.
- For example, a stock-trading agent should be rewarded by profit, but stock returns are known to be heavy-tailed. If we cap or otherwise transform rewards into , it will have no incentive to take into account huge gains or losses. Since RLHF rewards as implemented in Ziegler et al are unbounded, clipping or transforming rewards could itself cause reward misspecification. We think this also applies when rewards are bounded but potentially very large.
- In some cases, mostly when the reward is not the true intended one, it is possible to reparameterize the reward without adverse effects. In the RL literature for Atari games, rewards are changes in score clipped to [1].
Thank you for highlighting this, we will discuss when reward can or cannot be reparameterized in Section 7.
-
On precedent for DMRMDP
- As for precedent, deterministic MDPs are common in the literature e.g. in [2], but we have not seen the Markovian returns property in this exact form. With our definition, we are not trying to introduce a new setting, but rather to list the properties of RLHF that imply our theorems, so the relevance of our results is clear.
- We intend that trajectories must terminate in a sink state. Thank you for pointing out it is unclear, we will clarify the definition.
-
Is the converse of theorem 1 also true?
Theorem 3 is effectively a converse to Theorem 1, and shows that if Q is light-tailed, the maximum achievable mean is no higher than the mean of Q in the limit as . We will restate this theorem as an equivalence statement. Very good point!
-
Synthetic experimental setup
We agree that a synthetic experimental setup would add value, and plan to have one in the camera ready. One difficulty is in the modeling choices to make for a realistic setup, especially if we want to relax independence between utility V and error X. (A big takeaway for us was that, because independence + light-tailed error precludes overoptimization in theory, but we observed RMs to have light tails, the overoptimization observed in experiments like Gao et al is likely from broken independence, e.g. a negative correlation between error and utility. This broken independence could take many forms and different models could easily result in different outcomes.)
-
Precise meaning of heavy/light tail under a policy
We are taking actions as latent variables that determine the reward, but taking the expectation over randomness in the reward assigned by the environment for each possible trajectory (note in Theorem 2, is the average reward of trajectory ). So if a certain policy in a certain environment has a 50/50 chance between producing trajectories and , and the final state of gets reward with a normal distribution with mean 10, and the final state of gets reward with a Student-t distribution with mean 20, the relevant distribution of will be discrete with a 50/50 chance between 10 and 20.
All these questions as well as the two nitpicks will be, or have been already, addressed in the next version of the paper. We welcome any more comments or questions you might have, as these are good feedback that has improved the paper. In addition, we invite you to revise your score if your concerns have been answered to your satisfaction. Thank you for a thoughtful review!
[1]: https://arxiv.org/abs/1709.06009 Machado, Marlos C., Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. "Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents." Journal of Artificial Intelligence Research 61 (2018): 523-562.
[2]: Ronald Ortner, Online regret bounds for Markov decision processes with deterministic transitions, in: Algorithmic Learning Theory, 19th International Conference, ALT 2008, Proceedings, 2008, pp. 123–137.
Thank you for the response and the clarifications. I have update my initial review's score accordingly.
Thank you to all the reviewers for their thoughtful feedback, constructive criticism, and recognition of our paper's contributions. We appreciate the time and effort invested in evaluating our work and suggesting improvements.
The reviewers unanimously agreed the paper is technically sound and that it is relevant to an important problem in RLHF: reward misspecification. The reviewers' main concerns centered around clarity, experimental depth, and the relationship between theory and practice. We have addressed these as follows:
-
Clarity and presentation:
- We have improved the overall clarity and readability of the paper, particularly in the background section and graph presentations (see resp. to ngHC).
- We clarified the relationship between DMRMDPs and RLHF (see resp. to G3eM).
- We've fixed notation inconsistencies and typos (e.g., the missing unboundedness assumption in Theorem 4, thank you SFxG for pointing it out).
-
Synthetic experiments: we've conducted new experiments demonstrating our results with Best-of-N in a synthetic reward model. We also plan to do experiments optimizing the synthetic reward for KL divergence regularization, and possibly other kinds of regularization (see responses to ngHC and G3eM).
-
Implicit regularization/inductive bias in optimization: reviewers SFxG and ngHC asked about studies of how the RLHF setup regularizes the solutions found, beyond what is just optimal. As part of the synthetic experiments in point 2, we will study the implicit regularization of optimization of LLMs. However, we believe studying LoRA should be part of future work.
-
Bounded rewards: in responses to znAw and SFxG, we have provided more intuition on what happens with bounded or clipped rewards, and will add that to the paper. Concrete theoretical results on this are difficult and should be part of future work.
We believe these changes and clarifications significantly improve the paper while maintaining its core contributions. The theoretical framework we provide offers a new lens for understanding existing experimental work on specification gaming and overoptimization (see response to ngHC).
We would be excited to continue discussing, and remain committed to addressing any remaining concerns. We kindly ask the reviewers to consider revising their ratings if they feel we've adequately addressed their initial concerns.
Thank you again for your valuable input, which has undoubtedly strengthened our paper.
This paper studies whether KL regularization is sufficient to mitigate the (potentially-catastrophic) errors due to reward-misspecification in RL. They prove that KL regularization works under a fairly natural assumption on the type of reward mis-specification (independent noise with light-tails). Conversely, they prove that even KL-regularization can be catastrophic if this assumption is broken (ie if noise is heavy-tailed).
Reviewers agreed that the work presents a clean and novel theoretical insight to the important problem of reward misspecification. There were several concerns about clarity and experimental support, which the authors addressed in their rebuttal. After the rebuttal, all reviewers recommended acceptance. I thus recommend acceptance.