/10

Rejected4 位审稿人

最低2最高4标准差0.9

ICML 2025

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

Evgenii Opryshko,Umangi Jain,Igor Gilitschenski

提交: 2025-01-23更新: 2025-06-18

TL;DR

We introduce an RL algorithm that optimizes the current utility function while accounting for the consequences of its modification, and demonstrate its effectiveness in preventing reward hacking.

摘要

关键词

Reward HackingAI SafetyAlignmentReinforcement LearningDeep Reinforcement LearningReward TamperingSensor TamperingReinforcement Learning with General Utilities

评审与讨论

审稿意见

评分: 22025-03-02

The paper proposes a novel value learning RL algorithm intended to reduce the probability of having agents developing reward hacking (i.e., unsafe and unintended behavior due to non-optimal definition of the reward function). The paper proposes two variations of their algorithm, which in high-level basically uses a environment model to reason over whether if a sample should be added to the experience replay buffer before doing so. Some experimental results are presented in relatively simple environments comparing the approach to alternative similar training regimes.

给作者的问题

What is the reasoning behind checking if a sample will improve the policy before using it for training as a way of avoiding reward hacking?

论据与证据

The main claim of the paper is that the contributed algorithm addresses a specific type of reward hacking, therefore by using it we would be avoiding some instances of reward hacking.

The evidence provided is mainly empirical in modified gridworld domains, plus a MuJoCo domain where if the agent executes a certain sequence of actions it receives an extra "unintended" reward. A comparison showing the performance of the proposed method x the "base version" of the algorithm is shown, where the metric is mainly the accumulated reward achieved by the algorithms.

方法与评估标准

I am not very convinced the empirical evaluation shown here is really appropriate for several reasons:

First, I am not even sure if the gridworld domains actually represent a realistic situation related to reward hacking. I would expect that reward hacking wouldn't be identified by the user at all (otherwise they would have corrected the reward function). If I understand correctly the gridworlds intended to show a scenario in which a human would be :"watching" the policy learned and identifying where reward hacking happened, which while a strong assumption, is sorta ok, but would require some thought on how often and how the human would be required to be watching the policies, and exactly which form this feedback should take.
THe Mujoco environment is closer to what I would expect as a good reward hacking environment, where a "hidden" sequence of actions would enable a very high reward. However I am not sure how avoiding the reward hacking was incrporated in the metrics shown in the experimental evaluation. Because the reward hacking sequence was removed from the sum of rewards in the graph is it expected that pursuing this sequnece of actions would necessarily result in a lower sum of rewards? That's not really a great way of evaluating it because it gets mixed together with approaches that just can't learn the task. It would be better to report in a table or graph the amount of times the reward hacking sequence was triggered.

For the method, I couldn't really get why the proposal would avoid reward hacking. What the proposal is doing is basically having a counterfactual for adding a training sample or not, which I could not understand what is the relation whatsoever with the reward hacking (in the POV of the agent it would just be a higher reward, how would it know that it would result in a "worse policy"?). So to me the method sounds a bit disconnected from the objective.

理论论述

Paper is empirical

实验设计与分析

Apart from what was mentioned in Methods And Evaluation Criteria, the method did not show any comparison against another method developed explicitly for avoiding reward hacking. This might indicate that the assumptions followed by the method are too restrictive and the authors could not adapt any other method to compare against fairly.

补充材料

No.

与现有文献的关系

Authors provided a good review of related paper, but did not add them to the experimental evaluation.

遗漏的重要参考文献

N/A

其他优缺点

I would say that the paper is really unclear on the assumptions followed, some of which are quite restricting.

The reader has to read in details the method to only understand several pages into the paper that the method requires as an INPUT an already trained "safe" policy as a starting point. Therefore, the method could be seem as a way of improving an already-decent policy, not really an algorithm to be training from scratch. Moreover, the methods consists of performing a very high number of rollouts and optimization steps in the environment model just to decide which samples to use for training of the "final policy". The authors are not clear at all on that the users of their method should be prepared to invest an obscene amount of extra compute to use the method (except from a very quick comment in the appendix). Overall I would say the method description should start with.a clear list of assumptions and requirements for using the method.

Another critical matter is that after reading the whole paper I still don't understand WHY the proposed method should address reward hacking. I cannot get what is the reasoning behind checking if a sample will improve the policy before using it for training as a way of avoiding reward hacking, since the agent won't be able to tell apart a very high reward from a great policy and the high reward from hacking. I am probably missing something.

其他意见或建议

Reorganize paper so that in the beginning of the respective section it's very clear:

What is the reasoning behind using this method for reward hacking
What are the assumptions/costs expected when using the approach, and in which situation it would make sense to use it
Exactly how the gridworlds are simulation a realistic reward hacking situation?

Post-rebuttal

Perhaps "safe" wasn't the best word to describe it but I did mean that the agent needs to have access to a reward hacking-proof utility function to begin with, which sounds to me very unrealistic in most of the cases. The only situation I can think of that this could be useful is if you trained your agent in an controlled environment and you want to make sure that the samples your agent gather during deployment are not poisoned. Overall the reasons for my scoring remain.

作者回复

2025-04-01

Thank you for your detailed review and insightful questions. We appreciate the chance to clarify MCVL's mechanism, assumptions, and evaluation. We will revise the paper to make these points clearer.

For the method, I couldn't really get why the proposal would avoid reward hacking. [...] how would it know that it would result in a "worse policy"? [...] I cannot get what is the reasoning behind checking if a sample will improve the policy [...]

The core idea is that the agent has a utility function $U$ which can judge how good the trajectory is according to prior experience. When the agent encounters a new transition, it uses this utility function to check if learning from this transition will result in a behavior aligned with prior experiences. It checks that the new policy would not contradict all previous experiences without reward hacking. It doesn’t mean that the policy is “improved” in the standard RL sense.

When MCVL forecasts that learning from a transition will make the agent execute behaviors that it does not prefer now, it rejects the update. We describe this in L.157(left)-L.131(right) in the Method. We will clarify the description and add another one in Introduction.

The reader has to read in details the method to only understand [...] that the method requires as an INPUT an already trained "safe" policy [...], the users of their method should be prepared to invest an obscene amount of extra compute...

We apologize for the lack of clarity and will make assumptions and costs more prominent.

Input: MCVL requires an initial utility function, not a safe policy. This function captures initial preferences and can be learned from non-hacking data (e.g., Safe env, or random rollouts as in Reacher experiment). The utility function needs to prefer trajectories from current policy over reward hacking policy by the time reward hacking sequence is triggered. If reward hacking is hard to discover, the utility function has time to learn from transitions before it happens, without relying on the initial one. This is mentioned in the Abstract/Introduction, but we will make it more prominent. The RL policy can train from scratch.
Cost: MCVL adds computational overhead. We discuss this and potential ways to mitigate it in the first paragraph of the Limitations section. With one of the proposed solutions, using a threshold, we observe a moderate ~1.8x slowdown vs. TD3 in Reacher environment. The goal of the paper was to show that reward hacking can be mitigated by avoiding inconsistent utility updates; we are leaving further optimizations to the future work. We will add more information on this topic in the paper.

I am not even sure if the gridworld domains actually represent a realistic situation [...]. I would expect that reward hacking wouldn't be identified by the user at all. [...] how the gridworlds are simulation a realistic reward hacking situation?

To study reward hacking and measure its mitigation, we need environments where we can detect and measure it. Our experiments show that in several environments used by prior work to illustrate the problem of reward hacking, it can be avoided by preventing inconsistent utility updates. We expect this principle to generalize to situations where reward hacking is hard to detect by the user. Our experiments do not assume humans watching the policy learned and identifying reward hacking.

THe Mujoco environment [...] how avoiding the reward hacking was incrporated in the metrics [...] report [...] the amount of times the reward hacking sequence was triggered.

Metric: Performance tracks the intended task reward (reaching the target), excluding the hacking reward. When the baseline hacks, it neglects the target, causing performance to drop. MCVL improves performance throughout the training, showing it learns the intended task successfully. This distinguishes it from simply failing to learn. We will highlight this in the paper.
Hacking Frequency: Figure 3e (bottom) implicitly shows this. Returns above 0 for the baseline require hacking. MC-TD3's returns show it rarely triggers the sequence (legitimate triggers are possible when the intended goal is nearby). We will add text clarifying this.

the method did not show any comparison against another method developed explicitly for avoiding reward hacking.

Direct comparisons are hard due to differing assumptions. MCVL's requirement (initial utility function) is often less restrictive than requirements of other work. The only prior work applicable to deep RL is ORPO [1] and it requires a safe policy. Our response to Reviewer dDMw includes new experiments showing an ORPO-like approach would struggle in our setting.

We hope this clarifies MCVL's rationale and addresses your concerns.

[1] Cassidy Laidlaw, Shivam Singhal, Anca Dragan - Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking, In ICLR 2025

审稿意见

评分: 22025-03-13

Traditional Reinforcement Learning (RL) agents often demonstrate reward hacking, which is defined as the ability to maximize rewards without providing the desired outcomes. The paper studies reward hacking in RL by using General Utility in order to learn and update utility functions at the trajectory level. Inconsistencies between the current and updated utility functions are minimized using Modification-Considering Value Learning (MCVL). MCVL starts with an initial utility function and refines it by comparing the expected utility values of the current and updated functions. The modify command is embedded in the new action space and used alongside actions in the trajectories. MCVL is combined with DDQN (for discrete tasks) and TD3 (for continuous tasks). The proposed method demonstrates improved intended behaviors by virtue of the performance metric across safe and full environment configurations as well as different training scenarios.