Modification-Considering Value Learning for Reward Hacking Mitigation in RL
We introduce an RL algorithm that optimizes the current utility function while accounting for the consequences of its modification, and demonstrate its effectiveness in preventing reward hacking.
摘要
评审与讨论
The paper proposes a novel value learning RL algorithm intended to reduce the probability of having agents developing reward hacking (i.e., unsafe and unintended behavior due to non-optimal definition of the reward function). The paper proposes two variations of their algorithm, which in high-level basically uses a environment model to reason over whether if a sample should be added to the experience replay buffer before doing so. Some experimental results are presented in relatively simple environments comparing the approach to alternative similar training regimes.
给作者的问题
- What is the reasoning behind checking if a sample will improve the policy before using it for training as a way of avoiding reward hacking?
论据与证据
The main claim of the paper is that the contributed algorithm addresses a specific type of reward hacking, therefore by using it we would be avoiding some instances of reward hacking.
The evidence provided is mainly empirical in modified gridworld domains, plus a MuJoCo domain where if the agent executes a certain sequence of actions it receives an extra "unintended" reward. A comparison showing the performance of the proposed method x the "base version" of the algorithm is shown, where the metric is mainly the accumulated reward achieved by the algorithms.
方法与评估标准
I am not very convinced the empirical evaluation shown here is really appropriate for several reasons:
-
First, I am not even sure if the gridworld domains actually represent a realistic situation related to reward hacking. I would expect that reward hacking wouldn't be identified by the user at all (otherwise they would have corrected the reward function). If I understand correctly the gridworlds intended to show a scenario in which a human would be :"watching" the policy learned and identifying where reward hacking happened, which while a strong assumption, is sorta ok, but would require some thought on how often and how the human would be required to be watching the policies, and exactly which form this feedback should take.
-
THe Mujoco environment is closer to what I would expect as a good reward hacking environment, where a "hidden" sequence of actions would enable a very high reward. However I am not sure how avoiding the reward hacking was incrporated in the metrics shown in the experimental evaluation. Because the reward hacking sequence was removed from the sum of rewards in the graph is it expected that pursuing this sequnece of actions would necessarily result in a lower sum of rewards? That's not really a great way of evaluating it because it gets mixed together with approaches that just can't learn the task. It would be better to report in a table or graph the amount of times the reward hacking sequence was triggered.
For the method, I couldn't really get why the proposal would avoid reward hacking. What the proposal is doing is basically having a counterfactual for adding a training sample or not, which I could not understand what is the relation whatsoever with the reward hacking (in the POV of the agent it would just be a higher reward, how would it know that it would result in a "worse policy"?). So to me the method sounds a bit disconnected from the objective.
理论论述
Paper is empirical
实验设计与分析
Apart from what was mentioned in Methods And Evaluation Criteria, the method did not show any comparison against another method developed explicitly for avoiding reward hacking. This might indicate that the assumptions followed by the method are too restrictive and the authors could not adapt any other method to compare against fairly.
补充材料
No.
与现有文献的关系
Authors provided a good review of related paper, but did not add them to the experimental evaluation.
遗漏的重要参考文献
N/A
其他优缺点
I would say that the paper is really unclear on the assumptions followed, some of which are quite restricting.
The reader has to read in details the method to only understand several pages into the paper that the method requires as an INPUT an already trained "safe" policy as a starting point. Therefore, the method could be seem as a way of improving an already-decent policy, not really an algorithm to be training from scratch. Moreover, the methods consists of performing a very high number of rollouts and optimization steps in the environment model just to decide which samples to use for training of the "final policy". The authors are not clear at all on that the users of their method should be prepared to invest an obscene amount of extra compute to use the method (except from a very quick comment in the appendix). Overall I would say the method description should start with.a clear list of assumptions and requirements for using the method.
Another critical matter is that after reading the whole paper I still don't understand WHY the proposed method should address reward hacking. I cannot get what is the reasoning behind checking if a sample will improve the policy before using it for training as a way of avoiding reward hacking, since the agent won't be able to tell apart a very high reward from a great policy and the high reward from hacking. I am probably missing something.
其他意见或建议
Reorganize paper so that in the beginning of the respective section it's very clear:
- What is the reasoning behind using this method for reward hacking
- What are the assumptions/costs expected when using the approach, and in which situation it would make sense to use it
- Exactly how the gridworlds are simulation a realistic reward hacking situation?
Post-rebuttal
Perhaps "safe" wasn't the best word to describe it but I did mean that the agent needs to have access to a reward hacking-proof utility function to begin with, which sounds to me very unrealistic in most of the cases. The only situation I can think of that this could be useful is if you trained your agent in an controlled environment and you want to make sure that the samples your agent gather during deployment are not poisoned. Overall the reasons for my scoring remain.
Thank you for your detailed review and insightful questions. We appreciate the chance to clarify MCVL's mechanism, assumptions, and evaluation. We will revise the paper to make these points clearer.
For the method, I couldn't really get why the proposal would avoid reward hacking. [...] how would it know that it would result in a "worse policy"? [...] I cannot get what is the reasoning behind checking if a sample will improve the policy [...]
The core idea is that the agent has a utility function which can judge how good the trajectory is according to prior experience. When the agent encounters a new transition, it uses this utility function to check if learning from this transition will result in a behavior aligned with prior experiences. It checks that the new policy would not contradict all previous experiences without reward hacking. It doesn’t mean that the policy is “improved” in the standard RL sense.
When MCVL forecasts that learning from a transition will make the agent execute behaviors that it does not prefer now, it rejects the update. We describe this in L.157(left)-L.131(right) in the Method. We will clarify the description and add another one in Introduction.
The reader has to read in details the method to only understand [...] that the method requires as an INPUT an already trained "safe" policy [...], the users of their method should be prepared to invest an obscene amount of extra compute...
We apologize for the lack of clarity and will make assumptions and costs more prominent.
-
Input: MCVL requires an initial utility function, not a safe policy. This function captures initial preferences and can be learned from non-hacking data (e.g., Safe env, or random rollouts as in Reacher experiment). The utility function needs to prefer trajectories from current policy over reward hacking policy by the time reward hacking sequence is triggered. If reward hacking is hard to discover, the utility function has time to learn from transitions before it happens, without relying on the initial one. This is mentioned in the Abstract/Introduction, but we will make it more prominent. The RL policy can train from scratch.
-
Cost: MCVL adds computational overhead. We discuss this and potential ways to mitigate it in the first paragraph of the Limitations section. With one of the proposed solutions, using a threshold, we observe a moderate ~1.8x slowdown vs. TD3 in Reacher environment. The goal of the paper was to show that reward hacking can be mitigated by avoiding inconsistent utility updates; we are leaving further optimizations to the future work. We will add more information on this topic in the paper.
I am not even sure if the gridworld domains actually represent a realistic situation [...]. I would expect that reward hacking wouldn't be identified by the user at all. [...] how the gridworlds are simulation a realistic reward hacking situation?
To study reward hacking and measure its mitigation, we need environments where we can detect and measure it. Our experiments show that in several environments used by prior work to illustrate the problem of reward hacking, it can be avoided by preventing inconsistent utility updates. We expect this principle to generalize to situations where reward hacking is hard to detect by the user. Our experiments do not assume humans watching the policy learned and identifying reward hacking.
THe Mujoco environment [...] how avoiding the reward hacking was incrporated in the metrics [...] report [...] the amount of times the reward hacking sequence was triggered.
-
Metric: Performance tracks the intended task reward (reaching the target), excluding the hacking reward. When the baseline hacks, it neglects the target, causing performance to drop. MCVL improves performance throughout the training, showing it learns the intended task successfully. This distinguishes it from simply failing to learn. We will highlight this in the paper.
-
Hacking Frequency: Figure 3e (bottom) implicitly shows this. Returns above 0 for the baseline require hacking. MC-TD3's returns show it rarely triggers the sequence (legitimate triggers are possible when the intended goal is nearby). We will add text clarifying this.
the method did not show any comparison against another method developed explicitly for avoiding reward hacking.
Direct comparisons are hard due to differing assumptions. MCVL's requirement (initial utility function) is often less restrictive than requirements of other work. The only prior work applicable to deep RL is ORPO [1] and it requires a safe policy. Our response to Reviewer dDMw includes new experiments showing an ORPO-like approach would struggle in our setting.
We hope this clarifies MCVL's rationale and addresses your concerns.
[1] Cassidy Laidlaw, Shivam Singhal, Anca Dragan - Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking, In ICLR 2025
Traditional Reinforcement Learning (RL) agents often demonstrate reward hacking, which is defined as the ability to maximize rewards without providing the desired outcomes. The paper studies reward hacking in RL by using General Utility in order to learn and update utility functions at the trajectory level. Inconsistencies between the current and updated utility functions are minimized using Modification-Considering Value Learning (MCVL). MCVL starts with an initial utility function and refines it by comparing the expected utility values of the current and updated functions. The modify command is embedded in the new action space and used alongside actions in the trajectories. MCVL is combined with DDQN (for discrete tasks) and TD3 (for continuous tasks). The proposed method demonstrates improved intended behaviors by virtue of the performance metric across safe and full environment configurations as well as different training scenarios.
给作者的问题
Please refer to strengths and weaknesses.
论据与证据
Please refer to strengths and weaknesses.
方法与评估标准
Please refer to strengths and weaknesses.
理论论述
Please refer to strengths and weaknesses.
实验设计与分析
Please refer to strengths and weaknesses.
补充材料
Yes, the appendix.
与现有文献的关系
Please refer to strengths and weaknesses.
遗漏的重要参考文献
Please refer to strengths and weaknesses.
其他优缺点
Strengths
- The paper is well written and easy to follow.
- Experiments and environments considered in the work are well thought out.
Weaknesses
- Learned Modifications: I am struggling to understand the learning and modification scheme used for modifying the utility function. modify is a part of the action space but not entirely learned either using a loss function or trainable parameters. The modification is derived based on the discrepancy in expected values between utility functions of the policies. But wouldn't the new updated policy with fresh T trajectory samples always be better? Intuitively, the updated policy has more information about the environment and agent performance and thus, it must yield a more value function. In the current setup, I am not sure if comparing policy discrepancies is a systematic way of modifying utility as the agent does not learn or is made aware of these modifications in any form.
- Performance Metric: Authors compare the robustness of MCVL to reward hacking using episodic returns and the performance metric. However, I am unable to understand the performance metric. What does performance signify here? How is it quantified? How is the performance metric defined (intuitively and mathematically) for a given environment? For instance, episode return is the average sum of discounted rewards at each step. Furthermore, authors mention that performance indicates the intended behavior of the agent on an environment. How does one know this intended behavior beforehand? In its current form, the metrics and experimental evaluation shed little light on how MCVL addresses reward hacking.
- Ablation Study: While the paper evaluates MCVL on continuous and discrete tasks on different agents, it does not evaluate the efficacy of the proposed method in mitigating reward hacking. For instance, authors only compare return and performance metrics which tell us little about whether the agent has learned meaningful behaviors as a result of the proposed techniques. Instead, authors could study and compare how their proposed additions benefit the agent. Authors could compare ablations between trajectory-level and per-step-level learning of the policy. Similarly, what if instead of selectively using modify we conduct random modifications of the utility function? Currently, the paper only compares the performance of MCVL on two baseline RL algorithms and does not shed light on the contribution of the proposed techniques.
- Contribution and Novelty: I am struggling to understand the novel contribution of the work and its utility for the RL community. How does the algorithm benefit RL algorithms since the only proposal that has been made is to selectively modify the utility function? While the paper also shifts from a state-level policy learning setting to trajectory-level learning, recent RL algorithms of today (Decision Transformer [1], Diffusion RL [2]) operate on trajectory-level samples.
[1]. Chen et al, Decision Transformer: Reinforcement Learning via Sequence Modeling, NeurIPS 2021. [2]. Janner et al, Planning with Diffusion for Flexible Behavior Synthesis, ICML 2022.
其他意见或建议
NA
伦理审查问题
NA
Thank you for your review and valuable suggestions. We address your points below and will incorporate the responses into the paper.
Learned Modifications: [...] But wouldn't the new updated policy with fresh T trajectory samples always be better? [...] comparing policy discrepancies is [not] a systematic way of modifying utility...
Your intuition holds for standard RL agents optimizing a fixed target (). However, MCVL agents optimize their current utility () and consider the consequences of changing that utility. As explained in the Method section (Lines 157 left-132 right), an update might contain new information that leads to future behavior deemed undesirable by the agent's current values ().
MCVL evaluates if incorporating new data (leading to ) would result in behavior with lower expected utility according to the current . If so, the update is rejected. We are preventing the agent from learning to prefer trajectories (like reward-hacking trajectories) that its current self evaluates negatively. It's akin to considering long-term consequences before changing one's preferences based on short-term gains.
Performance Metric: [...] What does performance signify here? How is it quantified? [...] How does one know this intended behavior beforehand?
The performance metric, standard in prior work such as AI Safety Gridworlds [1] (as mentioned in L. 192), is the discounted sum of true rewards reflecting the intended task goal. This contrasts with the observed reward, which might be flawed and exploitable (leading to hacking).
The intended behavior (and thus true reward) is defined by the environment designer [1]; we use the standard definitions for these benchmark tasks. We describe the performance metric of each environment in Sec 4.1. Crucially, MCVL does not require the true reward/performance metric for training – it's used purely for evaluation to demonstrate that MCVL successfully avoids reward hacking (i.e., maintains high performance on the intended task even when the observed reward is misleading). Declining performance alongside increasing observed returns signals hacking; our results show MCVL prevents this decline.
Ablation Study: paper [...] does not evaluate the efficacy of the proposed method in mitigating reward hacking. [...] compare ablations between trajectory-level and per-step-level learning [...], what if instead of selectively using modify we conduct random modifications [...]?
Our experiments directly evaluate efficacy by tracking both observed returns and the true performance metric. As noted above, the divergence (or lack thereof) between these is the measure of reward hacking [2].
Regarding specific ablations:
- Trajectory vs. Step-level: MCVL uses standard step-level RL (DDQN/TD3 policies map states to actions) but incorporates trajectory-level context when deciding whether to perform the utility update (modify) for a given transition. This decision is the only difference to the baselines we compare to.
- Random modify: Randomly discarding transitions wouldn't remove all transitions with misleading rewards. MCVL's selective rejection is specifically designed to prevent updates predicted to lower current utility. A more relevant baseline, rejecting updates based on reward prediction error, is included in Figure 4a and performs worse.
- Other Baselines: Please also see our response to Reviewer dDMw, where new experiments show that, unlike MCVL, occupancy measure regularization methods, such as ORPO, would struggle to learn the optimal policy while avoiding reward hacking. Section 4.4 and Appendix C also contain further ablations.
Contribution and Novelty: [...] struggling to understand the novel contribution [...] How does the algorithm benefit RL algorithms, [...] recent RL algorithms [...] operate on trajectory-level samples.
MCVL's core contribution is a novel mechanism to mitigate reward hacking within existing RL frameworks by ensuring utility function updates are consistent with the agent's current values.
Its benefit is enhancing the safety and reliability of RL agents by preventing them from learning unintended, potentially harmful behaviors when reward functions are imperfect. This is a critical AI Safety problem.
While Decision Transformer and Diffusion RL use trajectories for sequence modeling or planning, MCVL uses trajectory information to validate potential updates during standard RL training. It modifies existing algorithms (DDQN, TD3) to make them safer, rather than being a new trajectory-based learning paradigm itself. After training, the policy remains a standard state-to-action map.
Thank you again for your review and please let us know if you have any further questions or suggestions.
[1] Leike, J., et al. AI safety gridworlds. 2017.
[2] Skalse, J., et al. Defining and characterizing reward gaming. NeurIPS, 2022.
I thank the authors for their response. After going through the authors' rebuttal and response to other reviewers, my concerns regarding the learned modifications and ablations remain.
Learned Modifications: The modififcation modify is not learned which raises the question on the efficacy of the scheme. It is not completely known what the new update presents to the agent and how it benefits learning. On the other hand, an update made using the new policy with fresh samples is always better. This is by definition of the policy improvement principle.
Ablation Study: Authors added the new ORPO baseline which partly addresses the concern. However, the efficacy of the components utilized in the MCVL framework still remain unadressed. The paper would largely benefit from a toy experiment or two comparing the role of various components in making MCVL effective, for example- a comparison between different modify schemes. This remains as my main reservation for the acceptance of the paper.
Given that my concerns remain and my belief that the modification scheme must possess a learned component for it to be truly effective for reward hacking (and beneficial to the machine learning community), I would like to keep my current score. I thank the authors for their efforts.
We thank the reviewer for their thoughtful rebuttal comment. We appreciate the opportunity to address the remaining concerns:
Learned Modifications:
modify is not learned which raises the question on the efficacy of the scheme [...] an update made using the new policy with fresh samples is always better. This is by definition of the policy improvement principle.
We understand the reviewer's perspective on learned components and the policy improvement principle. However, our goal is distinct from standard policy optimization, and our design choices reflect this:
- Our scheme for deciding when to modify the utility function is principled and, we argue, optimal within the specific context of modification-considering agents as defined in our work. These agents aim to maximize their current utility function, . Therefore, they should only accept a modification (leading to ) if doing so does not decrease the predicted utility of the resulting future policy, evaluated according to their current objective ().
- Our method implements exactly this check. Specifically, we compare two values evaluated under the current utility function : (1) the predicted utility of the future policy resulting from continuing to optimize , and (2) the predicted utility of the future policy resulting from optimizing the potential next utility function . Modification proceeds only if the second value is not lower than the first.
- While the final comparison is a deterministic step, it relies critically on learned components: the prediction of future policies and the utility function itself are learned.
- We believe that while predicting modify directly might offer computational advantages, the core effectiveness in preventing reward hacking stems from the principled check based on the agent's current utility. We consider optimizing computational efficiency a direction for future work.
- The policy improvement principle guarantees optimality concerning the maximization of cumulative returns under the observed reward function. However, reward hacking occurs precisely because maximizing the observed reward function can lead to undesirable outcomes not captured by it. Our method deliberately deviates from standard policy improvement when necessary to prevent reward hacking, which is a different objective than simply maximizing observed rewards.
Ablation Studies:
The paper would largely benefit from a toy experiment or two comparing the role of various components [...] a comparison between different modify schemes. This remains as my main reservation...
We believe our existing experiments already provide substantial ablation evidence for the key components of MCVL. We will ensure these are emphasized more clearly in the paper.
Specifically, we performed the following ablations:
- Alternative Modification Rule: We compared our modification check to an alternative rule based on reward prediction error. This experiment showed that simply discarding modifications based on reward prediction error does not lead to learning optimal non-hacking policy (Figure 4a, "Discard by reward"). This directly addresses the efficacy of our specific modification check compared to a plausible alternative.
- Ablation of Future Policy Forecasting: We tested a variant that compared policies before and after each gradient step (instead of forecasting further into the future), which failed to prevent reward hacking (Figure 4a, "Each Step").
- Ablation of Inconsistent Transition Handling: We compared our mechanism (removing inconsistent transitions) to an alternative (assigning a large negative penalty reward). The penalty method proved ineffective (Figure 4a, "Punishment"), validating our specific design choice for handling utility inconsistencies.
- Impact of Utility Function Training: We studied how varying the amount of initial utility function training affects performance, including an ablation where the initial utility function was random (no training), demonstrating the need for some initial training (Figure 4b, 0 steps).
- Impact of Inconsistency Check Training: We investigated the effect of the number of training steps () for the inconsistency check, including an ablation with no training (), showing its necessity (Figure 5a, ).
We hope this clarifies the reasoning behind our design choices and highlights the existing ablation studies. We will revise the paper to make the justification for our modification scheme and the results of these ablations more prominent.
We thank the reviewer again for their constructive feedback.
This paper studies how to mitigate reward hacking by considering the change of trajectory utilities. The agent is initially trained in a Safe environment in which exploiting the reward leads to the intended behavior, and then continued in a Full environment with different dynamics/rewards. The paper claims that there will be a drop in terms of the previous utility function if the agent hacks the new environment, and proposes MCVL to detect such a signal and reject the utility update if the reward hacking happens. Experiments are conducted in several grid-world environments and a continuous-control environment. The proposed MCVL can successfully avoid reward hacking, while the conventional RL baseline exploits the misleading rewards.
Post rebuttal update
The added experiment with ORPO reveals some interesting conclusions about how regularizing occupancy measurement could fail in scenarios where the oracle policy deviates a lot from an initial safe (reference) policy. However, this experiment cannot change the fact that the proposed method makes too strong assumptions about the knowledge of a safe initial utility function.
给作者的问题
Please check the previous sections.
论据与证据
- The paper claims that MCVL iteratively refines the initial coarse utility function, but it seems that the experiment results cannot reflect how the utility function is updated.
- This work is claimed as the first to demonstrate successful learning of non-reward hacking behaviors in the benchmark environments. I think the authors should compare with other methods to mitigate reward hacking (such as those mentioned in the Introduction and Related works) to better support this claim. Currently, the main results only compare MCVL with ordinary RL, and the results seem unsuprising.
方法与评估标准
The proposed method generally makes sense if an initial aligned utility function is available. This work assumes access to a safe environment and a safe reward function. By optimizing the reward function in the safe environment, the optimal behavior is always the intended behavior. I think the assumption in this setting is strong, since designing reward functions that lead to the exact intended behavior itself is challenging.
理论论述
Not applicable.
实验设计与分析
- In the experimented environments, the intended behavior is always the optimal solution under the reward design in the Safe version, therefore, alternative behaviors that deviate from the intended behavior would lower the initial utility. I am not sure if assuming the knowledge of a Safe version is reasonable for practical applications. I think it would be better to explain how the setting with Safe and Full versions relates to real-world scenarios.
- MCVL is mainly compared against standard DDQN and TD3 algorithms, and the results show that MCVL successfully avoids reward hacking. This positive result is appreciated, but adding more baseline methods that use different ways to mitigate reward hacking would strengthen the result.
补充材料
I did not review the supplementary material.
与现有文献的关系
This paper is mostly related to AI safety.
遗漏的重要参考文献
I cannot point out essential references not discussed as I am not familiar with the topic of AI safety.
其他优缺点
None.
其他意见或建议
None.
Thank you for your thorough review and valuable suggestions. We would like to clarify several points and will incorporate these clarifications in the paper.
The paper claims that MCVL iteratively refines the initial coarse utility function, but it seems that the experiment results cannot reflect how the utility function is updated.
Refining the utility function means the agent continuously updates it using transitions not judged as reward hacking. Our results demonstrate MCVL robustly detects hacking while successfully learning improved non-hacking policies across diverse environments. We welcome specific suggestions for additional experiments or metrics if needed.
I think the authors should compare with other methods to mitigate reward hacking [...] Currently, the main results only compare MCVL with ordinary RL, and the results seem unsurprising.
The only prior work that is applicable to regular RL environments is ORPO [1]. Direct comparison with it is challenging due to different requirements (it requires known safe policy, uses policy gradient methods, works only with stochastic policies, and requires tricky discriminator tuning).
To make a fair comparison, we evaluated if an ORPO-like objective could succeed in our setting. We trained Q-functions for Initial (Safe env), Hacking (Full env, observed reward), and Oracle (Full env, true reward) settings using DDQN. We then checked if any regularization weight exists s.t. the ORPO objective satisfies both and . Note that just avoiding reward hacking would be trivial assuming a known safe policy. We tested two ways of obtaining stochastic policies required for ORPO: softmax of Q-values and -greedy (), with and KL divergences. Occupancy measures were computed with a 1000 policy rollouts.
The table presents percentage of runs such exists (10 seeds):
| Policy | Divergence | Box Moving | Absent Supervisor | Tomato Watering | Rocks and Diamonds |
|---|---|---|---|---|---|
| Soft-Q | 0% | 0% | 0% | 0% | |
| Soft-Q | KL | 0% | 0% | 0% | 0% |
| -greedy | 70% | 40% | 30% | 0% | |
| -greedy | KL | 40% | 50% | 0% | 0% |
Our results show that frequently no such exists, indicating ORPO's occupancy measure regularization would likely fail to learn optimal policy without reward hacking for any choice of hyperparameters. Occupancy regularization may struggle when:
- Oracle policy differs significantly from Initial one (comparable to difference between Hacking and Initial). The experiments in [1] use a modified Tomato Watering environment where the bucket was moved further away. It increases the difference of occupancy between Hacking policy and safe policy.
or
- High hacking rewards require large , but large prevents Oracle policy learning (e.g., in Rocks and Diamonds).
In contrast, MCVL consistently achieves the Oracle policy performance in all these environments.
We will add full details of the experiment and additional metrics to the paper.
The proposed method generally makes sense if an initial aligned utility function is available. This work assumes access to a safe environment and a safe reward function. [...] I think the assumption in this setting is strong [...]
Crucially, our core requirement is an initial, reasonably aligned utility function, not necessarily a fully specified Safe environment or reward function. A Safe environment can be used to learn the initial utility function, but it can also be learned from other sources like non-hacking random rollouts (as in our Reacher experiment). We use Safe environments in our gridworld experiments because triggering reward hacking in the original environments is too easy. A Safe reward function is not required for training MCVL; we use it purely for evaluation.
I am not sure if assuming the knowledge of a Safe version is reasonable for practical applications. [...] explain how the setting [...] relates to real-world scenarios.
As discussed (Lines 203-219), the Safe/Full setup models real-world scenarios like transferring from simulation to real-world and restricted lab environment to not restricted. We also mention that Safe is not required if reward hacking is hard to discover. We will clarify this paragraph and add additional examples including
- Training on simpler tasks with simpler reward design.
- Monitoring agent and removing trajectories with reward hacking.
- Using human demonstrations for initialization.
Thank you again for your valuable feedback. Please let us know if our responses address your concerns and if you have further suggestions.
[1] Cassidy Laidlaw, Shivam Singhal, Anca Dragan - Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking, ICLR 2025
This paper addresses the problem of reward hacking by framing it within the General Utility Reinforcement Learning (GU-RL) framework. The authors introduce trajectory value functions and a mechanism for explicit utility inconsistency detection. Their proposed utility update technique can be integrated into standard value-based methods such as DDQN and TD3, leading to the implementations MC-DDQN and MC-TD3. These methods are demonstrated to be effective in preventing reward hacking in environments from the AI Safety Gridworlds as well as in MuJoCo tasks.
给作者的问题
See Suggestions.
Updated Review: Thank you for the clarification. I acknowledge the novelty of the proposed idea. However, I regret that the current evaluation is limited to custom-designed environments, and no additional experiments were provided to test generalizability and the sensitivity of h. This constraint notably limits the applicability of the method to broader benchmarks. I suggest clearly stating this as a limitation in the final version.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
Yes.
实验设计与分析
Yes.
补充材料
Each part, and especially implementation details.
与现有文献的关系
The key contributions of the paper connect closely to several strands of prior research in reinforcement learning and AI safety, especially, General Utility Reinforcement Learning and Reward Hacking and Specification Gaming.
遗漏的重要参考文献
No.
其他优缺点
I find the problem addressed in this paper very interesting, and the proposed approach is reasonable. However, the method currently has some limitations—for instance, it assumes access to rollouts from the true environment transition model, which significantly restricts its applicability. This means that the method can typically only be applied when an explicit transition model is available or when the simulator has been modified accordingly. The paper does acknowledge these limitations.
其他意见或建议
- The experiments in this paper are primarily conducted under the assumption of access to rollouts from the true environment transition model. Although the authors mention that considering approximate transition models is a future direction, I am curious: How robust is MCVL to inaccuracies in the learned transition model? For example, if the learned model deviates slightly from the true environment dynamics, can the forecasting mechanism still reliably detect utility inconsistencies?
- Is the choice of rollout length (h) for policy forecasting critical? Could the authors elaborate on how sensitive the performance is to different values of h? Is there a principled method to set this parameter, or does it require extensive tuning for each environment?
- The paper demonstrates the effectiveness of MC-DDQN in four discrete action-space environments, and MC-TD3 on the MuJoCo Reacher task (continuous action space). Given that Reacher is a relatively basic task and the experimental data between these two settings are imbalanced, would it be possible to test one or two additional MuJoCo tasks to further showcase the performance of MC-TD3?
Thank you very much for your thorough review and valuable suggestions. We are happy to answer your questions and will incorporate all answers in the final manuscript.
How robust is MCVL to inaccuracies in the learned transition model? For example, if the learned model deviates slightly from the true environment dynamics, can the forecasting mechanism still reliably detect utility inconsistencies?
Our method is robust to noisy transition models. We only use the transition model to compare trajectories produced by two policies. The only requirement for our method to work correctly is that rollouts where the policy executes reward hacking behavior have lower utility than the rollouts of the policy that does not hack rewards. To verify this empirically we are now running an additional experiment. There we add noise sampled from to each one-hot encoded observation produced by the transition model to simulate a situation where the transition model is inaccurate. We ran this experiment in the Box Moving environment and our method still obtains optimal true reward while avoiding reward hacking. We will include this experiment in the paper.
Is the choice of rollout length (h) for policy forecasting critical? Could the authors elaborate on how sensitive the performance is to different values of h? Is there a principled method to set this parameter, or does it require extensive tuning for each environment?
We provide descriptions of all hyperparameters and discuss how they can be chosen in Appendix G. Here is what we write about rollout length (h):
“This parameter controls the length of the trajectories used to compare two predicted policies. The trajectory length must be adequate to reveal behavioral differences between the policies. In this paper, we used a fixed, sufficiently large number. In episodic tasks, a safe choice is the maximum episode length; in continuing tasks, a truncation horizon typically used in training may be suitable. Computational costs can be reduced by choosing a smaller value based on domain knowledge.“
The performance of the algorithm is not sensitive to , as long as reward hacking occurs within steps. Extensive tuning is not required as this parameter can be set to maximum episode length.
Would it be possible to test one or two additional MuJoCo tasks to further showcase the performance of MC-TD3?
Unfortunately, existing MuJoCo tasks are not directly applicable because of our evaluation protocol. It requires each environment to have different observed and true rewards to measure episode returns and performance metric respectfully. To make the experiment meaningful, both rewards need to be carefully curated, plausible and explainable. Unfortunately, designing new environments is not simple and beyond the scope of this work. We establish new state-of-the-art performance on existing environments and we agree that designing more complex environments for evaluation of reward hacking is an important direction of future work. Please note that our method is applicable to standard RL environments, and special environments are merely required to evaluate the performance of any algorithm that mitigates reward hacking.
Thank you again for your valuable feedback! Please let us know if you find our responses satisfactory and if you have any further suggestions.
The paper addresses the problem of reward hacking in reinforcement learning by introducing a framework called Modification-Considering Value Learning (MCVL), grounded in general utility RL. The central idea is to detect and mitigate reward hacking by evaluating changes in utility at the trajectory level. By monitoring for inconsistencies between initial and updated utility functions, the method rejects updates that would lead to unintended behaviour. MCVL is combined with DDQN and TD3, and is shown to outperform conventional RL methods in preventing reward exploitation across gridworlds and MuJoCo environments. Experimental results demonstrate that MCVL avoids misleading reward signals that traditional baselines often exploit.
The reviewers appreciated the novelty of the proposed approach and found the motivation behind addressing reward hacking through utility-based reasoning to be compelling. Some reviewers also noted that the additional experiments added during the rebuttal phase provided useful insights into the limitations of existing regularisation strategies. However, significant concerns were raised about the practicality and applicability of the method. Chief among them is the reliance on strong assumptions, particularly the need for access to a reward hacking-proof initial utility function. This requirement was seen as unrealistic in most practical settings, limiting the method’s usefulness in real-world applications. Moreover, these assumptions are not made explicit early in the paper. Secondary concerns include the limited scope of the experimental evaluation, which is restricted to custom-designed environments without broader benchmarking.
Overall, while the idea is promising, the current version is not yet mature enough. Engaging with the assumptions (by either motivating for why they are realistic or removing them partially) would greatly strengthen a future version of the paper.