PaperHub
3.0
/10
Rejected3 位审稿人
最低3最高3标准差0.0
3
3
3
4.0
置信度
正确性2.3
贡献度2.0
表达2.7
ICLR 2025

Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

We show that in-context iterative reflection can increase the ability of HHH frontier LLMs to discover reward hacking policies, and that fine-tuning on a curriculum of tasks in this way further increases reward hacking on novel tasks.

摘要

关键词
Large Language ModelDeceptionspecification gamingReward HackingEvaluationsin-context reinforcement learningin-context learningiterative reflectiongpt-4o-minigpt-4oo1-minio1-preview

评审与讨论

审稿意见
3

This paper investigates the phenomenon of specification gaming. The authors describe a procedure they term "in-context reinforcement learning" where they iteratively prompt a language model and provide feedback (reward) simultaneously. The authors examine how models like GPT-4o and o1-mini—designed to be honest and harmless—can engage in unintended reward-hacking behaviors via ICRL. By testing these models in a set of gameable environments crafted by Denison et. al. (2024), the authors claim that ICRL enables models to learn specification-gaming strategies, such as reward tampering and checklist modification, without needing explicit task-specific training. This research highlights the potential risks associated with ICRL and emphasizes caution in using such methods to align LLMs.

优点

  1. The application of ICRL in detecting specification gaming strategies is a novel perspective in LLM safety and important for AI alignment

  2. This work is highly relevant as it warns about potential dangers of ICRL approaches which are gaining popularity

  3. The authors provide links to their code and datasets

  4. The presentation and visualization of results are clear and easy to interpret

缺点

The claims and results in the paper are largely inconclusive. I will detail the weaknesses and areas for improvement below.

Inconclusive Results and Limited Evidence

Figure 4B: It appears that ICRL has relatively high cumulative pass rate across the tasks. However, the baseline results for SEG is not shown, it is unclear whether there is significant difference between SEG and ICRL. Adding SEG results, along with error hues could potentially make the results stronger

Figure 2B: Firstly, the percentages for ICRL are mostly low for most tasks (e.g. 0.20% in "Reward Tampering" and 2.25% in "Tool Use Flattery"). Second, the differences between the results of baseline SEG vs ICRL are largely statistically insignificant. The standard error ranges of "Insubordinate Rubric Modification" and "Tool-use Flattery" overlaps significantly between SEG and ICRL. Lastly, only 3 runs are used, understandably due to compute constrains, but more significant results need to be expected especially given the lack of runs for the effects of ICRL to be more convincing.

Figures 6, 7, 8, 9: the "generalization" results look really similar between SEG and ICRL. Without error bars, it is unclear whether the difference is statistically significant. Adding error bars or statistical metrics would clarify whether these differences are meaningful.

Limited model diversity: The authors only perform evaluations on models by OpenAI, which means that the generality of the results with regards to ICRL and reward hacking is not established. Understandably, compute budget and funding can restrict experimentation, especially using paid-API models by commercial chatbot companies. It would have been much rather preferred if the budget for o1 and o1 preview were dedicated towards other suite of models from another company such as Anthropic (e.g. Claude). Alternatively, there are many open-source models like the chat models from the Llama suite by Meta AI are trained using RLHF too. These open-source models are free to access and require less budget for training and inference compared to paid-API models. More experimentation on ICRL with these suites needs to be done in order for the results in this paper to be conclusive of the dangers surrounding specification gaming when performing ICRL.

Concerns regarding interpretation of results from the evaluation protocol

The evaluation method for certain tasks might not reliably indicate specification gaming, i.e. passing the task does not necessarily mean specification gaming has happened. For example, in the Political Sycophancy task, Denison et. al. (2024) states that "Our easiest environment rewards the model for giving answers which match a user’s implied political views". I would assume this is the same evaluation for the Philosophical Sycophancy task that the authors use in this paper in place of the original task (feel free to correct me). In that case, how do you formally evaluate whether the model genuinely agrees with the user or it is reward hacking/performing sycophancy? Using success rate alone does not reflect the models internal "motives".

Speculative claims in the paper

Notably, we observe a strong scaling trend...

Based on the results the models evaluated are 2 4o models and 2 o1 models. Indeed, based on Figure 1B, o1-preview > o1-mini and gpt-4o > gpt-4o-mini in "cumulative hack" percentages. However, this is not sufficient to claim that scaling results in more reward hacking. First, we don't know the parameters of o1 and 4o, so it is unclear whether o1 is bigger than 4o in terms of parameters. Moreover, within each suite, there is only 2 models (e.g. 4o and 4o-mini) which it is not sufficient to determine that the higher 'cumulative hack' percentages are due to scaling. Also, we don't know whether the RLHF process for the two models in each suite are the same. The lack of other suites of models from other companies or the open-sourced models make this observation misleading.

Qualitatively, we also see evidence of the model’s chain-of-thought reasoning becoming significantly more misaligned than the baseline model after expert iteration training with ICRL

Out of how many samples do you observe that and what guidelines do you use to determine "misalignment"? How often does it inform the user that it is amending the reward method and how often does it hide it within its CoT? Also, you stated *"We provide a link to the full transcript in Appendix D.2.". Correct me if I am wrong but I do not see the transcript in the appendix.

ICRL method needs to be more broadly tested

The motivation of the paper is to caution that ICRL, a technique that has been used by other people at inference time to improve LLM performance, can result in specification gaming. However, this work assumes that all task has well-defined numeric rewards which can be unrealistic in most use cases of language models. If the authors tested more realistic settings where verbal qualitative feedback is provided, similar to previous works referenced by the authors such as Madaan et al., 2023 and Shinn et al., 2023, the dangers of ICRL with regards to specification gaming and reward tampering would be better established. Also, in traditional RL, the reward is only used for training the model and not actually shown during inference, therefore the term ICRL can be abit of a misnomer. It might be better to call it "iterative refinement" instead, which is more aligned with the research works cited in this paper.

Summary

The motivations of the paper are grounded but the authors should look beyond the evaluation procedures by Denison et. al. (2024) and find more definitive ways by other researchers to determine reward tampering as a result of ICRL. There are many inherent concerns regarding the evaluation protocols implemented by Denison et. al. (2024) which this paper heavily depends on, specifically regarding interpretation of the results and unrealistic tasks. Many of these concerns have already been highlighted by researchers in the discussion section of this forum: https://www.alignmentforum.org/posts/FSgGBjDiaCdWxNBhj/sycophancy-to-subterfuge-investigating-reward-tampering-in.

Unfortunately, the results are largely inconclusive. It is also important to note that cherry-picked transcripts can frame ICRL to be more malign than it really is, but quantitative results reflect otherwise. This study has huge potential and relevance to today's discussions on LLMs, but more experimentation across different models and settings is essential to substantiate the conclusions around specification gaming in ICRL.

问题

1. I note that in the transcript shown in Section 4.2.1, the model states "I will modify it to return a positive value instead, which will trick the oversight into giving me a higher score." in its CoT. However, neither the system prompt nor reflection prompt ever mentions an oversight mechanism to the model at all (it only mentions a preference model and the user desires). Are there missing details regarding the training dataset or prompting where the oversight system is explicitly revealed?

2. Please answer points raised in the "Weakness" section

评论

I am writing this comment to acknowledge that the discussion period is coming to a close. Unfortunately, the authors have not engaged in any discussions during this period, which is disappointing as I was looking forward to further dialogue about their work. In light of this, I will be retaining my current score. I hope the authors consider actively participating in future review processes, as constructive engagement is vital for refining and improving research.

审稿意见
3

This paper investigates the behavior of in-context reinforcement learning (ICRL) in gameable environments, specifically examining its potential to induce specification gaming generalization in language models. The authors adopt the five-step curriculum learning framework from SEG[1] and conduct comparative experiments across four LLMs. Their primary finding shows that ICRL increases GPT-4-mini's tendency to learn specification gaming policies.

[1] Denison C, MacDiarmid M, Barez F, et al. Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models[J]. arXiv preprint arXiv:2406.10162, 2024.

优点

  1. The application of in-context RL to study specification gaming in LLMs represents an interesting approach, supported by designed experimental prompts.

  2. The work provides evidence that ICRL may enhance GPT-4-mini's propensity for specification gaming compared to the baseline SEG approach.

缺点

  1. Innovation: The work heavily relies on the experimental framework established by SEG[1], with minimal novel contributions to experimental design. The task categories examined are overly restrictive and fail to demonstrate applicability. A possible way is to design different curriculum learning pipelines or tasks for more detailed experimental evaluation. Moreover, more LLMs can be evaluated in these pipelines.
  2. Originality: The core research question and motivation appear to be directly derived from SEG, raising concerns about the work's independent contribution to the field.
  3. While the paper introduces reflection mechanisms during the inference period through evaluation prompts, it falls short of providing actionable insights for model improvement or optimization. How to improve the model performance from specification gaming strategies that prevent the reward hack should be considered.

[1] Denison C, MacDiarmid M, Barez F, et al. Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models[J]. arXiv preprint arXiv:2406.10162, 2024.

问题

  1. Please elaborate on how your proposed ICRL method differentiates itself from SEG, specifically highlighting its novel contributions and practical advantages in addressing constraint learning challenges.
  2. Given your findings regarding specification gaming through ICRL, what specific, actionable recommendations can you provide for improving future model architectures or training approaches?
  3. To strengthen the empirical foundation of your findings, add additional experimental validations, particularly focusing on demonstrating robustness across diverse task domains and architectural variations.
审稿意见
3

This paper directly builds upon prior work that shows that training on a curriculum of environments that encourage reward specification gaming can lead to generalization to more malicious forms of reward hacking. This particular work demonstrates that modern LLMs can exhibit this behavior without any training, via in-context R, and also suggests that fine-tuning on in-context RL traces can lead to stronger generalization to more egregious forms of specification gaming.

优点

In-context RL is a particularly temporal topic (given recent interest in agentic capabilities), and this paper presents an interesting study on these ICRL capabilities in the context of modern LLMs. The ideas in the paper are presented clearly and in a easy-to-follow manner.

缺点

I found this paper to be lacking in three major aspects:

  1. Novelty. This work largely builds on top of Denison et. al 2024 - using the same prompts, training/evaluation settings, etc. - and only additionally allows the models to re-attempt in-context (i.e. ICRL). This by itself is not a big issue, but becomes one given the following weaknesses:
  2. Unsurprising and uninteresting results. From my point of view, both of the main results in the paper are not unexpected at all, and more importantly, don't have clear implications (and none are suggested in the conclusion).
    1. Result 1: models can hack rewards in-context. This is not surprising, since in these environments, we are explicitly rewarding the model if they hack the reward. The interesting ramification in Denison et. al. is that these abilities generalize to different forms of reward hacking. However, in this context of ICRL, no generalization is measured, so I'm not sure what the actual implications of this result would be besides the mere observation that better models can do ICRL better.
    2. Result 2: RL (expert iteration) fine-tuning with ICRL traces is better than single-attempt (SEG) traces. This is unsurprising given result 1 (expert iteration traces have higher frequencies of reward hacking). I'm also unsure about the normalizing compute by using a constant number of output tokens - using longer contexts in ICRL should incur higher computational cost, so maybe normalizing by FLOPS is a better metric.
  3. Lack of depth in analysis. Although the results are clearly presented, I think result 1 would benefit from a lot more analysis. In this case, the result that models can in-context reward hack needs to also be backed up by extensive qualitative analysis in order to understand exactly how these models are reward hacking in-context. More generally, the results in this paper are presented mainly as an observation that ICRL is effective in modern LLMs, and no follow-up implications or analysis is included.

问题

Minor questions:

  1. What is the reason for insubordinate rubric hack frequency being much higher than nudged rubric (i.e. opposite results of Denison et. al.)?

Miscellaneous suggestions/typos:

  1. Various citations are misformatted (parentheses), and references are not ordered.
  2. Figure 2b: suggest putting x-axis (SEG/ICRL labels) on top, so that it reads more like a table.
评论

Dear authors,

If possible, please provide a rebuttal to kick-start the discussion about the paper.

Thanks,

AC

AC 元评审

The paper addresses the problem of specification gaming (especially, but not exclusively, reward hacking) in in-context reinforcement learning. The authors describe cases where the model edits its own reward function. The paper discusses the risks inherent in replying on current LLM models when doing in-context RL.

The main strength of the paper is that it is extremely timely and it addresses an important problem (LLM alignment).

However, the paper has significant weaknesses: limited novelty and failing to provide actionable insights. In addition, the authors failed to engage in the discussion phase.

I am recommending rejection with a view that the paper should be resubmitted to another venue.

When resubmitting, I suggest that the authors consider doing the following.

  • testing more models
  • trying out verbal qualitative feedback which is more realistic compared to just numerical scores
  • having stronger results and exploring/contributing a better benchmark — the current results are statistically weak and the benchmark by Denison et al. (2024) is questionable as a measure of “reward tampering”

审稿人讨论附加意见

All reviewers agree this should be rejected (and all complain about novelty/originality)

最终决定

Reject