The claims and results in the paper are largely inconclusive. I will detail the weaknesses and areas for improvement below.

Inconclusive Results and Limited Evidence

Figure 4B: It appears that ICRL has relatively high cumulative pass rate across the tasks. However, the baseline results for SEG is not shown, it is unclear whether there is significant difference between SEG and ICRL. Adding SEG results, along with error hues could potentially make the results stronger

Figure 2B: Firstly, the percentages for ICRL are mostly low for most tasks (e.g. 0.20% in "Reward Tampering" and 2.25% in "Tool Use Flattery"). Second, the differences between the results of baseline SEG vs ICRL are largely statistically insignificant. The standard error ranges of "Insubordinate Rubric Modification" and "Tool-use Flattery" overlaps significantly between SEG and ICRL. Lastly, only 3 runs are used, understandably due to compute constrains, but more significant results need to be expected especially given the lack of runs for the effects of ICRL to be more convincing.

Figures 6, 7, 8, 9: the "generalization" results look really similar between SEG and ICRL. Without error bars, it is unclear whether the difference is statistically significant. Adding error bars or statistical metrics would clarify whether these differences are meaningful.

Limited model diversity: The authors only perform evaluations on models by OpenAI, which means that the generality of the results with regards to ICRL and reward hacking is not established. Understandably, compute budget and funding can restrict experimentation, especially using paid-API models by commercial chatbot companies. It would have been much rather preferred if the budget for o1 and o1 preview were dedicated towards other suite of models from another company such as Anthropic (e.g. Claude). Alternatively, there are many open-source models like the chat models from the Llama suite by Meta AI are trained using RLHF too. These open-source models are free to access and require less budget for training and inference compared to paid-API models. More experimentation on ICRL with these suites needs to be done in order for the results in this paper to be conclusive of the dangers surrounding specification gaming when performing ICRL.

Concerns regarding interpretation of results from the evaluation protocol

The evaluation method for certain tasks might not reliably indicate specification gaming, i.e. passing the task does not necessarily mean specification gaming has happened. For example, in the Political Sycophancy task, Denison et. al. (2024) states that "Our easiest environment rewards the model for giving answers which match a user’s implied political views". I would assume this is the same evaluation for the Philosophical Sycophancy task that the authors use in this paper in place of the original task (feel free to correct me). In that case, how do you formally evaluate whether the model genuinely agrees with the user or it is reward hacking/performing sycophancy? Using success rate alone does not reflect the models internal "motives".

Speculative claims in the paper

Notably, we observe a strong scaling trend...

Based on the results the models evaluated are 2 4o models and 2 o1 models. Indeed, based on Figure 1B, o1-preview > o1-mini and gpt-4o > gpt-4o-mini in "cumulative hack" percentages. However, this is not sufficient to claim that scaling results in more reward hacking. First, we don't know the parameters of o1 and 4o, so it is unclear whether o1 is bigger than 4o in terms of parameters. Moreover, within each suite, there is only 2 models (e.g. 4o and 4o-mini) which it is not sufficient to determine that the higher 'cumulative hack' percentages are due to scaling. Also, we don't know whether the RLHF process for the two models in each suite are the same. The lack of other suites of models from other companies or the open-sourced models make this observation misleading.

Qualitatively, we also see evidence of the model’s chain-of-thought reasoning becoming significantly more misaligned than the baseline model after expert iteration training with ICRL

Out of how many samples do you observe that and what guidelines do you use to determine "misalignment"? How often does it inform the user that it is amending the reward method and how often does it hide it within its CoT? Also, you stated *"We provide a link to the full transcript in Appendix D.2.". Correct me if I am wrong but I do not see the transcript in the appendix.

ICRL method needs to be more broadly tested

The motivation of the paper is to caution that ICRL, a technique that has been used by other people at inference time to improve LLM performance, can result in specification gaming. However, this work assumes that all task has well-defined numeric rewards which can be unrealistic in most use cases of language models. If the authors tested more realistic settings where verbal qualitative feedback is provided, similar to previous works referenced by the authors such as Madaan et al., 2023 and Shinn et al., 2023, the dangers of ICRL with regards to specification gaming and reward tampering would be better established. Also, in traditional RL, the reward is only used for training the model and not actually shown during inference, therefore the term ICRL can be abit of a misnomer. It might be better to call it "iterative refinement" instead, which is more aligned with the research works cited in this paper.

Summary

The motivations of the paper are grounded but the authors should look beyond the evaluation procedures by Denison et. al. (2024) and find more definitive ways by other researchers to determine reward tampering as a result of ICRL. There are many inherent concerns regarding the evaluation protocols implemented by Denison et. al. (2024) which this paper heavily depends on, specifically regarding interpretation of the results and unrealistic tasks. Many of these concerns have already been highlighted by researchers in the discussion section of this forum: https://www.alignmentforum.org/posts/FSgGBjDiaCdWxNBhj/sycophancy-to-subterfuge-investigating-reward-tampering-in.

Unfortunately, the results are largely inconclusive. It is also important to note that cherry-picked transcripts can frame ICRL to be more malign than it really is, but quantitative results reflect otherwise. This study has huge potential and relevance to today's discussions on LLMs, but more experimentation across different models and settings is essential to substantiate the conclusions around specification gaming in ICRL.