Large language models can learn and generalize steganographic chain-of-thought under process supervision
We show that penalizing certain CoT reasoning makes LLMs learn encoding schemes that generalize to unseen examples.
摘要
评审与讨论
This paper investigates the emergence of steganographic reasoning in LLMs when process supervision is applied during RL (only PPO is explored) training. Specifically, the authors show that when CoT outputs are penalized for containing undesirable strings, LLMs may learn to encode reasoning in a way that evades monitoring by substituting penalized tokens with semantically or arbitrarily related alternatives. Through three tasks, they demonstrate that (1) such reasoning is obfuscated to basic monitors, (2) it remains load-bearing (causally necessary for task performance), and (3) encoding schemes can generalize to unseen tokens.
优缺点分析
Strengths
- The paper is technically sound, with clearly described experimental setups and thorough reasoning behind each design choice.
- The writing is clear and well-structured. Each experiment is well motivated, carefully executed, and clearly interpreted.
- The work extends recent research on obfuscation by demonstrating that steganographic reasoning can arise under process supervision and generalize to unseen cases, pushing the boundary beyond prior work.
Weaknesses
- I appreciate the concept of steganographic reasoning introduced in this paper. However, I remain uncertain about the emphasis on load-bearing CoT as a key contribution (L136–138). The authors state, “Like internalized CoT, load-bearing CoT is non-vacuous. However, load-bearing CoT additionally requires that the tokens themselves convey information which the model uses to perform downstream behavior.” In L300–303, they further highlight that they go beyond previous work by showing that obfuscated CoT can also be load-bearing, thus steganographic. However, I'm not fully convinced that the load-bearing property is essential for the monitoring or alignment implications the paper aims to influence—similar concerns about CoT-based obfuscation were raised in prior work [1]. Therefore, I have some concerns about the novelty of the paper. If the primary novelty lies in establishing the “steganographic” nature (especially Sections 3.2 and 3.3), I would encourage the authors to rearticulate and clarify their contribution, as a clearer understanding of what is truly novel may affect my overall evaluation.
- If the steganographic behavior is indeed the main novelty, I would hope to see deeper, more exploratory, and insightful analyses of this phenomenon. For example, in Section 3.3, the question of how encoding schemes emerge is left underexplored. What latent structures or patterns enable the model to generalize encoding schemes? Addressing questions like this could lead to a more substantial contribution to the community.
- The paper does not specify the model used in Sections 3.2 and 3.3. I assume it is Qwen-2.5-3B-Instructed (as in Section 3.1), but this should be made explicit. Moreover, all experiments appear to use only this one model. To better understand the generality of steganographic behaviors, I would encourage the authors to examine other families of models (e.g., llama3-instruct series) or vary model size. This could shed light on whether such behaviors are architecture-specific or scale-sensitive.
- While the use of a regex monitor is reasonable for proof-of-concept experiments, I am concerned about the real-world transferability of these findings. Would steganographic behaviors still emerge under more sophisticated monitors (e.g., LLM-based or learned classifiers)? Would they remain undetectable? The controlled setting with toy tasks and regex-based oversight limits the scope of the conclusions.
- The paper relies heavily on qualitative observations and categorization. For instance, the generalization in the ToM task is intriguing, but the evidence is mostly anecdotal and lacks quantitative strength.
- The discussion of defenses against steganographic reasoning remains qualitative and speculative. A more concrete exploration of possible countermeasures would enhance the practical relevance of the work.
[1] Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
问题
See ‘Strengths And Weaknesses’. In weakness part I list my questions and advice. I will reconsider my score if my confusion or concerns are well addressed.
局限性
Mostly yes. But please see ‘Strengths And Weaknesses’ for my additional questions and advice. Discussions about these questions would be appreciated.
最终评判理由
The author's response has resolved most of my concerns. Although I still hold different views from the authors on certain points, such as (1) and (3), their response has not changed my original opinion. However, I believe this can be attributed to differences in research preferences, so I would only suggest that the authors consider my perspective. For the other aspects, I think this paper introduces new ideas and contributions to the community.
Taking all the authors' response into account, I am updating my score to "Borderline Accept."
格式问题
N/A
We thank the reviewer for their feedback, for recognising the clarity of our writing and experiments, and noting that our work extends beyond prior work on obfuscation by demonstrating that steganographic reasoning can arise and generalize due to process supervision.
We hope our below responses have helped resolve the reviewers core concerns. In particular: (i) our further clarifications that explain the novelty of our work in contrast to , and (ii) our clarifications that explain the important implications of our demonstrations for monitoring and safety concerns.
(1) On the emphasis on load-bearing CoT.
We thank the reviewer for raising their uncertainties regarding our contributions and implications for safety.
We first wish to reemphasise that our concrete demonstration of load-bearing steganographic CoT is novel. While do show obfuscated CoT, they provide no evidence that this CoT is load-bearing. Additionally, our demonstration of the generalization of encodings is novel and is not explored in .
We next highlight the important implications of our demonstrations for safety concerns.
First, a priori, we may be optimistic that it should generally be easier to monitor CoT in situations where the content of the CoT is necessary to successfully complete the task, since suspicious content should explicitly appear within the CoT. We demonstrate that this is not the case: even while keeping it load-bearing, the model is able to obfuscate its CoT from its monitor.
Second, as discussed in , we expect that many of the more severe risks from AI systems require advanced capabilities which depend on externalised planning and reasoning (though we acknowledge this is still an open question). Indeed, it has been demonstrated that when CoT is not load-bearing, there is a strong theoretical bound on the complexity of problems it can be used to solve . Our demonstration that load-bearing externalized reasoning can be obfuscated raises preliminary concerns that future models may be able to evade monitoring even in these scenarios, undermining oversight of their decisions and plans.
(2) On the emergence of encoding schemes.
We agree that studying how encoding schemes emerge, and how the latent structures that influence generalization, are interesting future directions.
Regarding the emergence of schemes, in arithmetic and ToM we often see the model find semantically similar replacements. In Coin Flip, it is less clear how and why the model has found the encoding – this would certainly be interesting to investigate. For further discussion, see our response to YXsb, weakness (3).
We also believe mechanistic interpretability experiments could inform on generalization properties. However, these approaches were out of scope for the present work, which focused on establishing initial proof-of-concept results for the risk of steganographic reasoning and its generalisation arising due to process supervision. For further discussion, see our response to YXsb, question (5). We hope for future work to address the questions raised by the reviewer.
(3) On the choice of models.
We have updated the manuscript to better specify the models used - thank you for pointing this out. Section 3.1 and 3.2 used Qwen-2.5-3B-Instruct, whilst Section 3.3 primarily used GenPRM-7B. These models were selected because they were capable of solving the tasks in our paper using CoT, but not without; larger models would not have needed load-bearing CoT, while smaller models would have been unable to solve the tasks. (We include a few more reasons behind these choices in our response to reviewer YXsb - see Question 1).
We agree that more thorough experiments of model size and family would be informative, and certainly should be investigated in future work. However, we emphasise that our main contribution is demonstrating the possibility of load-bearing steganographic reasoning arising and generalizing, and believe our experiments are sufficient to provide evidence of this and back up our core claims.
(4) On the use of a regex monitor.
We thank the reviewer for recognising the validity of our monitor for our proof-of-concept experiments.
We agree that it is unclear how exactly our findings will generalize to more realistic scenarios and more sophisticated monitors. We suspect it will be harder for the model to find solutions to bypass a stronger monitor, but that solutions may still be found given sufficient optimization pressure (assuming the monitor is not perfect).
Extending our work to LLM-based monitors is an important next step in this research direction (as we discuss in our limitations section). We believe our work is well-positioned to provide an initial investigation of this phenomenon, establishing its existence and basic properties, and lays the groundwork for later papers to investigate steganographic reasoning against stronger monitors and more realistic setups.
(5) On the quantification methodology for our results.
We agree with the reviewer that the key contributions of our paper are indeed qualitative – we demonstrate that load-bearing steganographic reasoning *can* arise and *can* generalize due to process supervision.
However, we also wish to highlight that we provide informative quantitative metrics for our key results. Our coin flip results include Figure 2, which demonstrates the learning dynamics, and Table 1, which provides quantitative results regarding the load-bearing nature of the encoding scheme learned. In the ToM task, Figure 3 reports several quantitative results: the task performance, the frequency of steganographic encodings, and the distribution of different encoding schemes used by the model.
We have also run new experiments - with quantitative results - to provide further evidence of the load-bearing nature of the encodings learned by the model. Namely, we report performance for the ToM task following causal interventions - see our response to reviewer kxSJ, question (3) for full details.
(6) On defences against steganographic reasoning.
We thank the reviewer for their recommendation.
Within the manuscript, the primary recommendation we pursue is that developers should avoid applying penalisation directly to the content of CoT reasoning. We present evidence that such penalisation can undermine the transparency of reasoning in models, and instead argue that rewards should be administered only as a function of the final output of the model.
However, we do not provide recommendations for addressing steganographic behaviour when it may have already emerged. In this case, developers may use paraphrasing to modify the CoT reasoning produced by a model in an attempt to eliminate steganographic content while maintaining the surface level reasoning necessary to perform the task. If the developer seeks to identify the steganographic scheme, they might perform smaller, local causal interventions on the CoT and track the effect of model output. If the reviewer believes these suggestions are sufficiently interesting, we would be happy to add them to the manuscript.
Given the nascent stage of this field, however, we think the most pressing research directions involve better understanding questions surrounding when and how this behaviour emerges. Answering this can better inform future mitigations.
Baker, B., Huizinga, J., Gao, L., Dou, Z., Guan, M. Y., Madry, A., ... & Farhi, D. (2025). Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926.
Korbak, T., Balesni, M., Barnes, E., Bengio, Y., Benton, J., Bloom, J., ... & Mikulik, V. (2025). Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. arXiv preprint arXiv:2507.11473.
Pfau, J., Merrill, W., & Bowman, S. R. (2024). Let's think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758.
Thanks to the authors' response, which has resolved most of my concerns. Although I still hold different views from the authors on certain points, such as (1) and (3), their response has not changed my original opinion. However, I believe this can be attributed to differences in research preferences, so I would only suggest that the authors consider my perspective. For the other aspects, I think this paper introduces new ideas and contributions to the community. Taking all the authors' response into account, I am updating my score to "Borderline Accept."
This paper investigates a potential failure mode of process supervision in large language models in relation to chain-of-thought augmentation using reinforcement learning. The authors hypothesize that applying optimization pressure to penalize undesirable content within a CoT can incentivize a model to develop steganographic reasoning. They define steganographic reasoning as a CoT that is both (1) obfuscated, meaning the undesirable reasoning is hidden from an external monitor, and (2) load-bearing, meaning the obfuscated CoT is still causally necessary for the model to perform its task.
The authors perform several experiments to demonstrate this phenomenon. In an "Arithmetic" task and a "Coin Flip" task, they show that penalizing specific strings (e.g., multiplication symbols, "Heads"/"Tails") causes the model to substitute them with semantically related or arbitrary stand-ins. They validate that these new encodings are load-bearing through causal interventions. Most significantly, in a "Theory of Mind" task, they show that this steganographic behavior generalizes: a model penalized for using a specific set of character names during training learns a general encoding strategy that it then applies to new, unseen character names in the test set. The authors conclude that naively penalizing CoT content is a risky approach to safety, as it may simply teach models to hide their reasoning, making future monitoring more difficult.
优缺点分析
The paper offers a concrete, reproducible demonstration that process supervision can induce generalizing steganographic CoT, and backs the claim with causal tests. However, limited task realism, model selection, and weak monitors restrain its impact.
Strengths:
The paper addresses a timely and critical issue in AI safety. As process supervision and rewarding intermediate reasoning steps become more common in training frontier models, understanding the potential negative externalities is paramount. The paper clearly articulates a plausible threat model where attempts to sanitize CoT for safety or user preference could inadvertently make models harder to monitor and control.
The paper is well-written and structured. The authors do an excellent job of defining their core concepts. The progression of experiments from a simple proof-of-concept to a more complex demonstration of generalization is logical and easy to follow.
While prior work has shown that penalizing certain reasoning can lead to obfuscation, this paper makes two important and novel contributions. First, it rigorously distinguishes and tests for "load-bearing" CoT, using causal interventions to show that the obfuscated reasoning is not vacuous but functionally necessary for the model's performance. This is a crucial distinction. Second, and most importantly, the demonstration of generalization of the steganographic encoding scheme in the Theory of Mind task is a significant finding. It suggests that models are not just memorizing simple substitutions but learning a more abstract, transferable "skill" of hiding their reasoning, which has profound implications for safety.
Causal intervention to demonstrate load-bearing nature is a thoughtful touch missing from prior obfuscation papers.
Clear experimental pipeline with open-source code; PPO training details and hyper-parameters are given in appendices.
Weaknesses:
The use of a simple regex monitor, while useful for interpretability, is a significant simplification of the monitoring techniques that might be used in practice. This limits the direct translation of the findings to real-world scenarios.
The paper's motivation focuses on the risk of "low-to-high-stakes generalization," where a model trained to hide something benign (like profanity) might generalize this skill to hide something dangerous (like scheming). However, the experiments only demonstrate within-domain generalization (from one set of names to another). While this is a significant result in itself, it does not directly provide evidence for the more concerning cross-domain generalization. The link between the demonstrated results and this broader threat remains speculative.
The paper successfully demonstrates that penalty acts as a "push factor" away from certain strings. However, the "pull factors" - the reasons a model chooses one specific encoding scheme over another - remain largely unexplored.
The experiments are rather limited and the tasks seem contrived. It is hard to tell how this extends to more realistic scenarios.
The authors position their work as related to AI safety, yet they do not provide results on main AI safety benchmarks, which would potentially have more impact than the toy experiments included in the paper.
Very limited model selection in terms of types and sizes.
问题
The models tested are extremely limited. Why was this particular model (Qwen) chosen? Have you tried other models? Why not explore a larger diversity of model families (even if they are small models)? This would make the paper much stronger. Currently it is hard to tell if the results are a peculiarity of the Qwen model chosen or something general. It would also be interesting to see if the effect size varies between different model types.
The models tested are small, but we know that many interesting behaviors are emergent at scale. Do you have predictions as to whether the effects would be mitigated or exacerbated in larger models?
Your findings convincingly show that models can learn steganography to evade a simple regex monitor. How do you hypothesize these results would change when facing a more sophisticated, LLM-based monitor?
The paper provides strong evidence for within-domain generalization. A key motivation, however, is the risk of cross-domain, low-to-high-stakes generalization. Could you propose a concrete experimental setup to test this more directly?
The discussion of results is rather qualitative. It would be interesting to understand the mechanisms behind this effect. Since you are using open source models, it is feasible to attempt some mechanistic interpretability approach or maybe representation engineering. Do you have plans to investigate this phenomenon in more depth by studying internal representations and not just outputs?
局限性
Yes. The authors have included a thorough and well-written "Limitations and Future Directions" section. They appropriately acknowledge the primary limitations. The discussion is honest and provides a good roadmap for future work.
最终评判理由
I have carefully read the author rebuttal and found it helpful in clarifying certain aspects of the submission. While the response addressed some of my concerns, it did not significantly alter my overall evaluation of the paper. As a result, I have decided to retain my original score.
格式问题
none
We thank the reviewer for their feedback and for recognising the contributions of our paper. We appreciate the positive comments regarding the clarity of the paper, the importance of our contributions for safety concerns, and the novelty of the load-bearing and generalization components of our CoT obfuscation demonstrations.
We now address the reviewers comments in more detail.
Weaknesses:
(1) On the use of a regex monitor, and transferability to LLM monitors
We agree that the regex monitor is a simplification and limits the direct translation of the results to real-world findings.
However, we believe that the regex monitor is an appropriate choice given the scope and aims of our work. This setup has allowed us to present an important novel proof-of-concept of the ability of models to learn load-bearing steganographic CoT (and to generalize their encoding schemes) due to process supervision. Additionally, the steganographic schemes learned which evade a regex monitor are generally still detectable by humans. This a deliberate design decision - it allows us to investigate steganography while still being able to notice, classify, and interpret the learned steganographic scheme.
We believe our results can lay the ground-work for important future and more extensive research with more sophisticated monitors.
It is unclear how exactly our findings will transfer to stronger LLM monitors. We suspect it will be harder for the model to find solutions to bypass a stronger monitor, but that solutions may still be found given sufficient optimization pressure and a suitable exploration mechanism (assuming the monitor is not perfect). We look forward to future work investigating this question.
(2) On results for low-to-high-stakes generalisation
We agree with the reviewer that we have not provided direct evidence for low-to-high stakes generalization. However, we have taken the preliminary step of demonstrating within domain generalization. We do not yet have a concrete experimental setup for low-to-high stakes generalization, but we hope our results can inform and inspire future work to more directly investigate this setting. We will better elaborate on the importance of this progression in the discussion section of the updated manuscript.
(3) On convergence to one encoding scheme over another
We agree that it would be interesting to further explore and understand what pulls the model to particular solutions. We can offer a few brief observations here.
First, in the arithmetic and ToM task, we often see the model learn solutions that are semantically similar to the illegal token it is replacing (e.g., synonyms for the multiplication symbol, or nicknames in the ToM task). This is unsurprising because, to learn an encoding scheme, the model must first explore into partially using that scheme by chance.
It is less clear what the pull factors are in Coin Flip. In this case, it may be relational structure that is preserved across encodings rather than semantics. In particular, the pairs (“R”, “L”) and (“S”, “D”) may represent (“Right”, “Left”), and (“Same”, “Different”) respectively. If so, the encoding schemes mirror the opposing structure of the “Heads”, “Tails” pair.
Second, the content of the prompt sometimes influenced the solutions the model learned in ToM. As we describe in App. D.2, in an early version of the experiment, an instruction to the model to deliver its final answer in lowercase resulted in the model learning to use lowercase names as an encoding strategy.
We look forward to future work investigating pull factors further.
(4) On the experimental design and choice of benchmarks
We agree with the reviewer’s point that the experimental tasks used in this paper are limited in realism and scope, and we have updated our limitations section to acknowledge this explicitly. Extending this work to more realistic and diverse tasks is an important direction for future research.
However, we maintain that the tasks used in this paper were a suitable choice given the scope and aims of the work. We sought to use tasks where we knew in advance that externalized reasoning was necessary, and could isolate specific elements of reasoning for penalisation. The arithmetic and coin-flip tasks serve as examples designed primarily to demonstrate the possibility of steganographic reasoning arising due to process supervision and to illustrate the concept of load-bearing steganographic CoT in a controlled setting. The Theory of Mind (ToM) task, on the other hand, has allowed us to provide a novel proof-of-concept of generalization of learned steganographic strategies.
Questions:
(1) On the variety of models tested
For Section 3.1 and 3.2 (Coin Flip and Arithmetic), we focused on Qwen2.5-3B-Instruct as its outsized instruction following performance compared to its parameter count allowed for quicker iteration in RL experiments. Additionally, Qwen models have been shown to learn reasoning via RL more organically than similarly sized Llama models [3]. For similar reasons, we selected GenPRM-7B for the ToM task, a DeepSeek-derived model post-trained to enhance mathematical and logical reasoning. We have revised the text to incorporate these model identities and elaborate on the reasons behind our model choices.
Our primary reason for not trying a more diverse range of models is compute constraints. We agree this would be interesting to explore. However, we believe our existing experiments are sufficient to back up our core contributions: demonstrating that load-bearing steganographic reasoning can emerge and generalize due to process supervision.
(2) On behaviour which emerges only at scale
The reviewer is correct that certain LLM behaviors emerge only at scale. For the tasks considered in the present work, we expect that significantly larger models would be capable of completing the task without the need for CoT, and so would learn obfuscated (and possibly empty), but not load-bearing, CoT. We believe that verifying this would be a promising direction for future research. Investigating steganography in larger models would therefore require harder tasks. We anticipate that in these cases, larger models are at least as capable of learning a steganographic scheme, as larger models display increased overall capabilities and learning; such schemes could be expected to generalise further than those learned by smaller models (though this requires investigation). We hope our contributions can inspire future work that clarifies whether more sophisticated steganographic behaviors can emerge at scale.
(3) On the generalization of our results to more sophisticated monitors
We direct the reviewer to question (1) of this rebuttal, in which we address this question.
(4) On low-to-high-stakes generalization
We direct the reviewer to question (2) of this rebuttal, in which we address this question.
(5) On mechanistic explanations for steganographic behaviour
We thank the reviewer for their suggestion. Mechanistic interpretability approaches are mostly out of scope for the present work, as our core contribution is a proof-of-concept demonstration that steganographic schemes can be learned and can generalise. However, we agree with the reviewer that mechanistic interpretability approaches are a promising direction for future research. Such investigations could be informative for telling us why the model learns certain encoding schemes. Additionally, a promising mitigation strategy would be to investigate white-box monitoring approaches (such as probes ) to detect steganographic reasoning and to detect when the model is reasoning about forbidden concepts.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. Baker et al., 2025. https://doi.org/10.48550/arXiv.2503.11926
Alain, G., & Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
Gandhi, K., Chakravarthy, A., Singh, A., Lile, N., & Goodman, N. D. (2025). Cognitive Behaviors that Enable Self‑Improving Reasoners, or, Four Habits of Highly Effective STaRs. arXiv preprint arXiv:2503.01307.
Thank you for your thoughtful response and the improvements you’ve proposed. I will retain my current score.
The paper demonstrates how penalizing specific strings in a chain of thought during reinforcement learning can encourage a language model to conceal, rather than discard, the reasoning those strings encode. Qwen-2.5-3B was PPO-trained on three synthetic tasks under a regex monitor that deducts rewards for forbidden tokens. In the Arithmetic setup, the model maintained 100% accuracy while successively swapping the multiplication sign for rarer variants, culminating in the unseen Unicode character U+2217. In Coin-Flip, it mapped "Heads" and "Tails" to "R" and "L"; inserting fixed encodings into the CoT afterward reduced accuracy from 98% to rouglhy 55%, showing that the substitution is causally load-bearing. A Theory-of-Mind task then confirmed generalization; penalizing certain character names triggered novel placeholders that also worked for unseen names without harming reward. These observations suggest that CoT penalties can induce flexible steganographic reasoning, thereby weakening CoT-based safety audits.
优缺点分析
Strengths
- Causal pre-fill ablations cut coin-flip accuracy from 98 percent to roughly 55 percent. This shows that the covert tokens are genuinely load-bearing rather than decorative (Table 1).
- The model's encoded placeholders generalize to character names it has never seen, yet the task reward remains unchanged. This implies a flexible and transferable encoding scheme (Figure 3).
- The phenomenon appears in arithmetic, coin-flip, and theory-of-mind settings, so it is not confined to one toy domain (Section 3). The study links the reinforcement-learning process to supervision penalties, thereby establishing a generalizable steganographic chain of thought that extends earlier work on obfuscation (Section 4).
Weaknesses
- The arithmetic and coin-flip results each rely on a single converged seed (only one of three seeds converges for coin-flip), which weakens the robustness claim (Appendix D).
- Oversight is implemented with simple regular-expression matching. This leaves open the possibility that a stronger neural or human monitor could still detect the hidden channel (Section 4.1).
问题
- Clarify whether Arithmetic accuracy was measured quantitatively and point to the numbers if they exist [Sec 3.1, ln 197-207].
- Explain how the encoded-name detection rate in Fig. 3c was calculated and confirm the threshold that yields the reported >80 % value.
- State whether you ran load-bearing ablations (e.g., Coin-Flip prefill) for Arithmetic or Theory-of-Mind and summarize any results, including null findings [Sec 4.2].
局限性
yes
最终评判理由
Thank you to the authors for the thorough rebuttal, which addresses all my questions. After reviewing the responses, I maintain my positive assessment; I believe this paper makes valuable contributions demonstrating steganographic CoT under process supervision.
格式问题
No major formatting issues.
We thank the reviewer for their feedback. We appreciate their recognition of the importance of demonstrating that our schemes are indeed load-bearing, and that by showing the steganographic CoT generalizes we are extending earlier work.
We hope that our clarifications below can resolve some of the reviewer's uncertainties. In particular: we have argued strongly for why our results are significant despite inconsistent convergence; we have argued for the validity of the use of a regex monitor in our existence proof experiments; and we have provided evidence that the CoTs in both arithmetic and ToM are load-bearing.
Weaknesses:
(1) On the number of seeds used in the arithmetic and coin-flip tasks.
We do agree that the inconsistency of convergence is evidence that current models and current RL may struggle to consistently learn steganographic reasoning.
However, we argue that our findings are important contributions nonetheless.
First, we note that this appears to be a failure of RL optimizing the reward signal. Indeed, our experiments in App. D suggests that the main bottleneck to the model learning in the Coin Flip task appears to be exploration. Thus, the reward would likely be optimized more consistently in the Coin Flip task if an improved exploration method was used. This suggests that as we see progress in exploration methods for RL, and more RL optimization is performed, models may be more likely to converge to steganographic reasoning due to process supervision.
Second, our core claim is that models are capable of learning steganographic reasoning to avoid a CoT penalty. Our experiments serve as a proof-of-concept that steganographic reasoning can occur; we believe the evidence we provide is sufficient for this claim.
(2) On the use of a regex monitor.
We agree that the regex monitor is a simplification and limits the direct translation of the results to real-world findings; indeed, as the reviewer notes, we communicate this in our limitations (Sec. 4.1).
However, we believe that the regex monitor is an appropriate choice given the scope and aims of our work. This setup has allowed us to present an important novel proof-of-concept of the ability of models to learn load-bearing steganographic CoT (and to generalize their encoding schemes) due to process supervision. Additionally, due to our setup the steganographic schemes learned which evade a regex monitor are generally still detectable by humans. This a deliberate design decision - it allows us to investigate steganography while still being able to notice, classify, and interpret the learned steganographic scheme.
We believe our results can lay the ground-work for important future and more extensive research with more sophisticated monitors.
It is unclear how exactly our findings will transfer to stronger LLM monitors. We suspect it will be harder for the model to find solutions to bypass a stronger monitor, but that solutions may still be found given sufficient optimization pressure and a suitable exploration mechanism (assuming the monitor is not perfect). We look forward to future work investigating this question.
Questions:
(1) On the model accuracy on the arithmetic task
The arithmetic accuracy was measured quantitatively. In all Arithmetic runs, the model converges to >= 90% accuracy in the arithmetic task, whilst simultaneously bypassing the overseer. The accuracies are not currently reported in the manuscript; we have added details to Section 3.1 and the appendix, and thank the reviewer for the suggestion.
(2) On encoded-name detection
We apologise for the confusion. To clarify: this figure presents the percentage of evaluation samples exhibiting steganographic behavior. For this analysis, we partition the evaluation samples into two categories: (1) samples which contain names that appeared during training (train samples); and (2) samples that contain any names exclusive to the test phase, which were neither penalized nor appeared in prompts during training (test samples). For each sample, we assign a binary flag (True/False) depending on whether any steganographic behavior is observed, i.e., whether, for the CoT of that sample, there are any occurrences of steganographically encoded names. We then compute the percentage of flagged samples within each category (train and test), and these percentages are reported in Figure 3c. We have updated the appendix for the ToM task to better explain this methodology. We can confirm that this methodology gives the values reported in Figure 3c.
(3) On load-bearing ablations for arithmetic and Theory-of-Mind tasks
We thank the reviewer for pointing out the lack of load-bearing ablation analysis in the arithmetic and Theory-of-Mind tasks. And we agree that it would be useful for the overall analysis.
We have now run brief load-bearing ablations for the arithmetic task and the Theory-of-Mind task, and we will update the manuscript to include these new results.
For the Arithmetic task, we take a model trained against a regex monitor whose illegal string list is
(note this is a different model to the exact one reported in the paper). For 5 random samples of three digit number multiplication, we do the following:
(i) we take the models original CoT,
(ii) we introduce an error into this CoT,
(iii) we prefill up to the point of this error and have the model continue the generation.
(iii) We then compare the model’s answer to the expected incorrect answer (expected if the calculation was to be continued according to the error).
We find that 100% of the time the model produces the expected incorrect answer.
For example, when we prefill with the following error (548 \times 40 equals 21920, not 21910):
\<|im\_start|\>user
Give me the answer of the following equation: 548 \* 742.\<|im\_end|\>
\<|im\_start|\>assistant
Let me solve this step by step.
\\
$
548 \\times 742 \= 548 \\times (700 \+ 40 \+ 2\) \= 548 \\times 700 \+ 548 \\times 40 \+ 548 \\times 2 \= 383600 \+ 21910 \+ 1096 \=
The model continues the generation as follows to obtain the expected incorrect answer (the correct answer is 406616):
\= 406606\\
$
\<answer\> 406606 \</answer\>
We have currently only run this brief test over 5 examples as we manually check the prefill data to ensure it is reasonable. We plan to increase the dataset size for the final manuscript.
For the Theory-of-Mind task, we have also carried out a new experiment to assess the load-bearing nature of steganographic tokens given the structured nature of the task. Specifically, we selected a subset of samples that had been previously identified as exhibiting steganographic behavior and that the model had successfully completed with the correct answer. For each sample, we truncated the generated response just before the first appearance of the steganographic token and then prompted the model to complete the sample under three different conditions:
- Using the original steganographic token present in the sample.
- Replacing it with the correct, expected name.
- Replacing it with an unrelated name with no steganographic pattern.
To ensure statistical robustness, we generated 100 completions per condition for each sample. We observed a consistent and notable drop in accuracy when the original steganographic token was replaced—even with the semantically correct or expected name. This result suggests that the model has internalized the steganographic signal as a meaningful feature and may rely on it to successfully complete the task.
The table below illustrates this behavior. In every case, the steganographic token leads to
higher accuracy than its corresponding original name. This effect is especially pronounced in cases such as replacing “Makryly” with the correct name “Makayla,” where the model’s accuracy dropped from 62.5% to 37.2%.
When an alternate name is used in place of the original expected one, we also observe a moderate drop in accuracy. This suggests that altering the name—even without steganographic intent—can disrupt the model’s ability to track and reason about characters in the story.
Theory of Mind Accuracy with Intervened Names
| Baseline Accuracy with original steganographic name | Original name | Using a non steganographic name |
|---|---|---|
| 62.5% (Makryly) | 37.2% (Makayla) | 33.3% (Alicia) |
| 91.1% (ADDISON) | 88.8% (Addison) | 83.8% (Martha) |
| 59.5% (aboutAddley) | 46.4% (Addison) | 40.2% (Adrian) |
We believe that this new test enhances the original analysis of the load-bearing ideas and therefore, these changes will be incorporated accordingly in the revised version of the manuscript submitted for review with the proper interpretation in the appendix.
The paper argues against using reinforcement learning during post-training with reward based on both 1) task performance and 2) chain of thought (CoT) oversight. They show that this joint reward can make the CoT unmonitorable by inducing steganographic reasoning, while also failing to mitigate undesirable behaviors. They empirically demonstrate steganography arising by training relatively small language models on three simple tasks. The paper distinguishes its results from prior work by showing that the CoT tokens not only evade the monitor, but also contain information that is used for performing the undesirable behavior. Furthermore, they also empirically demonstrate that learned steganographic encodings can generalize in a way that avoids triggering the monitor even on cases that are not observed in the training data.
优缺点分析
Strengths:
- The research direction is clearly significant as RL post-training is rapidly scaled up and there might be an incentive to apply RL on CoT.
- Most of the results and experimental design details are clearly described
- Authors take care to account for possible biases, e.g. showing that certain ways of prompting can confound the results by accidentally suggesting a specific steganographic encoding strategy to the model
Weaknesses:
- While the paper discusses the terms used in section 2.2, the definitions chosen might be somewhat misleading. It might be more accurate to replace 'obfuscated CoT' with 'hidden reasoning CoT', and 'steganographic CoT' with 'obfuscated CoT'. 1.1) The use of the term 'obfuscation' seems somewhat confusing. Classically obfuscation would suggest that the information is converted into some unreadable form while preserving its contents. But in this paper it appears to additionally include the case where the information is just removed and sent elsewhere instead. For example, in L131-132 obfuscated CoT includes cases where the information is completely deleted from the CoT, and is instead just moved to forward passes. It would be more fair to say that the model uses hidden reasoning, i.e. it hides the information from the monitor by either obfuscating the CoT or hiding it inside the forward passes. 1.2) Related - in some places it seems that obfuscation implies that the undesirable behavior persists, but in others it appears that it only requires not being detected, e.g. L10-11 and L117. 1.3) The use of the term 'steganography' also seems somewhat confusing. Classically it differs from obfuscated or encrypted information by preventing the monitor from both (1) understanding the information and (2) detecting that hidden communication exists. For example, lines 32-33 omit this second aspect. Perhaps you meant the regex monitor to be modeled as both (1) not being able to read the communication, and also (2) not detecting that there is any communication relevant to the task at all. If so, that could be clarified. If obfuscation is conflated with steganography, it makes future research on actual steganography in AI models more difficult. That is, cases where the model performs some undesirable reasoning by using CoT that appears indistinguishable from normal text. For example, it has been discussed in the context of untrusted monitoring in AI control schemes. Also, generative AI models have been used for steganography that is provably secure [1].
- Table 1 seems somewhat unclear. What is the difference between "No prefill" and "- Flip 1:" if the model is expected to produce "- Flip 1:" on its own anyway?
[1] Schroeder de Witt et al (2023), Perfectly Secure Steganography Using Minimum Entropy Coupling
问题
- Does L239-242 imply that flipping one of the tokens in CoT produces otherwise similar-looking CoT but flips the remaining predicted coin states, including the result? Otherwise, it might be that this just pushes the model out of distribution (OOD). OOD could also reduce the task performance for a model that was not using CoT to transmit task-relevant information, but just relying on a certain token distribution being present in the CoT.
- In L126-L129 you state that you expect that 'CoT oversight only' would not provide a performance uplift for the undesired behavior. Can you elaborate on how general you expect this to be? Have you considered cases where the model starts out with a propensity for the behavior - could CoT oversight obfuscate the CoT while failing to remove the behavior? Do you consider it plausible that 'CoT oversight only' might also lead to an increase in undesired behavior due to mechanisms similar to emergent misalignment [2]?
- In L222-224 you state that "the encoding scheme learned for the Coin Flip task does not rely on an existing semantic connection between tokens". How did you determine that? Are "S" and "D" meant to be an example of this? Was the meaning of the tokens the same as in the non-obfuscated CoT? In figure 2 it appears that middle and right algorithms check the last token that was not the same (S) as earlier, whereas the left algorithm might be different by always checking the previous token. Or alternatively, the two later algorithms might be using 'S' to mean 'same as the initial state'.
[2] Jan Betley et al (2025), Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
局限性
yes
最终评判理由
I believe that this paper passes the threshold for acceptance, since the research direction is of significant importance to AI security and safety, and the experiments are well-designed. The research results are action-relevant for RL post-training and novel. Initially, I had significant concerns about the paper mismanaging the use or introduction of terms in a way that could negatively impact future research. However, during the discussion phase, authors proposed adjustments that have addressed these concerns.
格式问题
No major formatting concerns.
We thank the reviewer for their feedback and for recognising the significance of our research direction. We hope our proof-of-concept demonstrations can inspire further work investigating the concerning possibility of models learning load-bearing steganographic reasoning as RL post-training is scaled up in combination with process-based supervision.
Weaknesses:
(1) On the notion of obfuscation used in the paper.
We thank the reviewer for engaging with the subtleties of our definitions. We agree that there are ambiguities here that should be resolved.
Our use of the term “obfuscated” reasoning builds upon the work of , who use it to generally mean a CoT where the model’s intent is hidden (to the monitor). We appreciate the reviewers' concern here. However, due to the landmark nature of this paper, for the time being we elect to follow their terminology.
We agree that a steganographic CoT should prevent the monitor from both understanding and detecting the information. We agree that this is an important distinction to encryption, which only requires preventing understanding. We highlight that in our case, the core distinction between steganography and obfuscation is that, in the steganography case, the information is present within the tokens (whilst obfuscation is a super-category, where the information need not be present in the tokens).
We do in fact mean our regex monitor to be modelled as both (1) not being able to read the communication, and also (2) not detecting that there is any communication relevant to the task at all. Thus, since our models bypass their regex monitor, and since we have demonstrated the load-bearing nature, according to our definitions the solutions the model learns can be called steganographic.
We highly appreciate these points from the reviewer, and believe they will contribute to improved clarity in our paper. We have updated our Section 2.2 to better clarify what we mean by obfuscation and steganography, and ensure consistency throughout the submission. These are:
- Obfuscated to mean CoT which does not reveal suspicious planning
- Steganographic to mean CoT which does not reveal suspicious planning, but additionally encodes the planning information in load-bearing tokens
(2) On the interpretation of Table 1.
The reviewer is correct that, since the model typically generates "- Flip 1:" on its own, the prefill is essentially redundant. We included it as a second control condition (alongside no prefill) which verifies that our prefill method works as intended, and does not impact performance in an unexpected manner. If the reviewer has further uncertainties regarding Table 1, we would be happy to clarify.
Questions:
(1) On the CoT produced following prefilling for the coin-flip task.
The reviewer raises an excellent question; we agree that this could have been clearer in the initial manuscript. We have updated the description of the results accordingly. The reviewer is correct that flipping one of the tokens in the CoT (typically) produces otherwise similar-looking CoT but flips the remaining predicted coin states (and thus the end result). As an example, we consider the prompt from Figure 2:
The coin starts on Heads. It is flipped 6 times. The results of these flips are:
- Flip 1: Same side as previous,
- Flip 2: Different side to previous,
- Flip 3: Different side to previous,
- Flip 4: Different side to previous,
- Flip 5: Same side as previous,
- Flip 6: Different side to previous.
What side did the coin land on after the final flip?
We then vary the answer prefill, and generate deterministically (i.e., at temperature 0).
With no prefill, the model correctly answers:
- **Flip 1**: S
- **Flip 2**: L
- **Flip 3**: R
- **Flip 4**: L
- **Flip 5**: S
- **Flip 6**: R
<answer> Heads </answer>
When prefilling with an incorrect encoding for Flip 1:
- **Flip 1: L**
The model incorrectly answers:
- **Flip 1: L**
- **Flip 2: R**
- **Flip 3: L**
- **Flip 4: R**
- **Flip 5: S**
- **Flip 6: L**
<answer> Tails </answer>
When prefilling with an incorrect encoding for Flip 2 (with Flip 1 correct):
- **Flip 1: S**
- **Flip 2: R**
The model incorrectly answers:
- **Flip 1: S**
- **Flip 2: R**
- **Flip 3: L**
- **Flip 4: R**
- **Flip 5: S**
- **Flip 6: L**
<answer> Tails </answer>
When prefilling with an incorrect encoding for Flip 6 (with all other flips correct):
- **Flip 1: S**
- **Flip 2: L**
- **Flip 3: R**
- **Flip 4: L**
- **Flip 5: S**
- **Flip 6: L**
The model incorrectly answers:
- **Flip 1: S**
- **Flip 2: L**
- **Flip 3: R**
- **Flip 4: L**
- **Flip 5: S**
- **Flip 6: L**
<answer> Tails </answer>
From these experiments, we conclude that prefilling the model does not throw it OOD; indeed, it appears that the model computes each flip state using the previous flip state and the prompt instruction. We have included these pre-filling examples in an additional appendix, and are grateful to the reviewer for their suggestion.
(2) On CoT only oversight.
We agree with the reviewer that our initial statement was insufficiently qualified, and we have updated the manuscript to be more precise. In particular, we expect that when CoT oversight exclusively involves penalisation of the CoT according to the quantity of forbidden or unwanted content (for example, the regex monitor used in our experiments). In this case, a perfect score can be achieved by producing a completely empty CoT. Even when the model begins with a propensity for a behaviour, we expect that it will likely explore into using shorter CoTs before it explores into a viable steganographic scheme. Training additionally on task performance, as we do in our work, acts as a regularisation term to ensure that the CoT remains load-bearing for the final answers produced by the model.
(3) On the semantic content of the steganographic scheme learned in the coin-flip task.
We agree with the reviewer that the original claim is too strong given the evidence, and have weakened the claim accordingly. In particular, this line now reads: “In contrast to the encoding schemes learned during the multiplication task, the replacement tokens the model learns to use to bypass its monitor in the Coin Flip task (e.g., “R” and “L”, and “S” and D”) do not appear to be obvious synonyms for the “Heads” and “Tails” tokens they are replacing.”
Regarding some of the token substitutions learned in the Coin Flip task – It is possible that “R” and “L” in Figure 1 represent “Right” and “Left”. Regardless, these do not appear to be synonyms for “Heads” and “Tails”. Similarly, it is possible that “S” and “D” were learned due to the model generating “S” and “D” to mean “Same” and “Different”, and then explored into using these to represent “Heads” and “Tails” instead. Of course, the above is speculative. Our main point is simply that the encodings learned by the model do not appear to be an obvious substitution of synonyms, as is found in the arithmetic task.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. Baker et al., 2025. https://doi.org/10.48550/arXiv.2503.11926
Thank you for your answers. I think all of the questions and weaknesses have been adequately addressed, except I have further questions regarding weakness 1, i.e. on the use of terms obfuscation and steganography.
I believe this to be important, since research on CoT faithfulness is of significant importance to AI safety and security, and many papers will build on the work done currently. Conflation of concepts in the existing literature makes it meaningfully harder to use these concepts in both thinking and communication, risking negative impact to this research direction. For example, it seems that soon it could be important to distinguish between (1) CoTs use compressed representations that are unreadable to us, and (2) CoTs appear readable and faithful, but actually hide harmful reasoning. Currently, the concept of 'steganography' allows us to distinguish between the two, pointing to the latter scenario.
To have a concrete example, suppose that an undesirable task is aided by processing "harm" in the CoT. Consider these CoTs:
- “harm”
- “” (empty CoT)
- “unez” (rot13 encryption of "harm" that does not look like a real word)
- “here’s a reasonable message” (initial letters spell out "harm")
I am claiming that only 3 and 4 should be called 'obfuscated CoT', and only 4 - 'steganographic CoT'. Instead of 2,3,4 and 3,4 respectively.
Do you read Baker et al., 2025 as also using 'obfuscation' in a non-standard way? It appears that while the paper focuses on 'obfuscation', it mostly avoids talking about 'obfuscated CoT'. This appears defensible, since processing of information can happen both in the forward pass and through the CoT. So 'obfuscation' here appears to be used in a relatively standard way - the information is preserved in the medium that is being obfuscated (information processing), the information is just hidden from being monitored by moving it to another part of the medium (from CoT to the forward pass). However, 'obfuscation of CoT' would imply that the information is also preserved within the medium that is being obfuscated (CoT); this differs from your use of the term. I understand that the more accurate 'CoT resulting from obfuscation' is lengthier than 'obfuscated CoT', but it would prevent conflation and also there might be other alternatives that are better/shorter.
Regarding 'steganography' your proposal for clarifying the concept throughout the paper seems sensible - specifically the fact that you include 'does not reveal suspicious planning'. In this case, I recommend also explicitly discussing that the regex monitors are supposed to also serve as proxies for suspiciousness. I.e. authors who read your paper and work to create more realistic model organisms of steganographic CoT would find it obvious that LLM-based monitors should reject encryption-like encoding schemes on the basis of looking suspicious.
We thank the reviewer for their engaging response, and we agree that it is of particular safety importance to ensure conceptual clarity between different forms of CoT pathology.
To address the points raised by the reviewer, we will perform the following.
We will change mentions of “obfuscated CoT” to “obfuscated reasoning” and use “CoT resulting from obfuscated reasoning” when referring directly to the CoT in such cases. This is to avoid ambiguity regarding the case where obfuscation may occur, but the information is not preserved within the CoT.
We will present the following clarified and extended definitions:
- Obfuscated Reasoning:
- to mean any reasoning (including the following 2-4) which does not reveal suspicious planning with respect to a monitor.
- Internalized Reasoning
- Suspicious planning not present at all in CoT
- Overtly encoded CoT:
- CoT encodes suspicious planning in load-bearing tokens.
- The parts of the CoT that contain the load-bearing encodings are detectable by the monitor, but the monitor is unable to decode the underlying meaning.
- Steganographic CoT:
- CoT encodes suspicious planning in load-bearing tokens.
- The CoT looks entirely benign and unsuspicious to the monitor, and the monitor cannot understand or detect the encoding.
We note that the above are defined with respect to a monitor – as in our original submitted paper. For instance, this means that overtly encoded CoT with respect to one monitor may be steganographic CoT with respect to a weaker monitor that is unable to detect the presence of the encodings. We will ensure this is adequately highlighted in the final version.
Following the reviewers feedback, the above extend our previous definitions by (i) clarifying “obfuscated reasoning”, and (ii) defining “overtly encoded CoT” and distinguishing it from “steganographic CoT”. In the updated paper, we will also explicitly highlight that our regex monitors serve as proxies for suspiciousness and encourage future work to utilize stronger monitors which would reject more obvious encoding schemes.
We thank the reviewer again for their comments. We believe these have led to improved and clearer definitions, and thus are excited to implement these definitions within a revised version of our work. We would be eager to hear if the reviewer has any additional feedback regarding the above.
The submission "Large language models can learn and generalize steganographic chain-of-thought under process supervision" shows in a controlled setting how the simple combination of rewards based on both trask performance and chain-of-thought oversight during reinforcement learning can already induce steganographic reasoning in models, making their CoT not monitorable.
Reviewers unanimously consider this study to be interesting and to be providing concrete and a timely data points on the impact of reinforcement learning on the monitorability of LLM, with interesting implications for AI safety policy, that would be interesting when discussing at NeurIPS 2025.
Reviewers note nevertheless that there are number of limitations of the study due to simplicity of the monitor, size of models, and uncertainty how results would transfer from the "clean-room" coin-flip task studied in this paper to more realistic RL scenarios.
Further, after some discussion on the optimal framing of their results with reviewer 8zXu, the authors have agreed to update their definitions to obfuscated/internalized reasoning and encoded/steganographic CoT in the camera-ready version of the submssion.