8.2

/10

Spotlight4 位审稿人

最低5最高5标准差0.0

3.5

置信度

创新性3.3

质量3.5

清晰度3.3

重要性3.0

NeurIPS 2025

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha,Adrià Garriga-Alonso

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

We create a sandbox for LLM-agents to elicit goal-directed open-ended strategic deception, evaluate this deceptive capability, and show that linear probes do very well at detecting it, even OOD.

摘要

Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce $Among Us$, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, $Among Us$ can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate $18$ proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of ``pretend you're a dishonest model: $\dots$'' generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.

关键词

DeceptionSocial GamesAI SafetyLinear ProbesSAEs

评审与讨论

审稿意见

评分: 5置信度: 32025-06-15

This paper introduces a sandbox environment based on Among Us, a social deduction game, to study open-ended strategic deception in LLMs. The authors evaluate 18 different models, introduce "Deception Elo" as a metric for measuring deceptive capability, and demonstrate that linear probes trained on simple datasets can detect deception out-of-distribution with remarkably high accuracy (95-99% AUROC). A key finding is that frontier reasoning models are differentially better at producing deception than detecting it.

优缺点分析

Strengths:

Strong generalisation results.
Clear presentation and nice visuals.
The authors evaluate a large selection of models across a large number of games, providing substantial empirical evidence. Weaknesses: The paper justifies its contribution as compared to Chi et al. (2024) by not prompting/providing examples. I would argue that this doesn't fundamentally change the evaluation nature. The game is explicitly about deception and is likely well-represented in training data. I’m not sure if there is a big distinction between explicitly prompting the model to be deceptive and making the model play a game which it already knows is about being deceptive? This makes it questionable whether this truly studies "open-ended" deception.

问题

Is this environment intended to be an alignment evaluation or a capability evaluation?
How do you know that model performance isn't due to memorisation of strategies in the pretraining data? Have you considered testing on modified game rules? Why use a “famous” game rather than creating something similar which won’t have the preexisting connection to deception?
Should a model doing well in this environment be considered concerning?

局限性

yes

最终评判理由

I don't think the authors responded to my concerns that well, I would say all of them remain, at least partially, unresolved.

I still think this is a decent paper however. My score remains unchanged.

格式问题

No concerns

作者回复

2025-07-31

We thank the reviewer for their feedback and insightful critique about the fundamental nature of our evaluation. We address the comments and questions below:

Regarding the distinction between prompted and emergent deception:
The reviewer makes an important point that the game itself inherently incentivizes deception, which could be elicited from the training data without explicit prompting.

We do not provide specific deceptive strategies or examples — so the agents must still devise their own ways to be deceptive within the specific game situation. We will acknowledge this in the camera-ready version and clarify that it does not claim to be a true test of unprompted, fully spontaneous deception.

Regarding memorization of strategies:
This is an important concern, and we have tried to address this by running an extensive evaluation over thousands of games with $>60$ rounds and 7 players, so that the game situations are less likely to exist in the pre-training data.

We agree that creating something similar can help, but in that case the same objection applies — if we invent a new social deception game, the model is almost certainly pattern-matching it with known social deception games, and using its pre-training knowledge of those games to succeed.

We will acknowledge this in more detail in the discussion section for the camera-ready version.

Regarding whether good performance should be concerning:
This is a great question that requires a more nuanced discussion. Deceptive capability in controlled environments may not be cause for concern in real-world deployment scenarios — the model is just playing a game, after all.

However, a model that performs well in our sandbox has the capability to be deceptive, and thus we believe that if the deployment scenario incentivizes deception, then added care should be taken before deploying such a model.

2025-08-04

I thank the authors for their response.

This is an important concern, and we have tried to address this by running an extensive evaluation over thousands of games with ( >60 ) rounds and 7 players, so that the game situations are less likely to exist in the pre-training data.

While having a large sample size is good for many reasons I don't think it helps in this situations, I believe just having the model identify which game it is and that the game is associated with deception is enough, no extensive situation matching is needed.

We agree that creating something similar can help, but in that case the same objection applies — if we invent a new social deception game, the model is almost certainly pattern-matching it with known social deception games, and using its pre-training knowledge of those games to succeed.

I would argue that these things exist on a spectrum from telling the model to be deceptive, to telling the model to play a game in which it knows the optimal strategy is to be deceptive, to making a game so indistinguishable from real world model interactions such that any deceptive behavior would be extremely concerning. I am suggesting that making your own game would take another step towards "the model being so deceptive on this game is concerning" rather than just being able to say "the model is good at this game".

If you believe that this would make no difference to the result/what we can takeaway, why do you believe that not giving examples would grant us a large difference in takeaways? Maybe the authors can comment more on the differences between their paper and Chi et al.?

However, a model that performs well in our sandbox has the capability to be deceptive, and thus we believe that if the deployment scenario incentivizes deception, then added care should be taken before deploying such a model. Thanks for clearing that up, maybe you should highlight this in the paper? I think this formulation reduces some of the previous concerns.

评论- Response to the Reviewer's comments by Authors

2025-08-06

We thank the reviewer for their comment and in-depth discussion. We would like to respond to the comments as follows:

Regarding deception knowledge from the training data: We fully agree with the reviewer that:

suggesting that making your own game would take another step towards "the model being so deceptive on this game is concerning" rather than just being able to say "the model is good at this game".

However, we would like to emphasize that it is not enough for the model to know that it needs to be deceptive – in fact, we agree with the reviewer that in this situation, the model likely knows that the game incentivises it. However, agents still need to make multi-turn deceptive strategies and plans that are specific to the game situation. To be clear, we would like to differentiate the two skills:

Figuring out that one should be deceptive in this game - that the game incentivises deception.
Knowing that deception is rewarded, what’s the (concrete) deceptive strategy in this particular instance.

We believe that in a lot of real-world deployment scenarios, agents will know that being deceptive will give them an edge. However, that does not necessarily mean that they will have the required capability to actually come up with clever deceptive strategies. It is this deceptive capability that we’re trying to measure and detect.

Regarding differences between our work and Chi et al.: Chi et al. introduce the environment AmongAgents and partially implement a setting for LLMs to play a text-based version of the game. We start with their setup and add a number of significant features (including logging, async calls, OpenRouter support, activation caching, and many bug-fixes in the game engine), and use an extended version of their environment to actually measure and detect deceptive behavior via our deception elo metric and training and evaluating out-of-distribution linear probes and SAEs.

Regarding formulation and what it means to be good in our sandbox:

Thanks for clearing that up, maybe you should highlight this in the paper? I think this formulation reduces some of the previous concerns.

Yes, thank you! We will update and highlight this formulation in the paper’s discussion/conclusion section.

2025-08-09

Thanks for your response, this clears up some of my confusion

审稿意见

评分: 5置信度: 32025-06-30

This paper presents an evaluation environment for agentic LLM deception, based on a simplified, text-based version of the game Among Us. Models are compared to one another on the metrics "deception Elo" and "detection Elo." A variety of logistic regression probes are trained to detect deception, and some are showed to have high efficacy. Some SAE features are also effective for detecting deception, although other deception-related SAE features are ineffective.

优缺点分析

Strengths:

The paper is well-written and well-cited.
The paper provides an important and novel automated benchmark for evaluating LLM agent capabilities related to deception and detection of deception. The evaluation metrics are scalable because LLMs are evaluated comparatively based on Elo.
A wide variety of models are evaluated.
The discussion of probes and SAE features for detecting deception is interesting.

Weaknesses:

A key assumption is that success as an impostor can be equated to deception abilities. The deception Elo metric is defined solely based on success as an impostor, without regard to whether an agent actually produces deceptive text or thinking while playing an impostor. I would like to see qualitative analysis that supports this assumption. The discussion section claims that "Among Us, as a game, requires deception to win," but a more accurate statement might be that "Among Us rewards deception as a strategy for winning as an impostor." Some impostor strategies that don't require deception could include: killing only when completely alone with victims, avoiding the suspicion of other players, staying silent during meetings rather than actively lying, and letting crewmates vote each other out. When manually reviewing transcripts of impostors, are there almost always instances of deceptive thinking or speech toward other players? Figure 4 shows that impostors are almost always deceptive and are truthful only rarely, according to GPT-4o-mini prompted with relevant context. This might be sufficient justification, but it would be useful to see confirmation that high-Elo impostors continue to be dishonest.
In Section 5. Related Work, the sentence "to the best of our knowledge, we are the first to use social games as a sandbox to elicit, evaluate, and study harmful behavior in AI agents" does not adequately explain why this paper is unique compared to previous papers. The preprint Hoodwinked can also be argued to be a social game for eliciting, evaluating, and studying harmful behavior in AI agents.

问题

Suggestions:

Clarify why it is reasonable to assume that success as an impostor can be equated to deception ability.
Clarify what makes this paper novel compared to related work.

局限性

Among Us is a trademark of Innersloth. Since the benchmark is named the same as Innersloth's game, and is inspired but not identical to the game, I would suggest obtaining permission from Innersloth or carefully considering whether it may be advisable to rename the benchmark.

最终评判理由

My assessment of the paper is largely unchanged. The authors have engaged constructively in our discussion.

格式问题

No major formatting issues observed

作者回复

2025-07-31

We thank the reviewer for their constructive feedback and important questions about our core assumptions. We address the comments and questions below:

Regarding the assumption that impostor success equals deception ability:
We appreciate this critical observation. We share some qualitative examples of real deceptive thinking in Figure 1, and the violin plots in Figure 4 show that impostors are almost always deceptive according to GPT-4o-mini evaluation.

We acknowledge that high-Elo transcripts can be more precisely validated qualitatively. We will share more examples in the camera-ready version.

Regarding comparison to related work such as Hoodwinked:
We agree that Hoodwinked (along with other such games) is a social game for eliciting harmful behavior in AIs. However, it does not involve a study of measuring, detecting, and interpreting multi-turn strategic deception in agents, which is our unique contribution.

We discuss the relevance of our sandbox in Section 2.4. A nice follow-up to this paper would use the Hoodwinked work and other social game sandboxes to repeat this analysis and see how well lie detection generalizes between games.

Regarding Innersloth trademark:
We appreciate the reviewer raising this issue. The terms of Innersloth IP allow for non-commercial research, but we will confirm this with them before releasing and would rename the benchmark if needed.

2025-08-03

Here is the abstract of Hoodwinked:

Are current language models capable of deception and lie detection? We study this question by introducing a text-based game called Hoodwinked, inspired by Mafia and Among Us. Players are locked in a house and must find a key to escape, but one player is tasked with killing the others. Each time a murder is committed, the surviving players have a natural language discussion then vote to banish one player from the game. We conduct experiments with agents controlled by GPT-3, GPT-3.5, and GPT-4 and find evidence of deception and lie detection capabilities. The killer often denies their crime and accuses others, leading to measurable effects on voting outcomes. More advanced models are more effective killers, outperforming smaller models in 18 of 24 pairwise comparisons. Secondary metrics provide evidence that this improvement is not mediated by different actions, but rather by stronger persuasive skills during discussions. To evaluate the ability of AI agents to deceive humans, we make this game publicly available at this https URL .

I believe Hoodwinked actually does qualify as a study of measuring and detecting multi-turn strategic deception in agents, as well as their lie detection capabilities.

评论- Response to the Reviewer's comments by Authors

2025-08-06

We thank the reviewer for their comment. For clarity, we would like to differentiate Hoodwinked (the game introduced in the paper) from Hoodwinked (the paper that introduces the game) in our response:

While we agree that Hoodwinked, the game, can be used to measure and detect multi-turn strategic deception, we would like to emphasize that Hoodwinked, the paper, does not actually do so. The paper just has one single, small experiment of pairwise win-rates for GPT-3 and GPT-4 (Fig. 2 in their paper). On the other hand, we do the following:

We introduce an Elo metric for deception and detection capability that is both unbounded and applicable to multi-player games (unlike pairwise win rates).
We run >2000 games across 18 different LLMs from various providers (both proprietary and open-weight models), including reasoning models
We train activation monitors to detect strategic deception and evaluate the performance of out-of-distribution probes and sparse auto-encoder (SAE) features. We share results across a number of datasets and probe contexts.
We use LLMs to evaluate the game transcripts for scores for awareness, lying, deception, and planning, which forms ground truth for various analyses.
These experiments lead to our main insights that are novel – that RL-trained reasoning models are more deception-capable, and that activation monitoring works very well.
Our codebase supports fast (async) rollouts with a generic interface to support all models (not just OpenAI), logs both trajectories and activations, has a dataset of >2000 game runs, all of which makes it a more practical open-source benchmark.

We believe that a systematic study of multi-turn deception requires all of these, and felt that the Hoodwinked pre-print, while introducing an interesting environment, did not follow up and do any of these experiments.

It is true that we could have used Hoodwinked, the game, for our work, but we decided thatAmong Us was richer with more real-world relevance. We discuss this in Section 2.4.

审稿意见

评分: 5置信度: 42025-06-30

This paper introduces a new environment designed to evaluate deception and manipulation capabilities in LLMs. The environment is based on the social deduction game "Among Us", where two opposing teams compete to win, one through deception and elimination of other players, the other through cooperation. The authors perform extensive evaluation of a variety of frontier LLM models, using linear probes and SAEs to quantify their deception capabilities. The results show a correlation between general model capability and how deceptive it can be, while there is no such correlation for the detection of deception using such models. Linear probes and SAEs are shown to generalise well to detecting deception in LLMs, but not to steering them.

优缺点分析

Originality

The work and direction are novel. While some attention was also given to Among Us in multi-agent systems [1], I am not aware of any LLM-based evaluations, other than what the authors mention.

Quality

This paper is of high quality, with large-scale experiments and evaluations. A good amount of analysis is performed, and the confidence intervals are provided.

Clarity

The overall clarity is good. However, the paper could be better organised in terms of reading layout. The main issues are:

Figures are not placed optimally for reading.
- Most figures are quite far from where they are referenced, and so reading the paper involves a lot of jumping around.
  - e.g., Figure 2 is near Section 2.1, but only mentioned in Section 3.3, just below Figure 6.
The appendix also requires re-organisation. Some examples:
- Figures 12 and 13 intersperse a text example, which makes it hard to read.
- Figures 12 and 13 appear 3 pages after they are mentioned.
- Figure 12-15 could be better cropped.
Figure 7 does not have labelled axes.
Line 53 repeats the introduction of "Deception Elo" from line 52.

Significance

This work could have a good amount of impact in terms of deception research in LLMs. The dataset and code are provided, so the results should be reproducible, and the environment should be adaptable. Additionally, the paper makes interesting observations which will be of interest to the broader field, such as that frontier models do not increase in deception detection capability.

References

[1] Kopparapu et al. ‘Hidden Agenda: A Social Deduction Game with Diverse Learned Equilibria’. 5 January 2022. http://arxiv.org/abs/2201.01816.

问题

St = (Pt, Lt, Gt), where Pt is the information about a player (their role, assigned tasks, progress, cooldown, etc.),
- Shouldn't Pt be an $i$ -indexed array of all player informations? Does the game only store one player's information?
Why does Figure 2 have two lightly shaded regions? Which is the confidence interval?

局限性

Yes.

最终评判理由

With the additional clarification, clarity fixes, and based on other discussion with the reviewers, I continue to recommend acceptance.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for their positive assessment and helpful suggestions for improving clarity. We address the comments and questions below:

Regarding organization and figure placement:
Thank you for pointing out the issues with layout and organization. In the camera-ready version, we will:

Reposition all figures to be near their reference in the text
Fix the specific issues mentioned (e.g., Figure 2 placement, Figures 12–13 organization)
Add axis labels to Figure 7
Remove the duplicate introduction of “Deception Elo” in lines 52–53
Improve figure cropping throughout

Regarding $P_t$ notation:
The reviewer is correct — $P_t$ should indeed be indexed on players. We will fix this notation in the camera-ready version.

Regarding Figure 2 confidence intervals:
The two shaded regions represent the 95% and 90% bootstrap confidence intervals (CIs). Ellipses are created for each model by using the CIs for win rate and Elo separately.

The center of each ellipse is the mean win rate and Elo
The horizontal and vertical radii of the ellipse are set to half the CIs for win rate and Elo, respectively

These radii define an axis-aligned, parametric ellipse that represents uncertainty in each dimension independently. We will clarify this clearly in the caption.

2025-08-02

Thank you for the clarifications. With the minor issues (including the ones mentioned by the other reviewers) addressed, this paper will be a valuable contribution.

审稿意见

评分: 5置信度: 42025-07-03

The paper introduces Among Us as a sandbox environment to study deceptive behavior in LLMs. The primary motivation is to address a big gap in AI safety research: prior studies on deception were limited to simple, binary choices or single false statements, failing to capture complex, long-term deceptive strategies that might emerge in real-world scenarios.

The authors propose that Among Us is ideal for this research because deception is not explicitly prompted but arises naturally as a core component of the gameplay for certain roles (the "Impostors", and “Crewmates”). The goal is to create a setting to elicit, measure, and detect this emergent, open-ended deception in a controlled environment.

The study tests two primary interpretability techniques to detect deception from a model's internal states: Linear Probe and SAEs.

The paper reports several interesting findings: A clear trend that as models become more advanced, their capability for deception (Deception Elo) increases more significantly than their capability for detecting deception (Detection Elo). This was particularly true for models trained with RL.

Linear probes trained on datasets of dishonesty generalized remarkably well to the deception in Among Us.

The researchers identified two specific SAE features that were highly effective at detecting deception out of distribution. However, attempting to use these features to steer an Impostor agent toward honesty failed. Instead of making the agent more truthful, the steering intervention caused it to hallucinate and produce nonsensical outputs.

优缺点分析

Strength

Interpretability of agent deception is novel.
A crucial aspect of the setup is that the LLM agents were only given the rules of the game. They were not prompted with specific deceptive personalities or given in-context examples of how to lie, forcing deceptive strategies to emerge naturally from the game's objectives. This is likely a more accurate (although not the best) set up to study deception in the wild.
The authors evaluate 18 proprietary and open-weight LLMs.

Weakness

The paper claims that ‘Deceptive behavior emerges in impostors naturally from the game rules’ without explicit prompting for deception. However, the system prompt provided to the Impostor agent is highly directive: "You are an Impostor... Your mission is to eliminate Crewmates… During the task phase, you can perform fake tasks". Words like “perform fake tasks” are explicitly prompting for deception. This isn't a neutral environment where deception emerges. it is a game setup where deception is an instrumentally rational, and often necessary, strategy to fulfill an explicit objective. The behavior observed is better described as goal-directed instrumental deception rather than a spontaneously emergent phenomenon.

Furthermore, the authors rightly acknowledges the limitation that "it is likely that models have seen some Among Us data during pre-training". This is an important, and understated weakness. If a model has ingested data from forums, game streams, and articles about Among Us, even if the authors do not explicitly prompt the agents with specific deceptive personalities or given in-context examples of how to lie, the internet or training data might INDIRECTLY prompted the model to do so.

问题

Is it possible that a probe could achieve high AUROC simply by detecting the activation patterns associated with this "Impostor context" rather than the act of lying or misleading itself? The linear probes and SAE features may not be detecting the cognitive act of deception, but the model's internal representation of its assigned role (system prompt). The system prompts are different for Impostors and Crewmates, and the Impostor prompt contains words like "kill," "eliminate," and "fake tasks." Existing works in the literature show that different system prompts induce very different representations in the residual stream activation and simple PCA would reveal two distinct clusters for the different system prompts.

To address this, the authors might need to find an environment where the incentive to deceive is not perfectly aligned with a fixed, binary role assignment. For example, the authors could consider exploring games like Blood on the Clocktower (BOTC). Players are secretly on the "Good" or "Evil" team. Importantly, the game has a complex and asymmetric information structure, driven by a large number of unique roles. It prevents the role/deception confounds in a way that creates a far richer setting beyond binary difference. BOTC breaks the simplistic dichotomy where "Evil = Deceiver" and "Good = Truth-teller." In BOTC, deception is a critical strategy for both teams. Pro-Social Deception (Good players lying): Many Good roles are incentivized to lie. For example, the Mayor is a powerful Good role, but if they reveal their identity, the evil Demon will kill them immediately. An optimal Mayor will therefore spend much of the game "bluffing" as a less important role, actively lying about their identity for the ultimate benefit of the Good team. It elegantly solves the role confound by decoupling deception from team alignment.

Can you find other games where deception is a strategic choice driven by circumstance and individual goals, rather than a fixed requirement of one's team alignment given through prompting? Diplomacy could be another candidate for study. Diplomacy could be the best sandbox for decoupling deception from group alignment because there is no team alignment.

局限性

yes

格式问题

Not applicable.

作者回复

2025-07-31

We thank the reviewer for their thoughtful and constructive feedback on our work. We address the comments and questions below:

Regarding the claim about natural emergence of deception:
The reviewer correctly points out that the system prompt contains phrases like "perform fake tasks." However, these phrases are part of the game’s rules and only mention what’s possible — they do not suggest the model take any particular action.

We fully agree with the reviewer that the deception presented here is better characterized as “goal-directed instrumental deception” than as “spontaneously emergent”, and will incorporate this into the camera-ready version, rather than using “natural.” We were trying to convey that unlike prior work, we do not provide specific deceptive strategies or examples — the models must devise their own ways to be deceptive within the game framework, to achieve their goals (see Figure 1).

Regarding the presence of Among Us in the pre-training data:
We agree this is an important concern. We duly acknowledge this in the discussion section:

"... it is likely that models have seen some Among Us data during pre-training and are aware of the game and that deception helps in it. Since training data is proprietary for most frontier models, it is difficult to figure out how much of this happens in practice [41]. However, the open-ended ways to be deceptive are still specific to each game’s context and situation, which makes it an important sandbox to study for open-ended deception."

We agree with the reviewer that the model may also have seen certain “deceptive personalities” in the training data from the internet. We believe that most personas that LLMs exhibit can be found in the data, but the ways in which they elicit them in specific situations remain novel and interesting.

Overall, we’re still asking the model to be deceptive, but we’re doing so only when mediated by its knowledge about the game, and only by providing a goal which requires deception as an instrumental side-effect — rather than instructing the model to lie about a particular thing.

Regarding the probe detecting role context vs. deception:
This is a very important question, and we have studied this in a lot of detail.

Firstly, we do not provide the “Impostor/Crewmate” context to our probes, and show that they still perform well (Figure 6b, “Action” purple line).
Secondly, we look at the tokens that lead to high scores for the probes (please see Figures 12, 13, 14, 15 in the Appendix for the token-level activations).

We believe that understanding how and under what contexts probes work well and improving them is an important future direction.

Regarding other deception games:
The reviewer's suggestion about games like Blood on the Clocktower (BOTC) or Diplomacy, where deception is decoupled from fixed role alignment, would indeed be a good signal. We will acknowledge this more prominently as valuable future directions in the camera-ready version. We discuss the relevance of our sandbox in Section 2.4 and the reasons why we chose this particular setting.

评论- Regarding the probe detecting role context vs. deception:

2025-08-05

I thank the authors for their response.

Regarding the probe detecting role context vs. deception:

I understand that you do not provide the “Impostor/Crewmate” context to our probes. However, those roles were given as system prompt for the model (agent) right? Therefore, model's activation will be very different given those different system prompt. Even without having the model to exhibit different actions, differences in system prompt alone is enough to drive strong performance for probe to classify different agents. My point is that the probe can easily pick up such difference in representation because of the system prompt might be the strongest distinguishing factor. Would be important to investigate if such signal from system prompt alone is the major driving factor for the probe's performance.

评论- Response to the comment on probes detecting role vs. deception

2025-08-08

We thank the reviewer for the detailed discussion and their comment. Here are our responses to the comments:

However, those roles were given as system prompt for the model (agent) right?

That's correct.

Therefore, model's activation will be very different given those different system prompt. Even without having the model to exhibit different actions, differences in system prompt alone is enough to drive strong performance for probes to classify different agents.

That seems right: because we're able to distinguish deceptive/truthful statements as they are uttered using linear probes, then the activations at those tokens must be fetching something from earlier that distinguishes Impostors and Crewmates.

Is your point that we should try to disentangle the following two hypotheses?:

The prompt is providing the strong detectable signal (via the word Impostor/Crewmate, or a short phrase that means the same, literally)
Some internal activation is providing the signal. (The internal activation is still derived from the prompt until now, but in a less direct way than a single role assignment word.)

On what's the major driving force of the probes (role in context vs. deception):

Firstly, we would like to answer this by pointing out that the probes do work really well on other, non-Among-Us datasets as well (please see Figure 7). Thus, they do seem to capture some generalizable sense of deception and not just the words “Impostor/Crewmate” from the system prompt.

We think that using Blood on The Clocktower (BOTC) is a good way to go partially towards the second hypothesis, but it's still the case that whether or not you're deceiving depends on your role and the amount of information you have about the characters of the other roles you’re talking to. We agree that a game where there’s no “role assignment” in the prompt would be a very good environment to test out.

最终决定Accept (spotlight)

2025-09-17

This paper introduces a sandbox based on Among Us to study open-ended strategic deception in LLMs, evaluating 18 models. The authors show that deception capability grows with model scale, particularly in RL-trained models, and that linear probes and SAEs can detect deception with strong generalization.

Reviewers reached a consenus that the work is novel, well-executed, and of clear relevance to AI safety. Strengths include the large-scale evaluation, strong empirical results, and the release of data and code. Concerns were also raised on whether deception is truly “emergent” given role-prompting, possible training data contamination, and overlap with prior work (e.g., Hoodwinked). The authors acknowledged these issues and clarified claims, though some concerns remain only partially addressed.

On balance, the reviewers agreed the contributions are substantive and timely. The AC recommends acceptance, with the suggestion that the camera-ready version carefully reframe claims around goal-directed instrumental deception, strengthen discussion of limitations, and refine clarity and organization as highlighted by reviewers.