影响力指数

43.55/100

前 11.9%

全站排名 #7,669

发表论文8 篇

平均评分5.1

年均产出2.7 篇/年

Francis Rhys Ward

PhD student@Imperial College London·OpenReview

研究方向

Deception · causality · game theory · alignment · reinforcement learning from human feedback

Password-Activated Shutdown Protocols for Misaligned Frontier Agents

ICLR 2026Rejected

How does information access affect LLM monitors' ability to detect sabotage?

ICLR 2026Rejected

CTRL-ALT-DECEIT Sabotage Evaluations for Automated AI R&D

NeurIPS 2025Spotlight

The Elicitation Game: Evaluating Capability Elicitation Techniques

ICML 2025Poster

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

ICLR 2025Poster

合作者 (20)

Felix Hofstätter

Teun van der Weij

Samuel F. Brown

Rohan Subramani

Raja Mehta Moreno