影响力指数

65.75/100

前 3.4%

全站排名 #2,211

发表论文17 篇

平均评分5.6

年均产出5.7 篇/年

Ethan Perez

Researcher@Anthropic·美国·OpenReview

研究方向

AI Safety

The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

ICLR 2026Poster

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

ICLR 2026Poster

Unsupervised Elicitation of Language Models

ICLR 2026Rejected

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

ICLR 2025Poster

Best-of-N Jailbreaking

NeurIPS 2025Poster

Looking Inward: Language Models Can Learn About Themselves by Introspection

ICLR 2025Poster

Language Models Learn to Mislead Humans via RLHF

ICLR 2025Poster

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

ICLR 2025Poster

Quantifying Elicitation of Latent Capabilities in Language Models

NeurIPS 2025Poster

Rapid Response: Mitigating LLM Jailbreaks With A Few Examples

ICLR 2025Rejected

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

ICLR 2025Rejected

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

ICLR 2025Withdrawn

Plan B: Training LLMs to fail less severely

ICLR 2025Withdrawn

Attacking Audio Language Models with Best-of-N Jailbreaking

ICLR 2025Rejected

合作者 (20)

合作者11 篇

Rylan Schaeffer