影响力指数

65.35/100

前 3.5%

全站排名 #2,260

发表论文18 篇

平均评分5.4

年均产出6.0 篇/年

Henry Sleight

Researcher@Constellation·美国·OpenReview

6.7

All Code, No Thought: Language Models Struggle to Reason in Ciphered Language

ICLR 2026Poster

二作

5.5

The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

ICLR 2026Poster

三作

5.5

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

ICLR 2026Rejected

4.5

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

ICLR 2026Rejected

4.5

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models

ICLR 2026Desk Rejected

二作

7.0

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

ICLR 2025Poster

6.8

Best-of-N Jailbreaking

NeurIPS 2025Poster

6.5

Looking Inward: Language Models Can Learn About Themselves by Introspection

ICLR 2025Poster

6.3

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

ICLR 2025Poster

6.0

Quantifying Elicitation of Latent Capabilities in Language Models

NeurIPS 2025Poster

5.8

Rapid Response: Mitigating LLM Jailbreaks With A Few Examples

ICLR 2025Rejected

三作

4.8

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

ICLR 2025Rejected

4.0

Plan B: Training LLMs to fail less severely

ICLR 2025Withdrawn

3.7

Attacking Audio Language Models with Best-of-N Jailbreaking

合作者 (20)

Henry Sleight

All Code, No Thought: Language Models Struggle to Reason in Ciphered Language

The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Unsupervised Elicitation of Language Models

Abstractive Red-Teaming of Language Model Character

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Best-of-N Jailbreaking

Looking Inward: Language Models Can Learn About Themselves by Introspection

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Quantifying Elicitation of Latent Capabilities in Language Models

Rapid Response: Mitigating LLM Jailbreaks With A Few Examples

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Plan B: Training LLMs to fail less severely

Attacking Audio Language Models with Best-of-N Jailbreaking