Dylan Hadfield-Menell

Associate Professor@Massachusetts Institute of Technology·美国·OpenReview

研究方向

Value Alignment · Inverse Reinforcement Learning · Preference Elicitation · Human-Robot Interaction · Sequential Decision Making · Planning · Motion Planning · Markov decision processes · Reinforcement Learning

5.0

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

ICLR 2025Rejected

3.5

Altared Environments: The Role of Normative Infrastructure in AI Alignment

ICLR 2025Rejected

3.0

Inverse Prompt Engineering for Task-Specific LLM Safety

合作者 (20)

Dylan Hadfield-Menell

Activation Steering via Contrastive Causal Mediation

Steering Vector Transfer via Orthonormal Transformations and Semantic Pairing

Leveraging Sparse Autoencoders for Passive Scoping

Diverse Preference Learning for Capabilities and Alignment

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Altared Environments: The Role of Normative Infrastructure in AI Alignment

Inverse Prompt Engineering for Task-Specific LLM Safety