影响力指数

91.62/100

前 0.5%

全站排名 #297

发表论文28 篇

平均评分5.9

年均产出9.3 篇/年

Jacob Steinhardt

Assistant Professor@University of California Berkeley·OpenReview

研究方向

theory · science · value learning · human-compatible AI · adversarial examples · security · robustness

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

ICLR 2026Poster

Understanding In-context Learning of Addition via Activation Subspaces

ICLR 2026Rejected

Eliciting Language Model Behaviors with Investigator Agents

ICML 2025Poster

Monitoring Latent World States in Language Models with Propositional Probes

ICLR 2025Spotlight

Uncovering Gaps in How Humans and LLMs Interpret Subjective Language

ICLR 2025Spotlight

Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision

ICLR 2025Spotlight

Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

ICML 2025Poster

LLM Layers Immediately Correct Each Other

NeurIPS 2025Poster

Interpreting the Second-Order Effects of Neurons in CLIP

ICLR 2025Poster

What Do Learning Dynamics Reveal About Generalization in LLM Mathematical Reasoning?

ICML 2025Poster

Language Models Learn to Mislead Humans via RLHF

ICLR 2025Poster

Which Attention Heads Matter for In-Context Learning?

ICML 2025Poster

Teaching LLMs to Decode Activations Into Natural Language

ICLR 2025Rejected

VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

ICLR 2025Poster

Adversaries Can Misuse Combinations of Safe Models

ICML 2025Poster

Evaluating Model Robustness Against Unforeseen Adversarial Attacks

ICLR 2025Rejected

Which Attention Heads Matter for In-Context Learning?

ICLR 2025Rejected

Adversaries Can Misuse Combinations of Safe Models

ICLR 2025Rejected

Pre-Memorization Train Accuracy Reliably Predicts Generalization in LLM Reasoning

ICLR 2025Rejected

SmartBackdoor: Malicious Language Model Agents that Avoid Being Caught

ICLR 2025Withdrawn

合作者 (20)