Jacob Steinhardt
~Jacob_Steinhardt1
26
论文总数
13.0
年均投稿
平均评分
接收情况17/26
会议分布
ICLR
18
ICML
5
NeurIPS
3
发表论文 (26 篇)
202518 篇
6
Which Attention Heads Matter for In-Context Learning?
ICML 2025Poster
5
Which Attention Heads Matter for In-Context Learning?
ICLR 2025Rejected
4
Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts
ICML 2025Poster
4
Interpreting the Second-Order Effects of Neurons in CLIP
ICLR 2025Poster
4
Monitoring Latent World States in Language Models with Propositional Probes
ICLR 2025Spotlight
4
Uncovering Gaps in How Humans and LLMs Interpret Subjective Language
ICLR 2025Spotlight
4
Adversaries Can Misuse Combinations of Safe Models
ICML 2025Poster
4
Adversaries Can Misuse Combinations of Safe Models
ICLR 2025Rejected
4
Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision
ICLR 2025Spotlight
4
Teaching LLMs to Decode Activations Into Natural Language
ICLR 2025Rejected
4
What Do Learning Dynamics Reveal About Generalization in LLM Mathematical Reasoning?
ICML 2025Poster
6
SmartBackdoor: Malicious Language Model Agents that Avoid Being Caught
ICLR 2025withdrawn
4
LLM Layers Immediately Correct Each Other
NeurIPS 2025Poster
4
VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
ICLR 2025Poster
4
Pre-Memorization Train Accuracy Reliably Predicts Generalization in LLM Reasoning
ICLR 2025Rejected
4
Language Models Learn to Mislead Humans via RLHF
ICLR 2025Poster
3
Eliciting Language Model Behaviors with Investigator Agents
ICML 2025Poster
4
Evaluating Model Robustness Against Unforeseen Adversarial Attacks
ICLR 2025Rejected
20248 篇
4
How do Language Models Bind Entities in Context?
ICLR 2024Poster
3
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
ICLR 2024Spotlight
4
Interpreting CLIP's Image Representation via Text-Based Decomposition
ICLR 2024Oral
4
Explainable, Steerable Models with Natural Language Parameters and Constraints
ICLR 2024Rejected
3
Explaining Datasets in Words: Statistical Models with Natural Language Parameters
NeurIPS 2024Poster
4
Approaching Human-Level Forecasting with Language Models
NeurIPS 2024Poster
3
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
ICLR 2024Rejected
4
Evaluating Robustness to Unforeseen Adversarial Attacks
ICLR 2024Rejected