影响力指数

71.04/100

前 2.4%

全站排名 #1,573

发表论文16 篇

平均评分5.4

年均产出5.3 篇/年

Samuel Marks

Researcher@Anthropic·美国·OpenReview

研究方向

interpretability · large language models · model editing

Steering Evaluation-Aware Language Models To Act Like They Are Deployed

ICLR 2026Poster

Liars' Bench: Evaluating Deception Detectors for AI Assistants

ICLR 2026Rejected

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

ICLR 2026Rejected

Unsupervised Elicitation of Language Models

ICLR 2026Rejected

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

ICLR 2026Rejected

Eliciting Secret Knowledge from Language Models

ICLR 2026Rejected

Robustly Improving LLM Fairness in Realistic Settings via Interpretability

ICLR 2026Rejected

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

ICLR 2025Poster

Erasing Conceptual Knowledge from Language Models

NeurIPS 2025Poster

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

ICML 2025Poster

Erasing Conceptual Knowledge from Language Models

ICLR 2025Rejected

合作者 (20)

博后导师5 篇

Senthooran Rajamanoharan