影响力指数

94.05/100

前 0.3%

全站排名 #208

发表论文35 篇

平均评分5.7

年均产出11.7 篇/年

Neel Nanda

Researcher@Google DeepMind·美国·OpenReview

研究方向

Mechanistic interpretability

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

ICLR 2026Poster

Steering Evaluation-Aware Language Models To Act Like They Are Deployed

ICLR 2026Poster

Thought Branches: Interpreting LLM Reasoning Requires Resampling

ICLR 2026Poster

Emergent Misalignment is Easy, Narrow Misalignment is Hard

ICLR 2026Poster

What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering

ICLR 2026Poster

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

ICLR 2026Rejected

Thought Anchors: Which LLM Reasoning Steps Matter?

ICLR 2026Rejected

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

ICLR 2026Rejected

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

ICLR 2026Rejected

Real-Time Detection of Hallucinated Entities in Long-Form Generation

ICLR 2026Rejected

Base Models Know How to Reason, Thinking Models Learn When

ICLR 2026Withdrawn

Eliciting Secret Knowledge from Language Models

ICLR 2026Rejected

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Learning Multi-Level Features with Matryoshka Sparse Autoencoders

ICML 2025Poster

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

NeurIPS 2025Poster

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

ICLR 2025Poster

Sparse Autoencoders Do Not Find Canonical Units of Analysis

ICLR 2025Poster

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

ICML 2025Poster

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

NeurIPS 2025Poster

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

ICML 2025Poster

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

ICML 2025Poster

Scaling Sparse Feature Circuits For Studying In-Context Learning

ICLR 2025Rejected

Interpreting Attention Layer Outputs with Sparse Autoencoders

ICLR 2025Rejected

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

ICLR 2025Rejected

Scaling Sparse Feature Circuits For Studying In-Context Learning

ICML 2025Poster

合作者 (20)

Senthooran Rajamanoharan

Joseph Isaac Bloom

Callum Stuart McDougall

Oscar Balcells Obeso