Neel Nanda
~Neel_Nanda1
23
论文总数
11.5
年均投稿
平均评分
接收情况16/23
会议分布
ICLR
12
NeurIPS
6
ICML
5
发表论文 (23 篇)
202513 篇
4
Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
ICML 2025Poster
4
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
ICLR 2025Poster
4
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
ICML 2025Poster
4
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
ICLR 2025Oral
4
Scaling Sparse Feature Circuits For Studying In-Context Learning
ICLR 2025Rejected
4
Scaling Sparse Feature Circuits For Studying In-Context Learning
ICML 2025Poster
4
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
NeurIPS 2025Poster
4
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
ICML 2025Poster
3
Interpreting Attention Layer Outputs with Sparse Autoencoders
ICLR 2025Rejected
4
Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval
NeurIPS 2025Poster
4
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
ICLR 2025Rejected
4
Sparse Autoencoders Do Not Find Canonical Units of Analysis
ICLR 2025Poster
4
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
ICML 2025Poster
202410 篇
3
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
ICLR 2024Poster
5
Neuron to Graph: Interpreting Language Model Neurons at Scale
ICLR 2024Rejected
3
Language Models Linearly Represent Sentiment
ICLR 2024Rejected
4
Transcoders find interpretable LLM feature circuits
NeurIPS 2024Poster
4
Summing Up the Facts: Additive Mechanisms behind Factual Recall in LLMs
ICLR 2024Rejected
3
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
ICLR 2024Poster
4
Copy Suppression: Comprehensively Understanding an Attention Head
ICLR 2024Rejected
4
Confidence Regulation Neurons in Language Models
NeurIPS 2024Poster
4
Refusal in Language Models Is Mediated by a Single Direction
NeurIPS 2024Poster
4
Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders
NeurIPS 2024Poster