David Krueger
~David_Krueger1
28
论文总数
14.0
年均投稿
平均评分
接收情况17/28
会议分布
ICLR
18
NeurIPS
7
ICML
2
COLM
1
发表论文 (28 篇)
202518 篇
4
From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization
NeurIPS 2025Poster
4
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
ICLR 2025Rejected
4
Distributional Training Data Attribution: What do Influence Functions Sample?
NeurIPS 2025Spotlight
4
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
ICLR 2025Rejected
6
Protecting against simultaneous data poisoning attacks
ICLR 2025Poster
3
Adversarial Robustness of In-Context Learning in Transformers for Linear Regression
ICLR 2025Rejected
3
Input Space Mode Connectivity in Deep Neural Networks
ICLR 2025Poster
5
Detecting High-Stakes Interactions with Activation Probes
NeurIPS 2025Poster
4
Interpreting Emergent Planning in Model-Free Reinforcement Learning
ICLR 2025Oral
4
Mitigating Goal Misgeneralization via Minimax Regret
ICLR 2025Rejected
3
Influence Functions for Scalable Data Attribution in Diffusion Models
ICLR 2025Oral
4
Towards Interpreting Visual Information Processing in Vision-Language Models
ICLR 2025Poster
4
The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret
ICLR 2025Rejected
4
PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data
ICML 2025Poster
4
The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret
ICML 2025Poster
4
Towards Meta-Models for Automated Interpretability
ICLR 2025withdrawn
3
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
COLM 2025Poster
6
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
ICLR 2025Rejected
202410 篇
4
Predicting Future Actions of Reinforcement Learning Agents
NeurIPS 2024Poster
4
Reward Model Ensembles Help Mitigate Overoptimization
ICLR 2024Poster
4
Stress-Testing Capability Elicitation With Password-Locked Models
NeurIPS 2024Poster
-
Delayed Generalization: Bridging Double Descent and Grokking
ICLR 2024withdrawn
4
Meta- (out-of-context) learning in neural networks
ICLR 2024Rejected
4
BaDLoss: Backdoor Detection via Loss Dynamics
ICLR 2024Rejected
4
Interpreting Learned Feedback Patterns in Large Language Models
NeurIPS 2024Poster
4
A Generative Model of Symmetry Transformations
NeurIPS 2024Poster
3
Towards Meta-Models for Automated Interpretability
ICLR 2024Rejected
3
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
ICLR 2024Poster