影响力指数

69.21/100

前 2.8%

全站排名 #1,795

发表论文23 篇

平均评分4.7

年均产出7.7 篇/年

Fazl Barez

Principal Researcher@University of Oxford·英国·OpenReview

研究方向

Mechanistic Interpretability · Safety · Alignment · AI Governance · AI Ethics

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

ICLR 2026Poster

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

ICLR 2026Poster

Query Circuits: Explaining How Language Models Answer User Prompts

ICLR 2026Rejected

Measuring Sparse Autoencoder Feature Space Similarities Across Large Language Models

ICLR 2026Withdrawn

Understanding Addition and Subtraction in Transformers

ICLR 2026Rejected

VAL-Bench: Measuring value alignment in Language Models

ICLR 2026Rejected

Chain-of-Thought Hijacking

ICLR 2026Desk Rejected

Towards Interpreting Visual Information Processing in Vision-Language Models

ICLR 2025Poster

Best-of-N Jailbreaking

NeurIPS 2025Poster

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

COLM 2025Poster

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

ICLR 2025Rejected

Scaling Sparse Feature Circuits For Studying In-Context Learning

ICLR 2025Rejected

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

ICLR 2025Rejected

PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data

ICML 2025Poster

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

ICLR 2025Rejected

Plan B: Training LLMs to fail less severely

ICLR 2025Withdrawn

Attacking Audio Language Models with Best-of-N Jailbreaking

ICLR 2025Rejected

合作者 (20)

博后导师9 篇

博士导师4 篇