影响力指数

98/100

前 0.1%

全站排名 #53

发表论文56 篇

平均评分5.5

年均产出18.7 篇/年

Graham Neubig

Associate Professor@Carnegie Mellon University·美国·OpenReview

研究方向

natural language processing · machine learning · large language models · agents

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

OpenAgentSafety: A Comprehensive Framework For Evaluating Real-World AI Agent Safety

ICLR 2026Poster

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

ICLR 2026Poster

Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering

ICLR 2026Poster

Prompt-MII: Meta-Learning Instruction Induction for LLMs

ICLR 2026Poster

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

ICLR 2026Desk Rejected

The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

ICLR 2026Poster

How can we assess human-agent interactions? Case studies in software agent design

ICLR 2026Rejected

PPTArena: A Benchmark for Computer-Use Agents on PowerPoint Tasks

ICLR 2026Rejected

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

ICLR 2026Rejected

Go-Browse: Training Web Agents with Structured Exploration

ICLR 2026Poster

MIP-Bench: Can LLMs Implicitly Personalize Responses Using Long-Term Memory?

ICLR 2026Rejected

TOM-SWE: User Mental Modeling For Software Engineering Agents

ICLR 2026Rejected

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

ICLR 2026Poster

Accumulating Context Changes the Beliefs of Language Models

ICLR 2026Rejected

Shepherd: Pattern-Guided Trajectory Selection for Coding Agents on SWE-Bench

ICLR 2026Rejected

TowerVision : Understanding and Improving Multilinguality in Vision-Language Models

ICLR 2026Withdrawn

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

ICLR 2026Withdrawn

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

ICLR 2026Rejected

Midtraining Bridges Pretraining and Posttraining Distributions

ICLR 2026Rejected

Better Instruction-Following Through Minimum Bayes Risk

ICLR 2025Spotlight

Checklists Are Better Than Reward Models For Aligning Language Models

NeurIPS 2025Spotlight

Overtrained Language Models Are Harder to Fine-Tune

ICML 2025Poster

Demystifying Long Chain-of-Thought Reasoning

ICML 2025Poster

M-Prometheus: A Suite of Open Multilingual LLM Judges

COLM 2025Poster

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

ICLR 2025Poster

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

ICLR 2025Poster

Inducing Programmatic Skills for Agentic Tasks

COLM 2025Poster

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

COLM 2025Poster

Harnessing Webpage UIs for Text-Rich Visual Understanding

ICLR 2025Poster

Repetition Improves Language Model Embeddings

ICLR 2025Poster

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

ICLR 2025Rejected

FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios

COLM 2025Poster

RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems

ICLR 2025Rejected

Agent Workflow Memory

ICLR 2025Rejected

RAGGED: Towards Informed Design of Scalable and Stable RAG Systems

ICML 2025Poster

Training Task Experts through Retrieval Based Distillation

ICLR 2025Withdrawn

Training Software Engineering Agents and Verifiers with SWE-Gym

ICML 2025Poster

Agent Workflow Memory

ICML 2025Poster

Beyond Browsing: API-Based Web Agents

ICLR 2025Withdrawn

合作者 (20)

Zora Zhiruo Wang