影响力指数

81.3/100

前 1.3%

全站排名 #810

发表论文24 篇

平均评分5.5

年均产出8.0 篇/年

Zhe Gan

Principal Researcher@Apple·美国·OpenReview

研究方向

deep learning · vision and language · deep generative models

Scaling Synthetic Task Generation for Agents via Exploration

ICLR 2026Poster

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

ICLR 2026Poster

UltraCUA: Scaling Computer Use Agent through GUI and Programmatic Control

ICLR 2026Desk Rejected

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

ICLR 2026Rejected

Where Did the Reasoning Go Wrong? A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

ICLR 2026Withdrawn

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

ICLR 2026Rejected

DeepMMSearch-R1: Empowering Multimodal LLMs in Multi-Modal Web Search

ICLR 2026Withdrawn

Contrastive Localized Language-Image Pre-Training

ICML 2025Poster

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

ICLR 2025Poster

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

ICLR 2025Rejected

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

ICLR 2025Poster

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

COLM 2025Poster

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

ICLR 2025Poster

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

ICLR 2025Poster

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

ICLR 2025Poster

SlowFast-LLaVA: A strong training-free baseline for video large language models

ICLR 2025Rejected

Contrastive Localized Language-Image Pre-Training

ICLR 2025Rejected

Improve Vision Language Model Chain-of-thought Reasoning

ICLR 2025Withdrawn

Pixelated Instructions: Can Multimodal Large Language Models Follow Printed Instructions in Images?

ICLR 2025Rejected

合作者 (20)