影响力指数

83.7/100

前 1.1%

全站排名 #680

发表论文28 篇

平均评分5.6

年均产出9.3 篇/年

Yinfei Yang

Researcher@Apple·美国·OpenReview

研究方向

Vision-Language Model · Representation Learning · Information retrieval · question answering · Multilinguality

6.5

AToken: A Unified Tokenizer for Vision

ICLR 2026Withdrawn

通讯

4.7

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

ICLR 2026Poster

4.5

Where Did the Reasoning Go Wrong? A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

ICLR 2026Withdrawn

4.0

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

ICLR 2026Rejected

4.0

Autoregressive Video Generation beyond Next Frames Prediction

ICLR 2026Withdrawn

通讯

3.5

DeepMMSearch-R1: Empowering Multimodal LLMs in Multi-Modal Web Search

ICLR 2026Withdrawn

7.8

Contrastive Localized Language-Image Pre-Training

ICML 2025Poster

7.3

CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching

NeurIPS 2025Spotlight

7.0

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

ICLR 2025Poster

通讯

6.7

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

ICLR 2025Rejected

6.3

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

ICLR 2025Poster

6.3

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

COLM 2025Poster

6.0

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

ICLR 2025Poster

6.0

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

ICLR 2025Poster

通讯

6.0

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

ICLR 2025Poster

通讯

6.0

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

NeurIPS 2025Poster

4.5

Contrastive Localized Language-Image Pre-Training

ICLR 2025Rejected

4.3

Improve Vision Language Model Chain-of-thought Reasoning

合作者 (20)

Yinfei Yang

AToken: A Unified Tokenizer for Vision

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Score Distillation of Flow Matching Models

UltraCUA: Scaling Computer Use Agent through GUI and Programmatic Control

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Where Did the Reasoning Go Wrong? A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

Autoregressive Video Generation beyond Next Frames Prediction

DeepMMSearch-R1: Empowering Multimodal LLMs in Multi-Modal Web Search

Contrastive Localized Language-Image Pre-Training

CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

Contrastive Localized Language-Image Pre-Training

Improve Vision Language Model Chain-of-thought Reasoning