影响力指数

57.77/100

前 5.4%

全站排名 #3,476

发表论文17 篇

平均评分5.1

年均产出5.7 篇/年

Yunhang Shen

Researcher@Tencent·中国·OpenReview

研究方向

Large Multimodal Models · object detection · image segmentation · weakly supervised learning · semi-supervised learning

5.5

Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

ICLR 2026Poster

5.5

VITA-E: A Dual-Model Framework for Real-Time, Interruptible, and Concurrent Human-Robot Interaction

ICLR 2026Rejected

4.8

FlexibleLLM: Making Low-Bit Quantization for Large Language Models More Flexible and Efficient

ICLR 2026Rejected

4.5

DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

ICLR 2026Rejected

三作

4.0

LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?

ICLR 2026Withdrawn

4.0

Breaking the Bias: Quantifying the Attention of Industrial Anomaly Detection

ICLR 2026Withdrawn

4.0

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

ICLR 2026Withdrawn

2.5

VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model

NeurIPS 2025Poster

二作

6.4

FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification

ICML 2025Poster

三作

6.4

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

NeurIPS 2025Poster

6.0

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

ICLR 2025Poster

三作

5.5

DS-VLM: Diffusion Supervision Vision Language Model

ICML 2025Poster

二作

4.0

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

合作者 (20)

Yunhang Shen

Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

VITA-E: A Dual-Model Framework for Real-Time, Interruptible, and Concurrent Human-Robot Interaction

FlexibleLLM: Making Low-Bit Quantization for Large Language Models More Flexible and Efficient

DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE

LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?

Breaking the Bias: Quantifying the Attention of Industrial Anomaly Detection

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Pseudo-Label Supervision in Unsupervised Industrial Anomaly Detection

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model

FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

DS-VLM: Diffusion Supervision Vision Language Model

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM