VeLAR: Vision-oriEnted Language-Attentive token Reduction for multimodal large language models

Yizheng Sun,Yanze Xin,Hao Li,Chenghua Lin,Riza Batista-Navarro

提交: 2024-09-26更新: 2024-10-11

TL;DR

We propose a token reduction framework for MLLMs that reduces vision token redundancy in vision-language learning, cutting computational costs by up to 42% while maintaining and even surpassing the original model performance.

摘要

关键词

Multi-modal Large Language ModelsToken ReductionModel AcelerationFoundation ModelsVision-Language LearningInstruction Tuning

评审与讨论

撤稿通知

2024-10-11

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.