5.0

/10

Poster4 位审稿人

最低4最高7标准差1.2

4.3

置信度

正确性2.8

贡献度2.5

表达3.3

NeurIPS 2024

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Shiwei Wu,Joya Chen,Kevin Qinghong Lin,Qimeng Wang,Yan Gao,Qianli Xu,Tong Xu,Yao Hu,Enhong Chen,Mike Zheng Shou

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

摘要

关键词

Online video understanding; effcient modeling

评审与讨论

审稿意见

评分: 4置信度: 52024-07-11

This paper aims to comprehend long video streams with multi-modal large language models. The existing works use too few visual tokens to represent the video stream to guarantee efficiency but may sacrifice visual perception performance. The authors propose to keep more visual tokens to represent each video frame but only select the crucial visual tokens to pass the transformer decoder layers to reduce computation. The experiments show that the proposed mixture of depth architecture achieves a trade-off between the number of visual tokens and the computation efficiency.

优点

The explored problem is meaningful. Handling long videos with large language models is promising and has wide applications.
The motivation for using different depths of networks to process visual tokens is clear and makes sense.
The proposed method is simple and straightforward, and can effectively speed up the long video context processing without reducing the token number.

缺点

The presentation of Section 3.1 and the Figure 6 are quite similar to VideoLLM-online [8]. And in Eq.1, why are the indicator and probability terms multiplied inside the logarithm?
The quantitative performance improvement over VideoLLM-online is quite marginal. And even the full computation only results in slight performance gains on some datasets. What is the underlying reason? Does it mean preserving very few visual tokens is sufficient?
Some visualizations are redundant, e.g., Figure 1 and Figure 5 are very similar. There lacks the visualization on the selected visual tokens in different layers. It is necessary to show the selection results to show what is crucial in the learned LayerExpert.

问题

It is better to also compare with some video LLM works that compress each frame into fewer tokens, e.g., LLaMA-VID [33] in terms of both performance and efficiency.

局限性

The authors have discussed the limitations in the lack of experiments on exocentric data.

作者回复

2024-08-07

Many thanks for your insightful feedback and valuable suggestions.

W1: The problem of the indicator.

Thanks for pointing this out. We follow VideoLLM-online and it makes an error. The revised equation is as follows:

L = \frac{1}{N}\sum^N_{j=1}\left(-l_{j+1}\log P_j^{`[Txt$_{j+1`$]}} - \sigma s_j\log P_j^{`[EOS]`}\right),

W2: The reason of marginal improvements over VideoLLM-online and the similar results of full computation.

The online narration benchmark only requires generating simple time-synchronized narrations, such as “You are riding the bike,” without needing detailed descriptions that require fine-grained visual perception. Therefore, the performance improvement with increased resolution is marginal.

However, experiments on other benchmarks, particularly those requiring fine-grained visual perception, demonstrate the necessity and benefit of increased resolution. As shown in Table 5b, for example, Ego Accuracy increased from 40.53% to 44.85% on the EgoExo4D Fine-grained Keystep Recognition benchmarks, indicating the potential of our approach for high-resolution vision tasks.

W3: Some visualizations are redundant, e.g., Figure 1 and Figure 5. There lacks the visualization on the selected visual tokens in different layers. It is necessary to show what is crucial in the learned LayerExpert.

We will reorganize our visualizations in the revised version. We also visualized the selected vision tokens learned by LayerExpert, as shown in Global-Rebuttal Figure 1. LayerExpert effectively focuses on critical vision tokens, such as bike-related tokens in Fig. 1a, tokens related to slicing onions in Fig. 1b, and tokens related to a table saw in Fig. 1c., and the model will attend to different tokens at different layers shown in Fig. 1d.

Q1: Compare with some video LLM works that compress each frame into fewer tokens, e.g., LLaMA-VID [33] in terms of both performance and efficiency.

We compare the performance and efficiency of our approach with LLaMA-VID [1], as shown in Global-Rebuttal Tables 1-4. Our approach achieves comparable performance to LLaMA-VID while requiring only 0.57x FLOPs and 0.2x training time.

Method	TFLOPs	Training Cost (Pretrain + Finetune)	GQA	MME	POPE	SQA
LLaMA-VID	9.8	4hrs + 48.5hrs	64.3	1521.4	86.0	68.3
VideoLLM-MoD	5.8	2hrs + 8.5hrs	62.8	1505.5	85.5	70.2

Method	TFLOPs	Training Cost (Pretrain + Finetune)	MSVD-QA		MSRVTT-QA		ActivityNet-QA
			Acc	Score	Acc	Score	Acc	Score
LLaMA-VID	40.1	9hrs + 30hrs	69.7	3.7	57.7	3.2	47.4	3.3
VideoLLM-MoD	23.0	4hrs + 5.5hrs	68.5	3.7	58.2	3.3	46.3	3.2

We highlight the advantages of our proposed approach over existing efficient vision modeling methods in the LMMs field:

Utilizing Fewer, Semantic-Aware Tokens: Semantic-aware token selection typically requires cross-attention-based modeling, which is computationally expensive when processing every frame. As shown in Global-Rebuttal Tables 1-2, LLaMA-VID [1] uses text features extracted from a q-former to select semantic-aware visual tokens via context-attention, resulting in significantly higher training costs (10.5 hours vs. 52.5 hours) with only marginal performance improvement.
Efficient Inference: High training costs are a significant issue for LMMs, particularly for video-based LMMs, as video consumes the majority of tokens, as indicated in Figure 7 of our paper. Unlike existing methods [2,3] that focus on efficient inference while still requiring high training costs, we successfully reduce both training and inference costs massively while maintaining LMMs’ performance. Our approach provides a new paradigm for other vision-language tasks, especially in training video-based LLMs, and plays a role in democratizing AI by enabling broader access to trained models without the need for large resources.
Offline Spatial-Temporal Token Merging: Online video LLMs must process every incoming visual token in real time and cannot perform frame sampling or offline token merging as Chat-UniVi [4] does. We are the first to explore efficient vision modeling in context rather than merely offline pruning or merging vision tokens.

L1: The authors have discussed the limitations in the lack of experiments on exocentric data.

In addition to extensive experiments on ego-centric datasets, we have conducted experiments on exo-centric COIN benchmarks, as shown in Paper Table 4. To further validate the effectiveness and generalization of our proposed method, we conducted experiments using the same training recipe as LLaVA/LLaMA-VID on standard image and video benchmarks, as shown in Global-Rebuttal Tables 1-4. Our method achieved comparable performance with significantly lower training costs and FLOPs, demonstrating that the proposed sparse vision processing strategy is broadly applicable to vision-language tasks in the LMMs field.

[1] Yanwei Li et al. LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models. ECCV 2024.

[2] Liang Chen et al. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. ECCV 2024.

[3] Yuzhang Shang et al. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv: 2403.15388.

[4] Peng Jin et al. Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding. CVPR 2024.

2024-08-12

Thanks for the author response. Some of my concerns still remain.

The objective of MOD is to address the challenge of numerous vision tokens in long-term and streaming videos. However, the experiments are not convincing.

On the one hand, the current experiments on streaming long videos only show marginal improvements since they require no fine-grained information, failing to validate the effectiveness of proposed MOD.
On the other hand, the proposed MOD shows higher efficiency but comparable or even worse performance on some short video benchmarks compared to LLaMA-VID, which are not long enough to verify the effectiveness.

I suggest the authors to include the results on some longer video benchmarks that require detailed understanding, such as EgoSchema [1], VideoMME [2] for evaluation.

[1] Mangalam, Karttikeya, Raiymbek Akshulakov, and Jitendra Malik. "Egoschema: A diagnostic benchmark for very long-form video language understanding." Advances in Neural Information Processing Systems 36 (2024).

[2] Fu, Chaoyou, et al. "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis." arXiv preprint arXiv:2405.21075 (2024).

评论- Response to Reviewer x5JZ (1/2)

2024-08-13

Thanks for your considerable feedback and suggestions! We argue that our approach offers a general architecture that significantly reduces computational costs while maintaining performance for both training and inference in vision-language tasks, especially in scenarios that require processing a large number of vision tokens, such as long-term, dense video frame online streaming.

On the one hand, the current experiments on streaming long videos only show marginal improvements since they require no fine-grained information, failing to validate the effectiveness of proposed MOD.

We validated the effectiveness of our approach on the Video-MME[1] benchmark. Our VideoLLM-MoD was trained using the same recipe as the initial submission paper, excluding the streaming loss, and with the same pretraining and finetuning data as LLaMA-VID[2]. Despite requiring significantly less training time—2.95x less than VideoLLaMA 2—our model still achieves top-tier performance, outperforming it.

Model	Frames	Training Cost	Overall(%)	Short Video(%)	Medium Video(%)	Long Video(%)
VideoLLaMA 2-7B	32 frames && 32 tokens/frame	65hrs & 2.95x	47.9	56.0	45.4	42.1
Chat-UniVi-v1.5-7B	64 frames && 112 tokens/frame	53hrs & 2.41x	40.6	45.7	40.3	35.8
Video-LLaVA-7B	8 frames && 49 tokens/frame	60hrs & 2.73x	39.9	45.3	38.0	36.2
VideoLLM-MoD-8B	1fps && 10 tokens/frame	22hrs	49.2	58.4	46.6	42.4

Given the limited time available during the discussion period, we will add experiments on EgoSchema[3] once our work is released.

Existing experiments on Ego4D narration benchmark can validate the effectiveness of our proposed approach. As we claimed our core contribution is reduce both the training and inference cost without sacrificing the performance, we achieve comparable performance with 1.7x training speedup (24hrs -> 14hrs), 0.6x training FLOPs and 1.7x longer inference context (830s -> 1440s).

Besides, as we claimed that in Paper Line 231-234, though the Full-computation in Paper Table 1 also shows marginal improvements, we found that larger vision resolution can indeed benefit performance, as shown in Figure 1, 5, and in experiments that demand more detailed visual information as shown in Table 4, 5.
Our approach allows for a larger visual budget within the same total computation, leading to significant performance gains from additional visual tokens, as demonstrated in extensive ablations in LLaVA-NEXT[4] and LongVA[5]. Our method can be seen as a “free lunch” in increasing the vision resolution.

Moreover, it is non-trivial to reduce computation in context under online video scenarios, as online VideoLLMs must process every incoming visual token without relying on frame sampling or offline token merging, as done in Chat-UniVi [6]. We are the first to explore efficient vision modeling in context, rather than solely focusing on offline pruning or merging of vision tokens.

评论- Response to Reviewer x5JZ (2/2)

2024-08-13

On the other hand, the proposed MOD shows higher efficiency but comparable or even worse performance on some short video benchmarks compared to LLaMA-VID, which are not long enough to verify the effectiveness.

It is worth noting that we used the same training recipe as LLaMA-VID during the previous rebuttal phase for a fair comparison. Our approach achieved comparable performance with significantly less training time and TFLOPs. However, our sparse architecture allows us to utilize far more visual tokens within the same computational budget. Specifically, we increased the number of visual tokens per frame (from CLS token + 1x1 average pooling to CLS token + 3x3 average pooling), resulting in substantial performance gains due to the higher vision resolution. Remarkably, the total training cost remained at just 0.56x that of LLaMA-VID.

Method	Training Cost (Pretrain + Finetune) && speedup	MSVD-QA		MSRVTT-QA		ActivityNet-QA		Video-based Generative Performance
		Acc	Score	Acc	Score	Acc	Score	Correctness	Detail	Context	Temporal	Consistency
LLaMA-VID-7B	9hrs + 30hrs	69.7	3.7	57.7	3.2	47.4	3.3	2.96	3.00	3.53	2.46	2.51
VideoLLM-MoD-7B (1fps, 2tokens/frame)	4hrs + 5.5hrs && 0.25x	68.5	3.7	58.2	3.3	46.3	3.2	2.88	2.98	3.41	2.51	2.50
VideoLLM-MoD-8B (1fps, 10tokens/frame)	8hrs + 14hrs && 0.56x	78.5	3.9	65.3	3.6	53.4	3.4	3.12	3.16	3.75	2.44	3.65

We demonstrate that our method generalizes well across extensive benchmarks. Here is a summary:

4 Image Benchmarks: GQA, MME, POPE, SQA

9 Video Benchmarks: Ego4D Narration, Ego4D LTA, EgoExo4D Fine-grained Keystep Recognition, COIN, MSVD-QA, MSRVTT-QA, ActivityNet-QA, VideoChatGPT, Video-MME

Thanks again for your feedback! We hope that our response can address your questions, and if you still have any concerns, we would be pleased to discuss them further with you.

[1] Fu, Chaoyou, et al. "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis." arXiv preprint arXiv:2405.21075 (2024).

[2] Li, Yanwei, et al. "LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models." ECCV, 2024

[3] Mangalam, Karttikeya, Raiymbek Akshulakov, and Jitendra Malik. "Egoschema: A diagnostic benchmark for very long-form video language understanding." Neurips 2024.

[4] Liu, Haotian, et al. "Llava-next: Improved reasoning, ocr, and world knowledge." (2024).

[5] Zhang, Peiyuan, et al. "Long context transfer from language to vision." arXiv preprint arXiv:2406.16852 (2024).

[6] Jin, Peng, et al."Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding." CVPR, 2024.

2024-08-14

Dear reviewer x5JZ:

Thank you again for your thoughtful feedback! We hope the rebuttal and additional experiments we provided were helpful. If any residual concerns remain, we would be glad to discuss further. If no concerns remain, we would appreciate it if you re-evaluate our paper.

Thank you once again for your thorough review of our paper.

Best regards,

Authors of Paper10329

2024-08-12

Dear Reviewer x5JZ,

We would like to express our sincere gratitude for the time and effort you spent reviewing our paper. As the author reviewer discussion stage draws to a close, we are eager for your response to ascertain if our detailed response has sufficiently addressed your concerns. We would be honored to address any further questions you may have. We eagerly anticipate and highly value your re-evaluation of our paper.

Thank you once again for your thorough review of our paper.

Best regards,

Authors of Submission 10329

审稿意见

评分: 5置信度: 42024-07-12

This paper proposes a novel layer skipping approach to reduce the computation and memory consumption in modern vision-language models. The overall performance is good on several egocentric video understanding benchmarks.

优点

The proposed layer skipping strategy enables efficiient attention computation while retaining the performance.
Experimental results on mutliple online/offline benchmark datasets demonstrate the effectiveness of the method.

缺点

This paper uses videollm-online as the baseline and merely proposes a weighted layer skipping strategy. The technical contribution of this method is relatively limited.
The discussion on the choice of LayerExpert is not fully discussed. The authors claimed that the proposed strategy can select critical visual tokens in the layer. It would be better if the authors include visualization results to prove this claim. In Table 3, are there any semantic-aware token selection strategies that can be used for comparison?

问题

In Table 3, it is a little bit confusing why increasing the keep ratio r=0.3 results in worse performance? Also, it would be good to see the performance comparison on more keep ratios.
Could you please explain why the proposed method targets at online video processing? It seems that LayerExpert can be adapted to a variety of vision transformers and applied to many video understanding tasks, as proved in Table 4.

局限性

Please refer to Weaknesses and Questions.

作者回复

2024-08-07

We sincerely appreciate your thoughtful feedback and valuable suggestions.

W1: The technical contribution of this method is relatively limited.

Our technical contributions are summarized as follows:

Efficient Vision Modeling in Context: It is non-trivial to reduce computation in context under online video scenarios, as online VideoLLMs must process every incoming visual token without perform frame sampling or offline token merging, as done in Chat-UniVi [1]. We are the first to explore efficient vision modeling in context, rather than merely offline pruning or merging vision tokens.
Reduction of both Training and Inference Costs: High training costs are a significant issue for LMMs, particularly for video-based LMMs, since video consumes the majority of tokens, as indicated in Figure 7 of our paper. Unlike existing methods [2,3] that focus on efficient inference while still requiring high training costs, we are the first to successfully reduce both training and inference costs massively while maintaining LMMs’ performance. We believe our approach offers a new paradigm for other vision-language tasks, especially in training video-based LLMs.
Generalization: To validate the effectiveness and generalization of our proposed method, we conducted experiments using the same training recipe as LLaVA/LLaMA-VID on standard image and video benchmarks, as shown in Global-Rebuttal Tables 1-4. Our method achieved comparable performance with significantly lower training costs and FLOPs, demonstrating that the proposed sparse vision processing strategy is broadly applicable to vision-language tasks in the LMMs field.

The topic of sparse vision modeling in context has gained popularity, as evidenced by Google DeepMind’s recent research on MoNE [4], released a few days ago. While they adopt a similar approach, their exploration is limited to traditional vision architectures rather than the popular LMMs field. We believe our idea "the importance of different vision tokens should be considered in context, and can be modeled by computation budget" can provide valuable insights into general LMMs.

W2: The discussion on the choice of LayerExpert, the visualization to prove the claim, the semantic-aware token selection strategies for comparison.

Forcing LayerExpert to select only the important visual tokens across each transformer block can be viewed as encouraging LMMs to focus on the most useful regions, while it increases the learning difficulty. As shown in Global-Rebuttal Figure 1, we visualize the tokens selected by LayerExpert and observe that it indeed focuses on important visual regions, such as bike-related tokens in Fig. 1a, tokens related to slicing onions in Fig. 1b, and tokens related to a table saw in Fig. 1c.

Semantic-aware token selection typically requires cross-attention-based modeling, which is computationally expensive when processing every frame. As shown in Global-Rebuttal Tables 1-2, LLaMA-VID [5] utilizes text features extracted from a q-former to select semantic-aware visual tokens via context-attention, resulting in significantly higher training costs (10.5 hours vs. 52.5 hours) with only marginal performance improvement.

Q1: Why increasing the keep ratio r=0.3 results in worse performance? Also, it would be good to see the performance comparison on more keep ratios.

First, selecting important visual tokens via LayerExpert increases the learning difficulty, leading to better performance with fewer visual tokens. Second, there may be deviations in the LMMs’ training process. We will conduct additional trials in the revised version and report the standard deviation across all experiments.

Further ablation studies on the keep ratio are presented below.

r	LM-PPL	TimeDiff	Fluency	LM-Correctness
0.1	2.43	2.11	44.7%	48.1%
0.2	2.41	2.04	45.2%	48.9%
0.3	2.41	2.05	44.9%	48.7%
0.5	2.41	2.03	45.1%	48.8%
0.7	2.40	2.05	45.2%	48.9%
0.9	2.40	2.04	45.3%	49.1%
Full	2.40	2.05	45.3%	49.0%

Q2: Why the proposed method targets at online video processing?

Processing video in an online scenario differs from offline processing, as LMMs must handle every incoming frame in real-time without frame sampling or token merging. This results in excessive vision token lengths, as shown in Figure 7 of the paper, and the training costs for LMMs increase exponentially with token length. Moreover, online VideoLLMs like GPT-4o have shown great potential in real-world applications, highlighting the urgent need to explore approaches that reduce both training and inference computational costs in online video scenarios.

To validate the effectiveness and generalization of our proposed method, we conducted experiments using the same training recipe as LLaVA/LLaMA-VID on standard image and video benchmarks, as shown in Global-Rebuttal Tables 1-4. Our method achieved comparable performance with significantly lower training costs and FLOPs, demonstrating that the proposed sparse vision processing strategy is broadly applicable to vision-language tasks in the LMMs field.

[1] Liang Chen et al. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. ECCV 2024.

[2] Yuzhang Shang et al. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv: 2403.15388.

[3] Peng Jin et al. Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding. CVPR 2024.

[4] Gagan Jain et al. Mixture of Nested Experts: Adaptive Processing of Visual Tokens. arXiv: 2407.19985.

[5] Yanwei Li et al. LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models. ECCV 2024.

2024-08-12

Dear Reviewer qExK,

We would like to express our sincere gratitude for the time and effort you spent reviewing our paper. As the author reviewer discussion stage draws to a close, we are eager for your response to ascertain if our detailed response has sufficiently addressed your concerns. We would be honored to address any further questions you may have.

Thank you once again for your thorough review of our paper.

Best regards,

Authors of Submission 10329

2024-08-14

Dear reviewer qExK:

Thank you again for your thoughtful feedback! We hope the rebuttal and additional experiments we provided were helpful. As the rebuttal period draws to a close, please don't hesitate to contact us if you have any problems.

Thank you once again for your thorough review of our paper!

Best regards,

Authors of Paper10329

审稿意见

评分: 4置信度: 42024-07-12

The document presents a novel approach called VideoLLM-MoD, which aims to efficiently scale up the vision resolution for online video large language models (VideoLLMs) without incurring high computational costs. The approach is inspired by the "mixture-of-depths" approach and learns to skip the computation for a high proportion of vision tokens. Experiments on several egocentric and instructional dataset show the method could achieve similar or even better performance with significantly less computation effort.

优点

the proposed method could reduce computational cost and save GPU memory, thus high vision resolution could be used for videoLLM.
lightweight LayerExpert is proposed to determine which vision tokens should be processed at certain layers.
streamingloss is proposed to ensure model remain silent when it has no necessary response.

缺点

this proposal mostly borrow idea from "Mixture-of-Depths", the novel part is applying the idea to visual tokens. The original contributions may not be enough for top-tier conference.
experiments are mainly on egocentric and instructional datasets, lack of Diversity.

问题

could authors summarize the original ideas or insights different from Mixture-of-Depths, except of different modalities?

局限性

authors adequately addressed the limitations.

作者回复

2024-08-07

We sincerely appreciate your time and efforts in reviewing our paper. We will address each of your concerns point by point.

W1/Q1: This novelty and contributions compared to Mixture-of-Depths.

We summarize the differences between our proposed framework and Mixture-of-Depths (MoD) as follows:

Validation of Effectiveness on LMMs: It is non-trivial to validate the effectiveness of the proposed approach in Large Language Models(LLMs) especially in Large Multi-modal Models(LMMs). While MoD demonstrates comparable loss scales on language models smaller than 1B parameters, we explore efficient methods to reduce computation in context for 8B-LMMs in both training and inference, particularly under online video scenarios.
Causal Operations: While MoD introduces causality through auxiliary loss or auxiliary MLP predictors, we simplify this by using LayerExpert to select vision tokens within individual frames across transformer blocks. The causal attention in the LLM then learns the temporal modeling, seamlessly accommodating our online video scenario.
Extensive Experiments: Unlike MoD, which validates its approach on language models smaller than 1B parameters on loss scale, we conducted extensive experiments on vision-language benchmarks, including the Ego4D narration benchmark, Ego4D LTA benchmark, EgoExo4D Fine-grained Keystep Recognition task, and COIN benchmarks. Additionally, we performed experiments on general image/video benchmarks, as shown in Global-Rebuttal Tables 1-4, demonstrating the generalization of our proposed approach to vision-language tasks.

The topic of sparse vision modeling in context has gained popularity, as evidenced by Google DeepMind’s recent research on MoNE [1], released a few days ago. While they adopt a similar approach, their exploration is limited to traditional vision architectures rather than the popular LMMs field. We believe our idea "the importance of different vision tokens should be considered in context, and can be modeled by computation budget" can provide valuable insights into general LMMs.

W2: Experiments are mainly on egocentric and instructional datasets, lack of Diversity.

To validate the effectiveness and generalization of our proposed method, we conducted experiments using the same training recipe as LLaVA/LLaMA-VID on standard image and video benchmarks, as shown in Global-Rebuttal Tables1-4. Our method achieved comparable performance with significantly lower training costs and FLOPs, demonstrating that the proposed sparse vision processing strategy is broadly applicable to vision-language tasks in the LMMs field.

[1] Gagan Jain et al. Mixture of Nested Experts: Adaptive Processing of Visual Tokens. arXiv: 2407.19985.

2024-08-12

Dear Reviewer 2CSB,

Thank you once again for your thorough review of our paper.

Best regards,

Authors of Submission 10329

2024-08-14

Dear reviewer 2CSB:

Thank you once again for your thorough review of our paper.

Best regards,

Authors of Paper10329

审稿意见

评分: 7置信度: 42024-07-13

The core idea of this paper is to scaling up vision resolution for online video large language models. Instead of distributing FLOPs uniformly across all vision tokens in every decoder layer, they utilize a learnable module LayerExpert to allocate compute to critical vision tokens within the frame dynamically.

优点

Video being such a compute heavy workload, anything to do with temporal modeling if can be performed in a streaming fashion can enable several practical applications. Online video-llms are relatively unexplored, this paper proposes sparse vision encoder suitable for enabling streaming applications at the same time retain spatial resolution. The project is timely, the results are good.

The results on 3 egocentric benchmarks show encouraging results.

缺点

Not a major weakness; but performing experiments on ActivityNet-based training and evaluating on ViDSTG would give an idea of how does this MoD approach for making a sparse vision encoder work on standard benchmarks.

问题

Will you make your code available for the community?

局限性

The method does talk about increasing spatial resolution, but it doesn't consider spatial grounding. how does the author suggest the method can be extended for grounding?

作者回复

2024-08-06

We sincerely appreciate your positive comments on our work and will address each of the issues you mentioned below.

W1: Performing experiments to show how does this MoD approach for making a sparse vision encoder work on standard benchmarks.

To validate the effectiveness and generalization of our proposed method, we conducted experiments using the same data configuration as LLaVA/LLaMA-VID on standard image and video benchmarks, as shown in Global-Rebuttal Tables 1-4.

Specifically, for video benchmarks, we trained our VideoLLM-MoD on the ActivityNet and WebVid-2.5m datasets and further evaluated it on several other benchmarks. Our method achieved comparable performance with significantly lower training costs and FLOPs, demonstrating that the proposed sparse vision processing strategy is broadly applicable to vision tasks in the LMMs field.

Q1: Code availability.

The project code is included in the supplementary materials. We will release all the code, data, and checkpoints as soon as possible.

L1: How does the author suggest the method can be extended for grounding?

Our proposed method can process more vision tokens within the same computational budget. For spatial grounding tasks, this capability allows for larger spatial resolutions and denser frame representations. More vision tokens facilitate finer-grained representations of images and videos, capturing more detailed visual features. This is crucial for accurately locating and distinguishing small objects or intricate details within complex scenes, thereby improving the model’s precision in such scenarios.

2024-08-12

Dear Reviewer HwD3,

Thank you once again for your thorough review of our paper.

Best regards,

Authors of Submission 10329

作者回复

2024-08-07

We would like to thank all reviewers for their constructive comments.

We appreciate their recognition of our motivation (HwD3, x5JZ); the novelty of the approach (qExK, 2CSB); the efficiency (HwD3, qExK, 2CSB, x5JZ); and sufficient experiments (HwD3, qExK, 2CSB).

In the uploaded PDF of the Global-Rebuttal, we visualize the selected visual tokens of LayerExpert and the generated response in Figure 1. Specifically, we trained our VideoLLM-MoD on the videollm-online-chat-ego4d-134k [1] dataset. We then used five consecutive frames from the Ego4D test set videos as inputs, aggregated and normalized the vision weights from each LayerExpert, and visualized them with an alpha mask in Figure 1(a,b,c), as well as the visualization on each LayerExpert in Figure1d. Note that more transparent tokens represent larger vision weights.

We further conduct more experiments on general image/video benchmarks as shown in Tables 1-4. Our method achieved comparable performance with significantly lower training costs and FLOPs, demonstrating that the proposed sparse vision processing strategy is broadly applicable to vision-language tasks in the LMMs field.

For a fair comparison, we implemented our approach in the LLaVA codebase and trained it using the same recipe as LLaVA [2] and LLaMA-VID [3] for image and video benchmarks, respectively. We computed the FLOPs for each method using text input token lengths of 60 and one single image.

For effective training, we utilized Deepspeed zero2 and Flash-Attention2. We trained the model using LoRA with a rank of 128 and a scaling factor of 256 on all linear layers of the LLMs.

[1] Joya Chen et al. VideoLLM-online: Online Video Large Language Model for Streaming Video. CVPR 2024.

[2] Haotian Liu et al. Visual Instruction Tuning. NeurIPS 2023.

[3] Yanwei Li et al. LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models. ECCV 2024.

In the following, we address each reviewer’s concerns. For each review, we address the Weaknesses, Questions, and Limitations point by point.

评论- To all reviewers

2024-08-12

Dear Reviewers,

Thank you very much for your great effort in reviewing our paper. We sincerely appreciate your valuable and constructive feedback.

The author/reviewer discussion stage will be ending soon. We are looking forward to your reply, to wonder whether our detailed response has adequately addressed your concerns. If you have further questions, we would be honored to address them.

Thank you once again for reviewing our paper!

Best regards,

Authors of Paper10329

评论- To all reviewers

2024-08-13

Dear reviewers,

As Reviewer x5JZ susggested, we further validate the effectiveness of our approach on the Video-MME[1] benchmark. Our VideoLLM-MoD was trained using the same recipe as the initial submission paper, excluding the streaming loss, and with the same pretraining and finetuning data as LLaMA-VID[2]. Despite requiring significantly less training time—2.95x less than VideoLLaMA 2—our model still achieves top-tier performance, outperforming it.

Model	Frames	Training Cost	Overall(%)	Short Video(%)	Medium Video(%)	Long Video(%)
VideoLLaMA 2-7B	32 frames && 32 tokens/frame	65hrs & 2.95x	47.9	56.0	45.4	42.1
Chat-UniVi-v1.5-7B	64 frames && 112 tokens/frame	53hrs & 2.41x	40.6	45.7	40.3	35.8
Video-LLaVA-7B	8 frames && 49 tokens/frame	60hrs & 2.73x	39.9	45.3	38.0	36.2
VideoLLM-MoD-8B	1fps && 10 tokens/frame	22hrs	49.2	58.4	46.6	42.4

[1] Fu, Chaoyou, et al. "Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis." arXiv preprint arXiv:2405.21075 (2024).

We hope that our response can address your questions. If you have any concerns, we would be pleased to discuss them further with you.

Best regards,

Authors of Paper10329

最终决定Accept (poster)

2024-09-25

The paper proposes a method to enhance vision resolution for online video large language models (VideoLLMs), by dynamically allocating computational resources to critical vision tokens using a learnable module called LayerExpert. The reviewers generally agree on the significance of this approach, particularly in reducing computation while retaining spatial resolution, making it suitable for streaming applications in video processing.

There are also concerns raised about the novelty, with some similarities to existing methods, e.g. "Mixture-of-Depths" and questioning the marginal performance improvements. Further experimentation on more diverse datasets, and a deeper exploration of the LayerExpert’s decision-making process are important in the revision.