Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
We enhance the model's performance in understanding long videos through hierarchical compression.
摘要
评审与讨论
Long video processing needs to exploit spatio-temporal redundancies. The core idea of this paper is hierarchical differentiation by key-frame selection and non-keyframe compression leading to processing abiility upto 10K frames on a single GPU and shows competitive performance on long-video benchmarks.
给作者的问题
Clearly written paper, self-explanatory.
论据与证据
Claims and evidences are clear.
方法与评估标准
Results shown on 5 different benchmarks.
理论论述
NA
实验设计与分析
I liked the way the experimental choices had been made.
补充材料
I don't see Supplementary materials for this paper in the reviewer portal.
update: I was expecting a separate supplementary material, but discovered it is appended to the main paper.
与现有文献的关系
Spatio-temporal redundancy is the key to process long-video, be it video compression or video classification or video-llm, any kind of model and application.
遗漏的重要参考文献
The paper restricted their discussions and comparative performances for 7-8B models, not beyond that. I believe this is ok.
其他优缺点
I think this is smart to do the frame-level and patch-level analysis on 4 video-llm models on Video-MME benchmark. The paper proposes an efficient scaling mechanism of video-llm models for long-videos, and results shown on different benchmarks show that the redundancy exploitation techniques motivated by intermediate results from Video-MME benchmark also hold true for other benchmarks.
The only shortcoming of this method is that its differential keyframe selection may not easily be extended for even larger videos. It assumes offline processing. Can the authors shed some lights of its possible extension to a streaming framework?
其他意见或建议
It's a nicely written paper, easy to understand and follow.
Thank you for your valuable feedback. Here is our reply to your main concern:
Can the authors shed some lights of its possible extension to a streaming framework?
We sincerely appreciate the reviewer’s insightful question. While our proposed DKS and DFM processes significantly reduce token counts, enabling ViLAMP to handle long video inputs, the current architecture does not yet support streaming videos where input length increases incrementally over time. However, we believe our method can be adapted to streaming scenarios through some adjustments. One feasible solution involves:
- Keyframe Candidate Pool: A candidate pool for keyframes is maintained. When a new frame arrives, it is identified as a keyframe based on two criteria: (1) Its similarity to the query exceeds a threshold. (2) Its similarity to existing keyframes in the pool is below a threshold. If both conditions are met, the frame is added to the pool, and an existing keyframe is evicted based on criteria such as (a) query similarity, (b) similarity to the newly inserted frame, or (c) the frame’s retention duration in the pool.
While this approach theoretically supports streaming scenarios, several implementation challenges require further exploration. A critical issue lies in kv-cache management: Modifying historical information (e.g., updating keyframes) currently does not update the kv-cache, leading to an approximate rather than exact kv-cache during inference. This approximation may impact performance, and rigorous empirical validation is necessary to quantify its effects. We plan to address these nuances in future work, with a focus on efficient cache updates and experimental validation.
Thank you for discussing the potential extension for streaming scenarios and also thinking through the challenges. I'll be glad to see this paper in the proceeding.
This paper introduces a new long-video LLM framework called ViLaMP to incorporate a differential distillation mechanism to compress the redundant video frame sequence's information into a more compact representation for better trade-off between performance and computational cost. Specifically, this work mainly designs a differential keyframe selection module and a differential feature merging module to compress the video features by the principle of maximizing query relevance while minmizing context redundancy. Extensive experiments and in-depth analyses demonstrate the motivation and effectiveness of the proposed method.
给作者的问题
Please refer to strengths and weaknesses.
论据与证据
The claims are empirically supported by experiments and observations.
方法与评估标准
This method is useful in improving the capability of current video LLMs to process longer video inputs.
理论论述
None.
实验设计与分析
The experimental designs and analyses are intuitively reasonable.
补充材料
I have read the supplementary material.
与现有文献的关系
None.
遗漏的重要参考文献
None.
其他优缺点
Strengths:
1.The paper is well-written and easy to understand.
2.Abundant experimental results and empirical analyses are included in the manuscript, which make the motivation and method design clearer. The basic assumption of the method related to query relevance and context uniqueness for compression is reasonable.
3.One of the advantage of this method is that both of the frame selection and patch merging modules are learnable to some extent since they rely on the feature-based intermediate scores for operation. This is more flexible compared to previous hard selection strategy.
4.Experiments show that the method is able to achieve competitive trade-off between computational cost and model performance, and its performance in long-form video scenarios seems compelling.
Weaknesses:
1.Although this method's effectiveness has been verified on video question answering benchmarks for multimodal large language models, it is still unclear whether this method can truly preserve necessary information for more temporal-sensitive tasks like action recognition, video grounding or temporal reasoning. I suggest the authors to add some experiments and anslyses on this aspect.
2.There have been many existing frame sampling research in video understanding domain such as [1, 2] that can serve as reasonable baselines for comparative analyses. I suggest the authors to add some comparisons with these frame sampling methods.
References
[1] Adaframe: Adaptive frame selection for fast video recognition, In CVPR 2019.
[2] MGSampler: An Explainable Sampling Strategy for Video Action Recognition, In ICCV 2021.
其他意见或建议
Please refer to strengths and weaknesses.
伦理审查问题
None.
Although this method's effectiveness has been verified on video question answering benchmarks for multimodal large language models, it is still unclear whether this method can truly preserve necessary information for more temporal-sensitive tasks like action recognition, video grounding or temporal reasoning. I suggest the authors to add some experiments and anslyses on this aspect.
There have been many existing frame sampling research in video understanding domain such as [1, 2] that can serve as reasonable baselines for comparative analyses. I suggest the authors to add some comparisons with these frame sampling methods.
Thank you for your valuable feedback. We have conducted additional experiments in response to your suggestions, as detailed below:
For the Video Grounding task, we performed experiments on the widely used QVHighlights[1] benchmark (267 citations). This dataset provides annotations for videos, marking the timestamps of query-relevant segments and the positions of keyframes. We evaluated ViLAMP using the Hit@k metric, which measures the probability of the ground-truth keyframe appearing in the top-k keyframes selected by our method. The results are as follows:
| Hit@1 | Hit@5 | Hit@10 | Hit@30 |
|---|---|---|---|
| 26.1 | 47.3 | 60.0 | 83.2 |
As shown, ViLAMP achieves 60% accuracy for Hit@10 and 83% for Hit@30, providing strong empirical support for its Dynamic Keyframe Selection (DKS) process. The current method effectively captures query-relevant video clips.
For the action recognition task, we incorporated the two baselines suggested: Adaframe and MGSampler, and evaluated them alongside ViLAMP on three benchmarks: ActivityNet[2], Something-Something V2 (Sth-V2)[3], and UCF-101[4]. Since ViLAMP is a generative model, we reformulated the original classification task as a multiple-choice problem during testing. The results are as follows:
| ActivityNet (mAP %) | Sth-V2 (Acc. %) | UCF-101 (Acc. %) | |
|---|---|---|---|
| Adaframe | 71.5 | - | - |
| MGSampler | - | 60.1 | 95.2 |
| ViLAMP (Ours) | 78.6 | 86.9 | 97.6 |
The results for Adaframe and MGSampler are sourced from their papers. Notably, although ViLAMP was not trained on these datasets, it demonstrates robust performance, significantly outperforming both baselines. This may be attributed to the generalizable knowledge acquired by the large-scale pretrained model, enabling effective zero-shot transfer to action recognition tasks.
We will incorporate these results into our revised draft. Please let us know if you have any further questions or suggestions.
[1] Lei J, Berg T L, Bansal M. Detecting moments and highlights in videos via natural language queries[J].
[2] Caba Heilbron F, Escorcia V, Ghanem B, et al. Activitynet: A large-scale video benchmark for human activity understanding[C]
[3] Goyal R, Ebrahimi Kahou S, Michalski V, et al. The" something something" video database for learning and evaluating visual common sense[C]
[4] Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[J].
Thanks for the authors' detailed rebuttal and additional experiment results, which adequately addressed my previous concerns. Therefore, I decide to upgrade my final score as 4.
This paper presents a novel video token compression strategy named Hierarchical Differential Distillation, designed for long video understanding. Specifically, the hierarchical differential distillation process is achieved through two key mechanisms: Differential Keyframe Selection (DKS) and Differential Feature Merging (DFM). Building upon this strategy, the paper introduces the VILAMP model. Experimental results demonstrate that VILAMP achieves superior performance across five common video understanding benchmarks.
##Update after rebuttal: I appreciate the authors' detailed response, most of my concern is settled. I will keep the original rating as "Accept".
给作者的问题
-
Why does DFM choose to compress non-keyframes into a single token? Would increasing the number of retained tokens during DFM improve performance?
-
Is it possible to adaptively select the number of keyframes?
论据与证据
Yes. This paper claimed that the DKS strategy can select keyframes by maximizing query relevance while ensuring temporal uniqueness, whereas the DFM strategy preserves query-relevant features while suppressing redundant visual information, thereby merging non-keyframes.
方法与评估标准
Yes. The evaluation benchmarks utilized in this paper include the most common video understanding benchmarks, including ActivityNetQA, Egoschema, LongVideoBench, MLVU, and Video-MME.
理论论述
This paper does not make theoretical claims.
实验设计与分析
The experimental analysis covers a comparison between state-of-art approaches, ablation studies among components, and keyframe numbers. And more experiments are provided in the supplementary materials. I have no objection to this part.
补充材料
Yes. The supplementary materials include details on the evaluation benchmarks, descriptions of the compared methods, discussions on efficiency, and ablation studies on hyperparameters
与现有文献的关系
This paper presents two novel strategies for video token compression, which hold the potential to advance not only VLMs but also the broader field of video processing.
遗漏的重要参考文献
Do the authors consider comparing LongVU with some training-free token compression methods[1-4]?
[1] VisionZip: Longer is Better but Not Necessary in Vision Language Models (2024)
[2] [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster (2024)
[3] DyCoke : Dynamic Compression of Tokens for Fast Video Large Language Models (2024)
[4] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-PLay Acceleration for VLLM Inference (2024)
其他优缺点
Strength
- The ideas behind the DKS and DFM strategies are considered novel.
- Experimental results demonstrate that VILAMP achieves superior performance across multiple benchmarks.
Weakness
- The DKS presets a fixed number of keyframes K, which may not be suitable for all videos. Intuitively, videos with less content variation may require fewer keyframes, while more dynamic and content-rich videos would benefit from selecting more keyframes.
- VILAMP requires model finetuning, which may limit its portability and practical value.
- The hierarchical differential distillation relies on multiple hyperparameters, and as shown in Table 4 of the supplementary material, these hyperparameters significantly impact the model’s performance.
其他意见或建议
It would be better to provide a more detailed derivation and analysis of the computational complexity of the DKS strategy.
Thank you for your time and effort, here are our replies:
Q1: Do the authors consider comparing LongVU with some training-free token compression methods[1-4]?
Since VisionZip[1], FasterV[2], and FastV[4] are based on the LLaVA-NeXT model, which is optimized for single image scenarios, they cannot be directly applied to video input. Due to time constraints, we only reproduce DyCoke[3], and the results are as follows:
| EgoSchema | MLVU | Video-MME | |
|---|---|---|---|
| DyCoke | 69.8 | 63.2 | 57.9 |
| LongVU | 67.6 | 65.4 | 60.6 |
We will include these results in our revised draft.
Q2: The DKS presets a fixed number of keyframes K, which may not be suitable for all videos. Intuitively, videos with less content variation may require fewer keyframes, while more dynamic and content-rich videos would benefit from selecting more keyframes.
Thank you for your valuable feedback. We would like to clarify that the parameter K serves as an upper boundary for the number of keyframes rather than a required quantity to achieve. In practical applications, when processing videos with less content variation, the DKS algorithm may automatically select fewer keyframes than the specified K value, as outlined in line 4 of Algorithm 1 (Page 5).
As demonstrated in Figure 6, which illustrates the impact of K on model performance, we observe that the model's performance tends to saturate when K exceeds 32. Taking into account both the model's performance and the length constraints of the base model, we ultimately set K to 32.
Q3: VILAMP requires model finetuning, which may limit its portability and practical value.
Indeed, ViLAMP requires finetuning. This is primarily for performance considerations. Training-free methods such as FastV, PruMerge, and DyCoke can only achieve performance comparable to the base model. We finetune ViLAMP with additional data to attain superior results, enabling it to better adapt to compressed video inputs.
Q4: The hierarchical differential distillation relies on multiple hyperparameters, and as shown in Table 4 of the supplementary material, these hyperparameters significantly impact the model’s performance.
As mentioned in Q3, our method does involve finetuning that introduces certain hyperparameters. We fully acknowledge the potential impact of hyperparameter selection on model performance and have conducted systematic experiments and corresponding analyses to offer practical guidance for users. The results demonstrate that the model maintains stable performance with slight deviations from the preset optimal values, and significant performance degradation only occurs under extreme parameter configurations.
Q5: Why does DFM choose to compress non-keyframes into a single token? Would increasing the number of retained tokens during DFM improve performance?
This is an important question. In our pilot study, we explored various configurations by compressing non-keyframes into 1 token, 3 tokens, and 9 tokens, respectively.
The result shows that using more tokens to represent non-keyframes do not yield performance improvements. We attribute this observation to two main reasons: (1) Useful patches are sparse in frames. As shown in Table 4, DFM’s parameter achieves the best performance when set to . This indicates that the patch scores are amplified by 100 before the Softmax operation (see Eq. 10). Such sharpening concentrates the weights on a small subset of patches, suggesting that focusing on only a few critical patches is sufficient for achieving desirable results. (2) Keyframes already contain sufficient details, while non-keyframes primarily serve to complement missing information. In this case, increasing the number of tokens for non-keyframes may not lead to noticeable gains.
Therefore, for efficiency considerations, we ultimately compressed non-keyframes into a single token. We will include the discussion ablve in our revised draft.
Q6: Is it possible to adaptively select the number of keyframes?
As mentioned in Q2, our current DKS keyframe selection algorithm employs a dynamic selection mechanism, though it currently operates with a fixed upper limit K for the maximum number of keyframes. To enhance model adaptability across diverse video inputs, we plan to develop a dedicated module in subsequent experiments. This module will dynamically calculate the optimal K value by considering inter-frame relevance and variation, and video duration, thereby enabling more intelligent adaptation.
I appreciate the authors' detailed response, most of my concern is settled. I will keep the original rating.
This paper works on improving the high computational costs for long-form video processing in VLMs. It receives unanimous positive reviews due to 1) the novel idea; 2) extensive experiments; 3) good writing. Although there are some concerns at the beginning, including 1) missing experiments on more temporal-sensitive tasks; 2) missing comparison with some related works; 3) the fixed number of keyframes K may not be suitable for all videos; 4) the fintuening of VILAMP may limit its practical value; 5) the hyperparameters. Most of they were well addressed by the rebuttal, and all reviewers are happy in the end. The AC finds no reason to disagree with the reviewers' recommendation, and thus recommends acceptance. The authors should follow the reviewers' feekback and reflect the rebuttal in the final version.