LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Large language model for long video language understanding
摘要
评审与讨论
This paper introduces LongVU, a video-LLM designed for Long Video-Language Understanding. LongVU is built on three main components: (1) DINOv2 Features, which is used to remove redundant frames that exhibit high similarity, (2)Text-Guided Cross-Modal Query, which can selective frame feature reduction and (3)Spatial Token Reduction, which allows spatial tokens reduction via temporal dependencies. Experiments mainly focus on video understanding tasks, testing the model’s Spatial-temporal Orientation Awareness, Video Detailed Description, Action Counting, and Hour-long Video Understanding, demonstrating the superiority of LongVU for long video understanding.
给作者的问题
Would the compression strategy require re-tuning for different types of videos (e.g., static surveillance footage vs. fast-moving sports videos)?
论据与证据
Claims supported by clear and convincing evidence:
Claim: The proposed LongVU significantly improves the performance of existing visual-language models (VLMs) on video understanding tasks.
Evidence: The paper provides extensive experimental results in Table 1 and Table 2, demonstrating performance gains on multiple benchmarks such as EgoSchema, MVBench, MLVU and VideoMME. The comparison with baseline models clearly shows the superiority of the LongVU.
方法与评估标准
Proposed Methods: The proposed method, LongVU, aims to enhance video understanding by: Injecting temporal information into large vision-language models (VLMs). Minimizing additional computational overhead while improving performance. Generalizing across various video understanding tasks. Video understanding inherently requires capturing both spatial and temporal features, which LongVU addresses. Reducing the cost of adaptation without compromising performance is crucial for large-scale VLMs.
理论论述
N/A
实验设计与分析
Strengths
The paper provides a reasonably well-structured experimental design, with the following positive aspects:
Baseline Comparisons: The papaer compare LongVU with several state-of-the-art large vision-language models (VLMs).
Multiple Datasets: The experiments are conducted on well-known datasets like EgoSchema, MVBench, and VideoMME. These datasets are widely accepted benchmarks for video understanding tasks, making the evaluation relevant and convincing.
Evaluation Metrics: The primary metric used is Accuracy (Acc), which is standard for VideoQA tasks. Although limited, the choice of this metric is reasonable for the problem setting.
Ablation Study: The paper conducts ablation studies to isolate the contribution of LongVU. This strengthens the validity of their claims.
Weaknesses
No Clear Analysis of STC’s Token Reduction The paper claims that STC helps compress video tokens, but it does not provide quantitative details on how many tokens are actually reduced. Table 3 shows that DINO + Query and DINO + Query + STC have very similar performance (only 0.3% - 0.5% difference), raising the question of whether STC meaningfully contributes to efficiency or accuracy. Without token reduction statistics, it is unclear whether STC is significantly improving computational efficiency or if it has a negligible effect.
Limited Task Scope The paper claims LongVU is generally applicable to various video understanding tasks. However, all experiments focus solely on VideoQA tasks, without evaluating performance on other video understanding tasks like: Video captioning (describing video content). Action recognition (classifying actions from videos). This narrow task scope weakens its claim of generalizability.
补充材料
N/A
与现有文献的关系
N/A
遗漏的重要参考文献
N/A
其他优缺点
Strengths
The paper presents a novel model LongVU, which effectively reduces video tokens while preserving crucial visual details. This is an innovative approach compared to existing uniform or dense sampling strategies. The use of cross-modal query-based compression is a unique way to adaptively reduce spatial and temporal redundancy in videos, an aspect that has not been fully explored in prior works.
Weaknesses
Lack of a Clearly Stated Contribution Section The paper does not explicitly list its main contributions in a separate section or clearly structured paragraph in the introduction.While the methodology is described in detail, the lack of a concise summary of contributions makes it difficult to quickly understand what is novel and significant about this work.
其他意见或建议
N/A
1. The paper claims that STC helps compress video tokens, but it does not provide quantitative details on how many tokens are actually reduced.
Please refer to Figure 4(b), where we illustrate the reduction in the number of tokens after applying STC across different video durations. On average, we observe a 40.4% reduction in tokens during STC. The primary goal of STC is to reduce tokens while preserving visual information, rather than achieving a significant performance improvement. The table below shows the retained token ratio before and after STC, demonstrating that while the number of token is significantly reduced, the model’s performance is effectively preserved.
| Methods | Retained Ratio | EgoSchema | VideoMME | MLVU |
|---|---|---|---|---|
| w/o STC | 23.9% | 18.5 | 67.3 | 60.1 |
| w/ STC | 14.2% | 67.6 | 60.6 | 65.4 |
2. Limited Task Scope. Other video understanding tasks like: Video captioning.
Our video training data, sourced from VideoChat2, includes extensive captioning annotations, as shown in Table 6. Our model is capable of generating detailed video captions, as demonstrated in the qualitative results presented in Figure 3.
Recent long video benchmarks adopt the VideoQA format for easier and more stable evaluation, since automatic caption evaluation is challenging, and GPT-4 based evaluation is not cost friendly. MLVU offers subsets like Video Summarization and Sub-Scene Captioning for captioning tasks. Below, we report MLVU video captioning results, with G-Avg representing the average performance across generation tasks.
| Models | MLVU (G-Avg) ↑ |
|---|---|
| ShareGPT4Video | 3.7 |
| VideoLLaMA2 | 3.9 |
| VideoChat2 | 3.9 |
| LongVU (Ours) | 4.1 |
3. Lack of list of main contributions.
We believe we developed notable contributions that have been thoughtfully developed and carefully integrated to achieve strong performance. We are clarifying the contributions here.
-
Vision-centric visual encoders for video understanding. We identify a key insight: vision-centric encoders trained with feature similarity objectives, e.g., DINOv2, excel at frame reduction in the visual space, while CLIP-based features, optimized for vision-language alignment, are suboptimal for this task. To our knowledge, this has not been explored before for video token reduction.
-
Context-aware dynamic token compression. Our method adaptively compresses the video both spatially and temporally by dynamically adjusting the number of tokens based on the video's inherent visual complexity and redundancy, considering both spatiotemporal nature of video and semantic relevance to the user’s query.
-
We conducted extensive experiments across several video understanding benchmarks, including EgoSchema, MVBench, VideoMME, and MLVU. Our LongVU method significantly outperforms multiple recent open-source video LLM models and also demonstrates strong effectiveness in smaller models (3B).
This paper proposes to reduce long video context by: 1) temporal frame reduction based on DINO feature similarity to extract keyframes (DINO module), 2) cross-modal query (Query module) to capture important tokens, 3) Spatial Token Compression (STC module) to further reduce tokens for excessively long videos. The technical novelty of the Query and STC module is limited. The ablation studies demonstrate that the improvements of STC module are weak and it sacrifices fine-grained abilities.
Given that the improvements of DINO module and Query module are significant, I lean towards weak accept if the concerns in STC module and advantages against FastV are solved.
给作者的问题
- Since dropping / compression tokens in MLLM is not a new idea, the authors should demonstrate their advantages against previous methods like FastV. I think efficiency might be a point, as LongVU compresses tokens before feeding LLM.
论据与证据
Yes
方法与评估标准
Yes
理论论述
No theoretical claims in this paper.
实验设计与分析
Yes
补充材料
I reviewed the code in suppl.\
与现有文献的关系
In terms of video understanding.
遗漏的重要参考文献
No.
其他优缺点
Strength
-
For long video understanding, the authors developed 1) temporal frame reduction based on DINO feature similarity to extract keyframes (DINO feature), 2) cross-modal query (Query module) to capture important tokens, 3) Spatial Token Compression (STC module) to further reduce tokens for excessively long videos.
-
The effectiveness of all the above modules is verified by ablation studies.
Weakness
- The core idea of Query and STC lacks substantial novelty. Cross-modal query are known to reduce sequence length long time ago[1], and sparsity in MLLM has already been explored by FastV [2]. This paper differs from FastV in that it compresses tokens before feeding into LLM, but the authors did not demonstrate its advantages either.
- Improvements of the STC module are not significant. From Table 3, performance gain on EgoSchema, VideoMME, and MLVU ranges from 0.3%~0.5%. In MLVU, simply changing seed, resolution, or number of sampled frames brings much more variations.
- STC harms fine-grained temporal understanding abilities. In Table 4, performance in Needle QA and Plot QA drops significantly when introducing STC. Both subtasks are on the fine-grained temporal understanding, especially Needle QA. In another saying, the paper sacrifices fine-grained abilities since they account for only small portions on the benchmark.
[1] X-Pool: Cross-Modal LanguageVideo Attention for Text-Video Retrieval. CVPR 2022.
[2] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. ECCV 2024.
其他意见或建议
N/A
1. The core idea of Query and STC lacks substantial novelty. Cross-modal query are known to reduce sequence length long time ago[1], and sparsity in MLLM has already been explored by FastV [2]. This paper differs from FastV in that it compresses tokens before feeding into LLM, but the authors did not demonstrate its advantages either.
FastV is a token reduction method designed for image understanding in MLLMs, operating after the second layer of MLLMs by reranking visual tokens based on attention scores from remaining tokens. In DyCoke [3], FastV was compared as a baseline for long video understanding tasks. The results show that while FastV effectively compresses tokens to speed up inference, it sacrifices performance in long-video comprehension compared to the base model.
As a plug-and-play approach, we applied FastV to our LongVU model to compare its compression with our method.
| Models | TFLOPs | VideoMME |
|---|---|---|
| LongVU w/o reduction + FastV | 34.4 | 55.1 |
| LongVU w/ reduction (default) | 18.5 | 60.6 |
For our case, the limitation of FastV becomes more pronounced. Our approach performs dense sampling and reduces tokens before input to LLM, ensuring they fit within the LLM’s context length with lower complexity. In contrast, FastV still requires processing all video tokens (100K) in the first two layers, leading to a significant exceeding beyond the LLM's context size (8K). This discrepancy affects attention reliability for extremely long contexts that extend beyond the model’s training distribution. In addition, its attention-based token selection fails to account for the temporal nature of video.
FastV is an effective token compression approach for image understanding, but not designed optimal for long video compression. In contrast, our method effectively compresses video tokens by considering the spatiotemporal dependencies inherent in video data, which not only reduces the number of tokens but also improves performance in long-video understanding tasks.
[3] DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models, CVPR 2025
2. Improvements of the STC module are not significant. STC harms fine-grained temporal understanding abilities.
We propose that STC is not primarily aimed at improving performance but at reducing video redundancy to minimize the number of tokens input into the LLM.
| Methods | Retained Ratio | EgoSchema | VideoMME | MLVU |
|---|---|---|---|---|
| w/o STC | 23.9% | 67.3 | 60.1 | 65.0 |
| w/ STC | 14.2% | 67.6 | 60.6 | 65.4 |
By applying STC, we achieve a 40.4% reduction in tokens, ultimately retaining only 14.2% of the original tokens. It demonstrates that while the number of tokens is significantly reduced, the model’s performance is effectively preserved.
The paper proposes LongVU, a method that addresses the challenge of processing long videos within multimodal language models' limited context by implementing a three compression approaches: reducing temporal redundancy through inter-frame similarity, leveraging cross-modal dependencies, and eliminating spatial token redundancy. This framework successfully processes hour-long videos within standard 8k context lengths, outperforming competing methods across multiple benchmarks while maintaining high-quality video understanding.
给作者的问题
Can the authors discuss each component of their method in light of previous work on similar components? What has already been and what is the unique contribution that is different from previous work? Is the unique contribution really a major research contribution of incremental in nature? Because this is my only major concern right now. Other than that, the results are good and the work is well presented.
论据与证据
The major claims about the method's performance on long videos are supported by results across multiple benchmark datasets. Additionally, the authors have provided detailed ablations on all their components to provide more insight into their method.
方法与评估标准
Yes, all major benchmark datasets are covered. However, I am concerned about the inference reported in the supplementary material. Authors report that LLama-VID is OOM on 20-minute videos. But the LLama-VID paper reports performance on hour-long videos. Is there any error here?
理论论述
The mathematical representations are fine. No major theoretical proofs in the paper.
实验设计与分析
The experimental design and analysis are all technically sound and thorough. All major datasets concerning long video evaluations are covered and experimental analysis is provided in detail.
补充材料
The supplemental material contains code that I did not run myself.
与现有文献的关系
This is my major point of concern. The main contribution of the paper revolves around token reduction based on dino features and adaptive reduction based on the text query. These components have already been proposed in prior work for either image VLMs or other Video VLMs. For example, both [1] and [2] use DINO features alongside the CLIP features. Similarly, the idea of using text query to select optimal tokens is also covered in previous published work on videos [3, 4] that the authors do not seem to discuss.
This reduces the proposed work to a more engineering solution that combines ideas from previous image or video-based work. There is no doubt that the results are outstanding, but the core contributions leading to strong performance have already been proposed in a similar form in previous work, or adapted for similar work in image-based models.
[1] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
[2] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
[3] Text-Conditioned Resampler For Long Form Video Understanding
[4] Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
遗漏的重要参考文献
I believe that there is a missing discussion on two works, that also propose a similar text-conditioned token reduction for videos.
[1] Text-Conditioned Resampler For Long Form Video Understanding
[2] Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
其他优缺点
Strengths:
The paper is well-motivated and well-written. The overall method achieves strong performance on all major benchmark datasets. The method is also supported by detailed ablations regarding each component of the method.
Weaknesses:
My major concern is only regarding the technical novelty of the method. As of right now, it seems that the method is a strong engineering solution to the problem of long video understanding in VLMs, but the major components have all been proposed in previous published work. Whether it is the concept of combining DINO and CLIP features [1, 2], the idea of selecting tokens based on text queries [3, 4], or even the concept of reducing temporal resolution before processing the video [5]. Therefore, I believe that the novelty of the method is quite limited with all components proposed in some form in previous work.
[1] Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
[2] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
[3] Text-Conditioned Resampler For Long Form Video Understanding
[4] Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
[5] 3D CNNs with Adaptive Temporal Feature Resolutions
其他意见或建议
N/A
伦理审查问题
N/A
1. LLama-VID OOM issue.
Thanks for your concern. We encountered an OOM issue while running on an 80GB A100 GPU for long videos, using the same settings as other comparison models. To address this, we need to precompute the video features in advance and then load the entire model for inference, as shown in LLaMA-VID (long-video-inference). The inference time for 20-minute videos is 55.3 seconds, which will be updated in Table 11.
2. The main contribution has already been proposed in prior work for either image VLMs or other Video VLMs.
2.1. For example, both [1] and [2] use DINO features alongside the CLIP features.
While MoF [1] and Cambrian-1 [2] utilize both DINO and CLIP features, neither explores their combination for video training and video token compression—where careful design is crucial for practical training and good performance. Simply applying both vision features to all video frames is computationally demanding, making long-video training impractical.
Our work goes beyond merely using both encoders for video understanding—we introduce a novel strategy to reduce computation demand and conduct adaptive video token compression for MLLMs. We identify a key insight: vision-centric encoders trained with feature similarity objectives, e.g., DINOv2, excel at frame reduction in the visual space, while CLIP-based features, optimized for vision-language alignment, are suboptimal for this task. To our knowledge, this has not been explored before for video token reduction. In addition, by leveraging DINOv2 to selectively filter redundant frames before extracting SigLIP features, we significantly reduce computational costs and make long video training feasible.
2.2. Using text query to select optimal tokens is also covered in [3, 4] that the authors do not seem to discuss.
The claim that Goldfish uses text query to select optimal tokens is inaccurate, as its selection mechanism does not operate on visual representations. Instead, it retrieves text descriptions based on query similarity. As we have already discussed in lines 151-154, Goldfish chunks long videos into short clips, generates descriptions with VideoLLM, and encodes them using OpenAI’s text-embedding-3-small. Then it retrieves most relevant descriptions based on similarity between the query's embedding and clip embeddings for the final answer. In contrast, whereas our method selects video tokens, making its approach fundamentally different from ours.
In [3], the Text-Conditioned Resampler (TCR) employs learnable queries to resample a long sequence of video frame features into a fixed-length sequence, which is then fed into a language model. This approach is quite similar to QFormer [5] but is applied in the video domain. However, TCR enforces long video compression into a limited and fixed token space, limiting its ability to represent complex visual information and ignoring temporal variations—resulting in redundancy in static scenes and loss of detail in complex ones.
In contrast to [3], our method dynamically compresses video tokens based on intrinsic dynamics, adapting to each video's complexity. We will integrate the above discussion in our revision.
[5] Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
3. The unique contribution that is different from previous work?
(1) Vision-centric visual encoders for video understanding. To the best of our knowledge, no previous work has explored leveraging a vision-centric visual encoder like DINOv2 for frame reduction. While methods like MovieChat and Chat-UniVi reduce redundancy by calculating frame-level feature similarities, they rely on CLIP features. We argue that language-supervised methods like CLIP fail to capture the nuanced differences between frames.
(2) Context-aware dynamic token compression. Many token compression methods, like TCR, VideoChat, and VideoChat2, rely on QFormer-based modules, which limit video representation by transforming it into a fixed number of trainable queries. These approaches overlook the video's dynamic and informational richness, regardless of their complexity. In contrast, our method adaptively compresses the video both spatially and temporally, dynamically adjusting the number of tokens based on the video's inherent visual complexity and redundancy.
(3) Stronger performance with lower cost. Compared to Chat-UniVi, which extends ToMe-based dynamic token merging for visual features in MLLMs, our method is both more efficient and more effective. As shown in Table 11, it processes over 1K video frames in 32.96 seconds, compared to Chat-UniVi’s 49.06 seconds, while achieving a substantial 17.7% accuracy gain in VideoMME long.
I thank the authors for their response. I believe most of my concerns have been addressed. I will be moving my rating up to Accept.
Dear Reviewer 7Nad,
We sincerely thank you for taking the time to carefully read our rebuttal and for acknowledging our clarifications. We’re grateful for your constructive feedback throughout the review process and are pleased to hear that your concerns have been addressed. We will integrate all of your valuable suggestions in our revision.
Best,
Paper 9623 Authors
This paper introduces a novel spatiotemporal token compression method named LongVU, designed for long video understanding. Specifically, LongVU divides the video token compression process into three stages: Temporal Reduction, Selective Feature Reduction, and Spatial Token Compression.
In the Temporal Reduction stage, an additional DINOv2 visual encoder is introduced to eliminate frame-level redundancy based on similarity. The Selective Feature Reduction stage incorporates the user’s query to identify the most relevant frames, preserving their resolution while reducing the resolution of less relevant frames. Finally, the Spatial Token Compression stage partitions the video into temporal windows and performs spatial compression at corresponding positions within each window.
The experimental evaluation in this paper assesses the performance of the proposed LongVU on four public long video understanding benchmarks, comparing it with other state-of-the-art methods.
##Update after rebuttal: I appreciate the authors' detailed response, most of my concern is settled. I will keep the original rating as "Accept".
给作者的问题
I have no further questions.
论据与证据
The paper claims that existing sampling schemes have drawbacks, namely that uniform sampling may cause some frames to be missed, and dense sampling may lead to the number of tokens reaching the limit prematurely, resulting in the video being truncated early. While this claim makes intuitive sense, it lacks further validation both qualitatively and quantitatively.
方法与评估标准
Yes. The evaluation benchmarks utilized in this paper include most common video understanding benchmarks, including Egoschema, mvbench, MLVU and Video-MME.
理论论述
This paper does not make theoretical claims.
实验设计与分析
I am satisfied with the most of the experiments in this paper except for the visual encoder part.
-
Since this paper introduces an additional DINOv2 encoder, it highlights the importance of the visual encoder. However, the paper lacks a thorough discussion on the visual encoders, including why DINOv2 and SIGLIP were selected (beyond the brief description in Sec 3.1, A more comprehensive analysis, such as quantitative ablation studies on different visual projectors, would be better).
-
Furthermore, the discussion of DINOv2 and SIGLIP in Tab. 3 is limited to their role in feature extraction during the temporal reduction stage. However, prior to the Selective Feature Reduction stage, the SVA is still employed for feature fusion (I hope I understand it correctly). Thus it lacks quantitative analysis when relying solely on DINOv2/SIGLIP for feature extraction throughout the pipeline.
补充材料
Yes. The supplementary material provides additional information on the training dataset, a comparison between SIGLIP and DINOv2, ablation experiments, and the NIAVH benchmark. It also introduces a new positional encoding method to enhance performance. Furthermore, the limitations of the approach are discussed.
与现有文献的关系
This paper proposes several novel solutions for feature extraction in long videos, which could be beneficial not only to the field of VLMs but also to the broader, more general domain of video processing.
遗漏的重要参考文献
Does the authors consider comparing LongVU with some training-free token compression methods[1-4]?
[1] VisionZip: Longer is Better but Not Necessary in Vision Language Models (2024)
[2] [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster (2024)
[3] DyCoke : Dynamic Compression of Tokens for Fast Video Large Language Models (2024)
[4] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-PLay Acceleration for VLLM Inference (2024)
其他优缺点
Strength
- The ideas behind the spatiotemporal adaptive compression are considered intuitive and useful.
- In general, the paper is easy to follow.
Weakness
- I understand that the tokens filtered out are considered less important compared to those retained. However, the filtered tokens may still contain some useful information. Have the authors considered providing a discussion on the potential implications of discarding these tokens?
- LongVU requires retraining the model, which may limit its portability and practical value.
其他意见或建议
-
Although the authors conducted ablation experiments on the threshold and sliding window in the supplementary material, it would be more convincing to provide quantitative statistics on the number of tokens and the compression ratio after each stage of compression based on these ablation experiments.
-
The description of Table 3 is somewhat unclear, as the order of experiments within the table does not align with the sequence in the paper. Also, the unclear description of specific experimental setup makes me somewhat confused.
-
The impact statement is missing. It is recommended that the author add it.
1. The drawbacks of uniform sample and dense sampling.
Numerous studies, including LLaVA-OneVision, LongVA, and SlowFast-LLaVA, have explored the trade-offs between these approaches. Below, we present results of the baseline LongVA using either uniform sampling or dense sampling (1fps, with truncation at the end). LongVA extends the context length to 224K, yet it still cannot fully process long videos at 1 fps, requiring truncation. Performance peaks around 128-frame sampling and then decreases.
| #Frames | VideoMME (Long) | VideoMME (Avg) |
|---|---|---|
| 32 (uniform) | 45.4 | 51.8 |
| 64 (uniform) | 45.0 | 52.4 |
| 128 (uniform) | 46.2 | 52.6 |
| 384 (uniform) | 46.1 | 51.8 |
| 1fps (dense) | 44.9 | 51.5 |
2. Lacks a thorough discussion on the visual encoders, and analysis on DINOv2/SigLIP solely for feature extraction.
We appreciate the reviewer's concern regarding the discussion on visual encoders. However, we emphasize that our approach primarily focuses on the adaptive video token reduction, rather than an exhaustive exploration of visual encoders.
In the initial stage, we explored various vision encoders, including SAM, DINOv2, and SigLIP. Due to space constraints, we present the ablation studies conducted during the image pretraining phase, where we compare the performance of SigLIP alone with that of a SigLIP and DINOv2 combination.
| Method | GQA | MMVP | POPE | RealWorldQA |
|---|---|---|---|---|
| SigLIP | 61.9 | 31.3 | 85.6 | 59.5 |
| SigLIP + DINOv2 | 62.3 | 51.3 | 86.7 | 61.1 |
The results clearly demonstrate that combining both vision encoders leads to superior performance, resulting in a more robust image understanding model. SigLIP alone ranked as the runner-up. In contrast, DINOv2 and SAM, trained without language supervision, consistently showed an average accuracy more than 5% lower than that of SigLIP.
3. Comparing with training-free token compression methods.
Thank you for sharing these methods. We reported the results in the table below and will incorporate them into our revision.
| Model | MVBench | MLVU | VideoMME |
|---|---|---|---|
| FastV | 56.1 | 62.6 | 57.3 |
| VisionZip | 56.9 | 62.5 | 57.8 |
| DyCoke | 58.2 | 63.8 | 60.4 |
| LongVU (Ours) | 66.9 | 65.4 | 60.6 |
4. Quantitative statistics on the number of tokens and the compression ratio.
We already illustrated the number of tokens before and after reduction for different video durations in Figure 4. Below we provide the average compression ratio for the VideoMME dataset below for each compression step.
| Temporal Frame Reduction | Query-based Token Reduction | STC |
|---|---|---|
| 45.9% | 52.1% (23.9%) | 59.6% (14.2%) |
At the temporal frame reduction stage, 45.9% of tokens remain. In the query-based token reduction, 52.1% remain, and at the STC stage, 59.6% are retained, yielding the final retain ratio of 14.2%. This breakdown shows the adaptive nature of our approach and the significant compression achieved throughout the process.
5. The order of experiments within the table does not align with the sequence in the paper.
The last three rows of Table 3 outline the sequence of our method: first, DINOv2 for temporal reduction, followed by query-based selection, and finally, STC reduction. The table also includes ablation studies on context length, tokens per frame, and the impact of using SigLIP or DINOv2 for temporal reduction. We apologize for any confusion caused by the table’s condensed format. To enhance clarity, we will reorganize it in the revision—grouping “Uniform” together, renaming "DINO" to "Temporal Reduction (DINO)," and "SigLIP" to "Temporal Reduction (SigLIP)."
6. The impact statement is missing.
Our work introduces a spatiotemporal adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. This innovation paves the way for future research in video compression tailored for MLLM-based applications, enabling more effective long-video, media, and streaming video understanding. Additionally, we contribute to the open-source community by providing datasets, benchmarks, and models to support the advancement of AI-driven video analysis. We envision a future where MLLMs can directly process compressed video formats aligned with VLLMs that account for spatiotemporal redundancy.
We will add an impact statement section in our revision.
Thanks for the detailed rebuttal, most of my concerns have been settled. I would keep my original rating as "Accept".
Dear Reviewer ow6N,
Thank you very much for your positive recognition of our work. We sincerely appreciate your thoughtful and constructive feedback. We will incorporate all of your suggestions in our revision.
Best, Paper 9623 Authors
This paper presents an adaptive compression algorithm for long video understanding. It reveals that by compressing spatiotemporal visual information, the modern MLLMs can get more video context as input and thus improve long video understanding. This is a solid contribution to the community. Considering the unanimous acceptance across all reviewers, the AC finds no reason to overturn the decision.