PaperHub
6.3
/10
Poster4 位审稿人
最低6最高7标准差0.4
6
7
6
6
3.8
置信度
COLM 2025

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

OpenReviewPDF
提交: 2025-03-21更新: 2025-08-26
TL;DR

We introduce SlowFast-LLaVA-1.5, a family of video large language models offering a token-efficient solution for long-form video understanding.

摘要

关键词
Multimodal LLMVideo UnderstandingVideo Question AnsweringLong-Form Video Understanding

评审与讨论

审稿意见
6

This paper presents a well-motivated and technically sound advance in the domain of video LLMs, targeting long-form video understanding with an emphasis on token efficiency. The paper introduces a family of models at scales ranging from 1B to 7B parameters that integrate the two-stream Slow-Fast mechanism into a LLaVA-style architecture, enabling the models to process both spatially detailed frames at a low rate (Slow pathway) and motion-rich frames at a high rate (Fast pathway) in a token-efficient manner. Training is carried out in two straightforward stages—first on image-only data for general knowledge and reasoning, and then on a joint image-video mixture drawn exclusively from publicly available datasets—ensuring both reproducibility and competitive performance. Empirical results on benchmarks such as LongVideoBench (62.5% at 7B) and MLVU (71.5% at 7B) demonstrate that even the smaller 1B and 3B variants outperform or match existing methods, particularly in long-form scenarios, while using substantially fewer tokens per video input.

接收理由

This work addresses a pressing challenge in multimodal AI—how to build Video LLMs that are both capable of understanding hour-long content and efficient enough for edge deployment. By systematically incorporating the Slow-Fast projector into a unified supervised fine-tuning pipeline, the authors strike a judicious balance between frame count and token complexity, yielding clear empirical gains across general videoQA, long-form understanding, and temporal reasoning benchmarks. The two-stage training recipe is refreshingly simple, eschewing proprietary data and elaborate multi-phase schedules in favor of publicly accessible datasets and a modestly tuned LLM backbone. The comprehensive ablation studies—ranging from design choices of Slow–Fast instantiations to video–image data ratios—underscore the robustness of the approach and offer practical guidance for future research. Importantly, the open‑source release of model weights and data mixtures will foster reproducibility and community adoption, amplifying the paper’s impact on both academic and industrial fronts.

拒绝理由

Despite its strengths, the paper leaves several concerns unaddressed. First, the reliance on uniform FPS sampling when videos exceed 128 frames risks overlooking semantically critical segments in longer recordings, potentially biasing model outputs; the proposed mitigation of adding timestamps remains speculative without empirical validation. Second, by freezing the visual encoder throughout training, the models may underexploit domain-specific visual features that could further boost performance, particularly on text‑rich or fine‑grained visual tasks. Third, while the Slow-Fast mechanism delivers efficiency, its comparative performance against emerging memory‑augmented or dynamic token‑merging techniques is not explored; this narrows the paper’s experimental scope. Finally, the presentation, though generally clear, is dense with acronyms and heavy math in the architecture section, which may impede accessibility for readers less familiar with prior LLaVA and SlowFast literature.

给作者的问题

  • Have you considered augmenting the uniform frame sampler with a learned importance function or keyframe detector to address the limitation that “a ten‑minute video and a one‑hour video have the same number of input frames”? Incorporating a lightweight attention or reinforcement‑learning module could dynamically prioritize salient segments.
  • What are the inference latency and memory footprints of the 1B and 3B models on a representative edge device (e.g., a smartphone GPU)? Detailed profiling would substantiate claims of mobile‑friendliness.
评论

Thank you for your review and helpful feedback!

Q1: Have you considered augmenting the uniform frame sampler with a learned importance function or keyframe detector to address the limitation that "a ten‑minute video and a one‑hour video have the same number of input frames"? Incorporating a lightweight attention or reinforcement‑learning module could dynamically prioritize salient segments.

A1: Great question! We plan to develop better frame samplers based on the Slow-Fast idea to further improve the performance. Specifically, the model can first process the entire video using only "Fast frames" with low-resolution (e.g., 1x1 token), which densely captures the long video with high efficiency. Based on the question, the model should output which part of the long video needs to be processed again in detail. Then, the model can "look twice" into the "Slow frames" with high resolutions to generate the final answer. This algorithm can be realized by RL in our future work.


Q2: What are the inference latency and memory footprints of the 1B and 3B models on a representative edge device (e.g., a smartphone GPU)? Detailed profiling would substantiate claims of mobile‑friendliness.

A2: Thanks for this suggestion. We report the runtime of our model on a H100 GPU in the paper's Table 6. Based on existing studies [1], 7B or smaller VLMs can be deployed to mobile devices with further optimization, such as model compression and quantization. However, please understand that the effort to deploy our model to a smartphone device is significant and requires a lot of engineering time. We have to leave this study for future work.

[1] MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning. Zhang et al. ICLR 2025.


Q3: While the Slow-Fast mechanism delivers efficiency, its comparative performance against emerging memory‑augmented or dynamic token‑merging techniques is not explored; this narrows the paper’s experimental scope.

A3: Thanks for this suggestion. In the paper's Table 6, we tried to compare Slow-Fast with widely used video projectors, where Qformer uses dynamic token merging and Perceiver Resampler is considered as an example of memory compression [2]. Please let us know if there is any particular technique that we should compare with. However, considering the limited time of rebuttal and training models require large computational resources, we may have to include the results in our paper revision.

[2] Long Short-Term Transformer for Online Action Detection. Xu et al. NeurIPS 2021.


Q4: The presentation, though generally clear, is dense with acronyms and heavy math in the architecture section, which may impede accessibility for readers less familiar with prior LLaVA and SlowFast literature.

A4: As suggested, we will polish our architecture section. In particular, we will move the heavy math into the appendix and give a more intuitive description of our model design.

审稿意见
7

The paper introduces SlowFast-LLaVA-1.5, a family of token-efficient VLMs for long-form video understanding. It incorporates the SlowFast mechanism into a streamlined training pipeline, using only publicly available datasets. The models, ranging from 1B to 7B parameters, achieve state-of-the-art performance on various video and image benchmarks, demonstrating superior efficiency and reproducibility.

接收理由

  1. The proposed VLMs have shown state-of-the-art performance on various video and image benchmarks.

  2. The proposed VLMs achieve high efficiency, achieving high performance with only 1B and 3B parameters.

  3. The presentation and organization of the paper are clear and easy to follow.

拒绝理由

I do not have significant concerns that can lead to the rejection of this paper.

评论

Thank you very much for your review and kind words!

评论

Thanks for the rebuttal. I will keep my score.

审稿意见
6

This paper introduces a family of video large language models for long-form video understanding. The authors integrate a SlowFast Video Projector into a LLaVA-style MLLM, and train the model with a simple two-stage training recipe and a carefully curated data mixture. Experiments show the proposed models offer efficiency and strong performance.

接收理由

  • The effectiveness of the SlowFast Projector has been verified in the experiments.
  • The paper proposes a simple two-stage training recipe, avoiding complex training stages used in previous works in this field.
  • The authors carefully curate a data mixture based on public data, and the training datasets will be publicly available.
  • The proposed models show strong performance on different model scales, and the model maintains strong image performance.

拒绝理由

Major Weakness

  1. The SlowFast Video Projector, which is the main contribution of this paper, is exactly the same as the SlowFast projector proposed in the previous work SlowFast-LLaVA [1]. No architecture modification is applied. The authors experiment with different instantiations of the SlowFast idea termed ISF, but experiments prove that this is not effective, and they used SlowFast-LLaVA's original instantiation. In conclusion, the SlowFast Video Projector should not be the main contribution/innovation of this paper. However, the majority of the paper's content and ablation experiments are about this part.
  2. The curated data mixture is a main contribution. However, the description of this part is quite brief and does not introduce any innovative techniques for data selection or filtering. A more detailed explanation and supporting experiments are needed for this section.
  3. Comparisons between the proposed two-stage training recipe and more complex recipes (e.g., VideoLLaMA 3 [2]) should be included.

Minor Weaknesses

The models utilize SlowFast projector for videos, and MLP projector for images. This design seems to be redundant, as mainstream video mllms (InternVL-2.5 [3], Qwen2.5-VL [4], VideoChat-Flash [5]) can process video and image using a unified projector, by treating image as a single-frame video. Discussion and experiments about this dual projector design should be included.

[1] Xu, Mingze, et al. "Slowfast-llava: A strong training-free baseline for video large language models." arXiv preprint arXiv:2407.15841 (2024).

[2] Zhang, Boqiang, et al. "VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding." arXiv preprint arXiv:2501.13106 (2025).

[3] Chen, Zhe, et al. "Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling." arXiv preprint arXiv:2412.05271 (2024).

[4] Bai, Shuai, et al. "Qwen2. 5-vl technical report." arXiv preprint arXiv:2502.13923 (2025).

[5] Li, Xinhao, et al. "Videochat-flash: Hierarchical compression for long-context video modeling." arXiv preprint arXiv:2501.00574 (2024).

评论

Thank you for your feedback!

Q1: The SlowFast Video Projector, which is the main contribution of this paper, is exactly the same as the SlowFast projector proposed in the previous work SlowFast-LLaVA [1]. No architecture modification is applied. The authors experiment with different instantiations of the SlowFast idea termed ISF, but experiments prove that this is not effective, and they used SlowFast-LLaVA's original instantiation. In conclusion, the SlowFast Video Projector should not be the main contribution/innovation of this paper. However, the majority of the paper's content and ablation experiments are about this part.

A1: We understand your concern, but we did not claim the architecture of the SlowFast Video Projector as our main contribution as discussed in the last paragraph of Introduction. Instead, our focus is to incorporate the SlowFast mechanism into an SFT framework and demonstrate its effectiveness on long-form video understanding. We believe this is valuable because, although SlowFast-LLaVA [1] first proposes to use slow-fast inputs in Video LLMs as a training-free model, the optimal way to implement this idea under a SFT framework was not explored. Our paper is the first to conduct such studies and achieve competitive results. Moreover, our contribution is not limited to the SlowFast Projector. Compared to existing Video LLMs, our models achieve state-of-the-art level performance across various benchmarks by using a simple training recipe and only open-source datasets. Considering that both training Video LLMs and collecting large-scale video datasets are very costly, we believe our paper can benefit the community beyond model novelty, especially for many small companies and university labs, to advance research on long-form video understanding.


Q2: The curated data mixture is a main contribution. However, the description of this part is quite brief and does not introduce any innovative techniques for data selection or filtering. A more detailed explanation and supporting experiments are needed for this section.

A2: Thanks for this good suggestion. We tune our data mixture through comprehensive empirical studies. We first build a base mixture that all existing Video LLMs are using. This includes large-scale video datasets with general captioning and QA labels. Then, we keep adding datasets (or their subsets) that are proposed for different video tasks, such as long-form understanding, temporal reasoning, and detailed spatial and temporal perception. We mainly evaluate the new mixture on Video-MME since it tests a model from various perspectives. We decide to add a new dataset if we observe non-marginal improvements. As suggested, our paper revision will give a detailed process of how we build our data mixture with experimental results as support.


Q3: Comparisons between the proposed two-stage training recipe and more complex recipes (e.g., VideoLLaMA 3 [2]) should be included.

A3: This is a good question. First, please understand that we do not have sufficient training resources to conduct complex training recipes, especially during the rebuttal. This limitation likely affects the video research of many other research labs as well. Second, most state-of-the-art Video LLMs, such as VideoLLaMA3, use large-scale in-house datasets to substantially improve their performance. Thus, we cannot even reproduce their recipe by following their papers.

Please kindly consider that the significance of a research paper extends beyond the introduction of novel training methods or architectures. We believe that achieving state-of-the-art performance using a two-stage training recipe and only open-source datasets is substantial, and the findings presented in the paper are insightful for the community.


Q4: The models utilize SlowFast projector for videos, and MLP projector for images. This design seems to be redundant, as mainstream video mllms (InternVL-2.5 [3], Qwen2.5-VL [4], VideoChat-Flash [5]) can process video and image using a unified projector, by treating image as a single-frame video. Discussion and experiments about this dual projector design should be included.

A4: To clarify, we also process images as a single-frame video. They share the same projector parameters, except that video inputs require a slow-fast pooling before the projection. The logic is also very simple from the implementation perspective: if an input has 5 dimensions (i.e., batch size, channel, temporal, height, width), it goes to the slow-fast pooling first, otherwise, it is an image input with 4 dimensions and directly goes to the MLP layers. We will rewrite this part to solve this confusion.

评论

Dear reviewer,

Thank you again for your thoughtful review! We did our best to address all your comments in the rebuttal. Do you feel that our responses have resolved your concerns? We would appreciate it if you could consider raising your score.

Please let us know if there's anything else we can clarify.

Authors

评论

Dear reviewer,

Just a gentle follow-up in case our previous message was missed. We would greatly appreciate your feedback of our response. Please let us know if there is anything we can clarify. Thank you for your time.

Authors

评论

Thank you for the response. I decided to raise my score. It is highly recommended to revise the manuscript accordingly in the final version.

评论

Thank you for raising your score!

审稿意见
6

This paper introduces SlowFast-LLaVA-1.5, a family of token-efficient video large language models designed for long-form video understanding. The authors employ the SlowFast mechanism to balance detailed spatial semantics and long-range temporal context, achieving state-of-the-art performance on multiple benchmarks. The streamlined training pipeline and use of publicly available datasets ensure reproducibility. Comprehensive experiments and ablation studies validate the model's effectiveness. The work is well-structured and makes contributions to the field of video understanding with LLMs.

接收理由

  1. The experiments are comprehensive across 1B, 3B, and 7B model scales, demonstrating the superiority of SlowFast-LLaVA in both image and video understanding. Additionally, the ablation study illustrates that SlowFast achieves a balanced trade-off between performance and efficiency among various projector designs, and also highlights the general effectiveness of architectural design.(GSF, ISF).
  2. This paper provides insights into applying the slow-fast mechanism during training, in contrast to previous works where it was only used at inference time.
  3. The paper is well-written, and easy to follow.

拒绝理由

  1. The paper lacks novelty, as the effectiveness of the slow-fast mechanism has already been demonstrated in prior work [1] on video understanding.
  2. The use of a fixed-frame sampling architecture appears outdated and may be inadequate for addressing the demands of proactive streaming video applications [2].
  3. The fast way seems minor in table 5, particularly when comparing Row 1 and Row 6 on the LongVideoBench validation set.

[1] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He. SlowFast Networks for Video Recognition. ICCV 2019.

[2] Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, Maosong Sun. StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding. arXiv 2024.

给作者的问题

  1. The authors have conducted validation on the Qwen2.5 backbone. It is recommended to further verify the effectiveness of the SlowFast mechanism on a broader range of backbones, such as LLaMA and InterLM.
  2. More benchmarks should be included in Table 5 to better demonstrate the effectiveness of both the slow way and fast way.
  3. A more fair comparison policy is recommended for Table 7 when drawing conclusions about the effectiveness of mixed image-video training. Specifically, the 1:1 image-to-video ratio should be compared against the 2:0 (image-only) and 0:2 (video-only) settings.
评论

Thank you for your feedback! We tried our best to give detailed answers with solid experimental results. Hopefully, they can solve your concerns and convince you to raise the score.

Q1: The paper lacks novelty, as the effectiveness of the slow-fast mechanism has already been demonstrated in prior work [1] on video understanding.

A1: We claim two main contributions compared to the original SlowFast Network. First, our goal and motivation are fundamentally different. SlowFast Network replaces the optical flow stream with high frame rate inputs, because precomputing the optical flow fields is very costly and slow. However, our slow-fast design aims to model long-range temporal context by taking more input frames while optimizing the token efficiency. Second, our slow-fast design is tailored specifically for VLMs. In the original SlowFast Network, the primary difference between the Slow and Fast pathways lies in their use of different frame rates. In contrast, our approach differentiates the two pathways by resolution. We employ low-resolution frames at a high frame rate in the Fast pathway, capturing fine-grained temporal information within an efficient token budget. Meanwhile, the Slow pathway uses high-resolution frames to enhance spatial details.

More importantly, compared to existing Video LLMs, our models achieve state-of-the-art level performance across various benchmarks by using a simple training recipe and only open-source datasets. Considering that both training Video LLMs and collecting large-scale video datasets are very costly, we believe our paper can benefit the community beyond model novelty, especially for many small companies and university labs, to advance research on long-form video understanding.


Q2: The use of a fixed-frame sampling architecture appears outdated and may be inadequate for addressing the demands of proactive streaming video applications [2].

A2: We agree that a dynamic-frame sampling algorithm could improve the model performance, but disagree that fixed-frame sampling is outdated. First, we clarify that our model mainly uses FPS sampling and only fails back to fixed-frame sampling when the video duration is longer than 128s. Second, our sampling method is adopted by most state-of-the-art Video LLMs, such as LLaVA-Video, Apollo, and VideoLLaMA3, where Apollo particularly studies the impact of using different frame samplers.

To reduce the reviewer's concern on streaming video applications, we evaluate our model on the suggested StreamingBench [1] benchmark. We convert SlowFast-LLaVA-1.5-7B into a proactive streaming assistant by adopting the StreamBridge [3] technique without fine-tuning. The results are shown in the following table. We observe that SlowFast-LLaVA-1.5-7B achieves state-of-the-art performance on Real-Time Visual Understanding and Contextual Understanding, showing our effectiveness for streaming video understanding. We fall behind on Omni-Source Understanding since our model is never trained on interleaved datasets with audio. We will include more detailed results and discussions in our paper revision.

ModelReal-Time Visual Understanding (All)Omni-Source Understanding (All)Contextual Understanding (All)
SOTA in StreamingBench [1] Table 271.138.435.0
SlowFast-LLaVA-1.5-7B + StreamBridge72.533.136.0

[2] StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding. Lin et al. arXiv 2024. [3] StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant. Wang et al. arXiv 2025.

评论

Q3: The fast way seems minor in table 5, particularly when comparing Row 1 and Row 6 on the LongVideoBench validation set. More benchmarks should be included in Table 5 to better demonstrate the effectiveness of both the slow way and fast way.

A3: Great question. As suggested, we report results on more video benchmarks in the following table. First, we observe that Row 6 consistently outperforms Row 1, especially +2.3% on VideoMME (general video QA), +1.7% on MLVU (long-form understanding), and +1.9% on TempCompass (temporal reasoning). This demonstrates the advantage of using the Fast pathway. Second, the difference is marginal on LongVideoBench because most of its questions rely on spatial perception [3] where the Slow pathway is more important. We use this benchmark for ablation because it is the most widely used benchmark for long-form video understanding. To further clarify the contribution of the Fast pathway, we will include more benchmarks in Table 5 of our paper revision.

ModelVideo-MME (avg)LongVideoBenchMLVULVBenchNextQAPerceptionTestTempCompass
slow-only (Row 1 of paper's Table 5)52.152.460.435.176.761.152.5
slow-fast (Row 6 of paper's Table 5)54.4 (+2.3)52.7 (+0.3)62.1 (+1.7)36.5 (+1.4)77.8 (+1.1)62.4 (+1.3)54.4 (+1.9)

[3] Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? Feng et al. arXiv 2025.


Q4: The authors have conducted validation on the Qwen2.5 backbone. It is recommended to further verify the effectiveness of the SlowFast mechanism on a broader range of backbones, such as LLaMA and InterLM.

A4: Thanks for this suggestion. We use Qwen2.5 to fairly compare with state-of-the-art Video LLMs. Since SlowFast-LLaVA [4] is built upon LLaMA and Phi, we believe our slow-fast design should be generalizable to different LLM backbones. Please understand that changing the LLM backbone requires considerable model training time, resources, and budget, so we may not be able to complete these experiments during the rebuttal. But we will adopt the reviewer's suggestion and include these results in our future revision.

[4] SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models. Xu et al. arXiv 2024.


Q5: A more fair comparison policy is recommended for Table 7 when drawing conclusions about the effectiveness of mixed image-video training. Specifically, the 1:1 image-to-video ratio should be compared against the 2:0 (image-only) and 0:2 (video-only) settings.

A5: As suggested, we re-train our model using 2:0 (image-only) and 0:2 (video-only) settings, and compare them with our 1:1 ratio mixture. The conclusion is consistent with our paper's, as shown in the table below. First, since the 2:0 (image-only) model is never trained on video data, we follow SlowFast-LLaVA to evaluate it as a training-free Video LLM. Both its video and image performance are lower than the mixed image-video training, demonstrating that image and video understanding can help each other. Second, for the 0:2 (video-only) model, we observe a similar trend with the 0:1 model (first row in paper's Table 7). Its video results are on-par with our 1:1 ratio mixture, but the text-rich performance (e.g., OCRBench and TextVQA) significantly drops. We will update these results in our paper revision.

ModelVideo-MME (avg)LongVideoBenchMMMURW-QAOCRBenchTextVQA
2:0 (image-only)49.848.238.458.466.969.4
0:2 (video-only)53.552.439.356.661.964.6
1:1 (from paper's Table 7)53.552.139.958.568.369.5
评论

I appreciate the authors' rebuttal and am inclined to raise my score to a weak accept, provided that the experimental results are updated in the final revision.

评论

Thank you for raising your score!

最终决定

This paper introduces SlowFast-LLaVA-1.5 (SF-LLaVA-1.5), a state-of-the-art family of Video Large Language Models tailored for token-efficient long-form video understanding. The authors aim to balance computational efficiency with detailed spatial and temporal modeling through the integration of the SlowFast mechanism into a streamlined training pipeline.

Pros:

  • The authors report comprehensive results application to various video and image understanding benchmarks.
  • The integration of the two-stream SlowFast mechanism is seen as a key factor in improving long-form video understanding while maintaining high efficiency.

Cons:

  • The concerns of uniform FPS sampling of missing key frames.
  • The authors may add more discussion comparing with the previous SlowFast architecture, which is helpful for readers to understand the motivation of this paper.

[Automatically added comment] At least one review was discounted during the decision process due to quality]