SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
We introduce SlowFocus, a straightforward yet effective approach, which, mirroring human cognition, identifies the relevant segments pertinent to the query and subsequently completes fine-grained temporal tasks based on the temporal details located.
摘要
评审与讨论
This paper focuses on the video LLMs (Vid-LLMs) area and attempts to improve the fine-grained video understanding. The authors design modules and training data to identify relevant temporal segments based on the given query and then generate outputs based on the identified segments.
优点
Overall, the writing is well-organized and easy to follow. The strengths can be summarized as follows:
- This work introduces a two-stage inference strategy into the Vid-LLMs area, following the coarse-to-fine mode. Such a strategy provides a new perspective for enhancing video understanding.
- This work proposes a new benchmark FineAction-CGR to assess the ability of Vid-LLMs to process fine-grained temporal understanding tasks, which provides a new perspective to evaluate Vid-LLMs' capabilities.
缺点
There are some weaknesses listed as follows:
-
Although the proposed coarse-to-fine reasoning approach is relatively novel in Vid-LLMs, I have concerns about its practicality. Firstly, this reasoning method nearly doubles the inference cost for the same question compared to other methods. Secondly, the accuracy of the final response heavily relies on the accuracy of event localization in the first stage. According to the benchmark validation provided by the authors, even when trained on proprietary data, SlowFocus only barely meets the passing mark in temporal grounding mIoU. This significantly impacts the accuracy of temporal reasoning in the second stage.
-
The authors chose relatively weak baselines and attempted to claim the superiority of their results. I suggest comparing their method with MovieChat [1] and VideoChat2 [2] to substantiate their claims of advancement. Moreover, despite using more data than LLaMA-VID, their performance on various benchmarks is not significantly better than LLaMA-VID, raising questions about the effectiveness of this two-stage approach.
-
The authors need to validate the effectiveness of their two-stage model on benchmarks involving long videos, which is crucial to substantiate their claims. For instance, they could use benchmarks like EgoSchema and MovieChat-1K for this purpose.
[1] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding. CVPR 2024
[2] MVBench: A Comprehensive Multi-modal Video Understanding Benchmark. CVPR 2024
问题
See weaknesses above.
局限性
N/A
Q1: I have concerns about its practicality. Firstly, this reasoning method nearly doubles the inference cost for the same question compared to other methods. Secondly, the accuracy of the final response heavily relies on the accuracy of event localization in the first stage. According to the benchmark validation provided by the authors, even when trained on proprietary data, SlowFocus only barely meets the passing mark in temporal grounding mIoU. This significantly impacts the accuracy of temporal reasoning in the second stage.
A:
-
The additional cost is actually minimal, as the low-frequency visual tokens only need to be encoded once. Moreover, the relevant segment grounding is not a time-consuming task, as it only involves processing short sequence lengths. We also report our inference speed compared with baselines in the table below, showing that our method results in only a small increase in inference cost while significantly improving fine-grained video understanding. | Method | LoRA | Inference time (s) | mIoU | B | M | R | C | Acc | Score | |-----------|------|--------------------|-------|------|------|------|------|-------|-------| | VTimeLLM | √ | 1.27 | 27.69 | 0.05 | 0.09 | 0.08 | 0.12 | 9.96 | 0.54 | | LLaMA-VID | × | 1.25 | 0.35 | 0.16 | 0.12 | 0.11 | 0.23 | 15.65 | 0.87 | | Ours | √ | 1.51 | 66.68 | 0.66 | 0.41 | 0.7 | 3.27 | 53.1 | 2.78 |
-
The temporal grounding task on FineActionCGR is challenging, as the ground truth often occurs within a very short time. Therefore, the current mIoU of 66.68 is actually sufficiently practical. We have also conducted experiments demonstrating that the current temporal grounding capability has reached saturation and is not the main limiting factor for further performance improvement, as detailed in the table below. | Stage 2 | Temporal grounding (mIoU) | B | M | R | C | Acc | Score | |---------|---------------------------|------|------|------|------|-------|-------| | 0.1 | 25.17 | 0.24 | 0.21 | 0.19 | 0.86 | 24.88 | 1.33 | | 0.25 | 47.83 | 0.49 | 0.38 | 0.42 | 2.12 | 44.73 | 2.28 | | 0.5 | 61.29 | 0.64 | 0.43 | 0.65 | 3.05 | 52.59 | 2.72 | | 1 | 66.68 | 0.66 | 0.41 | 0.7 | 3.27 | 53.1 | 2.78 | The column
stage 2means the number of epochs for stage 2's fine-tuning.
Q2: The authors chose relatively weak baselines and attempted to claim the superiority of their results. I suggest comparing their method with MovieChat and VideoChat2 to substantiate their claims of advancement. Moreover, despite using more data than LLaMA-VID, their performance on various benchmarks is not significantly better than LLaMA-VID, raising questions about the effectiveness of this approach.
A: Thank you for the feedback. We have added the comparisons with MovieChat and VideoChat2 in the table below.
| Method | LLM | LoRA | Temporal grounding (mIoU) | B | M | R | C | Acc | Score |
|---|---|---|---|---|---|---|---|---|---|
| MovieChat | Vicuna-7B | × | 1.87 | 0.23 | 0.11 | 0.15 | 0.42 | 20.81 | 1.21 |
| VideoChat2 | Vicuna-7B | × | 0.28 | 0.25 | 0.16 | 0.21 | 0.37 | 18.26 | 1.1 |
| Ours | Vicuna-7B | √ | 66.68 | 0.66 | 0.41 | 0.7 | 3.27 | 53.1 | 2.78 |
We have also fine-tuned prior work using the same dataset, as detailed in the table below. The results prove that the improvement achieved by introducing the stage 3 dataset is limited, with the primary gains being attributed to the proposed algorithm. Compared to prior work, our approach requires only a limited amount of additional data to adapt to the MFS algorithm and achieve overwhelming advantages on fine-grained benchmarks.
| Method | LLM | LoRA | Stage 3 | Temporal grounding (mIoU) | B | M | R | C | Acc | Score |
|---|---|---|---|---|---|---|---|---|---|---|
| LLaMA-VID | Vicuna-7B | × | × | 0.35 | 0.16 | 0.12 | 0.11 | 0.23 | 15.65 | 0.87 |
| LLaMA-VID | Vicuna-7B | √ | √ | 22.38 | 0.23 | 0.2 | 0.37 | 1.03 | 24.81 | 1.26 |
| Ours | Vicuna-7B | √ | √ | 66.68 | 0.66 | 0.41 | 0.7 | 3.27 | 53.1 | 2.78 |
Q3: Need to validate the effectiveness of their two-stage model on benchmarks involving long videos, which is crucial to substantiate their claims. For instance, they could use benchmarks like EgoSchema and MovieChat-1K.
A: To address the reviewer's concern, we evaluate our method on MovieChat-1K, as detailed in the table below. The results show that although our method is not specifically trained on long video benchmarks (in contrast, MovieChat has undergone targeted training for long videos), it still achieves competitive results.
| Method | LLM | LoRA | Global mode (Acc) | Global mode (Score) | Breakpoint mode (Acc) | Breakpoint mode (Score) |
|---|---|---|---|---|---|---|
| VideoChat | Vicuna-7B | × | 57.8 | 3 | 46.1 | 2.29 |
| VideoLLaMA | Vicuna-7B | × | 51.7 | 2.67 | 39.1 | 2.04 |
| Video-ChatGPT | Vicuna-7B | × | 47.6 | 2.55 | 48 | 2.45 |
| MovieChat | Vicuna-7B | × | 62.3 | 3.23 | 48.3 | 2.57 |
| Ours | Vicuna-7B | √ | 58.6 | 3.14 | 48.1 | 2.53 |
This work is designed to be more geared towards long video understanding. Otherwise for short videos, so-called event localization and video slicing wouldn't make much sense. However the authors didn't provide convincing results on long video benchmarks, and I needed comparative results on EgoSchema and Video-MME to see exactly how SlowFocus performs.
Thanks for the valuable comments.
First, we would like to clarify that our work is not specifically designed for long videos but rather focusing on fine-grained video understanding, independent of video length.
Even short videos with fine-grained temporal tasks present significant challenges for existing VLMs, as demonstrated by our benchmark evaluation in Table 1 of the main paper. In addition, the ablations in Table 3 of the main paper, that mixed-frequency sampling brings significant performance improvement on fine-grained video understanding, strongly support the effectiveness of our proposed approach (i.e., event localization and video slicing) on short video benchmarks.
Second, we have additionally test our method on MovieChat-1K in our last response, which is a widely used long video benchmark. We now provide the results on EgoSchema, as detailed in the tables below.
| Method | Acc |
|---|---|
| FrozenBiLM | 26.9 |
| VIOLET | 19.9 |
| InternVideo | 32.1 |
| LLoVi-7B | 33.5 |
| Vamos | 36.7 |
| LangRepo-12B | 41.2 |
| Ours | 39.7 |
The results further demonstrate that, despite not being specifically trained on long video datasets, our method still achieves competitive performance. We will include more evaluation results on other long video benchmarks (such as Video-MME) in the revision.
Hope our response helps the reviewer's final recommendation. Thank you!
Dear Reviewer bGgU
Thanks again for the valuable comments and suggestions. As the discussion phase is nearing its end, we wondered if the reviewer might still have any concerns that we could address. We believe our results on long video benchmarks (including MovieChat-1K and EgoSchema) addressed the questions/concerns.
It would be great if the reviewer can kindly check our responses and provide feedback with further questions/concerns (if any). We would be more than happy to address them. Thank you!
Best wishes,
Authors
This paper focuses on fine-grained video understanding with large language models. Current works suffer from the dilemma between the per-frame token number and temporal sampling frequency to maintain an acceptable sequence length into the language model. The authors propose to sample a global view with low sampling frequency to perceive the temporal positions that are related to the question, then sample the frames within the predicted temporal window with high frequency for fine-grained video perception. The proposed dataset FineAction-CGR can well reflect the ability in temporal grounding and fine-grained question answering.
优点
- The exploration in the per-frame token number and temporal sampling frequency for video understanding is important and fundamental.
- The authors propose to first perceive the related temporal window and then densely sample the frames within the temporal span. The architecture makes sense and is suitable for fine-grained video captioning and question answering while maintaining computation efficiency.
- The dataset can evaluate the temporal grounding and fine-grained video understanding ability and could make contribution to the progress in video understanding with large language models.
缺点
- How is the duration of the videos for temporal grounding evaluation? Will the low frequency sampling result in the to lose much information in excessively long videos?
- In Fig.2, the temporal grounding derives from the low frequency sampled video features. Why the high frequency sampling also has impact on the temporal grounding performance?
- The architecture of the temporal encoder is not illustrated.
- In Table 5, what does Fps=64 mean? The video dataset usually contains videos with around 30 fps, how does 64 fps work?
- The proposed method presents overwhelming advantage on the proposed FineAction-CGR dataset while the improvement on other benchmarks are marginal. I understand it is because the rest are comparatively easy and of coarse-grained. However, it is better to provide the evaluation on other challenging video benchmarks like long video datasets, e.g., EgoSchema [1], MovieChat [2] to validate the effectiveness of the temporal grounding in long sequences for more comprehensive comparison.
[1] Mangalam, Karttikeya, Raiymbek Akshulakov, and Jitendra Malik. "Egoschema: A diagnostic benchmark for very long-form video language understanding." Advances in Neural Information Processing Systems 36 (2024). [2] Song, Enxin, et al. "Moviechat: From dense token to sparse memory for long video understanding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
问题
How do you condense each frame into tokens?
局限性
The authors have discussed the limitations on the resolution.
Q1: How is the duration of the videos for temporal grounding evaluation? Will the low frequency sampling result in the to lose much information in excessively long videos?
A: The duration of temporal grounding tasks typically ranges from 1 to 10 seconds. To explore this question, we carry out ablations on low frequency sampling (fps) in the table below. Experimental results demonstrate that fps value at 1 is sufficient to perform temporal grounding task.
| Low frequency | mIoU | R@0.3 | R@0.5 | R@0.7 | Acc | Score |
|---|---|---|---|---|---|---|
| Fps=0.5 | 63.47 | 81.71 | 69.44 | 53.27 | 50.13 | 2.61 |
| Fps=1 | 66.68 | 85.8 | 73.01 | 56.25 | 53.1 | 2.78 |
| Fps=2 | 67.2 | 86.28 | 73.29 | 56.71 | 54.19 | 2.83 |
Q2: In Fig.2, the temporal grounding derives from the low frequency sampled video features. Why the high frequency sampling also has impact on the temporal grounding performance?
A: Actually, when evaluated on temporal grounding tasks, the mixed-frequency sampling algorithm is also applied. This is why high-frequency sampling impacts the temporal grounding performance. We have also conducted an ablation study on the effect of low and high-frequency features when mixed-frequency sampling is turned off. The results presented in the table below demonstrate that under this setting, the high-frequency sampling does not influence the performance.
| Low frequency | High frequency | mIoU | R@0.3 | R@0.5 | R@0.7 | Acc | Score |
|---|---|---|---|---|---|---|---|
| Fps=1 | 10 | 41.27 | 52.88 | 43.71 | 31.89 | 33.61 | 1.69 |
| Fps=1 | 20 | 41.52 | 52.06 | 44.18 | 31.64 | 33.55 | 1.68 |
| Fps=0.5 | 10 | 38.05 | 48.25 | 40.65 | 29.11 | 31.23 | 1.55 |
| Fps=2 | 10 | 42.46 | 55.13 | 46.61 | 31.74 | 34.07 | 1.69 |
Q3: The architecture of the temporal encoder is not illustrated.
A: We have described the architecture of the temporal encoder in the main paper (line 171-177), which consists of several learnable queries corresponding to different relative sequential positions. We will illustrate this in Figure 2 in the revised version.
Q4: In Table 5, what does Fps=64 mean? The video dataset usually contains videos with around 30 fps, how does 64 fps work?
A: FineAction is an action-centric dataset that includes some videos recorded at high fps (over 64).
For videos with a fps of 64 or lower, we dynamically adjust the sampling frequency to match the original fps, which means all frames are used as input.
Additionally, we will clarify the term fps by referring to it as max fps in the revised version for better clarity.
Q5: The proposed method presents overwhelming advantage on the proposed FineAction-CGR dataset while the improvement on other benchmarks are marginal. I understand it is because the rest are comparatively easy and of coarse-grained. However, it is better to provide the evaluation on other challenging video benchmarks like long video datasets, e.g., EgoSchema [1], MovieChat [2] to validate the effectiveness of the temporal grounding in long sequences for more comprehensive comparison.
A: To address the reviewer's concern, we evaluate our method on MovieChat-1K, as detailed in the table below. The results show that although our method is not specifically trained on long video benchmarks (in contrast, MovieChat has undergone targeted training for long videos), it still achieves competitive results.
| Method | LLM | LoRA | Global mode (Acc) | Global mode (Score) | Breakpoint mode (Acc) | Breakpoint mode (Score) |
|---|---|---|---|---|---|---|
| VideoChat | Vicuna-7B | × | 57.8 | 3 | 46.1 | 2.29 |
| VideoLLaMA | Vicuna-7B | × | 51.7 | 2.67 | 39.1 | 2.04 |
| Video-ChatGPT | Vicuna-7B | × | 47.6 | 2.55 | 48 | 2.45 |
| MovieChat | Vicuna-7B | × | 62.3 | 3.23 | 48.3 | 2.57 |
| Ours | Vicuna-7B | √ | 58.6 | 3.14 | 48.1 | 2.53 |
Q6: How do you condense each frame into Nv tokens?
A: We condense the features of each frame using spatial grid pooling. Specifically, after passing through the visual encoder, each frame is represented by tokens. These tokens are then further condensed into a grid, resulting in tokens, depending on the experimental setting.
Thanks for the author response. Some of my concerns still remain.
- The duration of temporal grounding tasks typically ranges from 1 to 10 seconds. Is it the temporal length of the target grounding segment? What is the duration of the whole video? And what is the ratio of the target clip of the input video?
- The improvements on MovieChat benchmark seem marginal. I wonder how you deal with the breakpoint mode. Did you use the grounding ability to select the related segments that help answer the breakpoint mode questions?
Thanks for the valuable comments. We address the remaining concerns as follows:
Q1: The duration of temporal grounding tasks typically ranges from 1 to 10 seconds. Is it the temporal length of the target grounding segment? What is the duration of the whole video? And what is the ratio of the target clip of the input video?
A: Yes, it is the length of the temporal grounding task. The duration of the whole video varies from 30s to 15min. According to our statistics, the average ratio of the target clip to the total video length is 4.28%.
Q2: The improvements on MovieChat benchmark seem marginal. I wonder how you deal with the breakpoint mode. Did you use the grounding ability to select the related segments that help answer the breakpoint mode questions?
A: Given that the official repository of MovieChat does not provide a general evaluation code for methods not designed for breakpoint mode (e.g., VideoLLaMA and LLaMA-VID), we adapt by converting the time mentioned in the breakpoint mode questions into discrete values (normalized between 000 and 999). These values are then incorporated into the questions, such as What might happen next at 154?.
While directly feeding the key frame information in the question is less favorable for our method—particularly because MovieChat uses ground-truth keyframe features as input—we still observe that our performance remains competitive, especially in breakpoint mode (lags by only 0.2 in Acc).
We actually leverage target temporal grounding to assist in selecting relevant segments, and observe a significant performance improvement compared to when the target temporal grounding is disabled, as detailed in the table below.
| Method | LLM | Target temporal grounding | Global mode (Acc) | Global mode (Score) | Breakpoint mode (Acc) | Breakpoint mode (Score) |
|---|---|---|---|---|---|---|
| Ours | Vicuna-7B | × | 52.4 | 2.77 | 41.8 | 2.19 |
| Ours | Vicuna-7B | √ | 58.6 | 3.14 | 48.1 | 2.53 |
Thanks for the response. Given that the model is not trained on long video data, the performance is acceptable and the ablation is convincing. I maintain weak accept and agree with other reviewers that the authors are expected to supplement the experiments on long videos in the revised version to make this work more significant.
Dear Reviewer 6ATY
We appreciate the reviewer's time for reviewing and thanks again for the valuable comments. We will include the experiments on long videos and refine the paper as suggested in the revision.
Best wishes
Authors
This works proposes “SlowFocus” to improve the balance between tokens per frame and frames per video used in Video LLMs. SlowFocus identifies video segments relevant to a given query and samples selected segments at high frequencies. The high frequency tokens are mixed with low frequency global video tokens. Authors propose suitable training strategies to learn the newly introduced layers. A new evaluation benchmark is also proposed by the authors.
优点
- Motivation for work is well established
- Clear explanation of proposed methodology
- Evaluation benchmark contribution
NOTE: Is the proposed dataset released publicly?
缺点
-
Motivation not justified: The idea is to improve video LLM performance at a fixed compute. However, there are no results to evaluate the inference speed of proposed SlowFocus with prior works. Hence, the benefit of this approach over uniform sampling at a higher resolution is unclear.
-
Table 1 Unfair: It appears that prior works are zero-shot while SlowFocus is trained on this FineActionCGR dataset. This makes the comparison unfair.
-
Table 2 Missing Details: Is SlowFocus re-trained on this data? How many frames do the prior works use? What is their inference compute requirement in comparison? Also, can a simple VLM + LLM baseline like LLoVi [1] be added for comparison? This will give a better idea on usefulness of proposed method.
-
Missing Ablations: How does this compare against simply feeding the selected high-res frames and the global low-res frames (arranged in temporal order) into an existing Vid-LLM (e.g. LLoVi)?
-
[Minor] Related work [2] using spatial coordinates for same video tasks
[1] LLoVi: https://arxiv.org/pdf/2312.17235
[2] Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs, CVPR 2024
问题
N/A
局限性
Addressed
Q1: Motivation not justified: 1) Evaluate the inference speed. 2) The benefit of this approach over uniform sampling at a higher resolution is unclear.
A: The additional cost of inference speed is actually minimal, as the low-frequency visual tokens only need to be encoded once. Moreover, the relevant segment grounding is not a time-consuming task, as it only involves processing short sequence lengths.
We report our inference speed compared to the baselines in the table below, demonstrating that our method results in only a slight increase in inference cost while significantly enhancing fine-grained video understanding.
| Method | Inference time (s) | mIoU | B | M | R | C | Acc | Score |
|---|---|---|---|---|---|---|---|---|
| VTimeLLM | 1.27 | 27.69 | 0.05 | 0.09 | 0.08 | 0.12 | 9.96 | 0.54 |
| LLaMA-VID | 1.25 | 0.35 | 0.16 | 0.12 | 0.11 | 0.23 | 15.65 | 0.87 |
| Ours | 1.51 | 66.68 | 0.66 | 0.41 | 0.7 | 3.27 | 53.1 | 2.78 |
The phrase at a higher resolution may be confusing.
We hypothesize that it refers to more tokens per frame, and we compare our approach with uniform sampling that adopts more token numbers per frame, as shown in the table below.
| Method | Fps | Tokens per frame | mIoU | B | M | R | C | Acc | Score |
|---|---|---|---|---|---|---|---|---|---|
| LLaMA-VID | 1 | 64 | 0.35 | 0.16 | 0.12 | 0.11 | 0.23 | 15.65 | 0.87 |
| LLaMA-VID | 1 | 256 | 0.29 | 0.23 | 0.16 | 0.23 | 0.77 | 18.91 | 1.13 |
| Ours | 1 | 64 | 66.68 | 0.66 | 0.41 | 0.7 | 3.27 | 53.1 | 2.78 |
After increasing the token numbers per frame to 256, the performance gets a minimal improvement, indicating the limitation of simply increasing frame resolutions.
Q2: Table 1 Unfair: SlowFocus is trained on this FineActionCGR dataset.
A: FineActionCGR is employed in stage 3's fine-tuning to align the LLM with our mixed-frequency search method. We carefully split the dataset to ensure no overlap in scenes or activities between the training and testing sets. Detailed ablation studies, as shown in Table 3 of the main paper (lines 280-284), demonstrate that the observed performance gap is primarily due to proposed MFS approach.
Additionally, to address the reviewer's concern, we also fine-tune the prior work (LLaMA-VID) using stage 3. The results presented in the table below prove that the improvement achieved by introducing the stage 3 dataset is limited, with the primary gains being attributed to the proposed algorithm.
| Method | Stage 3 | mIoU | B | M | R | C | Acc | Score |
|---|---|---|---|---|---|---|---|---|
| LLaMA-VID | × | 0.35 | 0.16 | 0.12 | 0.11 | 0.23 | 15.65 | 0.87 |
| LLaMA-VID | √ | 22.38 | 0.23 | 0.2 | 0.37 | 1.03 | 24.81 | 1.26 |
| Ours | √ | 66.68 | 0.66 | 0.41 | 0.7 | 3.27 | 53.1 | 2.78 |
Q3: Table 2 Missing Details: 1) Is SlowFocus re-trained on this data? 2) How many frames do the prior works use? What is their inference compute requirement in comparison? 3) Also, can a simple VLM + LLM baseline like LLoVi be added for comparison?
A:
-
No, MSVD-QA, MSRVTT-QA, and video-based generative performance benchmarks are zero-shot benchmarks. Our method is not re-trained on these datasets.
-
To address the reviewer's concern, we have updated Table 2 of the main paper to include more implementation details, such as the number of frames and the computational requirements for inference, as detailed in the table below.
-
For a fair comparison, we also implement LLoVi using LLaVA-1.5 as the video captioner and Vicuna-7B as the LLM, also shown in the table below. | Method | Sampling strategy | Tokens per frame | ActivityNet-QA (Acc) | |---------------|-------------------|------------------|----------------------| | FrozenBiLM | N_L=10 | 256 | 24.7 | | VideoLLaMA | N_L=8 | 256 | 12.4 | | VideoChat | Fps=1 | 256 | 26.5 | | Video-ChatGPT | N_L=100 | 576 | 35.2 | | LLaMA-VID | Fps=1 | 64 | 47.4 | | LLoVi | Fps=1 | 256 | 45.2 | | Ours | Fps=1 | 64 | 48.4 |
Q4: Missing Ablations: Compare against simply feeding the selected high-res frames and the global low-res frames into an existing Vid-LLM (e.g. LLoVi).
A: First, we clarify that LLoVi is not a multi-modal Video-LLM, but rather a language-based combination of a visual captioner and a LLM.
Second, our paper does not introduce the concept of high-res and low-res frames. We hypothesis these refer to high and low frequency. It would be unfair to directly compare our method to one that feeds selected high-frequency frames into LLoVi, as this selection process inherently reveals the ground-truth.
To ensure a fair comparison under this setting, we feed LLoVi (same implementation as Q3) with the video frames sampled at the same frequency, as detailed in the table below. The results demonstrate that although the video captioner provides a substantial amount of language descriptions to the LLM (this process is quite time-consuming and less practical), the video captioner may still miss important temporal details, leading to potential errors in subsequent LLM inference.
| Method | mIoU | B | M | R | C | Acc | Score |
|---|---|---|---|---|---|---|---|
| LLoVi | 10.83 | 0.4 | 0.27 | 0.43 | 0.94 | 32.27 | 1.61 |
| Ours | 66.68 | 0.66 | 0.41 | 0.7 | 3.27 | 53.1 | 2.78 |
Several issues were raised my reviewers and post-rebuttal most of these remain insufficiently addressed by the authors.
-
Motivation not justified: The first contribution, to quote the paper, is "to resolve the prevalent trade-off in existing Vid-LLMs...". This trade-off is only valid at a fixed inference compute budget - the original paper had no discussion / evaluations regarding inference time compute. The authors provide two sets of new tables including inference time and tokens per frame in the rebuttal, however omitting important details on what dataset these results are on, compute used (GPUs) to benchmark inference times, how the baselines were implemented, and why the baselines are not at all competitive (baseline numbers significantly lower). More important, the concern of "benefit of this approach over uniform sampling at a higher resolution" remains unaddressed. Uniform sampling at a higher resolution (i.e. larger FPS) should be able to achieve the same effect of authors method and possibly perform at similar compute for the kind of short videos used in evaluation tasks. This is not discussed at all, and the motivation for using the authors proposal over this default (current norm papers cited in original review) is unclear.
-
Unfair results comparisons: The new results in rebuttal (LLaMA-VID mIoU increases from 0.35 -> 22.38 with fine-tuning) further support how baselines used as comparison in the paper were unfair. The authors introduced a new task and dataset (that baselines were not built for) and compared a fine-tuned version of their method with the zero-shot baselines. While the authors newly introduce results where baselines are fine-tuned in the rebuttal, details of baseline fine-tuning is unclear (their results maybe suboptimal due to unsuitable fine-tuning of those methods). Also, only a single baseline is provided under these slightly fairer fine-tuning settings for comparison. The unfair results comparison issue still remains unaddressed.
-
Video QnA results: The results on activity-net provide a better comparison. However, the baselines used are not the best currently. See https://paperswithcode.com/sota/zeroshot-video-question-answer-on-activitynet or https://arxiv.org/pdf/2403.04640 (ECCV '24).
Also, agreeing with the sentiments of reviewer bGgU on need for some long-video QnA benchmarks. -
Missing ablations: The authors avoid providing the requested ablations. In original review, frequency = temporal frequency (i.e. frame rate). The original request was not for "one that feeds selected high-frequency frames" into a baseline, but feeds all frames at that high-frequency (and concern 1 is how authors method is better than such a baseline).
NOTE: Is the proposed dataset publicly released? This is unclear.
The idea and method of the authors is interesting and could be valuable. However, in the current form, the paper lacks sufficient evaluation to verify its effectiveness and usefulness. Hence, voting to reject this paper.
Thanks for the valuable comments. Our code and dataset will be publicly released. We address other concerns as follows.
Q1: Motivation not justified. 1) omitting important details on what dataset these results are on, compute used (GPUs) to benchmark inference times, how the baselines were implemented, and why the baselines are not at all competitive (baseline numbers significantly lower). 2) More important, the concern of "benefit of this approach over uniform sampling at a higher resolution" remains unaddressed.
A: 1) Due to character limitations in the rebuttal, we were unable to include all the details in the table. We now provide the requested details as follows:
- what dataset these results are on: FineAction-CGR benchmark.
- compute used (GPUs): single V100.
- how the baselines were implemented: we just follow the official implementation provided by each method.
- why the baselines are not at all competitive: In fact we have provided comprehensive explanations in the main paper (line 261-264) that existing VLMs struggle with accurately predicting temporal segments and capturing fine-grained temporal details due to their lack of sensitivity to precise time boundaries.
2) The term at a higher resolution in the reviewer’s original request is ambiguous.
Typically, resolution refers to the size of image/video or spatial tokens per frame.
Therefore in our last response we responsed based on such interpretation.
We now provide comparisons with uniform sampling at a higher frequency as requested, as detailed in table below.
| Method | LoRA | Fps | mIoU | B | M | R | C | Acc | Score |
|---|---|---|---|---|---|---|---|---|---|
| LLaMA-VID | × | 1 | 0.35 | 0.16 | 0.12 | 0.11 | 0.23 | 15.65 | 0.87 |
| LLaMA-VID | × | 2 | 0.32 | 0.18 | 0.17 | 0.3 | 0.83 | 20.19 | 1.1 |
| LLaMA-VID | √ | 2 | 29.27 | 0.42 | 0.28 | 0.48 | 1.26 | 30.13 | 1.55 |
| Ours | √ | 1 | 66.68 | 0.66 | 0.41 | 0.7 | 3.27 | 53.1 | 2.78 |
For LLaMA-VID, increasing the fps to 2 (which also doubles the computational cost) resulted in a performance improvement. However, even with this adjustment, its performance was still significantly lower than ours. This indicates that simply increasing the sampling fps yields only limited benefits while consuming significant computational resources.
Further increasing the fps is impractical, as it surpasses the maximum token length supported by existing open-source VLMs. This limitation underscores the advantage of our proposed method, which enhances effective fps without incurring additional computational cost.
Q2: Unfair results comparisons. 1) The new results in rebuttal further support how baselines used as comparison in the paper were unfair. 2) While the authors newly introduce results where baselines are fine-tuned in the rebuttal, details of baseline fine-tuning is unclear. 3) Also, only a single baseline is provided under these slightly fairer fine-tuning settings for comparison.
A: 1) We have discussed the reasons of subpar performance of baselines in the main paper (lines 261-264) that existing VLMs struggle with accurately predicting temporal segments and capturing fine-grained temporal details due to their lack of sensitivity to precise time boundaries. The data used in stage 3 includes tasks focused on temporal grounding and is specifically fine-grained. This is why the baseline's performance improves after fine-tuning with stage 3's data. We believe this further validates the importance of our benchmark and the rationale behind our stage 3 fine-tuning. Moreover, the remaining performance gap (44.3 in mIoU and 10.29 in Acc) further demonstrates the effectiveness of our proposed method.
2) To ensure fairness, the implementation details of baseline fine-tuning (such as learning rate and LoRA rank) are just kept consistent with our method.
3) Our method is easily plug-and-play and transferable to other baseline models. We have additionally implemented LLaVA-Next as the baseline. The fine-tuning details are kept consistent for fair comparison. The results on FineAction-CGR are shown in the table below, demonstrating that the LLaVA-Next baseline performs better than LLaMA-VID, but still lags behind our method. We will also consider adding more baseline comparisons in the revision.
| Method | LoRA | Stage 3 | mIoU | B | M | R | C | Acc | Score |
|---|---|---|---|---|---|---|---|---|---|
| LLaMA-VID | √ | √ | 22.38 | 0.23 | 0.2 | 0.37 | 1.03 | 24.81 | 1.26 |
| LLaVA-NeXT | √ | √ | 25.96 | 0.25 | 0.25 | 0.41 | 1.2 | 26.93 | 1.39 |
| Ours (LLaVA-Next) | √ | √ | 67.73 | 0.68 | 0.41 | 0.68 | 3.31 | 53.9 | 2.8 |
Q3: Video QnA results. 1) The baselines used are not the best currently. 2) Also, agreeing with the sentiments of reviewer bGgU on need for some long-video QnA benchmarks.
A: 1) Thanks, in fact our method is easily plug-and-play and transferable to other baseline models. We have additionally implemented LLaVA-Next as the baseline. The results on ActivityNet-QA are in the table below. By incorporating LLaVA-Next as the baseline, our method achieves advanced results on ActivityNet-QA. We will also consider adding more baseline comparisons in the revision.
| Method | LoRA | Acc | Score |
|---|---|---|---|
| LLaVA-NeXT | √ | 50.2 | 3.3 |
| Ours | √ | 50.4 | 3.3 |
2) Actually we have provided the evaluation results of our method on MovieChat-1K (in the response to reviewer bGgU) which is a widely used long video benchmark. We now provide the results on EgoSchema, as detailed in the tables below:
| Method | Acc |
|---|---|
| FrozenBiLM | 26.9 |
| VIOLET | 19.9 |
| InternVideo | 32.1 |
| LLoVi-7B | 33.5 |
| Vamos | 36.7 |
| LangRepo-12B | 41.2 |
| Ours | 39.7 |
The results further demonstrate that, despite not being specifically trained on long video datasets, our method still achieves competitive performance.
Q4: Missing ablations. The authors avoid providing the requested ablations.
A: First, we would like to clarify that we made every effort to provide the requested ablation experiments. However, the original request was somewhat unclear.
Specifically, the terms high-res and low-res frames are confusing, as our paper does not introduce these concepts.
Could the reviewer be more specific?
Consequently, in our last response we conducted the ablations according to the request to simply feed the selected high-res frames and the global low-res frames, which differs significantly from the reviewer's new comment of feeding all frames at that high frequency.
Second, feeding all frames at such a high frequency is impractical. The valid fps for high-frequency frames averages around 10. Maintaining an fps of 10 with 64 tokens per frame is computationally infeasible. For example, a 1-minute video under this setting would result in 38,400 visual tokens, heavily challenging current open-source VLMs. This underscores a key advantage of our proposed method that it enhances valid fps while avoiding increased computational cost.
Hope our response helps the reviewer's final recommendation. Thank you!
Appreciate the authors detailed response and new experimental results. Most concerns raised are clarified, hence updating my review to accept. However the current writing of the paper needs major modification based on the rebuttal. Please ensure all new information / updates provided in rebuttal are reflected in final version.
-
Highlight compute / inference costs to support motivation: Please include the results provides in rebuttal to support the authors claim that proposed method "enhances valid fps while avoiding increased computational cost", i.e. discuss clearly how a naive baseline would have increased costs, provide these numbers, and compare to authors method which achieves similar / better performance at a fraction of the costs. This is an important ablation to justify the motivation.
-
Explain more clearly why baselines are weak on CGR benchmark and explain the Stage 3 finetuning done on baselines to avoid unfair comparisons (while highlighting those results). Results on the newer LLaVA-NeXT baseline also further strengthen comparisons - please update tables with these results as well. Also use the new ActivityNet results in rebuttal.
-
The new results on long video benchmarks (particularly EgoSchema) is highly insightful. Include these results in the main paper, and if possible provide visualizations (possibly in appendix) for a few EgoSchema QnA examples of the frames selected by Stage 1 of the proposed model. This would further validate the generality of model's stage 1 setup on the somewhat out-of-domain long videos in EgoSchema.
Apologies for the unclear wording on resolution in the original review, and thank you again for the extensive efforts by authors in providing a detailed rebuttal.
Dear Reviewer uGD4
We appreciate the reviewer's time for reviewing and thanks again for the valuable comments and the improved score! We will revise and refine the paper as suggested in the revision.
Best wishes
Authors
This paper designs a SlowFocus mechanism to allow Vid-LLM's input signals to combine both high frame rate and low frame rate inputs simultaneously. This addresses the issue of maintaining the effectiveness of input information within a limited context window in LLMs. Low frame rate inputs contain global information, while high frame rate inputs contain local details. Additionally, this paper proposes a Multiple-Frequency Mixing Attention mechanism to better integrate these inputs and a Temporal Relationship Modeling mechanism to preserve temporal relationships in the temporal dimension.
优点
-
The SlowFocus mechanism, which combines high frame rate and low frame rate inputs, is highly reasonable and efficient.
-
The Multiple-Frequency Mixing Attention mechanism and Temporal Relationship Modeling are appropriate technical designs.
-
The proposed three-stage training and corresponding dataset construction are rational.
- The use of temporal grounding data for the second stage of training followed by video level instruction fine-tuning data for the third stage is highly instructive.
-
The FineAction-CGR benchmark fills the gap in the community for fine-grained video understanding evaluation.
-
The experiments in this paper are comprehensive, effectively demonstrating the validity of each design point.
- The writing is clear and easy to understand.
缺点
-
Line 126: Using the same letter "L" to represent both "Low" and "LLM" might appear confusing to readers.
-
Table 5: There is some confusion regarding whether this pertains to the low frame rate parts or the high frame rate parts.
-
Line 449:
- What is the data ratio processed by GPT-4V and the Video Recaptioner Model?
- Do their results show significant distribution differences?
- Are you using GPT-4V to generate frame-by-frame videos or video captions?
-
The analysis on the effectiveness and rationality of the FineAction-CGR benchmark evaluation is insufficient.
问题
-
Why is the visual encoder turned off in all stages? Some recent work also start open the gradient of CLIP encoder, would the results improve in your model if it were turned on?
-
Is the use of LoRA for pre-training due to resource constraints? Would full-rank training result in better performance?
局限性
It is recommended to include comparisons with closed-source models such as Gemini Pro, GPT-4V, etc., to provide reference scores for FineAction-CGR benchmark in Table 1.
Q1: Line 126: Using the same letter "L" to represent both "Low" and "LLM" might appear confusing to readers
A: Thanks, we will make a clearer representation in the revision.
Q2: Table 5: There is some confusion regarding whether this pertains to the low frame rate parts or the high frame rate parts.
A: Thanks, the fps term mentioned in Table 5 of the main paper refers to the low-frequency part.
The high-frequency part maintains a fixed sampling number .
We will provide a clearer representation in the revised version.
Q3: Line 449: 1) What is the data ratio processed by GPT-4V and the Video Recaptioner Model? 2) Do their results show significant distribution differences? 3) Are you using GPT-4V to generate frame-by-frame videos or video captions?
A:
- GPT-4V and the Video Recaptioner Model are both utilized during the process of generating captions for each video. GPT-4V is utilized to generate captions for the entire video by sampling 10 frames uniformly from each video. The Video Recaptioner Model is utilized to generate captions for the segmented video clips by uniformly sampling 8 frames from each clip.
- The caption of the entire video generated by GPT-4V tends to summarize coarse-grained and general content. While the Video Recaptioner Model generates captions for segmented video clips, containing more details and action descriptions.
- The video in FineAction may contain multiple objects and it's better not to generate frame-by-frame captions. Take the InternVID data set as an example, the frame-by-frame captions are not applicable when preceding objects are inconsistent with following objects. Therefore, GPT-4V is utilized to generate a caption by sampling 10 frames uniformly from a video to recognize different objects. This method not only saves cost but also provides global information for the QA generation of downstream tasks.
Q4: The analysis on the effectiveness and rationality of the FineAction-CGR benchmark evaluation is insufficient.
A: The evaluation protocol and metrics for FineAction-CGR align with widely-accepted benchmarks, such as ActivityNet-QA, ensuring the effectiveness of the benchmark evaluation. For other methods, we adhere to the official implementation guidelines. We will add more discussions and case studies on the benchmark evaluation in the revision.
Q5: Why is the visual encoder turned off in all stages? Some recent work also start open the gradient of CLIP encoder, would the results improve in your model if it were turned on?
A: Turning off the visual encoder and fine-tuning the multi-modal projector is a common practice among the baselines we compare, such as VideoLLaMA, Video-ChatGPT, and LLaMA-VID. To ensure a fair comparison, we adhere to this approach.
However, we acknowledge that turning on the visual encoder could be beneficial, and it would be a promising direction to explore methods for fine-tuning the visual encoder to enhance the unified representation of images and videos. Nonetheless, this aspect is not the primary focus of our work.
Q6: Is the use of LoRA for pre-training due to resource constraints? Would full-rank training result in better performance?
A: Yes, we use LoRA to fine-tune the LLM primarily due to resource constraints. We have conducted a comparison with full-rank training, as shown in Table 2 of the main paper, which demonstrates that the performance degradation when using the LoRA approach is minimal.
Q7: It is recommended to include comparisons with closed-source models such as Gemini Pro, GPT-4V, etc., to provide reference scores for FineAction-CGR benchmark in Table 1.
A: Thanks for the suggestion, we employ GPT-4V to evaluate on the benchmark, and the evaluation of Gemini Pro will be added in the revision. The results are presented in the table below. Due to token limitations, each video is sampled at 10 frames, and GPT-4V is tasked with answering questions based on these sampled frames. We observe that GPT-4V performs well on captioning tasks but exhibits suboptimal performance on tasks such as temporal grounding and reasoning, which require fine-grained temporal cues.
| Method | Temporal grounding (mIoU) | B | M | R | C | Acc | Score |
|---|---|---|---|---|---|---|---|
| GPT-4V | 9.28 | 0.59 | 0.53 | 0.65 | 2.74 | 19.39 | 1.1 |
| Ours | 66.68 | 0.66 | 0.41 | 0.7 | 3.27 | 53.1 | 2.78 |
The author has addressed most of my concerns. The ideas presented in this paper are quite ingenious. I do not believe that the combination of high and low-resolution inputs poses a computational efficiency issue. The author's strategy effectively addresses the challenge of long video understanding. I stand by my rating.
Dear Reviewer et83
We appreciate the reviewer's time for reviewing and thanks again for the valuable comments.
Best wishes
Authors
We sincerely thank all the reviewers for the valuable feedback, with consistent recognition for the motivation and innovation. We are pleased that the reviewers recognized the strengths of our paper:
- This paper is well motivated (et83, uGD4, 6ATY, bGgU).
- The designed architecture makes sense and is suitable for fine-grained video understanding (et83, 6ATY).
- Evaluation benchmark contribution (et83, uGD4, 6ATY, bGgU).
- The writing is well-organized and easy to follow (et83, uGD4, bGgU).
The following is our response to the major concerns raised by the reviewers, supported by further empirical investigations with each respective review:
-
Inference time details: Both reviewers uGD4 and bGgU express concerns about the inference time of our method. To address these concerns, we evaluate the inference latency of our approach and observe only a modest increase in inference cost (20%), while achieving significant improvements in fine-grained video understanding. The primary reason for this efficiency is that low-frequency visual tokens only need to be encoded once, and the relevant segment grounding is not a time-consuming task, as it only involves processing short sequence lengths.
-
Additional comparisons with more advanced counterparts: In response to the suggestions from reviewers uGD4 and bGgU, we conduct further experiments comparing our method with the suggested methods (LLoVi, MovieChat, and VideoChat2). Our results indicate that our method continues to achieve superior performance in fine-grained video understanding.
-
Additional experiments on long video benchmark: Both reviewers 6ATY and bGgU highlight the necessity of evaluating on long video benchmarks. In response, we evaluate our method on the MovieChat-1K dataset. The results demonstrate that, even though our method is not specifically trained on long video data, it still achieves competitive results.
We then address the specific concerns raised by the reviewers in our individual rebuttal responses.
Overall, we look forward to engaging in further fruitful discussions in the coming weeks to enhance our work.
For video-language modeling, this work aims to address the challenges of retaining high-quality frame-level semantic information (i.e., a sufficient number of tokens per frame) and comprehensive video-level temporal information (i.e., an adequate number of sampled frames per video) for video representation. SlowFocus is designed by identifying the query-related temporal segment based on the posed question, then performing dense sampling on this segment to extract local high-frequency features. A fine-grained video benchmark was provided.
Reviewers have appreciation for the motivation, reasonable designed architecture, contributed evaluation benchmark, and writing, while raising concerns on lack of evaluation on long video benchmarks, inference time comparision, weak baselines, etc. The rebuttal had addressed the concerns from Reviewer et83, uGD4, 6ATY, where the reviewer uGD4 had raised the score from 3 to 6. Reviewer bGgU expected result comparisons on long video benchmarks EgoSchema and Video-MME to see how SlowFocus performs. The authors provided results on EgoSchema which shows superior performance with the compared methods.
Considering the positive feedback from most reviewers, AC recommends acceptance but agrees with the reviewers and strongly suggests the authors to include the comparison on long video benchmarks and incorporate the suggestions of reviewers for comprehensive and clarity.