5.7

/10

Poster3 位审稿人

最低5最高6标准差0.5

3.7

置信度

正确性3.0

贡献度2.0

表达2.7

ICLR 2025

Understanding Long Videos with Multimodal Language Models

Kanchana Ranasinghe,Xiang Li,Kumara Kahatapitiya,Michael S Ryoo

OpenReview PDF

提交: 2024-09-15更新: 2025-05-09

TL;DR

Investigates effects of LLM strengths on Long Video QnA tasks. Introduces Multimodal Video Understanding (MVU) framework that incorporates object-centric data from pre-trained models and sets a new state-of-the-art in long-video tasks.

摘要

关键词

long-videovisual question answeringinterpretability

评审与讨论

审稿意见

评分: 5置信度: 42024-10-29

This paper proposes to inject video-specific information into an LLM based framework for video understanding. Specifically, the authors utilize off-the-shelf vision tools to extract three object-centric information modalities from videos and then leverage natural language as a medium to fuse the information. I think the proposed method seems novel and interesting. But I think this paper leaks a comparison with recent papers and the performance is not good enough.

优点

The proposed method seems novel and I think that makes sense for video understanding. But I think the key frame selection plays the most important role for video understanding -- it seems the authors do not propose new method on this.

缺点

1.Figure 1 caption: “(left-right)” -> “(left-bottom)”; 2.I think the authors used both \cite{} and \citep{} in their writing; 3.Missing comparison with recent methods, for examples the publish papers in 2024. I think that is necessary to better your paper; 4.The performance seems not so good, only a marginly superiority compared to the counterparts even without comparing the recent methods. I think the performance is not good; 5.How to select the frames for most relevant frames? Is there any novelty to do that -- I think this is the most important part for video understanding.

问题

See the fifth point of weakneses.

评论- Response to Reviewer NaUs

2024-11-23

We thank the reviewer for the encouraging positive comments and all the suggestions to improve our work further. We address the points raised by the reviewer below.

W1,W2: “Typos”

Thank you for pointing these out. We will correct the Figure 1 caption and fix our \cite{} vs \citep{} usage.

W3: “Compare to 2024 papers”

Several recent works such as MoreVQA [1], VideoAgent [2], VideoTree [3], LifelongMemory [4], LangRepo [5], Tarsier [6], InternVideo2 [7], LVNet [8] achieve strong performance on long video understanding tasks by utilizing strong pre-trained LLM and MLM models. We will update Table 2 and Table 3 to include comparison to these works. We will also discuss these in our related work section. We would be highly grateful if the reviewer has pointers for any additional papers we may have missed.

W5: “Performance not good”

Our MVU is a framework that can be integrated with other works. We combine MVU with state-of-the-art LVNet (following their open-source implementation) to achieve additional improvements.

Method	EgoSchema (full-set)
VideoTree	61.1
LifelongMemory	62.1
LangRepo	41.2
Tarsier	61.7
InternVideo2	60.2
MoreVQA	51.7
VideoAgent	54.1
LVNet	61.1
LVNet + MVU (ours)	61.3

This LVNet + MVU variant achieves strong performance competitive with more recent works. It also highlights how the MVU framework can be easily integrated with other existing works to achieve additional performance.

W5: “How to select the frames”

Frame selection is very important for long videos but somewhat orthogonal to our work. This means that we can integrate our MVU framework with existing frame-selection methods easily. LVNet is one such frame-selection approach. Our LVNet + MVU variant combines their orthogonal strengths to further improve performance.

Method	EgoSchema
LVNet	61.1
LVNet + MVU (ours)	61.3

Secondly, MVU extracts object information from multiple frames. This can be viewed as an alternative in addition to frame selection. This is because our object information extraction indirectly acts as an information bottleneck similar to frame selection. For frames without objects of interest, no information is extracted. For multiple frames containing the same object (identified by our object tracker), the repetitive information is removed. This resembles the idea of selecting useful information from multiple frames. Within MVU, this process acts as a secondary form of information selection from many frames. We provide an ablation to highlight this below:

Method	Frames	EgoSchema-S	Time (s)
MVU	16	60.3	2.42
MVU	32	60.4	2.48
MVU	64	60.4	2.60
MVU	128	61.2	2.81

Note how our lightweight object detector and tracker allows us to increase the frames but maintain good inference speeds.

Finally, we thank the reviewer again for these valuable comments that help improve our paper further. We are working on updating our manuscript with these changes.

References

[1] Min, Juhong, et al. "MoReVQA: Exploring Modular Reasoning Models for Video Question Answering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Wang, Xiaohan, et al. "Videoagent: Long-form video understanding with large language model as agent." European Conference on Computer Vision. Springer, Cham, 2024.

[3] Wang, Ziyang, et al. "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos."

[4] Ying Wang, et al. "Lifelongmemory: Leveraging llms for answering queries in long-form egocentric videos."

[5] Kahatapitiya, Kumara, et al. "Language repository for long video understanding.".

[6] Wang, Jiawei, et al. "Tarsier: Recipes for training and evaluating large video description models.".

[7] Wang, Yi, et al. "Internvideo2: Scaling video foundation models for multimodal video understanding.".

[8] Park, Jongwoo et al. “Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA.”.

2024-11-27

We thank the reviewer for their response, but further emphasize the following 3 points about our work:

MVU improves on 20+ datasets

MVU consistently achieves improved performance across 5 video QnA datasets (EgoSchema, NextQA, IntentQA, LongVideoBench, ActivityNet) as well as 16 robotics domain datasets.
While the performance gains are minimal in some cases, improvements are consistent across 20 different datasets validating the effectiveness and generality.
LVNet is a state-of-the-art method for EgoSchema over which we obtain an additional improvement. Multiple very recent prior work like [1] and [2] from CVPR'24, ECCV'24 under perform this method by up to 10% points. We believe it is difficult for random fluctuations to improve performance over such a method. In contrast, MVU improves over LVNet even with a smaller margin and shows such improvement across multiple datasets.

Improvements over LVNet consistent

We further experiment with LVNet on NextQA and IntentQA datasets in addition to EgoSchema where MVU achieves consistent results improvements.

Method	EgoSchema	NextQA	IntentQA
LVNet	61.1	72.9	71.1
LVNet + MVU (ours)	61.3	73.3	71.5

This further validates how MVU is contributing to these improved results.

Core Issues of LLMs for video understanding

Our experiments and prior work suggest that core issues of LLMs are slow inference speed and handling long contexts.

MVU achieves faster inference for MCQ tasks with likelihood selection
MVU tackles long contexts with efficient information extraction

The inference speed-up is highlighted in Tables 1, 6, and 7. The better handling of context length is highlighted in Table A.10 and Appendix K.

We apologize if these points were not highlighted clearly earlier and would highly appreciate if the reviewer is able to take these into consideration. We are also updating the manuscript to better highlight these points.

评论- Follow Up

2024-11-30

Dear reviewer NaUs,

We are highly thankful for the time and effort devoted by the reviewer to our paper and the following discussions. We are wondering if our last response addressed the concerns raised.

Thanks you!

评论- Thanks for the responses

2024-11-27

Compared to LVNet, the proposed method only increases the baseline by 0.2, which is not convincing to verify the effectiveness. I even suspect that the performance might simply be attributed to random fluctuations. 2. I believe this paper fails to address the core issues of using LLMs for video understanding, which makes its value fall below the acceptance threshold for ICLR. I choose to maintain my original score.

评论- Follow Up 2

2024-12-03

Dear reviewer NaUs,

We are wondering if you were able to review our response to the concerns raised.

Given the main concern being about results, we again highlight the following:

1) MVU improves over SOTA baseline LVNet across 3 datasets

Our LVNet+MVU shows improvements consistent across three datasets now. We also compare against two prior works to highlight how improving over this LVNet baseline is not easy.

Method	EgoSchema	NextQA	IntentQA
MoreVQA (CVPR'24)	51.7	69.2	-
VideoAgent (ECCV'24)	54.1	71.3	-
LVNet (NeurIPS'24)	61.1	72.9	71.1
LVNet + MVU (ours)	61.3	73.3	71.5

These results show the clear benefit of proposed MVU in multi-frame VLM settings with LVNet and how it outperforms SOTA prior works.

2) More multi-frame baselines on LongVideoBench

We integrate MVU with three different multi-frame baselines and evaluate on LongVideoBench dataset. MVU leads to performance improvements in each case.

Method	Phi-3-Vision-Instruct	LLaVA-Next	VideoLLava
Baseline	49.7	47.0	37.6
Baseline + MVU	50.4	47.8	39.2

We look forward to the reviewers response.

Thank you!

审稿意见

评分: 6置信度: 32024-11-03

This paper proposes a training-free approach to understanding long-form videos by extracting explicit image-/object-level information. The extracted information is translated into natural language descriptions to make LLM ‘see’ the visual input. Experimental results on several videoQA benchmarks demonstrate the superiority.

优点

The proposed likelihood selection approach offers a good way to speed up the inference in autoregressive LLMs.
Going beyond existing video datasets, this paper further evaluates the generalization ability on Open-X-Embodiedment.
The paper is well written and easy to follow.

缺点

It seems that using likelihood as a selection criterion still focuses on the exact match between the generated text and the answer candidates in a per-logit manner, without considering the semantic meaning. For example, ‘C is washing plates’ vs ‘C is cleaning dishes’.
Unlike prior approaches where all answer candidates are fed together to the language model, the proposed likelihood selection method organizes the Q-A pairs in a batch dimension. In this way, it seems that LLM fails to analyze the relationship between answer candidates, increasing the difficulty of QA.
If the frames are uniformly sampled across the entire long video, how can you ensure the consistent occurrence of objects? In certain cases, the appeared objects in each frame are completely different. Another related question is whether using X_{OSL} and X_{OMT} extracted from densely sampled frames (i.e. with better object/trajectory consistency) would lead to performance gain.
The authors are encouraged to evaluate the model on more long-video QA benchmarks, especially those designed to mitigate the language bias of existing long-video QA benchmarks (e.g., EgoSchema).
In Tables 2 and 3, a series of recent state-of-the-art approaches are not compared. [1][2]

[1] Juhong Min et al. MoReVQA: Exploring Modular Reasoning Models for Video Question Answering, CVPR 2024.

[2] Xiaohan Wang et al. VideoAgent: Long-form Video Understanding with Large Language Model as Agent, ECCV 2024.

问题

Please refer to the weaknesses.

评论- (1/2) Response to Reviewer a8nc

2024-11-23

We thank the reviewer for the positive comments as well as all the valuable feedback. We address all concerns raised below.

W1: “Likelihood selection focuses on exact match”

Likelihood selection uses a likelihood measure which is the likelihood (probability) of the model generating the given sentence (as opposed to being an exact match). We apologize for any lack of clarity in our original paper regarding this and will highlight the following two points better in our paper.

In likelihood selection, we use the likelihood measure which is also used as the training loss when training LLMs. Given how LLMs trained with this loss (i.e. all decoder based LLMs such as LLaMA, Gemini, GPT) are highly effective at handling semantic meaning, it follows that this loss can capture semantic meaning.
This likelihood measure is calculated within the LLM latent space. This is equivalent to the probability (or likelihood) of that answer being generated by the LLM conditioned on the input question. We derive this in detail in Appendix F. Relating to the same example, this means that likelihood is an estimate of how likely the model would predict ‘C is washing plates’ as opposed to making that exact match. This means predictions closer to the target such as ‘C is cleaning dishes’ would also gain high likelihood values.

In fact, we validate this second point through a toy example. We provide an LLM with the question "X is cleaning dishes in the kitchen. What is X doing? a) washing plates, b) cleaning laundry, c) painting dishes. The correct choice is:" and calculate the likelihood for each of the 3 responses. The calculated likelihoods are 0.996, 0.006, 0.007 for a, b, c respectively (highest is selected), despite response (a) having no common words with the original statement unlike (b) and (c). This illustrates the ability of likelihood selection to capture semantic meanings.

We apologize for including less detail and lack of clarity regarding this in the main text and will update this more clearly in our final manuscript.

W2: “LLM fails to analyze the relationship between answer candidates”

Thank you for this interesting observation. Our likelihood implementation actually follows prior approaches where all answer candidates are fed together to the language model in addition to organizing the Q-A pairs in a batch dimension. This is illustrated in Table A.4 - we will provide more examples to highlight this better during revision. Taking one simplified toy example, given a question “Where is the dog?” and answers “mat, table, bench”, we use three queries along batch dimension as:

Where is the dog? Select the correct response from: a) mat, b) table, c) bench. The correct response is a) mat
Where is the dog? Select the correct response from: a) mat, b) table, c) bench. The correct response is b) table
Where is the dog? Select the correct response from: a) mat, b) table, c) bench. The correct response is c) bench

In fact, applying likelihood selection without such prompting leads to significantly low performance for some datasets. We show this in Table 6 which we repeat below:

Dataset	ES-S	NQA-T
No answer candidates in prompt	58.2	35.8
With answer candidates in prompt	60.3	55.4

We apologize for any lack of clarity in our paper and will update to highlight this better.

W3a: “Ensure consistent occurrence of objects”

This is an important direction for long video understanding. However, our work is orthogonal to this direction. In fact, taking prior work like LVNet which directly addresses such issues, we can integrate our MVU framework to further improve their performance. We include these results below.

Method	EgoSchema (full set)
LVNet	61.1
LVNet + MVU (ours)	61.3

W3b: “Performance on more densely sampled frames”

We run the following ablations focussed on densely sampled frames. Dense sampling improves performance.

Method	Frames	EgoSchema-S	Time (s)
MVU	16	60.3	2.42
MVU	32	60.4	2.48
MVU	64	60.4	2.60
MVU	128	61.2	2.81

In fact, the lightweight object detector and tracker used in MVU allows scaling the number of frames with a lesser increase in inference time.

评论- (2/2) Response to Reviewer a8nc

2024-11-23

W4: “Results on Datasets better than EgoSchema”

We evaluate on the LongVideoBench dataset and provide these results below. We select this Phi-3-Vision-Instruct baseline since it is the best performing model we can replicate within our compute budget.

Method	LongVideoBench
Phi-3-Vision-Instruct	49.7
Phi-3-Vision-Instruct + MVU	50.4

It is clear how MVU combined with this baseline can improve performance.

W5: “Recent state-of-the-art approaches”

We thank the reviewer for pointing these works to us. We are updating Tables 2 and 3 to include these comparisons. We also repeat some results below for quick reference.

Method	EgoSchema (full-set)
MoreVQA [1]	51.7
VideoAgent [2]	54.1
VideoTree [3]	61.1
LifelongMemory [4]	62.1
LangRepo [5]	41.2
Tarsier [6]	61.7
InternVideo2 [7]	60.2
LVNet [8]	61.1
LVNet + MVU (ours)	61.3

Here we apply ours over LVNet given its state-of-the-art performance, open-source implementation, and publicly available pre-computed captions (computing long video frame captions with closed source models can be expensive).

We thank the reviewer again for all valuable feedback helpful for improving our paper further.