PaperHub
5.3
/10
Rejected3 位审稿人
最低5最高6标准差0.5
5
5
6
3.7
置信度
正确性2.7
贡献度2.7
表达2.7
ICLR 2025

CinePile: A Long Video Question Answering Dataset and Benchmark

OpenReviewPDF
提交: 2024-09-24更新: 2025-02-05

摘要

关键词
Datasets and benchmarkingVideo understandingMulti-modal learningVisual question answeringLong-form videoMetrics and benchmarks

评审与讨论

审稿意见
5

This paper introduces CinePile, a new large-scale dataset for long-form video question answering (VQA). The authors address the limitations of existing VQA datasets, many of which are easily solvable by analyzing only a few frames. CinePile aims to remedy this by focusing on questions requiring genuine long-form comprehension, encompassing temporal understanding, human-object interactions, and event reasoning. The dataset creation leverages a novel pipeline combining readily available audio descriptions (ADs) of movies, automatic speech recognition, and large language models (LLMs) like GPT-4 and Gemini, for automated question generation. A significant effort is dedicated to filtering out low-quality questions through an adversarial refinement process. The paper also presents a benchmark evaluating various open-source and commercial video LLMs on CinePile, revealing the strengths and weaknesses of current models in long-form video understanding and highlighting the potential for improvement through fine-tuning on the CinePile training set.

优点

  1. This paper gives a large-scale and challenging long-form video VQA dataset. The automated pipeline for question generation is a major strength. The use of readily available ADs and LLMs offers a cost-effective and scalable approach to annotate long videos. Also, the paper includes detailed analyses of various aspects of the dataset, including question type distribution, vision reliance, and question difficulty. The authors demonstrate a deep understanding of the challenges involved in creating and evaluating such a dataset.
  2. The evaluation of numerous open-source and commercial LLMs provides a new benchmark, offering insights into the current state-of-the-art in long-form video understanding. The fine-tuning experiments showcasing the effectiveness of CinePile for improving model performance are compelling and demonstrate the practical utility of the dataset.

缺点

  1. The description of the adversarial refinement process could be more detailed. Specific examples of how questions were modified and the criteria for determining success would be beneficial.
  2. While the authors acknowledge potential biases in the LLMs used for question generation and the geographical limitations of the movie clips, a more in-depth discussion of potential biases (e.g. countries, languages, etc.) in the dataset itself and mitigation strategies would strengthen the paper. How well does the dataset generalize to videos outside the specific sources used?
  3. While the paper highlights the performance gap between open-source and commercial models, a deeper dive into why this gap exists would be valuable. Is it solely due to model architecture, training data, or other factors? Besides, human evaluation is primarily performed by the authors themselves. While they acknowledge potential biases, a larger-scale, more diverse human evaluation would significantly enhance the credibility of difficulty and quality assessments.

问题

  1. Can the authors provide a more detailed breakdown of the types of biases identified and mitigated during the dataset creation process?
  2. What are the computational costs associated with the automated question generation and adversarial refinement pipelines?
  3. Can the authors provide a more in-depth analysis of the failure modes of the open-source models?
  4. How does the performance of the models on CinePile correlate with their performance on other video understanding benchmarks?
评论

Can the authors provide a more in-depth analysis of the failure modes of the open-source models?

Thank you for the question. We had previously discussed one of the reasons why are (some) OSS models so far behind in Section IV of the main paper, where we found that, for extremely poorly performing models (sub 20% overall performance), it was partly due to their inability to follow instructions as we both qualitatively and quantitatively discussed such failure cases in Fig 7a in the main paper and Appendix H (Table 8).

We have now updated Section P, with additional discussion on different failure modes and performance trends of open-source models along the following key aspects:

  • Does Scale (In Parameter Space) Alone Lead to Better Performance?

    • There is a lot of focus on model scale these days, so we were curious whether scale alone can lead to better performance (ignoring the architecture, training data, etc). So we computed the Pearson-r correlation between the model scale and overall performance and found it to be weakly positively correlated i.e., 0.157. Obviously, there are alot of confounders across different models like different training data, architecture, etc, so this is not definitely saying that scale would not improve significantly performance, rather it alone is not enough. If we control for everything else by only analyzing one particular model family i.e., InternVL, we see a positive correlation of 0.72.
  • Poor ability to utilize visual information; and overdependence on LLM-priors

    • Another possible reason for the performance gap in open-source models could be their weaker reliance on visual information and over-reliance on language priors [1,2]. In our experiments (Appendix Section K) examining the effect of model performance on the number of sampled frames, we observe that while models improve with additional frames, the extent of this improvement correlates with the model's overall performance. Specifically, better-performing models tend to utilize visual information more effectively, showing greater performance gains with more frames, whereas weaker models exhibit minimal to no improvement.
  • Gap with closed-source models

    • The performance advantage of closed-source models likely stems from a combination of factors rather than a single artifact. State-of-the-art models like Gemini-1.5-Pro and GPT-4o operate at scales of hundreds of billions of parameters, significantly outpacing the 7B-26B parameter range of the best open-source models we evaluated. Additionally, while these closed-source models do not disclose details about their training data mixtures or the GPU hours spent, it is reasonable to assume they adhere to scaling laws [3, 4] and are trained on datasets that are substantially larger and more diverse than those available to open-source models. The lack of transparency from closed-source models also means there are no ablation studies to pinpoint the optimal combinations of data mixtures or architectural choices contributing to their performance. This makes it challenging to draw precise comparisons.Despite these gaps, open-source models are rapidly catching up, with only about a \approx 10% performance difference in our evaluations. We are optimistic that this gap will continue to shrink in the coming months, and CinePile’s training set can be helpful in advancing the capabilities of open-source models.

[1] Shengbang Tong et al. “Eyes wide shut? exploring the visual shortcomings of multi- modal llms”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, pp. 9568–9578.

[2] Zhiqiu Lin et al. “Revisiting the role of language priors in vision-language models”. In: arXiv preprint arXiv:2306.01879 (2023).

[3] Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020).

[4] Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022).

评论

What are the computational costs associated with the automated question generation and adversarial refinement pipelines?

We have updated Appendix Section-C and Section-N to include a detailed discussion on the monetary costs associated with our automated question generation and adversarial refinement pipelines, respectively. Utilizing state-of-the-art foundation models via APIs, we provide the following cost breakdowns:

Question Generation Pipeline

We employ GPT-4o through the OpenAI API for question generation. The cost per scene is calculated as follows:

  • Input Tokens:

    • Base prompt (instructions for question-answer generation and templates): 1,167 tokens (fixed)
    • Movie scene (subtitles and visual descriptions): 465 tokens (average; varies across scenes)
    • Total Input Tokens per Scene: 1,632 tokens
    • Cost per Input Token: $2.50 per 1M tokens
    • Input Cost per Scene: (1,632/1,000,000) * 2.50 = $0.00408
  • Output Tokens:

    • Average output tokens: 1,582 tokens (average; varies across scenes)
    • Cost per Output Token: $10.00 per 1M tokens
    • Output Cost per Scene: (1,582/1,000,000) * 10.00 = $0.01582
  • Total Cost per Scene: \0.00408 + $0.01582 = $0.0199 $

Adversarial Refinement Pipeline

For adversarial refinement, we use GPT-4o for question rephrasing and the free-tier of LLaMA 3.1 70B API provided by Groq. The cost per question fix is only dependent on rephrasing by GPT-4o, and can be calculated as follows:

  • Input Tokens:

    • Base prompt (instructions for fixing the question): 709 tokens
    • Movie scene (subtitles and visual descriptions): 465 tokens (average; varies across scenes)
    • Deaf-blind LLM response and rationale: 102 tokens
    • Total Input Tokens per Attempt: 1,276 tokens
    • Cost per Input Token: $2.50 per 1M tokens
    • Input Cost per Attempt: \frac{1,276}{1,000,000} \times 2.50 = \0.00319 $
  • Output Tokens:

    • Fixed question: 74 tokens
    • Cost per Output Token: $10.00 per 1M tokens
    • Output Cost per Attempt: \frac{74}{1,000,000} \times 10.00 = \0.00074 $
  • Total Cost per Attempt: \0.00319 + $0.00074 = $0.00393 $

  • Number of Attempts per Question Fix: Up to 5 (manually set limit)

  • Total Cost per Question Fix: \0.00393 \times 5 = $0.01965 $

评论

Can the authors provide a more detailed breakdown of the types of biases identified and mitigated during the dataset creation process? How well does the dataset generalize to videos outside the specific sources used?

Thank you for your question. We have updated Section M of the Appendix to include a detailed discussion and new results on the types of biases, as well as the adaptation of CinePile to videos from different domains. Specifically, we address the biases in our dataset creation process at the following levels:

Source Data Level: Temporal Coverage and Demographic Representation:

Our dataset spans films from 1930 to 2023, covering 86 unique genres, including horror, comedy, and science fiction. This broad temporal and genre diversity ensures a rich mix of cultural contexts and representations across different eras. While we currently focus on English-language films, this strategic choice leverages resource availability and provides a solid foundation for future expansion into other languages and cultures—a significant area for future research. The flexibility of our pipeline supports this scalability, promoting broader inclusivity. Additionally, we carefully exclude films that are culturally problematic or known for perpetuating stereotypes, racism, or other biases during data collection.

Automated Data Level: Question Generation Bias:

We acknowledge that LLMs like GPT-4, used for question generation, may inherit biases from their training data. To mitigate this, we use prompt tuning with structured templates and prototypical questions. Each template is manually reviewed to ensure it is free from harmful stereotypes. Overly specific templates, such as “Pre-wedding Dilemmas: What complicates character Z’s plans to propose marriage to their partner?”, were removed or merged to other coarser templates to avoid over-specificity and to enhance generalizability. This process ensures that templates in CinePile are versatile and can possibly be extended to other domains.

”How well does the dataset generalize to videos outside the specific sources used?”

We primarily used movie clips as a source for generating question answers in our dataset, however, as we alluded to before, given the generality and overall flexibility of our pipeline we can easily expand these to other domains. To demonstrate this, we conducted an experiment with three longer videos from three domains very different from the originally used movie clips one:

  • Survive 100 Days Trapped, Win $500,000 (1620 seconds long video, YouTube Challenge-Reward),
  • How Hansi Flick's Tactics Are Revolutionizing Barcelona (540 seconds long video, soccer tactical analysis),
  • Eminem - Stan (Long Version) ft. Dido (480 seconds long video, music video).

These videos, vastly different from CinePile’s movie clips, were transcribed using Whisper, with key visual descriptions annotated by the authors. Additionally, we slightly revised the question generation prompt to reduce the emphasis on movie-clips specific analysis to general video analysis (e.g., changing “Create diverse and challenging questions based on the film analysis…” to “Create diverse and challenging questions based on the video analysis…”). We utilized the same question template bank (86 total templates) without adding or removing any. Feeding this "video scene information" into our pipeline generated high-quality questions. For instance, “What are the strong points of conflict between the characters in the video?” (video: Survive 100 Days Trapped, Win $500,000) required analyzing the entire clip to identify key conflicts and select the correct option. Similarly, “How does the video develop the theme of Barcelona's tactical variations in attack from start to finish?” (video: How Hansi Flick's Tactics Are Revolutionizing Barcelona) involved identifying and mapping out the tactical variations discussed throughout the video. Additionally, evaluating models on these questions revealed significant performance drops. For instance, Gemini-Pro-1.5, which achieved 60% on CinePile, scored 41% on these longer video questions, with similar 20% performance drops observed across other models. The question-answer examples, along with detailed results in the updated Section M of the Appendix, demonstrate the pipeline's ability to generalize effectively across different video sources and contexts. The full set of generated-question answers has been added to the supplementary material.

评论

Thank you for reviewing our paper, Reviewer dTLF. We are glad to know that you appreciate the cost-effective and scalable approach of our automated QA pipeline, the detailed analysis of various aspects of the dataset, the comprehensive model evaluation suite, and the fine-tuning experiments that highlight the practical utility of CinePile. Below, we address your concerns in detail.

Description of the adversarial refinement process could be more detailed. Specific examples of how questions were modified and the criteria for determining success would be beneficial.

Thank you for the suggestion. Due to space constraints in the main paper, we have provided a detailed discussion on adversarial refinement in Appendix Section-N. However, we are happy to make further adjustments and incorporate key points into the main paper if you believe it would improve clarity. Specifically, we have made the following changes:

  • Terminology Update: We made a minor change to the terminology, referring to the LLM used for answering questions without any context (subtitle or visual descriptions) as a “deaf-blind LLM.”
  • Illustrative Figure: An illustrative figure has been added to demonstrate the workflow of “deaf-blind LLMs” in the context of adversarial refinement within the overall question generation and filtration pipeline.
  • Code Inclusion: The code for performing adversarial refinement has been added to the supplementary file for reproducibility.
  • Additional Details: We have included details on how we adjust for chance performance to determine success, along with statistics such as the average number of attempts required to refine a question and the monetary cost of running the adversarial refinement pipeline.
  • Specific Examples: We have now added multiple examples illustrating the refinement process in Appendix Section-N. This includes examples of the “weak QA,” the deaf-blind LLM’s response and rationale, the refined question, and the LLM’s response to the revised question. Below is one of the examples from the scene Arthur 2: On the Rocks (1988) - It's Up to You Scene (additional examples can be found in Appendix Section-N):

Original Question

Q: How does Arthur physically react to Hobson’s revelation about his son?

A) Sighs deeply

B) Smiles faintly

C) Looks down

D) His eyes bulge [Correct answer; Chosen by the deaf-blind LLM in all 5 permutations of the option order.]

E) Turns away

Rationale by Deaf-Blind LLM: The phrase “Hobson’s revelation about his son” implies shock or surprise, and option D, “His eyes bulge,” is a common physical reaction to such emotions.

Rephrased Question

Q: How does Arthur’s demeanor change when Hobson mentions having seen Arthur’s son?

A) He frowns slightly

B) He smiles faintly [Chosen by the deaf-blind LLM in all 5 permutations of the option order.]

C) He looks confused

D) His eyes bulge [Correct answer]

E) He laughs nervously

Rationale by Deaf-Blind LLM: Question implies Hobson’s mention of Arthur’s son has a positive impact on Arthur, as it is likely a pleasant memory or a topic that brings him joy, thus a faint smile is the most probable reaction.

This example demonstrates how rephrasing through our adversarial refinement pipeline shifts the question’s emphasis, causing the deaf-blind LLM to answer incorrectly 5/5 times. This underscores the effectiveness of the pipeline in identifying and refining weak questions.

How does the performance of the models on CinePile correlate with their performance on other video understanding benchmarks?

Thank you for the question. We have added a detailed discussion on the correlation between model performance on CinePile and other video understanding benchmarks in Appendix O. Specifically, we compute the Spearman rank correlation (ρ\rho) between model ranks on CinePile and their ranks on other datasets, including Video-MME, MV-Bench, and EgoSchema. For each dataset, we use the model ranks provided in their official publications and calculate correlations based on the ranks of models common to both CinePile and the respective dataset. Our results show strong correlations: ρ=0.964\rho = 0.964 for Video-MME (7 common models), ρ=1.000\rho = 1.000 for MV-Bench (3 common models), and ρ=1.000\rho = 1.000 for EgoSchema (2 common models).

While CinePile evaluates 26 state-of-the-art models, the number of models evaluated by other benchmarks is often smaller, with limited overlap. For example, MV-Bench assesses only 6 models, of which 3 overlap with CinePile, making some correlations less robust. However, these strong correlations suggest that models performing well on CinePile also perform well on manually curated benchmarks, underscoring CinePile’s validity as a reliable test set. That said, performance levels naturally vary due to differences in dataset characteristics and task difficulty. For instance, Gemini-1.5 Pro achieves 81.3% on Video-MME but only 60% on CinePile.

评论

Dear Reviewer dTLF,

We sincerely appreciate your dedicated time and effort in reviewing our paper.

As the author-reviewer discussion period draws to a close, we kindly ask whether our additional clarifications and new results have sufficiently addressed your main concerns or if there are any remaining questions we can address further.

Thank you once again for your valuable feedback. We believe that incorporating these clarifications and experiments has further strengthened the paper.

评论

Dear Reviewer dTLF,

We sincerely appreciate your dedicated time and effort in reviewing our paper.

As the author-reviewer discussion period draws to a close, we kindly ask whether our additional clarifications and new results have sufficiently addressed your main concerns or if there are any remaining questions we can address further.

Thank you once again for your valuable feedback. We believe that incorporating these clarifications and experiments has further strengthened the paper.

评论

Dear Reviewer dTLF,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

With only a few hours remaining to upload the revised PDF, we kindly ask whether our additional clarifications and new results have sufficiently addressed your main concerns or if there are any remaining questions we can address further.

Thank you once again for your valuable feedback. We believe that incorporating these clarifications and experiments has helped to strengthen the paper further.

评论

Dear Reviewer dTLF,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

Since December 2nd is the final date for reviewers to post messages to authors, we kindly ask if our additional clarifications and new results have sufficiently addressed your main concerns, or if there are any remaining questions we can further address.

Thank you once again for your valuable feedback. Incorporating these clarifications and experiments has helped strengthen the paper further.

评论

Dear Reviewer dTLF,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

Since there are only a few hours remaining for reviewers to post messages to authors, we kindly ask if our additional clarifications and new results have sufficiently addressed your main concerns, or if there are any remaining questions we can further address.

Thank you once again for your valuable feedback. Incorporating these clarifications and experiments has helped strengthen the paper further.

审稿意见
5

The paper presents CinePile, a novel benchmark dataset consisting of 305,000 multiple-choice questions designed to challenge models with authentic long-form video understanding, moving beyond the capability of analyzing just a few frames. CinePile incorporates a diverse range of question types, emphasizing temporal comprehension, multimodal reasoning, and narrative understanding. Created through an innovative process that utilizes large language models and human-generated data, CinePile offers a comprehensive measure of model performance. The paper also includes an evaluation of various video-centric large language models, indicating significant room for improvement compared to human performance, and discusses the potential biases and ethical considerations inherent in such datasets. CinePile aims to serve as a scalable and comprehensive benchmark for assessing and advancing the capabilities of video understanding models.

优点

  1. The paper introduces a detailed and effective pipeline for generating a high-quality question-answering dataset, CinePile, which is specifically tailored for long-form video understanding.

  2. To ensure the quality of the automatically labeled data, the authors have developed an adversarial refinement process that rigorously evaluates and improves the dataset's accuracy.

  3. Additionally, the paper includes an extensive testing suite that assesses the performance of the latest open-source models against the most capable commercial models, providing a comprehensive benchmark.

  4. The authors also offer a detailed analysis, shedding light on the strengths and weaknesses of different models and their ability to understand and reason about video content.

缺点

  1. The authors' dataset statistics table compares CinePile with a variety of VideoQA datasets. However, there are two minor issues. Firstly, CinePile predominantly consists of movie clips, and a separate analysis comparing it with other movie QA datasets would be beneficial for a more targeted evaluation. Secondly, the videos in CinePile are only 160 seconds long, which doesn't offer a significant advantage in terms of length over other datasets.

  2. While the authors' use of templates to construct QA pairs facilitates rapid generation, it severely compromises the diversity of the questions. This often results in a distribution of generated QA pairs that significantly deviates from the distribution of real human prompts. Consequently, the constructed training set lacks diversity, offering limited benefits for model improvement. Additionally, the metrics reflected by the test set may not provide a sufficiently reliable reference value due to this disparity.

问题

How to ensure that the proposed test set has a high reference value for VideoQA, rather than just evaluating the templated MCQ of the movie domain?

评论

This often results in a distribution of generated QA pairs that significantly deviates from the distribution of real human prompts.

We take several steps to ensure that the questions generated using our templates are natural, plausible, and not force-fitted. This process involves extensive prompt tuning with state-of-the-art large language models (LLMs) such as GPT-4 and Gemini, along with multiple qualitative passes to refine the prompts. These iterative improvements continue until we achieve high confidence in the naturalness and quality of the generated questions. Additionally, during our human studies, we solicited explicit feedback on the naturalness of the generated question-answer pairs, and no concerns were raised. If the reviewer has any specific examples in our paper or the dataset released in the supplementary materials that they find to appear to deviate significantly from the distribution of human-generated prompts, we would welcome the opportunity to address them thoroughly.

The constructed training set lacks diversity, offering limited benefits for model improvement.

We would like to emphasize that CinePile’s training set provides substantial benefits for model improvement. As demonstrated in Section 4, fine-tuning on CinePile’s training set results in a significant performance boost, with accuracy increasing from 25.72% to 44.16%—an approximate 71% improvement. These gains are consistently observed across all question subcategories, highlighting the diverse and impactful nature of the training data. Additionally, we have updated the language in Section 4 and the Introduction to more clearly emphasize the contributions of CinePile’s training set in enhancing model performance.

Additionally, the metrics reflected by the test set may not provide a sufficiently reliable reference value due to this disparity.

To address concerns about the reference value of our proposed test set, we conducted an additional experiment to evaluate how well model performance on CinePile correlates with other established video understanding benchmarks. This analysis, detailed in Appendix O, computes the Spearman rank correlation (ρ\rho) between model ranks on CinePile and their ranks on other datasets, including Video-MME, MV-Bench, and EgoSchema. For each dataset, we use the model ranks provided in their official publications and calculate correlations based on the ranks of models common to both CinePile and the respective dataset. Our results show strong correlations: ρ=0.964\rho = 0.964 for Video-MME (7 common models), ρ=1.000\rho = 1.000 for MV-Bench (3 common models), and ρ=1.000\rho = 1.000 for EgoSchema (2 common models).

While CinePile evaluates 26 state-of-the-art models, the number of models evaluated by other benchmarks is often smaller, with limited overlap. For example, MV-Bench assesses only 6 models, of which 3 overlap with CinePile, making some correlations less robust. However, these strong correlations suggest that models performing well on CinePile also perform well on manually curated benchmarks, underscoring CinePile’s validity as a reliable test set. That said, performance levels naturally vary due to differences in dataset characteristics and task difficulty. For instance, Gemini-1.5 Pro achieves 81.3% on Video-MME but only 60% on CinePile.


[1] Tapaswi, Makarand, et al. "Movieqa: Understanding stories in movies through question-answering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[2] Lei, Jie, et al. "Tvqa: Localized, compositional video question answering." arXiv preprint arXiv:1809.01696 (2018).

评论

While authors' use of templates to construct QA pairs facilitates rapid generation, it severely compromises the diversity of the questions.

We would like to clarify that our use of templates does not compromise question diversity for the following two reasons. Firstly, rather than applying fixed templates for every video, we automatically select relevant ones from a diverse bank of 86 templates tailored to various aspects, such as Character Reaction Insight, Event Sequence Ordering, and Moral Dilemma Exploration. Thus, different videos receive different templates, ensuring diversity across the dataset. Secondly, the question generation process is guided by detailed prompts that incorporate both the chosen template and the specific video clip context. As a result, even when the same template is used, the questions vary significantly based on the unique characters, actions, and environments in each video. For example, the questions “How does the decision to buy the coffee machine and the Harry Potter collection lead to a significant consequence in the video?” and “What early tactical trait of Barcelona hinted at their ultimate attacking strategy?” both stem from the "Causal Chain Analysis" template but differ greatly in wording and focus due to the distinct video contexts. This approach contrasts with other datasets relying on human annotators, which often limit template categories (e.g., Perception Test uses four template areas) for human labeling feasibility.

To quantify question diversity, we conducted an experiment to measure the average semantic diversity of questions both within a video clip and across different video clips in our dataset.

Within-Video Diversity:

For a video clip viv_i, assume it has jj questions {qi1,qi2,,qij}\{ q_{i1}, q_{i2}, \dots, q_{ij} \}. Using an embedding model (dunzhang/stella_en_1.5B_v5; ranking in top-3 in the Massive Text Embedding Benchmark), we encoded each question into the embedding space and measured their semantic similarity using cosine similarity cosim(qik,qil)cosim(q_{ik}, q_{il}) for all pairs where 1k,lj1 \leq k, l \leq j and klk \neq l. Since question diversity is inversely related to similarity, we computed the pairwise cosine distance as 1cosim(qik,qil)1 - cosim(q_{ik}, q_{il}). The within-video diversity score for a clip viv_i is then given by the expected pairwise cosine distance:

Dwithin(vi)=Eqik,qilvi[1cosim(qik,qil)]D_{within}(v_i) = E_{q_{ik}, q_{il} \sim v_i} [ 1 - cosim(q_{ik}, q_{il}) ]

We aggregated this across the dataset by sampling clips viDv_i \sim D, where DD represents the distribution of video clips in CinePile:

Dwithin=EviD[Dwithin(vi)]D_{within} = E_{v_i \sim D} \left[ D_{\mathrm{within}}(v_i) \right]

Across-Video Diversity:

To measure diversity across different video clips, we considered the pairwise cosine distances between questions from different videos. For two different video clips viv_i and vjv_j (iji \neq j), with their associated questions {qik}\{ q_{ik} \} and {qjl}\{ q_{jl} \}, we computed: 1cosim(qik,qjl)1 - cosim(q_{ik}, q_{jl}). The across-video diversity score is given by the expected pairwise cosine distance between questions from different videos:

Dacross=Evi,vjD[Eqikvi,qjlvj[1cosim(qik,qjl)]],ijD_{across} = E_{v_i, v_j \sim D} \left[ E_{q_{ik} \sim v_i,\, q_{jl} \sim v_j} \left[ 1 - cosim(q_{ik}, q_{jl}) \right] \right], \quad i \neq j

Combined Diversity Score:

To obtain an overall measure of diversity, we computed the F1-score of the within-video and across-video diversity scores: Combined Diversity Score=2×Dwithin×DacrossDwithin+Dacross\text{Combined Diversity Score} = 2 \times \frac{D_{within} \times D_{across}}{D_{within} + D_{across}}

The F1-score is appropriate in this context because it balances both aspects of diversity by emphasizing the smaller of the two values, and ensuring that neither within-video nor across-video diversity disproportionately influences the combined score. We compute the diversity score on 50 randomly sampled video clips, and share the results in the table below. CinePile achieves a diversity score of 0.45. For context, we computed the same metric on other datasets: Video-MME: 0.45, MV-Bench 0.42, and IntentQA: 0.37. These comparisons demonstrate the strong semantic diversity of questions in CinePile that is greater or on-par with other (even purely human-curated) datasets.

DatasetWithin-Video DiversityAcross-Video DiversityDiversity-Score
CinePile0.550.380.45
Video-MME0.530.400.45
MVBench0.570.330.42
IntentQA0.450.320.37
评论

Thank you for reviewing our paper, Reviewer N1VZ. We are pleased that you found our pipeline for generating CinePile—a high-quality question-answering dataset—detailed and effective. We also appreciate your recognition of the adversarial refinement process for enhancing dataset accuracy, our comprehensive benchmark comparing various open-source and commercial models, and the detailed analysis highlighting model strengths and weaknesses. Below, we address your concerns.

CinePile predominantly consists of movie clips, and a separate analysis comparing it with other movie QA datasets would be beneficial for a more targeted evaluation

Thank you for your suggestion! We have updated Table 1 to include a new “domain” column and expanded the discussion in Section 5 (Related Work) to highlight the domains of videos across various datasets. CinePile features a diverse collection of movie clips (films from 1930 to 2023, spanning 86 unique genres) that capture pivotal moments, shifts in character dynamics, and key thematic elements. This provides a rich testbed for evaluating the long-form contextual understanding capabilities of Video-LLMs. Indeed, past benchmarks like MovieQA [1] and TVQA [2], which also focus on movies and TV series, have been instrumental in advancing the field of video understanding. That said, CinePile’s adaptable pipeline allows automatically generating high-quality question answers for other domains, as we’ll address in the next response.

Videos in CinePile are only 160 seconds long, which doesn't offer a significant advantage in terms of length over other datasets.

  1. Larger Average Video Length For Similar Scale: CinePile has a significantly larger average video length compared to datasets of similar scale. For example, while MSRVTT-QA (243k samples) and TVQA (152k samples) have average video lengths of 15 and 76 seconds respectively, CinePile's 160-second average makes it the longest video dataset of its scale.
  2. Substantial Dataset Size: CinePile sets itself apart from benchmarks with longer videos, such as EgoSchema and Video-MME, by offering a dataset of 300k samples—60 times larger than these benchmarks. The smaller size of these benchmarks makes them less suitable for fine-tuning on longer video content, thereby limiting their utility for enhancing the capabilities of current Video-LLMs. CinePile’s unique combination of longer video length and large scale is particularly advantageous for fine-tuning, as demonstrated in Section 4, where it significantly improves video understanding performance.
  3. Adaptability and Scalability: CinePile's pipeline is designed for easy adaptation, enabling the efficient scaling of datasets to longer videos across diverse domains. While videos around 100 seconds are typically considered long in the video understanding community, as current models still struggle with this length, future models with improved temporal resolution will require even longer videos for training and evaluation. To meet this need, CinePile was developed not only as a dataset and benchmark but also as a reproducible, scalable, and efficient pipeline for curating long-form video datasets. We demonstrated this adaptability by experimenting with three longer videos from diverse domains:
  • Survive 100 Days Trapped, Win $500,000 (1620 seconds long video, YouTube Challenge-Reward),
  • How Hansi Flick's Tactics Are Revolutionizing Barcelona (540 seconds long video, soccer tactical analysis),
  • Eminem - Stan (Long Version) ft. Dido (480 seconds long video, music video)

These videos, vastly different from CinePile’s movie clips, were transcribed using Whisper, with key visual descriptions annotated by the authors. Additionally, we slightly revised the question generation prompt to reduce the emphasis on movie-clips specific analysis to general video analysis (e.g., changing “Create diverse and challenging questions based on the film analysis…” to “Create .... based on the video analysis…”). We utilized the same question template bank (86 total templates) without adding or removing any. Feeding this "video scene information" into our pipeline generated high-quality questions. For instance, “What are the strong points of conflict between the characters in the video?” (Survive 100 Days Trapped, Win $500,000) required analyzing the entire clip to identify key conflicts and select the correct option. Similarly, “How does the video develop the theme of Barcelona's tactical variations in attack from start to finish?” (How Hansi Flick's Tactics Are Revolutionizing Barcelona) involved identifying and mapping out the tactical variations discussed throughout the video. The QA examples, along with detailed results in the updated Section M of the Appendix, demonstrate the pipeline's ability to generalize effectively across different video sources and contexts. The full set of generated-question answers has been added to the supplementary material.

评论

Dear Reviewer N1VZ,

We sincerely appreciate your dedicated time and effort in reviewing our paper.

As the author-reviewer discussion period draws to a close, we kindly ask whether our additional clarifications and new results have sufficiently addressed your main concerns or if there are any remaining questions we can address further.

Thank you once again for your valuable feedback. We believe that incorporating these clarifications and experiments has further strengthened the paper.

评论

Dear Reviewer N1VZ,

We sincerely appreciate your dedicated time and effort in reviewing our paper.

As the author-reviewer discussion period draws to a close, we kindly ask whether our additional clarifications and new results have sufficiently addressed your main concerns or if there are any remaining questions we can address further.

Thank you once again for your valuable feedback. We believe that incorporating these clarifications and experiments has further strengthened the paper.

评论

Dear Reviewer N1VZ,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

With only a few hours remaining to upload the revised PDF, we kindly ask whether our additional clarifications and new results have sufficiently addressed your main concerns or if there are any remaining questions we can address further.

Thank you once again for your valuable feedback. We believe that incorporating these clarifications and experiments has helped to strengthen the paper further.

评论

Dear Reviewer N1VZ,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

Since December 2nd is the final date for reviewers to post messages to authors, we kindly ask if our additional clarifications and new results have sufficiently addressed your main concerns, or if there are any remaining questions we can further address.

Thank you once again for your valuable feedback. Incorporating these clarifications and experiments has helped strengthen the paper further.

评论

Dear Reviewer N1VZ,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

Since there are only a few hours remaining for reviewers to post messages to authors, we kindly ask if our additional clarifications and new results have sufficiently addressed your main concerns, or if there are any remaining questions we can further address.

Thank you once again for your valuable feedback. Incorporating these clarifications and experiments has helped strengthen the paper further.

审稿意见
6

This paper presents CinePile, a novel large-scale dataset and benchmark for long-form video understanding, containing 305,000 multiple-choice questions across 9,396 movie clips. The authors introduce a data creation pipeline that combines human-created audio descriptions with LLM-generated question templates, followed by quality filtering to ensure questions require genuine video understanding. The dataset's effectiveness is demonstrated through evaluations of 24 video-language models, where the best commercial model (Gemini 1.5 Pro) achieved 60.12% accuracy and the top open-source model (LLaVA-OV) reached 49.34%, both still significantly below human performance of 73.21%. The authors also show that fine-tuning existing models on CinePile's training set can substantially improve performance, as demonstrated by Video-LLaVa's improvement from 25.72% to 44.16%. This work addresses critical limitations in existing video QA datasets by requiring true multimodal comprehension rather than allowing models to succeed through either visual or dialogue understanding alone.

优点

  • Scale and Efficiency: The paper presents an innovative automated pipeline that created 305,000 high-quality QA pairs - which is 70-75x larger than comparable datasets like MoVQA (21,953) or Video-MME (2,700).
  • Clear Benchmark Value: The significant performance gap between humans (73.21%) and best models (60.12% for Gemini 1.5 Pro) demonstrates that this dataset effectively challenges current state-of-the-art systems and provides meaningful room for improvement.

缺点

  • The paper lacks rigorous statistical analysis of the dataset's properties - for example, there are no inter-annotator agreement scores for human evaluation and no analysis of the language content of the answers or questions. For instance, with llm-generated answers, the correct answer often is longer than the wrong answer which becomes an issue
  • The "Setting and Technical Analysis" (STA) questions may not truly test long-form understanding since they could potentially be answered by analyzing just a few frames, contradicting the paper's core claim about requiring temporal understanding.
  • The paper lacks ablation studies on the number of sampled frames, making it unclear how model performance varies with different temporal sampling strategies and what the minimum required context is.

问题

What are the statistics on the length of the answers on the dataset? Just general more exploration on the content of the questions and answers

评论

The paper lacks ablation studies on the number of sampled frames.

Thank you for the question. We address this in Appendix K, where we initially provided ablation studies by evaluating model performance when video frames are entirely removed, leaving only subtitles (transcribed audio). In this setting, we observed a significant performance drop across all models except Video-ChatGPT.

To further investigate the impact of temporal sampling, we have now updated this section with additional ablation results, examining model performance when varying the number of sampled frames: [1,8,16,32] on a subset of CinePile questions. Due to the high cost of running these ablations on closed-source models like Gemini, we focused primarily on open-source models from our earlier experiments, adding a new model, MiniCPM-V 2.6.

Our findings show that model performance consistently improves as the number of frames increases, except for Video-ChatGPT, which shows no consistent gains. The improvement is proportional to the model’s overall ranking in our benchmarks. MiniCPM-V 2.6 shows the most significant performance gains with additional frames, followed by VideoLLaMa2, while Video-ChatGPT’s performance remains relatively unchanged, underscoring its limited reliance on visual inputs. These results highlight the necessity of integrating visual information (along with the transcribed audio) to improve performance on CinePile.

Num FramesVideoLLama2Video-ChatGPTMiniCPM-V 2.6
134.7314.2634.11
838.4412.7342.03
1643.8814.8148.31
3246.9415.8154.11
评论

Thank you for reviewing our paper, Reviewer QWy8. We are pleased to note that you appreciate the scalability and efficiency of our automated pipelines and the effectiveness of our dataset/benchmark in challenging state-of-the-art models where they still underperform humans by a significant margin. Below, we address all your concerns.

What are the statistics on the length of the answers on the dataset? with llm-generated answers, the correct answer often is longer than the wrong answer which becomes an issue

Thank you for the question. We have added detailed statistics about the language content of the generated questions and answers in Appendix Section-O. Specifically, we provide the distribution of the correct answer across different option markers: “A” (18.72%), “B” (21.35%), “C” (20.18%), “D” (20.26%), and “E” (19.49%), indicating no significant position bias. To address the concern about correct answers being consistently longer than distractor options, we analyzed the word count distribution. The correct answer is the longest option in only 14.18% of the questions, indicating that this occurs in a minority of cases. Similarly, the correct answer is the shortest option in just 5.14% of the questions, demonstrating that no reverse bias exists either. On average, correct answers have a length of 4.84 words, while distractor options average 4.59 words. Appendix Section-O also includes the distribution plots for correct and distractor options, along with additional statistics on question-answer length across different question-categories, etc.

There are no inter-annotator agreement scores for human evaluation.

We report human evaluation results in Table 2, based on assessments conducted by 25 graduate student volunteers from our department. These studies underscore the challenging nature of video understanding, with human evaluators occasionally making errors (discussed in detail in Section-I of the Appendix). To address this, the authors conducted a separate study where they ensured high response quality by dedicating time to carefully answer the questions and minimizing casual errors. Due to resource constraints, we prioritized maximizing coverage across question categories and movie clips, so the same questions were not shown to multiple participants.

At a higher level, the primary goal of this work was to develop a scalable, efficient, and adaptable pipeline for generating video understanding datasets, given the prohibitive cost of human annotation for long videos. As such, we focused on automated question filtration and refinement. For the test set, we manually verified all questions to ensure its reliability. The train set, due to its scale, was not manually verified, but finetuning experiments in Section 4 demonstrate that models trained on it improve performance on the verified test set. Additionally, we computed correlations between model performance on CinePile and other smaller, human-curated benchmarks (as requested by Reviewer dTLF). The high correlations (e.g., 0.96 with Video-MME) further validate CinePile’s value as a reliable benchmark, with consistent performance trends across datasets.

The "Setting and Technical Analysis" (STA) questions may not truly test long-form understanding since they could potentially be answered by analyzing just a few frames, contradicting the paper's core claim about requiring temporal understanding.

In the paper, we emphasize the importance of maintaining a diverse set of question categories rather than focusing solely on perceptual questions, such as those in the "Setting and Technical Analysis" (STA) category. This diversity, facilitated by question-category labeling tags for each question, enables researchers to more effectively analyze their models' failure modes across understanding, reasoning, and perceptual dimensions. Even within the STA category, it is not always the case that questions can be answered by analyzing only a few frames. For instance, a seemingly straightforward question like “Where is the group during the helicopter’s second pass?” (from the video: Area 51 (2015) - Sneaking Onto the Base Scene (4/10) | Movieclips) actually demands watching a significant portion of the clip as the Video-LLM must count how many helicopter passes have occurred and then analyze the surrounding location during the second pass, testing its ability to maintain temporal awareness and spatial reasoning over an extended sequence.

评论

Dear Reviewer QWy8,

We sincerely appreciate your dedicated time and effort in reviewing our paper.

As the author-reviewer discussion period draws to a close, we kindly ask whether our additional clarifications and new results have sufficiently addressed your main concerns or if there are any remaining questions we can address further.

Thank you once again for your valuable feedback. We believe that incorporating these clarifications and experiments has further strengthened the paper.

评论

Dear Reviewer QWy8,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

As the author-reviewer discussion period draws to a close, we kindly ask whether our additional clarifications and new results have sufficiently addressed your main concerns or if there are any remaining questions we can address further.

Thank you once again for your valuable feedback. We believe that incorporating these clarifications and experiments has helped to further strengthen the paper.

评论

Dear Reviewer QWy8,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

With only a few hours remaining to upload the revised PDF, we kindly ask whether our additional clarifications and new results have sufficiently addressed your main concerns or if there are any remaining questions we can address further.

Thank you once again for your valuable feedback. We believe that incorporating these clarifications and experiments has helped to strengthen the paper further.

评论

Dear Reviewer QWy8,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

Since December 2nd is the final date for reviewers to post messages to authors, we kindly ask if our additional clarifications and new results have sufficiently addressed your main concerns, or if there are any remaining questions we can further address.

Thank you once again for your valuable feedback. Incorporating these clarifications and experiments has helped strengthen the paper further.

评论

Dear Reviewer QWy8,

We sincerely appreciate the time and effort you have dedicated to reviewing our paper.

Since there are only a few hours remaining for reviewers to post messages to authors, we kindly ask if our additional clarifications and new results have sufficiently addressed your main concerns, or if there are any remaining questions we can further address.

Thank you once again for your valuable feedback. Incorporating these clarifications and experiments has helped strengthen the paper further.

评论

We thank all reviewers for thoroughly reviewing our paper and providing constructive feedback. We are pleased that the reviewers acknowledge the “scalability” and “efficiency” of our automated pipeline [Reviewer QWy8, Reviewer N1VZ, Reviewer dTLF], which allowed us to develop a large-scale benchmark that “effectively challenges current state-of-the-art models” [Reviewer QWy8] as well as a dataset that is “compelling” and “of practical utility” in improving model performance via fine-tuning [Reviewer dTLF]. The reviewers also highlighted the “comprehensiveness” of our evaluation suite, which assesses “numerous open-source and commercial” Video-LLMs [Reviewer N1VZ, Reviewer dTLF], “shedding light on the strengths and weaknesses of different models” [Reviewer N1VZ].

The reviewers posed several insightful questions, prompting us to conduct additional experiments to address them. This exchange further validated the efficiency and adaptability of our pipeline, as well as the quality of our generated data along key aspects such as diversity and the integration of visual information. Below, we briefly summarize the key changes we make in the revised submission. We have updated both the main paper and supplementary material, with all changes highlighted in blue. The appendices are included in both the supplementary material ZIP file and the main PDF file, following the bibliography for easier access.

  • Additional Details on Characteristics of the Dataset. [Appendix Section O]

    • As suggested by Reviewer QWy8, we have updated Appendix Section O to include additional details on the characteristics of the generated dataset, such as the distribution of correct answers across different option markers and length comparisons between the correct choices and distractor options. We observed no substantial differences in the characteristics of the correct answers and the distractors.
  • Comparisons with Other Datasets. [Appendix Section O]

    • In response to Reviewer N1VZ’s questions regarding the use of templates, we have updated Appendix Section O to include additional experiments. These experiments quantitatively analyze the within-video and across-video diversity of the questions generated in CinePile, comparing them with other datasets such as Video-MME, MV-Bench, and IntentQA. We find that CinePile is better than most other dataset in this aspect. Furthermore, as requested by Reviewer dTLF and in response to Reviewer N1VZ’s query about the “reference value for VideoQA,” we have added a correlation analysis of model rankings on CinePile and other datasets. This analysis reveals strong positive correlations (~0.96) between CinePile and other datasets, including fully manually created ones.
  • Adapting CinePile to Longer and Different Videos. [Appendix Section M]

    • As requested by Reviewer dTLF and in response to Reviewer N1VZ’s question on advantages of CinePile, we have updated Appendix Section M to include experiments on generating question answers for significantly longer videos (up to 1620 seconds) and across different domains (e.g., YouTube challenge videos, soccer tactical analysis, music videos) compared to the original CinePile dataset (160-second movie clips). We found that our pipeline can generate meaningful questions across these variations in length and domain, with only minor changes to the generation prompt and by utilizing the same question template bank.
  • Ablation on Varying Frame Rate [Appendix Section K]

    • As requested by Reviewer QWy8, we have added an ablation study to investigate the impact of varying frame rates on the performance of different Video-LLM models. Our findings indicate that this effect is more pronounced in better-performing models; that is, models with higher overall performance tend to leverage more visual information, while lower-performing models exhibit minimal variation.
  • Additional Discussion on Adversarial Refinement [Appendix Section N]

    • As requested by Reviewer dTLF, we have provided additional details on the adversarial refinement process. This includes an illustrative figure demonstrating the workflow, an explanation of how we adjust for chance performance to determine success, and statistics such as the average number of attempts required to refine a question and the monetary cost of running the adversarial refinement pipeline. We have also included multiple specific examples, such as instances of "weak QA," the deaf-blind LLM’s response and rationale, the refined question, and the LLM’s response to the revised question. Additionally, the code files for conducting adversarial refinement have been added to the supplementary materials.
  • Additional Details on Monetary Cost Involved [Appendix Section C and N]

    • As requested by Reviewer dTLF, we have included estimates of the monetary cost for running the question generation process in Appendix Section C and for running the adversarial refinement process in Section N.
评论

(cont. from Part 1.)

  • Additional Discussion on Failure of Open Source Models [Appendix Section P]
    • As requested by Reviewer dTLF, we have added additional analysis on failures of open-source models including correlation analysis to understand the role of model scale on performance, ability to integrate visual information, and discussion on possible reasons for gaps with better-performing closed-source models.

We hope our rebuttal and additional experiments provide the necessary clarity and can serve as a valuable reference for both reviewers and the area chair(s) to alleviate concerns. We are looking forward to engaging with the reviewers in the case that there are remaining concerns.

AC 元评审

The paper receives comments from three reviewers, and they are leaning to reject this submission because of several critical concerns. First, the paper lacks appropriate statistical analysis to show the significant difference and importance compared to some existing related benchmarks. Second, the training and evaluation sets constructed in the benchmark seem to be not diverse enough, which may raise a significant issue when people use it for modeling training and testing. Lastly, some ablation studies conducted cannot effectively demonstrate the targeted points. Based on these relatively negative comments, AC recommends a rejection for this time.

审稿人讨论附加意见

The reviewers requested clarification about the statistical analysis of the benchmark, more details of the benchmark, and the experimental analysis of the designed baseline models. They are not fully satisfied with the rebuttal of the authors.

最终决定

Reject