Understanding Complexity in VideoQA via Visual Program Generation
We propose a data-driven method to assess question complexity in Video Question Answering (VideoQA) based on code primitives, and use it to create a new benchmark, which is nearly twice as difficult for models compared to existing datasets.
摘要
评审与讨论
This paper explores a data-driven method for evaluating question complexity for Video Question Answering (VQA) by collecting the visual programs generated by ViperGPT. Then, they analyze the code outputs and leverage them to assess the difficulty of the questions in a VQA triplet. Given the output code from the Visual Programming module (ViperGPT), they propose CodePlexity, an algorithm that parses the generated code via Abstract Syntax Trees (AST), and later identifies valid subtrees. These subtrees are then scored by correlating subtrees with difficult questions -- subroutines that hurt the VQA model performance. The authors use NExT-QA as the benchmark to analyze their proposed metric. Given this metric, they also propose a pipeline to generate difficult question-answer pairs for a set of videos from MOMA, ActivityNet (+Entities/Captions), Action Genome, and Charades. Finally, they compare the results of models SeViLA-ZS, InternVideo and VIOLET on their proposed new dataset (CodePlex-QA) against NExT-QA.
优点
-
This paper is easy to read, the method is easy to follow and the qualitative examples help to illustrate the motivation of their work.
-
Leveraging Visual Programming outputs as a proxy for assessing the difficulty of a given task is an underexplored domain -- in software engineering, an active area of study is the correlation between the code complexity and difficulty of the task. This is an open problem, and this paper proposes an interesting generalization of this task.
-
Decomposing the output code from the VP module via ASTs, and identifying subtrees that decrease performance via scoring, may give an interesting angle for generating an interpretable pipeline to analyze subprocesses that might be hurting the models' performance. In principle, this looks like an interesting approach and a viable metric for VP-based methods.
缺点
-
Visual Programming (VP) is an interpretable framework for several visio-linguistic tasks, however, VP may output very simple code for a quite complex task, yielding false positive or false negative hard cases, and VP might not be reliable -- for a very complex task, the output code could be simple; the LLM might regard it as a simple task, but it's actually the opposite In addition, VP falls short against end-to-end models (e.g., ViperGPT acc. is 60% vs SeViLa (finetuned) acc. is 73.8%) -- given its underperformance, it is hard to justify the use of its outputs as a proxy for evaluation of complexity. It is also hard to justify using VP for measuring task complexity and then evaluating on end-to-end models.
-
Disregarding visual inputs: "Text-based metrics (above) perform worse than the code-based ones (below), and our approach demonstrates the highest correlation with the models’ performance." -- this completely ignores the visual modality, which seems problematic.
-
The authors claim: "In summary, we discovered that VideoQA methods struggle with fine-grained temporal reasoning and lack spatio-temporal, object-centric representations. This is in accord with prior studies" [3] -- however, those studies regard both visual and language modalities for assessing temporal reasoning in video QA, giving high importance to the video part.
-
Human evaluations: Figure 6: We ask human annotators to provide the relative ordering of three provided questions according to the estimated complexity of answering the question about an unseen video -- how to correctly assess the complexity of the task without looking at the video?
-
Experimental setup: Baselines [1] focus on grammar and text only. BERT and GPT-4 also focus on text only.
-
Generalizability: ViperGPT is the VP module used in this paper, however, VP has significantly progressed. Different methods leverage multiple LLMs, there has been extensive problem decomposition and iteration that impact the code outputs. Further experiments with other VP approaches might be required to ensure generalization. Similarly, the proposed metric and dataset is only compared against NExT-QA. Other benchmarks might be necessary to validate the proposed metric and dataset (e.g., MVBench [2] compiles a collection of datasets (with subsets of samples from a diverse range of sources), which includes spatio-temporal analysis, spatial action, object, position, scene, count, attribute and cognition).
-
For dataset generation, the authors compare their pipeline with [4] -- however, an important step for EgoSchema is the manual curation of the whole dataset in the final step -- details for this step is not further detailed in this paper. Furthermore, as a comparison, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data -- significantly larger than the 1981 samples -- what is the length of each video sample?
[1] Yulia Tsvetkov, Manaal Faruqui, Wang Ling, Brian MacWhinney, and Chris Dyer. Learning the curriculum with bayesian optimization for task-specific word representation learning. In ACL, 2016.
[2] Li, Kunchang, et al. "Mvbench: A comprehensive multi-modal video understanding benchmark." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[3] Shyamal Buch, Crist´obal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the” video” in video-language understanding. In CVPR, 2022.
[4] Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. In NeurIPS, 2023.
问题
-
SeViLA "leverages a single image-language model (BLIP2) to tackle both temporal keyframe localization and question answering on videos" How the findings in this paper generalize to models that take multiple video frames for VQA? (e.g., VideoChat2 [5], Video-LLaVA [6])?
-
NExT-QA uses videos from VidOR [7] -- CodePlex-QA use videos from MOMA, ActivityNet, ActivityNet-Entities, ActivityNet-Captions), Action Genome, and Charades. Why not using the same videos, or at least the same source video dataset (VidOR)?
[5] Li, Kunchang, et al. "Mvbench: A comprehensive multi-modal video understanding benchmark." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[6] Lin, Bin, et al. "Video-llava: Learning united visual representation by alignment before projection." arXiv preprint arXiv:2311.10122 (2023).
[7] Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. Annotating objects and relations in user generated videos. In ICMR, pages 279–287, 2019
Thank you for your detailed review and suggestions. We address your comments individually below (comment 2 of 2).
Q1: How do the findings in the paper generalize to models that take multiple frames as input?
Sevila takes in multiple video frames. The only model that does not take in multiple frames is ATP. Sevila does do frame selection. Other models (except for ATP) don't. We have additionally evaluated Tarsier [1], a recent multi-frame Video LLM based on LLaMA, which is the SoTA model on NextQA and is architecturally equivalent to Video-LLaVA. As shown in the table below, Tarsier also struggles with questions identified as complex by CodePlexity. We have updated Table 1 in the manuscript with this result. Additionally, Tarsier has been added to the results in Table 2 and Figure 3 of the manuscript, which also show that the findings in the paper generalize to Tarsier. Namely, that code-based metrics outperform their text-based counterparts, and that our approach, CodePlexity, further improves upon them.
| SeViLA | ViperGPT | ATP | VIOLET | HGA | SeViLA ZS | InternVideo | Tarsier | |
|---|---|---|---|---|---|---|---|---|
| Type | Train | Train | Train | Train | Val | Val | Val | Val |
| Dependency Tree Depth | 12.9 | 7.9 | 11.1 | 15.9 | 7.4 | 13.5 | 17.7 | 10.1 |
| GPT-4 | 9.6 | 8.9 | 11.6 | 5.8 | 7.8 | 14.6 | 13.9 | 10.8 |
| BERT | 12.5 | 6.0 | 18.3 | 17.3 | 7.7 | 14.3 | 21.1 | 10.8 |
| Lines of Code | 16.4 | 15.3 | 14.2 | 12.0 | 9.9 | 16.2 | 17.5 | 14.4 |
| Cyclomatic Complexity | 18.2 | 14.2 | 18.7 | 15.9 | 8.9 | 17.2 | 24.2 | 16.7 |
| CodePlexity (Ours) | 26.7 | 21.3 | 21.0 | 15.8 | 14.1 | 25.6 | 26.6 | 24.9 |
Q2: Why not use VidOR instead of MOMA+ActivityNet+Charades for CodePlexQA
Thank you for raising this point. While VidOR provides a valuable dataset for video-related tasks, we chose not to use it for the following reasons:
- Limited variety: Videos in VidOR lack diversity, both in terms of visual content and scenarios. This limits the range of complex, challenging questions that can be generated.
- Short video duration: Most videos in VidOR are relatively short, which makes it difficult to create and evaluate questions requiring fine-grained temporal reasoning or extended contextual understanding.
- Older source material: VidOR is sourced from YFCC, a collection of videos from Flickr dating back to 2016. While useful, these older videos may not represent the richness or variability seen in more contemporary datasets.
To address these limitations, we opted to use videos from three different sources: MOMA, ActivityNet, and Action Genome. This allows us to leverage a wider variety of video types, durations, and content, ensuring that our benchmark better captures the complexity and diversity of real-world scenarios.
[1] Wang, J., Yuan, L., Zhang, Y., & Sun, H. (2024). Tarsier: Recipes for training and evaluating large video description models. arXiv preprint arXiv:2407.00634.
[2] Ge, J. & Subramanian, S. & Shi, B. & Herzig, R. & Darrell, T. Recursive visual programming,. In ECCV’24
Thank you for your detailed review and suggestions. We address your comments individually below (comment 1 of 2)
W1: CodePlexity is not the perfect metric for evaluating question complexity in VideoQA
We wholeheartedly agree. Our question complexity metric based on visual program analysis has limitations, including not always predicting the correct complexity score (indeed, it achieves an mPEG score of only ~25/100). It is, however, the best automatic metric that exists at the moment, and is also significantly more accurate than humans (see analysis in Section 4.2). Importantly, it provides human-interpretable and actionable insights into the key sources of complexities of existing models, allowing us to automatically construct a new challenging VideoQA benchmark. While we acknowledge that our approach does not address all challenges of VideoQA, we believe it represents a significant and valuable step forward for the field.
W2: Our analysis does not consider video information
We acknowledge that our approach focuses on analyzing question complexity independently of video information. However, we emphasize that asking complex questions about videos is a long-standing challenge. The first benchmark for video-language reasoning was introduced in Rohrbach et al., CVPR'15, and was followed by dozens of attempts to manually design questions that cannot be answered from a single frame, which were largely unsuccessful [4, 21, 33]. Our method takes a data-driven approach instead, discovering sources of complexity directly from the data, which is both novel and complementary to existing methods.
Additionally, as discussed in Section 9.1 of the appendix, video and text-based complexities are composable. This validates the study of question complexity in isolation. Combining CodePlexity with accurate video complexity metrics could yield an even more comprehensive evaluation in future work. Hence, our study of question complexity in VideoQA is still a valuable and valid research direction.
W3: Both human and algorithmic baselines also only consider text information
It is indeed true that our baselines, including human evaluations, focus solely on text information. This is a deliberate choice, as our goal is to evaluate question complexity in isolation from the video. While humans may not perfectly estimate video question complexity based on text alone, their judgments still provide informative estimates. Comparing baselines with access to the same information ensures fair evaluations of our metric.
W4: Please report more recent VP methods and more recent datasets
We reran the analysis with the recommended RVP mode [2] and show that code-based complexity functions based on it also correlate negatively with model performance. Results are shown in Figure 19 of the Appendix and demonstrate that using a more advanced CodeGen model enhances the predictive power of code-based metrics for estimating question complexity. This highlights the extensibility of our method and its potential for further improvement as code generation models continue to evolve.
In addition, we are now re-running our experiments on MVBench. Please appreciate the effort involved in this, as it essentially requires re-running most of the experiments in the paper on a new, large-scale dataset. We expect the experiments to finish before 26th of November and will post the results here as soon as possible.
W5: Please provide additional details and statistics of the proposed CodePlexQA benchmark
We provide the requested details here and added them to Section 8 of the Appendix. The resulting dataset has an average of 2.40 questions per video. The duration of each video ranges from approximately 3 seconds to 10 minutes with an average video duration of about 1.5 minutes. This diverse range of video lengths is desirable as it is conducent to generating a wide variety of questions.
I thank the authors for the detailed explanations.
One big concern shared by other reviewers is the robustness of the proposed approach. In particular, results in the NextQA-ATP-Hard split help to shed some light on this issue.
My question regarding using videos from MOMA, ActivityNet, ActivityNet-Entities, and ActivityNet-Captions comes from the fact that comparing the NExT-QA benchmark and the proposed CodePlex-QA might be unfair. As the authors pointed out, VidOR has several limitations, including: limited variety, short video duration, and older source material. However, NExT-QA is composed of 6,000 VidOR videos. It seems that the benefits from CodePlex-QA may come from the video sources and not the proposed approach/metric. Generating CodePlex-QA from VidOR would yield a slightly better/fairer comparison.
Dear Reviewer Hrka,
We sincerely appreciate your thoughtful feedback and are glad that you found our explanations helpful. We address your remaining concerns below.
Q1: Robustness of our approach
We have already demonstrated that our approach is robust to the CodeGen method being used in Section 10.3 as well as to a variety of VideoQA models in Table 1. Following reviewer’s request, we now report results on MVBench below, as well as in Section 10.4 in the updated manuscript.
Firstly, as you can see from Figure 20 in the manuscript, ViperGPT code complexity correlates strongly with question complexity for a variety of VideoQA models on this dataset. Secondly, as shown in the table below and in Table 5 of the manuscript, our proposed CodePlexity metric is far superior to the naive code complexity metrics on MVBench.
These results complete the robustness analysis of our approach, clearly demonstrating that it can scale to new methods and datasets, and validating its broader applicability. We hope this additional analysis resolves the reviewer’s concern and would be happy to answer additional questions in the remaining discussion period.
| InternVideo | SeViLA ZS | Tarsier | VideoChat2 | LLaVA-NeXT | |
|---|---|---|---|---|---|
| Lines of Code | 0.0441471 | 0.0950738 | 0.125403 | 0.10273 | 0.0919019 |
| Cyclomatic Complexity | 0.129945 | 0.0963568 | 0.0614128 | 0.0521084 | 0.0447285 |
| CodePlexity (Ours) | 0.413361 | 0.330262 | 0.444389 | 0.299086 | 0.274255 |
Q2: Comparison of NextQA and CodePlexQA is not fair due to different video sources
Thank you for clarifying this point. We acknowledge that multiple factors contribute to CodePlex-QA being more challenging than NExT-QA, and we agree that differing video sources play a role. We do provide a more detailed analysis of some of these factors in Figure 18 in the manuscript, which demonstrates that selecting samples based solely on their CodePlexity scores from NExT-QA already produces a benchmark that is significantly more challenging. This highlights the efficacy of our metric, independent of the diversity introduced by alternative data sources and automatic question generation.
We would like to re-itterate the key difference between NExT-QA and CodePlex-QA: the former is 100% manually constructed, whereas our pipeline is automatic, only requiring a single manual filtering stage to remove incorrectly generated question-answer pairs. In our opinion, it would have been impressive if such an automated pipeline simply matched a human-expert-curated benchmark in terms of complexity. As our results show, it significantly surpasses the widely adopted NExT-QA.
We will include results with VidOR as a data source for constructing CodePlex-QA in the camera ready version of the paper to provide a stricter comparison (getting them by the discussion deadline is, unfortunately, not feasible). However, we believe that the broader diversity and temporal richness of the datasets we selected more effectively demonstrate the generalizability and scalability of our approach.
I really appreciate the additional experiments and explanations.
However, the main issue with this work is that it is built on the assumption that the proposed CodePlexity metric is a good proxy to evaluate video-language data complexity. In Table 5 for example, the baselines are lines of code and cyclomatic complexity, which seems insufficient. While this paper provides some hypotheses that visual program generation code quality might correlate with the complexity of VideoQA datasets, all metrics are dependent on ViperGPT, and the generated CodePlex-QA is built from more challenging sets of videos. Furthermore, prior work has shown that, in general, LLMs struggle when generating code that can effectively interpret visual elements [1] and text-only instructions [2]. If the proxy is not reliable, is it valid to use it as an oracle to evaluate for complexity? This is a very promising direction, but in light of the current state of this paper and the lack of stronger empirical evidence, I am keeping my original rating of 5 and lowering my confidence to 3.
[1] Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. 2024. MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 736–783, Miami, Florida, USA. Association for Computational Linguistics.
[2] Jiang, Juyong, et al. "A Survey on Large Language Models for Code Generation." arXiv preprint arXiv:2406.00515 (2024).
Dear Reviewer Hrka,
Thank you for your thoughtful response and for engaging in this discussion with an open mind. We deeply value your feedback and your willingness to reconsider aspects of our work.
We would like to emphasize a few key points from our rebuttal that directly address your remaining concerns:
-
We do show that code-based metrics generalize beyond ViperGPT in Figure 19 by replacing it with the very recent RVP [1] visual programming model. This experiment demonstrates that using a more advanced visual programming model improves the predictive power of code-based metrics, reinforcing the argument that code serves as a robust proxy for question complexity in VideoQA.
-
We do explicitly evaluate the effect of the correctness of the generated code on the predictive power of our metric in Table 4. While the correlation between models’ performance and metric value improves when code is correct, CodePlexity consistently outperforms baselines in robustness to code generation errors. This underscores its practicality even when errors occur.
-
We do evaluate how data sources influence dataset complexity in Figure 18 and have committed to including your specific suggestion in the camera-ready version. Importantly, this is a secondary issue that does not challenge our central hypothesis: code complexity is a strong proxy for question complexity in VideoQA.
-
We are running non-code-based baselines from Table 1 on MVBench right now and will report the results as soon as possible, but wanted to post this discussion earlier to give the Reviewer time to respond.
Together, these and other results in our paper substantiate the central claim of this work: code complexity correlates strongly with question complexity in VideoQA. Intuitively, code complexity is effective at this task, because it is a proxy of the complexity of the algorithm that is implicit in the question. Crucially, our work makes no claim that code-based metrics are optimal or that they capture all aspects of complexity in this domain. Rather, we position CodePlexity as a promising and novel approach that broadens the scope of how complexity can be measured. We do compare to all other text-based complexity metrics in the literature and results speak for themselves.
We kindly ask the reviewer to consider that papers proposing new and unexpected hypotheses—even if not yet flawless—often have the greatest potential to drive meaningful progress within the community, in contrast to works that merely confirm widely held beliefs. We hope the reviewer will give our paper the chance to contribute in this way.
[1] Ge, J. & Subramanian, S. & Shi, B. & Herzig, R. & Darrell, T. Recursive visual programming,. In ECCV’24
I really appreciate the discussion and summary of the updates and revised version. Thank you for pointing out the results on Figure 19, I think the authors have a valid point. I also agree with the authors regarding new research directions, and this indeed is an interesting one. Thus, I'm increasing my overall rating.
Dear Reviewer Hrka,
Thank you once again for your thoughtful feedback and for your willingness to engage with our work in such depth. We truly appreciate your recognition of the potential impact of our approach and your increased rating of our submission.
As promised, we are providing results for the non-code-based baselines from Table 1 evaluated on MVBench below. In particular, we report the heuristic-based Dependency-Tree-Depth and data-driven BERT baselines. Evaluating the GPT-4 baseline on MVBench is taking more time, but we commit to including it in the camera ready version of the paper. Overall, we observe similar trends to those in Table 1. Here are the main findings:
- Data-driven metrics outperform heuristic-based ones: As anticipated, metrics that leverage data to model complexity (BERT and CodePlexity) show a clear advantage over manually designed heuristics, which lack the flexibility to capture the challenges posed by VideoQA.
- Our CodePlexity metric outperforms all alternatives: The results affirm that CodePlexity, as a data-driven, code-based metric, is not only robust but also provides the most accurate estimation of question complexity. Its ability to correlate strongly with model performance across diverse datasets further validates its utility.
| InternVideo | SeViLA ZS | Tarsier | LLaVA-NeXT | VideoChat | |
|---|---|---|---|---|---|
| Dependency Tree Depth | 6.5 | 6.2 | 19.7 | 12.2 | 16.5 |
| BERT | 25.6 | 11.5 | 18.8 | 21.5 | 21.1 |
| Lines of Code | 4.4 | 9.5 | 12.5 | 9.2 | 10.3 |
| Cyclomatic Complexity | 13.0 | 9.6 | 6.1 | 4.5 | 5.2 |
| CodePlexity (Ours) | 41.3 | 33.0 | 44.4 | 27.4 | 29.9 |
We are encouraged by your acknowledgment of the novelty and promise of this research direction. Thank you for giving our work the opportunity to be evaluated fairly and for supporting contributions that aim to broaden the scope of methodologies in this field.
In this paper, the author claims that the questions considered difficult are different from the actual difficult questions to current VideoQA models. Therefore, the authors propose a new metric: CodePlexity, to evaluate the difficulty of a VideoQA question. This metric is based on the recent visual programming methods, where a visual program of a question is generated and which complexity is evaluated. Based on this metric, the authors found that most models struggle with questions where multiple frames are included and frame order need to be take into consideration. Then based on the metric, the author proposes CodePlex-QA and claims that it is 1.9 times harder than existing benchmarks.
优点
- The paper proposes a novel metric to evaluate the complexity of the VideoQA problem and also proposes a CodePlex-QA that is more difficult for current VideoQA models. The idea of leveraging code to develop a new metric for question difficulty analysis is interesting.
- The paper conducts thorough experiments on testing the correlation of the metric.
缺点
- The criteria for what makes a question complex or challenging for models—especially compared to human intuition—seem speculative. Without more rigorous validation, it’s unclear whether the proposed complexity measures truly capture inherent question difficulty or just reflect the limitations of current VideoQA models. Also, the idea of identifying questions that are “easy for humans but hard for machines” is ambiguous. It seems plausible that any difference in difficulty may be more a result of model architecture and training rather than the intrinsic complexity of the question itself.
- Visual programming is a natural way (and probably the best way) to address the CodePlex-QA task. The author didn't report how well the recent visual programming methods (Visprog, ViperGPT, CodeVQA), especially those on addressing complex logical questions (eg. RVP[1]) addresses the task.
- The comparison between Next-QA and CodePlex-QA (Table2) is not convincing enough as previous works have shown that Next-QA have easy questions[2]. How is CodePlex-QA compared with Next-QA ATP-Hard split?
[1] Recursive Visual Programming [2]Revisiting the "Video" in Video-Language Understanding
问题
- What are the results of the recent visual programming methods on this task?
- How complex is CodePlex-QA compared with Next-QA ATP-T/ATP-C split?
Thank you for your detailed review and suggestions. We address your comments individually below.
W1: CodePlexity doesn't capture ‘the inherent complexity of the question’
Please note that the concept of "inherent question complexity" is not well-defined and that our work does not claim to capture it. Instead, our primary objective is to propose a practical and operationalizable metric, CodePlexity, to evaluate the relative difficulty of VideoQA questions for existing models. Our findings highlight a crucial gap: questions that appear straightforward to humans can be disproportionately challenging for current VideoQA architectures due to their inherent design limitations. CodePlexity addresses this gap by serving as a bridge between human intuition and model-specific difficulty.
W2: Visual programming methods are not evaluated in the paper
We thank the reviewer for pointing this out. We are currently evaluating ViperGPT on CodeplexQA and will post the results as soon as the evaluation finishes
W3: Comparison with NextQA-ATP-Hard split is not provided
This is a great point! We provide the requested evaluation below, as well as in Table 2 in the manuscript. Critically, ATP-Hard uses ground truth labels in Next-QA to select questions on which a proportion of model from an ensemble of 10 VideoQA models fails. In contrast, our approach only requires the question itself to determine its complexity.
Despite this oracle-like nature of ATP-Hard, CodeplexQA is substantially more challenging for the top performing models and approximately equally hard for the least effective VIOLET baseline. ATP-Hard is somewhat more challenging than CodeplexQA for InternVideo because this baseline is finetuned from CLIP and the samples in ATP-Hard were selected by seeing where CLIP-based models fail (i.e. this dataset represents an upper bound in terms of complexity for CLIP-based models). These results strongly support the effectiveness of both our complexity metric and of our automatic approach for generating challenging VideoQA benchmarks.
| Dataset | Tarsier | SeViLA ZS | InternVideo | VIOLET | Random |
|---|---|---|---|---|---|
| NExT-QA | 70.9% | 64.2% | 50.9% | 37.7% | 20.0% |
| CodeplexQA | 52.5% | 43.7% | 29.9% | 27.6% | 20.0% |
| ATP-Hard | 59.8% | 54.9% | 24.6% | 25.4% | 20.0% |
Dear Reviewer EhtB,
As promised above, we now report the results of the evaluation of VipeGPT on CodeplexQA below, as well as in Table 2 in the updated manuscript. As you can see, our dataset is indeed significantly more challenging than NExT-QA for visual programming methods, as well as for more traditional methods. This result further demonstrates the generality of our approach. We look forward to continued discussion and would be happy to answer any additional questions.
| Dataset | Tarsier | SeViLA ZS | Viper | InternVideo | VIOLET | Random |
|---|---|---|---|---|---|---|
| NExT-QA | 70.9% | 64.2% | 60.0% | 50.9% | 37.7% | 20.0% |
| ATP-Hard | 59.8% | 54.9% | 51.8% | 24.6% | 25.4% | 20.0% |
| CodeplexQA | 52.5% | 43.7% | 45.8% | 29.9% | 27.6% | 20.0% |
Dear Reviewer EhtB,
We hope this message finds you well. We wanted to kindly remind you that the reviewer response deadline is at midnight today. We would greatly appreciate it if you could let us know whether your concerns have been addressed or if there are any additional questions or suggestions we can assist with.
Thank you for the experiments on visual programming based approaches. It addresses some of my concerns. However, I am not convinced of having a set of questions targeted as "complex for existing models" and emphasizing them for future research. This distinction could potentially reflect biases inherent in current research methods rather than offering new insights. More importantly, focusing on these questions might inadvertently introduce new biases rather than addressing the truly challenging problems in video understanding. For this reason, I will maintain my current score.
Dear Reviewer EhtB,
Thank you for your detailed feedback. We are glad that our response addressed some of your concerns, but regret that you did not leave us enough time for a thorough discussion. We would like to use this opportunity to clarify a potential misunderstanding of our contributions and address your final concern.
Our work is not about creating a fixed metric or dataset but about introducing a data-driven framework for systematically identifying and analyzing model-specific challenges in VideoQA. Rather than proposing a universal measure of "inherent complexity," we focus on uncovering common failure modes across diverse state-of-the-art models. These shared failure patterns often reflect fundamental challenges in the field and offer critical insights into where progress is most needed.
While you suggest that focusing on questions "complex for existing models" risks introducing biases, we argue that identifying such failure modes is essential to addressing any biases already present in current methods. By highlighting where models falter—particularly in areas like temporal reasoning and multi-frame dependencies—our approach provides actionable insights to guide model development and dataset design.
Ultimately, our framework is adaptable and iterative: as new models and datasets emerge, it can be reapplied to surface fresh challenges, automatically generate benchmarks that incorporate these challenges, and sustain the momentum of progress. This adaptability makes it more future-proof than any heuristic-based definition of “true complexity” crafted by human experts. Indeed, all the efforts to propose such a definition have not stood the test of time [4, 21, 33]. We believe that our approach provides a novel perspective on question complexity analysis in VideoQA and will ultimately help addressing the fundamental challenges in this domain.
This paper proposes a novel approach to analyzing and measuring question complexity in VideoQA by leveraging generated code complexity. The authors demonstrate that code-based metrics correlate better with model performance compared to human judgment and introduce CodePlexity, an algorithm for estimating question complexity. They use this to identify challenging patterns in VideoQA and automatically generate a new benchmark dataset, CodePlex-QA, which proves to be significantly more challenging than existing datasets.
优点
- The idea to measuring question complexity using code generation is well-motivated
- The ablation studies and analysis are well-designed
缺点
- All experiments are conducted on a single dataset (NExT-QA) for analysis. Therefore, it leads to the limited evaluation across different types of VideoQA tasks.
- The authors should discuss more how errors of code generation might affect the complexity metrics
- The human evaluation study uses only 150 questions, which is a relatively small sample
- We concern that the merging of subtrees that "always co-occur" could potentially miss important patterns in edge cases
- The manual filtering process for the generated dataset (removing 12% of questions) could introduce selection bias
问题
- Why the authors do not validate the proposed approach on other VideoQA datasets besides NExT-QA?
- How do errors in code generation impact the complexity metrics?
- Are there any important patterns that might be missed by your current subtree merging approach?
- What specific criteria were used to manually filter out the 12% of questions?
Thank you for your detailed review and suggestions. We address your comments individually below.
W1: All evaluations are conducted on NextQA
We thank the reviewer for pointing out this imitation. To evaluate the generalizability of our method, following the suggestion of Reviewer Hrka, we are now running analysis on the very recent MVBench dataset. Please appreciate the effort involved in this, as it essentially requires re-running most of the experiments in the paper on a new, large-scale dataset. We expect the experiments to finish before 26th of November and will post the results here as soon as possible.
W2: How do errors in code generation affect CodePlexity?
We thank the reviewer for bringing up this under-explored aspect of our method. We have analyzed the performance of CodePlexity on questions which ViperGPT answers correctly (i.e. the generated code is correct), vs. the questions on which ViperGPT fails and show the results here and in Table 4 in the updated manuscript. As you can see, while the correlation between models’ performance and predicted metric value is consistently higher when the code is correct for all code-based metrics, CodePlexity is a lot more robust to code generation errors than the baselines. Please also see our response to Reviewer RWYy above, where we show that using more advanced code generation models improves the predictive power of code-based complexity metrics.
| Metric | Result | SeViLA | ViperGPT | ATP | VIOLET | HGA | SeViLA ZS | InternVideo | Tarsier |
|---|---|---|---|---|---|---|---|---|---|
| Lines of Code | Correct | 0.1373 | --- | 0.1747 | 0.1455 | 0.0712 | 0.1654 | 0.2022 | 0.1475 |
| Incorrect | 0.1245 | --- | 0.0540 | 0.0656 | 0.0735 | 0.0756 | 0.0831 | 0.0696 | |
| Cyclomatic Complexity | Correct | 0.1702 | --- | 0.2128 | 0.1930 | 0.0649 | 0.1739 | 0.2825 | 0.1634 |
| Incorrect | 0.1351 | --- | 0.1118 | 0.0881 | 0.0664 | 0.0973 | 0.1388 | 0.1071 | |
| CodePlexity | Correct | 0.2608 | --- | 0.3128 | 0.3178 | 0.0867 | 0.2095 | 0.2877 | 0.1950 |
| Incorrect | 0.2810 | --- | 0.2041 | 0.2542 | 0.1087 | 0.1839 | 0.1700 | 0.1857 |
W3: The sample size for the human study is relatively small.
Thank you for your observation. While we agree that a larger sample size is generally preferable, we argue that our sample size of 150 questions is consistent with established practices in the literature. For instance, in the original BLEU metric study by Papineni et al. (2002), human evaluations were conducted on 250 sentence pairs [1]. Similarly, in machine learning research, Ranzato et al. (2015) assessed their reinforcement learning model on 100-200 test cases for specific NLP tasks [2]. In linguistics, Koehn et al. (2007) conducted human evaluations of translations on sets of 150-300 sentences [3]. Given this precedent, we believe our sample size is reasonable and sufficient to draw meaningful conclusions.
W4: Does merging co-occurring subtrees risk missing important patterns?
We thank the reviewer for pointing out this important detail. To clarify, subtrees are only merged when they always co-occur in cases where one is a descendant of the other. This means their one-hot encodings are identical across all programs, ensuring that merging does not lose unique patterns. Since the co-occurrence is absolute, any pattern found in one subtree will also exist in the other. We have updated Section 7.1 of the Appendix to include this discussion.
W5: Does manual filtering of generated questions introduce a selection bias?
Our filtering process is very straightforward and as bias-free as possible. In particular, we only remove questions where the automatically generated answer is wrong, or the question cannot be answered from the video (for example, when the LLM refers to an actor by their annotated ID instead of a visual attribute). There is no manual filtering of questions beyond that.
[1] Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 311–318. https://doi.org/10.3115/1073083.1073135
[2] Ranzato, M., Chopra, S., Auli, M., & Zaremba, W. (2016). Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732. https://doi.org/10.48550/arXiv.1511.06732
[3] Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical phrase-based translation. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL-HLT), 48–54. https://doi.org/10.3115/1073445.1073462
Thank you to the authors for their responses. Some of my concerns have been addressed. However, I share other reviewers' concern about the generalizability of the proposed approach. The authors should validate their proposed approach on other popular VideoQA datasets.
Dear Reviewer 8Wrw,
We sincerely appreciate your thoughtful feedback and are glad that some of your concerns have been addressed by our response. Please note that we have already demonstrated that our approach generalizes to different CodeGen methods used in Section 10.3 as well as to a variety of VideoQA models in Table 1. Following your request and Reviewer Hrks’a recommendation, we now report results on MVBench below, as well as in Section 10.4 in the updated manuscript.
Firstly, as you can see from Figure 20 in the manuscript, ViperGPT code complexity correlates strongly with question complexity for a variety of VideoQA models on this dataset. Secondly, as shown in the table below and Table 5 of the manuscript, our proposed CodePlexity metric is far superior to the naive code complexity metrics on MVBench.
These results complete the robustness analysis of our approach, clearly demonstrating that it can scale to new methods and datasets, and validating its broader applicability. We hope this additional analysis resolves the reviewer’s concern and would be happy to answer additional questions in the remaining discussion period.
| InternVideo | SeViLA ZS | Tarsier | VideoChat2 | LLaVA-NeXT | |
|---|---|---|---|---|---|
| Lines of Code | 0.0441471 | 0.0950738 | 0.125403 | 0.10273 | 0.0919019 |
| Cyclomatic Complexity | 0.129945 | 0.0963568 | 0.0614128 | 0.0521084 | 0.0447285 |
| CodePlexity (Ours) | 0.413361 | 0.330262 | 0.444389 | 0.299086 | 0.274255 |
Thank you to the authors for their responses. I really appreciate the additional experiments on MVBench. Therefore, I am slightly leaning towards acceptance and will increase my score.
Dear Reviewer 8Wrw,
Thank you for your follow-up and for being open to change your initial impression of our paper. We are glad that our responses addressed all of your major concerns. Your questions and suggestions helped us improve the manuscript and clarify important aspects of our work. Please feel free to reach out if any additional feedback comes to mind.
This paper introduces an approach to analyzing and generating complex questions for Video QA. The authors propose the use of generated code complexity as a metric to evaluate question difficulty. They introduce "CodePlexity," a novel algorithm that analyzes generated code's structure and content to identify patterns that make questions challenging for ML models. This allows the creation of a new VideoQA dataset named "CodePlex-QA" which features more complex questions than existing datasets without relying on human expertise.
优点
- The paper introduces a novel and interesting approach using code complexity to evaluate question complexity.
- The paper offers interesting insights into the differences in human evaluation, text-based and code-based evaluation.
- The experiment results show clear empirical evidence to support the claims.
缺点
Please see the question section below
问题
- The paper represents subtrees in question code using simple one-hot encoding without additional processing, which might ignore the frequency of subtrees and their structural relationships. Additionally, this representation can be very sparse. How does this affect CodePlexity's performance?
- To what extent does the choice of code generation model affect the method's results? Would using different code generation models lead to identifying different patterns of complexity, and how might this impact the method's consistency?
- Since the method evaluates difficulty without considering visual information, there might be cases where code generation is challenging due to question ambiguity, but the task would be straightforward with visual context. Is CodePlexity truly measuring question difficulty, or is it primarily measuring code generation complexity?
- Most models evaluated in the paper are large video-language models built upon language model architectures. Since these models share similar language model foundations with the code generation approach, they might inherently struggle with the same types of questions that are difficult for code generation. Does this architectural similarity explain their correlated performance? How well would the method generalize to fundamentally different architectures that might employ different reasoning mechanisms?
Thank you for your detailed review and suggestions. We respond to your question individually below.
Q1: Does the one-hot subtree representations limit the metrics’s performance?
Using one-hot encoding for subtree representations does not significantly limit metric performance because it is effectively similar to using counts in most practical scenarios. The likelihood of a subtree appearing multiple times in a program’s AST is minimal (except for leaf nodes). For a subtree to appear more than once, the same code logic would need to be repeated, which is uncommon. Therefore, the difference is negligible, and one-hot effectively captures the necessary structural information without compromising metric quality.
Q2: How does the code generation model affect the results?
Thank you for this excellent question. The choice of code generation model is indeed an important factor in our methodology. To investigate this, we re-ran our experiments using the recent RVP [1] CodeGen approach as a substitute for ViperGPT. Preliminary results, included in Figure 19 of the updated manuscript, demonstrate that using a more advanced CodeGen model enhances the predictive power of code-based metrics for estimating question complexity.
These results suggest that the performance of CodePlexity can benefit from advancements in code generation models, as they can provide richer and more accurate program representations. While this introduces some variability depending on the underlying model, the general approach remains robust and adaptable. We are in the process of completing a detailed analysis and will include the finalized results in the camera-ready version of the paper.
This highlights the extensibility of our method and its potential for further improvement as code generation models continue to evolve.
Q3: Is CodePlexity truly measuring question difficulty?
We acknowledge that CodePlexity is not a definitive complexity metric for VideoQA and has limitations, including its indirect consideration of video content. Our key observation is that generated code complexity (not “code generation complexity”) is strongly correlated with underlying question complexity for existing VideoQA models. In particular, we experimentally demonstrate in Section 4.2 that CodePlexity exhibits superior predictive power compared to other automatic metrics and even human evaluations. Hence, our approach is not "primarily measuring code generation complexity”.
Q4: How would our method generalize to fundamentally different VideoQA models?
We thank the reviewer for this insightful observation. It is true that most modern VideoQA approaches leverage large visual-language models in one way or another. That said, the models evaluated in our works are quite diverse: ATP only trains a frame selection module and is based on CLIP, ViperGPT is a Visual Programming approach that leverages a suite of modules (CLIP, GPT3, BLIP, XVLM), VIOLET is a multimodal transformer, which is trained from scratch on the Merlot dataset and SeViLA is a 2-stage video model finetunned from a VLM. In response to the reviewers suggestion we have added two additional models.
- HGA [2]: based on Graph Neural Networks, this model combines visual and textual features (from BERT) using a Graph Reasoning Network. It was the state of the art on NExT-QA before ATP.
- Tarsier [3]: the current state of the art on NExT-QA, based on Video-LLaVA [4]. The results are reported below, as well as in Table 1 in the manuscript. As you can see from the results, our question complexity metric indeed generalizes well to these disparate models.
If the reviewer has a concrete suggestion of a VideoQA approach which is fundamentally different from all the models listed above we would be happy to evaluate it as well.
| SeViLA | ViperGPT | ATP | VIOLET | HGA | SeViLA ZS | InternVideo | Tarsier | |
|---|---|---|---|---|---|---|---|---|
| Dependency Tree Depth | 12.9 | 7.9 | 11.1 | 15.9 | 7.4 | 13.5 | 17.7 | 10.1 |
| GPT-4 | 9.6 | 8.9 | 11.6 | 5.8 | 7.8 | 14.6 | 13.9 | 10.8 |
| BERT | 12.5 | 6.0 | 18.3 | 17.3 | 7.7 | 14.3 | 21.1 | 10.8 |
| Lines of Code | 16.4 | 15.3 | 14.2 | 12.0 | 9.9 | 16.2 | 17.5 | 14.4 |
| Cyclomatic Complexity | 18.2 | 14.2 | 18.7 | 15.9 | 8.9 | 17.2 | 24.2 | 16.7 |
| CodePlexity (Ours) | 26.7 | 21.3 | 21.0 | 15.8 | 14.1 | 25.6 | 26.6 | 24.9 |
[1] Ge, J. & Subramanian, S. & Shi, B. & Herzig, R. & Darrell, T. Recursive visual programming,. In ECCV’24
[2] Jiang, P., & Han, Y. (2020, April). Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence
[3] Wang, J., Yuan, L., Zhang, Y., & Sun, H. (2024). Tarsier: Recipes for training and evaluating large video description models. arXiv preprint arXiv:2407.00634.
[4] Lin, Bin, et al. "Video-llava: Learning united visual representation by alignment before projection." arXiv preprint arXiv:2311.10122(2023).
Dear Reviewer RWYy,
We hope this message finds you well. We wanted to kindly remind you that the reviewer response deadline is at midnight today. We would greatly appreciate it if you could let us know whether your concerns have been addressed or if there are any additional questions or suggestions we can assist with.
We sincerely thank the reviewers for their thoughtful feedback and valuable suggestions. We have carefully addressed each comment individually and would like to take this opportunity to summarize the contributions of our work and clarify its scope.
It is hard to define what constitutes a good review in a few words, but one of the best definitions we have heard goes as follows: “A good review does not judge a paper based on what it is NOT (i.e. what a reviewer wishes it was), but rather based on what it IS (i.e. the value it brings to the community)”. We hope the reviewers and the AC can agree with this simple definition.
With this in mind, here are a few things our work is NOT:
- Our metric does not capture the ‘inherent complexity of questions’ in VideoQA (Reviewer EhtB). Instead, it quantifies question complexity for existing models.
- Our metric does not represent all the aspects of complexity in VideoQA (Reviewers RWYy, Hrka), focusing solely on question complexity instead.
We never claim any of these contributions in the paper.
In contrast, here are a few things our work IS:
- It provides a novel perspective on question complexity analysis to the community that has struggled to achieve progress [4, 21, 33], and is in need of fresh insights.
- It proposes a practical and operationalizable metric for estimating question complexity in VideoQA, which outperforms all the alternatives and indeed is more effective than humans at this task (see our analysis in Section 4.2).
- It provides an automatic pipeline for generating challenging questions based on the proposed metric, which is used to construct a new VideoQA benchmark.
We humbly argue that these contributions make our work a meaningful addition to the VideoQA research community.
Dear Reviewers,
We deeply appreciate your thoughtful feedback and constructive engagement, which helped to improve our paper. Below, we summarize our core contributions and highlight the results provided during the rebuttal process that validate and expand upon these contributions.
Key Contributions:
Our work introduces a data-driven framework for estimating question complexity in VideoQA, consisting of:
- CodePlexity Metric: A practical, interpretable, and adaptable metric that assesses question complexity using visual program generation.
- Insights Into Model Limitations: Identification of shared failure modes across state-of-the-art VideoQA models, offering actionable insights to guide future research.
- Benchmark Generation: A methodology for creating datasets, such as CodePlex-QA, that systematically surface questions challenging for current VideoQA models.
Our framework does not propose a fixed definition of “true complexity” but instead offers a systematic, adaptable tool for evaluating model-specific challenges in VideoQA. By identifying failure modes common to diverse models, we provide actionable insights to guide future research. As new datasets and models emerge, our framework can be reapplied to identify fresh challenges, ensuring sustained progress in the field.
Key Insights from Rebuttal Results:
During the rebuttal period, we validated the robustness and generalizability of our approach:
- Diverse Code Generation Models: We evaluated our framework with the recent RVP CodeGen model in place of ViperGPT. This analysis demonstrated that using a more advanced visual programming model improves the predictive power of code-based metrics, reinforcing the argument that code serves as a robust proxy for question complexity in VideoQA.
- Wide Range of VideoQA Architectures: By testing CodePlexity on diverse VideoQA models such as HGA and Tarsier, we confirmed its applicability across a wide range of architectural approaches. These results reaffirm that our metric captures shared challenges across fundamentally different model designs.
- Generalization to New Datasets: We extended our evaluation to MVBench, a recent dataset with unique characteristics. On MVBench, CodePlexity showed strong correlation with question complexity across multiple models, including LLaVA-NeXT and VideoChat2 (state-of-the-art on MVBench). Our metric outperformed a wide array of code-based and language-based complexity metrics, demonstrating its effectiveness across diverse data sources.
- Robustness to Code Generation Errors: We further analyzed the impact of code generation errors, showing that CodePlexity remains robust and outperforms other metrics even when the generated code contains inaccuracies. This highlights its practicality in real-world scenarios where code generation may not be flawless.
Our rebuttal reinforces the central claim of this work: CodePlexity is a robust, generalizable, and practical tool for advancing VideoQA research. By systematically identifying shared challenges and providing a foundation for generating new benchmarks, our framework offers a scalable path toward tackling foundational challenges in VideoQA.
We hope the reviewers will recognize the strengthened contributions and validated applicability of our work in guiding future advancements in this domain.
This paper introduces a new method for evaluating question complexity in Video Question Answering by leveraging the programs generated by ViperGPT. The authors analyze these code outputs and use them to assess the difficulty of questions in a VQA triplet. To achieve this, they propose CodePlexity, an algorithm that processes the generated code through Abstract Syntax Trees (AST), identifies valid subtrees, and scores them based on their correlation with challenging questions. While the reviewers agree that the paper is well-written and using ASTs to evaluate visual program code is a promising direction, the reviewers raised several points weaknesses of the paper: In particular, the core point of using ViperGPT as the proxy for the evaluation of question complexity, given ViperGPTs relatively low performance, as well as the robustness of the proposed approach.
审稿人讨论附加意见
Reviewer Hrka had a fruitful discussion on their raised points of generalizability beyond ViperGPT and its use to compare models, which lead to them increasing their score to a marginal acceptance with low confidence. Reviewer EhtB raised points in particular on the core point about the confounding of currently challenging questions for models, as those might reflect only limitations in current models/ViperGPT and are not in general more challenging. While the authors recognise this and state (also in a direct message to the AC) that their main contribution is a "data-driven framework for systematically identifying and analyzing model-specific challenges in VideoQA", the AC agrees with the reviewer that this is an issue -- as benchmarks are taken as-is and this might lead to a static dataset that's limited not by inherent difficulty, but by the ViperGPT/RVT model.
Reject