5.8

/10

Poster3 位审稿人

最低3最高5标准差0.9

3.7

置信度

创新性2.7

质量3.0

清晰度2.7

重要性2.7

NeurIPS 2025

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Xiaoyi Zhang,Zhaoyang Jia,Zongyu Guo,Jiahao Li,Bin Li,Houqiang Li,Yan Lu

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

TL;DR

To resolve long video understanding task, we propose Deep Video Discovery agent to iterative reasoning and gather information from video content via an agentic search and tool use strategy..

摘要

Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the $Deep \ Video \ Discovery \ (DVD)$ agent to leverage an $agentic search$ strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of $74.2%$, which substantially surpasses all prior works, and further improves to $76.0%$ with transcripts.

关键词

agentcomputer visionlong video understanding

评审与讨论

审稿意见

评分: 3置信度: 32025-06-24

This paper presents a novel approach for comprehending long-form videos, which is a complex and challenging task due to the vast temporal-spatial complexity and dense information. The proposed solution DVD, leverages an autonomous, agentic search strategy across segmented video clips. Unlike previous systems that rely on predefined workflows, DVD uses modular search tools at multiple granular levels to enhance video understanding.

This paper demonstrates the effectiveness of DVD through extensive evaluations on multiple long-form video benchmarks, achieving state-of-the-art results. Compared to existing methods, DVD shows significant performance improvements, particularly on the LVBench dataset, surpassing previous best approaches by a large margin.

The Key contributions of this paper include: (1) An agentic search mechanism that adapts based on the query and context. (2) The design and integration of search-centric tools for efficient video understanding. (3) State-of-the-art performance on long video benchmarks. (4) Insights into reasoning model behavior and tool usage, aiding the development of future video agents.

优缺点分析

Strengths:

(1) Long-form video understanding is crucial for a variety of applications, including media analysis, surveillance, and educational content. By addressing the problem of reasoning over large video datasets, this paper has the potential to impact numerous industries where video analysis is vital.

(2) This paper offers experimental validation of its approach. It presents comprehensive evaluations on long-form video benchmarks, such as LVBench, LongVideoBench, and EgoSchema. DVD demonstrates substantial improvements over previous methods, showing its efficacy in real-world tasks.

Weaknesses:

(1) The DVD system significantly outperforms existing methods, but it may introduce higher computational costs due to the iterative reasoning process and multiple tool invocations.

(2) While the system is flexible, providing case studies or examples of diverse queries would help illustrate the effectiveness and versatility of the integrated tools.

(3) The video processing itself (i.e., segmentation into clips, captioning) relies on existing methods and may not push the boundaries of video processing techniques.

问题

Question 1: This paper primarily focuses on benchmark datasets, but it would be helpful to see more discussion on how well the approach generalizes to other types of video content or less structured environments. How well does the system adapt to videos with different types of content (e.g., documentaries vs. user-generated videos)? Please show some results.

Question 2: Please provide detailed use cases or examples where these tools are iteratively employed to answer complex queries on long video. How does the system balance between global context and fine-grained details when addressing a single query?

Question 3: The paper mentions that the DVD system introduces higher computational overhead due to the iterative reasoning process and multiple tool invocations. Please provide some results of computational overhead compared to other agent-based approaches.

局限性

yes

最终评判理由

This paper demonstrates the effectiveness of DVD through extensive evaluations on multiple long-form video benchmarks, but I think that conducting comparisons with inconsistent numbers of model calls is unfair. Considering the existence of this issue, I will maintain my initial score.

格式问题

No formatting issues in this paper.

作者回复

2025-07-31

W1: Higher computational costs and computational overhead comparison with other agent-based approaches.

A1: Thanks for the comment. Though this work focuses on advancing the boundaries of long-form video understanding capabilities by test-time scaling, we acknowledge the importance of computational efficiency. We compare our DVD agent with representive VideoAgent [1] in terms of efficiency. To fair comparison, we reproduced VideoAgent[1] in our codebase with all same models, denoted as VideoAgent-fashion. Then we conducted experiments on LVBench without transcripts. The results demonstrate that our DVD achieves significantly less search steps while superior accuracy, compared to VideoAgent-fashion.

Under 15 max steps, DVD achieves 74.2% accuracy with only 7.3 average steps, while VideoAgent-fashion requires 11.3 average steps to reach 64.1% accuracy, demonstrating both higher accuracy and better efficiency.

Max Step Limit	8	12	15
Our DVD
AVG. step	6.7	7.2	7.3
Accuracy	72.3	73.8	74.2
VideoAgent-fashion
AVG. step	6.5	9.7	11.3
Accuracy	60.0	62.0	64.1

W2: Lack of cases studies to illustrate the effectiveness and versatility of the integrated tools.

A2: Thank you for your suggestion. We have included the case study in Appendix B.2, wherein we present representative examples for five distinct behavioral patterns: (1) Global Browse Only, (2) Simple Action, (3) Iterative Search, (4) Frame Inspect Trap, and (5) Clip Search Trap. The Iterative Search case demonstrates the agent’s use of various tools and query formulations in pursuit of the correct answer. In contrast, the Frame Inspect Trap and Clip Search Trap illustrate scenarios in which the agent encounters difficulties. Specifically, when initial clip searches yield limited relevant information, prompting the agent to repeatedly invoke the same tools and ultimately leading to incorrect answers.

We are committed to incorporate more qualitative visualization to facilitate a more nuanced analysis of the agent's behavior in our revised version.

W3: The video processing itself relies on existing methods and may not push the boundaries of video processing techniques.

A3: We appreciate your feedback and would like to clarify our core contribution. While our framework indeed employs standard video processing techniques (e.g., clip segmentation, caption generation), our primary innovation lies in the agentic search workflow orchestration that fundamentally changes how the search strategy is utilized for video understanding.

Unlike previous video agent that rely on fixed, pre-defined workflows, our framework empowers the agent to dynamically construct query-adaptive workflows. The agent intelligently selects and composes different search tools based on each query's unique requirements, resulting in more efficient exploration of video content and significantly improved understanding accuracy.

To illustrate this conceptual difference:

Method	Workflow	Adaptive Workflow
Our DVD	Loop[ ToolSelect( Answer \| GlobalBrowse \| ClipSearch \| FrameInspect ) ]	✓
VideoAgent[1]	Loop[ PredictAnswer → SelfReflect → FindMissInfo → Search → Caption → Merge ]	✗

This adaptive approach represents a paradigm shift from traditional hand-crafted, fixed video processing workflow. Rather than pushing the boundaries through novel low-level video processing techniques, we demonstrate that agentic orchestration of existing tools can achieve substantial performance gains, a 10.1% absolute improvement over VideoAgent-fashion as shown in Table of A1.

Q1: Generalization on different types of video content.

A1_: Thanks for the suggestion. We are commited to add more analysis on video content into the revised paper. Here we provide the performance on different video content categories of LVBench. It shows that our DVD agent achieves strong performance across all categories, including ego-centric self-media, cartoon with massive scene changes, documentary with long shot and carefully edited TV show. This demonstrates the generalization capability of our framework on diverse video content types.

Category	#Queries	OpenAI o3 (VLM)	Our DVD agent
Cartoon	283	56.9	80.2
Documentary	197	44.7	73.6
Live	262	61.5	75.2
Self Media	317	55.5	78.9
Sport	240	65.4	74.2
TV show	250	56.8	72.0
Average	1549	57.1	76.0

Q2: Detailed use cases of tool use.

A2_: Thanks for the question. Please refer to our response of W2. We provide the specific case study of search workflow orchestrated by our agent in Appendix B.2. For quantative analysis, as shown in Figure 3 of our main body, with OpenAI o3 as the orchestrator model, for 2.8% of queries the agent only uses the Global Browse Tool then draw the final answer. For most of the queries, the agent uses the Global Browse Tool to find relevant event and subjects, then uses the Clip Search Tool to localize specific time ranges, and finally uses the Frame Inspect Tool to inspect the frames to find more fine-grained information to confirm the answer. We are committed to add more qualitative visualization to facilitate a more nuanced analysis of the agent's behavior in our revised version.

Q3: Computational overhead comparison with other agent-based approaches.

A3_: Thanks for the question. Please refer to our response of W1.

Reference:

[1] VideoAgent: Long-form Video Understanding with Large Language Model as Agent, ECCV 2024

2025-08-05

Dear reviewer dHsw,

Thanks for your insightful comments on our submission. We hope our rebuttal has addressed your concerns. As the discussion deadline approaches, please let us know if you have any further questions. We would be happy to provide any needed clarification.

评论- Follow-up discussion

2025-08-07

Dear Reviewer dHsw,

Thank you for helping in reviewing this paper!

Please read the authors' rebuttal and let us know your thoughts

Thanks, AC

2025-08-07

Thank you for your detailed answers. For W1, although the authors compared the performance and efficiency of VideoAgent and DVD under the same models, the original performance of VideoAgent is very low (29.3%). Comparing performance and efficiency based on such a low baseline may not be very convincing. I would like to know the performance and efficiency comparison of methods with inherently better performance (e.g., MR. Video) under the same models. Additionally, the number of steps cannot directly reflect computational costs. I hope to see more direct metrics such as running time, GPU peak memory, etc.

评论- Running time and Peak GPU memory Comparison

2025-08-08

Thanks for your comment. Since MR.Video[1]'s code has not been open-sourced, we reproduce MR.Video and evaluate it on LVBench without transcript under the same model. The results are shown in the following table. The peak memory is calculated on Qwen3-32B with peak token number.

Methods	Acc.	Running Time Per Query (s)	Peak Token Number	Peak GPU memory
Ours	74.2	153.3	20803	86.49 GB
MR. Video	65.5	326.2	51468	123.92 GB
VideoAgent	64.1	200.5	19876	85.36 GB

MR.Video achieves the highest computational cost in both running time and peak GPU memory usage. This is primarily due to its Segment Intention Analysis component (described in Section 3.3.1 of its paper), which requires querying with extensive scene captions for all video segments. Our DVD demonstrates superior performance compared to MR.Video while maintaining comparable peak GPU memory usage to VideoAgent.

[1] MR.Video: “MapReduce” is the Principle for Long Video Understanding, 2025.04 Arxiv

Please let us know if you have any further concerns. We would be happy to provide any needed clarification.

2025-08-09

Thank you for your response. According to my understanding of [1], the method in this paper only calls two models. I think that conducting comparisons with inconsistent numbers of model calls is unfair. Considering the existence of this issue, I will maintain my initial score.

[1] MR.Video: “MapReduce” is the Principle for Long Video Understanding, 2025.04 Arxiv

评论- Further Clarification on MR.Video Comparison

2025-08-09

Thank you for your reply. We want to point out that your understanding is seriously mismatched with original MR.Video paper. MR.Video indeed requires significantly more model calls than two, as stated in its original paper.

According to the original paper, MR.Video consists of four components, each potentially involving multiple model calls. For instance, in the "Map: Segment Intention Analysis" component (Section 3.3.1), for every single segment it triggers once model call to analyze the segment. As stated in the MR.Video paper: "Within each segment, we instruct MR. Video to focus on ... and generate a paragraph of analysis". It means if the video has 30 segments (as exemplified in its own paper), this component alone triggers 30 model calls. This design is the primary source of the higher runtime shown in our comparison table. Our DVD demonstrates superior efficiency compared to MR.Video, achieving both lower runtime, reduced peak GPU memory usage and much higher accuracy.

审稿意见

评分: 3置信度: 52025-06-27

The paper introduces Deep Video Discovery (DVD), an agentic framework designed for long-form video understanding. The key idea is to leverage a multi-granular video database and a set of search-centric tools (Global Browse, Clip Search, Frame Inspect) to iteratively retrieve and reason over video content using an LLM as the central orchestrator. The agent autonomously plans, refines queries, and synthesizes answers through an observe-reason-act loop.

优缺点分析

Strengths:

Strong Empirical Results

The method achieves SOTA on multiple benchmarks, demonstrating clear improvements over existing approaches.

Comprehensive Ablations

The paper includes detailed ablation studies on tool contributions and model choices, offering insights into system behavior.

Weaknesses:

Limited Novelty in Video Retrieval & Search

The core idea of segmenting videos into clips and using an LLM for iterative search closely resembles prior works like VideoAgent. While DVD improves performance, the fundamental approach (clip-level retrieval + LLM reasoning) is not significantly differentiated. The tools (Global Browse, Clip Search, Frame Inspect) are intuitive extensions of existing retrieval strategies. A deeper discussion on how DVD uniquely advances agentic video search is needed.

Inefficient Model Inference

Constructing the multi-granular database requires processing the entire video and generating captions for each clip, leading to substantial computational costs, especially for long-form videos. In contrast, VideoAgent processes a much smaller subset of frames, making it more efficient.

Narrow Model Evaluation

The experiments only test GPT-4.1 and o3 as the backbone models. Given the rapid progress in VLMs, the paper should evaluate proprietary models (Gemini 1.5/2.5) to assess generalization and open-source models (Qwen/Qwen VL, LLaVA, InternVL, DeepSeek R1) to ensure reproducibility and accessibility. Without broader model comparisons, it’s unclear whether DVD’s gains stem from the framework or simply from using a stronger base LLM (o3).

Heavy Reliance on Zero-Shot LLM Capabilities

The system currently relies on the in-context learning and reasoning capabilities of an LLM. It does not fine-tune the LLM to tailor it for video understanding. This raises concerns about scalability (performance may degrade with weaker LLMs) and adaptability (task-specific fine-tuning could further improve results). A comparison with a fine-tuned variant would better demonstrate the framework’s robustness.

Lack of Qualitative Analysis

The reviewer noticed that the authors include a few text-only results in the appendix. It is recommended that the authors include visualizations of success/failure cases in the main paper to make it easier to assess where DVD excels (e.g., complex temporal reasoning) and where it struggles (e.g., fine-grained spatial details, ambiguous queries).

问题

Please refer to the weakness section.

局限性

Yes.

最终评判理由

I have read the rebuttal and would like to keep my score.

格式问题

No.

作者回复

2025-07-31

Q1: Limited Novelty in Video Retrieval & Search

A1: Thanks for the comment. We want to clarify that our key innovation distinguishing our work from previous approaches like VideoAgent [1] lies in our agentic search workflow orchestration through selective combination of different search tools.

Unlike previous hand-crafted, fixed workflows, our framework empowers the reasoning model to dynamically compose workflows tailored to each specific user query, resulting in flexible and query-adaptive search strategies.

Method	Workflow	Adaptive Workflow
Our DVD	Loop[ ToolSelect( Answer \| GlobalBrowse \| ClipSearch \| FrameInspect ) ]	✓
VideoAgent[1]	Loop[ PredictAnswer → SelfReflect → FindMissInfo → Search → Caption -> Merge ]	✗

To empirically demonstrate the effectiveness of our agentic search workflow orchestration, we conducted an ablation study replacing our adaptive workflow with a hand-crafted fixed workflow.

As shown in the table below, the hand-crafted approach results in a significant 7.5% absolute accuracy drop while requiring more steps. Even when scaling up to 11.1 average steps (52% more than our DVD), the hand-crafted workflow achieves only 70.2% accuracy, still 4.0% below our adaptive approach. This comparison highlights the superior search efficiency of our dynamic workflow design.

Max Step Limit	8	12	15
Ours
Avg. Steps	6.7	7.2	7.3
Accuracy (%)	72.3	73.8	74.2
Ours w/ hand-crafted workflow
Avg. Steps	5.0	8.3	11.1 (↑52%)
Accuracy (%)	48.4	66.3	70.2 (-4.0%)

Q2: Inefficient Model Inference

A2: We appreciate the efficiency concern regarding our framework. We discuss the efficiency of our DVD from two perspectives: (1) the trade-off between pre-building and online-generating captions, and (2) overall efficiency comparison with VideoAgent.

(1) Trade-off between pre-building and online-generating captions:

The comparison with VideoAgent reflects an efficiency trade-off between two database construction strategies: pre-building captions versus online-generating captions. This trade-off is not specific to any single video understanding agent but rather represents a setting that can be applied to various frameworks, including our DVD. These strategies serve different scenarios with online/offline efficiency profiles:

Offline preprocessing efficiency: VideoAgent precomputes CLIP embeddings for all frames, while DVD generates captions and embeddings using video clips as basic units. During this phase, pre-building captions incurs higher computational cost compared to online-generating while similar time cost benifited from parallel processing across clips.

Online inference-time efficiency: Without pre-generated captions, they must be generated during inference. Here, online-generating becomes substantially more expensive than pre-building, with latency accumulating across iterative reasoning steps.

The choice between these approaches depends on usage patterns:

For videos queried once, online-generating may be more efficient.
For videos queried by multiple times, pre-building significantly reduces cumulative inference cost. For example, a video uploaded to YouTube may be queried by thousands of users with different questions, where pre-building can significantly reduce inference cost and improve overall efficiency.

While DVD employs pre-building in this submission, it can also operate in online-generating mode. To evaluate this alternative, we conducted an experiment replacing our original pre-building database with an online-generating approach. The results show that this replacement leads to a slight performance drop. This further clarifies that the two strategies serve different purposes: online-generating is well-suited for scenarios where videos are queried once with relatively simple queries, while pre-building is better suited for scenarios where videos receive multiple queries and achieving optimal performance is prioritized.

DVD Database Construction	AVG. Steps	Accuracy (%)
Pre-building	7.3	74.2
Online-generating	7.5	71.7

(2) Overall efficiency comparison with VideoAgent:

We reproduce VideoAgent using the same LLM and VLM in our codebase (denoted as VideoAgent-fashion). Our DVD framework achieves superior performance with fewer iterations, demonstrating our agent design's efficiency:

Max Iteration	8	12	15
Our DVD
AVG. step	6.7	7.2	7.3
Accuracy (%)	72.3	73.8	74.2
VideoAgent-fashion
AVG. step	6.5	9.7	11.3
Accuracy (%)	60.0	62.0	64.1

Under 15 max steps, DVD achieves 74.2% accuracy with only 7.3 average steps, while VideoAgent-fashion requires 11.3 average steps to reach 64.1% accuracy—demonstrating both higher accuracy and better efficiency.

Q3: Narrow Model Evaluation on models

A3: Thanks for the suggestion. Due to limited access to proprietary models, we evaluated our framework across three open-source models with varying architectures and sizes to demonstrate generalization capability. We tested DeepSeek-V3 (671B MoE without reasoning), DeepSeek-R1 (671B MoE with reasoning), and Qwen3-32B-thinking (32B dense model with reasoning) as orchestrator models on LVBench without transcripts.

All three models significantly outperform the OpenAI o3 Vision-Language Model baseline, validating our framework's effectiveness across different model architectures and scales. Notably, our DVD framework with Qwen3-32B-thinking is already on par with original OpenAI o3, which is the top-tier commercial reasoning vision-language model. When we equip our DVD with OpenAI o3 as the orchestrator model, we achieve a 17.0% absolute accuracy improvement over original OpenAI o3. The consistent improvement demonstrate that DVD's gains stem from the framework design itself rather than simply using stronger models. We will include these results with further analysis in our revised paper.

Methods	Accuracy (%)
ours w/ OpenAI o3	74.1
ours w/ DeepSeek-R1	68.5
ours w/ GPT-4o	62.3
ours w/ DeepSeek-V3	57.5
ours w/ Qwen3-32B-thinking	57.3
OpenAI o3 (Commercial VLM)	57.1

Q4: Heavy Reliance on Zero-Shot LLM Capabilities and Generalization on Small Model

A4: Thanks for the suggestion. We agree that fine-tuning LLMs for video understanding agent is a promising and effective approach to improve performance. However, it is non-trivial to collect training data of long video understanding or build up reinforcement learning environment to fine-tune a smaller LLM. We plan to explore this direction in future work.

Here we focus on evaluating the generalization ablity of our DVD on smaller LLMs. We conduct experiment with latest Qwen3-32B as the orchester model in our DVD. It shows even with 4.7% parameter compared to the DeepSeek-R1, DVD can still show impressive performance surpassing the Vision-Language Model OpenAI o3. It demonsrates our framework works well even with smaller LLM's zero-shot capability.

Methods	Parameter	Accuracy (%)
Ours w/ OpenAI o3	-	74.1
Ours w/ Qwen3-32B-thinking	32B	57.3
OpenAI o3 (Commercial VLM)	-	57.1

Q5: Lack of Qualitative Analysis

A5: Thanks for the suggestion. We will add more qualitative visualization for better agent behavior analysis in our revised version to supplement the textual case study in our Append B.2. As quantitative evidence, we want to highlight Figure 3 of our main body, which shows common error patterns including:

Repeated tool calls ("Clip-Search Trap" and "Frame-Inspect Trap"): The agent loops on the same tool, wasting budget and lowering accuracy by 8.0 percentage.
Behavioral collapse: when replacing OpenAI o3 with GPT-4o, the agent adapts the same simple action strategy for most of queries, resulting accuray lower than average. the policy converges to a short “search-query-answer” routine covering 54 % of queries but with sub-average accuracy.

Reference:

[1] VideoAgent: Long-form Video Understanding with Large Language Model as Agent, ECCV 2024

2025-08-05

Dear reviewer JPc5,

Thanks for your insightful comments regarding our submission. We hope our rebuttal has addressed your concerns. We noticed that you have submitted the "Mandatory Acknowledgement", but it appears there may have been an issue, as we have not yet received a response. If you have any further questions, we would be happy to provide any needed clarification.

评论- Follow-up discussion

2025-08-07

Dear Reviewer JPc5,

Thank you for helping in reviewing this paper!

Please read the authors' rebuttal and let us know your thoughts

Thanks, AC

审稿意见

评分: 5置信度: 32025-07-01

The paper introduces DVD agents addressing a set of search-centric tools on video database. The core idea is in devising a modular search tool operating at different granularities, global, clip and frame-level. Leveraging this strategy the agent performs autonomous reasoning and iterative decision-making.

优缺点分析

Multigranular database construction is intuitive. Across all 4 datasets, their proposed method outperforms the base VLM. Ablation study shows the value of tool design.

问题

The difference between the proposed method and other video-llms are that the proposed method utilizes iterative reasoning. What's the trade off between the boost in accuracy vs additional cost? What's the value of step limit N? Which is the smallest N>1 value that shows better performance than the baseline? Why longer reasoning chain makes the results within the same model even worse?

局限性

The difference between the proposed method and other video-llms are that the proposed method utilizes iterative reasoning. What's the trade off between the boost in accuracy vs additional cost?

最终评判理由

I read all the responses to all the questions from all the reviewers. Speed/memory comparison with other agentic systems such as Mr. Video and Video Agent helped put things into perspective better. I keep my original rating "accept".

格式问题

N/A

作者回复

2025-07-31

Q1: The trade-off between the boost in accuracy vs. additional cost.

A1: Thanks for the valuable comment. Here we provide our evaluation results under different max step numbers with OpenAI o3 on LVBench without transcripts.

Max Step Limit	4	6	7	8	12	15	-
Ours AVG. step	3.9	4.7	5.7	6.7	7.2	7.3	-
Ours accuracy (%)	57.6	63.1	68.4	72.3	73.8	74.2	-
OpenAI o3 (VLM Baseline) accuracy (%)	-	-	-	-	-	-	57.1

We are committed to add these results with further analysis in our revision.

Q2: Detailed setting of max step number.

A2: In our paper, the max step (iteration) number is set as 15 for the best performance via test-time scaling. As shown in the table of response 1, the smallest iteration with better performance than our baseline Vision-Language Model (VLM) is 4. The accuracy scales effectively, reaching 74.2% at 15 max steps, a 17.1% absolute improvement over the strong baseline.

Q3: Why longer reasoning chain makes the results within the same model even worse?

A3: Thanks for the valuable question. For the same query, the growth of iteration steps generally improves accuracy, as shown in the table of response 1. However, across different queries, the agent may follow different reasoning modes depending on the video and query content, where we observe that queries involving longer reasoning chains often correspond to lower accuracy.

Supported by our case study (Appendix B.2) and quantitative data (Figure 3), we identified two typical failure modes that result in longer iterations with a decrease in accuracy, including:

Frame Inspect Trap: The agent repeatedly utilizes the Frame Inspect Tool to pose various queries about the same temporal segment of the video. This behavior suggests a cognitive fixation, wherein the agent presumes that crucial information must exist within that specific time range. This fixation prevents it from exploring alternative video content, resulting in a 140.7% increase in iteration steps and an 11.1 percentage point absolute drop in accuracy.
Clip Search Trap: The agents repeatedly employ the Clip Search Tool to find information with similar queries. This pattern typically occurs when there is little information in the video database corresponding to the user question. This behavior leads to a 152.6% increase in iteration steps and a 16.1 percentage point absolute decline in accuracy.

2025-08-05

Dear reviewer vnLA,

Thanks for your constructive suggestions regarding our submission. We hope our rebuttal has addressed your concerns. As the discussion deadline approaches, please let us know if you have any further questions. We would be happy to provide any needed clarification.

评论- Follow up discussion

2025-08-07

Dear Reviewer vnLA,

Thank for reviewing the submission. Even though your initial rating for the paper is positive, please acknowledge the rebuttal and let us know if you have any further questions.

Thanks, AC

评论- analysis on step size and improvement

2025-08-08

Thank you for providing empirical analysis on step size and improvement and setup for the max step size. Now I can see the same question about cost has also been posed by other reviewers. I couldn't find any answer to that. Additionally, the number of steps cannot directly reflect computational costs. Running time or GPU peak memory could be direct metrics rather than number of steps.

评论- Running time and GPU peak memory Comparison

2025-08-08

Thanks for your comment. We compare our DVD agent with previous video agent including MR.Video[1] and VideoAgent[2]. Since MR.Video[1]'s code has not been open-sourced, we reproduce MR.Video and evaluate it on LVBench without transcript under the same model. The results are shown in the following table. The peak GPU memory is calculated on Qwen3-32B with peak token number per query.

Methods	Acc.	Running Time Per Query (s)	Peak Token Number	Peak GPU memory
Ours	74.2	153.3	20803	86.49 GB
MR. Video	65.5	326.2	51468	123.92 GB
VideoAgent	64.1	200.5	19876	85.36 GB

[1] MR.Video: “MapReduce” is the Principle for Long Video Understanding, 2025.04 Arxiv [2] VideoAgent: Long-form Video Understanding with Large Language Model as Agent, ECCV 2024

Please let us know if you have any further concerns. We would be happy to provide any needed clarification.

最终决定Accept (poster)

2025-09-17

Summary

This paper proposes Deep Video Discovery (DVD), an agentic framework designed for long-form video understanding.

Reasons to accept the paper

Strong performance on various benchmarks
Comprehensive ablations

Reason to reject the paper

Limited novelty in video retrieval & search.
Higher computational costs / inefficient model inference.

Discussion

During rebuttal, the author(s) provide an answer about the novelty: being a more adaptive framework compared with existing ones. The author(s) also provide additional comparisons on efficiency and additional based models. AC founds the response are reasonable and believes the current merits (strong empirical results) can outweigh the concerns, thus AC recommends to accept the paper. The paper had been discussed with SAC and SAC agrees with the decision.