(1) From a technical standpoint, the innovation is limited and can be categorized as incremental. Specifically, the query-adaptive visual cluster and the coarse-to-fine strategy for identifying keyframes have been explored in previous work.

(2) In the Introduction Section, the authors mention one of the motivations for VideoTree as the Inability to Capture the Coarse-to-Fine Video Structure. However, this motivation is unconvincing because capturing the coarse-to-fine video structure is merely a method for identifying key frames, rather than a true challenge faced in the field of long video understanding. It seems that the authors are somewhat justifying their proposed approach rather than addressing a broader, established challenge.

(3) In Section 3.1, the authors write that for a cluster , they identify the keyframe closest to the centroid vector and consider it as the keyframe of the th cluster. However, this straightforward method of converting the image to a caption can result in a significant loss of information relevant to the query, leading to potential error propagation. Additionally, this image-level captioning operation overlooks substantial motion information, making it inadequate for addressing queries related to temporal dynamics.

(4) The experimental results on Video-MME are not inspiring and insufficient, and do not convincingly demonstrate the effectiveness of VideoTree. For instance, why not comparisons with similar methods (e.g., LLoVi, VideoAgent) ? Additionally, it would be beneficial to see results that incorporate subtitles, as this could provide further insight into how VideoTree performs relative to existing approaches and under different input conditions.