Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
摘要
评审与讨论
This work proposes Visual Semantic-Logical Search (VSLS) for keyframe search problem in long-video understanding. VSLS starts by defining four common visual semantic logics presented in text queries: spatial co-occurrence, attribute dependency, temporal proximity, and causal order. It then has an iterative sampling process that leverages object detector and assigns numerical score to the frames with the four logics to finally derive the keyframes. Empirical experiments demonstrate that VSLS achieves improved search accuracy with minimally increased latency (compared with naive uniform sampling).
优缺点分析
Strengths
- The paper is well-written, with all technical details clearly and thoroughly presented.
- VSLS sounds like a reasonable and valid strategy for keyframe search. I also appreciate the fact that important aspects (beyond just accuracy), including demonstrating efficiency (Table 1) and proving the sufficient coverage of semantic-logic relations (Section 4.1, 4.3), have been touched in detail.
Weaknesses
-
Although there is distribution updates in the process, I would imagine that the initial frame sampling (which is uniform) could heavily impact the performance in certain cases. For example in extreme cases where all the keyframes are all within a certain short period, can VSLS (in its current form) updates to that area within reasonable number of iterations? Having a visualization of such case, similar to Figure 3, would be quite helpful. If not, what could be the improved design regarding this aspect?
-
The four semantic logics could be sufficient for general queries and natural images. Yet I would imagine that in specialized applications (e.g., medical applications), suitable logics could be different. Is there any guidelines/recommendations to derive such logics for different applications/scenarios? I'd like to see some discussion on this, which would make VSLS more general.
问题
Please see weaknesses.
局限性
yes
最终评判理由
My questions/concerns are sufficiently addressed. I've also read through other reviews, and the rebuttal seems to address them well too. In general, this work presents an interesting, effective, and extendable framework for extracting keyframes, which may inspire future works. I'm thus raising my score from 4 to 5.
格式问题
The in-text references seem to be not in the correct form. I believe the authors were using the wrong command.
We sincerely appreciate your constructive feedback and recognition of our well-written paper, sound VSLS strategy, and comprehensive evaluation. We are very grateful for your positive evaluation, which serves as the highest encouragement for our team. Below, we will address your concerns and incorporate your suggestions.
1. Concerns about the short period of distribution updates. Thank you for your insightful and constructive feedback. We appreciate you pointing out the potential impact of the initial uniform sampling, particularly in the extreme case where keyframes are clustered within a short period. This is a crucial aspect of the method's robustness.
We are confident that VSLS, in its current form, can effectively handle such "short period" cases. This capability stems from two key characteristics of our framework designed for robust exploration:
- First, our distribution update mechanism maintains the potential to explore new frames in every iteration (Sec. 2.3.4). While the algorithm exploits high-confidence regions by increasing their sampling probability, the global distribution update ensures that all frames retain a non-zero chance of being selected. This prevents the search from becoming permanently stuck in a suboptimal region and preserves its ability to discover the critical cluster even if it was missed in the initial samples.
- Second, the search is broadened by observing not just the primary target, but also relevant "Cue Objects" and scenes (Sec. 2.2). If a sampled frame contains a contextual cue, the algorithm increases the sampling probability in that temporal area. Moreover, our scoring mechanism based on relations will further prioritize the nearby frames if they satisfy certain logical relations. Therefore, the final searching results may converge to the frames in a small time window with the highest score. This strategy enhances the probability of exploring the ground truth frames by leveraging related contextual information, guiding the search toward the correct region even before the main target is found.
We agree that a visualization similar to Figure 3 would be the most effective way to demonstrate this. In preparation for this rebuttal, we have analyzed such extreme cases and generated the corresponding visualizations, which confirm this rapid convergence. Unfortunately, as per NeurIPS rebuttal guidelines, we are not permitted to include external links or new figures. Therefore, to illustrate this capability, we would like to present the results in a table format below.
| video_id | question | options | keyframe_time_stamps | num_iterations | ground_truth frame position | Initial Uniform Sample | Iterations when other targets are found |
|---|---|---|---|---|---|---|---|
| Ip9DbdOtqF4 | A little girl appears in front of a wall. She is holding two whips and is dressed in a light-colored top and dark-colored trousers. There is a piece of white paper hanging on the wall in front of her. What did the little girl do the first time she appeared on the scene? | A) Grabbed the white paper B) Raised both hands C) Used a tool to imprint a design on the white paper D) Bent down and bowed E) Waved her hands left and right | [201.0, 64.0, 92.0, 236.0, 194.0, 33.0, 179.0, 200.0] | 16 | 4800 (timestamp: 200.2) | [0, 15, 30, 45, 60, 75, 90, 105, 120, 135, 150, 165, 180, 195, 210, 225] find target girl missing target paper | paper found at iteration 3: sample frames [28, 25, 192, 235, 40, 17, 229, 231, 27, 43, 19, 9, 31, 226, 13, 29] |
As illustrated in the table, during the initial uniform sampling in iteration 1, only the object “girl” is detected, while the critical object “paper” is missed. Instead of continuing to explore frames uniformly at random, our algorithm leverages an object-relation scoring mechanism to guide the next round of sampling more strategically. This targeted exploration allows our method to focus on promising, previously unvisited frames. As a result, our approach successfully identifies the ground truth keyframe in later iterations, whereas the uniform baseline fails to do so.
Furthermore, we appreciate your forward-thinking question about potential design improvements. To further enhance the robustness of VSLS, we have considered two promising future directions:
- Hierarchical Sampling: A coarse-to-fine approach where we first identify high-potential video segments before applying our fine-grained iterative frame search.
- Heuristic Initialization: Using cues like scene-change detection or audio energy to create a weighted, non-uniform initial sampling distribution.
2. Concerns about the semantic logics in specialized applications. Thank you for this excellent question regarding the generalizability of our framework. You have raised a very important point about the need for domain-specific logics in specialized applications like medicine.
We completely agree. The four logics we presented—spatial, temporal, attribute, and causal—are intended as a foundational set designed to cover a broad range of general queries. We see VSLS not as a system with a fixed set of logics, but as an extensible framework that can be adapted to various domains.
Regarding your request for guidelines on deriving new logics for different scenarios, we propose the following systematic, three-step approach:
- Domain Knowledge Elicitation: The first step is to collaborate with domain experts to identify the critical relationships and events they analyze. For a medical application like surgical video analysis, this would involve identifying key instrument-tissue interactions (e.g., ‘cutting’, ‘suturing’, ‘retracting’) which are far more specific than our general ‘causal’ or ‘spatial’ relations. This process translates expert knowledge into a set of target logical relations.
- Operationalization into Verifiable Rules: The conceptual relationship must then be translated into a computable, verifiable rule that the VSLS search can execute. This involves defining the specific visual and temporal evidence required. For example, a new logic like
suturingcould be operationalized as:- (i) Detecting a needle-holder instrument and suture material.
- (ii) Observing the instrument in periodic contact with tissue edges.
- (iii) Verifying that the tissue edges become approximated in subsequent frames.
- Modular Integration: Finally, this new, operationalized logic can be integrated as a new module into the VSLS framework. Our design allows new logic-checking functions to be added alongside the existing four. The query parser would be extended to recognize domain-specific keywords (e.g., "suture") and trigger the corresponding verification function during the iterative search process.
This process ensures that the VSLS framework can be effectively adapted to specialized fields, making it more general and powerful. We believe this discussion significantly strengthens our paper's contribution. We will add a section covering the framework's extensibility and these guidelines for deriving new logics to the final version of the paper. Thank you again for this valuable suggestion.
Once again, we would like to express our sincerest gratitude. We will incorporate all of your suggestions into the final manuscript. Thank you for your contributions to the improvement of our paper and the advancement of the field.
This paper presents a semantics-driven framework for keyframe selection in long videos, addressing the challenge of coarse frame sampling and the lack of logical alignment between textual queries and visual content. By modeling four types of semantic-logical relations—spatial co-occurrence, temporal proximity, attribute dependency, and causal order—the method iteratively refines sampling distributions to identify query-relevant frames. The approach achieves new state-of-the-art results on keyframe selection benchmarks and significantly improves performance on downstream video QA tasks such as LONGVIDEOBENCH and VIDEO-MME, effectively bridging the gap between text queries and visual-temporal reasoning.
优缺点分析
strengths The proposed VSLS algorithm is both novel and effective. It leverages YOLO-World to detect objects in video frames and constructs a relevance scoring system based on individual object presence and co-occurrence of related objects. This approach significantly improves the efficiency of video processing. VSLS achieves strong performance on both LONGVIDEOBENCH and VIDEO-MME benchmarks. The ablation study in Table 4 demonstrates the effectiveness of modeling object relationships in videos, which is insightful.
weaknesses The notation in Equation 1 is somewhat confusing; I believe there is a clearer way to present it. 2. How does the speed and size of the detector affect video understanding in terms of both efficiency and performance? Will a better detector further improve the performance. Exploring more types of detectors can bring more insights.
问题
Please refer to weakness for questions.
局限性
Please refer to weakness for limitations.
最终评判理由
Maintain the original evaluation
格式问题
no
We sincerely appreciate your constructive feedback and recognition of our Efficiency Gains and Novel idea. We are very grateful for your positive evaluation, which serves as the highest encouragement for our team. Below, we will address your concerns and incorporate your suggestions:
1. Concerns about the notation in Equation. Thank you very much for your detailed suggestion. We will revise Equation 1 according to your advice. If you have a preferred or better formulation, we would be grateful if you could share it with us. We will then update Equation 1 to use a more standard and explicit format:
where is a function that draws indices from the population based on the probabilities specified in .
2. Concerns about the detector affect video understanding. Thank you very much for your insightful and professional suggestion. We fully agree that the choice and configuration of the object detector can influence both the efficiency and performance of long video understanding. Based on your feedback, we summarize the effects into three key aspects:
- Number of frames per detector input (grid size). The number of frames packed into a single grid affects the trade-off between speed and accuracy. As shown below, while using more frames per detector input improves efficiency (reducing search steps), it also degrades QA accuracy due to reduced detection precision:
| Num. of images in grids | Search cost (steps) | QA performance |
|---|---|---|
| 1 | 770 | 56.7 |
| 4 | 160 | 55.5 |
| 8 | 48 | 53.5 |
| 12 | 730 | 53.2 |
| 16 | 860 | 52.7 |
- Confidence threshold for detection. Higher confidence thresholds improve precision in object detection, which helps QA performance—but also slow down the search process due to fewer confident detections per iteration.
| Confidence threshold | Search cost (steps) | QA performance |
|---|---|---|
| 0.3 | 18 | 50.2 |
| 0.5 | 42 | 54.1 |
| 0.7 | 49 | 55.3 |
| 0.8 | 162 | 56.4 |
- Type of detector model. We also conducted additional experiments using OWL-ViT, a transformer-based open-world detector, and compared it with the YOLO-OpenWorld model used in our main experiments. While OWL-ViT offers broader concept coverage, its lower detection accuracy slightly affects downstream QA performance, as shown below:
| Detector model | Detection performance (AP) | Search cost (steps) | QA performance |
|---|---|---|---|
| OWL-ViT | 34.1% | 67 | 51.2 |
| YOLO-OpenWorld | 39.0% | 49 | 55.3 |
These results demonstrate that while a better detector (e.g., one with higher AP) generally leads to stronger QA performance, the overall effectiveness of long video understanding also depends on balancing detection precision, sampling strategy, and efficiency constraints. We appreciate your suggestion, which helped us deepen the analysis and further confirm the robustness of our framework. Exploring a wider range of detectors remains an exciting future direction.
Once again, we would like to express our sincerest gratitude. We will incorporate all of your suggestions into the final manuscript. Thank you for your contributions to the improvement of our paper and the advancement of the field.
Thank you for the detailed response. Table 1 shows that as the grid size increases, the number of search steps first decreases and then increases, resulting in a non-monotonic trend. Could this behavior be attributed to the distribution or characteristics of the current dataset? If the experiments were performed on a different dataset, might the trend differ?
Thank you for your insightful question. We have indeed observed a similar non-monotonic trend on other datasets as well, suggesting that this behavior is not specific to the current dataset.
This pattern can be explained by the trade-off between detection efficiency and accuracy when increasing the grid size. Initially, merging a small number of images into a grid does not significantly affect YOLO’s detection capability. On the contrary, it allows the model to process multiple images in parallel, thereby reducing the total number of search steps. However, as the number of images per grid continues to increase, YOLO’s detection accuracy begins to decline due to the increased visual clutter. This makes it more difficult to detect the target objects decomposed by the VLM, which in turn causes the VSLS method to continue iterating until the target is correctly identified, resulting in an increase in the number of steps.
Therefore, while the trend may vary slightly depending on the dataset, the overall non-monotonic behavior is likely to persist due to this underlying mechanism.
We will include the relevant discussion and experiments in the revised version of the paper. Your feedback is truly encouraging for our team. Thank you again for your thoughtful and constructive comments!
The paper introduces Visual Semantic-Logical Search (VSLS), a semantics-driven framework for efficient keyframe extraction in long videos. Unlike conventional methods that rely on dense frame sampling or end-to-end feature selection, VSLS reformulates keyframe selection by modeling four logical dependencies—spatial co-occurrence, temporal proximity, attribute dependency, and causal order—to dynamically refine sampling distributions. This approach enables precise, query-aware frame identification while sampling only 1.4% of video frames (EGO4D), leading to an 8.7% improvement in GPT-4O’s long-form video QA accuracy.
Compared to existing techniques, VSLS offers superior efficiency by eliminating the need for training or dense captioning, while its explicit modeling of logical relationships enhances reasoning consistency beyond simple object detection. The framework is designed as a plug-and-play module, ensuring seamless integration with existing VLM pipelines. Experiments demonstrate state-of-the-art performance on LONGVIDEOBENCH, VIDEO-MME, and HAYSTACK-LVBENCH, validating its effectiveness in bridging the gap between textual queries and visual-temporal reasoning. While the paper is well-structured, some figures and tables could be enlarged for better readability.
优缺点分析
Strengths:
-
Clear and Effective Writing - The paper is exceptionally well-written, with the introduction providing a particularly lucid overview of the key contributions. The presentation makes the technical innovations accessible while maintaining scholarly rigor.
-
Remarkable Efficiency Gains - The proposed method achieves an impressive 8.7% accuracy improvement in GPT-4O-based video QA while sampling only 1.4% of frames, demonstrating superior computational efficiency compared to dense sampling approaches.
-
Advanced Logical Modeling - Unlike conventional target detection methods, VSLS innovatively models four fundamental logical relationships (spatial, temporal, attribute, and causal) within queries. This semantic-logical framework significantly enhances reasoning consistency throughout the visual-temporal analysis process.
Weaknesses:
-
Insufficient Model Validation: The experiments only tested a limited set of models. To enhance generalizability, it is recommended to include additional open-source models such as Qwen2-VL and LLaVA-Video for validation.
-
Limitations in Visual Semantic Logic Detection: The current approach, particularly in Attribute Relation, relies solely on bounding box overlap (IoU) to determine attribute associations, which is logically weak. For instance, cases like "o1 is part of o2" (e.g., "wheel" and "car") and "o1 and o2 coincidentally overlap" (e.g., "pedestrian near a car") are treated equally. This raises questions about its connection to graph-based frame relation modeling methods. A discussion in related work or the main text would help clarify the method's uniqueness.
-
High Parameter Sensitivity: Many parameters appear to be manually set without justification. Moreover, the distribution update depends on frame scores (Ct), which themselves rely on initial detection and simple relation-based scoring. If the initial detection is erroneous (e.g., missing critical objects), the sampling process may get trapped in a "low-quality frame loop," compromising robustness.
问题
1.Readability Decline in Method Section: While the Introduction is well-written and highly readable, the Method section appears less polished in terms of clarity and logical flow. It is recommended to refine the writing in this section to match the high standard of the Introduction, ensuring better comprehension.
2.The current approach relies heavily on manually tuned parameters, and detection errors (e.g., missed objects) may propagate and degrade performance. Could error-correction mechanisms (e.g., confidence-based dynamic adjustment, error propagation suppression) be introduced to reduce human intervention and improve robustness?
局限性
This work appears to have minimal potential for negative societal impact.
最终评判理由
They have addresse d my concerns,however,my confidence is low.
格式问题
No major formatting issues are found in this paper.
We sincerely appreciate your constructive feedback and recognition of our clear writing style, Remarkable Efficiency Gains, and Advanced Logical Modeling. Your insights have been invaluable in strengthening our work. Below, we will address your concerns and incorporate your suggestions:
1. Additional Open-source Models. Thank you for raising the important point regarding model validation. Initially, our primary focus was on the keyframe search component, and thus, we limited our comparisons to a select set of downstream QA VLMs. However, we fully acknowledge your concern about the generalizability of our proposed VSLS method.
To address this concern, we conducted additional experiments using open-source models, specifically Qwen2-VL and LLaVA-Video, as you recommended. The newly added experiments clearly demonstrate that VSLS consistently improves performance across these diverse VLMs, further validating its effectiveness and generalizability. Our results illustrate that, by employing efficient keyframe selection strategies, VSLS significantly enhances long-video understanding capabilities of various downstream VLMs.
| Method | Frames | LVB-Long | LVB-Med | LVB-Short | LVB-Average | VidMME-Long | VidMME-Med | VidMME-Short | VidMME-Average |
|---|---|---|---|---|---|---|---|---|---|
| LLaVA+T* | 8 | 39.58% | 50.00% | 48.00% | 44.43% | 37.50% | 40.36% | 37.16% | 38.30% |
| LLaVA+VSLS (ours) | 8 | 42.26% | 46.92% | 50.00% | 44.73% | 38.46% | 38.45% | 38.53% | 38.47% |
| LLaVA+T* | 32 | 40.18% | 44.23% | 50.00% | 42.57% | 35.76% | 39.64% | 37.67% | 37.66% |
| LLaVA+VSLS (ours) | 32 | 41.67% | 48.08% | 54.00 | 45.20% | 36.71% | 38.99% | 39.53% | 38.42% |
| QwenVL+T* | 8 | 41.96% | 47.69% | 54.00% | 45.20% | 40.28% | 50.00% | 54.90% | 48.43% |
| QwenVL+VSLS (ours) | 8 | 45.83% | 49.23% | 54.00% | 47.83% | 43.18% | 49.64% | 60.80% | 51.36% |
| QwenVL+T* | 32 | 38.69% | 41.92% | 40.00% | 40.09% | 34.90% | 45.64% | 55.24% | 45.34% |
| QwenVL+VSLS (ours) | 32 | 38.69% | 42.31% | 54.00% | 41.33% | 37.94% | 50.00% | 55.78% | 48.00% |
2. Concerns about Limitations in Visual Semantic Logic Detection. Thank you for this excellent point. We agree that relying solely on bounding box overlap for attribute association is a heuristic with logical limitations, and we appreciate the opportunity to clarify our framework's design and scope. The primary role of our Visual Semantic-Logical Search (VSLS) is to function as a highly efficient candidate retrieval mechanism, not as the final reasoning engine. The goal of this heuristic is to rapidly filter a vast number of frames down to a small, manageable set that is highly likely to contain the answer. We then delegate the more complex task of logical disambiguation to a powerful downstream Vision-Language Model (VLM). For instance, given the query "What color is the shirt of the person in the car?", our search might retrieve keyframes of both a person sitting in the car and a person who happens to be walking past the car. Both sets of frames are passed to the VLM, which has the foundational reasoning capability to distinguish the correct context and provide the right answer.
Our method is therefore distinct from, and complementary to, graph-based frame relation modeling. VSLS is designed as an efficient, query-guided temporal search module that serves as a crucial pre-processing step. A more complex method, such as graph-based reasoning, would be a downstream task that operates on the concise set of keyframes our method selects. This two-stage approach allows for both efficiency over long videos and robust reasoning on the retrieved candidates.
To make this clearer, we will add a discussion to the Limitations section acknowledging the potential risks of this heuristic and will further clarify the distinction between our search framework and downstream reasoning tasks in the Related Work section of our revised manuscript.
3. Concerns about the initial detection being an erroneous case. Thank you for raising this important concern. Such concerns are entirely valid, as relying on initial detections can pose a risk of the sampling process falling into a “low-quality frame loop” if critical objects are missed early on. However, we have implemented specific measures in our method to address this issue, as described in the paper:
- Confidence-based exploration control: We introduce a confidence threshold for detected objects. If a detection’s confidence is below the threshold, the corresponding frame’s sampling probability remains unchanged—allowing the algorithm to continue exploring other regions of the video. Additionally, after each sampling step, we apply a temporal suppression window that lowers the scores of neighboring frames. This encourages exploration of untouched video segments and prevents the algorithm from repeatedly focusing on already detected high-score regions. As a result, the system avoids being trapped in a narrow subset of frames, even when early detections are incomplete or noisy.
- Redundancy via cue-object modeling: Our method not only searches for key query objects but also leverages semantically relevant cue objects to guide exploration. This redundancy ensures that even if a key object is initially missed, the presence of related cue objects can still lead the search toward the correct temporal region. Furthermore, distribution updates are not based on a single frame’s score, but are aggregated across sampled frames, which reduces the risk of being misled by outliers.
As illustrated in Figure 3 (bottom), although VSLS assigns low confidence to the second keyframe in iteration 1, the model does not get stuck near the first peak. Instead, it continues exploring and successfully identifies the correct second keyframe near the end of the video—demonstrating the algorithm’s ability to escape from local optima and avoid the low-quality frame loop.
| video_id | question | options | keyframe_time_stamps | num_iterations | ground_truth frame position | Initial Uniform Sample | Iterations when other targets are found |
|---|---|---|---|---|---|---|---|
| Ip9DbdOtqF4 | A little girl appears in front of a wall. She is holding two whips and is dressed in a light-colored top and dark-colored trousers. There is a piece of white paper hanging on the wall in front of her. What did the little girl do the first time she appeared on the scene? | A) Grabbed the white paper B) Raised both hands C) Used a tool to imprint a design on the white paper D) Bent down and bowed E) Waved her hands left and right | [201.0, 64.0, 92.0, 236.0, 194.0, 33.0, 179.0, 200.0] | 16 | 4800 (timestamp: 200.2) | [0, 15, 30, 45, 60, 75, 90, 105, 120, 135, 150, 165, 180, 195, 210, 225] find target girl missing target paper | paper found at iteration 3: sample frames [28, 25, 192, 235, 40, 17, 229, 231, 27, 43, 19, 9, 31, 226, 13, 29] |
The table below shows another example where initial uniform sampling detects only “girl” but misses “paper.” Thanks to our spline-based smoothing, the algorithm avoids repeatedly sampling similar frames and instead explores new ones. By iteration 3, it successfully finds “paper” and retrieves the correct ground truth frame, avoiding the low-quality frame loop.
4. Concerns about Parameter Sensitivity. We thank the reviewer for this professional and insightful feedback. We agree that analyzing parameter sensitivity is crucial. While several parameters in our framework are manually set, they are designed to be interpretable and correspond to clear, logical trade-offs, allowing users to configure the system based on their specific needs for efficiency versus accuracy.
| num of images in grids | search cost (steps) | performance |
|---|---|---|
| 1 | 770 | 56.7 |
| 4 | 160 | 55.5 |
| 8 | 48 | 53.5 |
| 12 | 730 | 53.2 |
| 16 | 860 | 52.7 |
| confidence threshold | search cost (steps) | performance |
|---|---|---|
| 0.3 | 18 | 50.2 |
| 0.5 | 42 | 54.1 |
| 0.7 | 49 | 55.3 |
| 0.8 | 162 | 56.4 |
For example, two key parameters are the confidence threshold and the num of images in grids. The confidence threshold allows a user to balance search cost against performance; as shown in our analysis, increasing the threshold from 0.3 to 0.8 improves performance from 50.2 to 56.4, while the required search cost increases from 18 to 162 steps. Similarly, the num of images in grids parameter controls the granularity of the search. A smaller grid size (e.g., 1) yields the highest performance (56.7) at a high search cost (770 steps), whereas a medium grid size (e.g., 8) minimizes the search cost to just 48 steps, offering a more efficient search. This demonstrates that the parameters offer predictable control over the search behavior rather than being arbitrary settings.
4. Concerns about Readability Decline in Method Section. Thank you for your valuable feedback. We appreciate your positive comments on the Introduction and fully acknowledge your suggestion regarding the Method section. We will carefully revise this part to improve its clarity, structure, and logical flow, ensuring it matches the standard of the Introduction and enhances overall readability for the reader.
Once again, we would like to express our sincerest gratitude. We will incorporate all of your suggestions into the final manuscript. Thank you for your contributions to the improvement of our paper and the advancement of the field.
Thank you for the detailed response, which has addressed my concerns.
Thank you so much for your thoughtful engagement and for acknowledging that your concerns have been addressed. We are truly grateful for your detailed review, constructive suggestions, and your willingness to follow up after our rebuttal.
Given that the points you raised have now been clarified and supported with additional experiments and analyses, we kindly and respectfully ask whether you might consider a positive adjustment to your score. As you are aware, even a small change can have a meaningful impact on the outcome of our submission, and your support would be greatly appreciated.
Once again, we are very thankful for your time and contributions to improving our work.
This paper presents Visual Semantic-Logical Search (VSLS), a novel framework that can efficiently search key frames in long videos based on the logical relationships between queries and visual elements. VSLS first decompose the input query into 4 types of pre-defined semantic logic relations. It then samples frames and check if the objects in sampled frames satisfies the 4 pre-defined object relations. Based on the relevance score, VSLS iteratively samples and updates query relevant frames. The authors evaluate VSLS in LV-HayStack, LongVideoBench and Video-MME. VSLS shows strong performance, outperforming multiple baselines.
优缺点分析
Strengths:
- The writing is clear and easy to follow.
- Based on the T* framework, when scoring the frames, the authors propose to model higher-order object relations instead of only focusing on the existence of objects. The motivation is clear and makes sense.
- The experiments are extensive and the performance of the proposed method is strong. VSLS outperforms the T* baseline on LV-HayStack, LongVideoBench and Video-MME with similar efficiency.
Weaknesses:
- On Video-MME, which has more diverse question types than LongVideoBench and LV-HayStack, the proposed method leads to marginal improvements or sometimes even worse performance, especially for GPT-4o. In addition, in Table 3, the authors only report GPT-4o performance under the 32 frames setting. This is inconsistent with the 8 frame setting, in which both GPT-4o and InternVL are reported. These results raise concern about the generalizability of the proposed method.
- The proposed method relies on LLM/VLM to extract potential logical relations. However, it is unclear to me whether the LLM/VLM can accurately extract the logical relations.
- What are the benefits of stacking multiple frames into one image as multiple grids? Although it is more efficient, it might also leads to worse object detection performance.
- What are the effects of multiple iterations? From Figure 3, it seems that iteration 1 is already good enough. More iterations seem to introduce more noise on frame selection.
- In Table 4, GPT-4O + VSLS (ours) 32 frames setting, 54.2 is not the best performance but is bolded.
问题
Please see the above ‘Weaknesses’ section. If the authors could provide more analysis of the generalizability of the propose method, the reviewer will consider raising the score.
局限性
yes
最终评判理由
Thanks for the rebuttal. My main concerns have been addressed, especially for the generalizability of the proposed method. The new analysis about the effects of the grid size is also interesting. Therefore, I will raise my score.
格式问题
no
We sincerely appreciate your constructive feedback and recognition of our clear writing style, extensive experimental validation, and novel idea. Your insights have been invaluable in strengthening our work. Below, we will address your concerns and incorporate your suggestions:
1. Concern about the generalizability of the proposed method. Thank you very much for raising these insightful questions.
We would like to clarify that the reason we did not report InternVL performance in the 32-frame setting is due to a limitation of the InternVL version available at the time, which does not support 32-frame inputs as it exceeds the maximum token length. Therefore, we included only GPT-4o for that specific comparison.
Regarding generalizability, we have conducted additional comprehensive experiments to more clearly understand the behavior of our method across varying frame counts and VLMs. The results of these experiments are provided below for your reference:
| Method | Frames | LVB-Long | LVB-Med | LVB-Short | LVB-Average | VidMME-Long | VidMME-Med | VidMME-Short | VidMME-Average |
|---|---|---|---|---|---|---|---|---|---|
| LLaVA+T* | 8 | 39.58% | 50.00% | 48.00% | 44.43% | 37.50% | 40.36% | 37.16% | 38.30% |
| LLaVA+VSLS (ours) | 8 | 42.26% | 46.92% | 50.00% | 44.73% | 38.46% | 38.45% | 38.53% | 38.47% |
| LLaVA+T* | 32 | 40.18% | 44.23% | 50.00% | 42.57% | 35.76% | 39.64% | 37.67% | 37.66% |
| LLaVA+VSLS (ours) | 32 | 41.67% | 48.08% | 54.00% | 45.20% | 36.71% | 38.99% | 39.53% | 38.42% |
| QwenVL+T* | 8 | 41.96% | 47.69% | 54.00% | 45.20% | 40.28% | 50.00% | 54.90% | 48.43% |
| QwenVL+VSLS (ours) | 8 | 45.83% | 49.23% | 54.00% | 47.83% | 43.18% | 49.64% | 60.80% | 51.36% |
| QwenVL+T* | 32 | 38.69% | 41.92% | 40.00% | 40.09% | 34.90% | 45.64% | 55.24% | 45.34% |
| QwenVL+VSLS (ours) | 32 | 38.69% | 42.31% | 54.00% | 41.33% | 37.94% | 50.00% | 55.78% | 48.00% |
The results clearly demonstrate that our proposed VSLS method consistently achieves better accuracy compared to the baseline (T*) across various benchmarks and settings. Notably, on LVB and VideoMME datasets, VSLS consistently outperforms T* in most settings, especially under the 8-frame configuration, where the improvements are substantial (e.g., 45.83% vs. 41.96% on LVB-Long and 51.36% vs. 48.43% on VidMME-Average with QwenVL). Even under the more challenging 32-frame scenario—where noise often obscures key information—our method still generally achieves superior or competitive accuracy, underscoring its robustness and capability to effectively identify critical semantic-logical frames.
As you observed, while our method sometimes underperforms in the 32-frame setting, particularly on Video-MME, the main reason is that introducing more frames can result in increased noise, overshadowing the few critical frames (typically only 1-3 per query). This issue becomes prominent in models that are sensitive to redundant or noisy information.
| Method | Frame | LVB-Long | LVB-Med | LVB-Short | LVB-Average | VidMME-Long | VidMME-Med | VidMME-Short | VidMME-Average |
|---|---|---|---|---|---|---|---|---|---|
| LLaVA-Video-7B | 4 | 39.3% | 48.4% | 54.0% | 44.1% | 38.43% | 39.71% | 38.4% | 38.8% |
| LLaVA-Video-7B + T* | 4 | 41.1% | 46.5% | 50.0% | 43.9% | 38.4% | 39.1% | 39.4% | 38.9% |
| LLaVA-Video-7B + Ours (cluster) | 4 | 45.20% | 50.00% | 62.0% | 48.4% | 43.8% | 47.9% | 56.7% | 49.8% |
| LLaVA-Video-7B + T* | 8 | 39.58% | 50.00% | 48.00% | 44.43% | 37.50% | 40.36% | 37.16% | 38.30% |
| LLaVA-Video-7B + T* | 32 | 41.67% | 48.08% | 54.00% | 42.57% | 35.76% | 39.64% | 37.67% | 37.66% |
The table above presents results from our newly added hierarchical clustering approach, where highly similar frames are removed in selected frames, retaining only 4 diverse and semantically distinct frames for input to the VLM. Despite the significant reduction in total frame count, our method (“+ Ours (cluster)”) achieves notable improvements over the T* baseline—particularly on long videos. For example, on LVB-Long, accuracy improves from 43.9%% to 48.4%, and on VidMME-Long, from 38.9% to 49.8%. These results demonstrate that by eliminating redundant information and focusing on key semantic-logical frames, we can not only maintain but even enhance performance in long video understanding tasks, while significantly reducing computational cost.
Our core optimization objective is to minimize the number of frames required to achieve strong QA performance, thereby reducing computational cost at the VLM inference stage. To further validate this claim, we conducted additional experiments where selected frames were clustered, reducing the frame count significantly (e.g., to only 4 frames). In these more compressed scenarios, VSLS consistently demonstrates more significant improvements over baseline methods, highlighting its effectiveness in efficiently capturing critical visual-semantic information.
The two experiments above demonstrate that our method outperforms the baseline, regardless of the choice of VLMs or the number of input frames. Notably, after removing redundant frames, VSLS with only 4 frames even surpasses the performance of the baseline using 32 frames. We will incorporate these additional findings in the revised manuscript to better demonstrate the generalizability and robustness of VSLS.
2. Concern about the accuracy of extracting the logical relations. Thank you for your professional suggestion. We agree that evaluating the LLM's ability to accurately extract logical relations is a crucial point for validating our framework's reliability.
To quantitatively address this, we conducted a verification study. We randomly sampled 500 query instances from each question category across the Video-MME and Long VideoBench datasets. We then used LLMs to perform the logical relation and object extraction as described in our paper. Subsequently, to ensure the quality of the results, we performed a rigorous manual audit of all extracted logical relations and objects. Our analysis yielded the following high accuracy rates for the extraction task:
- GPT-4o: 92%
- Qwen-VL 72B: 88%
These results demonstrate that state-of-the-art LLMs are highly proficient at this task, confirming that the logic extraction ability does not serve as a performance bottleneck for our framework at this stage. Due to committee guidelines, we are unable to provide a direct link to this data in the rebuttal. However, we are committed to transparency and will release the full set of our manually audited data and LLM outputs as part of our public code release.
3. Benefits of stacking multiple frames into one image as multiple grids. Thank you for raising this important point. Our rationale for this design is that it offers a deliberate trade-off between search efficiency and downstream task accuracy. The grid size is a configurable parameter allowing users to balance these two competing factors based on their needs.The table below shows a supplementary experiment we conducted. It presents the required steps to complete the keyframe search and the performance of the downstream QA task of GPT4o. As demonstrated, increasing the grid size substantially decreases the search cost (from 770 steps down to as few as 48 steps). However, it simultaneously leads to a moderate reduction in performance (from 56.7 down to 52.7). Notably, increasing the number of images in grids leads to a decline in detector accuracy, which in turn increases the overall search cost.
| num of images in grids | search cost (steps) | performance |
|---|---|---|
| 1 | 770 | 56.7 |
| 4 | 160 | 55.5 |
| 8 | 48 | 53.5 |
| 12 | 730 | 53.2 |
| 16 | 860 | 52.7 |
Therefore, this stacking approach presents a clear efficiency-performance trade-off, offering users the flexibility to choose an optimal balance based on their specific requirements or resource constraints. Users prioritizing detection accuracy may prefer smaller grid sizes or individual frames, while those with limited computational resources or stringent runtime constraints may opt for larger grid configurations to achieve greater efficiency.
4. Why need multiple iterations? We thank the reviewer for this insightful question. You are correct in observing that for simpler cases, such as the one in the top plot of Figure 3, our model can identify keyframes in the very first iteration. However, since each iteration only samples 8 frames to estimate the keyframe distribution, a single pass is often insufficient for more complex queries where visual cues are not immediately obvious—a frequent challenge in long video analysis. The bottom plot of Figure 3 illustrates this necessity, showing a scenario where the keyframes are only successfully located after the search distribution is progressively refined over 15 iterations.
We will address the BLOD-related issues in the paper. Once again, we would like to express our sincerest gratitude. We will incorporate all of your suggestions into the final manuscript. Thank you for your contributions to the improvement of our paper and the advancement of the field.
Thanks for the rebuttal. My main concerns have been addressed, especially for the generalizability of the proposed method. Therefore, I will raise my score.
Dear Reviewers,
We are sincerely grateful for the time and effort you have dedicated to reviewing our work and providing invaluable feedback.
As the deadline for the rebuttal phase is approaching, we are keen to ensure that we have addressed all of your concerns as comprehensively as possible. Your insights are crucial to us for improving the quality of our manuscript.
Should you have any further questions or additional concerns after reading our rebuttal, please do not hesitate to let us know. Our team is on standby and fully committed to addressing them promptly.
Best regards, The Authors
Dear Reviewers and Area Chair,
On behalf of all authors, we would like to express our deepest gratitude for the time, effort, and expertise you have dedicated to reviewing our submission “Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding” (Paper #14140).
Throughout the review and rebuttal process, we have greatly appreciated the constructive feedback, which has allowed us to further refine and strengthen our work. We are truly encouraged by the recognition from all reviewers (Reviewer BFmk, Reviewer CCKP, Reviewer rCAy, Reviewer Rfu2) of the novelty of our Visual Semantic-Logical Search (VSLS) framework and the thoroughness of our experimental validation. In particular:
- Innovation & Originality – Reviewers acknowledged that our explicit modeling of four fundamental semantic-logical relations (spatial, temporal, attribute, causal) represents a clear and meaningful advancement over existing keyframe search approaches, effectively bridging the gap between textual queries and visual-temporal reasoning.
- Experimental Rigor – Multiple reviewers highlighted the comprehensiveness of our experiments, covering diverse datasets, frame counts, detectors, and VLM architectures, as well as detailed ablation studies, efficiency–accuracy trade-off analyses, and additional verification experiments.
- Practical Impact & Generalizability – We are grateful for the recognition that VSLS can serve as a plug-and-play, efficient, and extensible framework, with potential applicability to both general-purpose and domain-specific long-video understanding tasks.
We also appreciate that Reviewer BFmk explicitly raised their score after our rebuttal, acknowledging that their main concerns were fully addressed. Likewise, Reviewer CCKP stated that their concerns had been resolved, even though their score has not yet been updated. Such acknowledgements further affirm the value and robustness of our work.
We have greatly valued the thoughtful questions and suggestions regarding generalizability, parameter sensitivity, detection strategies, domain adaptation, and methodological clarity. Addressing these points has strengthened not only our manuscript but also our understanding of the broader research challenges in this field.
Finally, we would like to sincerely thank all Reviewers and the Area Chair for your professionalism, constructive engagement, and commitment to advancing the field of vision-language research. Your feedback has been instrumental in improving our work, and your dedication contributes significantly to the growth and rigor of our research community.
With our warmest regards and deepest appreciation,
The Authors
This paper introduces a novel semantics-driven framework, VSLS, for efficient keyframe extraction in long videos. VSLS reformulates keyframe selection by explicitly modeling four logical dependencies: spatial co-occurrence, temporal proximity, attribute dependency, and causal order. The authors claim this approach improves video question-answering by enabling context-aware identification of semantically critical frames.
The reviews were generally positive, praising the paper's clear writing, novel approach, and extensive experiments. For the initial reviews, concerns were raised about the framework's generalizability and its sensitivity to parameters. The authors' rebuttal successfully addressed these concerns with new experiments, demonstrating the framework's broad applicability to different VLMs and clarifying parameter trade-offs. This response resolved reviewer concerns, leading to positive final sentiment (4,4,4,5).
AC recommended acceptance as a poster.