VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree
摘要
评审与讨论
This paper addresses explainable video anomaly detection with Hierarchical Granularity-Aware Tree in a training-free manner. The authors first use a generic event boundary detector to detect event boundaries and apply TreeInit algorithm to build an initial tree consisting of segments. The tree is then stratified by Adaptive Node Stratification that divides a video into coarse segments as parent nodes, and each coarse segment has fine segments as child nodes. Finally, the authors use Intra-cluster node refinement to compute the initial anomaly score of a node with its top K similar nodes and Inter-cluster node correlation to estimate the final anomaly scores. The authors have conducted experiments on UCF-Crime and XD-Violence to show the effectiveness of the proposed method.
优缺点分析
Strengths:
- The concept of clustering video frames into coarse and fine segments with Hierarchical Granularity-aware Tree is interesting
- Intra-cluster Node Refinement and Inter-cluster Node Correlation wisely leverage the tree structure for estimating anomaly scores.
- The experimental results are promising. In addition, the ablation study shows that each proposed component is effective.
Weaknesses:
- Lack of Clarity in the Proposed Method
- In equation 2, is a set of tuples and each tuple contains an index and confidence score. How do authors perform K-means on regarding those two values to get coarse and fine boundaries in equation 4?
- The min and max operation in the second row in equation 4 confuses me as well because the contains both indexes and scores.
- Both “Adaptive Node Stratification” and “Intra-cluster Node Refinement” use the term “cluster”, but the idea is different, which may confuse readers. In my understanding, the former generates coarse and fine clusters using K-means and the latter uses cosine similarity to find clusters.
- It is unclear to me how the final frame-wise anomaly scores are obtained. Did the authors determine the final anomaly score for each frame with the anomaly score of each child node ?
- Concerns about the Proposed Method
- Equation 3 may generate choppy peaks/predictions. How can the method deal with this case?
- Inter-cluster Node Correlation does not seem very effective. Specifically, Table 3 shows 1.7% improvement with correlation and Table 10 shows 2.5% improvement with both correlation and HGTree Fine Cluster. I am curious what the performance will be if the authors just calculate the average over the parent node and child nodes to obtain the anomaly score.
- More qualitative results
- The authors mention that existing explainable VAD methods fail to explain complex events such as burglaries and arrests in line 51-52. It would be good if the authors provided more qualitative results for those activities generated by the proposed method and other baselines.
问题
Important questions and concerns for the authors to address. Responses will strongly influence the final rating
- Address all concerns listed under 'Lack of Clarity in the Proposed Method' in the weaknesses section.
- Explain how the method deals with choppy predictions in equation 3.
- Prove the effectiveness of Inter-cluster Node Correlation. It would be better if the authors reformulated Equation 10 as an average function over the parent node and its child nodes for comparison.
Optional questions and concerns for the authors to address. Good responses can increase the final rating
- Provide qualitative results for burglaries and arrests
Please find the relevant details for some questions in the weakness section. The framework based on Hierarchical Granularity-Aware Tree is interesting and has impact on the VAD community. However, there are some concerns that the authors need to address. Therefore, my current rating is borderline and leans slightly toward accept. I will carefully read the rebuttal to decide my final rating.
局限性
yes
最终评判理由
The authors have addressed my concerns in their rebuttal and follow-up response. This paper demonstrates (1) a solid and novel model design, (2) promising results, and (3) a detailed ablation study. I encourage the authors to include the ablation study on selection in the main paper or supplementary material if accepted. Overall, I will raise my final rating.
格式问题
no major formatting issues
Q1: Lack of Clarity in the Proposed Method.
Q1-1&2: In equation 2, is a set of tuples and each tuple contains an index and confidence score. How do authors perform K-means on regarding those two values to get coarse and fine boundaries in equation 4? The min and max operation in the second row in equation 4 confuses me as well because it contains both indexes and scores.
A1-1&2: Thanks for your comments. Regarding Eq. 2, the confidence score is calculated with index serving as an inherent attribute of each . In Eq. 4, while K-means clustering operates on the one-dimensional values to produce coarse and fine categories, the attribute associated with each is preserved throughout this process. The min and max operations in Eq. 4 are also performed only on . Importantly, the index information plays a crucial role in both constructing the hierarchical tree structure and facilitating subsequent deduplication and completion operations. We will clarify these potential ambiguities in revised versions.
Q1-3: Both “Adaptive Node Stratification” and “Intra-cluster Node Refinement” use the term “cluster”, but the idea is different, which may confuse readers. In my understanding, the former generates coarse and fine clusters using K-means and the latter uses cosine similarity to find clusters.
A1-3: Thanks for your comments. To clarify, Adaptive Node Stratification generates clusters of different granularities (Coarse and Fine) through K-means clustering, while Intra-cluster Node Refinement further optimizes the anomaly scores of nodes within each of these clusters (either Coarse or Fine). The two operations serve distinct purposes in our hierarchical framework, and we will make this distinction clearer in the revised manuscript to avoid potential confusion.
Q1-4: It is unclear to me how the final frame-wise anomaly scores are obtained. Did the authors determine the final anomaly score for each frame with the anomaly score of each child node ?
A1-4: Thanks for your comments. The final anomaly score for each frame is determined based on the Correlation anomaly scores of its corresponding child nodes (video segments). Following a strategy similar to LAVAD and VERA, the score of each child node is directly assigned as the frame-level anomaly score within that segment. We will clarify the above procedure in the revised version.
Q2: Concerns about the Proposed Method
Q2-1: Equation 3 may generate choppy peaks/predictions. How can the method deal with this case?
A2-1: Thanks for your comments. As shown in the Boundary Confidence at the top of Fig. 2 . Unstable peaks potentially generated by Eq. 3 typically occur in regions extremely low-confidence intervals (quantitatively supported by Table S1).Our DFS (Depth-First Search) strategy effectively mitigates this issue by initiating from the highest-confidence boundary while applying a minimum confidence threshold () to eliminate low-quality peaks. In addition, the deduplication operation removes redundant video representations, and our hierarchical structure design combined with Inter-cluster Node Correlation mechanisms provides additional robustness against unstable peaks. As evidenced in Table S1, the number of peaks used in the VADTree is 8697. Compared with single-grained VADTree-Coarse, VADTree introduces additional peaks that enhance video granularity diversity, thereby improving overall detection performance. In contrast to VADTree-Fine, VADTree reuses a subset of existing peaks to construct the Coarse Cluster without introducing new peaks, resulting in identical peak counts between these two variants. We will clarify this mitigation strategy in our revised manuscript.
Table S1: Peak Counts distribution on UCF-Crime Dataset with
| Threshold | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | VADTree-Fine | VADTree-Coarse | VADTree |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Peak Counts | 149531 | 94512 | 57041 | 37416 | 27653 | 22047 | 18266 | 14802 | 9310 | 8697 | 6329 | 8697 |
Q2-2: Inter-cluster Node Correlation does not seem very effective. Specifically, Table 3 shows 1.7% improvement with correlation and Table 10 shows 2.5% improvement with both correlation and HGTree Fine Cluster. I am curious what the performance will be if the authors just calculate the average over the parent node and child nodes to obtain the anomaly score.
A2-2: Thanks for your comments. Inter-cluster Node Correlation extensively integrates multi-granularity anomaly detection scores from the hierarchical structure, significantly enhancing the overall stability of the VAD system. Consequently, as demonstrated in Table 3, Appendix Table 10, and the Table S2 below, this module consistently improves AUC performance across all three datasets. On the latest multi scenario MSAD benchmark, our method outperforms the weakly supervised SOTA VAD (88.68 AUC, CVPR2025) with a higher AUC score.
Table S2. Influence of Inter-cluster Node Correlation control coefficient on UCF-Crime, XD-Violence and MSAD(NeurIPS'24) datasets
| Method/Datasets | UCF-Crime(AUC%) | XD-Violence(AUC%) | MSAD(AUC%) |
|---|---|---|---|
| VADTree-Coarse | 82.81 | 89.36 | 87.01 |
| VADTree-Fine | 83.05 | 90.04 | 86.71 |
| (+Correlation ) VADTree | 84.71 | 90.44 | 89.32 |
Eq.10 defines the parameter with a value range between -1 and 1. Notably, when , the correlation operation degenerates to a simple average of anomaly scores from parent and child nodes. Experimental results for different values on UCF-Crime, XD-Violence and MSAD are provided in Appendix Table 9 and Table S3, respectively. When taking the average directly, the performance slightly decreases on all three datasets(0.16 UCF-Crime, 0.07 on XD-Violence and 0.06 on MSAD).
Table S3. Influence of Inter-cluster Node Correlation control coefficient on XD-Violence and MSAD datasets. The best values (ours) are marked in bold.
| -1.0 | -0.8 | -0.6 | -0.4 | -0.2 | 0 | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| XD-Violence (AUC%) | 89.88 | 90.03 | 90.15 | 90.24 | 90.32 | 90.37 | 90.42 | 90.42 | 90.44 | 90.41 | 90.37 |
| MSAD (AUC%) | 88.22 | 89.16 | 89.28 | 89.32 | 89.30 | 89.26 | 89.21 | 89.05 | 88.87 | 88.58 | 88.01 |
Q3: More qualitative results.
Q3-1: Provide qualitative results for burglaries and arrests.
A3-1: Thanks for your comments. For complex anomalies such as arrests and burglaries, the fixed temporal sampling (e.g., 10-second windows) used in LAVAD and frame-level caption aggregation can lead to missed long-range semantic dependencies, resulting in inaccurate or incomplete interpretations of abnormal events. To show the superiority of our VADTree, we conduct a qualitative analysis using two video samples from the UCF Crime dataset.
Table S4. Comparison of abnormal explanations of arrests by different methods
| Method/Video | Arrest024_x264.mp4 Key Explanation |
|---|---|
| LAVAD | A man runs across a street in front of cars, and a motorcycle drives down a highway with other cars and motorcycles...soldiers shoot at a car in the street. |
| VADTree(Ours) | The scene described is a busy street with traffic, then a person in dark clothes running across the street, being chased by another person... The environment seems tense, possibly involving law enforcement because there's a person in a uniform approaching the car. |
Table S5. Comparison of abnormal explanations of burglary by different methods
| Method/Video | Burglary079_x264.mp4 Key Explanation |
|---|---|
| LAVAD | A man walks through a parking garage for approximately 10 seconds, as shown by a security camera...A man is shown walking in a parking garage on a security camera for several minutes before he exits the garage and walks down the sidewalk. |
| VADTree(Ours) | The debris on the ground adds to the suspicious vibe because it might be from an attempted break-in or something else illegal...Putting it all together: an open vehicle, someone with a tool, another person inside moving around, nighttime, debris, tense atmosphere. These factors add up to a scene that's quite suspicious. |
As shown in Table S4 and S5, LAVAD’s frame-level semantic aggregation often leads to hallucinations and struggles to accurately identify long-range abnormal events. In contrast, VADTree excels at detecting sub-events (e.g., "chased by another person," "attempted break-in") while also synthesizing long-range contextual clues (e.g., "putting it all together"). The above qualitative results will be added to our revision.
I thank the authors for their detailed responses. They have clarified and addressed most of my concerns. Here are my remaining concerns:
- How can the authors choose the optimal for each dataset? Is there a general solution to do that? Table S3 shows that when , the performance remains consistently strong across different datasets. Although the design of Inter-cluster Node Correlation is reasonable, it requires manual tuning of the optimal , which compromises the method’s flexibility.
1.Conclusion
Thanks for your reply. We agree with your point of limited flexibility. In addition to the existing experimental results, we have tried to design some heuristic rules from the perspective of dataset attributes or statistical characteristics to achieve automated selection (See examples in paragraphs 2 to 4). However, the experiment shows that our inter-cluster correlation design is currently the most effective solution. Therefore, we maintain that the current design of tuned inter-cluster correlation remains the empirically optimal solution. Its consistent superiority (Table S3) validates its robustness across diverse datasets, balancing flexibility with performance. We acknowledge that hyperparameter tuning is a limitation but emphasize it is a one-time, dataset-specific optimization. Future work may explore adaptive mechanisms, yet our ablation confirms that manual tuning already achieves near-peak efficacy.
2.Design Basis for Automated Selection Method
We discussed the relationship between different dataset characteristics and correlation methods. Firstly, we calculated the average video length, average GT abnormal segment length and its proportion, as well as the average GT abnormal segments length of different datasets. As shown in table3, MSAD is mainly composed of short videos, which is its biggest difference from XD Violence and UCF Crime. This difference is much greater than the difference in abnormal segment length (280 vs 435 or 540), because the length of abnormal events in reality follows the same distribution, which is different from the artificially controlled video length. Secondly, by observing the average video length and average GT abnormal segment length, we found that each abnormal segment in the MSAD dataset has a high proportion of the entire video length (0.45), much higher than XD Visibility (0.15) and UCF Crime (0.14). This means that VADTree should focus more on nodes within the Coarse Cluster () in MSAD data, as these contextual distances are longer and have higher coverage for high proportion abnormal segments. In contrast, XD-Violence and UCF-Crime focus more on nodes within Fine Cluster (), which is consistent with our experimental results on . In summary, we believe that meaningful and easily obtainable data attributes include: 1. Video length in the dataset. 2. The length of the current abnormal segment.
Table C1. Experimental results of heuristic rules node correlation method based on data attributes
| Attribute/Datasets | XD-Violence | UCF-Crime | MSAD |
|---|---|---|---|
| Average video length (Frames) | 2921 | 3831 | 624 |
| Average GT abnormal segment length (Frames) | 435 | 540 | 280 |
| Proportion of average GT abnormal segment length | 0.15 | 0.14 | 0.45 |
| (Measurable) | 0.24 | 0.10 | 0.11 |
| VADTree(ours) AUC(%) | 90.44 | 84.71 | 89.32 |
| VADTree(Heuristic rules) AUC(%) | 90.43 | 84.78 | 89.23 |
3.Heuristic Rule for Selection
Based on design basis, we first divided the dataset into long video data (UCF-Crime and XD-Violence) and short video data (MSAD) based on prior information about video length. However, the length of the anomalous segments is unmeasurable during test time (unknown to the ground truth), making it unsuitable for designing heuristic rules. To address this issue, we defined measurable data attributes based on VADTree prediction results. For the prediction results of the Coarse Cluster nodes of video , we defined as a weighted sum of the node video length and the corresponding predicted anomaly score to measure the degree of anomaly information in the system's prediction of V. Then, by removing the total video length, we obtained the normalized predicted anomaly information level: , which indirectly expresses the length of the anomalous segments in the video.
Finally, we designed a simple heuristic rule: for long video data, its ; for short video data, its .
4.Experimental Findings
Table C1 proves heuristic rules cannot outperform manual β tuning, showing marginally lower performance on XD-Violence (-0.01%) and MSAD (-0.09%), and a statistically insignificant difference on UCF-Crime (+0.07%).
This indicates that the relationship between dataset attributes and inter-cluster correlation is more complex than linear heuristics can capture. While attributes like video length provide useful priors, their direct mapping to cannot fully exploit temporal dependencies modeled by VADTree.
Overall, under the training-free setting, the heuristic rules are limited by their inability to capture complex data features. In addition, considering that the semantic-level information between nodes is not utilized, lightweight meta-learning solutions can be tried in the future to incorporate more diversified information for optimization.
Thank you again for your detailed response. The ablation study on selection is clear and reasonable. I agree that manual tuning yields strong performance and increasing method’s flexibility is an important direction for future work.
Overall, the authors’ rebuttal and follow-up response are clear and convincing. I will raise my final rating accordingly.
We sincerely appreciate your constructive feedback on this article and your recognition of the work. Thank you for providing valuable guidance throughout the entire process.
This paper proposes VADTree, a training-free and interpretable framework for video anomaly detection. The key innovation lies in its use of a hierarchical granularity-aware tree to adaptively segment videos, addressing the limitations of traditional fixed-window sampling methods, which struggle to handle anomalies of varying durations.
优缺点分析
Strengths:
1.This is the first work to introduce an event tree structure into training-free video anomaly detection (VAD), using the HGTree to adaptively handle anomalies of varying durations.
2.Compared to fixed sliding window sampling, it achieves higher mIoU on two benchmark video anomaly datasets.
3.A comprehensive comparison is conducted across two major benchmark datasets, covering supervised, weakly-supervised, unsupervised, and training-free methods, where the proposed method achieves state-of-the-art AUC among training-free approaches.
4.The framework can generate event-level textual descriptions and anomaly scores, making it broadly applicable to real-world scenarios.
Weaknesses:
1.The quality of the GEBD model directly affects event boundary detection.
2.The generalization ability of the VLM/LLM component is not validated (e.g., performance with non-Qwen models remains unclear).
3.The inference time of the LLM is not reported, raising concerns about real-time performance.
问题
- How do boundary-detection errors from GEBD in complex scenes—such as crowded areas or dynamic backgrounds—affect the HGTree structure?
- Could you test other large language models (LLMs) beyond Qwen to show how well the method generalizes?
- By leaving out micro-expressions and audio, is the system missing certain types of anomalies? Would adding multimodal data help?
- Please report the time required for HGTree construction, anomaly scoring, refinement, and final inference so we can judge whether the method is efficient enough for scenarios like real-time monitoring.
- It would be helpful to provide an analysis of some failure cases.
局限性
- The method relies only on visual information and ignores other important anomaly cues like audio (e.g., the XD-Violence dataset includes audio, but it is not used).
- It does not discuss cross-domain generalization — for example, the GEBD model is trained on Kinetics, which differs significantly from surveillance videos.
- The impact of GEBD boundary detection failures is not thoroughly analyzed.
最终评判理由
The rebuttal has largely addressed my concerns; however, I still believe that VADTree requires excessive preprocessing time. Therefore, I have decided to maintain my original score.
格式问题
N/A
Q1: How do boundary-detection errors from GEBD in complex scenes—such as crowded areas or dynamic backgrounds—affect the HGTree structure? It does not discuss cross-domain generalization — for example, the GEBD model is trained on Kinetics, which differs significantly from surveillance videos.
A1: Thanks for your comments. Taking the video Normal_Videos_758_x264.mp4 from the UCF-Crime dataset as an example, this footage captures a scene from a roadside market surveillance camera with relatively more people appearing in the frame. It can be observed that in such complex scenes lacking a clear dominant subject, GEBD produces event boundaries with relatively low confidence scores. The model only generates high-confidence valid event boundaries when pedestrians enter or exit near the frame edges. The event boundaries are distributed uniformly yet sparsely throughout the video sequence, resulting in HGTree nodes of comparable durations that effectively shift the sampling approach toward fixed-length temporal window sampling.
In dynamic background scenarios, HGTree node generation adapts to motion dynamics by tracking foreground changes in stable scenes. When foreground and background demonstrate comparable motion intensities, node generation becomes influenced by background dynamics. The multi-granularity approach improves event fragment coverage compared to single segmentation methods.
The quality of GEBD boundaries has limited impact on VAD performance to some extent. To achieve this, we suppress the negative impact of low-quality GEBD boundaries in multiple ways. 1) A DFS strategy that prioritizes high-confidence boundaries while filtering unreliable ones via the threshold, (2) Deduplication operations to eliminate redundant representations, and (3) Hierarchical structure design enhanced by Inter-cluster Node Correlation mechanisms that compensate for boundary imperfections. In addition, we select different GEBD models trained on Kinetics-GEBD dataset [59] to demonstrate the stability of the above strategies. Table S1 reveals a direct correlation between their VAD performance on UCF Crime and their original Kinetics-GEBD dataset results. This confirmation highlights two key findings: (1) The quality of GEBD models remains an influential factor, as improved GEBD implementations consistently yield better performance; 2) Our architectural innovations demonstrate robust adaptability to boundary quality variations.
From a domain shift perspective, the four datasets exhibit significant domain divergence: UCF-Crime's surveillance footage captures constrained real-world anomalies, XD-Violence features edited cinematic content with artificial effects, and MSAD bridges controlled multi-scenario environments, while Kinetics-GEBD's in-the-wild YouTube videos demonstrate superior generality through open-world diversity and naturalistic human activities. Compared with UCF-Crime, XD-Violence and MSAD, Kinetics-GEBD exhibits greater generality characterized by open-world diversity, unrestricted motion patterns, and naturalistic interactions. Consequently, our GEBD based pre-training framework offers comparable advantages for video understanding tasks, particularly in cross-domain anomaly detection scenarios. This paradigm mirrors established transfer learning practices in computer vision. For instance, the conventional strategy of employing large-scale pre-trained backbones (e.g., ResNet-50 on ImageNet) to address domain shifts has proven effective across various downstream tasks, including object detection and image classification.
Table S1 Results of VADTree based on different GEBD models on UCF-Crime dataset
| GEBD Model | Kinetics-GEBD-Val F1(%) | Kinetics-GEBD-Test F1(%) | UCF-Crime AUC(%) |
|---|---|---|---|
| BasicGEBD-ResNet50 | 73.70 | 76.80 | 82.85 |
| EfficientGEBD-ResNet18-L4 | 78.20 | - | 84.70 |
| EfficientGEBD-ResNet50-L4 (Ours) | 78.64 | 78.7 | 84.71 |
Q2: Could you test other large language models (LLMs) beyond Qwen to show how well the method generalizes?
A2: Thanks for your comments. To comprehensively evaluate the generalizability, we conduct rigorous testing with alternative model architectures. Specifically, we select InternVL2_5-8B as an additional VLM with distinct input specifications of 32 frames at 448×448 resolution, contrasting with LLaVA-NeXT-Video-7B's 64 frames 336×336 inputs. Furthermore, we incorporate t5gemma-9B-2B as additional LLM variant featuring a unique 9B-encoder and 2B-decoder configuration. The experimental results in Table S2 demonstrate consistent outperformance of all three combinations over existing training-free VAD benchmarks, including LAVAD's 80.28 AUC and EventVAD's 82.03 AUC. This cross-architectural superiority provides conclusive evidence for our framework's robust generalization capabilities.
Table S2 Comparison of performance of VADTree under different VLM and LLM on UCF-Crime dataset
| VLM | LLM | AUC(%) |
|---|---|---|
| LLaVA-NeXT-Video-7B | DeepSeek-R1-Distill-Qwen-14B | 84.71 |
| InternVL2_5-8B | DeepSeek-R1-Distill-Qwen-14B | 83.74 |
| LLaVA-NeXT-Video-7B | t5gemma-9B-2B | 84.00 |
Q3: By leaving out micro-expressions and audio, is the system missing certain types of anomalies? Would adding multimodal data help?
A3: Thanks for your comments. Ignoring the audio modality would indeed lead to missing visually occluded but audibly salient anomalies like gunshots. To address this issue, we design a simple training-free video anomaly detection framework incorporating audio by employing Kimi-Audio-7B-Instruct as our AudioLLM with the prompt "Please generate the audio caption" to extract textual audio descriptions for each HGTree video node, thereby unifying visual and audio semantics into the text domain for joint LLM reasoning. Our XD-Violence experiments (Table S3) using LLaVA-NeXT-Video-7B (VLM) and t5gemma-9B-2B (LLM) show consistent performance gains with audio integration. For fair comparison with existing baselines, we restrict our evaluation to visual information only. The analysis of the audio modality will be comprehensively addressed in the revised manuscript.
Regarding micro-expressions, our analysis of current datasets reveals significant detection challenges: 1) prevailing surveillance conditions (long distances, suboptimal angles) rarely capture micro-expressions effectively; 2) no dedicated micro-expression VAD benchmark exists for proper methodological evaluation. These limitations will be explicitly discussed in the revised manuscript.
Table S3 Comparison of performance of VADTree under different audio modality configurations on XD-Violence dataset
| Methods | AudioLLM | AUC(%) | PR(%) |
|---|---|---|---|
| VADTree-Coarse | - | 87.59 | 62.07 |
| VADTree-Fine | - | 87.81 | 63.00 |
| VADTree | - | 88.28 | 63.22 |
| VADTree-Coarse | Kimi-Audio-7B-Instruct | 88.52 | 63.20 |
| VADTree-Fine | Kimi-Audio-7B-Instruct | 88.94 | 64.69 |
| VADTree | Kimi-Audio-7B-Instruct | 89.42 | 64.96 |
Q4: Please report the time required for HGTree construction, anomaly scoring, refinement, and final inference so we can judge whether the method is efficient enough for scenarios like real-time monitoring.
A4: Thanks for your comments. Current training-free VAD method is not optimized for real-time inference as our primary objective focuses on developing a system requiring neither training data collection nor model training. The computational bottlenecks primarily stem from the sequential processing pipeline involving video caption generation and anomaly score reasoning.
We calculate the total inference time (GPU hours) of LAVAD and VADTree under NVIDIA GeForce RTX 3090 GPU in Table S4. The time consumption of the VLM Caption, LLM Summary and LLM Scoring parts of LAVAD is estimated. As shown in Table S4, our proposed VADTree requires slightly higher total GPU hours than LAVAD. Even for VADTree-Coarse (82.81) , it achieves training-free SoTA performance on the UCF-Crime dataset while consuming 16.5GPU hours, outperforming both LAVAD (80.28) and the latest published EventVAD (82.03, ACM MM2025).
Table S4 Component-level inference time consumption analysis of VADTree and LAVAD on UCF-Crime
| Methods | HGTree construction | Video/Text Encoding | VLM Caption | LLM Summary | LLM Scoring | Total(GPU hours) | AUC(%) |
|---|---|---|---|---|---|---|---|
| LAVAD | - | 5.1h | 20h | 7.7h2 | 7.7h2 | 55.9 | 80.28 |
| VADTree-Coarse | 0.3h | 0.2h | 5.2h2 | - | 2.8h2 | 16.5 | 82.81 |
| VADTree-Fine | 0.3h | 0.4h | 14.8h2 | - | 7.9h2 | 46.1 | 83.05 |
| VADTree | 0.3h | 0.6h | 20.0h2 | - | 10.7h2 | 62.3 | 84.71 |
Q5: It would be helpful to provide an analysis of some failure cases.
A5: Thanks for your comments. For some videos with continuous lag and frame-skipping artifacts (e.g., Arrest039_x264.mp4 in UCF-Crime), GEBD mistakenly identifies skipped frames as event boundaries. This segmentation error forces VADTree into fixed-length window sampling while allowing the artificial high confidence of frame-skipping moments to obscure genuine event boundaries.
For subtle anomalies like lighter explosions (Explosion027_x264.mp4), VLM is prone to ignore these details when conducting inference, and focuses on prominent scene elements (e.g., people, table). An important research direction in the future is to design VAD methods that can independently discover these subtle anomalies. These case studies will be expanded in the revised manuscript.
Dear Reviewer,
Thank you again for the great efforts and valuable comments. We have carefully addressed the main concerns in detail. We hope you might find the response satisfactory. As the discussion phase is about to close, we are very much looking forward to hearing from you about any further feedback. We will be very happy to clarify any further concerns (if any).
Best, Authors
This paper introduces VADTree, a novel, training-free, and explainable method for video anomaly detection (VAD). The method builds a hierarchical tree structure of normal events with different granularities, enabling detection and explanation of anomalies without any prior training. VADTree computes node-level similarities using spatial-temporal features (CLIP and ViT) and performs anomaly detection by evaluating abnormal paths in the tree. The method is evaluated on VAD datasets (UCF-Crime, XD-Violence,), outperforming many training-based methods and offering transparent explanations.
优缺点分析
Strengths
- Novel Hierarchical Sampling: Introduces HGTree to overcome the rigid fixed-length windowing problem in prior works; allows adaptive temporal granularity.
- Training-free & Explainable: Does not require training on target datasets and provides clear language-based explanations of anomalies.
- Strong Performance: Outperforms other training-free methods and even some weakly-supervised baselines on two challenging VAD datasets.
Weaknesses
- High Computational Complexity: Uses multiple pre-trained large models with potentially high inference costs—though no detailed runtime analysis is given.
- Limited Modality Scope: Only visual information is considered—no audio or multi-modal extensions despite availability in some datasets.
- Limited Generalization Proof: Only tested on two datasets—generalizability to other video domains (e.g., traffic, medical) remains unproven.
- Subjective Evaluation of Explanations: While textual explanations are provided, there is no quantitative evaluation of their accuracy or usefulness.
问题
- What is the computational cost (in FLOPs or GPU hours) of VADTree during inference compared to LAVAD or other training-free methods?
- How does VADTree perform on datasets beyond surveillance and violence detection, such as driving videos or sports anomaly datasets?
- How sensitive is the method to the quality of GEBD boundaries?
局限性
- Limited Modality Use: Does not exploit audio cues, which are relevant for datasets like XD-Violence.
- Potential Overfitting to GEBD: Performance heavily depends on the capability of the pre-trained GEBD model; domain-shifted videos may degrade performance.
最终评判理由
The work and rebuttal are good. I have increased the rating to Borderline accept. There are some concerns that performance improvement and generalization are contributed by large models such as 7B, 14B and larger models, if use smaller models the benefits might not be so obvious. I think that can be included in future work.
格式问题
Figures:
- Figure 1 is informative but cluttered; arrows and stages could be more clearly separated.
Q1: What is the computational cost (in FLOPs or GPU hours) of VADTree during inference compared to LAVAD or other training-free methods?
A1: Thanks for your comments. We compare the parameter counts of each module when LAVAD and VADTree reached their highest AUC on UCF Crime dataset in Table S1. According to performance report by LAVAD, its VLM Caption module integrates the results of five BLIP-2 models. The parameter counts used by LAVAD (79.6B) is 3.6 times that of our method VADTree (22.2B).
Table S1 Component-level parameter analysis of VADTree and LAVAD
| Methods | HGTree construction | Video/Text Encoding | VLM Caption | LLM Summary | LLM Scoring | Total |
|---|---|---|---|---|---|---|
| LAVAD | - | ImageBind_Huge-1.2B | OPT-6.7B2+FLAN-T5XL-3B2 +FLAN-T5XXL-33B | Llama-2-13B-chat | Llama-2-13B-chat | 79.6B |
| VADTree | ResNet50-25.6M | ImageBind_Huge-1.2B | LLaVA-NeXT-Video-7B | - | DeepSeek-R1-Distill-Qwen-14B | 22.2B |
We calculate the total inference time (GPU hours) of LAVAD and VADTree under NVIDIA GeForce RTX 3090 GPU in Table S2. The time consumption of the VLM Caption, LLM Summary and LLM Scoring parts of LAVAD is estimated. As shown in Table S2, our proposed VADTree requires slightly higher total GPU hours than LAVAD. Even for VADTree-Coarse (82.81) , it achieves training-free SoTA performance on the UCF-Crime dataset while consuming 16.5GPU hours, outperforming both LAVAD (80.28) and the latest published EventVAD (82.03, ACM MM2025).
Table S2 Component-level inference time consumption analysis of VADTree and LAVAD on UCF-Crime
| Methods | HGTree construction | Video/Text Encoding | VLM Caption | LLM Summary | LLM Scoring | Total(GPU hours) | AUC(%) |
|---|---|---|---|---|---|---|---|
| LAVAD | - | 5.1h | 20h | 7.7h2 | 7.7h2 | 55.9 | 80.28 |
| VADTree-Coarse | 0.3h | 0.2h | 5.2h2 | - | 2.8h2 | 16.5 | 82.81 |
| VADTree-Fine | 0.3h | 0.4h | 14.8h2 | - | 7.9h2 | 46.1 | 83.05 |
| VADTree | 0.3h | 0.6h | 20.0h2 | - | 10.7h2 | 62.3 | 84.71 |
Q2: How does VADTree perform on datasets beyond surveillance and violence detection, such as driving videos or sports anomaly datasets?
A2: Thanks for your comments. Besides the driving videos in datasets UCF-Crime and XD-Violence, we conduct comprehensive experiments on the multi-scenario MSAD dataset (NeurIPS 2024), which includes 14 diverse scenarios with varying camera views, motion patterns, and challenging conditions like different lighting and weather. The dataset covers various anomaly types including traffic accidents (train/car traffic accidents) and abnormal human behaviors (falls, drowning), providing a robust benchmark for cross-domain evaluation. Our experimental results (Table S3) demonstrate that VADTree outperforms all current weakly supervised methods in AUC metrics, including the latest published VAD (88.68, CVPR 2025), while maintaining competitive performance across other evaluation metrics.
Table S3 Results on the MSAD dataset
| Methods | Supervision | ||||
|---|---|---|---|---|---|
| RTFM | Weakly Supervised | 86.65 | - | - | - |
| MGFN | Weakly Supervised | 84.96 | - | - | - |
| TEVAD | Weakly Supervised | 86.82 | - | - | - |
| UR-DMU | Weakly Supervised | 85.78 | 67.95 | 67.35 | 75.30 |
| VAD (CVPR 2025) | Weakly Supervised | 88.68 | 71.25 | 71.26 | 77.86 |
| VADTree(Ours) | Training-Free | 89.32 | 67.85 | 71.41 | 75.49 |
Q3: How sensitive is the method to the quality of GEBD boundaries? domain-shifted videos may degrade performance.
A3: Thanks for your comments. The quality of GEBD boundaries has limited impact on VAD performance to some extent. To achieve this, we suppress the negative impact of low-quality GEBD boundaries in multiple ways. 1) A DFS strategy that prioritizes high-confidence boundaries while filtering unreliable ones via the threshold, (2) Deduplication operations to eliminate redundant representations, and (3) Hierarchical structure design enhanced by Inter-cluster Node Correlation mechanisms that compensate for boundary imperfections. In addition, we select different GEBD models trained on Kinetics-GEBD dataset [59] to demonstrate the stability of the above strategies. Table S4 reveals a direct correlation between their VAD performance on UCF Crime and their original Kinetics-GEBD dataset results. This confirmation highlights two key findings: (1) The quality of GEBD models remains an influential factor, as improved GEBD implementations consistently yield better performance; 2) Our architectural innovations demonstrate robust adaptability to boundary quality variations.
From a domain shift perspective, the four datasets exhibit significant domain divergence: UCF-Crime's surveillance footage captures constrained real-world anomalies, XD-Violence features edited cinematic content with artificial effects, and MSAD bridges controlled multi-scenario environments, while Kinetics-GEBD's in-the-wild YouTube videos demonstrate superior generality through open-world diversity and naturalistic human activities. Compared with UCF-Crime, XD-Violence and MSAD, Kinetics-GEBD exhibits greater generality characterized by open-world diversity, unrestricted motion patterns, and naturalistic interactions. Consequently, our GEBD based pre-training framework offers comparable advantages for video understanding tasks, particularly in cross-domain anomaly detection scenarios. This paradigm mirrors established transfer learning practices in computer vision. For instance, the conventional strategy of employing large-scale pre-trained backbones (e.g., ResNet-50 on ImageNet) to address domain shifts has proven effective across various downstream tasks, including object detection and image classification.
Table S4 Results of VADTree based on different GEBD models on UCF-Crime dataset
| GEBD Model | Kinetics-GEBD-Val F1(%) | Kinetics-GEBD-Test F1(%) | UCF-Crime AUC(%) |
|---|---|---|---|
| BasicGEBD-ResNet50 | 73.70 | 76.80 | 82.85 |
| EfficientGEBD-ResNet18-L4 | 78.20 | - | 84.70 |
| EfficientGEBD-ResNet50-L4 (Ours) | 78.64 | 78.7 | 84.71 |
Q4: While textual explanations are provided, there is no quantitative evaluation of their accuracy or usefulness.
A4: Thanks for your comments. To quantitatively assess explanation quality, we measure the semantic similarity between anomaly category labels and video captions using the paraphrase-multilingual-MiniLM-L12-v2 model, which encodes text into 384-dimensional dense vectors for cosine similarity computation. Our evaluation protocol on UCF-Crime calculates the mean maximum similarity across all anomaly instances, with caption processing standardized by splitting text at sentence boundaries to prevent the key information from being overwhelmed by long sentences. To ensure fair comparison, we employ LAVAD's cleaned summaries as captions for their method while utilizing VLM-generated descriptions as captions for our VADTree. As evidenced in Table S5, our approach demonstrates superior semantic alignment with anomaly labels compared to LAVAD across multiple VLM configurations. Also, LAVAD's cleaning process itself shows measurable improvements over unprocessed outputs. This systematic evaluation framework objectively validates the textual explanations' relevance to the detected anomalies.
Tabel S5 Quantitative analysis of the accuracy of abnormal interpretation using different methods
| Methods | mean max similarity |
|---|---|
| LAVAD-unclean | 0.3598 |
| LAVAD | 0.4099 |
| VADTree-LLaVA-NeXT-Video-7B | 0.4820 |
| VADTree-InternVL2_5-8B | 0.5251 |
Q5: Only visual information is considered—no audio or multi-modal extensions despite availability in some datasets.
A5: Thanks for your comments. Ignoring the audio modality would indeed lead to missing visually occluded but audibly salient anomalies like gunshots. To address this issue, we design a simple training-free video anomaly detection framework incorporating audio by employing Kimi-Audio-7B-Instruct as our AudioLLM with the prompt "Please generate the audio caption" to extract textual audio descriptions for each HGTree video node, thereby unifying visual and audio semantics into the text domain for joint LLM reasoning. Our XD-Violence experiments (Table S6) using LLaVA-NeXT-Video-7B (VLM) and t5gemma-9B-2B (LLM) show consistent performance gains with audio integration. For fair comparison with existing baselines, we restrict our evaluation to visual information only. The analysis of the audio modality will be comprehensively addressed in the revised manuscript.
Table S6. Comparison of performance of VADTree under different audio modality configurations on XD-Violence dataset
| Methods | AudioLLM | AUC(%) | PR(%) |
|---|---|---|---|
| VADTree-Coarse | - | 87.59 | 62.07 |
| VADTree-Fine | - | 88.28 | 63.22 |
| VADTree | - | 88.28 | 63.22 |
| VADTree-Coarse | Kimi-Audio-7B-Instruct | 88.52 | 63.20 |
| VADTree-Fine | Kimi-Audio-7B-Instruct | 88.94 | 64.94 |
| VADTree | Kimi-Audio-7B-Instruct | 89.42 | 64.94 |
Appreciate your rebuttal. So in conclusion, the proposed approach is more computationally expensive than LAVAD. Besides, the generalization capability of Kinetics-GEBD is evaluated on the datasets but it is still difficult to validate.
Clarification Regarding Table S6 in Rebuttal:
Regarding Table S6 in our rebuttal, we identified three data entry errors during final verification. The corrected table should read:
Table S2. Comparison of performance of VADTree under Different Audio Modality Configurations on XD-Violence dataset
| Methods | ALM | AUC(%) | PR(%) |
|---|---|---|---|
| VADTree-Coarse | - | 87.59 | 62.07 |
| VADTree-Fine | - | 87.81 | 63.00 |
| VADTree | - | 88.28 | 63.22 |
| VADTree-Coarse | Kimi-Audio-7B-Instruct | 88.52 | 63.20 |
| VADTree-Fine | Kimi-Audio-7B-Instruct | 88.94 | 64.69 |
| VADTree | Kimi-Audio-7B-Instruct | 89.42 | 64.96 |
This error pertains only to this specific entry in the table. All other results in Table S6 and throughout the rebuttal remain accurate. This correction does not affect the overall findings or conclusions presented in the rebuttal regarding the effectiveness of VADTree and the Kimi-Audio-7B-Instruct module.
Thank you again for the great efforts and valuable comments. We have carefully addressed the main concerns in detail. We hope you might fnd the response satisfactory. As the discussion phase is about to close, we are very much looking forward to hearing from you about any furthel feedback. We will be very happy to clarify any further concerns (if any).
The computational analysis of our VADTree and LAVAD
Thanks again for your insightful discussion. As a training-free approach, the inference time of our proposed VADTree exhibits a positive correlation with the pre-trained LLMs and VLMs. As shown in Table C1, multiple variants of our method achieve superior performance compared to LAVAD with significantly reduced inference time.
A closer examination of Table S2 and Table C1 reveals the following key observations: 1) VADTree-Coarse requires less than 30% of LAVAD's GPU hours (16.5 vs. 55.9) while achieving a 2.53% higher AUC (82.81 vs. 80.28) on UCF-Crime. This confirms that our method achieves a better trade-off between computational efficiency and detection accuracy compared to LAVAD. 2) Our VADTree framework is highly flexible, with the core HGTree construction process being computationally efficient. Both the VLM and LLM components are modular, allowing for adjustments based on computational constraints. Importantly, high-cost inference models are not essential for VADTree' s effectiveness. 3) The default VADTree’s inference time is primarily influenced by the reasoning phase of DeepSeek-R1-Distill-Qwen-14B-think in the LLM scoring module. Replacing it with faster variants (e.g., DeepSeek-R1-Distill-Qwen-14B-no-think or t5gemma-9B-2B) significantly reduces inference time. We intentionally preserve the "thinking" process because it generates valuable intermediate reasoning steps that significantly enhance anomaly interpretation. Even without this phase, our variants outperform LAVAD in AUC performance while maintaining lower inference times in all cases.
Table C1 Component-level inference time consumption analysis of VADTree and LAVAD on UCF-Crime. Bold indicates that the total time consumption is lower than LAVAD
| Methods | HGTree construction | Video/Text Encoding | VLM Caption | LLM Summary | LLM Scoring | Total(GPU hours) | AUC(%) |
|---|---|---|---|---|---|---|---|
| LAVAD | - | 5.1h | 20h | 7.7h2 | 7.7h2 | 55.9 | 80.28 |
| VADTree-Coarse | 0.3h | 0.2h | 5.2h2 | - | 2.8h2(DeepSeek-R1-Distill-Qwen-14B-think) | 16.5 | 82.81 |
| VADTree-Fine | 0.3h | 0.4h | 14.8h2 | - | 7.9h2(DeepSeek-R1-Distill-Qwen-14B-think) | 46.1 | 83.05 |
| VADTree | 0.3h | 0.6h | 20.0h2 | - | 10.7h2(DeepSeek-R1-Distill-Qwen-14B-think) | 62.3 | 84.71 |
| VADTree-Coarse | 0.3h | 0.2h | 5.2h2 | - | 0.6h2(DeepSeek-R1-Distill-Qwen-14B-no think) | 12.1 | 82.83 |
| VADTree-Fine | 0.3h | 0.4h | 14.8h2 | - | 1.2h2(DeepSeek-R1-Distill-Qwen-14B-no think) | 32.7 | 82.72 |
| VADTree | 0.3h | 0.6h | 20.0h2 | - | 1.8h2(DeepSeek-R1-Distill-Qwen-14B-no think) | 44.5 | 84.65 |
| VADTree-Coarse | 0.3h | 0.2h | 5.2h2 | - | 0.1h2(t5gemma-9B-2B) | 11.1 | 82.21 |
| VADTree-Fine | 0.3h | 0.4h | 14.8h2 | - | 0.2h2(t5gemma-9B-2B) | 30.0 | 82.19 |
| VADTree | 0.3h | 0.6h | 20.0h2 | - | 0.3h2(t5gemma-9B-2B) | 41.5 | 84.00 |
In the following two comments, we have conducted two experiments to address the generalization capability concerns of Kinetics-GEBD: (1) comprehensive evaluation using multiple generalization metrics, and (2) comparative benchmarking against other scenario detection methods (video segmentation)
To directly evaluate the effectiveness of different GEBD models in generalized video segmentation, we propose three key metrics: (1) Number of Segments (NoS), measuring computational efficiency by counting sampled segments (lower is better); (2) Mean Intersection over Union (mIoU), assessing event coverage quality through maximum temporal IoU between segments and ground-truth events (higher is better); (3) Mean Intersection Frames (mIF), quantifying the average overlapping frames between top-matched segments and events. Together, these metrics comprehensively evaluate segmentation efficiency (NoS), precision (mIoU), and granularity (mIF) for VAD tasks.
As shown in Tabke C2, C3, and C4, the number of NoS generated using the GEBD model in VADTree is significantly less than that of LAVAD and VERA based on sliding TW (temporary window), which reduces the inference redundancy of VLM and LLM. At the same time, the mIoU and mIF indicators reflect that the video segments generated by VADTree can be more effectively matched with the video interval where the abnormal event is located, thereby obtaining a more complete event context to further support subsequent abnormal perception and inference. In particular, VADTree-Initialization, as a native video representation structure generated by the GEBD model, its mIoU and mIF are stable ahead of the sliding TW-based method, which directly proves the good generalization of the GEBD model for current mainstream VAD data.
Table C2. Effectiveness analysis of different GEBD models based on Kinetics GEBD for video sampling on the UCF-Crime dataset.
| GEBD model | Methods | TW Length | Stride | NoS↓ | mIoU↑ | mIF↑ |
|---|---|---|---|---|---|---|
| - | sliding TW | 5s | 5s | 7558 | 0.41 | 122 |
| - | sliding TW | 10s | 10s | 3852 | 0.40 | 191 |
| - | sliding TW | 20s | 20s | 1994 | 0.33 | 265 |
| - | sliding TW | 10s | 16f (LAVAD、VERA) | 69634 | 0.51 | 210 |
| EfficientGEBD-ResNet50-L4 | VADTree-Coarse | - | - | 2248 | 0.37 | 369 |
| EfficientGEBD-ResNet50-L4 | VADTree-Fine | - | - | 6365 | 0.40 | 233 |
| EfficientGEBD-ResNet50-L4 | VADTree | - | - | 8613 | 0.47 | 343 |
| EfficientGEBD-ResNet50-L4 | VADTree-Initialization | - | - | 12440 | 0.52 | 456 |
| BasicGEBD-ResNet50 | VADTree-Coarse | - | - | 1506 | 0.32 | 443 |
| BasicGEBD-ResNet50 | VADTree-Fine | - | - | 4669 | 0.38 | 283 |
| BasicGEBD-ResNet50 | VADTree | - | - | 6175 | 0.44 | 409 |
| BasicGEBD-ResNet50 | VADTree-Initialization | - | - | 9048 | 0.48 | 459 |
| EfficientGEBD-ResNet18-L4 | VADTree-Coarse | - | - | 2906 | 0.37 | 344 |
| EfficientGEBD-ResNet18-L4 | VADTree-Fine | - | - | 8077 | 0.39 | 212 |
| EfficientGEBD-ResNet18-L4 | VADTree | - | - | 10983 | 0.48 | 312 |
| EfficientGEBD-ResNet18-L4 | VADTree-Initialization | - | - | 15864 | 0.56 | 469 |
Table C3. Effectiveness analysis of different GEBD models based on Kinetics GEBD for video sampling on the XD-Violence dataset.
| GEBD model | Methods | TW Length | Stride | NoS↓ | mIoU↑ | mIF↑ |
|---|---|---|---|---|---|---|
| - | sliding TW | 5s | 5s | 19965 | 0.40 | 87 |
| - | sliding TW | 10s | 10s | 10199 | 0.36 | 133 |
| - | sliding TW | 20s | 20s | 5302 | 0.30 | 189 |
| - | sliding TW | 10s | 16f (LAVAD、VERA) | 146449 | 0.44 | 144 |
| EfficientGEBD-ResNet50-L4 | VADTree-Coarse | - | - | 19984 | 0.42 | 197 |
| EfficientGEBD-ResNet50-L4 | VADTree-Fine | - | - | 38016 | 0.41 | 117 |
| EfficientGEBD-ResNet50-L4 | VADTree | - | - | 58000 | 0.50 | 191 |
| EfficientGEBD-ResNet50-L4 | VADTree-Initialization | - | - | 75232 | 0.64 | 409 |
| BasicGEBD-ResNet50 | VADTree-Coarse | - | - | 16933 | 0.42 | 219 |
| BasicGEBD-ResNet50 | VADTree-Fine | - | - | 33528 | 0.41 | 127 |
| BasicGEBD-ResNet50 | VADTree | - | - | 50461 | 0.50 | 214 |
| BasicGEBD-ResNet50 | VADTree-Initialization | - | - | 66256 | 0.62 | 407 |
| EfficientGEBD-ResNet18-L4 | VADTree-Coarse | - | - | 21200 | 0.42 | 187 |
| EfficientGEBD-ResNet18-L4 | VADTree-Fine | - | - | 39775 | 0.40 | 113 |
| EfficientGEBD-ResNet18-L4 | VADTree | - | - | 60975 | 0.49 | 183 |
| EfficientGEBD-ResNet18-L4 | VADTree-Initialization | - | - | 78750 | 0.63 | 409 |
Table C4. Effectiveness analysis of different GEBD models based on Kinetics GEBD for video sampling on the MSAD dataset.
| GEBD model | Methods | TW Length | Stride | NoS↓ | mIoU↑ | mIF↑ |
|---|---|---|---|---|---|---|
| - | sliding TW | 5s | 5s | 1076 | 0.46 | 112 |
| - | sliding TW | 10s | 10s | 599 | 0.52 | 175 |
| - | sliding TW | 20s | 20s | 380 | 0.6 | 247 |
| - | sliding TW | 10s | 16f (LAVAD、VERA) | 9150 | 0.68 | 192 |
| EfficientGEBD-ResNet50-L4 | VADTree-Coarse | - | - | 561 | 0.58 | 208 |
| EfficientGEBD-ResNet50-L4 | VADTree-Fine | - | - | 1008 | 0.52 | 152 |
| EfficientGEBD-ResNet50-L4 | VADTree | - | - | 1569 | 0.62 | 197 |
| EfficientGEBD-ResNet50-L4 | VADTree-Initialization | - | - | 1776 | 0.73 | 265 |
| BasicGEBD-ResNet50 | VADTree-Coarse | - | - | 426 | 0.60 | 222 |
| BasicGEBD-ResNet50 | VADTree-Fine | - | - | 641 | 0.56 | 191 |
| BasicGEBD-ResNet50 | VADTree | - | - | 1067 | 0.62 | 221 |
| BasicGEBD-ResNet50 | VADTree-Initialization | - | - | 1042 | 0.70 | 272 |
| EfficientGEBD-ResNet18-L4 | VADTree-Coarse | - | - | 644 | 0.57 | 189 |
| EfficientGEBD-ResNet18-L4 | VADTree-Fine | - | - | 1208 | 0.52 | 145 |
| EfficientGEBD-ResNet18-L4 | VADTree | - | - | 1852 | 0.61 | 188 |
| EfficientGEBD-ResNet18-L4 | VADTree-Initialization | - | - | 2176 | 0.75 | 270 |
To further validate the generalization capability of GEBD models, we conduct comparative experiments with SceneDetect, a widely adopted open-source video scene detection framework that operates without Kinetics-GEBD pretraining. Under identical experimental conditions except for the scene detection method, our evaluation demonstrates that VADTree equipped with a Dynamics-GEBD pretrained model achieves superior performance, as evidenced by significantly higher AUC scores shown in Table C5. These empirical results indicate that GEBD models pretrained on dynamic datasets exhibit enhanced transferability to VAD tasks compared to conventional scene detection approaches, thereby reinforcing the effectiveness of our proposed methodology.
Table C5 Results of VADTree based on different scene detection methods on UCF-Crime dataset
| Methods | VADTree-Coarse | VADTree-Fine | VADTree |
|---|---|---|---|
| SceneDetect | 79.64 | 76.96 | 80.00 |
| BasicGEBD-ResNet50 | 81.85 | 81.93 | 82.85 |
| EfficientGEBD-ResNet18-L4 | 82.09 | 83.04 | 84.70 |
| EfficientGEBD-ResNet50-L4 | 82.81 | 83.05 | 84.71 |
This paper proposes VADTree, a training-free video anomaly detection framework that leverages hierarchical granularity-aware trees and pre-trained models to achieve adaptive sampling and multi-granularity reasoning for anomaly detection.
优缺点分析
Strengths: 1 The paper presents a novel and well-structured framework that effectively addresses the limitations of fixed-length window sampling in video anomaly detection. 2 The HGTree design is innovative and demonstrates strong adaptability to varying anomaly durations. 3 The integration of pre-trained models provides a practical and scalable solution without requiring additional training data. Weaknesses: 1 The HGTree construction process appears computationally intensive. Are there any optimizations to reduce inference time? 2 The paper relies solely on late fusion for multimodal integration. Have you explored more sophisticated fusion methods like cross-attention or transformer-based approaches that might capture deeper modality interactions? 3 Some implementation details are missing. For example, how is the γ_min threshold determined? What's the impact of this parameter on performance?
问题
see weakness
局限性
see weakness
最终评判理由
All my concerns have been addressed. I improve my rating.
格式问题
n/a
Q1: The HGTree construction process appears computationally intensive. Are there any optimizations to reduce inference time?
A1: Thanks for your comments. Thanks for your comments. HGTree construction process achieves significant performance gains while introducing only a minimal and acceptable computational overhead.
The complete HGTree construction involves three sequential stages: (1) Segmentation Confidence Sequence generation, (2) Generic Event Node Initialization, and (3) Adaptive Node Stratification. The first stage employs a pre-trained GEBD model for initial boundary detection. For computational complexity analysis, let V, F, P, and S represent the number of videos, frames per video, detected peaks, and final segments respectively. The latter two stages exhibit a combined complexity of O(V×F) + O(V×P²) + O(V×S²), which simplifies to O(V×F) as the dominant term given the hierarchical relationship F ≫ P > S. To optimize efficiency, we downsample input videos from 30FPS to 10FPS to reduce computational load, and employ a confidence threshold to eliminate low-quality peaks during processing.
Table S1 Time-consuming analysis of HGTree construction and its components on three datasets
| Dataset | Total duration of all videos | Segmentation Confidence Sequence | Generic Event Node Initialization and Adaptive Node Stratification | HGTree construction total time-consuming |
|---|---|---|---|---|
| XD-Violence | 27.03h | 1.14h | 7.23s | 1.14h+7.23s |
| UCF-Crime | 10.29h | 0.30h | 2.2s | 0.30h+2.2s |
| MSAD | 1.39h | 0.25h | 0.43s | 0.25h+0.43s |
As shown in Table S1, experimental results on an AMD EPYC 9654 CPU and NVIDIA RTX 3090 GPU confirm that HGTree construction introduces only modest computational overhead relative to its performance benefits. Also, we calculate the total inference time (GPU hours) of LAVAD and VADTree under the same configuration, and present the comparison results in Table S2. The time consumption of the VLM Caption, LLM Summary and LLM Scoring parts of LAVAD is estimated. As shown in Table S2, our proposed VADTree requires slightly higher total GPU hours than LAVAD. Even for VADTree-Coarse (82.81) , it achieves training-free SoTA performance on the UCF-Crime dataset while consuming 16.5GPU hours, outperforming both LAVAD (80.28) and the latest published EventVAD (82.03, ACM MM2025).
Table S2 Component-level inference time consumption analysis of VADTree and LAVAD on UCF-Crime
| Methods | HGTree construction | Video/Text Encoding | VLM Caption | LLM Summary | LLM Scoring | Total(GPU hours) | AUC(%) |
|---|---|---|---|---|---|---|---|
| LAVAD | - | 5.1h | 20h | 7.7h2 | 7.7h2 | 55.9 | 80.28 |
| VADTree-Coarse | 0.3h | 0.2h | 5.2h2 | - | 2.8h2 | 16.5 | 82.81 |
| VADTree-Fine | 0.3h | 0.4h | 14.8h2 | - | 7.9h2 | 46.1 | 83.05 |
| VADTree | 0.3h | 0.6h | 20.0h2 | - | 10.7h2 | 62.3 | 84.71 |
In order to reduce inference time of GEBD, we could implement parallel video processing pipelines to accelerate GEBD inference, also modify the GEBD pre-training protocol to directly filter low-confidence segments during inference rather than post-processing.
Q2: The paper relies solely on late fusion for multimodal integration. Have you explored more sophisticated fusion methods like cross-attention or transformer-based approaches that might capture deeper modality interactions?
A2: Thanks for your comments. This paper integrates inter-cluster node correlation based on anomaly scores. Though advanced fusion techniques such as cross-attention or transformer-based approaches could potentially capture richer cross-modal interactions, their inherent dependence on learnable parameters renders them incompatible with our training-free architecture. Moreover, incorporating such methods would compromise the fairness of comparative evaluations against other training-free baselines. Future research directions may investigate the development of lightweight VAD systems with fewer parameters, where carefully designed attention mechanisms could be employed to more effectively leverage the complementary information embedded in both fine-grained and coarse-grained cluster scores, thereby enhancing multi-scale anomaly detection performance.
Q3: Some implementation details are missing. For example, how is the threshold determined? What's the impact of this parameter on performance?
A3: Thanks for your comments. The determination of the threshold is validated through experiments in Table 4 (UCF Crime dataset ) and Table S3 (MSAD dataset). Our experimental results demonstrate that the selection of significantly impacts model performance. When is set too high (e.g., 0.5), the model tends to overlook subtle event boundaries, resulting in excessively long video segments that may miss important anomaly transitions, whereas an overly low value (e.g., 0.3) generates an excessive number of video segments, which not only increases computational overhead but also disrupts the semantic continuity of long-range anomaly patterns.
For other parameter settings, we provide detailed experimental results for in Appendix Table 7, in Appendix Table 8, and in Appendix Table 9.
Table S3. Results of VADTree on MSAD dataset with different .
| VADTree-Coarse | VADTree-Fine | VADTree | |
|---|---|---|---|
| 0.3 | 87.01 | 86.71 | 89.32 |
| 0.4 | 86.79 | 86.19 | 88.56 |
| 0.5 | 84.18 | 82.97 | 85.33 |
Dear Reviewer,
I hope this message finds you well. As the discusion period is nearing its end with less than two days remaining, l wanted to ensure we have addressed all your concerns satisfactorily. If there are any additional points or feedback you'd like us to consider, please let us know. Your insights are invaluable to us, and we're eager to address any remaining issues to improve our work.
Thank you for your time and effort in reviewing our paper.
This paper introduces VADTree, a novel, training-free, and explainable method for video anomaly detection (VAD). The method builds a hierarchical tree structure of normal events with different granularities, enabling detection and explanation of anomalies without any prior training. VADTree computes node-level similarities using spatial-temporal features (CLIP and ViT) and performs anomaly detection by evaluating abnormal paths in the tree. The method is evaluated on VAD datasets.
Strenghts: +This paper introduces HGTree to overcome the rigid fixed-length windowing problem in prior works, which allows adaptive temporal granularity. +The proposed training-free model is interesting and practical, and provides clear language-based explanations of anomalies. +The impressive performance is achieved, and outperforms some weakly-supervised baselines on two challenging VAD datasets.
Weaknesses -The mode introduces multiple pre-trained large models with potentially high inference costs. -Limited Modality Scope: Only visual information is considered—no audio or multi-modal extensions despite availability in some datasets. -The experiment is evaluated on two datasets, and the generalizability seems to be unclear. -Althought textual explanations are provided, the subjective evaluation of explanations lacks the quantitative evaluation of their accuracy or usefulness.
After rebuttal, some concerns are responsed, such as, VADTree Inference Time and Computational Complexity, more efficient and flexible alternatives to the inter-cluster node correlation module, the Impact of Different VLM or LLM Components on VADTree, the generalizability of VADTree.
The ratings are 2 BA, and 2 Accept after rebuttal. All reviewers think the rebuttals address the core concerns they raised.