6.4

/10

Poster4 位审稿人

最低4最高4标准差0.0

4.5

置信度

创新性2.5

质量3.0

清晰度3.0

重要性2.3

NeurIPS 2025

PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer

Zhiwei Yang,Chen Gao,Mike Zheng Shou

OpenReview PDF

提交: 2025-04-20更新: 2025-10-29

TL;DR

Generalized Video Anomaly Detection via Agentic AI Engineer

摘要

关键词

Video Anomaly DetectionAI AgentAgentic AI EngineerMultimodal Large Language Model

评审与讨论

审稿意见

评分: 4置信度: 42025-07-02

The paper proposes a training free model for VAD called PANDA. An LLM acts as the core reasoner. A four stage pipeline has been introduced to first pull scene-specific rule snippets from text, then a language-vision model evaluates videos with those rules. After this, there is a reflection module to use external tools such as super resolution if the confidence is low and then re running the evaluation. Finally a memory system is implemented to guide the later decisions.

优缺点分析

Strength:

The framework design, combining planning, reasoning, utilization of external tools and memory for generalist VAD is compelling idea, which is extending from LLM agents to the video domain.
There are comprehensive empirical study with four datasets and ablations on design choices.
The paper is very clear and easy to follow.

Weakness:

My main concern is the real-time performance of the model. In VAD we inherently rely on real-time systems. Catching an anomaly in a reasonable time frame is an important aspect of VAD. The authors report average inference speeds of 0.53–0.86 FPS across four datasets. In PANDA the raw video stream is first down-sampled to one frame per second and inference is carried out on 5-frame clips drawn from that 1 FPS stream. The authors (and your supplemental table) report effective throughputs of 0.53 – 0.86 analysed frames per second. To my understanding for example at 0.86 FPS the agent processes each sampled frame in about 1.16s, so the backlog grows slowly. In practice the display or decision stream will drift behind real time by roughly one second per minute of video. At the lower end, 0.53 FPS approximately doubles that drift. For time-critical response scenarios the agent would require further optimization. If the authors could clarify this or provide comparison with other models mentioned in the paper it would strengthen the papers promise.
Another concern is that the model still depends on a manually curated anomaly-rule knowledge base. Authors need to clarify on the effectiveness and the cost of building and maintaining this base, and its effect on generalisation to unseen anomaly categories. I understand that the domain seems to be anomalous "actions" and not dealing with unseen or novel anomalies, thus I need clarification with this point.
There are some minor typos such as "planing" or "Kowleage" in table 3. It would be great if authors can go over the paper for a final clean up.
One last minor concern is that, the model relies mostly on already developed models and the contribution is limited to the design of the system.

问题

Please refer to the weaknesses part, but here is the main questions:

1- The first question I have is regarding the real-time performance of the model. Please refer to the first weakness.

2- I would appreciate it if authors can clarify on the unseen anomalies and also the effect of varying the number of rules. Evaluate PANDA on anomaly categories absent from the knowledge base to reveal generalisation behaviour is very helpful.

3- Please estimate annotation time per anomaly category or provide an automated rule-generation procedure, and discuss how the knowledge base scales to hundreds of categories. Clear evidence that the rule set can be created with minimal manual effort, or that it can be learned automatically would help the decision.

4- Please polish the presentation of the paper and fix spelling issues etc.

局限性

The societal-impact section acknowledges surveillance use but does not analyse privacy risks or potential misuse. A brief discussion would help.

最终评判理由

While most of my concerns were properly addressed by the authors. Considering other reviewers comments, I still would like to keep my initial assessment.

格式问题

No major concerns.

作者回复

2025-07-31

We sincerely thank the reviewer for the positive assessment and thoughtful comments. We are glad you found the PANDA framework compelling and clearly presented. Below, we address each concern in detail.

Q1: Real-time performance concern.

We thank the reviewer for this detailed analysis and fully agree that real-time inference is vital for practical VAD systems. We provide a detailed clarification on the real-time performance below.

Measured throughput and drift analysis
As shown in Supp. Tab. 1c, PANDA achieves the following average inference speeds across different datasets:

Dataset Average Inference Speed (FPS)
UCF-Crime 0.82
XD-Violence 0.86
UBnormal 0.79
CSAD 0.53

Notably, CSAD is an extreme stress-test benchmark, composed entirely of visually degraded videos. This is not reflective of typical real-world anomaly distributions, but is used to evaluate robustness under worst-case conditions.
Across the other three realistic and diverse datasets, PANDA achieves an average speed of 0.823 FPS, meaning each sampled frame is processed in approximately 1.22 seconds, resulting in a mild temporal drift under the 1 FPS sampling setting — as the reviewer insightfully observed.

In practice, this level of delay is often acceptable for real-world anomaly detection, as abnormal events typically unfold over several seconds. PANDA is often able to raise an anomaly alert during the early stages of most anomalies, which satisfies the latency requirements of many surveillance and safety-critical scenarios. Moreover, we plan to further reduce PANDA’s computational cost by introducing multi-threaded tool pipelines and adopting lighter-weight LLM/VLM backbones in future work.
Comparison with existing training-free LLM/VLMs-based methods

Real-time VAD using LLM/VLMs is inherently challenging due to their reasoning complexity. Existing SOTA training-free MLLM-based methods like LAVAD and AnomalyRuler involve manual intervention (e.g., pre-processing prompts or post-filtering predictions) and require offline access to global video information, making them unsuitable for online deployment**. PANDA, to our knowledge, is the first system to support online, training-free VAD in an agentic manner.

To illustrate this comparison, we measured the average inference FPS on the UCF-Crime dataset with existing SOTA training-free LLM/VLM-based VAD baselines, all evaluated on the same hardware setup using a single A6000 GPU. As shown in the table below, PANDA outperforms previous methods by 6~10 in speed, while also supporting training-free, manual-free, and fully autonomous online detection.

Method Avg. FPS Online
LAVAD (CVPR 2024) 0.08 ✘
AnomalyRuler (ECCV 2024) 0.13 ✘
PANDA 0.82 ✔

Note: For LAVAD and AnomalyRuler, the inference process involves multiple sequential stages, where each stage must be completed before the next can begin. For example, LAVAD first generates captions for all video frames, then performs caption cleaning and summarization, followed by initial scoring and anomaly score refinement based on the captions. Therefore, in our evaluation, we report the average FPS across all inference stages to provide a fair and comprehensive comparison.

Method	Avg. FPS	Online
LAVAD (CVPR 2024)	0.08	✘
AnomalyRuler (ECCV 2024)	0.13	✘
PANDA	0.82	✔

This discussion and table will be included in the revision.

Q2: Concerns regarding reliance on an anomaly-rule knowledge base, generalization to unseen anomalies, and rule number sensitivity

We sincerely thank the reviewer for these insightful questions. Below, we respond to each aspect in detail:

1. On the construction and cost of the anomaly-rule knowledge base

We would like to clarify that PANDA does not require any manual annotation or dataset-specific rule design. As shown in Supp. Fig. 1 (top), the anomaly knowledge base is automatically constructed at runtime based on the user’s specified anomaly detection needs.

This process is driven by MLLMs (e.g., Gemini 2.0 Flash), using dynamically composed prompts based on the user requirement;
For example, generating 20 rules for each of 13 anomaly categories on the UCF-Crime dataset took approximately 95 seconds in total — demonstrating low overhead and strong scalability, even when extended to hundreds of categories.

Importantly, PANDA does not rely rigidly on the initial rule base. It performs:

Scene-aware knowledge retrieval in the perception stage, selecting only context-relevant rules;
Dynamic rule augmentation during reflection, where new rules may be added or revised based on ambiguous cases and context.

This ensures that PANDA’s reasoning process remains adaptive, even when the initial knowledge base lacks coverage for the current anomaly type.

2. On generalization to unseen anomalies

Since PANDA is fully training-free, it starts with no pre-learned notion of known anomaly categories. The only guidance comes from user requirement at runtime.

Because anomaly is inherently context-dependent — the same action (e.g., “running”) may be normal in a park but abnormal in a hospital. PANDA reflects this principle and building the anomaly rule base on demand, according to the specific detection goal and scene context.

Moreover, PANDA’s adaptive mechanisms — self-adaptive scene perception, tool-augmented self-reflection, and chain-of-memory — allow it to handle anomalies not explicitly defined in the initial rule base.

To further verify this, we conducted an open-set evaluation on the UBnormal dataset:

The anomaly rule base was constructed only using categories from the training set;
The test set included completely disjoint anomaly types;
As shown in the table below, PANDA’s performance only dropped ~1.5% AP, and still outperformed all SOTA methods — demonstrating robust generalization to unseen anomalies.

Method	Setting	AP (%)
ZS CLIP (ICML 2021)	Training-free	46.20
LLAVA-1.5 (CVPR 2024)	Training-free	53.71
LAVAD (CVPR 2024)	Training-free	64.23
AnomalyRuler (ECCV 2024)	Training-free	71.79
PANDA	Training-free (w/o test anomaly rules)	74.23
PANDA	Training-free (w test anomaly rules)	75.78

3. On rule number effect

We have analyzed this factor in Tab. 2(b) of the main paper. When too few rules are retrieved (e.g., k = 1), the system lacks diverse contextual cues to support robust reasoning, resulting in performance degradation. Conversely, setting k too high (e.g., k = 9) may introduce noisy or irrelevant rules that dilute reasoning quality. Finally, PANDA achieves optimal performance when setting k = 5.

Rules Number k	UCF-Crime (AUC%)
1	82.79
5	84.89
9	83.92

Q3: Clarification on contribution scope beyond model reuse

Thank you for this remark. While it is true that PANDA leverages existing MLLMs and VLMs, we respectfully argue that its contribution lies in the novel agentic AI paradigm and system-level architecture designed to unlock training-free, manual-free, and generalist video anomaly detection — a capability not achieved by prior work.

1. Paradigm-level innovation

PANDA is not a naive wrapper over existing models. It introduces a deliberately designed four-stage reasoning stack —
Perceive → Reason → Reflect → Memory(Learn) — mimicking a human detective’s workflow. Each module equips PANDA with essential abilities:

Self-adaption scene-aware strategy planning → addresses open-scene adaptation;
Goal-driven heuristic reasoning → supports goal-driven, explainable, and reliable decisions;
Tool-augmented self-reflection → enhances robustness under visual degradation;
Chain-of-memory → enables temporal abstraction and learning from prior experience.

This transforms MLLMs from static responders into situationally aware, reflective agents that operate sequentially over video data — enabling a new class of VAD agents.

2. First to unify training-free + online VAD

PANDA is, to our knowledge, the first training-free MLLM-based VAD system that supports online detection, addressing a critical practical need. In contrast:

ZS CLIP, LLAVA-1.5, LAVAD, AnomalyRuler — all require offline processing and manually tuned or learned prompts;
None support self-reflection, dynamic rule retrieval, or temporal memory;
None operate without scene-specific annotations or tuning.

3. System-level design as contribution

PANDA unifies RAG, tools, reasoning, and memory into a dynamic and autonomous agent, shifting from static pipelines to reasoning-driven learning. This design is non-trivial and central to PANDA’s contribution.

We will further emphasize these points in the revised Introduction and Section 3.

Q4: Typos.

We sincerely thank the reviewer for pointing out the typos. We will correct all typos such as "planing" → "planning" and "Kowleage" → "Knowledge". We will also thoroughly revise and proofread the entire paper and supplementary materials to ensure a polished and professional presentation in the final version.

Q5: Privacy and societal impact.

Thank you for the reminder. While PANDA does not collect personal identifiers, we acknowledge potential risks if used inappropriately (e.g., unauthorized surveillance, biased rules).

We will update the societal impact section to discuss:

The importance of transparent and auditable rule design;
Ethical deployment boundaries;
Bias mitigation through automated rule auditing.

评论- Discussion

2025-08-06

Dear Reviewer wQDY,

I would be grateful if you could kindly review the authors' rebuttal and let them know if you have any further questions or concerns. Your feedback will help ensure a constructive and productive discussion.

Best regards,

2025-08-07

I thank the reviewers for their comprehensive response. My only concern is still about the FPS. I understand that this work has potential, and thus I would like to consider this to be further improved. While the authors are proposing solutions, they could respond with a more scientific approach, and discuss their steps towards achieving ~real-time speed.

2025-08-07

FPS Optimization Approach

We sincerely thank you for the positive and thoughtful feedback. Achieving real-time performance for generalist video anomaly detection is indeed one of our long-term goals. While PANDA already outperforms existing SOTA LLM/VLM-based methods by 6–10× in inference speed, we fully agree that there is still a gap to close before reaching true real-time capability.

We think that the inference speed of PANDA can be improved by further optimization at three levels: preprocessing-level, model-level, and engineering-level:

1. Preprocessing-level optimization via dynamic spatiotemporal cropping

One major bottleneck in PANDA's inference speed is not only the complexity of models but also the volume and resolution of video input, which directly affects token count and thus inference latency.

To mitigate this, we can adopt an adaptive spatiotemporal input reduction strategy, motivated by characteristics of surveillance footage: most surveillance videos come from static-angle cameras, resulting in large portions of frames that are visually redundant (e.g., static background). In contrast, foreground changes (i.e., moving objects) typically signal informative or anomalous behavior.

Therefore, we can design a dynamic spatiotemporal cropping module that dynamically adjusts both temporal and spatial input:

Adaptive frame rate control:

If no motion is detected in the foreground, the sampling rate is dynamically reduced (e.g., 0.1–0.5 FPS);

When foreground motion is detected, the rate increases to 1 FPS or higher.

Spatial region cropping:

A simple frame-difference-based method can estimate motion regions, which are then used to crop the frame to its informative area.

Scene context caching:

Since the background scene is often stable, its global information can be summarized once (e.g., via captions) and reused across clips, further reducing redundant processing.

This dynamic spatiotemporal cropping will allow PANDA to focus on the most informative content while significantly reducing video token count, thereby directly speeding up model inference.

2. Model-level optimization

PANDA’s current VLM can be replaced with more efficient, quantized variants that offer competitive performance with significantly lower latency. For instance, we can replace Qwen2.5-VL-7B-Instruct with Qwen2.5-VL-7B-Instruct-GGUF, or other GGUF-format models optimized for fast inference on resource-constrained hardware.

3. Engineering-level optimization

We can implement asynchronous, multi-threaded execution pipelines for tool invocation and memory update, to minimize blocking calls in the reflection loop. Tool outputs will be prefetched or cached whenever possible to reduce latency.

Overall, we believe that with joint optimization at the preprocessing, model, and engineering levels, PANDA's inference speed can be further improved to approach real-time performance. While PANDA is not yet fully real-time, it is, to the best of our knowledge, the first LLM/VLM-based framework to support online, training-free, and manual-free generalist VAD. Its strong performance across multiple datasets further demonstrates its practical value and application potential.

We once again sincerely thank the reviewer for the positive and constructive feedback. If you have any additional questions or suggestions, we are open to continuing the discussion.

2025-08-09

Dear Reviewer wQDY,

We sincerely thank you for the time and effort in reviewing our work, and we truly appreciate your support.

With less than one day remaining before the discussion deadline, we would like to kindly ask whether our response has addressed your only remaining concern regarding FPS. Please also let us know if you have any other questions or concerns, and we will be glad to provide further clarification.

Best regards,

Authors of Submission3267

审稿意见

评分: 4置信度: 52025-07-02

This paper presents a solution to make multimodal large language models (MLLMs) behave like a detective with 4 modules: (1) a RAG to make them have strategy planning ability, (2) a heuristic prompt strategy to make them have goal-driven heuristic reasoning, (3) a progressive reflection mechanism to make them have self-reflection, and (4) a chain-of-memory mechanism for self-improving ability. The paper tests the methods on representative datasets including UCF-Crime, XD-Violence, UBnormal, and CSAD.

优缺点分析

Strengths:

The paper is well-motivated to make AIs behave like detectives in video anomaly detection. The paper includes additional modules to equip the model with the desired ability, and the selection of them is reasonable.
The paper tests the proposed method in various datasets and presents a comprehensive result for study.

Weaknesses: The main concern for this paper is that it doesn't illustrate the necessity of why we must have those modules, and in fact these modules actually could be redundant.

First, the paper doesn't have concrete reasons to explain why we must use those techniques to model the four abilities in detectives. Whether they are optimal design is not discussed. The current solution is more like a combination of hot techniques without illustrating why they are essential in design.
Second, these modules may not be necessary because an important baseline is missing. This paper is categorized into training-free methods but it doesn't compare with another method not fine-tuning LLMs published in CVPR'25. (VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models). The proposed method uses many additional modules to achieve VAD, however, based on the performance on XD-Violence and UCF-Crime, it doesn't outperform VERA which simply uses good prompts obtained from verbalized learning between LLMs. These results could indicate that actually all the introduced modules in this paper is not necessary and a good prompt could outperform them. The paper needs a more thoughtful discussion on models not using fine-tuning with more works such as VERA.
Third, the additional computation overhead for including these modules are not discussed in this paper. This is an obvious trade-off for including those modules and need detailed discussion.

问题

Why we must use those modules to model the desired ability? Are they optimal? What are the other solutions?
Are these modules really necessary? VERA has an AUC~86% in UCF-Crime and just need good prompts for MLLMs.
What's the additional computation overhead will be brought by these modules?

局限性

yes

最终评判理由

I'd like to raise the score here because the rebuttal results include the discussion of the comparison with recent methods like VERA. This discussion will be useful for illustrating the strengths and weaknesses of the paper. I raise my rating because of this and hope the authors will include this discussion in the future version.

格式问题

N/A

作者回复

2025-07-31

Thank you for your thoughtful and critical review. We appreciate your recognition of the motivation and comprehensiveness of our proposed PANDA. We address your concerns in detail below.

Q1: Why we must use those modules to model the desired ability? Are they optimal? What are the other solutions?

1. Why those modules are necessary.

Thank you for the insightful question. We would like to clarify that the four modules in PANDA — namely: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) a self-improving chain-of-memory — are not arbitrary choices, but are deliberately designed to achieve generalist VAD that is fully training-free and manual-free even when facing novel scenes or anomaly types — a capability not supported by existing VAD methods. Next, we elaborate on the necessity of those modules below.

(1) Functional Necessity

Each module equips PANDA with a key cognitive capacity essential for generalist VAD:

Self-adaptive scene-aware strategy planning (via RAG): Real-world anomalies are scene-dependent, and thus require context-specific rules. The RAG module lets PANDA dynamically retrieve only relevant rules from a anomaly knowledge base based on perceived context. Static prompts or fixed rules (used in prior work like VERA or AnomalyRuler) cannot generalize.
Goal-driven heuristic reasoning: Existing methods that rely on manually constructed or tuned fixed prompt templates for MLLM reasoning are brittle and poorly scalable. PANDA learns to synthesize them adaptively from environment + user goals + rules.
Tool-augmented self-reflection: When facing ambiguous or degraded input, PANDA “knows what it doesn’t know” and selectively invokes tools (e.g., object detection, web search). This reflects a meta-cognitive ability essential to generalization.
Chain-of-Memory: Many anomalies involve temporal cues (e.g., shoplifting, riot). PANDA maintains both short-term reasoning traces and long-term historical experiences, allowing it to refine decisions across clips — which neither VERA nor other baselines support.

(2) Empirical Justification

As shown in Table 2 (main paper), PANDA outperforms existing SOTA methods.
Table 3 (main paper) demonstrates that the performance of PANDA improves consistently and substantially as each module is incrementally added.
As addressed in Q2, even when compared to VERA — a recently proposed method trained with coarse-grained prior labels — PANDA still achieves superior results on the XD-Violence and CSAD datasets.

These empirical evidence indicates that these modules are not redundant but instead contribute orthogonally to generalization and robustness in VAD.

2. Are these modules optimal? What are the other solutions?

Thank you for the insightful question. While we do not claim these modules represent the “optimal” design, they provide a minimally sufficient and extensible structure to achieve training-free, manual-free, and generalist VAD across diverse benchmarks.

We acknowledge that these modules proposed in PANDA represent just one plausible design path toward generalist VAD. Another feasible solution is to design reinforcement learning (RL)-based agents — for example, using RL to train a policy that decides how to analyze videos (e.g., which tools to invoke or how to interpret anomaly signals). However, since real-world anomalies are rare and highly context-dependent, this poses significant challenges for reward function design and generalization to complex, dynamic environments.

In summary, while other solutions exist, we emphasize that PANDA offers a training-free, manual-free, and generalist design that balances flexibility and practicality. We believe it provides a compelling trade-off between generalization, interpretability, and ease of deployment in real-world VAD scenarios.

Q2: The paper lacks comparison with VERA (CVPR 2025), which uses learned prompts to achieve strong results without fine-tuning. Does this mean the proposed modules are unnecessary?

Thank you for raising this important point. We fully acknowledge that VERA is a strong, recent method in the MLLM-based VAD paradigm. However, we respectfully clarify several key distinctions and provide both theoretical and empirical evidence supporting the necessity of PANDA’s modular design:

1. PANDA is Fully Training-Free and Manual-Free — VERA Requires Dataset-Specific Training and Human Intervention

A core distinction is that PANDA is a generalist that is fully training-free and manual-free, requiring no training phase, no data annotation, no optimization of prompt parameters, and no human intervention. The entire pipeline is globally automated — from rule generation to detection — making PANDA highly suitable for deployment in novel environments with no prior data or supervision.

In contrast, VERA is a specialist model that requires:

Coarsely labeled training videos (weakly-supervised);
An optimizer-learner loop to iteratively tune prompt questions;
Manual inspection for convergence and selection of prompt questions.

This makes PANDA significantly more lightweight and deployment-ready in unseen, zero-data environments, while VERA remains more aligned with weakly-supervised prompt learning and dataset-specific adaptation.

2. Evaluation Metric Consistency — Our AUC and AP on XD-Violence is Higher than VERA

VERA reports AUC on the XD-Violence dataset, while our main paper follows the original XD-Violence paper [HL-Net, Wu et al., ECCV 2020] in reporting AP, which is more robust for class-imbalanced datasets as highlighted by XD-Violence authors.

To ensure a fair comparison, we additionally report PANDA’s AUC on the XD-Violence dataset. Furthermore, using the official VERA codebase — including the released video features, initial scores, and AP computation scripts — we reproduce VERA’s AP on XD-Violence (which was not reported in their paper). As shown in the comparison table, PANDA outperforms VERA on both AUC and AP metrics. This demonstrates that PANDA’s modular design does not sacrifice accuracy despite being fully training-free and manual-free, whereas VERA requires training on coarsely labeled video data.

Methods	Supervision	XD-Violence (AUC%)	XD-Violence (AP%)	CSAD (AUC%)
VERA	Weakly-supervised	88.26	70.11	64.52
PANDA	Training-free	89.35 (+1.09)	70.16 (+0.05)	73.12 (+8.6)

3. PANDA Outperforms VERA in Complex and Degraded Scenarios (e.g., CSAD)

We further evaluate VERA using its released codebase on our proposed CSAD benchmark — a complex-scene anomaly detection set characterized by poor lighting, low resolution, and long-range temporal anomalies.

As shown in the table above, VERA achieves only 64.52% AUC on CSAD, whereas PANDA significantly outperforms it with 73.12%. This highlights that prompt-based adaptation alone is insufficient for handling real-world degradation, where PANDA’s self-adaptive scene-aware, tool-augmented self-reflection, and chain-of-memory modules play a critical role in restoring visual clarity and improving reasoning confidence.

In summary, while we commend VERA as an elegant prompt-learning method, we emphasize that:

PANDA is completely training-free, manual-free, and generalizable — capable of handling arbitrary new scenes and anomaly types, whereas VERA requires coarse-grained annotations and training on specific datasets;
PANDA surpasses VERA on complex and unseen scenes without dataset-specific tuning;
PANDA achieves superior accuracy on the XD-Violence under a fair comparison on the same evaluation metric;
PANDA offers fully training-free, manual-free, and globally automated solution, which is essential for real-world generalist anomaly detection systems.

Finally, we thank the reviewer again for highlighting this important comparison. We will include detailed discussion on VERA in the revised version of our paper.

Q3: What's the additional computation overhead will be brought by these modules?

Thank you for pointing out this important concern. We agree that a discussion on computation cost is necessary, and now provide additional clarification and quantitative analysis.

1. Inference Speed Across Datasets

We have reported the inference speed of PANDA on each dataset in the supp. material (Section C.3, Table 1c):

Dataset	Average Inference Speed (FPS)
UCF-Crime	0.82
XD-Violence	0.86
UBnormal	0.79
CSAD	0.53

Among them, CSAD is a purposefully extreme benchmark composed entirely of visually degraded scenes, and thus its speed reflects a worst-case condition. For the other three datasets — UCF-Crime, XD-Violence, and UBnormal — which represent real-world surveillance distributions and open-set scenarios, PANDA achieves an average speed of 0.823 FPS, which closely aligns with the 1 FPS processing rate and supports timely anomaly detection in practical deployments (1 FPS sampling settings).

2. Trade-off Justification

We acknowledge that PANDA introduces additional reasoning and reflection steps compared to fixed-prompt baselines. However, this is a conscious trade-off:

PANDA trades marginal compute for robustness under uncertainty, scene adaptivity, and open-set generalization;
Unlike training-dependent systems, PANDA avoids all training costs;
Unlike VERA, PANDA does not require pre-optimized prompts or dataset-specific training to adapt.

We believe that this modest overhead is justified by PANDA’s real-world usability and improved generalization in complex and unseen scenarios. We also plan to reduce PANDA’s computational load and further improve its inference speed in future work by incorporating multithreaded tool invocation and adopting lighter LLM/VLM backbones.

2025-08-04

Dear Reviewer cvqx,

I would be grateful if you could kindly review the rebuttal to see whether it addresses your concerns.

In particular, the authors have now included a comparison with VERA using AUC (previously, only AP was reported). Do you still believe the proposed method underperforms compared to VERA? If so, could you kindly clarify why?

Additionally, do you agree with the authors’ assertion that VERA is not truly training-free, as it relies on coarse-grained annotations?

We look forward to a thoughtful and constructive discussion between the authors and reviewers.

Best regards, AC

2025-08-06

Dear Authors, thank you for the response! I think the new results and discussion presented here will be helpful and important for making the paper solid. Please include them in the final version. The score is raised based on this consideration.

2025-08-06

Dear Reviewer cvqx，

Thank you very much for your valuable feedback and for raising the score. We truly appreciate your recognition of our efforts to address your concerns. We will ensure that all discussions and new results are incorporated into the final version.

Best regards, Authors of Submission3267

审稿意见

评分: 4置信度: 42025-07-03

This paper proposes PANDA, a new paradigm for video anomaly detection (VAD) targeting general scenarios and open anomaly types. PANDA leverages multimodal large models (MLLMs) and simulates a detective-like “perceive-reason-reflect-learn” decision process. Through adaptive scene perception, goal-driven reasoning, tool-augmented reflection, and self-improving memory chains, PANDA achieves anomaly detection that requires no training or manual adjustments, adapting to different scenes and anomaly types. Extensive experiments on multiple datasets (UCF-Crime, XD-Violence, UBnormal, CSAD) covering diverse, open, and complex scenarios show that PANDA achieves SOTA-level performance without any training.

优缺点分析

Strengths:

The paper introduces a “detective agent” paradigm, incorporating RAG (retrieval-augmented generation), chain-of-thought self-reflection, tool-chain support, and memory chain mechanisms into VAD, achieving anomaly detection without training or manual prompt engineering.
The approach breaks free from the heavy dependence of traditional VAD methods on specific scenes, event types, rules, or models.

Weaknesses & Questions:

The extensive use of RAG, repeated reasoning, and tool-chain calls in PANDA incurs significant inference latency and computational cost—preliminary estimates suggest FPS is even less than 0.1. The authors should provide detailed inference timings and discuss this issue, as high computational cost could significantly limit practical use.
The novelty is limited; the core idea is to replace the prompts in existing MLLM-based VAD methods (such as SUVAD, LAVAD, VERA, etc.) with RAG and chain-of-thought techniques.
During the lengthy reasoning process, model hallucinations may be amplified at each step. The authors should conduct case studies and thoroughly explain whether severe hallucinations occur and how much they affect result correctness.
Technical details are lacking; the authors should supplement detailed implementation of RAG and tool invocations.

问题

Please respond to each item of the weaknesses mentioned above.

局限性

Yes

最终评判理由

After reading the responses from the authors, I believe that PANDA demonstrates a certain degree of innovation and practical significance in the direction of “training-free, manual-free, and generalizable VAD for open scenarios.” Although there are still some shortcomings, such as considerable engineering complexity and limited depth in ablation studies of certain modules, the overall contribution and experimental support are now at an acceptable level. Therefore, I decide to raise my score.

I hope the authors will further improve the implementation details and failure case analysis in the final version, making this work even more valuable to the community.

格式问题

This paper is generally written well and I do not find obvious formatting issues in this paper.

作者回复

2025-07-31

Thank you for your thoughtful and detailed feedback. We appreciate your recognition of PANDA’s “detective agent” paradigm and its advantages of operating without training or manual prompt engineering. Below, we address each of your concerns point by point:

Q1: Inference latency and FPS concerns.

We respectfully clarify that PANDA’s actual speed is significantly higher than the initial estimate (< 0.1 FPS), and is broadly sufficient to meet the practical requirements under 1 FPS sampling settings, which aligns with our experimental setup.

As shown in the supplementary material (Sec. C.3, Tab. 1c), the measured average inference speed across datasets is:

Dataset	Inference Speed (FPS)
UCF-Crime	0.82
XD-Violence	0.86
UBnormal	0.79
CSAD	0.53

Importantly, PANDA is a training-free and manual-free VAD framework that supports both online and offline inference, while existing SOTA training-free LLM/VLM-based methods (e.g., LAVAD, AnomalyRuler) rely on multi-stage processing involving manual prompt design, external visual parsing, or result aggregation, making them unsuitable for online operation.

To illustrate this comparison, we measured the average inference FPS on the UCF-Crime dataset with existing SOTA training-free LLM/VLM-based VAD baselines, all evaluated on the same hardware setup using a single A6000 GPU. As shown in the table below, PANDA outperforms previous methods by 6~10 in speed, while also supporting training-free, manual-free, and fully autonomous online detection.

Method	Inference Speed (FPS)
LAVAD (CVPR 2024)	0.08
AnomalyRuler (ECCV 2024)	0.13
PANDA	0.82

Note: For LAVAD and AnomalyRuler, the inference process involves multiple sequential stages, where each stage must be completed before the next can begin. For example, LAVAD first generates captions for all video frames, then performs caption cleaning and summarization, followed by initial scoring and anomaly score refinement based on the captions. Therefore, in our evaluation, we report the average FPS across all inference stages to provide a fair and comprehensive comparison.

We will include these clarifications and the comparison table in the revised version.

Q2: The novelty is limited; the core idea is to replace the prompts in existing MLLM-based VAD methods (such as SUVAD, LAVAD, VERA, etc.) with RAG and chain-of-thought techniques.

We respectfully clarify that PANDA is not a simple replacement of prompts with RAG and CoT techniques, nor an incremental extension of existing VAD methods. Instead, it is a fundamentally new agentic AI framework, built from scratch using LangGraph, to enable generalist VAD — meaning it can automatically adapt to new scenes and novel anomaly types in a training-free and manual-free manner. In contrast, existing methods typically require retraining or manual intervention when faced with such open-world scenarios.

Specifically, PANDA introduces a paradigm-level shift from training or manual dependence to training-free, manual-free, and generalist paradigm. It is built as a detective-like agent and is empowered with the following cooperating abilities:

Self-adaptive scene perception with environment-aware rule retrieval;
Goal-driven heuristic reasoning that aligns anomaly rules with real-time vision-language observations;
Tool-augmented self-reflection, allowing the PANDA to detect uncertainty and improve its own perception pipeline dynamically;
Chain-of-memory experience alignment, enabling context tracking across sliding windows and continue learning from history experience;
Online processing ability, where the system makes decisions sequentially without accessing future frames — whereas existing methods, such as SUVAD, LAVAD, and VERA, require access to the entire video for global context and can only operate in an offline manner.

In summary, these prior works like SUVAD, LAVAD, and VERA rely on handcrafted or fine-tuned prompts and equire offline access to global video information, they lack four critical capacities:
① self-adaptive scene perception, ② self-reflection under degradation, ③ long-term memory across temporal sequences, and ④ online processing — all of which are essential for generalist VAD in complex, real-world settings.

We believe PANDA is the first agentic VAD framework that not only performs well across open-set and degraded scenes, but also does so fully autonomously, without training or handcrafted tuning.

Q3: Model hallucinations concern.

Indeed, hallucination remains one of the most recognized challenges when deploying large models (LLMs/VLMs), especially in lengthy reasoning process. In PANDA, we fully acknowledge this concern and have explicitly designed several key components to mitigate hallucination propagation throughout the pipeline. We elaborate below:

1.Scene-Aware Planning with RAG Mitigation

During the strategy planning phase, PANDA integrates scene perception with RAG. Instead of relying on free-form hallucinated outputs, the system uses explicit environmental cues (e.g., scene type, weather condition, potential anomalies) to retrieve relevant anomaly rules from a curated knowledge base. This constraint dramatically reduces the likelihood of hallucinated, scene-irrelevant planning responses.

2.Goal-Driven VLM Reasoning with Multi-State Output

During Reasoning, PANDA explicitly passes potential anomaly types, inferred from scene-aware planning, into the reasoning prompt — thus narrowing the decision space. Moreover, instead of forcing binary outputs (“normal” vs. “anomaly”), we introduce an "Insufficient" reasoning state, allowing the model to abstain from uncertain predictions rather than hallucinate a confident but incorrect answer.

3.Tool-Augmented Reflection to Ground Ambiguities

If a clip is deemed “insufficient” by the VLM, PANDA triggers tool-augmented self-reflection, which includes reliable enhancement tools (e.g., image deblurring, brightness enhancement, object detection) that provide objective, grounded information. These tools act as external signal correctors, significantly reducing the reliance on speculative internal reasoning.

4.Chain-of-Memory for Decision Consistency

PANDA maintains both ShortCoM and LongCoM memory. These modules serve as explicit external memories, enabling the system to reference prior decisions and reflection rationales — reducing the chance of making inconsistent judgments across similar cases, a common hallucination symptom.

Overall, these mechanisms above enable PANDA to effectively mitigate model hallucinations, as further supported by the component ablation study in Table 3 of the main paper. Nonetheless, hallucination remains a stubborn challenge for large models. Despite our efforts to alleviate it through multiple strategies, we still observed occasional hallucination cases during the final experiments. We will add hallucination case studies in the revised version to illustrate remaining failure cases and reflect on potential future improvements. Notably, despite these challenges, PANDA — as a training-free and manual-free generalist model — still significantly outperforms existing methods across diverse scenario datasets, as shown in Table 1 (main paper), indicating its overall reliability and effectiveness.

Q4: Missing implementation details — RAG and tool invocation.

Thank you for pointing this out. We appreciate the opportunity to clarify the technical implementation behind PANDA's core modules, and we will add further elaboration in the revised version.

1. RAG Details

PANDA encodes both the pre-built anomaly rule base and environment information using the all-MiniLM-L6-v2 sentence embedding model.

The rule base is indexed using FAISS for fast similarity retrieval.
During the scene-aware planning phase, the environment information is embedded and used as a query to retrieve the top-k most relevant rules, which are then fed into the MLLM for reasoning.

2. Tool Invocation Details

PANDA integrates a modular toolset, where all tool functions are pre-defined with their input/output formats and functional descriptions. These descriptions are made available to the MLLM during the reflection stage — as illustrated in Supp. Fig. 6 ({tool_description_text}).

During reflection, the MLLM considers the VLM’s output, historical context, and available tool descriptions to determine whether additional information is needed. It then outputs a structured field tools_to_use, specifying:

The tool name to invoke
Optional parameters (e.g., query strings)

Each tool name directly maps to a callable function in PANDA’s toolset. Finally, PANDA parses this tools_to_use field and executes the corresponding tool function with the provided parameters.

评论- Discussion

2025-08-06

Dear Reviewer B3X7,

Best regards, AC

2025-08-06

Dear Reviewer B3X7,

Thank you sincerely for your insightful comments and valuable suggestions on our paper. We have carefully addressed each of your points in our detailed response. As the discussion phase concludes, we remain open to any further questions or feedback that could help improve our work. Your insights have been immensely helpful, and we deeply appreciate your time and effort.

Best regards,

Authors of Submission3267

2025-08-08

Thanks to the authors for their comprehensive rebuttal, which addresses each of my concerns in detail. The additional experiments on inference speed provide a clear clarification of PANDA’s efficiency advantages in practical scenarios and offer an objective comparison with current mainstream training-free baseline methods. Furthermore, the authors have provided further clarification regarding the model’s novelty, hallucination issues, and implementation details, which enhances the transparency and completeness of the work.

2025-08-08

Dear Reviewer B3X7,

We sincerely thank you for your positive and valuable feedback, as well as for recognizing our efforts to address all concerns in detail. Your constructive comments have been instrumental in improving the quality of this paper. Thank you again for your time and thoughtful review.

Best regards, Authors of Submission3267

审稿意见

评分: 4置信度: 52025-07-03

This work aims to develop a generalisable video anomaly detection method which should handle any scene and any anomaly types without training data or human involvement. Self-adaption and scene-aware strategy planning, goal-driven heuristic reasoning, tool-augmented self-reflection, and self-improving chain-of-memory are the four main components of the proposed method. Extensive experiments demonstrate that the proposed method (dubbed as PANDA) achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings.

优缺点分析

Strengths: This work proposes PANDA, training-free VAD that can automatically handle any scene and any anomaly types without training data or human involvement. PANDA is evaluated on four VAD benchmarks representing three distinct setting multi-scenario, open-set, and complex scenario settings.

Weekness:

In the implementation details, it is mentioned that “the maximum number of reflection rounds r is set to 3.” The paper suggests that PANDA is able to assign a normal or anomalous score to all previously insufficient samples within these rounds. However, intuitively, there may be hard cases where even after three reflection rounds, PANDA might still fail to confidently assign a label. Could the authors clarify how such unresolved or ambiguous cases are handled, if any remain after the maximum number of reflection rounds?
All the methods compared in Table 1 perform anomaly detection in an offline setting. As stated in the supplementary material (L19–L20), "If future information is accessed, the method is considered offline; otherwise, it falls under the online setting." However, PANDA online setting still outperforms offline, training-free methods. This seems counterintuitive. Could you clarify why PANDA achieves such strong performance even in the online setting?
An analysis of how frequently the reflection mechanism is invoked i.e., the number of clips requiring reflection rounds would be valuable. Additionally, it would be insightful to report how many clips transition from "insufficient" to either "normal" or "anomalous" after each reflection round. Finally, please clarify how many clips, if any, remain in the "insufficient" category even after the maximum of three reflection rounds.
Figure 2 lacks clarity and does not accurately reflect the proposed method. According to the methodology, when the VLM reasoning status is marked as “insufficient,” the MLLM reflection module is invoked. However, this is not clearly depicted in the figure. Instead, the arrow from the VLM reasoning block appears to lead directly into another instance of VLM reasoning, which is misleading. To improve clarity, the figure should explicitly illustrate the transition from VLM reasoning to MLLM reflection in the case of an “insufficient” outcome. Additionally, incorporating step numbers into the diagram would help convey the sequential flow of the pipeline more effectively. As it stands, the figure may confuse readers rather than aiding in their understanding of the method.
Since LongCoM evolves over time, it initially lacks any prior information about insufficient cases. How does the Experience-Driven Reflection module handle such scenarios where no historical context is available?
L241-242 is ambiguous “In offline reasoning mode, the perception phase is sampling M = 300 frames uniformly for the whole video, while only the initial M = 10 frames are sampled in offline mode.” please clarify this statement.
PANDA incorporates several tools to enhance video content analysis. Could you elaborate on how these specific tools were selected, and what criteria guided the choice of each component?
Is the anomaly knowledge subjective to each dataset?
In Supplementary Figure 1, which visualizes the detailed PANDA pipeline, the Short-CoM component appears to be missing. For completeness and clarity, it would be helpful to include Short-CoM in the diagram as well.
Supp. Figure 5, where does the “{formatted_enhancement_prompt} come from?
How the proposed method compares to unsupervised VAD such as: "Generative cooperative learning for unsupervised video anomaly detection" in CVPR 2023?

问题

Please respond to the weaknesses mentioned above.

局限性

Failure cases are not discussed in the paper.

最终评判理由

Q2: The accuracy values shown in Table 1 on CSAD dataset is not convincing. The proposed method has a gap of 15.86% AUC over the current best training-less method LAVAD. Note that CSAD dataset is sampled from the existing datasets UCF, XD, UB, and the average performance improvement on these datasets is less than 8%. Authors were not able to justify this gap in the rebuttal. Authors have not disclosed the details of the CSAD dataset so this reviewer could not confirm which type of videos were included. In Q12 authors accept that their method does not work well in low resolution and the case has long-term dependencies. This CSAD dataset is all about low resolution and long-term dependencies. Therefore the performance of SOTA as well as the proposed method on CSAD is not convincing. A detail of the video numbers included in CSAD would be helpful.

Q4,Q9,Q5,Q10: The figures and the text need improvements and authors have promised to incorporate these improvements. Such major changes are often beyond the scope of a conference paper because reviewers could not enforce the authors to incorporate the promised changes.

Q8: Response to Q8 is not convincing. It looks the algorithm requires the dataset dependent anomaly class names which is against the claims of a generic algorithm which can be applied to any field or domain without any modifications. Candidate rule set is constructed based on the user’s goals which varies from dataset to dataset. Therefore, for a test video, the possible anomaly class names and the user's goals should also be available. Without this information, the proposed algorithm would not be able to predict anomaly in a given test video.

Q 11: Comparison with unsupervised anomaly detection methods would require the proposed algorithm to predict anomalies in a given video without knowing the possible anomaly class names and any other user input. Authors need to present results of their method under unsupervised settings if applicable. The authors' response to this comment is not convincing.

Additionally, the capability of the proposed method for handling longterm dependencies is not convincing. A breakdown of the performance on the UCF Crime over different anomaly categories would be helpful. Also slow speed is a problem because most abnormal actions are fast and would take only very few seconds. Identifying anomalies by looking at 2 or 3 frames far apart would be quite difficult.

For the time being, the rating of the paper is not changed.

格式问题

No major formatting issues.

作者回复

2025-07-31

We thank the reviewer for their insightful comments and thoughtful questions. We are pleased that you found PANDA’s design, scope, and generality valuable. Your excellent rating on the quality of our work is truly encouraging and deeply appreciated. Below we respond point-by-point to the concerns raised.

Q1. How does PANDA handle unresolved “insufficient” cases after the maximum 3 reflection rounds?

As stated in the main paper (L196–197):

“If after r rounds the result is still Insufficient, PANDA skips the current segment and continues to the next timestep.”

Thus, if a segment remains “Insufficient” after the maximum r = 3 reflection rounds, PANDA assigns a default anomaly score corresponding to the "Insufficient" status. This strategy is a deliberate design choice to balance computational efficiency with performance stability.

As shown in Tab. 2(a) of the main paper, increasing r from 3 to 5 brings minor gains, resolving some hard cases but significantly increasing inference time, making r = 3 a reasonable trade-off in practice.

Q2. Why PANDA outperforms other offline methods in online mode?

We agree that a method typically performs better in the offline setting due to access to future information. As shown in Tab. 1 (main paper), PANDA’s offline mode consistently outperforms its online counterpart. However, due to fundamental design differences, online and offline methods are often not directly comparable — a well-designed online system may outperform rigid offline ones.

Our agentic AI framework, PANDA, is deliberately designed to function robustly in the online setting, and notably, it outperforms existing training-free offline methods, which we highlight as one of our key contributions. Below, we further explain PANDA’s advantage by analyzing the capabilities enabled by its core components.

Self-adaptive scene perception: PANDA uses RAG to retrieve context-aware anomaly rules based on current scene information (e.g., scene overview, potential anomalies), aligning reasoning goals with environment.
Goal-driven heuristic reasoning: PANDA performs CoT-style planning guided by retrieved rules, enhancing grounding between task objectives and VLM predictions.
Tool-augmented self-reflection: Upon insufficient reasoning confidence, PANDA autonomously invokes enhancement tools (e.g., object detection, image deblurring) to improve clarity before re-reasoning.
Chain-of-memory: PANDA leverages memory mechanisms (ShortCoM and LongCoM) to accumulate prior context, enabling temporally consistent decisions in streaming settings.

These mechanisms enable PANDA to make robust anomaly detection — even without access to future frames — much like a human expert acting with present information and past experience. In contrast, SOTA methods like LAVAD and AnomalyRuler rely on static prompts and lack adaptation, reflection, or memory, limiting their reasoning robustness even in offline mode.

Q3. Statistics about reflection mechanism.

Thank you for this detailed suggestion. We analyzed PANDA’s reflection behavior and summarize key statistics below (full results will be added to supp. material).

1. Frequency of reflection

Dataset	Total Clips	Clips Triggering Reflection	Trigger Rate (%)
UCF-Crime	7416	858	11.6%
XD-Violence	19382	2956	15.3%
UBnormal	619	82	13.2%
CSAD	1202	429	35.7%
Avg.	—	—	19.0%

As shown in the table above, the average reflection trigger rate across all four datasets is 19.0%. Notably, the CSAD dataset exhibits a trigger rate nearly 2–3 times higher than the other three. This aligns with our expectations, as CSAD is composed entirely of complex scenes, where PANDA is more likely to encounter uncertainty during reasoning and therefore invokes reflection more frequently.

2. Transition outcomes

Among clips initially classified as “Insufficient”, their transitions after each round are as follows (UCF-Crime dataset):

Reflection Round	Clips Resolved to Normal or Abnormal	Clips Remaining Insufficient	Transition Rate (%)
After Round 1	334	524 (858→334)	38.9%
After Round 2	599	259 (858→599)	69.8%
After Round 3	752	106 (858→752)	87.6%

This shows that over 87.6% of uncertain clips are resolved within 3 rounds. We also observed diminishing returns after the second round, which supports our default choice of r = 3 as a trade-off between effectiveness and computational overhead.

Q4. Fig. 2 lacks clarity in reflection flow.

Thank you for this constructive feedback. To improve clarity, we will revise Fig. 2 in the following ways:

Explicitly depict the “Insufficient” reasoning path, with a clearly labeled arrow from the VLM block to the MLLM reflection module;
Visually separate normal reasoning and reflection flows via distinct line styles;
Number key steps (1–6) to clearly present PANDA’s sequential reasoning process: perception → planning → initial reasoning → reflection trigger → tool-enhanced retry → memory update.

Q5. How does LongCoM behave at cold-start (no history)?

Thank you for the insightful question. At the start of a video, LongCoM is empty by design, and PANDA relies on ShortCoM’s local window memory for initial reasoning and reflection. As more clips are processed, LongCoM gradually accumulates traces, supporting memory-consistent reasoning and reflection planning. This progressive warm-up is intentional to ensure robustness under cold-start conditions. We will clarify this in Sec. 3.4.

Q6. L241–242 ambiguity.

Thank you for catching this. We will revise this sentence in the final version. The correct statement should be:

“In offline reasoning mode, the perception phase is sampling M = 300 frames uniformly for the whole video, while only the initial M = 10 frames are sampled in online mode.”

Q7. Tool selection.

As detailed in Sec. 3.3 of main paper and Supp. Fig. 1, PANDA selects tools based on two signals:

VLM reasoning feedback: When the VLM returns “Insufficient”, it provides a natural language rationale (e.g., “too blurry”, “too dark”), which helps identify the root cause.
Memory-guided experience: PANDA retrieves similar past “Insufficient” cases and their successful tool usage from LongCoM, informing effective tool reuse.

These cues are combined into a reflection prompt, along with all tool descriptions. The MLLM then selects the most appropriate tools, which are invoked accordingly. We will clarify this more explicitly in the final version.

Q8. Is the anomaly knowledge subjective to each dataset?

We would like to clarify that PANDA’s anomaly knowledge base is not subjective to specific datasets, but is dynamically built based on:

User-defined anomaly categories
Scene context perceived at test time.

Given anomaly is context-dependent, PANDA embraces this by adapting the rule base per instance:

It first constructs a candidate rule set based on the user’s goals;
Then, during the scene perception stage, it retrieves context-relevant rules using environmental cues (e.g., scene overview, potential anomalies);
During reflection, it further refines or expands rules based on reasoning ambiguities or failures.

This design ensures the knowledge base is context-grounded, adaptive, and generalizable, rather than statically tailored to any dataset — enabling open world anomaly detection.

Q9. Missing ShortCoM in Supp. Fig. 1.

Thank you for pointing this out. While ShortCoM is implicitly integrated into the construction of both the Reasoning Input and Prompt and the Reflection Prompt, we agree it is not explicitly shown in Supp. Fig. 1. We will revise the Supp. Fig. 1 to clearly depict ShortCoM.

Q10. Supp. Fig. 5, where does the “{formatted_enhancement_prompt} come from?

Thank you for raising this point. {formatted_enhancement_prompt} refers to the enhanced prompt generated after the reflection stage, as described in Lines 148-152 and 188–193 of the main paper. It integrates three components: {Text Enhancement Info, New Anomaly Rule, New Heuristic Prompt}, where Text Enhancement Info summarizes the outputs of invoked tools (e.g., object detection, web search). We will clarify this in the caption of Supp. Fig. 5.

Q11. PANDA VS "Generative cooperative learning for unsupervised VAD (GCL)" (CVPR 2023)

Thank you for highlighting this reference. PANDA differs from GCL in both its goal and capabilities. While GCL focuses on cooperative training and learning from unlabeled videos, PANDA is designed to generalize to unseen scenes and anomaly types without any training, which GCL cannot handle due to its training dependency. Specifically:

PANDA adopts an agentic paradigm with scene perception, rule-based retrieval (RAG), heuristic reasoning, and tool-augmented reflection, enabling zero-shot and training-free VAD.
In contrast, GCL requires fixed model structures and offline training on large unlabeled data, limiting adaptability when deployed in open-world, complex environments.

Quantitatively, on the shared evaluation benchmark UCF-Crime, PANDA achieves an AUC of 84.89%, significantly outperforming GCL’s reported 71.04%, while training-free and zero-data requiring. We will add this comparison to the revised version.

Q12. Failure case analysis.

Thank you for the helpful suggestion. We will add a failure case visualization in Supp. Sec. C.4.

The case from UCF-Crime (Shoplifting033_x264.mp4) involves a man discreetly steals a watch from a store table. Due to low resolution and subtle visual cues, PANDA fails to detect the anomaly — even after 3 reflection rounds, the clip remains “Insufficient” status and is skipped.

This case shows that PANDA may struggle with fine-grained, low-visibility anomalies. We will explore stronger visual enhancement tools and finer-grained reasoning to address such cases in future work.

2025-08-05

Q2: The accuracy values shown in Table 1 on CSAD dataset is not convincing. The proposed method has a gap of 15.86% AUC over the current best training-less method LAVAD. Note that CSAD dataset is sampled from the existing datasets UCF, XD, UB, and the average performance improvement on these datasets is less than 8%. Authors were not able to justify this gap in the rebuttal. Authors have not disclosed the details of the CSAD dataset so this reviewer could not confirm which type of videos were included. In Q12 authors accept that their method does not work well in low resolution and long-term dependencies. This CSAD dataset is all about low resolution and long-term dependencies. Therefore the performance of SOTA as well as the proposed method on CSAD is not convincing. A detail of the video numbers included in CSAD would be helpful.

2025-08-06

4. Q11: Comparison with Unsupervised VAD Methods

We appreciate the reviewer’s concern and respectfully clarify a key distinction in design philosophy.

Our goal is to develop a training-free, manual-free, and generalizable VAD framework that adapts dynamically to user goals and scene context — enabling open-set deployment without retraining or manual engineering. While unsupervised methods do not require anomaly class names or user input, but rely on large volumes of unlabeled data for training. Their understanding of anomaly is limited to appearance and motion patterns learned from training data, which restricts generalization — especially for context-dependent anomalies. For example, “running” may be normal in a sidewalk but abnormal in a hospital, a distinction unsupervised models may miss if not seen during training.

In contrast, PANDA makes no assumptions about available training data, and instead builds anomaly reasoning on-the-fly, leveraging real-time scene perception and user goals. It supports both specific queries and general anomaly detection (e.g., “find any abnormal events”) and uses retrieved rule sets and contextual cues to reason about anomalies, even in previously unseen settings.

Therefore, we argue that PANDA complements — rather than competes with — unsupervised methods, addressing their limitations by eliminating the dependence on offline training and enabling truly flexible, generalist anomaly detection.

5. New Q1: Clarification on PANDA’s capability in handling long-term dependencies

We respectfully clarify that PANDA is capable of handling long-term temporal dependencies, as evidenced by its strong performance on the CSAD. As noted in our response to Q2, CSAD includes many challenging degraded scenes and long-range anomaly types, such as shoplifting, arson, and burglary, which inherently require reasoning over extended temporal contexts.

On this dataset, PANDA outperforms all existing methods by a large margin, including training-free baselines like LAVAD. We believe this empirical result provides strong evidence of PANDA’s effectiveness in addressing long-term dependency challenges, even under adverse visual conditions.

6. New Q2: Inference speed

As shown in the suppl. material, the measured average inference speed across datasets is:

Dataset	Inference Speed (FPS)
UCF	0.82
XD	0.86
UB	0.79
CSAD	0.53

Among them, CSAD is a purposefully extreme benchmark composed entirely of visually degraded scenes, and thus its speed reflects a worst-case condition. For the other three datasets — UCF, XD, and UB — which represent real-world surveillance distributions and open-set scenarios, PANDA achieves an average speed of 0.823 FPS, which closely aligns with the 1 FPS processing rate and supports timely anomaly detection in practical deployments (1 FPS sampling settings).

To illustrate this comparison, we measured the average inference FPS on the UCF dataset with existing SOTA training-free LLM/VLM-based VAD baselines, all evaluated on the same hardware setup using a single A6000 GPU. As shown in the table below, PANDA outperforms previous methods by 6~10 in speed, while also supporting training-free, manual-free, and fully autonomous online detection.

Method	Inference Speed (FPS)
LAVAD (CVPR 2024)	0.08
AnomalyRuler (ECCV 2024)	0.13
PANDA	0.82

In addition, PANDA processes the video at 1 FPS, a standard practice in VAD that balances efficiency and completeness. This strategy drops redundant frames but does not introduce large gaps between frames. As most real-world anomalies (e.g., fighting, theft) span multiple seconds, this frame rate is typically sufficient to capture key moments. We also highlight that PANDA’s strong performance on multiple datasets supports its practical applicability for detecting both rapid and long-duration anomalies.

We genuinely hope that our additional response helps resolve your concerns. Should you have any further questions or require more information, we would be more than happy to provide additional clarification.

2025-08-06

We thank the reviewer for the feedback and provide the following clarification.

1. Q2: Justification of PANDA’s performance on CSAD

As stated in Sec. 4.1, CSAD is a stress-test benchmark we constructed to evaluate robustness under degraded conditions. It contains 100 videos (50 normal, 50 abnormal) sampled from UCF, XD, and UB, featuring low resolution, poor lighting, high noise, and long-term temporal anomalies. Examples include:

UCF: Shoplifting010_x264.mp4, Burglary032_x264.mp4, Burglary035_x264.mp4, Arson007_x264.mp4 etc.
XD: v=iHuggczItBk__#00-01-00_00-02-45_label_B6-0-0.mp4, v=IWzI9V3WSnc__#1_label_B4-0-0.mp4, v=Q8K7roZu3WU__#1_label_B1-0-0.mp4 etc.
UB: normal_scene_11_scenario_2_fog.mp4, normal_scene_13_scenario_1_fog.mp4, normal_scene_8_scenario_2_fog.mp4 etc.

These publicly available videos confirm that CSAD is not simply a subset of existing datasets but a curated benchmark focusing on hard-case complexity.

LAVAD heavily depends on VLM-generated captions for scoring. In complex scenes, we observed that these captions are often inaccurate, and despite post-processing (denoising, smoothing), poor caption quality severely affects LAVAD’s performance. In contrast, PANDA dynamically invokes external tools (e.g., image super-resolution, image deblurring, image retrieval) to enhance low-quality inputs when reflection is triggered. As shown in Tab. 1 (Q3), PANDA’s reflection module is activated in 35.7% of clips on CSAD, which demonstrates its ability to adaptively recover from ambiguity and this capacity is absent in LAVAD.

Additionally, while we present a failure case in Q12, it serves only to illustrate areas where PANDA can be further improved. As VAD remains a highly challenging task, PANDA—despite achieving state-of-the-art performance—cannot yet perfectly handle all hard cases, just as existing VAD methods also have their own limitations. Nonetheless, its performance on CSAD demonstrates notable robustness in visually degraded and complex scene.

2. Regarding Q4, Q5, Q9, and Q10

We sincerely appreciate these valuable suggestions regarding the figures and text. While we believe they do not affect the core contributions or conclusions of the paper, we wholeheartedly reaffirm our commitment to thoughtfully incorporating all recommended improvements into the final version.

3. Q8: Clarification on Anomaly Knowledge and Generalization

We appreciate the reviewer’s follow-up and welcome the opportunity to clarify PANDA’s design philosophy.

First, anomalies in the real world are inherently context-dependent — the same behavior (e.g., “running”) may be normal in one environment (a sidewalk) but clearly anomalous in another (a hospital). From a practical deployment perspective, relying solely on fixed prior knowledge — either from training data (in training-based methods) or from pre-trained LLM/VLM models (in training-free methods like LAVAD) — can lead to biased or brittle decisions in new or evolving environments.

This is evidenced in our results on the UB dataset. Compared to other datasets (Except for CSAD), PANDA achieves its largest margin over LAVAD on UB. The reason is that UB defines behaviors such as running, jumping, and shuffling on roads as anomalies — which are often regarded as normal in most LLM/VLM priors. Consequently, methods like LAVAD that rely entirely on frozen priors struggle to detect these user-defined anomalies. In contrast, PANDA allows users to specify their detection goals (e.g., “identify people running on roads”), and it dynamically retrieves relevant rules to reinterpret behavior accordingly. This adaptability enables PANDA to succeed where static, prior-based methods fail.

We believe this flexibility is essential for any truly generalist anomaly detection system. As new types of anomalies emerge in real-world applications (e.g., new forms of theft, attacks, or social behaviors), models must be able to adapt detection criteria — something that neither fixed training-based methods nor prior-only LLM/VLM systems can achieve reliably. PANDA achieves this through user goals and RAG-enhanced reasoning, allowing it to update its knowledge base and prompts on the fly, making it more robust to open-set or shifting domains.

Importantly, PANDA’s only input beyond the video stream is a user query, which can vary in granularity. While the query may specify target anomalies (e.g., “detect shoplifting or loitering”), it can also be open-ended (e.g., “detect anything abnormal”), in which case PANDA reverts to common anomaly patterns encoded in its pre-trained LLM/VLM. However, in doing so, the detection space is bounded by the LLM/VLM priors, which may miss task-specific or emerging events — hence goal conditioning is optional but strongly beneficial.

In summary, PANDA is not dataset-dependent; it is goal-adaptive, context-aware, and dynamically extensible, aligning closely with the demands of real-world generalist VAD systems.

2025-08-06

Keeping in view that user has to set a query containing the domain knowledge and type of anomaly he is expecting to see, your claim becomes null and void: "Therefore, we aim to achieve generalist VAD, i.e., automatically handle any scene and any anomaly types without training data or human involvement." There is human involvement therefore it is not all automatic. Please explain.
However, you can overcome this objection by using a fixed user query for all datasets and all videos. I was not able to find this result in the paper or supplementary document or in the rebuttal. If you were optimising user input at video level as in CSAD dataset, then performance improvement is expected.
Please disclose all 100 video names in CSAD dataset and the corresponding user input for each video.
A theoretical explanation how PANDA can handle long term dependencies is required.

2025-08-06

Q14. Please disclose all 100 video names in CSAD dataset and the corresponding user input for each video.

We would like to respectfully clarify that, as stated in our response to Q13, PANDA does not optimize user input at the video level. For the CSAD dataset, all 100 videos share the same user query, defined to reflect a broad set of target anomaly types. The query is:

"Please help me detect the following types of abnormal events: Abuse, Arson, Burglary, Explosion, Fighting, Road Accidents, Robbery, Shooting, Shoplifting, Stealing, Vandalism, Riot, Running, Jumping, Shuffling, Having a seizure."

Below, we list the complete set of video names included in the CSAD dataset for full transparency:

A.Beautiful.Mind.2001__#00-40-52_00-42-01_label_A.mp4

A.Beautiful.Mind.2001__#01-14-30_01-16-59_label_A.mp4

abnormal_scene_13_scenario_2_fog.mp4

abnormal_scene_13_scenario_3_fog.mp4

abnormal_scene_1_scenario_7.mp4

abnormal_scene_1_scenario_9.mp4

abnormal_scene_9_scenario_3.mp4

abnormal_scene_9_scenario_4_fog.mp4

About.Time.2013__#00-23-50_00-24-31_label_A.mp4

About.Time.2013__#00-30-50_00-32-31_label_A.mp4

About.Time.2013__#00-40-52_00-42-31_label_A.mp4

Arson007_x264.mp4

Arson009_x264.mp4

Arson010_x264.mp4

Before.Sunrise.1995__#00-03-00_00-04-05_label_A.mp4

Before.Sunrise.1995__#00-04-20_00-05-35_label_A.mp4

Before.Sunrise.1995__#00-23-50_00-24-31_label_A.mp4

Be.with.You.2018__#00-04-20_00-05-35_label_A.mp4

Be.with.You.2018__#00-23-50_00-24-31_label_A.mp4

Black.Hawk.Down.2001__#01-42-58_01-43-58_label_G-0-0.mp4

Bullet.in.the.Head.1990__#00-23-31_00-24-40_label_G-0-0.mp4

Burglary005_x264.mp4

Burglary018_x264.mp4

Burglary032_x264.mp4

Burglary035_x264.mp4

Burglary079_x264.mp4

Explosion011_x264.mp4

Explosion028_x264.mp4

GoldenEye.1995__#01-17-05_01-19-57_label_B2-B1-0.mp4

normal_scene_11_scenario_2_fog.mp4

normal_scene_11_scenario_4_fog.mp4

normal_scene_13_scenario_1_fog.mp4

normal_scene_8_scenario_2_fog.mp4

normal_scene_9_scenario_4_fog.mp4

Normal_Videos_006_x264.mp4

Normal_Videos_010_x264.mp4

Normal_Videos_015_x264.mp4

Normal_Videos_018_x264.mp4

Normal_Videos_019_x264.mp4

Normal_Videos_024_x264.mp4

Normal_Videos_025_x264.mp4

Normal_Videos_041_x264.mp4

Normal_Videos_048_x264.mp4

Normal_Videos_100_x264.mp4

Normal_Videos_129_x264.mp4

Normal_Videos_189_x264.mp4

Normal_Videos_251_x264.mp4

Normal_Videos_345_x264.mp4

Normal_Videos_452_x264.mp4

Normal_Videos_686_x264.mp4

Normal_Videos_725_x264.mp4

Normal_Videos_745_x264.mp4

Normal_Videos_828_x264.mp4

Normal_Videos_831_x264.mp4

Normal_Videos_870_x264.mp4

Normal_Videos_872_x264.mp4

Normal_Videos_876_x264.mp4

Normal_Videos_881_x264.mp4

Normal_Videos_882_x264.mp4

Normal_Videos_883_x264.mp4

Normal_Videos_885_x264.mp4

Normal_Videos_888_x264.mp4

Normal_Videos_891_x264.mp4

Normal_Videos_892_x264.mp4

Normal_Videos_897_x264.mp4

Normal_Videos_901_x264.mp4

Normal_Videos_904_x264.mp4

Normal_Videos_906_x264.mp4

Normal_Videos_908_x264.mp4

Operation.Red.Sea.2018__#01-20-58_01-22-00_label_B5-0-0.mp4

RoadAccidents001_x264.mp4

RoadAccidents009_x264.mp4

RoadAccidents017_x264.mp4

RoadAccidents123_x264.mp4

RoadAccidents128_x264.mp4

RoadAccidents131_x264.mp4

RoadAccidents132_x264.mp4

Robbery050_x264.mp4

Robbery106_x264.mp4

Shooting002_x264.mp4

Shooting007_x264.mp4

Shooting013_x264.mp4

Shooting021_x264.mp4

Shoplifting005_x264.mp4

Shoplifting010_x264.mp4

Shoplifting016_x264.mp4

Shoplifting029_x264.mp4

Shoplifting031_x264.mp4

Shoplifting033_x264.mp4

Spectre.2015__#01-08-58_01-09-20_label_B1-B2-0.mp4

Stealing062_x264.mp4

Vandalism028_x264.mp4

v=BQjKQbYgUBA__#1_label_B1-0-0.mp4

v=Ia9ATKNeUbY__#00-04-19_00-05-11_label_B6-0-0.mp4

v=Ia9ATKNeUbY__#00-06-19_00-06-50_label_B6-0-0.mp4

v=IWzI9V3WSnc__#1_label_B4-0-0.mp4

v=iHuggczItBk__#00-01-00_00-02-45_label_B6-0-0.mp4

v=lpkL0Y1MhA8__#1_label_B4-0-0.mp4

v=MqrNCb2N5to__#1_label_B1-0-0.mp4

v=Q8K7roZu3WU__#1_label_B1-0-0.mp4

We sincerely hope that the additional responses have resolved your concerns. If you have any further questions, please feel free to let us know. Thank you for your time and thoughtful review.

2025-08-06

We thank the reviewer for the feedback. We would like to offer further clarification regarding your concerns.

Q12. Clarification regarding “without training data or human involvement”.

We would like to respectfully clarify the intent behind the statement: "Therefore, we aim to achieve generalist VAD, i.e., automatically handle any scene and any anomaly types without training data or human involvement." Our definition of “without training data or human involvement” refers specifically to the execution pipeline when PANDA is applied to new scenes or new anomaly types. As illustrated in Fig. 1 (main paper), PANDA differs from existing training-based and training-free methods in the following key ways:

It requires no new data collection or annotation when encountering a new scene;
It requires no model retraining or finetuning to adapt to novel anomaly types;
It involves no handcrafted pre- or post-processing, unlike LAVAD, which requires first generates captions for all video frames, then performs caption cleaning and summarization, followed by initial scoring and anomaly score refinement based on the captions.

Once PANDA is set up with its user goals, the inference pipeline operates fully automatically — including perception, planning, reasoning, and reflection — without any retraining or additional manual intervention in the pipeline. In practice, if the deployment environment requires only common anomaly types (e.g., fighting, robbery, shoplifting, etc.), these all can be predefined once during system setup. In such cases, PANDA requires only the input video stream, with no user query input during runtime. The system remains training-free and manual-free, which is not achievable by prior methods.

Q13. However, you can overcome this objection by using a fixed user query for all datasets and all videos. I was not able to find this result in the paper or supplementary document or in the rebuttal. If you were optimising user input at video level as in CSAD dataset, then performance improvement is expected.

We would like to respectfully clarify that PANDA does not optimize the user input at the video level. Instead, the user query is defined based on a set of anomaly categories relevant to the user’s goals — rather than to individual videos.

As illustrated in Supp. Fig. 1, for example, all videos in the UCF dataset share the same user query:

"Please help me detect the following types of abnormal events: Abuse, Arrest, Arson, Assault, Burglary, Explosion, Fighting, Road Accidents, Robbery, Shooting, Shoplifting, Stealing, Vandalism."

For the other datasets (XD, UB, and CSAD), the queries follow the same template structure and only differ in the list of anomaly types. We did not tune or optimize user queries per video. Additionally, following your suggestion, we can replace the anomaly categories in the user query with the full set of anomaly classes from each dataset. In this way, all datasets can share a single unified user query.

We hope this clears up the misunderstanding. PANDA’s performance reflects its generalist capability using high-level user queries at the scene level — rather than the result of fine-grained, video-level query customization.

Q15. A theoretical explanation how PANDA can handle long term dependencies is required.

We appreciate the reviewer’s request for a theoretical clarification. PANDA is designed with several core components that explicitly support reasoning over long temporal contexts. Below, we outline two key mechanisms that enable this:

1. Tool-Augmented Self-Reflection

When reasoning about long-duration anomalies, if PANDA encounters uncertainty, it triggers the reflection mechanism. During reflection, PANDA can dynamically invoke external tools to enhance or gather additional information. In particular, the image retrieval tool enables PANDA to query past frames from the historical memory buffer based on semantic cues (retrive query) generated during reflection. These retrieved frames are then combined with the current segment and fed back into the VLM for re-reasoning. This mechanism explicitly allows PANDA to incorporate non-local temporal context from earlier moments in the video, which is crucial for detecting long-term anomaly patterns.

2. LongCoM

PANDA maintains a long-term memory chain that records past reflection cases and their resolution strategies. During reflection triggered by insufficient reasoning in long-term dependency scenarios, PANDA retrieves similar past cases from LongCoM. These historical resolutions serve as heuristic priors, guiding current decision-making and reducing reliance on immediate temporal information alone.

Together, these two components — retrieval-enhanced reflection and long-term memory reuse — allow PANDA to reason across extended time horizons, addressing the challenges of long-term temporal dependencies in real-world video anomaly detection.

2025-08-09

Dear Reviewer AEpx,

We sincerely thank you for the time and effort in reviewing our work, and we truly appreciate your support.

With less than one day remaining before the discussion deadline, we would like to kindly ask whether our response has addressed your concerns. Please also let us know if you have any other questions or concerns, and we will be glad to provide further clarification.

Best regards,

Authors of Submission3267

最终决定Accept (poster)

2025-09-17

The reviewers and the AC concur that the paper demonstrates sufficient novelty, provides detailed implementation, and presents strong results. This paper went through extensive discussion. AC recommends acceptance and request authors to incorporate all reviewers queries in the final version.