PaperHub
5.5
/10
Poster4 位审稿人
最低3最高7标准差1.5
7
6
6
3
4.5
置信度
正确性3.3
贡献度3.0
表达3.5
NeurIPS 2024

HAWK: Learning to Understand Open-World Video Anomalies

OpenReviewPDF
提交: 2024-05-06更新: 2024-12-19
TL;DR

A large vision-language model for understanding open-world video anomalies.

摘要

关键词
Video Anomalies Understanding

评审与讨论

审稿意见
7

This work proposes a novel video-language framework, HAWK, aiming at understanding video anomalies, which incorporates motion modality to enhance its capability. The work generates rich language descriptions for seven different video anomaly datasets, and also generates question-answer pairs to tackle potential user inquiries. The proposed framework demonstrates SOTA performance for video anomaly understanding and question-answering across multiple scenarios, which will advance the open-world anomaly understanding field.

优点

The work proposes a novel vision-language framework to address open-world video anomaly understanding, which is a very different pipeline from previous classification-based anomaly detection pipelines. To build a vision-language model for open-world video anomaly understanding, the work adopts seven different video anomaly datasets for training, where rich language descriptions and question-answer pairs are generated. Experiments for video anomaly understanding and question-answering across multiple scenarios demonstrate the effectiveness of the proposed framework. I believe the work will advance the field of open-world video anomaly understanding.

缺点

  1. The work adopts Gunnar Farneback’s algorithm to obtain the motion modality. Is this algorithm efficient?
  2. The Ubnormal dataset consists of virtual anomaly videos. Do the authors consider the gap between real and virtual anomaly videos, which is mentioned in previous works [1, 2]?

[1] Ubnormal: New benchmark for supervised open-set video anomaly detection, CVPR 2022

[2] Generating Anomalies for Video Anomaly Detection with Prompt-based Feature Mapping, CVPR 2023

问题

See the Weakness part.

局限性

The paper has discussed the limitations and potential impacts of the work.

作者回复

We extend our profound gratitude for your esteemed recognition of our paper. Furthermore, we sincerely appreciate your constructive feedback and suggestions, which have proven to be instrumental in enhancing our manuscript.

We are devoted to addressing all of your proposed problems, as shown in follows:


Response to Q1: The Efficiency of Gunnar Farneback’s Algorithm for Motion Modality Extraction

Firstly, Gunnar Farneback's algorithm demonstrates remarkable efficiency in generating video optical flows—even on CPU platforms. For each frame, the efficiency of this algorithm surpasses that of other widely deployed methods, as illustrated in the table below:

MethodsGunnar Farneback[1]LDOF [2]FlownetS [3]FlownetC[3]
Second(s) per frame0.02 (CPU)65 (CPU) & 2.5 (GPU)0.08 (GPU)0.15 (GPU)

Secondly, for processing one video, the motion of multiple rounds of dialogue necessitates just a single iteration of video motion extraction. Consequently, the response time for processing one video (about 0.72 seconds) is significantly shorter than the time required to generate a single round of dialogue (about 1.5 seconds). This cost is deemed acceptable for a practical anomaly detection system for users.

Lastly, we concur that future research into more efficient methods for optical flow extraction, including end-to-end optical flow extraction strategies, will likely further augment the efficiency of our system.

[1] Two-frame motion estimation based on polynomial expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003.

[2] Large displacement optical flow: descriptor matching in variational motion estimation. TPAMI, 2011.

[3] FlowNet: Learning Optical Flow with Convolutional Networks. ICCV 2015.


Response to Q2: The gap between real and virtual anomaly videos within the Ubnormal dataset.

Firstly, we acknowledge that some existing papers considered the distinction between real-world and virtual anomalies. In our paper, to mitigate the potential discrepancies and bridge the gap between these two types of data, we have mixed all datasets to train an open-world model in the domain of both synthetic and real-world datasets, as shown in Appendix D (Data Distribution). This strategy ensures that our model is versatile and performs well across different data types, as shown in Appendix F.

Besides, while the primary focus of our current work does not delve into the intricacies of how synthetic data can be leveraged to enhance the detection of real-world anomalies in [2], we recognize the importance of this aspect. The potential of synthetic data to boost the performance of anomaly detection systems is substantial, and we agree that exploring this potential further could yield significant advancements.

[1] Ubnormal: New benchmark for supervised open-set video anomaly detection, CVPR 2022

[2] Generating Anomalies for Video Anomaly Detection with Prompt-based Feature Mapping, CVPR 2023

评论

Thanks for the authors' detailed response. I have read all reviewers' comments and the corresponding response. I think most of concerns have been adequately addressed. I am willing to increase my rating.

审稿意见
6

The manuscript proposes a new framework that uses an interactive large-scale visual language model (VLM) to accurately explain video anomalies. The framework can explicitly integrate motion modalities to enhance anomaly identification. The manuscript also constructs an auxiliary consistency loss to guide the video branch to focus on motion modalities. The manuscript annotates more than 8,000 anomaly videos with language descriptions, enabling effective training in different open-world scenarios, and creates 8,000 question-answer pairs for users' open-world questions. The final results show that HAWK achieves SOTA performance, surpassing existing baselines in both video description generation and question answering.

优点

1.This manuscript proposes a novel video language framework HAWK, which aims to understand video anomalies, and combines motion modalities to enhance its capabilities. 2.This manuscript collects seven video anomaly datasets from different scenarios and generates rich language descriptions for each video. At the same time, considering the diversity of open-world questions, question-answer pairs are generated to solve potential user queries. 3.The proposed method achieves state-of-the-art performance on three public anomaly detection datasets.

缺点

  1. There are already many studies focusing on the background information in abnormal events. Have the authors considered describing the background information when describing the action information? [1] Scene-aware context reasoning for unsupervised abnormal event detection in videos[C]//Proceedings of the 28th ACM international conference on multimedia. 2020: 184-192. [2] Few-shot scene-adaptive anomaly detection[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer International Publishing, 2020: 125 -141. [3] Hierarchical scene normality-binding modeling for anomaly detection in surveillance videos[C]//Proceedings of the 30th ACM international conference on multimedia. 2022: 6103-6112. 2.The dataset proposed in the manuscript also supports question answering in open scenes, but the method does not reflect how to use these questions and answers to improve the model's anomaly detection and understanding capabilities. 3.When constructing a dataset, if a description is generated for each second in the dataset, it may result in a lot of repeated descriptions. In order to avoid this redundancy problem, have you considered using keyframes instead of second-by-second descriptions to reduce the complexity and computational cost of data processing? 4.Most previous studies pay equal attention to various parts of the video, such as background, motion information, character appearance, etc. This paper focuses on motion information. Section 4.3 mainly extracts language descriptions related to motion. Whether background information is not considered. Moreover, the 7 datasets proposed in the manuscript contain more abnormal scenes. I am very curious why the scene information is not paid attention to.

问题

As mentioned above.

局限性

Yes, the authors address possible limitations of their study.

作者回复

We are deeply grateful for your acknowledgment of our efforts, especially for highlighting our contributions to the framework, dataset, and experimental outcomes. Additionally, we are thankful for the precious suggestions you have offered, which will significantly aid in the enhancement of our manuscript.

In the following sections, I will systematically respond to the queries and concerns you have presented.


Response to Q1 and Q4: Missing background information and other parts?

Firstly, although we have significantly enhanced motion information in Sections 4.2 and 4.3, this enhancement does not preclude our core task from maintaining vital scene information, which includes background details and appearances.

As delineated in Line 227 (Optimization Goal) of the main paper, our primary task is established upon all parts of the video, controlled via t0=1t_0 = 1. Motion information only serves as an augmentation for anomaly information, with its optimization weights set to t1=0.1t_1 = 0.1 and t2=0.1t_2 = 0.1. Thus, our optimization goal does not neglect the comprehension of background information or other parts.

Moreover, to focus on motion information is due to the observation in other video understanding baselines (e.g., Video-LLaMA) that an excessive focus on understanding background information led to a diminished capability in grasping video anomalies (as shown in Figure 1 and Table 2(A) in the main paper). Therefore, to enhance the understanding of video anomalies, it is essential to enhance the representation of motion information within the network.

About Reference Paper. Regarding the works you referenced, [1] and [3], these studies also utilize object-bounding boxes to prominently describe objects involved in anomalies and employ a dual-path strategy to amplify the representation of these objects within the network. These objects, typically humans or vehicles, are also in motion during anomalies, whereas background information is usually static.

Our Comment. We agree that leveraging background information can provide a robust prior for understanding video anomalies, as [1,2,3] indicated. We will try to integrate information related to scenes, backgrounds, and objects in a large-model-based video anomaly understanding model.

[1] Scene-aware context reasoning for unsupervised abnormal event detection in videos[C]//Proceedings of the 28th ACM international conference on multimedia. 2020: 184-192.

[2] Few-shot scene-adaptive anomaly detection[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer International Publishing, 2020: 125 -141.

[3] Hierarchical scene normality-binding modeling for anomaly detection in surveillance videos[C]//Proceedings of the 30th ACM international conference on multimedia. 2022: 6103-6112.


Response to Q2: Use generated questions and answers to improve the model's anomaly detection and understanding capabilities?

While we introduce the generation of question-answer (QA) pairs, it is crucial to clarify that the primary objective of generating these pairs was NOT intended as a means of data augmentation to enhance the model's capabilities in understanding anomalies. Instead, the improvement in our model's understanding of anomalies is ONLY through the accurate descriptions of anomaly videos and the motion-aware network.

Distinctively, the generation of QA pairs aims to cover potential inquiries from users in an open-world setting, thereby facilitating improved user interaction with the network. Therefore, as shown in Table 1 (B), the Anomaly Video Question-Answering is evaluated as a separate task. This evaluation is specifically designed to simulate the performance of the system when utilized in the open-world environment.


Response to Q3: Why not use keyframes to reduce the complexity and computational cost of data processing?

Our Consideration. First, our current methodology for dataset construction involves sampling the video at consistent one-second intervals. This technique is strategically chosen to ensure that all possible anomalies within the videos are comprehensively captured (Some anomalies only happened in 1-2 seconds), thereby significantly reducing the likelihood of missing critical accidents. While we realize that this may lead to a degree of caption redundancy, we still prioritize the facilitation of thorough annotation to ensure that all anomalies are detected.

In addition, we have leveraged the capabilities of GPT-4 for generating captions, especially for anomalous events (refer to Figure 2 Prompt). Due to GPT-4's advanced text generation and summarization abilities, it serves as an effective tool in minimizing redundancy, ensuring that the extracted captions are both high-quality and succinct.

After the initial processing with GPT-4, we also undertake a manual checking process. This step is crucial for further reducing any residual redundancy and correcting possible errors within the captions, thereby ensuring the quality and accuracy of our dataset.

Future Work. Certainly, we agree that the utility of keyframes is an effective strategy, especially for much longer videos, and believe its potential to significantly enhance data annotation efficiency. This will be the future work.

评论

The author's response has resolved my doubts to some extent, but I hope to be able to supplement my doubts in the revised manuscript. I maintain my previous score.

审稿意见
6

The paper proposes a new variant of the video anomaly detection. Prior methods were vanilla classification methods, and this paper proposes more descriptive anomaly description and also QA along with that. The paper first gives the dataset creation strategy and then introduces the motion model and the video architecture to provide a text description of the anomaly. The authors also have baselines to compare against recent methods.

优点

  1. The problem setup is novel and fills the gap in current VAD methods.

  2. The dataset creation strategy is very intuitive and the quality looks good from the examples provided.

  3. The evaluation protocol is correctly explained.

缺点

My main concern is regarding the experimentation and the lack of some crucial baselines.

  1. The proposed method benefits from the training data, whereas all the baselines are zero-shot. It is difficult to now evaluate the contribution of your motion model and the training data. It is essential to show how the baseline methods like VideoChat will behave when they are fine-tuned with the training data the authors propose. Without these baselines, it is difficult to evaluate the effectiveness.

  2. The GPT-4 output shown in Fig. 2 (a car with red lights has lost traction....) is already pretty good. Due to this, the authors must show that the trained model is able to perform better than the basic dataset creation strategy. It is understood that the dataset is created using off the shelf models and expected to perform worse than the trained video and motion model, but a baseline that uses the pseudo-dataset creation strategy is important. (Or an appropriate justification of why it cannot be added)

问题

Please answer the question regarding the experimental details. Overall, I am positive about the paper and hence, will increase the rating once the clarification on the experiments is provided.

Minor comment

Guieded -> Guided in Tab 4

局限性

Yes, limitations are discussed and they correctly reflect the shortcomings that I see.

作者回复

We sincerely appreciate the valuable time you have dedicated to providing critical reviews and also are grateful for your recognition of the problem setup, dataset, and experimental evaluation in our work.

In response to your issues, we offer the following more clarifications as follows.


Response to Q1: Demonstrating the effectiveness of our motion model through fine-tuning other baselines with our proposed dataset

To demonstrate the significance of the motion model, we have conducted ablation studies, the results are presented in Table 3 (A) and (B) of the main paper. Compared with "w/o motion", the results demonstrate that the motion modality, video-motion consistency, and motion-language matching, can significantly enhance the overall performance of the framework. Additionally, this experiment represents a fair comparison due to the same training and testing procedure.

To further demonstrate the effectiveness of the motion model, we fine-tuned other frameworks using the same dataset we introduced.

NOTICE: The results are depicted in Tables (A) and (B) in the global rebuttal mentioned ABOVE.

Based on the data presented in Tables (A) and (B), when facing various baselines and maintaining identical fine-tuning data, the incorporation of motion information can enhance the model's effectiveness in both Anomaly Video Description Generation and Anomaly Video Question-Answering.

[1] Video-chatgpt: Towards detailed video understanding via large vision and language models. ACL 2024.

[2] VideoChat: Chat-Centric Video Understanding. CVPR 2024.

[3] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. EMNLP 2023 Demo.

[4] Llama-adapter: Efficient fine-tuning of language models with zero-init attention. ICLR 2024.

[5] Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. Axriv 2024.


Response to Q2: Demonstrating that the trained model outperforms GPT-4-based data generation.

While GPT-4 can assist in video labeling, "Manual Checking" remains a crucial step in the data annotation process. As depicted in Figure 2 of the main paper, we show the case with high-quality labeling, which is also subjected to "Manual Checking".

In line 496, "Manual Checking" involves correcting errors in the video's textual descriptions. This primarily includes rectifying inaccuracies in the descriptions, eliminating redundant information, and adding correct details regarding objects, motions, and scenes. The table below provides an example of before and after manual checking:


Before Manual Checking (GPT-4)After Manual Checking
The video depicts different scenes of people walking and sitting in front of a building and a train station. There are also shots of a woman with a red bag sitting on the ground, a man walking by with a shopping bag, and a young boy in a red jacket standing outside a building. There are also several shots of people standing at a bus stop, and a sign with a glowing red hand.The video shows scenes in front of the subway entrance, with students and pedestrians coming and going or entering and exiting the subway. In addition, two school security guards patrol through, and at the subway entrance, two students are chatting and discussing.

In this example, the caption generated by GPT-4 included hallucinations (such as "a man walking by with a shopping bag" or "a young boy in a red jacket" ), which were corrected after Manual Checking. Additionally, GPT-4's description was inaccurate (for instance, "bus stop" should have been more accurately described as "subway entrance").

Based on this, only using the results generated by GPT-4 for video understanding is inaccurate and could lead to erroneous descriptions and hallucinations. Therefore, we invested human resources (a total of 150 hours) in Manual Checking for each video's annotations to ensure the dataset is of higher quality.

Furthermore, to demonstrate that our model has better detection capabilities than GPT-4, we compared our model's results with the unchecked labels generated by GPT-4 on the same testset, as shown in the following table:


MethodsBackbonesLLM SizeBLEU-1BLEU-2BLEU-3BLEU-4ReasonabilityDetailConsistency
Data Engineering PipelineGPT-4 [1]-0.1880.0980.0560.0340.1890.3130.158
OursLLaMA-2 7B7B0.2700.1390.0740.0430.2830.3200.218

Clearly, the results indicate that our model can better assist in understanding video anomalies compared to GPT-4, achieving state-of-the-art (SOTA) performance in both text-level and GPT-guided evaluations.

[1] Gpt-4 technical report. Axriv 2023.


Response to minor comment.

We will revise the manuscript to address the typos and also conduct a comprehensive check in the final version to try to correct any minor problems.

评论

I would like to thank the authors for their response.

The experiments for both my concerns are now making the comparisons stronger. Regarding Q2, it is good to see that the training improves the naive weakly supervised training set. Similarly, in Q1, the fine-tuning significantly improves the numbers. These are the numbers that should have been ideally reported.

I am updating my rating to weak accept since these new experiments finally show the effectiveness of the method. Please update the paper with these new numbers.

审稿意见
3

This paper presents HAWK, a framework that uses large Visual Language Models (VLMs) to accurately interpret video anomalies. HAWK integrates motion and video modalities through a dual-branch framework, enhanced by an auxiliary consistency loss to focus on motion-related features. The authors annotated over 8,000 anomaly videos with language descriptions and created 8,000 question-answer pairs to train the model across diverse scenarios. HAWK demonstrates state-of-the-art performance.

优点

  1. The framework benefits from extensive annotations and question-answer pairs, improving training quality.
  2. HAWK achieves superior results in video description generation and question-answering tasks, outperforming existing baselines.
  3. The paper is well-written and easy to follow.

缺点

  1. In Table 1, it would be better to indicate the LLM backbone of the methods for a fair comparison.
  2. From the ablation study in Table 3, it appears that even without the motion modality, the baseline model achieves comparable performance to other methods without motion modality. This may be because the high-quality dataset and strong LLM backbone contribute more to the performance, which weakens the perceived technical contribution of the motion modality.

问题

See the weaknesses.

局限性

See the weaknesses.

作者回复

Thank you for your meticulous and thoughtful review of our work. We greatly appreciate your affirmation of our work's framework, experimental results, and writing quality.

We are also thankful for the profound questions you raised regarding our paper and will provide detailed explanations and clarifications in the following sections.


Response to Q1: What is the LLM backbone of the baselines?

To ensure the fairness of our experiments, we employed large language models (LLMs) of the same size (7B parameters) as the backbone for the baselines, as illustrated in the table below:

MethodsBackbonesLLM Size
Video-ChatGPT [1]LLaMA 7B7B
VideoChat [2]Vicuna 7B7B
Video-LLaMA [3]LLaMA-2 7B7B
LLaMA-Adapter [4]LLaMA-2 7B7B
Video-LLaVA [5]Vicuna 7B7B
OursLLaMA-2 7B7B

This information will be incorporated into the experimental section of our work.

[1] Video-ChatGPT: Towards detailed video understanding via large vision and language models. ACL 2024.

[2] VideoChat: Chat-Centric Video Understanding. CVPR 2024.

[3] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. EMNLP 2023 Demo.

[4] LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. ICLR 2024.

[5] Video-LLaVA: Learning United Visual Representation by Alignment Before Projection. Axriv 2024.


Response to Q2: Technical contribution of integrating motion modality.

To further substantiate the contribution of motion information to the network architecture, we conducted tests using our provided data on baseline models that do not incorporate motion information.

NOTICE: The results are depicted in Tables (A) and (B) in the global rebuttal mentioned ABOVE.

Given the absence of motion modality information in these baselines, the experimental results can affirm that integrating motion information into our framework significantly will help to enhance model performance.

Moreover, in other studies, such as VideoChatGPT [1] and VideoChat [2], the construction of datasets is regarded as a significant contribution, whereas the model architecture is not highly emphasized. Thus, the dataset construction aspect of our work also constitutes a core contribution of this paper.

Certainly, we also agree that developing better frameworks remains a promising research direction. Our future research can delve deeper to explore and unlock more possibilities.

[1] Video-ChatGPT: Towards detailed video understanding via large vision and language models. ACL 2024.

[2] VideoChat: Chat-Centric Video Understanding. CVPR 2024.

评论

Dear reviewer,

The authors have responded according your comments. Please have a check to see whether your concerns are adequately solved. You can ask authors to address further if required.

Best, Your AC

评论
  1. The significant improvements seen when fine-tuning other frameworks with your dataset suggest that the performance boost is primarily due to the high quality of the dataset itself, reinforcing my assumption that the dataset plays a major role in the observed enhancements.
  2. In response to your claim that “the experimental results can affirm that integrating motion information into our framework significantly enhances model performance,” I would like to correct this interpretation. The effectiveness of the motion modality module should be rigorously evaluated through an ablation study within the same framework, rather than by comparing your framework with motion modality against different frameworks without it. Based on your ablation study, the motion modality provides limited enhancement.
  3. Video-ChatGPT and VideoChat, as early works, contribute significantly to multimodal dialogue systems for video understanding, beyond just dataset creation. In contrast, the motion modality in your work appears to be more of an incremental improvement rather than a groundbreaking innovation. Similarly, while your dataset is valuable, it doesn’t seem to offer substantial advancements over existing datasets.

Given these considerations, I am updating my rating to reject.

评论

Thank you once again for your comments on our paper. We appreciate your concerns regarding the effectiveness of the model architecture, and we will provide further clarification on this matter.


Firstly, we acknowledge that the proposed dataset can significantly enhance the effectiveness of the model in understanding video anomalies.

However, it is unreasonable to conclude that the motion model has limited significance based on this.

Here are some reasons:


Firstly, to demonstrate the effectiveness of motion information, we have conducted a rigorous ablation study as in Table 3 and 4 of the main paper. The experimental results, including quantitative metrics and qualitative analysis, indicate a significant improvement when motion information is integrated, compared to "w/o motion information", across ALL Text-level and GPT-guided metrics. (However, Video-ChatGPT and VideoChat only utilize a single type of GPT-guided metric.)

In the magnitude of performance improvement, integrating motion information resulted in an average performance increase of around 2.4% in GPT-guided metrics. In comparison, the performance increase in Video-ChatGPT compared to the SOTA baseline was around 1.8% in their paper. Therefore, the performance improvement observed in the ablation study is significant.

Hence, the unfounded assertion by the reviewer that the improvement in model performance is merely "comparable" is both unreasonable and lacks sufficient evidence.


Secondly, to further demonstrate the effectiveness of motion information, we compared the results of fine-tuning other baselines on the same training data. This type of experiment serves as a valid form of evidence (acknowledged by Reviewer K7jv and previously used in VideoChat).

When compared to other existing baselines, our framework significantly outperformed after fine-tuning on the same training and testing data. Since other baselines do not incorporate motion information, this comparison showcases the advantage of our framework in leveraging motion for video anomaly understanding.

Additionally, it is worth noting that the performance of our base model ("w/o motion information" in Table 3 (A)) was initially weaker than Video-LLaMA and LlaMA Adapter in the same training data. However, after integrating motion-related information and motion-related loss functions, the performance saw a significant enhancement. This demonstrates that the performance improvement of the model is indeed derived from motion information.


Thirdly, the problem that this paper aims to address is to enhance the system's ability to understand anomalous information in videos. Although our method is built upon the framework of video understanding, it would be unfair to directly assess the novelty of our approach based on previous general video understanding frameworks. Instead, our improvements are significant in the field of understanding abnormal information in videos. This contribution has been acknowledged and praised by Reviewers K7jv, Cy7X, and kzaw in the Strengths section.


Fourthly, we have noticed in the supplementary comments that while the first point is to acknowledge our significant contribution to the dataset (even considering it the sole contribution), however, the third point denies the contribution of certain parts of the dataset (stating that it doesn’t seem to offer substantial advancements over existing datasets). This raises doubts about whether your comments are made from a reasonable and fair perspective. We look forward to further discussion to address and resolve these perplexities.


Thanks again for your response and look forward to your response.

评论

One more suggestion to make your ablation study truly rigorous: for the contribution claim as a framework, it's important to conduct ablation experiments on the LLM backbone itself. For example, testing different LLMs and varying parameter sizes (e.g., different versions of LLaMA, different parameter sizes) would help justify why you chose LLaMA-2 7B for your framework:)

评论
  1. The author states, “Although our method is built upon the framework of video understanding, it would be unfair to directly assess the novelty of our approach based on previous general video understanding frameworks.” However, video anomaly understanding is a subtask of video understanding, which justifies comparing your method with standard video understanding baselines. Moreover, motion modality is not unique to anomaly videos; it applies to other video types as well. In your Related Work section, you compare your method with existing video understanding frameworks, noting that the primary difference is the addition of motion modality.

  2. In large visual language model fine-tuning, it is common practice to provide a high-quality dataset to facilitate downstream tasks. This is why I emphasize that the dataset plays a major role in the observed enhancements. However, the way your dataset was created does not demonstrate significant differentiation or contribution compared to datasets for other downstream tasks.

  3. Considering the limited technical contribution of introducing motion modality to existing video understanding baselines, coupled with the common practice of creating datasets tailored for downstream tasks, I believe the novelty and contribution of your work may not meet the threshold required for publication at NeurIPS

评论

Your further comments are crucial for us to refine our work, and we deeply appreciate them.

In response to the issues you raised, we will provide detailed clarification to ensure our work is more accurate and comprehensive.


About Unfair Assessment

We have never denied that understanding video anomalies is a subtask of video understanding.

However, the essence of our task lies in understanding "anomalies"; hence, we believe the core issue is whether the proposed motion modality can significantly help the understanding of anomalies, rather than whether the motion module can assist in general video understanding.

Therefore, the "Unfair" we emphasize is from the perspective of problem definition. To argue the novelty of this paper, we start from the understanding of the "video anomaly" and seek reasonable strategies to enhance its understanding capability.

Furthermore, we have NOT avoided direct comparison with other general video understanding baselines (including zero-shot and fine-tuned), as can be seen from our global rebuttal and Table 1 in the main paper.

At the same time, we do not deny that the motion modality exists in other videos, and integrating motion information can certainly serve as a means to further enhance video performance in other specific scenarios. However, this does not negate our core idea that the proposed motion module is useful for understanding anomalies.


About Dataset Building Pipeline

In the dataset, our main contribution is the annotated descriptions for seven unannotated datasets across seven different domains. This data itself will support our task and promote future advancements in this field.

We do NOT emphasize the creation method of the dataset as the core contribution of this paper. Certainly, selecting an appropriate method for creation is necessary.

I believe there might have been a misunderstanding by the reviewers regarding this.


About Different Backbones in Ablation Study

In the task of video anomaly understanding, the primary objective of this study is to ascertain the role of the motion modality within the overall model framework. In the ablation study, we have validated its significance while maintaining a consistent backbone architecture, as detailed in Table 3 and the preceding analysis within the main paper.

Furthermore, we acknowledge that the selection of alternative frameworks might achieve higher performance. However, this consideration is NOT aligned with the current focus of our research. Our focus lies in exploring ways to improve the performance of video anomaly detection within one reasonable framework.

Nonetheless, we are open to integrating new frameworks to extend the applicability of our study, which will be a direction for our future work.

作者回复

Firstly, we would like to express our heartfelt gratitude to all the reviewers for their insightful and constructive suggestions. These suggestions have been immensely helpful in refining our paper and have provided valuable guidance for our research direction.


Restating Our Contributions

The principal contribution of our work is the development of a novel framework aimed at understanding video anomalies. This framework, by integrating motion modality information, significantly enhances the capability to comprehend abnormal events.

Moreover, we have built a comprehensive dataset containing over 8,000 annotated video anomalies across various scenarios. The creation of this dataset enriches the training materials for the model, enabling it to better adapt to the needs of open-world anomaly understanding.

We believe that, with the valuable feedback from the reviewers to refine the paper, this research will bring innovative insights and practical solutions to the field of video anomaly understanding.


Supplementary experiment for Reviewer b4dX and Reviewer K7jv

To further demonstrate the effectiveness of the motion model, we fine-tuned other frameworks using the same dataset we introduced.


(A) Anomaly Video Description Generation

MethodsBackbonesLLM SizeBLEU-1BLEU-2BLEU-3BLEU-4ReasonabilityDetailConsistency
Video-ChatGPT [1]LLaMA 7B7B0.2400.1210.0620.0210.1590.1770.132
VideoChat [2]Vicuna 7B7B0.1230.1010.0480.0190.1980.2710.144
Video-LLaMA [3]LLaMA-2 7B7B0.2500.1350.0690.0380.2790.3190.195
LLaMA-Adapter [4]LLaMA-2 7B7B0.2530.1360.0700.0400.2400.1900.172
Video-LLaVA [5]Vicuna 7B7B0.1500.0720.0310.0100.1010.2010.118
OursLLaMA-2 7B7B0.2700.1390.0740.0430.2830.3200.218

(B) Anomaly Video Question-Answering

MethodsBackbonesLLM SizeBLEU-1BLEU-2BLEU-3BLEU-4ReasonabilityDetailConsistency
Video-ChatGPT [1]LLaMA 7B7B0.1880.1090.0630.0420.6120.5650.527
VideoChat [2]Vicuna 7B7B0.2820.1430.0900.0490.7120.6630.567
Video-LLaMA [3]LLaMA-2 7B7B0.1680.0890.0510.0300.6240.5270.449
LLaMA-Adapter [4]LLaMA-2 7B7B0.2390.1420.0880.0500.6730.5720.532
Video-LLaVA [5]Vicuna 7B7B0.1230.0980.0690.0470.6200.5280.533
OursLLaMA-2 7B7B0.3190.1790.1120.0730.8400.7940.753

最终决定

The manuscript received ratings of two weak accept, one accept and one reject. Reviewers appreciated the proposed benchmark, approach and experimental results but also raised some concerns such as the impact of motion modality and strong LLM backbone. Authors provided a detailed rebuttal with three reviewers remained positive. Given the reviewers feedback, rebuttal and reviewer discussions, the recommendation is accept. Authors are strongly encouraged to take into account all the reviewers feedback along with publicly releasing the dataset, code and models, when preparing the final draft.