0% found this document useful (0 votes)
30 views22 pages

Video Anamoly Detection

This document is a comprehensive review of video anomaly detection (VAD) methods, particularly focusing on deep learning approaches. It categorizes VAD into five types based on supervision levels: semi-supervised, weakly supervised, fully supervised, unsupervised, and open-set supervised, while also discussing the latest advancements and performance trends in the field. The review aims to provide a structured taxonomy and analysis of existing literature, datasets, and methodologies to guide future research in VAD.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views22 pages

Video Anamoly Detection

This document is a comprehensive review of video anomaly detection (VAD) methods, particularly focusing on deep learning approaches. It categorizes VAD into five types based on supervision levels: semi-supervised, weakly supervised, fully supervised, unsupervised, and open-set supervised, while also discussing the latest advancements and performance trends in the field. The review aims to provide a structured taxonomy and analysis of existing literature, datasets, and methodologies to guide future research in VAD.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2021 1

Deep Learning for Video Anomaly Detection:


A Review
Peng Wu, Chengyu Pan, Yuting Yan, Guansong Pang, Peng Wang, and
Yanning Zhang Senior Member, IEEE

400 160

Abstract—Video anomaly detection (VAD) aims to discover 350 140

behaviors or events deviating from the normality in videos. As 300 120

a long-standing task in the field of computer vision, VAD has 250 100

Publications

Publications
arXiv:2409.05383v1 [[Link]] 9 Sep 2024

witnessed much good progress. In the era of deep learning, with 200 80

150 60
the explosion of architectures of continuously growing capability 100 40

and capacity, a great variety of deep learning based methods 50 20

are constantly emerging for the VAD task, greatly improving the 0
2016 2017 2018 2019 2020 2021 2022 2023
0
2016 2017 2018 2019 2020 2021 2022 2023

generalization ability of detection algorithms and broadening the


application scenarios. Therefore, such a multitude of methods Fig. 1. Publications on VAD. Left: Google Scholar; Right: IEEE Xplore.
and a large body of literature make a comprehensive survey a
pressing necessity. In this paper, we present an extensive and 92
Avenue ShanghaiTech
90
UCF-Crime XD-Violence

comprehensive research review, covering the spectrum of five 84 86


different categories, namely, semi-supervised, weakly supervised,

AUC\AP (%)
AUC (%)
fully supervised, unsupervised and open-set supervised VAD, 76 82

and we also delve into the latest VAD works based on pre-
68 78
trained large models, remedying the limitations of past reviews
in terms of only focusing on semi-supervised VAD and small 60
2016 2017 2018 2019 2020 2021 2022 2023
74
2018 2019 2020 2021 2022 2023 2024
model based methods. For the VAD task with different levels of
supervision, we construct a well-organized taxonomy, profoundly
Fig. 2. Performance development for semi/weakly supervised VAD tasks.
discuss the characteristics of different types of methods, and show
their performance comparisons. In addition, this review involves review and traffic anomaly prediction in autonomous driving
the public datasets, open-source codes, and evaluation metrics [2]. Owing to its significant potential for applications across
covering all the aforementioned VAD tasks. Finally, we provide
several important research directions for the VAD community. different fields, VAD has attracted considerable attention from
both industry and academia.
Index Terms—Video anomaly detection, anomaly detection,
video understanding, deep learning.
In the pre-deep learning era, the routine way is to separate
feature extraction and classifier design, which forms a two-
stage process, and then combine them together during the
I. I NTRODUCTION inference stage. First, there is a feature extraction process
NOMALY represents something that deviates from what to convert the original high dimensional raw videos into
A is standard, normal, or expected. There are myriads of
normalities, and anomalies are considerably scarce. However,
compact hand-crafted features based on prior knowledge of
experts. Although hand-crafted features lack robustness and
when anomalies do appear, they often have a negative impact. are difficult to use for capturing effective behavior expressions
Anomaly detection aims to discover these rare anomalies built in the face of complex scenarios, these pioneer works deeply
on top of machine learning, thereby reducing the cost of man- enlighten subsequent deep learning based works.
ual judgment. Anomaly detection has widespread application The rise of deep learning has made traditional machine
across various fields [1], such as financial fraud detection, learning algorithms fall out of favor over the last decade.
network intrusion detection, industrial defect detection, and With the rapid development of computer hardware and the
human violence detection. Among these, video anomaly de- massive data in the Internet era, we have witnessed great
tection (VAD) occupies an important place, in which anomaly progress in developing deep learning based methods for VAD
indicates the abnormal events in the temporal or spatial dimen- in recent years. For example, ConvAE [3], the first work using
sions. VAD not only plays a vital role in intelligent security deep auto-encoders based on convolutional neural networks
(e.g., violence, intrusion, and loitering detection) but is also (CNNs) for capturing the regularities in videos; FuturePred
widely used in other scenarios, such as online video content [4], the first work making use of U-Net for forecasting future
anomalies; DeepMIL [5], the first endeavor exploring deep
Peng Wu, Chengyu Pan, Yuting Yan, Peng Wang, and Yanning Zhang are multiple instance learning (MIL) framework for real-world
with the School of Computer Science, Northwestern Polytechnical University,
China. E-mail: xdwupeng@[Link];{[Link], ynzhang}@[Link]. anomalies. In order to more intuitively manifest the research
Guansong Pang is with the School of Computing and Information Systems, enthusiasm for the VAD task in the era of deep learning, we
Singapore Management University Singapore, Singapore. E-mail: pangguan- conduct a statistical survey on the number of publications
song@[Link].
Manuscript received April 19, 2021; revised August 16, 2021. (Correspond- related to VAD over the past decade (the era driven by
ing author: Guansong Pang. Chengyu Pan and Yuting Yan contributed equally.) the rise of deep learning based methods) through Google
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

TABLE I
A NALYSIS AND C OMPARISON OF R ELATED R EVIEWS .

Research Topics
Reference Year Main Focus Main Categorization
UVAD WVAD SVAD FVAD OVAD LVAD IVAD
Ramachandra et al. [8] 2020 Semi-supervised single-scene VAD Methodology % % ! % % % %
Santhosh et al. [9] 2020 VAD applied on road traffic Methodology ! % ! ! % % %
Nayak et al. [10] 2021 Deep learning driven semi-supervised VAD Methodology % % ! % % % %
Tran et al. [11] 2022 Semi&weakly supervised VAD Architecture % % ! % % % %
Chandrakala et al. [12] 2023 Deep model-based one&two-class VAD Methodology&Architecture % ! ! ! % % %
Liu et al. [13] 2023 Deep models for semi&weakly supervised VAD Model Input ! ! ! ! % % %
Methodology, Architecture, Refinement
Our survey 2024 Comprehensive VAD taxonomy and deep models
Model Input, Model Output ! ! ! ! ! ! !

UVAD: Unsupervised VAD, WVAD: Weakly supervised VAD, SVAD: Semi-supervised VAD, FVAD: Fully supervised VAD, OVAD: Open-set supervised
VAD, LVAD: Large-model based VAD, IVAD: Interpretable VAD

Scholar and IEEE Xplore1 . We select five related topics, i.e., VAD, semi-supervised VAD, weakly supervised VAD, and
video anomaly detection, abnormal event detection, abnormal fully supervised VAD, and also surveyed deep learning based
behavior detection, anomalous event detection, and anomalous methods for most supervised VAD tasks. However, they restrict
behavior detection, and showcase the publication statistics in their scope to the conventional close-set scenario, and fail to
Figure 1. It is not hard to see that the number of related cover the latest research in the field of open-set supervised
publications counted from both sources exhibits a steady and VAD, without introducing a brand-new pipeline based on pre-
rapid growth trend, demonstrating that VAD has garnered trained large models and interpretable learning.
widespread attention. Moreover, we also demonstrate the de- To address this gap comprehensively, we present a thorough
tection performance trends of annual state-of-the-art methods survey of VAD works in the deep learning era. Our survey
on commonly used datasets under two common supervised covers several key aspects to provide a comprehensive analysis
manners, and present performance trends in Figure 2. The of VAD studies. To be specific, we perform an in-depth
detection performance shows a steady upward trend across investigation into the development trends of VAD task in the
all datasets, without displaying any performance bottleneck. era of deep learning, and then propose a unified framework
For instance, the performance of semi-supervised methods on that integrates different VAD tasks together, filling the gaps
CUHK Avenue [6] has experienced a significant surge, rising in the existing reviews in terms of taxonomy. We then collect
from 70.2% AUC [3] to an impressive 90.1% AUC [7] over the most comprehensive open sources, including benchmark
the past seven-year period. Moreover, for the subsequently datasets, evaluation metrics, open-source codes, and perfor-
proposed weakly supervised VAD, significant progress has mance comparisons, to help researchers in this field avoid
been achieved as well. This indicates the evolving capability detours and improve efficiency. Further, we systematically
of deep learning methods under developing architectures, and categorize various VAD tasks, dividing existing works into
also showcases the ongoing exploration enthusiasm of deep different categories and establishing a clear and structured tax-
learning methods for the VAD task. onomy system to provide a coherent and organized overview
The above statistics clearly demonstrate that deep learning of various VAD paradigms. In addition to this taxonomy, we
driven VAD is the hot area of the current research. There- conduct a comprehensive analysis of each paradigm. Further-
fore, there is an urgent necessity for a systematic taxonomy more, throughout this survey, we spotlight influential works
and comprehensive summary of existing works, to facilitate that have significantly contributed to the research advancement
newcomers as a guide and provide references for existing in VAD.
researchers. Based on this, we first collect some high-profile The main contributions of this survey are summarized in
reviews on VAD in the past few years, which are shown the following three aspects:
in Table I. Ramachandra et al. [8] mainly focused on semi- • We provide a comprehensive review of VAD, covering
supervised VAD in the single scenario, lacking in discussions five tasks based on different supervision signals, i.e.,
of cross scenes. Santhosh et al. [9] reviewed VAD methods semi-supervised VAD, weakly supervised VAD, fully
focusing on entities in road traffic scenarios. Their reviews lack supervised VAD, unsupervised VAD, and open-set su-
sufficient in-depth analysis and center on pre-2020 method- pervised VAD. The research focus has expanded from
ologies, resulting in the neglect of recent advances. Nayak traditional single-task semi-supervised VAD to a broader
et al. [10] comprehensively surveyed on deep learning based range of multiple VAD tasks.
methods for semi-supervised VAD, but did not take into • Staying abreast of the research trends, we review the
account weakly supervised VAD methods. The follow-up work latest studies on open-set supervised VAD. Moreover, we
Tran et al. [11] introduced a review of the emerging weakly also revisit the most recent VAD methods based on pre-
supervised VAD, but the focus is not only on videos but also trained large models and interpretable learning. The emer-
on image anomaly detection, resulting in a lack of systematic gence of these methods elevates both the performance and
organization of the VAD task. More recently, both Chandrakala application prospects of VAD. To our knowledge, this
et al. [12] and Liu et al. [13] constructed an organized is the first comprehensive survey of open-set supervised
taxonomy covering a variety of VAD tasks, e.g., unsupervised VAD and pre-trained large model based VAD methods.
• For different tasks, we systematically review existing
1 [Link] [Link] deep learning based methods, and more importantly, we
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

F Frame-level Label V Video-level Label N Normal Video A Anomalous Video U Unlabeled Video A Unseen Anomalous Video

F V F OR V
… … … … …

Train N A N N N N N A N U U U N A N

… … … … …
Test
N A N N A N N A N U U U N A N

(a) Fully supervised VAD (b) Semi-supervised VAD (c) Weakly supervised VAD (d) Unsupervised VAD (e) Open-set supervised VAD

Fig. 3. Comparisons of five supervised VAD tasks, i.e., fully supervised, semi-supervised, weakly supervised, unsupervised, and open-set supervised VAD.

introduce a unified taxonomy framework categorizing the video-level labels are available (i.e., inexact supervision).
methods from various VAD paradigms based on various Formally, Y = {0, 1}N +A
i=1 , where yi = 0 indicates that xi
aspects, including model input, architecture, methodol- is normal, and yi = 1 indicates that xi is abnormal. Pros
ogy, model refinement, and output. This meticulous sci- and Cons: Compared to fully supervised annotations, it can
entific taxonomy enables a comprehensive understanding significantly reduce labeling costs. However, it places higher
of the field. demands on algorithm design and may lead to situations of
blind guessing.
II. BACKGROUND Fully supervised VAD, as its name implies, comprises
A. Notation and Taxonomy the complete supervision signals, meaning that each abnormal
sample has precise annotations of anomalies. This task can be
As aforementioned, the studied problem, VAD, can be for-
viewed as a standard video or frame classification problem.
mally divided into five categories based on supervision signals.
Due to the scarcity of abnormal behaviors and intensive
Different supervised VAD tasks aim to identify anomalous
manual labeling in reality, there has been little research on the
behaviors or events, but with different training and testing
fully supervised VAD task. It is noteworthy that video violence
setups. We demonstrate these different VAD tasks in Figure 3.
detection can be regarded as a fully supervised VAD, hence,
The general VAD problem is presented as follows. Suppose
we denote violence detection as a fully supervised VAD task
we are given a set of training samples X = {xi }N +A
i=1 and
N in this paper. Formally, each video xi in Xa is accompanied
corresponding labels Y, where Xn = {xi }i=1 is the set of
by a corresponding supervision label yi = {(tsj , tej )}Ui
j , where
normal samples and Xa = {xi }N +A
i=N +1 is the set of abnormal s e
tj and tj denote the start and end time of the j-th violence
samples. Each sample xi is accompanied by a corresponding
event, Ui indicates the total number of anomalies present in
supervision label yi in Y. During the training phase, the
the video. Pros and Cons: In contrast to weakly supervised
detection model Φ(θ) takes X as input and generates anomaly
VAD, with full supervised information, the detection perfor-
predictions; it is then optimized according to the following
mance of the algorithms would be remarkable. However, the
objective,
corresponding drawback is the high requirement for intensive
l = L (Φ (θ, X ) , X or Y) (1)
manual annotations.
where L(·) is employed to quantify the discrepancy between Unsupervised VAD aims to discover anomalies directly
the predictions and the ground-truth labels or original samples. from fully unlabeled videos in an unsupervised manner. Thus,
During inference, the detection model is expected to locate the unsupervised VAD no longer requires labeling normal and
abnormal behaviors or events in videos based on the generated abnormal videos to build the training set. It can be expressed
anomaly predictions. Depending on the input to L, VAD can formally as follows, X = Xtest , and Y = ∅, in which
be categorized into one of the following five task settings. Xtest denotes the set of test samples. Pros and Cons: No
Semi-supervised VAD assumes that only normal samples time-consuming effort is needed to collect training samples,
are available during the training stage, that is, Xa = ∅. This avoiding the heavy labeling burden. Besides, this assumption
task aims to learn the normal patterns based on the training also expands the application fields of VAD, implying that
samples and consider the test samples which fall outside the the detection system can continuously retrain without human
learned patterns as anomalies. Pros and Cons: Only normal intervention. Unfortunately, due to the lack of labels, the
samples are required for training, hence, there is no need to detection performance is relatively poor, leading to a higher
painstakingly collect scarce abnormal samples. However, any rate of false positives and false negatives.
unseen test sample may be recognized as abnormal, leading Open-set supervised VAD is devised to discover unseen
to a higher false positive rate. anomalies that are not presented in the training set. Un-
Weakly supervised VAD has more sufficient training sam- like semi-supervised VAD, open-set supervised VAD includes
ples and supervision signals than semi-supervised VAD. Both abnormal samples in the training set, which are referred
normal and abnormal samples are provided during the training to as seen anomalies. Specifically, for each xi in Xa , its
stage, but the precise location stamps of anomalies in these corresponding label yi ∈ Cbase , here Cbase represents the
untrimmed videos are unknown. In other words, only coarse set of base (seen) anomaly categories, and Cbase ⊂ C, with
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

Model Input Methodology Network Architecture Refinement Model Output


One-class
Classifier
One-class Gaussian
Skeleton Learning Classifier
Adversarial
Classifier
Auto Encoder 2D CNN
Reconstruction

Optical Flow Frame Level Prediction 3D CNN


Pseudo
Frame Level
Cloze Test Anomalies
RNN
Self-supervised
Patch Level Jigsaw Puzzles GAN
Learning
Contrastive LSTM
Memory
Learning Pixel Level
Bank
RGB Object Level GCN
Denoising

Sparse Coding Diffusion Transformer

Patch Inpainting

Hybrid Interpretable
Multiple Tasks
Learning

Fig. 4. The taxonomy of semi-supervised VAD. We provide a hierarchical taxonomy that organizes existing deep semi-supervised VAD models by model
input, methodology, network architecture, refinement strategy, and model output into a systematic framework.

C = Cbase ∪ Cnovel . Here, Cnovel and C represent the sets of methods (Refinement), and expression of anomaly results
of novel anomaly categories unseen during training and all (Model Output). These key elements collectively contribute
anomaly categories, respectively. Given a testing sample xtest , to the effectiveness of semi-supervised VAD solutions. In the
its label ytest may be either ∈ Cbase or ∈ Cnovel . Pros and following sections, we introduce existing deep learning based
Cons: Compared to the two most common tasks, i.e., semi- VAD methods systematically according to the aforementioned
supervised VAD and weakly supervised VAD, open-set super- taxonomy.
vised VAD not only reduces false positives but also avoids
being limited to closed-set scenarios, thus demonstrating high A. Model Input
practical value. However, it relies on learning specialized Existing semi-supervised VAD methods typically use the
classifiers, loss functions, or generating unknown classes to raw video or its intuitive representations as the model input.
to detect unseen anomalies. Depending on the modality, these can be categorized as fol-
lows: RGB images, optical flow, skeleton, and hybrid inputs,
B. Datasets and Metrics where the first three represent appearance, motion, and body
posture, respectively.
Related benchmark datasets and evaluation metrics are listed
1) RGB: RGB images are the most common input for
at [Link]
conventional vision tasks driven by deep learning techniques,
and this holds true for the VAD task as well. Unlike other
III. S EMI - SUPERVISED V IDEO A NOMALY D ETECTION modalities, RGB images do not require additional processing
Based on our in-depth investigation of past surveys, we steps such as optical flow calculations or pose estimation
found that previous surveys mostly lack a scientific taxonomy, algorithms. In deep learning era, various deep models can be
in which many surveys simply categorize semi-supervised employed to extract compact and high-level visual features
VAD works into different groups based on usage approaches, from these high-dimensional raw data. Utilizing these high-
such as reconstruction-based, distance-based, and probability- level features enables the design of more effective subsequent
based approaches, and some surveys classify works according detection methods. Moreover, depending on the input size,
to inputs, such as image-based, optical flow-based, and patch- RGB image based input can be categorized into three principal
based approaches. It is particularly apparent that existing clas- groups: frame level, patch level, and object level.
sification reviews are relatively simplistic and superficial, thus Frame-level RGB input provides a macroscopic view of
making it challenging to cover all methods comprehensively the entire scene, encompassing both the background, which
and effectively. To address this issue, we establish a compre- is usually unrelated to the event, and the foreground objects,
hensive taxonomy, encompassing model input, methodology, where anomalies are more likely to occur. The conventional
architecture, model refinement, and model output. The detailed approach typically uses multiple consecutive video frames as
illustration is presented in Figure 4. a single input to capture temporal context within the video,
As aforementioned, only normal samples are available for as seen in methods like ConvAE [3], ConvLSTM-AE [14],
training in the semi-supervised VAD task, rendering the su- and STAE [15]. On the other hand, several research studies
pervised classification paradigm inapplicable. Common ap- focus on using single-frame RGB as input, aiming to detect
proaches involve leveraging the intrinsic information of the anomalies at the spatial level, such as AnomalyGAN [16] and
training samples to learn deep neural networks (DNNs) for AMC [17].
solving a pretext task. For instance, normality reconstruction Patch-level RGB input involves segmenting the frame-
is a classic pretext task [3]. During this process, several critical level RGB input spatially or spatio-temporally, which focuses
aspects need consideration: selection of sample information on local regions, effectively separating the foreground from
(Model Input), design of pretext tasks (Methodology), utiliza- the background and differentiating between various individ-
tion of deep networks (Network Architecture), improvement ual entities. The primary advantage of patch-level input is
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

its ability to significantly reduce the interference from the B. Methodology


dominant background, which is usually less relevant to the For semi-supervised VAD, only normal samples are pro-
anomalies. This segmentation helps in isolating areas that vided during the training phase, rendering conventional su-
are more likely to contain anomalies, thereby enhancing the pervised classification methods inapplicable. The current ap-
detection accuracy. For example, AMDN [18], [19], DeepOC proach involves designing a pretext task based on the prop-
[20] and Deep-cascade [21] took spatio-temporal patches as erties inherent in normal samples themselves to encapsulate
the input, and S2 -VAE [22] and GM-VAE [23] only employed a paradigm that encompasses all normal events, referred to
an image patch from a single video frame as the model input. as the normal paradigm or normal pattern. Through exten-
Object-level RGB input has emerged in recent years along- sive research on existing works, we categorize three major
side the advancement of object detection approaches, which approaches for learning the normal paradigm: self-supervised
focuses exclusively on foreground objects. Compared to patch- learning, one-class learning, and interpretable learning.
level input, it entirely disregards background information and 1) Self-supervised Learning:
neglects to consider the relationship between objects and “If intelligence is a cake, the bulk of the cake is
backgrounds. As a result, it demonstrates impressive perfor- self-supervised learning.” —— Yann LeCun
mance in identifying anomalous events within complex scenes.
Self-supervised learning primarily leverages auxiliary tasks
Hinami et al. [24] first proposed an object-centric approach
(pretext tasks) to derive supervisory signals directly from un-
FRCN based on object inputs, a follow-up work ObjectAE [25]
supervised data. Essentially, self-supervised learning operates
introduced to train object-centric auto-encoders on detected
without external labeled data, as these labels are generated
objects, and subsequently, a series of works focusing on
from the input data itself. For semi-supervised VAD task,
object-level inputs emerged, e.g., HF2 -VAD [26], HSNBM
which lacks explicit supervisory signals, self-supervised learn-
[27], BDPN [28], ER-VAD [29], and HSC [30].
ing naturally becomes essential for learning normal represen-
2) Optical Flow: A video is not merely a sequence of tations and constructing the normal paradigm based on these
stacked RGB images, it encapsulates the temporal dimension auxiliary tasks. Consequently, self-supervised learning based
and crucial temporal context. Therefore, extracting temporal methods consistently dominate the leading position in semi-
context is vital for understanding video content, with mo- supervised VAD task. Throughout this process, a significant
tion information playing an irreplaceable role. Optical flow research focus and challenge lie in designing effective pretext
represents the motion information between consecutive video tasks derived from the data itself. Here, we compile the
frames and is commonly used as a model input for VAD common design principles of auxiliary tasks used in existing
task. It typically does not appear alone but is paired with the self-supervised learning based methods.
corresponding RGB image as input in the two-stream network. Reconstruction is currently the most commonly used pre-
Therefore, it also encompasses frame [4], [31]–[35], patch text task for self-supervised learning based methods in the
[20], [36], [37], and object [26], [29], [35], [38] levels. field of semi-supervised VAD [3], [14], [17], [22], [48], [52]–
3) Skeleton: In recent years, with the remarkable success [54]. The main process involves inputting normal data into
of deep learning technologies in the field of pose estima- the network, performing encoding-decoding operations, and
tion, VAD methods based on skeleton input have emerged. generating reconstructed data, encouraging the network to
Skeleton input solely focuses on the human body, making it produce reconstructions that closely match the original input
more specialized than object-level RGB input. It demonstrates data. The objective can be expressed as,
impressive performance in human-centric VAD, marking it
as a significant research interest in recent years within the lrec = L (Φ (θ, x) , x) (2)
VAD domain. Morais et al. [39] first endeavored to learn the For convenience, in the following sections, unless otherwise
normal patterns of human movements using dynamic skeleton, specified, x represent normal data, which could be a normal
where the pose estimation is utilized to independently detect video, a normal video frame, a normal feature, or similar.
skeletons in each video frame. Then, GEPC [40], MTTP [41], The above objective function measures the reconstruction
NormalGraph [42], HSTGCNN [43], TSIF [44], STGCAE- error, which serves as a criterion for determining whether the
LSTM [45], STGformer [46], STG-NF [47], MoPRL [48], test data is anomalous during the test stage. The larger the
MoCoDAD [49], and TrajREC [50] are specialized in human- reconstruction error, the higher the probability that the data
related VAD with the skeleton input. is considered anomalous. However, due to the high capacity
4) Hybrid: The hybrid input from different modalities often of deep neural networks, reconstruction based methods cannot
proves more advantageous for the VAD task compared to the guarantee that larger reconstruction errors for abnormal events
unimodal input due to their complementary nature. In existing do necessarily happen.
deep learning driven VAD methods, hybrid input is a com- Prediction fully leverages the temporal coherence inherent
mon practice. Typical hybrid input includes frame-level RGB in videos, which is also a commonly used pretext task.
combined with optical flow [4], patch-level RGB combined This pretext task is based on the assumption that normal
with optical flow [20], and object-level RGB combined with events are predictable, while abnormal ones are unpredictable.
optical flow [26]. More recently, several research studies have Specifically, the prediction pretext task takes historical data as
explored hybrid input based on RGB combined with skeleton input and, through encoding-decoding operations within the
[51]. network, outputs the predicted data for the current moment.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6

The network is compelled to make the predicted data similar Lu et al. [70] further proposed a learnable locality-sensitive
to the actual current data. We define the optimization objective hashing with contrastive learning strategy for VAD.
for prediction as, Denoising [71], [72] is very similar to reconstruction, with
the main difference being that noise η is added to the input
lpre = L (Φ (θ, {It−∆t , ..., It−1 }) , It ) (3) data, and the network is encouraged to achieve a denoising
It is the actual data at the current time step t, It−∆t:t−1 effect for the reconstructed data. The benefit is that it can
represents the historical data from time step t − ∆t to t − 1. enhance the robustness of network for VAD. The optimization
FuturePred [4], as a future frame prediction framework, pro- objective is expressed as,
vided a new solution for VAD. Then, many researchers [26], lden = L (Φ (θ, x + η) , x) (7)
[33], [46], [55]–[61] proposed other prediction based methods.
It alleviates, to some extent, the problem in reconstruction Deep sparse coding is encouraged by the success of
based methods where abnormal events can also be well traditional sparse reconstruction based VAD methods [73],
reconstructed. upgrade versions leverage deep neural networks for semi-
Visual cloze test is inspired by the cloze test in natural supervised VAD. Unlike the aforementioned reconstruction or
language processing [62]–[64]. It mainly involves training prediction tasks, sparse coding typically uses extracted high-
multiple DNNs to infer deliberately erased data from in- level representations rather than raw video image data as input.
complete video sequences, where the prediction task can be By learning from a large amount of normal representations,
considered a special case of visual cloze test task, i.e., the a dictionary of normal patterns is constructed. The total
erased data happens to be the last frame in the video sequence. objective is listed as,
We define the objective function for completing erased data at
lspa = ∥x − Bz∥22 + ∥z∥1 (8)
the t-th time stamp as,
(t) Different normal events can be reconstructed through the dic-
lvct = L (Φ (θ, {I1 , ..., It−1 , It+1 , ...}) , It ) (4)
tionary B multiplied by the sparse coefficient z. For anomalies,
Similar to the prediction task, it also leverages the temporal it is hard to reconstruct them using the linear combination of
relationships in the video, but the difference lies in this task elements from the normal dictionary with a sparse coefficient.
can learn better high-level semantics and temporal context. To overcome the time-consuming inference and low-level
Jigsaw puzzles have recently been applied as a pretext task hand-crafted features of traditional sparse reconstruction based
in semi-supervised VAD [65]–[67]. The main process involves methods, deep sparse coding based methods are emerged [32],
creating jigsaw puzzles by performing temporal, spatial, or [74]–[76], simultaneously leveraging the powerful representa-
spatio-temporal shuffling, and then designing networks to tion capabilities of DNNs and sparse representation techniques
predict the relative or absolute permutation in time, space, or to improve detection performance and efficiency.
both. The optimization function is as follows, Patch inpainting involves the process of reconstructing
X  missing or corrupted regions by inferring the missing parts
ljig = L ti , t̂i (5) from the available data. This technique mainly leverages the
i spatial and temporal context to predict the content of the
where ti and t̂i denote the ground-truth and predicted positions missing regions, ensuring that the reconstructed regions blend
of the i−th data in the original sequence, respectively. Unlike seamlessly with the surrounding regions. The optimization
the previous pretext tasks, which involve high-quality image objective for patch inpainting can be defined to minimize the
generation, jigsaw puzzles are cast as the multi-label classifi- difference between the original and the reconstructed patches,
cation, enhancing computational efficiency and learning more 
contextual details. lpat = L Φ (θ, x ⊙ M ) , x ⊙ M̄ (9)
Contrastive learning is a key approach in self-supervised M denotes a mask, where the value of 0 in the mask indicates
learning, where the goal is to learn useful representations by that the position needs to be inpainted, while a value of 1
distinguishing between similar and dissimilar pairs. For semi- indicates that the position does not need to be inpainted,
supervised VAD, two samples are regarded as a positive pair and M̄ is a reversed M . Different from prediction and vi-
if they originate from the same sample, and otherwise as a sual cloze test, patch inpainting takes into greater account
negative sample pair [68]. We show the contrastive loss as the spatial or spatio-temporal context. Zavrtanik et al. [77]
below, regarded anomaly detection as a reconstruction-by-inpainting
X exp(sim(xi , x+
i )/τ ) task, further randomly removed partial image regions and
lcon = − log P − (6) reconstructed the image from partial inpaintings. Then, Ristea
i k exp(sim(xi , xk )/τ )
et al. [78], [79] presented a novel self-supervised predictive

xi and x+i are the positive pair, and xi and xk are the negative architectural building block, a plug-and-play design that can
pair. sim(·, ·) is the similarity function (e.g., cosine similarity). easily be incorporated into various anomaly detection methods.
Wang et al. [69] introduced a cluster attention contrast frame- More recently, a self-distilled masked auto-encoder [80] was
work for VAD, which is built on top of contrastive learning. proposed to inpaint original frames.
During the inference stage, the highest similarity between the Multiple task can ease the dilemma of the single pretext
test sample and its variants is regarded as the regularity score. task, i.e., a single task may be not well aligned with the VAD
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7

task, thus leading to sub-optimal performance. Recently, sev- as the normal samples are accessible to it. Therefore, D
eral works attempted to train VAD models jointly on multiple explicitly decides whether the output of G follows the normal
pretext tasks. For example, various studies exploited different distribution or not. Therefore, adversarial classifier can be
self-supervised task compositions, involving reconstruction jointly learned by optimizing the following objective,
and prediction [15], [27], [39], [81], prediction and denoising 
min max Exi ∼pt [log(D(xi ))]
[82], [83], prediction and jigsaw puzzles [84], prediction and G D
(13)
constrastive learning [70]. A few works [66], [67], [85],

+ Ex˜i ∼pt +Nσ [log(1 − D(G(x̃i )))] ,
[86] strove to develop more sophisticated multiple tasks from
different perspectives. where xi is drawn from a normal data distribution pt and x̃i
2) One-class Learning: One-class learning primarily fo- is the sample xi with added noise, which is sampled from
cuses on samples from the normal class. Compared to self- a normal distribution Nσ . The final abnormal score of an
supervised learning methods, it does not require the strenuous input sample x is given as D(G(x)). For instance, Sabokrou
effort of designing feasible pretext tasks. It is generally divided et al. [93]–[95] developed a conventional adversarial network
into three categories: one-class classifier, Gaussian classifier, that contains two sub-networks, where the discriminator works
and adversarial classifier from discriminators of generative as the one-class classifier, while the refiner supports it by
adversarial networks (GAN). enhancing the normal samples and distorting the anomalies. To
One-class classifier basically includes one-class support mitigate the instability caused by adversarial training, Zaheer
vector machine (OC-SVM) [87], support vector data descrip- et al. [96], [97] proposed stabilizing adversarial classifiers
tion (SVDD) [88], and other extensions, e.g., basic/generalized by transforming the role of discriminator to distinguish good
one-class discriminative subspace classifier (BODS, GODS) and bad quality reconstructions as well as introducing pseudo
[89]. Specifically, OC-SVM is modeled as an extension of anomaly examples.
the SVM objective by learning a max-margin hyperplane that 3) Interpretable Learning: While self-supervised learning
separates the normal from the abnormal in a dataset, which is and one-class learning based methods perform competitively
learned by minimizing the following objective, on popular VAD benchmarks, they are entirely dependent
1 X on complex neural networks and mostly trained end-to-end.
min ∥w∥22 − b + C ξi , s.t. wT xi ≥ b − ξi , ∀xi ∈ Xn This limits their interpretability and generalization capacity.
w,b,ξ≥0 2
(10) Therefore, explainable VAD emerges as a solution, which
where ξi is the non-negative slack, w and b denote the refers to techniques and methodologies used to identify and
hyperplane, and C is the slack penalty. AMDN [18] is a typical explain unusual events in videos. These techniques are de-
OC-SVM based VAD method, which obtains low-dimensional signed not only to detect anomalies but also to provide clear
representations through the auto-encoder, and then uses OC- explanations for why these anomalies are flagged, which is
SVM to classify all normal representations. Another popular crucial for trust and transparency in real-world applications.
variant of one-class classifiers is (Deep) SVDD [90], [91] that For example, Hinami et al. [24] leveraged multi-task detector
instead of modeling data to belong to an open half-space (as as the generic model to learn generic knowledge about visual
in OC-SVM), assumes the normal samples inhabit a bounded concepts, e.g., entity, action, and attribute, to describe the
set, and the optimization seeks the centroid c of a hypersphere events in the human-understandable form, then designed an
of minimum radius R > 0 that contains all normal samples. environment-specific model as the anomaly detector for abnor-
Mathematically, the objective reads, mal event recounting and detection. Similarly, Reiss et al. [38]
1 X extracted explicit attribute-based representations, i.e., velocity
min R2 + C ξi , s.t. ∥xi − c∥22 ≤ R2 − ξi , ∀xi ∈ Xn and pose, along with implicit semantic representations to make
c,R,ξ≥0 2
(11) interpretable anomaly decisions. Coincidentally, Doshi and
where, as in OC-SVM, the ξi models the slack. Based on Yilmaz [98] proposed a novel framework which monitors both
this, Wu et al. [20] proposed a end-to-end deep one-class individuals and the interactions between them, then explores
classifier, i.e., DeepOC, for VAD, avoiding the shortcomings the scene graphs to provide an interpretation for the context of
of complicated two-stage training of AMDN. anomalies. Singh et al. [99] started a new line for explainable
Gaussian classifier based methods [21], [23], [92] assume VAD, a more generic model based on high-level appearance
that, in practical applications, the data typically follows a and motion features which can provide human-understandable
Gaussian distribution. By using training samples, it learns reasons. Compared to previous methods, this work is indepen-
the Gaussian distribution (mean µ and variance Σ) of the dent of detectors and is capable of locating spatial anomalies.
normal pattern. During the testing phase, samples that deviate More recently, Yang et al. [100] proposed the first rule-
significantly from the mean are considered anomalies. The based reasoning framework for semi-supervised VAD with
abnormal score is presented as, large language models (LLMs) due to LLMs’ revolutionary
  reasoning ability. Here, we present some classical explainable
1 1 T −1
p(xi ) = k 1 exp − (x i − µ) Σ (x i − µ) (12) VAD methods in 5.
(2π) 2 |Σ| 2 2
Adversarial classifier uses adversarial training between the C. Network Architecture
generator G and discriminator D to learn the distribution 1) Auto-encoder: It consists of two vital structures, namely,
of normal samples. G is aware of normal data distribution, encoder and decoder, in which the encoder compresses the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8

(1) learn multi-task Fast R-CNN (generic model) Detection result


skateboarding white
(3) detect fc7 model
person object abnormal events
Multi-task
Fast R-CNN action
object action attribute attribute
labeled image datasets
if abnormal abnormal RoI
(2) extract features (4) recount events
(2) learn anomaly detectors (environment-specific model) and scores
(1) detect object ‘person’
fc7 model proposals Recounting result of
anomaly score
person

‘running’
Fast R-CNN
Multi-task

fc7 features running


object

‘person’ ‘dog’ FCx2


… convs RoI young
pool ‘young’
predicted concepts
attribute action

‘walking’ ‘running’ FC+


… whole per activation → running is the evidence
image RoI to be abnormal
unlabeled data of ‘white’ ‘young’


target environment classification scores generic model environment-specific
model
learn normal behavior
(a) learning procedures (b) testing procedures

Model building pipeline:

App.

Ang.
Exemplar
,… feature
Mag. , vectors
Exemplar
Bkg_pix. Selection
concatenate
Video volume,
t=N+1,2N Feature vectors for
Bkg_cls. each video volume in
a spatial region
Nominal video frames Video volume,
t=1,N Exemplar feature
vectors for this
Anomaly detection spatial region
App.
pipeline :
Ang.

Mag.
Scalar anomaly
score for this
video volume
Ped1 Ped2 Avenue
Distance
Bkg_cls. to nearest
neighbor
Bkg_cls. Feature vector for
this video volume
Test video frames

Fig. 6. Illustrations of frame-level (Top) and pixel-level (Bottom) output.


Fig. 5. Flowchart of four classical explainable VAD methods.
D. Model Refinement

1) Pseudo Anomalies: In semi-supervised learning, there is


input sample into a latent space representation, significantly usually a scarcity of real anomalous samples. To compensate
reducing the dimensionality of the sample. The decoder re- for this lack, several research studies opt to generate pseudo-
stores the input sample from the latent space representation, anomalies. Current approaches include: Perturbing normal
increasing the dimensionality of the sample back to the samples, i.e., applying random perturbations to normal video
original input size. Due to its inherent image restoration and samples, such as adding noise, shuffling frame sequences,
high-level representation extraction capabilities, auto-encoder or adding extra patches [106]–[109]; Leveraging generative
is widely used in image restoration based pretext tasks for models, i.e., using GAN or Diffusion to generate samples
self-supervised learning methods, such as reconstruction [3], similar to normal ones but with anomalous features [97],
prediction [15], and inpainting [80]. Moreover, auto-encoder is [110]; Simulating specific anomalous behaviors, i.e., manually
also used to extract features for one-class learning based meth- introducing extra anomalous samples at the image-level or
ods [20], where the extracted features are used for optimizing feature-level [111], [112]. As a result, training with pseudo-
subsequent one-class classifiers. Auto-encoder structures are anomalous samples allows the detection model to learn a
highly flexible and can be based on various different base variety of anomalous patterns, and helps the model learn a
networks, such as 2D CNN [3], 3D CNN [15], recurrent neural broader range of anomalous features, improving its robustness
network (RNN) [76], gated recurrent unit (GRU) [39], long and generalization abilities in real-world applications.
short-term memory (LSTM) [14], [101], graph convolutional
2) Memory Bank: Memory banks [113]–[118] are used to
network (GCN) [42], [45] and Transformer [63], [80].
store feature representations of normal video samples, which
2) GAN: GAN has been widely adopted in various applica- serve as a reference baseline and can be dynamically updated
tions, including VAD, due to its powerful generative capabili- to adapt to new normal patterns, thereby enabling the model to
ties. The key idea is to use the generator and discriminator to better capture normal patterns, and simultaneously improving
identify abnormal sample that deviate from the learned normal the ability to adapt to changing environments. In specific im-
distribution. Specifically, GAN, like auto-encoder, is mainly plementations, memory banks can be combined with different
applied to image restoration based pretext tasks [4], [102]– network architectures, such as: auto-encoder based reconstruc-
[105], where the generator creates restored images, and the tion (or prediction) [119]–[122] and contrastive learning [123].
discriminator is discarded once the training is finished. On
the contrary, several one-class learning based methods [93],
[96] leverage the discriminator to assess the likelihood that a
E. Model Output
new sample is real (normal) or generated (anomalous). Low
likelihood scores indicate anomalies, thereby achieving end- 1) Frame-level: In frame-level output, each frame of the
to-end one-class classifiers. video is classified as normal or abnormal. This output format
3) Diffusion: Diffusion models have achieved state-of-the- provides an overall view of which frames in the video contain
art performance across a wide range of generative tasks anomalies. Such output is simple and straightforward, easy
and have become a promising research topic. Unlike tradi- to implement and understand, and particularly effective for
tional generative models like GAN or auto-encoder, diffusion detecting anomalies over a broad time range.
models generate samples by reversing a diffusion process 2) Pixel-level: In pixel-level output, not only is it identified
that gradually destroys sample structure. This reverse process which frames contain anomalies, but also which specific pixel
reconstructs the sample from noise in a step-by-step manner, regions within those frames are abnormal [126]. This output
leading to high-quality results. Therefore, diffusion models format provides more granular information about the anoma-
also appear in the image restoration based pretext tasks. Yan lies. Pixel-level output offers precise locations and extents
et al. [7] and Flaborea et al. [49] introduced novel diffusion- of anomalies, providing more detailed information for further
based methods to predict the features using the RGB frame analysis of the nature and cause of the anomalies. We illustrate
and skeleton respectively for VAD. different model output in Figure 6.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9

AMDN ConvAE FRCN AdvOCC MemAE ClusterAE HF2-VAD SSPCAB EVAL


(Xu et al.) (Hasan et al.) (Hinami et al.) (Hanson et al.) (Gong et al.) (Chang et al.) (Liu et al.) (Ristea et al.) (Singh et al.)

StackRNN DeepOC MNAD MultiTask MoCoDAD


(Luo et al.) (Wu et al.) (Park et al.) (Georgescu et al.) (Flaborea et al.)

2015 2016 2017 2018 2019 2020 2021 2022 2023

STAE MPED-RNN VEC PFMF


(Zhao et al.) (Morais et al.) (Yu et al.) (Liu et al.)

ConvLSTM AnomalyGAN FramePred ObjectAE ST-GCE MPN STJP HSC


(Medel et al.) (Ravanbakhsh et al.) (Liu et al.) (Ionescu et al.) (Markovitz1 et al.) (Lv et al.) (Wang et al.) (Sun et al.)

Self-Supervised Learning One-Class Learning Interpretable Learning

Fig. 7. The chronology of representative semi-supervised VAD research.

TABLE II
Q UANTITATIVE P ERFORMANCE C OMPARISON OF S EMI - SUPERVISED M ETHODS ON P UBLIC DATASETS .

Ped1 Ped2 Avenue ShanghaiTech UBnormal


Method Publication Methodology
AUC (%) AUC (%) AUC (%) AUC (%) AUC (%)
AMDN [18] BMVC 2015 One-class classifier 92.1 90.8 - - -
ConvAE [3] CVPR 2016 Reconstruction 81.0 90.0 72.0 - -
STAE [15] ACMMM 2017 Hybrid 92.3 91.2 80.9 - -
StackRNN [74] ICCV 2017 Sparse coding - 92.2 81.7 68.0 -
FuturePred [4] CVPR 2018 Prediction 83.1 95.4 85.1 72.8 -
DeepOC [20] TNNLS 2019 One-class classifier 83.5 96.9 86.6 - -
MemAE [119] ICCV 2019 Reconstruction - 94.1 83.3 71.2 -
AnoPCN [81] ACMMM 2019 Prediction - 96.8 86.2 73.6 -
ObjectAE [25] CVPR 2019 One-class classifier - 97.8 90.4 84.9 -
BMAN [124] TIP 2019 Prediction - 96.6 90.0 76.2 -
sRNN-AE [76] TPAMI 2019 Sparse coding - 92.2 83.5 69.6 -
ClusterAE [52] ECCV 2020 Reconstruction - 96.5 86.0 73.3 -
MNAD [120] CVPR 2020 Reconstruction 97.0 88.5 70.5 -
VEC [62] ACMMM 2020 Cloze test - 97.3 90.2 74.8 -
AMMC-Net [33] AAAI 2021 Prediction - 96.6 86.6 73.7 -
MPN [121] CVPR 2021 Prediction 85.1 96.9 89.5 73.8 -
HF2 -VAD [26] ICCV 2021 Hybrid - 99.3 91.1 76.2 -
BAF [111] TPAMI 2021 One-class classifier - 98.7 92.3 82.7 59.3
Multitask [85] CVPR 2021 Multiple tasks - 99.8 92.8 90.2 -
F2 PN [31] TPAMI 2022 Prediction 84.3 96.2 85.7 73.0 -
DLAN-AC [122] ECCV 2022 Reconstruction - 97.6 89.9 74.7 -
BDPN [28] AAAI 2022 Prediction - 98.3 90.3 78.1 -
CAFE [114] ACMMM 2022 Prediction - 98.4 92.6 77.0 -
STJP [65] ECCV 2022 Jigsaw puzzle - 99.0 92.2 84.3 56.4
MPT [66] ICCV 2023 Multiple tasks - 97.6 90.9 78.8 -
HSC [30] CVPR 2023 Hybrid - 98.1 93.7 83.4
LERF [117] AAAI 2023 Predicition - 99.4 91.5 78.6 -
DMAD [115] CVPR 2023 Reconstruction - 99.7 92.8 78.8 -
EVAL [99] CVPR 2023 Interpretable learning - - 86.0 76.6 -
FBSC-AE [125] CVPR 2023 Prediction - - 86.8 79.2 -
FPDM [7] ICCV 2023 Prediction - - 90.1 78.6 62.7
PFMF [112] CVPR 2023 Multiple tasks - - 93.6 85.0 -
STG-NF [47] ICCV 2023 Gaussian classifier - - - 85.9 71.8
AED-MAE [80] CVPR 2024 Patch inpainting - 95.4 91.3 79.1 58.5
SSMCTB [79] TPAMI 2024 Patch inpainting - - 91.6 83.7 -

F. Performance Comparison weakly supervised VAD task. However, the former only briefly
Figure 7 presents a concise chronology of semi-supervised describes several achievements from 2018 to 2020, while the
VAD methods. Besides, Table II provides a performance latter, although encompassing recent works, lacks a scientific
summary observed in representative semi-supervised VAD taxonomy and simply categorizes them into single-modal and
methods. multi-modal based on the different modalities. Given this
context, we survey related works from 2018 to the present,
IV. W EAKLY S UPERVISED V IDEO A NOMALY D ETECTION including the latest methods based on pre-trained large models,
and we classify existing works from four aspects: model
Weakly supervised VAD is currently a highly regarded input, methodology, refinement strategy, and model output.
research direction in the VAD field, with its origins traceable The taxonomy of weakly supervised VAD is illustrated in
to DeepMIL [5]. Compared to semi-supervised VAD, it is a Figure 8.
newer research direction, and therefore existing reviews lack a
comprehensive and in-depth introduction. As shown in Table I, Compared to semi-supervised VAD, weakly supervised
both Chandrakala et al. [12] and Liu et al. [13] mention the VAD explicitly defines anomalies during the training process,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

Model Input Methodology Refinement Model Output 2) Optical Flow: Similar to RGB, the same approach is
Optical TSN applied to optical flow input to obtain corresponding global
Flow
I3D Temporal features. However, due to the time-consuming nature of optical
Modeling
Audio VGGish Spatio-Temporal
flow extraction, it is less commonly used in existing methods.
Modeling Common pre-trained models for optical flow include I3D [143]
C3D Frame Level
One-Stage
I3D MIL Modified MIL and TSN [137].
3D-ResNet Two-Stage Metric Learning 3) Audio: For multimodal datasets (e.g., XD-Violence) con-
RGB Self-Training
Pixel Level
TSN Knowledge taining audio signals, audio also holds significant perceptual
Distillation
VideoSwin information. Unlike RGB images, audio is one-dimensional
Hybrid Large Models
CLIP and is typically processed as follows, audios are resampled,
Text CLIP compute the spectrograms, and create log mel spectrograms,
then these features are framed into non-overlapping examples.
Fig. 8. The taxonomy of weakly supervised VAD. We provide a hierarchical
Finally, these examples are then fed into a pre-trained audio
taxonomy that organizes existing deep weakly supervised VAD models by model, such as VGGish [144], to extract features [145], [146].
model input, methodology, refinement strategy and model output into a
systematic framework.
4) Text: More recently, several researchers [147]–[150]
attempt to incorporate text descriptions related to videos to aid
giving the detection algorithm a clear direction. However, in in VAD. These texts may be manually annotated or generated
contrast to fully supervised VAD, the coarse weak supervision by large models. The text data is typically converted into
signals introduce uncertainty into the detection process. Most features using the text encoder and then fed into the subsequent
existing methods utilize the MIL mechanism to optimize the detection network.
model. This process can be viewed as selecting the hardest 5) Hybrid: Common hybrid inputs include RGB combined
regions (video clips) that appear most abnormal from normal with optical flow [143], RGB combined with audio [151]–
bags (normal videos) and the regions most likely to be abnor- [153], RGB combined with optical flow and audio [154], and,
mal from abnormal bags (abnormal videos). Then, the goal more recently, RGB combined with text [155].
is to maximize the predicted confidence difference between
them (with the confidence for the hardest normal regions
approaching 0 and the confidence for the most abnormal
regions approaching 1), which can be regarded as a binary B. Methodology
classification optimization. By gradually mining all normal and
Under the weak supervision, traditional fully supervised
abnormal regions based on their different characteristics, the
methods are no longer adequate. To address this issue, we
anomaly confidence of abnormal regions increases while that
identify two different approaches: the one-stage MIL and two-
of normal regions decreases. Unfortunately, due to the lack
stage self-learning.
of strong supervision signals, the detection model inevitably
involves blind guessing in the above optimization process. 1) One-stage MIL: The one-stage MIL [5], [156]–[158]
is the most commonly used approach for weakly supervised
VAD. The basic idea is to first divide long videos into multiple
A. Model Input segments, and then use a MIL mechanism to select the most
representative samples from these segments. This includes
Unlike semi-supervised VAD, the network input for weakly selecting hard examples from normal videos that look most
supervised VAD is not the raw video, such as RGB, optical like anomalies and the most likely abnormal examples from the
flow, or skeleton. Instead, they are features extracted by pre- abnormal videos. The model is then optimized by lowering the
trained models. This approach alleviates the problem posed by anomaly confidence of hard examples and increasing the con-
the large scale of existing weakly supervised VAD datasets, fidence of the most likely abnormal examples. Ultimately, the
the diverse and complex scenes, and the weak supervision confidence of model in predicting normal samples gradually
signals. Using pre-trained features as input allows the effec- decreases, while its confidence in predicting abnormal samples
tive utilization of off-the-shelf models’ learned knowledge of gradually increases, thereby achieving anomaly detection. The
appearance and motion, significantly reduces the complexity advantage of this method lies in its simplicity and ease of
of the detection model, and enables efficient training. implementation. The MIL objective is showcased as,
1) RGB: RGB is the most common model input. The gen-
X
eral approach divides a long video into multiple segments and lmil = max (0, 1 − max Φ (θ, xai ) + max Φ (θ, xni ))
uses pre-trained visual models to extract global features from i
each segment. As deep models continue to evolve and improve, (14)
the visual models used have also been upgraded, progressing where xa and xn denote an abnormal video and a normal
from the initial C3D [5], [127], [128] to I3D [129]–[133], 3D- video respectively.
ResNet [134], [135], TSN [136]–[138], and more recently, to Additionally, TopK [130] extends MIL by selecting the
the popular Swin Transformer [139], [140] and CLIP [141], top K segments with the highest prediction scores from
[142]. This continuous upgrade in visual models has led to a each video, rather than just the highest-scoring segment, for
gradual improvement in detection performance. training. Therefore, MIL can be seen as a special case of TopK.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

For these TopK segments, compute their average prediction whereas global modeling is mainly used for offline detection
score as the predicted probability ŷ, [163]. Techniques such as temporal convolutional networks
  [130], [164], dilated convolution [163], GCN [165], [166],
K
1 X conditional random field [167], and transformers [140], [168]–
ŷ = σ  Φ (θ, xi ) (15)
K [170] are frequently employed to capture these temporal
i∈topk
relationships effectively.
where σ is the sigmoid activation function. The cross-entropy 2) Spatio-temporal Modeling: Further, spatio-temporal
loss between ŷ and the label y is used to optimize the model, modeling can simultaneously capture spatial relationships,
highlighting anomalous spatial locations and effectively reduc-
ltopk = − (y log(ŷ) + (1 − y) log(1 − ŷ)) (16) ing noise from irrelevant backgrounds. This can be achieved
The one-stage MIL mechanism leads to models that tend to by segmenting video frames into multiple patches or using
focus only on the most significant anomalies while ignoring existing object detectors to capture foreground objects. Then,
less obvious ones. methods like self-attention [34], [135], [171], [172] are used
2) Two-stage Self-training: In contrast, improved two-stage to learn the relationships between these patches or objects.
self-learning is more complex but also more effective. This Compared to temporal modeling, spatio-temporal modeling
method employs a two-stage training process. First, a prelim- involves a higher computational load due to the increased
inary model is pre-trained using the one-stage MIL. During number of entities being analyzed.
this phase, the model learns the basic principles of VAD. 3) MIL-based Refinement: The traditional MIL mechanism
Then, using the pre-trained model as an initial parameter, a focuses only on the segments with the highest anomaly scores,
self-learning mechanism is introduced to adaptively train the which leads to a series of issues, such as ignoring event
model further, enhancing its ability to recognize anomalies. continuity, the fixed length K-value not adapting to different
Specifically, during the self-learning phase, the predictions of video scenarios, and bias towards abnormal snippets with sim-
model from the pre-training stage are used to automatically ple contexts. Several advanced strategies [173], [174] aim to
select high-confidence abnormal regions. These regions are address the limitations. By incorporating unbiased MIL [175],
then treated as pseudo-labeled data to retrain the model, prior information from text [149], [176], magnitude-level MIL
thereby improving its ability to identify anomalies. This two- [177], continuity-aware refinement [178], and adaptive K-
stage training approach effectively enhances the performance values [179], the detection performance can be significantly
of model in weakly supervised VAD, further improving the improved.
model’s generalization ability and robustness. NoiseClearner 4) Feature Metric Learning: While MIL-based classifica-
[137], MIST [159], MSL [140], CUPL [160], and TPWNG tion ensures the interclass separability of features, this sep-
[161] are typical two-stage self-training works. arability at the video level alone is insufficient for accurate
The two-stage self-learning method based on improved MIL anomaly detection. In contrast, enhancing the discriminative
excels in weakly supervised VAD, but it also comes with some power of features through clustering similar features and iso-
drawbacks, such as, high computational complexity: the two- lating different ones should complement and even augment the
stage training process requires more computational resources separability achieved by MIL-based classification. Specifically,
and time. Both pre-training and self-learning phases involve the basic principle of feature metric learning is to make
multiple iterations of training, leading to high computational similar features compact and different features distant in the
costs; Dependence on initial model quality: the self-learning feature space to improve discrimination. Several works [132],
stage relies on the initial model generated during pre-training. [147], [149], [162], [168], [180], [181] exploited feature metric
If the quality of the initial model is poor, erroneous predictions learning to enhance the feature discrimination.
may be treated as pseudo-labels, affecting subsequent training 5) Knowledge Distillation: Knowledge distillation aims to
effectiveness. transfer knowledge from the enriched branch to the barren
branch to alleviate the semantic gap, which is mainly applied
for modality-missing [182] or modality-enhancing [153] sce-
C. Refinement Strategy narios.
Refinement strategies primarily focus on input features, 6) Leveraging Large Models: Large models have begun
method design, and other aspects to compensate for the to show tremendous potential and flexibility in the field of
shortcomings of weak supervision signals. We compile several VAD. They not only enhance detection capabilities through
commonly used refinement strategies and provide a detailed vision-language features, e.g., CLIP-TSA [185] and cross-
introduction in this section. modal semantic alignment, e.g., VadCLIP [183], but also lever-
1) Temporal Modeling: Temporal modeling is essential age large language models to generate explanatory texts that
for capturing the critical context information in videos. Un- improve detection accuracy, e.g., TEVAD [148], UCA [155],
like actions, anomalous events are complex combinations of and VAD-Instruct50k [186]. Furthermore, they can directly use
scenes, entities, actions, and other elements, which require the prior knowledge of large language models for training-
rich contextual information for accurate reasoning. Existing free VAD [187], [188], demonstrating advantages in rapid
temporal modeling methods can be broadly categorized into deployment and cost reduction. Furthermore, the superior
local relationship modeling and global relationship modeling. zero-shot capabilities of these large models may be leveraged
Local modeling is typically used for online detection [162], for anomaly detection via various other ways, such as AD-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

TABLE III
Q UANTITATIVE P ERFORMANCE C OMPARISON OF W EAKLY S UPERVISED M ETHODS ON P UBLIC DATASETS .

UCF-Crime XD-Violence ShanghaiTech TAD


Method Publication Feature
AUC (%) AP (%) AUC (%) AUC (%)
DeepMIL [5] CVPR 2018 C3DRGB 75.40 - - -
GCN [137] CVPR 2019 TSNRGB 82.12 - 84.44 -
HLNet [130] ECCV 2020 I3DRGB 82.44 75.41 - -
CLAWS [128] ECCV 2020 C3DRGB 83.03 - 89.67 -
MIST [159] CVPR 2021 I3DRGB 82.30 - 94.83 -
RTFM [163] ICCV 2021 I3DRGB 84.30 77.81 97.21 -
CTR [162] TIP 2021 I3DRGB 84.89 75.90 97.48 -
MSL [140] AAAI 2022 VideoSwinRGB 85.62 78.59 97.32 -
S3R [131] ECCV 2022 I3DRGB 85.99 80.26 97.48 -
SSRL [171] ECCV 2022 I3DRGB 87.43 - 97.98 -
CMRL [166] CVPR 2023 I3DRGB 86.10 81.30 97.60 -
CUPL [160] CVPR 2023 I3DRGB 86.22 81.43 - 91.66
MGFN [177] AAAI 2023 VideoSwinRGB 86.67 80.11 - -
UMIL [175] CVPR 2023 CLIP 86.75 - - 92.93
DMU [170] AAAI 2023 I3DRGB 86.97 81.66 - -
PE-MIL [176] CVPR 2024 I3DRGB 86.83 88.05 98.35 -
TPWNG [161] CVPR 2024 CLIP 87.79 83.68 - -
VadCLIP [183] AAAI 2024 CLIP 88.02 84.51 - -
STPrompt [184] ACMMM 2024 CLIP 88.08 - 97.81 -

DeepMIL Background-Bias HLNet RTFM MSL PKD VadCLIP


(Sultani et al.) (Liu et al.) (Wu et al.) (Tian et al.) (Li et al.) (Liu et al.) (Wu et al.)

MAF CLAWS MIST SSRL CUPL PEL4VAD


(Zhu et al.) (Zaheer et al.) (Feng et al.) (Li et al.) (Zhang et al.) (Pu et al.)

2018 2019 2020 2021 2022 2023 2024

CRFAD DMU
(Purwanto et al.) (Zhou et al.)

GCN CTRFD UMIL TPWNG


(Zhong et al.) (Wu et al.) (Lv et al.) (Yang et al.)

One-Stage MIL Two-Stage Self-Training

Fig. 9. The Chronology for representative weakly supervised VAD research.

oriented prompts [189], [190] or generic residual learning E. Performance Comparison


[191]. These methods collectively advance the development As depicted in Figure 9, several significant research works
of VAD technology, providing new avenues and tools for have emerged in the field’s chronology. Moreover, We present
achieving more efficient and interpretable VAD. an elaborate performance comparison of existing research, as
detailed in Table III.
D. Model Output
1) Frame-level: Similar to semi-supervised VAD, the out- V. F ULLY S UPERVISED V IDEO A NOMALY D ETECTION
put in weakly supervised VAD is typically frame-level pre- Fully supervised VAD refers to the task of detecting video
diction results, indicating the probability of each frame being anomalies under the condition that the dataset has detailed
anomalous. This type of output is intuitive and straightforward, frame-level or video-level annotations. Here, we consider
which is why it is commonly adopted. video violence detection as a fully supervised VAD task.
2) Pixel-level: Although frame-level output is intuitive, it
lacks interpretability. Therefore, some works have begun to fo-
cus on achieving pixel-level detection. For instance, Liu et al. A. Approach Categorization
[192] used spatial-level strong supervision signals to achieve Video violence detection typically takes the appearance,
spatial localization. Wu et al. [193] took a different approach motion, skeleton, audio, or a combination of these as the
by not relying on labor-intensive annotations. Instead, they input. It can be categorized based on the type of input into
leveraged algorithms such as object detection and tracking, the following types:
drawing inspiration from video spatio-temporal localization Appearance input mainly consists of raw RGB images, di-
algorithms, and achieved anomaly spatio-temporal localization rectly showcasing the visual effect of video frames. This helps
through spatio-temporal object tube analysis. the model better understand anomalies that can be directly
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

TABLE IV
Q UANTITATIVE P ERFORMANCE C OMPARISON OF F ULLY S UPERVISED M ETHODS ON P UBLIC DATASETS .

Hockey Fights Violent-Flows RWF-2000 Crowed Violence


Method Publication Model Input
Accuracy (%) Accuracy (%) Accuracy (%) Accuracy (%)
TS-LSTM [194] PR 2016 RGB+Flow 93.9 - - -
FightNet [195] JPCS 2017 RGB+Flow 97.0 - - -
ConvLSTM [199] AVSS 2017 Frame Difference 97.1 94.6 - -
BiConvLSTM [200] ECCVW 2018 Frame Difference 98.1 96.3 - -
SPIL [201] ECCV 2020 Skeleton 96.8 - 89.3 94.5
FlowGatedNet [203] ICPR 2020 RGB+Flow 98.0 - 87.3 88.9
X3D [206] AVSS 2022 RGB - 98.0 94.0 -
HSCD [205] CVIU 2023 Skeleton+Frame Difference 94.5 - 90.3 94.3

TABLE V
Q UANTITATIVE P ERFORMANCE C OMPARISON OF U NSUPERVISED M ETHODS ON P UBLIC DATASETS .

Avenue Subway Exit Ped1 Ped2 ShaihaiTech UMN


Method Publication Methodology
AUC (%) AUC (%) AUC (%) AUC (%) AUC (%) AUC (%)
ADF [207] ECCV 2016 Change detection 78.3 82.4 - - - 91.0
Unmasking [208] ICCV 2017 Change detection 80.6 86.3 68.4 82.2 - 95.1
MC2ST [209] BMVC 2018 Change detection 84.4 93.1 71.8 87.5 - -
DAW [210] ACMMM 2018 Pseudo label 85.3 84.5 77.8 96.4 - -
STDOR [211] CVPR 2020 Pseudo label - 92.7 71.7 83.2 - 97.4
TMAE [212] ICME 2022 Change detection 89.8 - 75.7 94.1 71.4 -
CIL [213] AAAI 2022 Others 90.3 97.6 84.9 99.4 - 100
LBR-SPR [214] CVPR 2022 Others 92.8 - 81.1 97.2 72.6 -

detected from a visual perspective. Many methods [194]– the one hand, we cannot clearly define what constitutes normal
[197] used RGB features extracted from raw images using behavior of real-life human activities in many cases, e.g.,
pre-trained models as the model input. running in a sports ground is normal but running in a library is
Motion input mainly includes optical flow, optical flow accel- forbidden. On the other hand, it is impractical to know every
eration, and frame differences. These inputs directly showcase possible normal event in advance, especially for scientific
the motion state of objects, helping to identify anomalies research. Therefore, VAD in unsupervised environments is of
from the motion perspective that might be difficult to detect significant research value.
visually. Dong et al. [194] and Bruno Peixoto et al. [198]
used optical flow and optical flow acceleration as input, while A. Approach Categorization
Sudhakaran et al. [199] and Hanson et al. [200] employed
Through an in-depth investigation, we roughly classify the
frame differences as model input.
current unsupervised VAD methods into 3 categories: pseudo
Skeleton input can intuitively display the pose state of hu-
label, change detection, and others.
mans, allowing the model to exclude background interference
Pseudo label based paradigm is described as follows.
and focus on human actions. This enables more intuitive and
Wang et al. [210] proposed a two-stage training approach
vivid recognition of violent behavior. Su et al. [201] and
where an auto-encoder is first trained with an adaptive re-
Singh et al. [202] conducted violence detection by studying
construction loss threshold to estimate normal events from
the interaction relationships between skeletal points.
unlabeled videos. These estimated normal events are then
Audio input can provide additional information to aid in
used as pseudo-labels to train an OC-SVM, refining the
identifying violent events [198]. This is because certain violent
normality model to exclude anomalies and improve detection
incidents inevitably involve changes in sound, such variations
performance. Pang et al. [211] introduced a self-training
help us better detect violent events, especially when RGB
deep ordinal regression method, starting with initial detection
images may not effectively detect due to issues like occlusion.
using classical one-class algorithms to generate pseudo-labels
Hybrid input combines the strengths of different modalities to
for anomalous and normal frames. An end-to-end anomaly
better detect violent events. Cheng et al. [203] utilized RGB
score learner is then trained iteratively using a self-training
images and optical flow as input, while Shang et al. [204]
strategy that optimizes the detector with newly generated
combined RGB images with audio as input. Garcia et al. [205]
pseudo labels. Zaheer et al. [215] proposed an unsupervised
fed skeleton and frame differences into detection models.
generative cooperative learning approach, leveraging the low-
frequency nature of anomalies for cross-supervision between a
B. Performance Comparison generator and a discriminator, each learning from the pseudo-
We present the performance comparison of existing fully labels of the other. Al-lahham et al. [216] presented a coarse-
supervised VAD research in Table IV. to-fine pseudo-label generation framework using hierarchical
divisive clustering for coarse pseudo-labels at the video level
VI. U NSUPERVISED V IDEO A NOMALY D ETECTION and statistical hypothesis testing for fine pseudo-labels at the
Despite the great popularity of supervised VAD, supervised segment level, training the anomaly detector with the obtained
methods still have shortcomings in practical applications. On pseudo-labels.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

Change detection based paradigm can be summarized as A. Open-set VAD


follows. Del Giorno et al. [207] performed change detection Open-set VAD is a vital area of research that addresses
in video frames using a simple logistic regression classifier the limitations of traditional anomaly detection methods by
to measure deviations between the data, making comparisons focusing on the detection of unknown anomalies. MLEP [219],
time-independent by randomly ordering frames. Ionescu et al. as the first work on open-set supervised VAD, addressed
[208] proposed a change detection framework based on the the challenge of detecting anomalies in videos when only a
unmasking technique, determining abnormal events by observ- few anomalous examples are available for training. It empha-
ing changes in classifier accuracy between consecutive events. sizes learning a margin that separates normal and anomalous
Liu et al. [209] linked the heuristic unmasking procedure samples in the feature space. This helps in distinguishing
to multiple classifier two-sample tests in statistical machine anomalies even when only a few examples are available. The
learning, aiming to improve the unmasking method. Hu et al. follow-up work [220] introduced a new benchmark, namely
[212] introduced a method based on masked auto-encoders UBnormal, specifically designed for open-set VAD. It aims to
[217], where the rare and unusual nature of anomalies leads provide a comprehensive evaluation framework for testing and
to poor prediction of changing events, allowing for anomaly comparing various VAD methods under open-set conditions.
detection and scoring in unlabeled videos. Zhu et al. [221] broke through the closed-set detection limi-
Other paradigm includes the below methods. Li et al. [218] tations and developed novel techniques that can generalize to
proposed a clustering technique, where an auto-encoder is previously unseen anomalies and effectively distinguish them
trained on a subset of normal data and iterates between from normal events. Specially, a normalizing flow model is
hypothetical normal candidates based on clustering and rep- introduced to create pseudo anomaly features. Recently, Wu
resentation learning. The reconstruction error is used as a et al. [142] extended open-set VAD to the more challenging
scoring function to assess normality. Lin et al. [213] intro- open-vocabulary VAD, aiming to both detect and recognize
duced a causal inference framework to reduce the effect of anomaly categories. Centered around vision-language models,
noisy pseudo-labeling, combining long-term temporal context this task is achieved by matching videos with correspond-
with local image context for anomaly detection. Yu et al. ing textual labels. Additionally, large generative models and
[214] highlighted the effectiveness of deep reconstruction for language models are utilized to generate pseudo-anomalous
unsupervised VAD, revealing a normality advantage where samples. There are other methods for the open-set setting,
normal events have lower reconstruction loss. They integrated such as [222], [223], but they are focused on image-level
a novel self-paced refinement scheme into localization-based anomaly detection. Through various innovative approaches,
reconstruction for unsupervised VAD. such as margin learning, the development of benchmarks,
generalization strategies, and the integration of language-
B. Performance Comparison
vision models, researchers are pushing the boundaries of what
we present the performance comparison of existing unsu- is possible in VAD. These advancements are paving the way
pervised VAD research in Table V. for more robust, flexible, and practical VAD systems suitable
for a wide range of real-world applications.
VII. O PEN - SET S UPERVISED V IDEO A NOMALY
D ETECTION
It is challenging to make well-trained supervised model B. Few-shot VAD
deployed in the open world detect unseen anomalies. Unseen The goal of few-shot VAD is to detect anomalies in a
anomalies are highly likely to occur in real-world scenarios, previously unseen scene with only a few frames. Compared
thus research on open-set anomaly detection has garnered sig- to open-set VAD, the main difference lies in that a few real
nificant attention. Open-set supervised VAD is a challenging frames of unseen anomalies are provided. This task is first
task where the goal is to detect anomalous events in videos introduced by Lu et al. [224], and a meta-learning based model
that are unseen during the training phase. Unlike traditional is proposed to tackle this problem. During the test stage, the
(closed-set) VAD, where the types of anomalies are known model needs to be fine-tuned via a few provided samples from
and well-defined, open-set VAD must handle unforeseen and the new scene. In order to avoid extra fine-tuning processes
unknown anomalies. This is crucial for real-world applications, before deployment, Hu et al. [225] and Huang et al. [226]
as it is impractical to anticipate and annotate every possible adopted the metric-based adaptive network and variational
anomaly during training. Therefore, research on open-set VAD network respectively, both leveraging few normal samples as
has garnered significant attention. However, existing review the reference during the test stage without any fine-tuning.
works lack an investigation into open-set VAD. Based on Further, Aich et al. [227] presented a novel zxVAD framework,
this, we conduct an in-depth survey and make a systematic a significant advancement by enabling anomaly detection
taxonomy of existing open-set VAD works. To our knowledge, across domains without requiring target domain adaptation.
this is the first review that includes a detailed introduction In this work, a novel untrained CNN based anomaly synthesis
to open-set supervised VAD. In this section, we broadly module crafts pseudo-abnormal examples by adding foreign
categorize open-set supervised VAD into two types based on objects in normal video frames in a training-free manner.
different research directions: open-set VAD and few-set VAD. This contrasts with the above few-shot adaptive methods,
In Figure 10, we showcase six classical open-set supervised which require a few labeled data from the target domain for
VAD methods. fine-tuning. The former focuses on domain-invariant feature
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15

Fig. 10. Flowchart of six classical open-set supervised VAD methods. Top: open-set methods; Bottom: few-shot methods.

extraction and unsupervised learning, ensuring robustness and comprehensive environment analysis, enabling the detection of
generalizability, while the latter relies on few-shot learning to anomalies that may not be visible from a single perspective.
adapt models to new domains with minimal labeled data. 3D perspectives from depth information or point clouds can
offer more detailed spatial information, enabling models to
VIII. F UTURE O PPORTUNITIES better understand the structure and context of the environment,
which also brings multi-modal signals.
A. Creating Comprehensive Benchmarks
The current VAD benchmarks have various limitations in B. Towards Open-world Task
terms of data size, modality, and capturing views. Thus, an
The current research focuses on the closed-set VAD, which
important future direction is to extend benchmarks along these
is restricted to detecting only those anomalies that are defined
dimensions for providing more realistic VAD test platforms.
and annotated during training. In applications like urban
1) Large-scale: Currently, in VAD—especially in semi-
surveillance, the inability to adapt to unforeseen anomalies
supervised VAD—the data scale is too small. For example,
limits the practicality and effectiveness of closed-set VAD
the UCSD Ped dataset [228] lasts only a few minutes, and
models. Therefore, moving towards the open-world VAD
even the larger ShanghaiTech dataset [14] is only a few hours
task, handling the uncertainty and variability of real-world
long. Compared to datasets in video action recognition tasks
situations, is a feasible future trend. To accomplish this task,
[229], which can last hundreds or thousands of hours, VAD
several key approaches and their combination can be taken
datasets are extremely small. This is far from sufficient for
into account. Self-supervised learning: leveraging unlabeled
training VAD models, as training on small-scale datasets is
data to learn discriminative representations that can distinguish
highly prone to over-fitting in large models. While this might
between normal and abnormal events [231]; Open-vocabulary
lead to good detection results on the small-scale test data, it
learning: developing models that can adapt to new anomalies
can severely impact the performance of VAD models intended
with large models [142], pseudo anomaly synthesis, or min-
for real-world deployment. Therefore, expanding the data scale
imal labeled examples; Incremental learning: continuously
is a key focus of future research.
updating models with new data and anomaly types without
2) Multi-modal: Currently, there is limited research on
forgetting previously learned information [232].
multimodal VAD. Just as humans perceive the world through
multiple senses [230], such as vision, hearing, and smell,
effectively utilizing various modality information in the face of C. Embracing Pre-trained Large models
multi-source heterogeneous data can enhance the performance Pre-trained large models have shown remarkable success
of anomaly detection. For example, using audio information in various computer vision tasks, and these models can be
can better detect anomalies such as screams and panic, while leveraged in VAD to enhance the understanding and detection
using infrared information can identify abnormal situations in of anomalies by integrating semantic context and improving
dark environments. feature representations. Here are several feasible directions.
3) Egocentric, Multi-view, 3D, etc.: Egocentric VAD in- Feature extraction: pre-trained weights of large models,
volves using data captured from wearable devices or body- which have been trained on large-scale datasets, provide
mounted cameras to simulate how individuals perceive their a strong foundation for feature extraction and reduce the
environment and identify abnormal events, such as detecting need for extensive training from scratch [185]. Semantic
falls or aggressive behavior in real time. Creating multi-view understanding: language-vision models can be utilized to
benchmarks that leverage data from viewpoints allows for a understand and incorporate contextual information from video
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16

scenes. For instance, text descriptions associated with video [3] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S.
frames can provide additional context that helps in identifying Davis, “Learning temporal regularity in video sequences,” in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
anomalies. In the same way, leverage the language capabilities Recognition, 2016, pp. 733–742.
of these models to generate or understand textual descriptions [4] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction
of anomalies, aiding in both the detection and interpretation for anomaly detection–a new baseline,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2018, pp.
of the anomalies [186]. Zero-shot learning: exploit the zero- 6536–6545.
shot learning capabilities of language-vision models to detect [5] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection
anomalies without requiring explicit examples during training. in surveillance videos,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 6479–6488.
This is particularly useful in open-set VAD scenarios where [6] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in
new types of anomalies can occur [190]. matlab,” in Proceedings of the IEEE International Conference on
Computer Vision, 2013, pp. 2720–2727.
[7] C. Yan, S. Zhang, Y. Liu, G. Pang, and W. Wang, “Feature prediction
D. Exploiting Interpretable VAD diffusion model for video anomaly detection,” in Proceedings of the
IEEE International Conference on Computer Vision, 2023, pp. 5527–
Interpretable VAD focuses on creating models that not only 5537.
detect anomalies but also provide understandable explanations [8] B. Ramachandra, M. J. Jones, and R. R. Vatsavai, “A survey of single-
scene video anomaly detection,” IEEE Transactions on Pattern Analysis
for their predictions. This is crucial for gaining trust in the and Machine Intelligence, vol. 44, no. 5, pp. 2293–2312, 2020.
system, especially in high-stakes applications like surveillance, [9] K. K. Santhosh, D. P. Dogra, and P. P. Roy, “Anomaly detection in road
healthcare, and autonomous vehicles. Here are several feasi- traffic using visual surveillance: a survey,” ACM Computing Surveys,
vol. 53, no. 6, pp. 1–26, 2020.
ble directions from three different layers of a VAD system. [10] R. Nayak, U. C. Pati, and S. K. Das, “A comprehensive review on
Input: instead of directly inputting raw video data into the deep learning-based methods for video anomaly detection,” Image and
model, leverage existing technologies to extract key informa- Vision Computing, vol. 106, p. 104078, 2021.
[11] T. M. Tran, T. N. Vu, N. D. Vo, T. V. Nguyen, and K. Nguyen,
tion, such as foreground objects, position coordinates, motion “Anomaly analysis in images and videos: a comprehensive review,”
trajectories, and crowd relationships. Algorithm: combining ACM Computing Surveys, vol. 55, no. 7, pp. 1–37, 2022.
algorithms from different domains can be helpful for enhanced [12] S. Chandrakala, K. Deepak, and G. Revathy, “Anomaly detection in
surveillance videos: a thematic taxonomy of deep models, review and
reasoning, including: knowledge graphs, i.e., utilize knowledge performance analysis,” Artificial Intelligence Review, vol. 56, no. 4, pp.
graphs to incorporate contextual information and relationships 3319–3368, 2023.
between entities; intent prediction, i.e., use intent prediction [13] Y. Liu, D. Yang, Y. Wang, J. Liu, J. Liu, A. Boukerche, P. Sun,
and L. Song, “Generalized video anomaly event detection: systematic
algorithms to anticipate future actions and detect deviations taxonomy and comparison of deep models,” ACM Computing Surveys,
from expected behaviors [125]. LLMs’ reasoning, i.e., the vol. 56, no. 7, 2023.
textual descriptions of detected anomalies using large LLMs [14] W. Luo, W. Liu, and S. Gao, “Remembering history with convolutional
lstm for anomaly detection,” in Proceedings of the IEEE International
can also be used for the explanation. These descriptions can Conference on Multimedia and Expo, 2017, pp. 439–444.
explain what the model perceives as abnormal and why [186]. [15] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X.-S. Hua, “Spatio-
Output: Various aspects such as the spatio-temporal changes temporal autoencoder for video anomaly detection,” in Proceedings of
the ACM International Conference on Multimedia, 2017, pp. 1933–
and patterns in the video may be synthesized to explain 1941.
anomalies [184]. [16] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni,
and N. Sebe, “Abnormal event detection in videos using generative
adversarial nets,” in Proceedings of the IEEE International Conference
IX. C ONCLUSION on Image Processing, 2017, pp. 1577–1581.
We present a comprehensive survey of video anomaly [17] T.-N. Nguyen and J. Meunier, “Anomaly detection in video sequence
with appearance-motion correspondence,” in Proceedings of the IEEE
detection approaches in the deep learning era. Unlike previous International Conference on Computer Vision, 2019, pp. 1273–1283.
reviews mainly focusing on semi-supervised video anomaly [18] D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe, “Learning deep repre-
detection, we provide a taxonomy that systematically divides sentations of appearance and motion for anomalous event detection,”
in Proceedings of the British Machine Vision Conference, 2015.
the existing works into five categories by their supervision [19] D. Xu, Y. Yan, E. Ricci, and N. Sebe, “Detecting anomalous events
signals, e.g., semi-supervised, weakly supervised, unsuper- in videos by learning deep representations of appearance and motion,”
vised, fully supervised, and open-set supervised video anomaly Computer Vision and Image Understanding, vol. 156, pp. 117–127,
2017.
detection. For each category, we further refine the categories [20] P. Wu, J. Liu, and F. Shen, “A deep one-class neural network for
based on model differences, e.g., model input and output, anomalous event detection in complex scenes,” IEEE Transactions on
methodology, refinement strategy, and architecture, and we Neural Networks and Learning Systems, vol. 31, no. 7, pp. 2609–2622,
2019.
demonstrate the performance comparison of various methods. [21] M. Sabokrou, M. Fayyaz, M. Fathy, and R. Klette, “Deep-cascade:
Finally, we discuss several promising research directions for cascading 3d deep neural networks for fast anomaly detection and lo-
deep learning based video anomaly detection in the future. calization in crowded scenes,” IEEE Transactions on Image Processing,
vol. 26, no. 4, pp. 1992–2004, 2017.
[22] T. Wang, M. Qiao, Z. Lin, C. Li, H. Snoussi, Z. Liu, and C. Choi,
R EFERENCES “Generative neural networks for anomaly detection in crowded scenes,”
IEEE Transactions on Information Forensics and Security, vol. 14,
[1] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for no. 5, pp. 1390–1399, 2018.
anomaly detection: a review,” ACM Computing Surveys, vol. 54, no. 2, [23] Y. Fan, G. Wen, D. Li, S. Qiu, M. D. Levine, and F. Xiao, “Video
pp. 1–38, 2021. anomaly detection and localization via gaussian mixture fully convo-
[2] Y. Yao, X. Wang, M. Xu, Z. Pu, Y. Wang, E. Atkins, and D. J. Crandall, lutional variational autoencoder,” Computer Vision and Image Under-
“Dota: unsupervised detection of traffic anomaly in driving videos,” standing, vol. 195, p. 102920, 2020.
IEEE Transactions on Pattern Analysis and Machine Intelligence, [24] R. Hinami, T. Mei, and S. Satoh, “Joint detection and recounting of
vol. 45, no. 1, pp. 444–459, 2022. abnormal events by learning deep generic knowledge,” in Proceedings
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

of the IEEE International Conference on Computer Vision, 2017, pp. the IEEE International Conference on Acoustics, Speech, and Signal
3619–3627. Processing, 2022, pp. 5787–5791.
[25] R. T. Ionescu, F. S. Khan, M.-I. Georgescu, and L. Shao, “Object- [45] N. Li, F. Chang, and C. Liu, “Human-related anomalous event detection
centric auto-encoders and dummy anomalies for abnormal event detec- via spatial-temporal graph convolutional autoencoder with embedded
tion in video,” in Proceedings of the IEEE Conference on Computer long short-term memory network,” Neurocomputing, vol. 490, pp. 482–
Vision and Pattern Recognition, 2019, pp. 7842–7851. 494, 2022.
[26] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, “A hybrid video anomaly [46] C. Huang, Y. Liu, Z. Zhang, C. Liu, J. Wen, Y. Xu, and Y. Wang,
detection framework via memory-augmented flow reconstruction and “Hierarchical graph embedded pose regularity learning via spatio-
flow-guided frame prediction,” in Proceedings of the IEEE Interna- temporal transformer for abnormal behavior detection,” in Proceedings
tional Conference on Computer Vision, 2021, pp. 13 588–13 597. of the ACM International Conference on Multimedia, 2022, pp. 307–
[27] Q. Bao, F. Liu, Y. Liu, L. Jiao, X. Liu, and L. Li, “Hierarchical 315.
scene normality-binding modeling for anomaly detection in surveil- [47] O. Hirschorn and S. Avidan, “Normalizing flows for human pose
lance videos,” in Proceedings of the ACM International Conference on anomaly detection,” in Proceedings of the IEEE International Con-
Multimedia, 2022, pp. 6103–6112. ference on Computer Vision, 2023, pp. 13 545–13 554.
[28] C. Chen, Y. Xie, S. Lin, A. Yao, G. Jiang, W. Zhang, Y. Qu, R. Qiao, [48] S. Yu, Z. Zhao, H. Fang, A. Deng, H. Su, D. Wang, W. Gan, C. Lu,
B. Ren, and L. Ma, “Comprehensive regularization in a bi-directional and W. Wu, “Regularity learning via explicit distribution modeling for
predictive network for video anomaly detection,” in Proceedings of the skeletal video anomaly detection,” IEEE Transactions on Circuits and
AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. Systems for Video Technology, pp. 1–1, 2023.
230–238. [49] A. Flaborea, L. Collorone, G. M. D. Di Melendugno, S. D’Arrigo,
[29] C. Sun, Y. Jia, and Y. Wu, “Evidential reasoning for video anomaly B. Prenkaj, and F. Galasso, “Multimodal motion conditioned diffusion
detection,” in Proceedings of the ACM International Conference on model for skeleton-based video anomaly detection,” in Proceedings
Multimedia, 2022, pp. 2106–2114. of the IEEE International Conference on Computer Vision, 2023, pp.
[30] S. Sun and X. Gong, “Hierarchical semantic contrast for scene-aware 10 318–10 329.
video anomaly detection,” in Proceedings of the IEEE Conference on [50] A. Stergiou, B. De Weerdt, and N. Deligiannis, “Holistic representation
Computer Vision and Pattern Recognition, 2023, pp. 22 846–22 856. learning for multitask trajectory anomaly detection,” in Proceedings of
[31] W. Luo, W. Liu, D. Lian, and S. Gao, “Future frame prediction network the IEEE Winter Conference on Applications of Computer Vision, 2024,
for video anomaly detection,” IEEE Transactions on Pattern Analysis pp. 6729–6739.
and Machine Intelligence, vol. 44, no. 11, pp. 7505–7520, 2021. [51] R. Pi, P. Wu, X. He, and Y. Peng, “Eogt: video anomaly detection
[32] P. Wu, J. Liu, M. Li, Y. Sun, and F. Shen, “Fast sparse coding networks with enhanced object information and global temporal dependency,”
for anomaly detection in videos,” Pattern Recognition, vol. 107, p. ACM Transactions on Multimedia Computing, Communications and
107515, 2020. Applications, 2024.
[33] R. Cai, H. Zhang, W. Liu, S. Gao, and Z. Hao, “Appearance-motion [52] Y. Chang, Z. Tu, W. Xie, and J. Yuan, “Clustering driven deep autoen-
memory consistency network for video anomaly detection,” in Proceed- coder for video anomaly detection,” in Proceedings of the European
ings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, Conference on Computer Vision, 2020, pp. 329–345.
2021, pp. 938–946. [53] Z. Fang, J. Liang, J. T. Zhou, Y. Xiao, and F. Yang, “Anomaly detection
[34] Y. Liu, J. Liu, X. Zhu, D. Wei, X. Huang, and L. Song, “Learning with bidirectional consistency in videos,” IEEE Transactions on Neural
task-specific representation for video anomaly detection with spatial- Networks and Learning Systems, vol. 33, no. 3, pp. 1079–1092, 2020.
temporal attention,” in Proceedings of the IEEE International Confer- [54] C. Huang, Z. Yang, J. Wen, Y. Xu, Q. Jiang, J. Yang, and Y. Wang,
ence on Acoustics, Speech, and Signal Processing, 2022, pp. 2190– “Self-supervision-augmented deep autoencoder for unsupervised visual
2194. anomaly detection,” IEEE Transactions on Cybernetics, vol. 52, no. 12,
[35] X. Huang, C. Zhao, and Z. Wu, “A video anomaly detection framework pp. 13 834–13 847, 2021.
based on appearance-motion semantics representation consistency,” [55] J. T. Zhou, L. Zhang, Z. Fang, J. Du, X. Peng, and Y. Xiao, “Attention-
in Proceedings of the IEEE International Conference on Acoustics, driven loss for anomaly detection in video surveillance,” IEEE Trans-
Speech, and Signal Processing, 2023, pp. 1–5. actions on Circuits and Systems for Video Technology, vol. 30, no. 12,
[36] N. Li, F. Chang, and C. Liu, “Spatial-temporal cascade autoencoder pp. 4639–4647, 2019.
for video anomaly detection in crowded scenes,” IEEE Transactions [56] Y. Zhang, X. Nie, R. He, M. Chen, and Y. Yin, “Normality learning
on Multimedia, vol. 23, pp. 203–215, 2020. in multispace for video anomaly detection,” IEEE Transactions on
[37] B. Ramachandra, M. Jones, and R. Vatsavai, “Learning a distance Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3694–
function with a siamese network to localize anomalies in videos,” 3706, 2020.
in Proceedings of the IEEE Winter Conference on Applications of [57] X. Wang, Z. Che, B. Jiang, N. Xiao, K. Yang, J. Tang, J. Ye,
Computer Vision, 2020, pp. 2598–2607. J. Wang, and Q. Qi, “Robust unsupervised video anomaly detection by
[38] T. Reiss and Y. Hoshen, “Attribute-based representations for ac- multipath frame prediction,” IEEE Transactions on Neural Networks
curate and interpretable video anomaly detection,” arXiv preprint and Learning Systems, vol. 33, no. 6, pp. 2301–2312, 2021.
arXiv:2212.00789, 2022. [58] J. Yu, Y. Lee, K. C. Yow, M. Jeon, and W. Pedrycz, “Abnormal
[39] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, event detection and localization via adversarial event prediction,” IEEE
“Learning regularity in skeleton trajectories for anomaly detection in Transactions on Neural Networks and Learning Systems, vol. 33, no. 8,
videos,” in Proceedings of the IEEE Conference on Computer Vision pp. 3572–3586, 2021.
and Pattern Recognition, 2019, pp. 11 996–12 004. [59] W. Zhou, Y. Li, and C. Zhao, “Object-guided and motion-refined
[40] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avi- attention network for video anomaly detection,” in Proceedings of the
dan, “Graph embedded pose clustering for anomaly detection,” in IEEE International Conference on Multimedia and Expo, 2022, pp.
Proceedings of the IEEE Conference on Computer Vision and Pattern 1–6.
Recognition, 2020, pp. 10 539–10 547. [60] K. Cheng, X. Zeng, Y. Liu, M. Zhao, C. Pang, and X. Hu, “Spatial-
[41] R. Rodrigues, N. Bhargava, R. Velmurugan, and S. Chaudhuri, “Multi- temporal graph convolutional network boosted flow-frame prediction
timescale trajectory prediction for abnormal human activity detection,” for video anomaly detection,” in Proceedings of the IEEE International
in Proceedings of the IEEE Winter Conference on Applications of Conference on Acoustics, Speech, and Signal Processing, 2023, pp. 1–
Computer Vision, 2020, pp. 2626–2634. 5.
[42] W. Luo, W. Liu, and S. Gao, “Normal graph: spatial temporal graph [61] Y. Liu, J. Liu, K. Yang, B. Ju, S. Liu, Y. Wang, D. Yang, P. Sun,
convolutional networks based prediction network for skeleton based and L. Song, “Amp-net: appearance-motion prototype network assisted
video anomaly detection,” Neurocomputing, vol. 444, pp. 332–337, automatic video anomaly detection system,” IEEE Transactions on
2021. Industrial Informatics, vol. 20, no. 2, pp. 2843–2855, 2023.
[43] X. Zeng, Y. Jiang, W. Ding, H. Li, Y. Hao, and Z. Qiu, “A hierarchical [62] G. Yu, S. Wang, Z. Cai, E. Zhu, C. Xu, J. Yin, and M. Kloft, “Cloze
spatio-temporal graph convolutional neural network for anomaly detec- test helps: effective video anomaly detection via learning to complete
tion in videos,” IEEE Transactions on Circuits and Systems for Video video events,” in Proceedings of the ACM International Conference on
Technology, vol. 33, no. 1, pp. 200–212, 2021. Multimedia, 2020, pp. 583–591.
[44] Y. Yang, Z. Fu, and S. M. Naqvi, “A two-stream information fusion [63] Z. Yang, J. Liu, Z. Wu, P. Wu, and X. Liu, “Video event restoration
approach to abnormal event detection in video,” in Proceedings of based on keyframes for video anomaly detection,” in Proceedings of
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

the IEEE Conference on Computer Vision and Pattern Recognition, anomaly detection,” IEEE Transactions on Neural Networks and Learn-
2023, pp. 14 592–14 601. ing Systems, vol. 34, no. 11, pp. 9389–9403, 2022.
[64] G. Yu, S. Wang, Z. Cai, X. Liu, E. Zhu, and J. Yin, “Video anomaly [85] M.-I. Georgescu, A. Barbalau, R. T. Ionescu, F. S. Khan, M. Popescu,
detection via visual cloze tests,” IEEE Transactions on Information and M. Shah, “Anomaly detection in video via self-supervised and
Forensics and Security, vol. 18, pp. 4955–4969, 2023. multi-task learning,” in Proceedings of the IEEE Conference on Com-
[65] G. Wang, Y. Wang, J. Qin, D. Zhang, X. Bao, and D. Huang, puter Vision and Pattern Recognition, 2021, pp. 12 742–12 752.
“Video anomaly detection by solving decoupled spatio-temporal jigsaw [86] M. Zhang, J. Wang, Q. Qi, H. Sun, Z. Zhuang, P. Ren, R. Ma,
puzzles,” in Proceedings of the European Conference on Computer and J. Liao, “Multi-scale video anomaly detection by multi-grained
Vision, 2022, pp. 494–511. spatio-temporal representation learning,” in Proceedings of the IEEE
[66] C. Shi, C. Sun, Y. Wu, and Y. Jia, “Video anomaly detection via Conference on Computer Vision and Pattern Recognition, 2024, pp.
sequentially learning multiple pretext tasks,” in Proceedings of the 17 385–17 394.
IEEE International Conference on Computer Vision, 2023, pp. 10 330– [87] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.
10 340. Williamson, “Estimating the support of a high-dimensional distribu-
[67] A. Barbalau, R. T. Ionescu, M.-I. Georgescu, J. Dueholm, B. Ra- tion,” Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001.
machandra, K. Nasrollahi, F. S. Khan, T. B. Moeslund, and M. Shah, [88] D. M. Tax and R. P. Duin, “Support vector data description,” Machine
“Ssmtl++: revisiting self-supervised multi-task learning for video Learning, vol. 54, pp. 45–66, 2004.
anomaly detection,” Computer Vision and Image Understanding, vol. [89] J. Wang and A. Cherian, “Gods: generalized one-class discriminative
229, p. 103656, 2023. subspaces for anomaly detection,” in Proceedings of the IEEE Inter-
[68] C. Huang, Z. Wu, J. Wen, Y. Xu, Q. Jiang, and Y. Wang, “Abnormal national Conference on Computer Vision, 2019, pp. 8201–8211.
event detection using deep contrastive learning for intelligent video [90] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui,
surveillance system,” IEEE Transactions on Industrial Informatics, A. Binder, E. Müller, and M. Kloft, “Deep one-class classification,”
vol. 18, no. 8, pp. 5171–5179, 2021. in Proceedings of the International Conference on Machine Learning,
[69] Z. Wang, Y. Zou, and Z. Zhang, “Cluster attention contrast for 2018, pp. 4393–4402.
video anomaly detection,” in Proceedings of the ACM International [91] P. Liznerski, L. Ruff, R. A. Vandermeulen, B. J. Franks, M. Kloft,
Conference on Multimedia, 2020, pp. 2463–2471. and K.-R. Müller, “Explainable deep one-class classification,” in Pro-
[70] Y. Lu, C. Cao, Y. Zhang, and Y. Zhang, “Learnable locality-sensitive ceedings of the International Conference on Learning Representations,
hashing for video anomaly detection,” IEEE Transactions on Circuits 2021.
and Systems for Video Technology, vol. 33, no. 2, pp. 963–976, 2022. [92] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, and R. Klette, “Deep-
[71] C. Sun, Y. Jia, H. Song, and Y. Wu, “Adversarial 3d convolutional auto- anomaly: fully convolutional neural network for fast anomaly detection
encoder for abnormal event detection in videos,” IEEE Transactions on in crowded scenes,” Computer Vision and Image Understanding, vol.
Multimedia, vol. 23, pp. 3292–3305, 2020. 172, pp. 88–97, 2018.
[72] D. Chen, L. Yue, X. Chang, M. Xu, and T. Jia, “Nm-gan: noise- [93] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially
modulated generative adversarial network for video anomaly detec- learned one-class classifier for novelty detection,” in Proceedings of
tion,” Pattern Recognition, vol. 116, p. 107969, 2021. the IEEE Conference on Computer Vision and Pattern Recognition,
[73] Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal 2018, pp. 3379–3388.
event detection,” in Proceedings of the IEEE Conference on Computer
[94] M. Sabokrou, M. Pourreza, M. Fayyaz, R. Entezari, M. Fathy, J. Gall,
Vision and Pattern Recognition, 2011, pp. 3449–3456.
and E. Adeli, “Avid: adversarial visual irregularity detection,” in
[74] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly
Proceedings of the Asian Conference on Computer Vision, 2018, pp.
detection in stacked rnn framework,” in Proceedings of the IEEE
488–505.
International Conference on Computer Vision, 2017, pp. 341–349.
[95] M. Sabokrou, M. Fathy, G. Zhao, and E. Adeli, “Deep end-to-end one-
[75] J. T. Zhou, J. Du, H. Zhu, X. Peng, Y. Liu, and R. S. M. Goh,
class classifier,” IEEE Transactions on Neural Networks and Learning
“Anomalynet: an anomaly detection network for video surveillance,”
Systems, vol. 32, no. 2, pp. 675–684, 2020.
IEEE Transactions on Information Forensics and Security, vol. 14,
no. 10, pp. 2537–2550, 2019. [96] M. Z. Zaheer, J.-h. Lee, M. Astrid, and S.-I. Lee, “Old is gold: redefin-
[76] W. Luo, W. Liu, D. Lian, J. Tang, L. Duan, X. Peng, and S. Gao, “Video ing the adversarially learned one-class classifier training paradigm,” in
anomaly detection with sparse coding inspired deep neural networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern
IEEE Transactions on Pattern Analysis and Machine Intelligence, Recognition, 2020, pp. 14 183–14 193.
vol. 43, no. 3, pp. 1070–1084, 2019. [97] M. Z. Zaheer, J.-H. Lee, A. Mahmood, M. Astrid, and S.-I. Lee,
[77] V. Zavrtanik, M. Kristan, and D. Skočaj, “Reconstruction by inpainting “Stabilizing adversarially learned one-class novelty detection using
for visual anomaly detection,” Pattern Recognition, vol. 112, p. 107706, pseudo anomalies,” IEEE Transactions on Image Processing, vol. 31,
2021. pp. 5963–5975, 2022.
[78] N.-C. Ristea, N. Madan, R. T. Ionescu, K. Nasrollahi, F. S. Khan, [98] K. Doshi and Y. Yilmaz, “Towards interpretable video anomaly detec-
T. B. Moeslund, and M. Shah, “Self-supervised predictive convolutional tion,” in Proceedings of the IEEE Winter Conference on Applications
attentive block for anomaly detection,” in Proceedings of the IEEE of Computer Vision, 2023, pp. 2655–2664.
Conference on Computer Vision and Pattern Recognition, 2022, pp. [99] A. Singh, M. J. Jones, and E. G. Learned-Miller, “Eval: explainable
13 576–13 586. video anomaly localization,” in Proceedings of the IEEE Conference
[79] N. Madan, N.-C. Ristea, R. T. Ionescu, K. Nasrollahi, F. S. Khan, T. B. on Computer Vision and Pattern Recognition, 2023, pp. 18 717–18 726.
Moeslund, and M. Shah, “Self-supervised masked convolutional trans- [100] Y. Yang, K. Lee, B. Dariush, Y. Cao, and S.-Y. Lo, “Follow the rules:
former block for anomaly detection,” IEEE Transactions on Pattern reasoning for video anomaly detection with large language models,” in
Analysis and Machine Intelligence, vol. 46, no. 1, pp. 525–542, 2023. Proceedings of the European Conference on Computer Vision, 2024.
[80] N.-C. Ristea, F.-A. Croitoru, R. T. Ionescu, M. Popescu, F. S. Khan, [101] J. R. Medel and A. Savakis, “Anomaly detection in video using
M. Shah et al., “Self-distilled masked auto-encoders are efficient predictive convolutional long short-term memory networks,” arXiv
video anomaly detectors,” in Proceedings of the IEEE Conference on preprint arXiv:1612.00390, 2016.
Computer Vision and Pattern Recognition, 2024, pp. 15 984–15 995. [102] M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe, “Training
[81] M. Ye, X. Peng, W. Gan, W. Wu, and Y. Qiao, “Anopcn: video anomaly adversarial discriminators for cross-channel abnormal event detection
detection via deep predictive coding network,” in Proceedings of the in crowds,” in Proceedings of the IEEE Winter Conference on Appli-
ACM International Conference on Multimedia, 2019, pp. 1805–1813. cations of Computer Vision, 2019, pp. 1896–1904.
[82] Y. Liu, J. Liu, J. Lin, M. Zhao, and L. Song, “Appearance-motion [103] H. Vu, T. D. Nguyen, T. Le, W. Luo, and D. Phung, “Robust anomaly
united auto-encoder framework for video anomaly detection,” IEEE detection in videos using multilevel representations,” in Proceedings of
Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 5, the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019,
pp. 2498–2502, 2022. pp. 5216–5223.
[83] Y. Liu, J. Liu, M. Zhao, D. Yang, X. Zhu, and L. Song, “Learning [104] H. Song, C. Sun, X. Wu, M. Chen, and Y. Jia, “Learning normal
appearance-motion normality for video anomaly detection,” in Proceed- patterns via adversarial attention-based autoencoder for abnormal event
ings of the IEEE International Conference on Multimedia and Expo, detection in videos,” IEEE Transactions on Multimedia, vol. 22, no. 8,
2022, pp. 1–6. pp. 2138–2148, 2019.
[84] C. Huang, J. Wen, Y. Xu, Q. Jiang, J. Yang, Y. Wang, and D. Zhang, [105] X. Feng, D. Song, Y. Chen, Z. Chen, J. Ni, and H. Chen, “Convolutional
“Self-supervised attentive generative adversarial networks for video transformer based dual discriminator generative adversarial networks
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 19

for video anomaly detection,” in Proceedings of the ACM International [127] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
Conference on Multimedia, 2021, pp. 5546–5554. spatiotemporal features with 3d convolutional networks,” in Proceed-
[106] P. Wu, W. Wang, F. Chang, C. Liu, and B. Wang, “Dss-net: dynamic ings of the IEEE International Conference on Computer Vision, 2015,
self-supervised network for video anomaly detection,” IEEE Transac- pp. 4489–4497.
tions on Multimedia, vol. 26, pp. 2124–2136, 2023. [128] M. Z. Zaheer, A. Mahmood, M. Astrid, and S.-I. Lee, “Claws:
[107] M. Astrid, M. Z. Zaheer, and S.-I. Lee, “Limiting reconstruction clustering assisted weakly supervised learning with normalcy suppres-
capability of autoencoders using moving backward pseudo anomalies,” sion for anomalous event detection,” in Proceedings of the European
in Proceedings of the International Conference on Ubiquitous Robots, Conference on Computer Vision, 2020, pp. 358–376.
2022, pp. 248–251. [129] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
[108] M. Astrid, M. Zaheer, J.-Y. Lee, and S.-I. Lee, “Learning not to model and the kinetics dataset,” in Proceedings of the IEEE Conference
reconstruct anomalies,” in Proceedings of the British Machine Vision on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
Conference, 2021. [130] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, “Not
[109] M. Astrid, M. Z. Zaheer, and S.-I. Lee, “Pseudobound: limiting the only look, but also listen: learning multimodal violence detection under
anomaly reconstruction capability of one-class classifiers using pseudo weak supervision,” in Proceedings of the European Conference on
anomalies,” Neurocomputing, vol. 534, pp. 147–160, 2023. Computer Vision, 2020, pp. 322–339.
[110] M. Pourreza, B. Mohammadi, M. Khaki, S. Bouindour, H. Snoussi, [131] J.-C. Wu, H.-Y. Hsieh, D.-J. Chen, C.-S. Fuh, and T.-L. Liu, “Self-
and M. Sabokrou, “G2d: generate to detect anomaly,” in Proceedings supervised sparse representation for video anomaly detection,” in
of the IEEE Winter Conference on Applications of Computer Vision, Proceedings of the European Conference on Computer Vision, 2022,
2021, pp. 2003–2012. pp. 729–745.
[111] M. I. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah,
[132] Y. Zhou, Y. Qu, X. Xu, F. Shen, J. Song, and H. Shen, “Batchnorm-
“A background-agnostic framework with adversarial training for abnor-
mal event detection in video,” IEEE Transactions on Pattern Analysis based weakly supervised video anomaly detection,” arXiv preprint
arXiv:2311.15367, 2023.
and Machine Intelligence, vol. 44, no. 9, pp. 4505–4523, 2021.
[112] Z. Liu, X.-M. Wu, D. Zheng, K.-Y. Lin, and W.-S. Zheng, “Generating [133] S. AlMarri, M. Z. Zaheer, and K. Nandakumar, “A multi-head approach
anomalies for video anomaly detection with prompt-based feature with shuffled segments for weakly-supervised video anomaly detec-
mapping,” in Proceedings of the IEEE Conference on Computer Vision tion,” in Proceedings of the IEEE Winter Conference on Applications
and Pattern Recognition, 2023, pp. 24 500–24 510. of Computer Vision, 2024, pp. 132–142.
[113] J. Leng, M. Tan, X. Gao, W. Lu, and Z. Xu, “Anomaly warning: [134] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace
learning and memorizing future semantic patterns for unsupervised the history of 2d cnns and imagenet?” in Proceedings of the IEEE
ex-ante potential anomaly prediction,” in Proceedings of the ACM Conference on Computer Vision and Pattern Recognition, 2018, pp.
International Conference on Multimedia, 2022, pp. 6746–6754. 6546–6555.
[114] G. Yu, S. Wang, Z. Cai, X. Liu, and C. Wu, “Effective video abnormal [135] S. Sun and X. Gong, “Long-short temporal co-teaching for weakly
event detection by learning a consistency-aware high-level feature supervised video anomaly detection,” in Proceedings of the IEEE
extractor,” in Proceedings of the ACM International Conference on International Conference on Multimedia and Expo, 2023, pp. 2711–
Multimedia, 2022, pp. 6337–6346. 2716.
[115] W. Liu, H. Chang, B. Ma, S. Shan, and X. Chen, “Diversity-measurable [136] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and
anomaly detection,” in Proceedings of the IEEE Conference on Com- L. Van Gool, “Temporal segment networks: towards good practices for
puter Vision and Pattern Recognition, 2023, pp. 12 147–12 156. deep action recognition,” in Proceedings of the European Conference
[116] Y. Liu, D. Yang, G. Fang, Y. Wang, D. Wei, M. Zhao, K. Cheng, J. Liu, on Computer Vision, 2016, pp. 20–36.
and L. Song, “Stochastic video normality network for abnormal event [137] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph
detection in surveillance videos,” Knowledge-Based Systems, vol. 280, convolutional label noise cleaner: train a plug-and-play action classifier
p. 110986, 2023. for anomaly detection,” in Proceedings of the IEEE Conference on
[117] C. Sun, C. Shi, Y. Jia, and Y. Wu, “Learning event-relevant factors for Computer Vision and Pattern Recognition, 2019, pp. 1237–1246.
video anomaly detection,” in Proceedings of the AAAI Conference on [138] N. Li, J.-X. Zhong, X. Shu, and H. Guo, “Weakly-supervised anomaly
Artificial Intelligence, vol. 37, no. 2, 2023, pp. 2384–2392. detection in video surveillance via graph convolutional label noise
[118] L. Wang, J. Tian, S. Zhou, H. Shi, and G. Hua, “Memory-augmented cleaning,” Neurocomputing, vol. 481, pp. 154–167, 2022.
appearance-motion network for video anomaly detection,” Pattern [139] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
Recognition, vol. 138, p. 109335, 2023. B. Guo, “Swin transformer: hierarchical vision transformer using
[119] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and shifted windows,” in Proceedings of the IEEE International Conference
A. v. d. Hengel, “Memorizing normality to detect anomaly: memory- on Computer Vision, 2021, pp. 10 012–10 022.
augmented deep autoencoder for unsupervised anomaly detection,” in [140] S. Li, F. Liu, and L. Jiao, “Self-training multi-sequence learning
Proceedings of the IEEE International Conference on Computer Vision, with transformer for weakly supervised video anomaly detection,” in
2019, pp. 1705–1714. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36,
[120] H. Park, J. Noh, and B. Ham, “Learning memory-guided normality no. 2, 2022, pp. 1395–1403.
for anomaly detection,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2020, pp. 14 372–14 381. [141] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
[121] H. Lv, C. Chen, Z. Cui, C. Xu, Y. Li, and J. Yang, “Learning normal G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
visual models from natural language supervision,” in Proceedings of
dynamics in videos with meta prototype network,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, the International Conference on Machine Learning, 2021, pp. 8748–
2021, pp. 15 425–15 434. 8763.
[122] Z. Yang, P. Wu, J. Liu, and X. Liu, “Dynamic local aggregation network [142] P. Wu, X. Zhou, G. Pang, Y. Sun, J. Liu, P. Wang, and Y. Zhang,
with adaptive clusterer for anomaly detection,” in Proceedings of the “Open-vocabulary video anomaly detection,” in Proceedings of the
European Conference on Computer Vision, 2022, pp. 404–421. IEEE Conference on Computer Vision and Pattern Recognition, 2024,
[123] C. Cao, Y. Lu, and Y. Zhang, “Context recovery and knowledge pp. 18 297–18 307.
retrieval: a novel two-stream framework for video anomaly detection,” [143] B. Wan, Y. Fang, X. Xia, and J. Mei, “Weakly supervised video
IEEE Transactions on Image Processing, vol. 33, pp. 1810–1825, 2024. anomaly detection via center-guided discriminative learning,” in Pro-
[124] S. Lee, H. G. Kim, and Y. M. Ro, “Bman: bidirectional multi-scale ceedings of the IEEE International Conference on Multimedia and
aggregation networks for abnormal event detection,” IEEE Transactions Expo, 2020, pp. 1–6.
on Image Processing, vol. 29, pp. 2395–2408, 2019. [144] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C.
[125] C. Cao, Y. Lu, P. Wang, and Y. Zhang, “A new comprehensive bench- Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn
mark for semi-supervised video anomaly detection and anticipation,” in architectures for large-scale audio classification,” in Proceedings of
Proceedings of the IEEE Conference on Computer Vision and Pattern the IEEE International Conference on Acoustics, Speech, and Signal
Recognition, 2023, pp. 20 392–20 401. Processing, 2017, pp. 131–135.
[126] C. Huang, C. Liu, Z. Zhang, Z. Wu, J. Wen, Q. Jiang, and Y. Xu, “Pixel- [145] W.-F. Pang, Q.-H. He, Y.-j. Hu, and Y.-X. Li, “Violence detection in
level anomaly detection via uncertainty-aware prototypical trans- videos based on fusing visual and audio information,” in Proceedings
former,” in Proceedings of the ACM International Conference on of the IEEE International Conference on Acoustics, Speech, and Signal
Multimedia, 2022, pp. 521–530. Processing, 2021, pp. 2260–2264.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 20

[146] X. Peng, H. Wen, Y. Luo, X. Zhou, K. Yu, Y. Wang, and Z. Wu, motion relational learning,” in Proceedings of the IEEE Conference on
“Learning weakly supervised audio-visual violence detection in hyper- Computer Vision and Pattern Recognition, 2023, pp. 12 137–12 146.
bolic space,” arXiv preprint arXiv:2305.18797, 2023. [167] D. Purwanto, Y.-T. Chen, and W.-H. Fang, “Dance with self-attention: a
[147] Y. Pu, X. Wu, and S. Wang, “Learning prompt-enhanced context fea- new look of conditional random fields on anomaly detection in videos,”
tures for weakly-supervised video anomaly detection,” arXiv preprint in Proceedings of the IEEE International Conference on Computer
arXiv:2306.14451, 2023. Vision, 2021, pp. 173–183.
[148] W. Chen, K. T. Ma, Z. J. Yew, M. Hur, and D. A.-A. Khoo, “Tevad: [168] C. Huang, C. Liu, J. Wen, L. Wu, Y. Xu, Q. Jiang, and Y. Wang,
improved video anomaly detection with captions,” in Proceedings of “Weakly supervised video anomaly detection via self-guided temporal
the IEEE Conference on Computer Vision and Pattern Recognition discriminative transformer,” IEEE Transactions on Cybernetics, vol. 54,
Workshop, 2023, pp. 5548–5558. no. 5, pp. 3197–3210, 2022.
[149] C. Tao, C. Wang, Y. Zou, X. Peng, J. Wu, and J. Qian, “Learn suspected [169] C. Zhang, G. Li, Q. Xu, X. Zhang, L. Su, and Q. Huang, “Weakly
anomalies from event prompts for video anomaly detection,” arXiv supervised anomaly detection in videos considering the openness of
preprint arXiv:2403.01169, 2024. events,” IEEE Transactions on Intelligent Transportation Systems,
[150] P. Wu, J. Liu, X. He, Y. Peng, P. Wang, and Y. Zhang, “Toward video vol. 23, no. 11, pp. 21 687–21 699, 2022.
anomaly retrieval from video anomaly detection: new benchmarks and [170] H. Zhou, J. Yu, and W. Yang, “Dual memory units with uncertainty
model,” IEEE Transactions on Image Processing, vol. 33, pp. 2213– regulation for weakly supervised video anomaly detection,” in Proceed-
2225, 2024. ings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3,
[151] D.-L. Wei, C.-G. Liu, Y. Liu, J. Liu, X.-G. Zhu, and X.-H. Zeng, 2023, pp. 3769–3777.
“Look, listen and pay more attention: fusing multi-modal information [171] G. Li, G. Cai, X. Zeng, and R. Zhao, “Scale-aware spatio-temporal
for video violence detection,” in Proceedings of the IEEE International relation learning for video anomaly detection,” in Proceedings of the
Conference on Acoustics, Speech, and Signal Processing, 2022, pp. European Conference on Computer Vision, 2022, pp. 333–350.
1980–1984. [172] H. Ye, K. Xu, X. Jiang, and T. Sun, “Learning spatio-temporal relations
[152] D. Wei, Y. Liu, X. Zhu, J. Liu, and X. Zeng, “Msaf: multimodal with multi-scale integrated perception for video anomaly detection,”
supervise-attention enhanced fusion for video anomaly detection,” in Proceedings of the IEEE International Conference on Acoustics,
IEEE Signal Processing Letters, vol. 29, pp. 2178–2182, 2022. Speech, and Signal Processing, 2024, pp. 4020–4024.
[153] J. Yu, J. Liu, Y. Cheng, R. Feng, and Y. Zhang, “Modality-aware con- [173] S. Lin, H. Yang, X. Tang, T. Shi, and L. Chen, “Social mil: interaction-
trastive instance learning with self-distillation for weakly-supervised aware for crowd anomaly detection,” in Proceedings of the IEEE
audio-visual violence detection,” in Proceedings of the ACM Interna- International Conference on Advanced Video and Signal-Based Surveil-
tional Conference on Multimedia, 2022, pp. 6278–6287. lance, 2019, pp. 1–8.
[154] P. Wu, X. Liu, and J. Liu, “Weakly supervised audio-visual violence [174] S. Park, H. Kim, M. Kim, D. Kim, and K. Sohn, “Normality guided
detection,” IEEE Transactions on Multimedia, vol. 25, pp. 1674–1685, multiple instance learning for weakly supervised video anomaly detec-
2022. tion,” in Proceedings of the IEEE Winter Conference on Applications
of Computer Vision, 2023, pp. 2665–2674.
[155] T. Yuan, X. Zhang, K. Liu, B. Liu, C. Chen, J. Jin, and Z. Jiao,
[175] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, “Unbiased
“Towards surveillance video-and-language understanding: new dataset
multiple instance learning for weakly supervised video anomaly detec-
baselines and challenges,” in Proceedings of the IEEE Conference on
tion,” in Proceedings of the IEEE Conference on Computer Vision and
Computer Vision and Pattern Recognition, 2024, pp. 22 052–22 061.
Pattern Recognition, 2023, pp. 8022–8031.
[156] Y. Zhu and S. Newsam, “Motion-aware feature for improved video
[176] J. Chen, L. Li, L. Su, Z.-j. Zha, and Q. Huang, “Prompt-enhanced
anomaly detection,” in Proceedings of the British Machine Vision
multiple instance learning for weakly supervised video anomaly detec-
Conference, 2019.
tion,” in Proceedings of the IEEE Conference on Computer Vision and
[157] J. Zhang, L. Qing, and J. Miao, “Temporal convolutional network
Pattern Recognition, 2024, pp. 18 319–18 329.
with complementary inner bag loss for weakly supervised anomaly
[177] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu, “Mgfn:
detection,” in Proceedings of the IEEE International Conference on
magnitude-contrastive glance-and-focus network for weakly-supervised
Image Processing, 2019, pp. 4030–4034.
video anomaly detection,” in Proceedings of the AAAI Conference on
[158] Y. Liu, J. Liu, M. Zhao, S. Li, and L. Song, “Collaborative normality Artificial Intelligence, vol. 37, no. 1, 2023, pp. 387–395.
learning framework for weakly supervised video anomaly detection,” [178] Y. Gong, C. Wang, X. Dai, S. Yu, L. Xiang, and J. Wu, “Multi-
IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, scale continuity-aware refinement network for weakly supervised video
no. 5, pp. 2508–2512, 2022. anomaly detection,” in Proceedings of the IEEE International Confer-
[159] J.-C. Feng, F.-T. Hong, and W.-S. Zheng, “Mist: multiple instance self- ence on Multimedia and Expo, 2022, pp. 1–6.
training framework for video anomaly detection,” in Proceedings of the [179] H. Sapkota and Q. Yu, “Bayesian nonparametric submodular video
IEEE Conference on Computer Vision and Pattern Recognition, 2021, partition for robust anomaly detection,” in Proceedings of the IEEE
pp. 14 009–14 018. Conference on Computer Vision and Pattern Recognition, 2022, pp.
[160] C. Zhang, G. Li, Y. Qi, S. Wang, L. Qing, Q. Huang, and M.-H. 3212–3221.
Yang, “Exploiting completeness and uncertainty of pseudo labels for [180] J. Fioresi, I. R. Dave, and M. Shah, “Ted-spad: temporal distinctiveness
weakly supervised video anomaly detection,” in Proceedings of the for self-supervised privacy-preservation for video anomaly detection,”
IEEE Conference on Computer Vision and Pattern Recognition, 2023, in Proceedings of the IEEE International Conference on Computer
pp. 16 271–16 280. Vision, 2023, pp. 13 598–13 609.
[161] Z. Yang, J. Liu, and P. Wu, “Text prompt with normality guidance [181] M. Z. Zaheer, A. Mahmood, M. Astrid, and S.-I. Lee, “Clustering aided
for weakly supervised video anomaly detection,” in Proceedings of the weakly supervised training to detect anomalous events in surveillance
IEEE Conference on Computer Vision and Pattern Recognition, 2024, videos,” IEEE Transactions on Neural Networks and Learning Systems,
pp. 18 899–18 908. pp. 1–14, 2023.
[162] P. Wu and J. Liu, “Learning causal temporal relation and feature [182] T. Liu, K.-M. Lam, and J. Kong, “Distilling privileged knowledge
discrimination for anomaly detection,” IEEE Transactions on Image for anomalous event detection from weakly labeled videos,” IEEE
Processing, vol. 30, pp. 3513–3527, 2021. Transactions on Neural Networks and Learning Systems, pp. 1–15,
[163] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, 2023.
“Weakly-supervised video anomaly detection with robust temporal [183] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang,
feature magnitude learning,” in Proceedings of the IEEE International “Vadclip: adapting vision-language models for weakly supervised video
Conference on Computer Vision, 2021, pp. 4975–4986. anomaly detection,” in Proceedings of the AAAI Conference on Artifi-
[164] T. Liu, C. Zhang, K.-M. Lam, and J. Kong, “Decouple and resolve: cial Intelligence, vol. 38, no. 6, 2024, pp. 6074–6082.
transformer-based models for online anomaly detection from weakly [184] P. Wu, X. Zhou, G. Pang, Z. Yang, Q. Yan, P. Wang, and Y. Zhang,
labeled videos,” IEEE Transactions on Information Forensics and “Weakly supervised video anomaly detection and localization with
Security, vol. 18, pp. 15–28, 2022. spatio-temporal prompts,” in Proceedings of the ACM International
[165] S. Chang, Y. Li, S. Shen, J. Feng, and Z. Zhou, “Contrastive attention Conference on Multimedia, 2024.
for video anomaly detection,” IEEE Transactions on Multimedia, [185] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, “Clip-tsa: clip-assisted
vol. 24, pp. 4067–4076, 2021. temporal self-attention for weakly-supervised video anomaly detec-
[166] M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and S. Lee, “Look around tion,” in Proceedings of the IEEE International Conference on Image
for anomalies: weakly-supervised anomaly detection via context- Processing, 2023, pp. 3230–3234.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 21

[186] H. Zhang, X. Xu, X. Wang, J. Zuo, C. Han, X. Huang, C. Gao, [207] A. Del Giorno, J. A. Bagnell, and M. Hebert, “A discriminative
Y. Wang, and N. Sang, “Holmes-vad: towards unbiased and explain- framework for anomaly detection in large videos,” in Proceedings of
able video anomaly detection via multi-modal llm,” arXiv preprint the European Conference on Computer Vision, 2016, pp. 334–349.
arXiv:2406.12235, 2024. [208] R. Tudor Ionescu, S. Smeureanu, B. Alexe, and M. Popescu, “Un-
[187] H. Lv and Q. Sun, “Video anomaly detection and explanation via large masking the abnormal events in video,” in Proceedings of the IEEE
language models,” arXiv preprint arXiv:2401.05702, 2024. International Conference on Computer Vision, 2017, pp. 2895–2903.
[188] L. Zanella, W. Menapace, M. Mancini, Y. Wang, and E. Ricci, [209] Y. Liu, C.-L. Li, and B. Póczos, “Classifier two sample test for video
“Harnessing large language models for training-free video anomaly anomaly detections,” in Proceedings of the British Machine Vision
detection,” in Proceedings of the IEEE Conference on Computer Vision Conference, 2018, p. 71.
and Pattern Recognition, 2024, pp. 18 527–18 536. [210] S. Wang, Y. Zeng, Q. Liu, C. Zhu, E. Zhu, and J. Yin, “Detecting
[189] J. Jeong, Y. Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, abnormality without knowing normality: a two-stage approach for
“Winclip: Zero-/few-shot anomaly classification and segmentation,” in unsupervised video abnormal event detection,” in Proceedings of the
Proceedings of the IEEE Conference on Computer Vision and Pattern ACM International Conference on Multimedia, 2018, pp. 636–644.
Recognition, 2023, pp. 19 606–19 616. [211] G. Pang, C. Yan, C. Shen, A. v. d. Hengel, and X. Bai, “Self-trained
[190] Q. Zhou, G. Pang, Y. Tian, S. He, and J. Chen, “Anomalyclip: Object- deep ordinal regression for end-to-end video anomaly detection,” in
agnostic prompt learning for zero-shot anomaly detection,” in The Proceedings of the IEEE Conference on Computer Vision and Pattern
Twelfth International Conference on Learning Representations, 2024. Recognition, 2020, pp. 12 173–12 182.
[212] J. Hu, G. Yu, S. Wang, E. Zhu, Z. Cai, and X. Zhu, “Detecting
[191] J. Zhu and G. Pang, “Toward generalist anomaly detection via in-
anomalous events from unlabeled videos via temporal masked auto-
context residual learning with few-shot sample prompts,” in Pro-
encoding,” in Proceedings of the IEEE International Conference on
ceedings of the IEEE Conference on Computer Vision and Pattern
Multimedia and Expo, 2022, pp. 1–6.
Recognition, 2024, pp. 17 826–17 836.
[213] X. Lin, Y. Chen, G. Li, and Y. Yu, “A causal inference look at
[192] K. Liu and H. Ma, “Exploring background-bias for anomaly detection unsupervised video anomaly detection,” in Proceedings of the AAAI
in surveillance videos,” in Proceedings of the ACM International Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1620–
Conference on Multimedia, 2019, pp. 1490–1499. 1629.
[193] J. Wu, W. Zhang, G. Li, W. Wu, X. Tan, Y. Li, E. Ding, and L. Lin, [214] G. Yu, S. Wang, Z. Cai, X. Liu, C. Xu, and C. Wu, “Deep anomaly
“Weakly-supervised spatio-temporal anomaly detection in surveillance discovery from unlabeled videos via normality advantage and self-
video,” in Proceedings of International Joint Conferences on Artificial paced refinement,” in Proceedings of the IEEE Conference on Com-
Intelligence, 2021. puter Vision and Pattern Recognition, 2022, pp. 13 987–13 998.
[194] Z. Dong, J. Qin, and Y. Wang, “Multi-stream deep networks for person [215] M. Z. Zaheer, A. Mahmood, M. H. Khan, M. Segu, F. Yu, and S.-I.
to person violence detection in videos,” in Proceedings of the Chinese Lee, “Generative cooperative learning for unsupervised video anomaly
Conference on Pattern Recognition, 2016, pp. 517–531. detection,” in Proceedings of the IEEE Conference on Computer Vision
[195] P. Zhou, Q. Ding, H. Luo, and X. Hou, “Violent interaction detection and Pattern Recognition, 2022, pp. 14 744–14 754.
in video based on deep learning,” in Journal of Physics: Conference [216] A. Al-lahham, N. Tastan, M. Z. Zaheer, and K. Nandakumar, “A coarse-
Series, vol. 844, no. 1, 2017, p. 012044. to-fine pseudo-labeling (c2fpl) framework for unsupervised video
[196] B. Peixoto, B. Lavi, J. P. P. Martin, S. Avila, Z. Dias, and A. Rocha, anomaly detection,” in Proceedings of the IEEE Winter Conference
“Toward subjective violence detection in videos,” in Proceedings of on Applications of Computer Vision, 2024, pp. 6793–6802.
the IEEE International Conference on Acoustics, Speech, and Signal [217] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked
Processing, 2019, pp. 8276–8280. autoencoders are scalable vision learners,” in Proceedings of the IEEE
[197] M. Perez, A. C. Kot, and A. Rocha, “Detection of real-world fights Conference on Computer Vision and Pattern Recognition, 2022, pp.
in surveillance videos,” in Proceedings of the IEEE International 16 000–16 009.
Conference on Acoustics, Speech, and Signal Processing, 2019, pp. [218] T. Li, Z. Wang, S. Liu, and W.-Y. Lin, “Deep unsupervised anomaly
2662–2666. detection,” in Proceedings of the IEEE Winter Conference on Applica-
[198] B. Peixoto, B. Lavi, P. Bestagini, Z. Dias, and A. Rocha, “Multimodal tions of Computer Vision, 2021, pp. 3636–3645.
violence detection in videos,” in Proceedings of the IEEE International [219] W. Liu, W. Luo, Z. Li, P. Zhao, S. Gao et al., “Margin learning
Conference on Acoustics, Speech, and Signal Processing, 2020, pp. embedded prediction for video anomaly detection with a few anoma-
2957–2961. lies,” in Proceedings of the International Joint Conference on Artificial
[199] S. Sudhakaran and O. Lanz, “Learning to detect violent videos using Intelligence, 2019, pp. 3023–3030.
convolutional long short-term memory,” in Proceedings of the IEEE [220] A. Acsintoae, A. Florescu, M.-I. Georgescu, T. Mare, P. Sumedrea,
International Conference on Advanced Video and Signal-Based Surveil- R. T. Ionescu, F. S. Khan, and M. Shah, “Ubnormal: new benchmark
lance, 2017, pp. 1–6. for supervised open-set video anomaly detection,” in Proceedings of
[200] A. Hanson, K. Pnvr, S. Krishnagopal, and L. Davis, “Bidirectional con- the IEEE Conference on Computer Vision and Pattern Recognition,
volutional lstm for the detection of violence in videos,” in Proceedings 2022, pp. 20 143–20 153.
of the European Conference on Computer Vision Workshop, 2018, pp. [221] Y. Zhu, W. Bao, and Q. Yu, “Towards open set video anomaly
0–0. detection,” in Proceedings of the European Conference on Computer
Vision, 2022, pp. 395–412.
[201] Y. Su, G. Lin, J. Zhu, and Q. Wu, “Human interaction learning on 3d
[222] C. Ding, G. Pang, and C. Shen, “Catching both gray and black swans:
skeleton point clouds for video violence recognition,” in Proceedings
Open-set supervised anomaly detection,” in Proceedings of the IEEE
of the European Conference on Computer Vision, 2020, pp. 74–90.
Conference on Computer Vision and Pattern Recognition, 2022, pp.
[202] A. Singh, D. Patil, and S. Omkar, “Eye in the sky: real-time drone 7388–7398.
surveillance system (dss) for violent individuals identification using [223] J. Zhu, C. Ding, Y. Tian, and G. Pang, “Anomaly heterogeneity learning
scatternet hybrid deep learning network,” in Proceedings of the IEEE for open-set supervised anomaly detection,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshop, Conference on Computer Vision and Pattern Recognition, 2024, pp.
2018, pp. 1629–1637. 17 616–17 626.
[203] M. Cheng, K. Cai, and M. Li, “Rwf-2000: an open large scale video [224] Y. Lu, F. Yu, M. K. K. Reddy, and Y. Wang, “Few-shot scene-adaptive
database for violence detection,” in Proceedings of the International anomaly detection,” in Proceedings of the European Conference on
Conference on Pattern Recognition, 2021, pp. 4183–4190. Computer Vision, 2020, pp. 125–141.
[204] Y. Shang, X. Wu, and R. Liu, “Multimodal violent video recognition [225] Y. Hu, X. Huang, and X. Luo, “Adaptive anomaly detection network
based on mutual distillation,” in Proceedings of the Chinese Conference for unseen scene without fine-tuning,” in Proceedings of the Chinese
on Pattern Recognition and Computer Vision, 2022, pp. 623–637. Conference on Pattern Recognition and Computer Vision, 2021, pp.
[205] G. Garcia-Cobo and J. C. SanMiguel, “Human skeletons and change 311–323.
detection for efficient violence detection in surveillance videos,” Com- [226] X. Huang, Y. Hu, X. Luo, J. Han, B. Zhang, and X. Cao, “Boosting
puter Vision and Image Understanding, vol. 233, p. 103739, 2023. variational inference with margin learning for few-shot scene-adaptive
[206] J. Su, P. Her, E. Clemens, E. Yaz, S. Schneider, and H. Medeiros, anomaly detection,” IEEE Transactions on Circuits and Systems for
“Violence detection using 3d convolutional neural networks,” in Pro- Video Technology, vol. 33, no. 6, pp. 2813–2825, 2022.
ceedings of the IEEE International Conference on Advanced Video and [227] A. Aich, K.-C. Peng, and A. K. Roy-Chowdhury, “Cross-domain video
Signal-Based Surveillance, 2022, pp. 1–8. anomaly detection without target domain adaptation,” in Proceedings
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 22

of the IEEE Winter Conference on Applications of Computer Vision,


2023, pp. 2579–2591.
[228] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and
localization in crowded scenes,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 36, no. 1, pp. 18–32, 2013.
[229] Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu,
“Human action recognition from various data modalities: A review,”
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 45, no. 3, pp. 3200–3225, 2022.
[230] Y. Zhu, Y. Wu, N. Sebe, and Y. Yan, “Vision+x: a survey on multimodal
learning in the light of data,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2024.
[231] J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao,
“A survey on self-supervised learning: algorithms, applications, and
future trends,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2024.
[232] D. Zhou, Q. Wang, Z. Qi, H. Ye, D. Zhan, and Z. Liu, “Class-
incremental learning: a survey,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2024.

You might also like