Video Anamoly Detection
Video Anamoly Detection
8, AUGUST 2021 1
400 160
a long-standing task in the field of computer vision, VAD has 250 100
Publications
Publications
arXiv:2409.05383v1 [[Link]] 9 Sep 2024
witnessed much good progress. In the era of deep learning, with 200 80
150 60
the explosion of architectures of continuously growing capability 100 40
are constantly emerging for the VAD task, greatly improving the 0
2016 2017 2018 2019 2020 2021 2022 2023
0
2016 2017 2018 2019 2020 2021 2022 2023
AUC\AP (%)
AUC (%)
fully supervised, unsupervised and open-set supervised VAD, 76 82
and we also delve into the latest VAD works based on pre-
68 78
trained large models, remedying the limitations of past reviews
in terms of only focusing on semi-supervised VAD and small 60
2016 2017 2018 2019 2020 2021 2022 2023
74
2018 2019 2020 2021 2022 2023 2024
model based methods. For the VAD task with different levels of
supervision, we construct a well-organized taxonomy, profoundly
Fig. 2. Performance development for semi/weakly supervised VAD tasks.
discuss the characteristics of different types of methods, and show
their performance comparisons. In addition, this review involves review and traffic anomaly prediction in autonomous driving
the public datasets, open-source codes, and evaluation metrics [2]. Owing to its significant potential for applications across
covering all the aforementioned VAD tasks. Finally, we provide
several important research directions for the VAD community. different fields, VAD has attracted considerable attention from
both industry and academia.
Index Terms—Video anomaly detection, anomaly detection,
video understanding, deep learning.
In the pre-deep learning era, the routine way is to separate
feature extraction and classifier design, which forms a two-
stage process, and then combine them together during the
I. I NTRODUCTION inference stage. First, there is a feature extraction process
NOMALY represents something that deviates from what to convert the original high dimensional raw videos into
A is standard, normal, or expected. There are myriads of
normalities, and anomalies are considerably scarce. However,
compact hand-crafted features based on prior knowledge of
experts. Although hand-crafted features lack robustness and
when anomalies do appear, they often have a negative impact. are difficult to use for capturing effective behavior expressions
Anomaly detection aims to discover these rare anomalies built in the face of complex scenarios, these pioneer works deeply
on top of machine learning, thereby reducing the cost of man- enlighten subsequent deep learning based works.
ual judgment. Anomaly detection has widespread application The rise of deep learning has made traditional machine
across various fields [1], such as financial fraud detection, learning algorithms fall out of favor over the last decade.
network intrusion detection, industrial defect detection, and With the rapid development of computer hardware and the
human violence detection. Among these, video anomaly de- massive data in the Internet era, we have witnessed great
tection (VAD) occupies an important place, in which anomaly progress in developing deep learning based methods for VAD
indicates the abnormal events in the temporal or spatial dimen- in recent years. For example, ConvAE [3], the first work using
sions. VAD not only plays a vital role in intelligent security deep auto-encoders based on convolutional neural networks
(e.g., violence, intrusion, and loitering detection) but is also (CNNs) for capturing the regularities in videos; FuturePred
widely used in other scenarios, such as online video content [4], the first work making use of U-Net for forecasting future
anomalies; DeepMIL [5], the first endeavor exploring deep
Peng Wu, Chengyu Pan, Yuting Yan, Peng Wang, and Yanning Zhang are multiple instance learning (MIL) framework for real-world
with the School of Computer Science, Northwestern Polytechnical University,
China. E-mail: xdwupeng@[Link];{[Link], ynzhang}@[Link]. anomalies. In order to more intuitively manifest the research
Guansong Pang is with the School of Computing and Information Systems, enthusiasm for the VAD task in the era of deep learning, we
Singapore Management University Singapore, Singapore. E-mail: pangguan- conduct a statistical survey on the number of publications
song@[Link].
Manuscript received April 19, 2021; revised August 16, 2021. (Correspond- related to VAD over the past decade (the era driven by
ing author: Guansong Pang. Chengyu Pan and Yuting Yan contributed equally.) the rise of deep learning based methods) through Google
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2
TABLE I
A NALYSIS AND C OMPARISON OF R ELATED R EVIEWS .
Research Topics
Reference Year Main Focus Main Categorization
UVAD WVAD SVAD FVAD OVAD LVAD IVAD
Ramachandra et al. [8] 2020 Semi-supervised single-scene VAD Methodology % % ! % % % %
Santhosh et al. [9] 2020 VAD applied on road traffic Methodology ! % ! ! % % %
Nayak et al. [10] 2021 Deep learning driven semi-supervised VAD Methodology % % ! % % % %
Tran et al. [11] 2022 Semi&weakly supervised VAD Architecture % % ! % % % %
Chandrakala et al. [12] 2023 Deep model-based one&two-class VAD Methodology&Architecture % ! ! ! % % %
Liu et al. [13] 2023 Deep models for semi&weakly supervised VAD Model Input ! ! ! ! % % %
Methodology, Architecture, Refinement
Our survey 2024 Comprehensive VAD taxonomy and deep models
Model Input, Model Output ! ! ! ! ! ! !
UVAD: Unsupervised VAD, WVAD: Weakly supervised VAD, SVAD: Semi-supervised VAD, FVAD: Fully supervised VAD, OVAD: Open-set supervised
VAD, LVAD: Large-model based VAD, IVAD: Interpretable VAD
Scholar and IEEE Xplore1 . We select five related topics, i.e., VAD, semi-supervised VAD, weakly supervised VAD, and
video anomaly detection, abnormal event detection, abnormal fully supervised VAD, and also surveyed deep learning based
behavior detection, anomalous event detection, and anomalous methods for most supervised VAD tasks. However, they restrict
behavior detection, and showcase the publication statistics in their scope to the conventional close-set scenario, and fail to
Figure 1. It is not hard to see that the number of related cover the latest research in the field of open-set supervised
publications counted from both sources exhibits a steady and VAD, without introducing a brand-new pipeline based on pre-
rapid growth trend, demonstrating that VAD has garnered trained large models and interpretable learning.
widespread attention. Moreover, we also demonstrate the de- To address this gap comprehensively, we present a thorough
tection performance trends of annual state-of-the-art methods survey of VAD works in the deep learning era. Our survey
on commonly used datasets under two common supervised covers several key aspects to provide a comprehensive analysis
manners, and present performance trends in Figure 2. The of VAD studies. To be specific, we perform an in-depth
detection performance shows a steady upward trend across investigation into the development trends of VAD task in the
all datasets, without displaying any performance bottleneck. era of deep learning, and then propose a unified framework
For instance, the performance of semi-supervised methods on that integrates different VAD tasks together, filling the gaps
CUHK Avenue [6] has experienced a significant surge, rising in the existing reviews in terms of taxonomy. We then collect
from 70.2% AUC [3] to an impressive 90.1% AUC [7] over the most comprehensive open sources, including benchmark
the past seven-year period. Moreover, for the subsequently datasets, evaluation metrics, open-source codes, and perfor-
proposed weakly supervised VAD, significant progress has mance comparisons, to help researchers in this field avoid
been achieved as well. This indicates the evolving capability detours and improve efficiency. Further, we systematically
of deep learning methods under developing architectures, and categorize various VAD tasks, dividing existing works into
also showcases the ongoing exploration enthusiasm of deep different categories and establishing a clear and structured tax-
learning methods for the VAD task. onomy system to provide a coherent and organized overview
The above statistics clearly demonstrate that deep learning of various VAD paradigms. In addition to this taxonomy, we
driven VAD is the hot area of the current research. There- conduct a comprehensive analysis of each paradigm. Further-
fore, there is an urgent necessity for a systematic taxonomy more, throughout this survey, we spotlight influential works
and comprehensive summary of existing works, to facilitate that have significantly contributed to the research advancement
newcomers as a guide and provide references for existing in VAD.
researchers. Based on this, we first collect some high-profile The main contributions of this survey are summarized in
reviews on VAD in the past few years, which are shown the following three aspects:
in Table I. Ramachandra et al. [8] mainly focused on semi- • We provide a comprehensive review of VAD, covering
supervised VAD in the single scenario, lacking in discussions five tasks based on different supervision signals, i.e.,
of cross scenes. Santhosh et al. [9] reviewed VAD methods semi-supervised VAD, weakly supervised VAD, fully
focusing on entities in road traffic scenarios. Their reviews lack supervised VAD, unsupervised VAD, and open-set su-
sufficient in-depth analysis and center on pre-2020 method- pervised VAD. The research focus has expanded from
ologies, resulting in the neglect of recent advances. Nayak traditional single-task semi-supervised VAD to a broader
et al. [10] comprehensively surveyed on deep learning based range of multiple VAD tasks.
methods for semi-supervised VAD, but did not take into • Staying abreast of the research trends, we review the
account weakly supervised VAD methods. The follow-up work latest studies on open-set supervised VAD. Moreover, we
Tran et al. [11] introduced a review of the emerging weakly also revisit the most recent VAD methods based on pre-
supervised VAD, but the focus is not only on videos but also trained large models and interpretable learning. The emer-
on image anomaly detection, resulting in a lack of systematic gence of these methods elevates both the performance and
organization of the VAD task. More recently, both Chandrakala application prospects of VAD. To our knowledge, this
et al. [12] and Liu et al. [13] constructed an organized is the first comprehensive survey of open-set supervised
taxonomy covering a variety of VAD tasks, e.g., unsupervised VAD and pre-trained large model based VAD methods.
• For different tasks, we systematically review existing
1 [Link] [Link] deep learning based methods, and more importantly, we
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3
F Frame-level Label V Video-level Label N Normal Video A Anomalous Video U Unlabeled Video A Unseen Anomalous Video
F V F OR V
… … … … …
Train N A N N N N N A N U U U N A N
… … … … …
Test
N A N N A N N A N U U U N A N
(a) Fully supervised VAD (b) Semi-supervised VAD (c) Weakly supervised VAD (d) Unsupervised VAD (e) Open-set supervised VAD
Fig. 3. Comparisons of five supervised VAD tasks, i.e., fully supervised, semi-supervised, weakly supervised, unsupervised, and open-set supervised VAD.
introduce a unified taxonomy framework categorizing the video-level labels are available (i.e., inexact supervision).
methods from various VAD paradigms based on various Formally, Y = {0, 1}N +A
i=1 , where yi = 0 indicates that xi
aspects, including model input, architecture, methodol- is normal, and yi = 1 indicates that xi is abnormal. Pros
ogy, model refinement, and output. This meticulous sci- and Cons: Compared to fully supervised annotations, it can
entific taxonomy enables a comprehensive understanding significantly reduce labeling costs. However, it places higher
of the field. demands on algorithm design and may lead to situations of
blind guessing.
II. BACKGROUND Fully supervised VAD, as its name implies, comprises
A. Notation and Taxonomy the complete supervision signals, meaning that each abnormal
sample has precise annotations of anomalies. This task can be
As aforementioned, the studied problem, VAD, can be for-
viewed as a standard video or frame classification problem.
mally divided into five categories based on supervision signals.
Due to the scarcity of abnormal behaviors and intensive
Different supervised VAD tasks aim to identify anomalous
manual labeling in reality, there has been little research on the
behaviors or events, but with different training and testing
fully supervised VAD task. It is noteworthy that video violence
setups. We demonstrate these different VAD tasks in Figure 3.
detection can be regarded as a fully supervised VAD, hence,
The general VAD problem is presented as follows. Suppose
we denote violence detection as a fully supervised VAD task
we are given a set of training samples X = {xi }N +A
i=1 and
N in this paper. Formally, each video xi in Xa is accompanied
corresponding labels Y, where Xn = {xi }i=1 is the set of
by a corresponding supervision label yi = {(tsj , tej )}Ui
j , where
normal samples and Xa = {xi }N +A
i=N +1 is the set of abnormal s e
tj and tj denote the start and end time of the j-th violence
samples. Each sample xi is accompanied by a corresponding
event, Ui indicates the total number of anomalies present in
supervision label yi in Y. During the training phase, the
the video. Pros and Cons: In contrast to weakly supervised
detection model Φ(θ) takes X as input and generates anomaly
VAD, with full supervised information, the detection perfor-
predictions; it is then optimized according to the following
mance of the algorithms would be remarkable. However, the
objective,
corresponding drawback is the high requirement for intensive
l = L (Φ (θ, X ) , X or Y) (1)
manual annotations.
where L(·) is employed to quantify the discrepancy between Unsupervised VAD aims to discover anomalies directly
the predictions and the ground-truth labels or original samples. from fully unlabeled videos in an unsupervised manner. Thus,
During inference, the detection model is expected to locate the unsupervised VAD no longer requires labeling normal and
abnormal behaviors or events in videos based on the generated abnormal videos to build the training set. It can be expressed
anomaly predictions. Depending on the input to L, VAD can formally as follows, X = Xtest , and Y = ∅, in which
be categorized into one of the following five task settings. Xtest denotes the set of test samples. Pros and Cons: No
Semi-supervised VAD assumes that only normal samples time-consuming effort is needed to collect training samples,
are available during the training stage, that is, Xa = ∅. This avoiding the heavy labeling burden. Besides, this assumption
task aims to learn the normal patterns based on the training also expands the application fields of VAD, implying that
samples and consider the test samples which fall outside the the detection system can continuously retrain without human
learned patterns as anomalies. Pros and Cons: Only normal intervention. Unfortunately, due to the lack of labels, the
samples are required for training, hence, there is no need to detection performance is relatively poor, leading to a higher
painstakingly collect scarce abnormal samples. However, any rate of false positives and false negatives.
unseen test sample may be recognized as abnormal, leading Open-set supervised VAD is devised to discover unseen
to a higher false positive rate. anomalies that are not presented in the training set. Un-
Weakly supervised VAD has more sufficient training sam- like semi-supervised VAD, open-set supervised VAD includes
ples and supervision signals than semi-supervised VAD. Both abnormal samples in the training set, which are referred
normal and abnormal samples are provided during the training to as seen anomalies. Specifically, for each xi in Xa , its
stage, but the precise location stamps of anomalies in these corresponding label yi ∈ Cbase , here Cbase represents the
untrimmed videos are unknown. In other words, only coarse set of base (seen) anomaly categories, and Cbase ⊂ C, with
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4
Patch Inpainting
Hybrid Interpretable
Multiple Tasks
Learning
Fig. 4. The taxonomy of semi-supervised VAD. We provide a hierarchical taxonomy that organizes existing deep semi-supervised VAD models by model
input, methodology, network architecture, refinement strategy, and model output into a systematic framework.
C = Cbase ∪ Cnovel . Here, Cnovel and C represent the sets of methods (Refinement), and expression of anomaly results
of novel anomaly categories unseen during training and all (Model Output). These key elements collectively contribute
anomaly categories, respectively. Given a testing sample xtest , to the effectiveness of semi-supervised VAD solutions. In the
its label ytest may be either ∈ Cbase or ∈ Cnovel . Pros and following sections, we introduce existing deep learning based
Cons: Compared to the two most common tasks, i.e., semi- VAD methods systematically according to the aforementioned
supervised VAD and weakly supervised VAD, open-set super- taxonomy.
vised VAD not only reduces false positives but also avoids
being limited to closed-set scenarios, thus demonstrating high A. Model Input
practical value. However, it relies on learning specialized Existing semi-supervised VAD methods typically use the
classifiers, loss functions, or generating unknown classes to raw video or its intuitive representations as the model input.
to detect unseen anomalies. Depending on the modality, these can be categorized as fol-
lows: RGB images, optical flow, skeleton, and hybrid inputs,
B. Datasets and Metrics where the first three represent appearance, motion, and body
posture, respectively.
Related benchmark datasets and evaluation metrics are listed
1) RGB: RGB images are the most common input for
at [Link]
conventional vision tasks driven by deep learning techniques,
and this holds true for the VAD task as well. Unlike other
III. S EMI - SUPERVISED V IDEO A NOMALY D ETECTION modalities, RGB images do not require additional processing
Based on our in-depth investigation of past surveys, we steps such as optical flow calculations or pose estimation
found that previous surveys mostly lack a scientific taxonomy, algorithms. In deep learning era, various deep models can be
in which many surveys simply categorize semi-supervised employed to extract compact and high-level visual features
VAD works into different groups based on usage approaches, from these high-dimensional raw data. Utilizing these high-
such as reconstruction-based, distance-based, and probability- level features enables the design of more effective subsequent
based approaches, and some surveys classify works according detection methods. Moreover, depending on the input size,
to inputs, such as image-based, optical flow-based, and patch- RGB image based input can be categorized into three principal
based approaches. It is particularly apparent that existing clas- groups: frame level, patch level, and object level.
sification reviews are relatively simplistic and superficial, thus Frame-level RGB input provides a macroscopic view of
making it challenging to cover all methods comprehensively the entire scene, encompassing both the background, which
and effectively. To address this issue, we establish a compre- is usually unrelated to the event, and the foreground objects,
hensive taxonomy, encompassing model input, methodology, where anomalies are more likely to occur. The conventional
architecture, model refinement, and model output. The detailed approach typically uses multiple consecutive video frames as
illustration is presented in Figure 4. a single input to capture temporal context within the video,
As aforementioned, only normal samples are available for as seen in methods like ConvAE [3], ConvLSTM-AE [14],
training in the semi-supervised VAD task, rendering the su- and STAE [15]. On the other hand, several research studies
pervised classification paradigm inapplicable. Common ap- focus on using single-frame RGB as input, aiming to detect
proaches involve leveraging the intrinsic information of the anomalies at the spatial level, such as AnomalyGAN [16] and
training samples to learn deep neural networks (DNNs) for AMC [17].
solving a pretext task. For instance, normality reconstruction Patch-level RGB input involves segmenting the frame-
is a classic pretext task [3]. During this process, several critical level RGB input spatially or spatio-temporally, which focuses
aspects need consideration: selection of sample information on local regions, effectively separating the foreground from
(Model Input), design of pretext tasks (Methodology), utiliza- the background and differentiating between various individ-
tion of deep networks (Network Architecture), improvement ual entities. The primary advantage of patch-level input is
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5
The network is compelled to make the predicted data similar Lu et al. [70] further proposed a learnable locality-sensitive
to the actual current data. We define the optimization objective hashing with contrastive learning strategy for VAD.
for prediction as, Denoising [71], [72] is very similar to reconstruction, with
the main difference being that noise η is added to the input
lpre = L (Φ (θ, {It−∆t , ..., It−1 }) , It ) (3) data, and the network is encouraged to achieve a denoising
It is the actual data at the current time step t, It−∆t:t−1 effect for the reconstructed data. The benefit is that it can
represents the historical data from time step t − ∆t to t − 1. enhance the robustness of network for VAD. The optimization
FuturePred [4], as a future frame prediction framework, pro- objective is expressed as,
vided a new solution for VAD. Then, many researchers [26], lden = L (Φ (θ, x + η) , x) (7)
[33], [46], [55]–[61] proposed other prediction based methods.
It alleviates, to some extent, the problem in reconstruction Deep sparse coding is encouraged by the success of
based methods where abnormal events can also be well traditional sparse reconstruction based VAD methods [73],
reconstructed. upgrade versions leverage deep neural networks for semi-
Visual cloze test is inspired by the cloze test in natural supervised VAD. Unlike the aforementioned reconstruction or
language processing [62]–[64]. It mainly involves training prediction tasks, sparse coding typically uses extracted high-
multiple DNNs to infer deliberately erased data from in- level representations rather than raw video image data as input.
complete video sequences, where the prediction task can be By learning from a large amount of normal representations,
considered a special case of visual cloze test task, i.e., the a dictionary of normal patterns is constructed. The total
erased data happens to be the last frame in the video sequence. objective is listed as,
We define the objective function for completing erased data at
lspa = ∥x − Bz∥22 + ∥z∥1 (8)
the t-th time stamp as,
(t) Different normal events can be reconstructed through the dic-
lvct = L (Φ (θ, {I1 , ..., It−1 , It+1 , ...}) , It ) (4)
tionary B multiplied by the sparse coefficient z. For anomalies,
Similar to the prediction task, it also leverages the temporal it is hard to reconstruct them using the linear combination of
relationships in the video, but the difference lies in this task elements from the normal dictionary with a sparse coefficient.
can learn better high-level semantics and temporal context. To overcome the time-consuming inference and low-level
Jigsaw puzzles have recently been applied as a pretext task hand-crafted features of traditional sparse reconstruction based
in semi-supervised VAD [65]–[67]. The main process involves methods, deep sparse coding based methods are emerged [32],
creating jigsaw puzzles by performing temporal, spatial, or [74]–[76], simultaneously leveraging the powerful representa-
spatio-temporal shuffling, and then designing networks to tion capabilities of DNNs and sparse representation techniques
predict the relative or absolute permutation in time, space, or to improve detection performance and efficiency.
both. The optimization function is as follows, Patch inpainting involves the process of reconstructing
X missing or corrupted regions by inferring the missing parts
ljig = L ti , t̂i (5) from the available data. This technique mainly leverages the
i spatial and temporal context to predict the content of the
where ti and t̂i denote the ground-truth and predicted positions missing regions, ensuring that the reconstructed regions blend
of the i−th data in the original sequence, respectively. Unlike seamlessly with the surrounding regions. The optimization
the previous pretext tasks, which involve high-quality image objective for patch inpainting can be defined to minimize the
generation, jigsaw puzzles are cast as the multi-label classifi- difference between the original and the reconstructed patches,
cation, enhancing computational efficiency and learning more
contextual details. lpat = L Φ (θ, x ⊙ M ) , x ⊙ M̄ (9)
Contrastive learning is a key approach in self-supervised M denotes a mask, where the value of 0 in the mask indicates
learning, where the goal is to learn useful representations by that the position needs to be inpainted, while a value of 1
distinguishing between similar and dissimilar pairs. For semi- indicates that the position does not need to be inpainted,
supervised VAD, two samples are regarded as a positive pair and M̄ is a reversed M . Different from prediction and vi-
if they originate from the same sample, and otherwise as a sual cloze test, patch inpainting takes into greater account
negative sample pair [68]. We show the contrastive loss as the spatial or spatio-temporal context. Zavrtanik et al. [77]
below, regarded anomaly detection as a reconstruction-by-inpainting
X exp(sim(xi , x+
i )/τ ) task, further randomly removed partial image regions and
lcon = − log P − (6) reconstructed the image from partial inpaintings. Then, Ristea
i k exp(sim(xi , xk )/τ )
et al. [78], [79] presented a novel self-supervised predictive
−
xi and x+i are the positive pair, and xi and xk are the negative architectural building block, a plug-and-play design that can
pair. sim(·, ·) is the similarity function (e.g., cosine similarity). easily be incorporated into various anomaly detection methods.
Wang et al. [69] introduced a cluster attention contrast frame- More recently, a self-distilled masked auto-encoder [80] was
work for VAD, which is built on top of contrastive learning. proposed to inpaint original frames.
During the inference stage, the highest similarity between the Multiple task can ease the dilemma of the single pretext
test sample and its variants is regarded as the regularity score. task, i.e., a single task may be not well aligned with the VAD
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7
task, thus leading to sub-optimal performance. Recently, sev- as the normal samples are accessible to it. Therefore, D
eral works attempted to train VAD models jointly on multiple explicitly decides whether the output of G follows the normal
pretext tasks. For example, various studies exploited different distribution or not. Therefore, adversarial classifier can be
self-supervised task compositions, involving reconstruction jointly learned by optimizing the following objective,
and prediction [15], [27], [39], [81], prediction and denoising
min max Exi ∼pt [log(D(xi ))]
[82], [83], prediction and jigsaw puzzles [84], prediction and G D
(13)
constrastive learning [70]. A few works [66], [67], [85],
+ Ex˜i ∼pt +Nσ [log(1 − D(G(x̃i )))] ,
[86] strove to develop more sophisticated multiple tasks from
different perspectives. where xi is drawn from a normal data distribution pt and x̃i
2) One-class Learning: One-class learning primarily fo- is the sample xi with added noise, which is sampled from
cuses on samples from the normal class. Compared to self- a normal distribution Nσ . The final abnormal score of an
supervised learning methods, it does not require the strenuous input sample x is given as D(G(x)). For instance, Sabokrou
effort of designing feasible pretext tasks. It is generally divided et al. [93]–[95] developed a conventional adversarial network
into three categories: one-class classifier, Gaussian classifier, that contains two sub-networks, where the discriminator works
and adversarial classifier from discriminators of generative as the one-class classifier, while the refiner supports it by
adversarial networks (GAN). enhancing the normal samples and distorting the anomalies. To
One-class classifier basically includes one-class support mitigate the instability caused by adversarial training, Zaheer
vector machine (OC-SVM) [87], support vector data descrip- et al. [96], [97] proposed stabilizing adversarial classifiers
tion (SVDD) [88], and other extensions, e.g., basic/generalized by transforming the role of discriminator to distinguish good
one-class discriminative subspace classifier (BODS, GODS) and bad quality reconstructions as well as introducing pseudo
[89]. Specifically, OC-SVM is modeled as an extension of anomaly examples.
the SVM objective by learning a max-margin hyperplane that 3) Interpretable Learning: While self-supervised learning
separates the normal from the abnormal in a dataset, which is and one-class learning based methods perform competitively
learned by minimizing the following objective, on popular VAD benchmarks, they are entirely dependent
1 X on complex neural networks and mostly trained end-to-end.
min ∥w∥22 − b + C ξi , s.t. wT xi ≥ b − ξi , ∀xi ∈ Xn This limits their interpretability and generalization capacity.
w,b,ξ≥0 2
(10) Therefore, explainable VAD emerges as a solution, which
where ξi is the non-negative slack, w and b denote the refers to techniques and methodologies used to identify and
hyperplane, and C is the slack penalty. AMDN [18] is a typical explain unusual events in videos. These techniques are de-
OC-SVM based VAD method, which obtains low-dimensional signed not only to detect anomalies but also to provide clear
representations through the auto-encoder, and then uses OC- explanations for why these anomalies are flagged, which is
SVM to classify all normal representations. Another popular crucial for trust and transparency in real-world applications.
variant of one-class classifiers is (Deep) SVDD [90], [91] that For example, Hinami et al. [24] leveraged multi-task detector
instead of modeling data to belong to an open half-space (as as the generic model to learn generic knowledge about visual
in OC-SVM), assumes the normal samples inhabit a bounded concepts, e.g., entity, action, and attribute, to describe the
set, and the optimization seeks the centroid c of a hypersphere events in the human-understandable form, then designed an
of minimum radius R > 0 that contains all normal samples. environment-specific model as the anomaly detector for abnor-
Mathematically, the objective reads, mal event recounting and detection. Similarly, Reiss et al. [38]
1 X extracted explicit attribute-based representations, i.e., velocity
min R2 + C ξi , s.t. ∥xi − c∥22 ≤ R2 − ξi , ∀xi ∈ Xn and pose, along with implicit semantic representations to make
c,R,ξ≥0 2
(11) interpretable anomaly decisions. Coincidentally, Doshi and
where, as in OC-SVM, the ξi models the slack. Based on Yilmaz [98] proposed a novel framework which monitors both
this, Wu et al. [20] proposed a end-to-end deep one-class individuals and the interactions between them, then explores
classifier, i.e., DeepOC, for VAD, avoiding the shortcomings the scene graphs to provide an interpretation for the context of
of complicated two-stage training of AMDN. anomalies. Singh et al. [99] started a new line for explainable
Gaussian classifier based methods [21], [23], [92] assume VAD, a more generic model based on high-level appearance
that, in practical applications, the data typically follows a and motion features which can provide human-understandable
Gaussian distribution. By using training samples, it learns reasons. Compared to previous methods, this work is indepen-
the Gaussian distribution (mean µ and variance Σ) of the dent of detectors and is capable of locating spatial anomalies.
normal pattern. During the testing phase, samples that deviate More recently, Yang et al. [100] proposed the first rule-
significantly from the mean are considered anomalies. The based reasoning framework for semi-supervised VAD with
abnormal score is presented as, large language models (LLMs) due to LLMs’ revolutionary
reasoning ability. Here, we present some classical explainable
1 1 T −1
p(xi ) = k 1 exp − (x i − µ) Σ (x i − µ) (12) VAD methods in 5.
(2π) 2 |Σ| 2 2
Adversarial classifier uses adversarial training between the C. Network Architecture
generator G and discriminator D to learn the distribution 1) Auto-encoder: It consists of two vital structures, namely,
of normal samples. G is aware of normal data distribution, encoder and decoder, in which the encoder compresses the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8
‘running’
Fast R-CNN
Multi-task
…
target environment classification scores generic model environment-specific
model
learn normal behavior
(a) learning procedures (b) testing procedures
App.
Ang.
Exemplar
,… feature
Mag. , vectors
Exemplar
Bkg_pix. Selection
concatenate
Video volume,
t=N+1,2N Feature vectors for
Bkg_cls. each video volume in
a spatial region
Nominal video frames Video volume,
t=1,N Exemplar feature
vectors for this
Anomaly detection spatial region
App.
pipeline :
Ang.
Mag.
Scalar anomaly
score for this
video volume
Ped1 Ped2 Avenue
Distance
Bkg_cls. to nearest
neighbor
Bkg_cls. Feature vector for
this video volume
Test video frames
TABLE II
Q UANTITATIVE P ERFORMANCE C OMPARISON OF S EMI - SUPERVISED M ETHODS ON P UBLIC DATASETS .
F. Performance Comparison weakly supervised VAD task. However, the former only briefly
Figure 7 presents a concise chronology of semi-supervised describes several achievements from 2018 to 2020, while the
VAD methods. Besides, Table II provides a performance latter, although encompassing recent works, lacks a scientific
summary observed in representative semi-supervised VAD taxonomy and simply categorizes them into single-modal and
methods. multi-modal based on the different modalities. Given this
context, we survey related works from 2018 to the present,
IV. W EAKLY S UPERVISED V IDEO A NOMALY D ETECTION including the latest methods based on pre-trained large models,
and we classify existing works from four aspects: model
Weakly supervised VAD is currently a highly regarded input, methodology, refinement strategy, and model output.
research direction in the VAD field, with its origins traceable The taxonomy of weakly supervised VAD is illustrated in
to DeepMIL [5]. Compared to semi-supervised VAD, it is a Figure 8.
newer research direction, and therefore existing reviews lack a
comprehensive and in-depth introduction. As shown in Table I, Compared to semi-supervised VAD, weakly supervised
both Chandrakala et al. [12] and Liu et al. [13] mention the VAD explicitly defines anomalies during the training process,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10
Model Input Methodology Refinement Model Output 2) Optical Flow: Similar to RGB, the same approach is
Optical TSN applied to optical flow input to obtain corresponding global
Flow
I3D Temporal features. However, due to the time-consuming nature of optical
Modeling
Audio VGGish Spatio-Temporal
flow extraction, it is less commonly used in existing methods.
Modeling Common pre-trained models for optical flow include I3D [143]
C3D Frame Level
One-Stage
I3D MIL Modified MIL and TSN [137].
3D-ResNet Two-Stage Metric Learning 3) Audio: For multimodal datasets (e.g., XD-Violence) con-
RGB Self-Training
Pixel Level
TSN Knowledge taining audio signals, audio also holds significant perceptual
Distillation
VideoSwin information. Unlike RGB images, audio is one-dimensional
Hybrid Large Models
CLIP and is typically processed as follows, audios are resampled,
Text CLIP compute the spectrograms, and create log mel spectrograms,
then these features are framed into non-overlapping examples.
Fig. 8. The taxonomy of weakly supervised VAD. We provide a hierarchical
Finally, these examples are then fed into a pre-trained audio
taxonomy that organizes existing deep weakly supervised VAD models by model, such as VGGish [144], to extract features [145], [146].
model input, methodology, refinement strategy and model output into a
systematic framework.
4) Text: More recently, several researchers [147]–[150]
attempt to incorporate text descriptions related to videos to aid
giving the detection algorithm a clear direction. However, in in VAD. These texts may be manually annotated or generated
contrast to fully supervised VAD, the coarse weak supervision by large models. The text data is typically converted into
signals introduce uncertainty into the detection process. Most features using the text encoder and then fed into the subsequent
existing methods utilize the MIL mechanism to optimize the detection network.
model. This process can be viewed as selecting the hardest 5) Hybrid: Common hybrid inputs include RGB combined
regions (video clips) that appear most abnormal from normal with optical flow [143], RGB combined with audio [151]–
bags (normal videos) and the regions most likely to be abnor- [153], RGB combined with optical flow and audio [154], and,
mal from abnormal bags (abnormal videos). Then, the goal more recently, RGB combined with text [155].
is to maximize the predicted confidence difference between
them (with the confidence for the hardest normal regions
approaching 0 and the confidence for the most abnormal
regions approaching 1), which can be regarded as a binary B. Methodology
classification optimization. By gradually mining all normal and
Under the weak supervision, traditional fully supervised
abnormal regions based on their different characteristics, the
methods are no longer adequate. To address this issue, we
anomaly confidence of abnormal regions increases while that
identify two different approaches: the one-stage MIL and two-
of normal regions decreases. Unfortunately, due to the lack
stage self-learning.
of strong supervision signals, the detection model inevitably
involves blind guessing in the above optimization process. 1) One-stage MIL: The one-stage MIL [5], [156]–[158]
is the most commonly used approach for weakly supervised
VAD. The basic idea is to first divide long videos into multiple
A. Model Input segments, and then use a MIL mechanism to select the most
representative samples from these segments. This includes
Unlike semi-supervised VAD, the network input for weakly selecting hard examples from normal videos that look most
supervised VAD is not the raw video, such as RGB, optical like anomalies and the most likely abnormal examples from the
flow, or skeleton. Instead, they are features extracted by pre- abnormal videos. The model is then optimized by lowering the
trained models. This approach alleviates the problem posed by anomaly confidence of hard examples and increasing the con-
the large scale of existing weakly supervised VAD datasets, fidence of the most likely abnormal examples. Ultimately, the
the diverse and complex scenes, and the weak supervision confidence of model in predicting normal samples gradually
signals. Using pre-trained features as input allows the effec- decreases, while its confidence in predicting abnormal samples
tive utilization of off-the-shelf models’ learned knowledge of gradually increases, thereby achieving anomaly detection. The
appearance and motion, significantly reduces the complexity advantage of this method lies in its simplicity and ease of
of the detection model, and enables efficient training. implementation. The MIL objective is showcased as,
1) RGB: RGB is the most common model input. The gen-
X
eral approach divides a long video into multiple segments and lmil = max (0, 1 − max Φ (θ, xai ) + max Φ (θ, xni ))
uses pre-trained visual models to extract global features from i
each segment. As deep models continue to evolve and improve, (14)
the visual models used have also been upgraded, progressing where xa and xn denote an abnormal video and a normal
from the initial C3D [5], [127], [128] to I3D [129]–[133], 3D- video respectively.
ResNet [134], [135], TSN [136]–[138], and more recently, to Additionally, TopK [130] extends MIL by selecting the
the popular Swin Transformer [139], [140] and CLIP [141], top K segments with the highest prediction scores from
[142]. This continuous upgrade in visual models has led to a each video, rather than just the highest-scoring segment, for
gradual improvement in detection performance. training. Therefore, MIL can be seen as a special case of TopK.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
For these TopK segments, compute their average prediction whereas global modeling is mainly used for offline detection
score as the predicted probability ŷ, [163]. Techniques such as temporal convolutional networks
[130], [164], dilated convolution [163], GCN [165], [166],
K
1 X conditional random field [167], and transformers [140], [168]–
ŷ = σ Φ (θ, xi ) (15)
K [170] are frequently employed to capture these temporal
i∈topk
relationships effectively.
where σ is the sigmoid activation function. The cross-entropy 2) Spatio-temporal Modeling: Further, spatio-temporal
loss between ŷ and the label y is used to optimize the model, modeling can simultaneously capture spatial relationships,
highlighting anomalous spatial locations and effectively reduc-
ltopk = − (y log(ŷ) + (1 − y) log(1 − ŷ)) (16) ing noise from irrelevant backgrounds. This can be achieved
The one-stage MIL mechanism leads to models that tend to by segmenting video frames into multiple patches or using
focus only on the most significant anomalies while ignoring existing object detectors to capture foreground objects. Then,
less obvious ones. methods like self-attention [34], [135], [171], [172] are used
2) Two-stage Self-training: In contrast, improved two-stage to learn the relationships between these patches or objects.
self-learning is more complex but also more effective. This Compared to temporal modeling, spatio-temporal modeling
method employs a two-stage training process. First, a prelim- involves a higher computational load due to the increased
inary model is pre-trained using the one-stage MIL. During number of entities being analyzed.
this phase, the model learns the basic principles of VAD. 3) MIL-based Refinement: The traditional MIL mechanism
Then, using the pre-trained model as an initial parameter, a focuses only on the segments with the highest anomaly scores,
self-learning mechanism is introduced to adaptively train the which leads to a series of issues, such as ignoring event
model further, enhancing its ability to recognize anomalies. continuity, the fixed length K-value not adapting to different
Specifically, during the self-learning phase, the predictions of video scenarios, and bias towards abnormal snippets with sim-
model from the pre-training stage are used to automatically ple contexts. Several advanced strategies [173], [174] aim to
select high-confidence abnormal regions. These regions are address the limitations. By incorporating unbiased MIL [175],
then treated as pseudo-labeled data to retrain the model, prior information from text [149], [176], magnitude-level MIL
thereby improving its ability to identify anomalies. This two- [177], continuity-aware refinement [178], and adaptive K-
stage training approach effectively enhances the performance values [179], the detection performance can be significantly
of model in weakly supervised VAD, further improving the improved.
model’s generalization ability and robustness. NoiseClearner 4) Feature Metric Learning: While MIL-based classifica-
[137], MIST [159], MSL [140], CUPL [160], and TPWNG tion ensures the interclass separability of features, this sep-
[161] are typical two-stage self-training works. arability at the video level alone is insufficient for accurate
The two-stage self-learning method based on improved MIL anomaly detection. In contrast, enhancing the discriminative
excels in weakly supervised VAD, but it also comes with some power of features through clustering similar features and iso-
drawbacks, such as, high computational complexity: the two- lating different ones should complement and even augment the
stage training process requires more computational resources separability achieved by MIL-based classification. Specifically,
and time. Both pre-training and self-learning phases involve the basic principle of feature metric learning is to make
multiple iterations of training, leading to high computational similar features compact and different features distant in the
costs; Dependence on initial model quality: the self-learning feature space to improve discrimination. Several works [132],
stage relies on the initial model generated during pre-training. [147], [149], [162], [168], [180], [181] exploited feature metric
If the quality of the initial model is poor, erroneous predictions learning to enhance the feature discrimination.
may be treated as pseudo-labels, affecting subsequent training 5) Knowledge Distillation: Knowledge distillation aims to
effectiveness. transfer knowledge from the enriched branch to the barren
branch to alleviate the semantic gap, which is mainly applied
for modality-missing [182] or modality-enhancing [153] sce-
C. Refinement Strategy narios.
Refinement strategies primarily focus on input features, 6) Leveraging Large Models: Large models have begun
method design, and other aspects to compensate for the to show tremendous potential and flexibility in the field of
shortcomings of weak supervision signals. We compile several VAD. They not only enhance detection capabilities through
commonly used refinement strategies and provide a detailed vision-language features, e.g., CLIP-TSA [185] and cross-
introduction in this section. modal semantic alignment, e.g., VadCLIP [183], but also lever-
1) Temporal Modeling: Temporal modeling is essential age large language models to generate explanatory texts that
for capturing the critical context information in videos. Un- improve detection accuracy, e.g., TEVAD [148], UCA [155],
like actions, anomalous events are complex combinations of and VAD-Instruct50k [186]. Furthermore, they can directly use
scenes, entities, actions, and other elements, which require the prior knowledge of large language models for training-
rich contextual information for accurate reasoning. Existing free VAD [187], [188], demonstrating advantages in rapid
temporal modeling methods can be broadly categorized into deployment and cost reduction. Furthermore, the superior
local relationship modeling and global relationship modeling. zero-shot capabilities of these large models may be leveraged
Local modeling is typically used for online detection [162], for anomaly detection via various other ways, such as AD-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
TABLE III
Q UANTITATIVE P ERFORMANCE C OMPARISON OF W EAKLY S UPERVISED M ETHODS ON P UBLIC DATASETS .
CRFAD DMU
(Purwanto et al.) (Zhou et al.)
TABLE IV
Q UANTITATIVE P ERFORMANCE C OMPARISON OF F ULLY S UPERVISED M ETHODS ON P UBLIC DATASETS .
TABLE V
Q UANTITATIVE P ERFORMANCE C OMPARISON OF U NSUPERVISED M ETHODS ON P UBLIC DATASETS .
detected from a visual perspective. Many methods [194]– the one hand, we cannot clearly define what constitutes normal
[197] used RGB features extracted from raw images using behavior of real-life human activities in many cases, e.g.,
pre-trained models as the model input. running in a sports ground is normal but running in a library is
Motion input mainly includes optical flow, optical flow accel- forbidden. On the other hand, it is impractical to know every
eration, and frame differences. These inputs directly showcase possible normal event in advance, especially for scientific
the motion state of objects, helping to identify anomalies research. Therefore, VAD in unsupervised environments is of
from the motion perspective that might be difficult to detect significant research value.
visually. Dong et al. [194] and Bruno Peixoto et al. [198]
used optical flow and optical flow acceleration as input, while A. Approach Categorization
Sudhakaran et al. [199] and Hanson et al. [200] employed
Through an in-depth investigation, we roughly classify the
frame differences as model input.
current unsupervised VAD methods into 3 categories: pseudo
Skeleton input can intuitively display the pose state of hu-
label, change detection, and others.
mans, allowing the model to exclude background interference
Pseudo label based paradigm is described as follows.
and focus on human actions. This enables more intuitive and
Wang et al. [210] proposed a two-stage training approach
vivid recognition of violent behavior. Su et al. [201] and
where an auto-encoder is first trained with an adaptive re-
Singh et al. [202] conducted violence detection by studying
construction loss threshold to estimate normal events from
the interaction relationships between skeletal points.
unlabeled videos. These estimated normal events are then
Audio input can provide additional information to aid in
used as pseudo-labels to train an OC-SVM, refining the
identifying violent events [198]. This is because certain violent
normality model to exclude anomalies and improve detection
incidents inevitably involve changes in sound, such variations
performance. Pang et al. [211] introduced a self-training
help us better detect violent events, especially when RGB
deep ordinal regression method, starting with initial detection
images may not effectively detect due to issues like occlusion.
using classical one-class algorithms to generate pseudo-labels
Hybrid input combines the strengths of different modalities to
for anomalous and normal frames. An end-to-end anomaly
better detect violent events. Cheng et al. [203] utilized RGB
score learner is then trained iteratively using a self-training
images and optical flow as input, while Shang et al. [204]
strategy that optimizes the detector with newly generated
combined RGB images with audio as input. Garcia et al. [205]
pseudo labels. Zaheer et al. [215] proposed an unsupervised
fed skeleton and frame differences into detection models.
generative cooperative learning approach, leveraging the low-
frequency nature of anomalies for cross-supervision between a
B. Performance Comparison generator and a discriminator, each learning from the pseudo-
We present the performance comparison of existing fully labels of the other. Al-lahham et al. [216] presented a coarse-
supervised VAD research in Table IV. to-fine pseudo-label generation framework using hierarchical
divisive clustering for coarse pseudo-labels at the video level
VI. U NSUPERVISED V IDEO A NOMALY D ETECTION and statistical hypothesis testing for fine pseudo-labels at the
Despite the great popularity of supervised VAD, supervised segment level, training the anomaly detector with the obtained
methods still have shortcomings in practical applications. On pseudo-labels.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
Fig. 10. Flowchart of six classical open-set supervised VAD methods. Top: open-set methods; Bottom: few-shot methods.
extraction and unsupervised learning, ensuring robustness and comprehensive environment analysis, enabling the detection of
generalizability, while the latter relies on few-shot learning to anomalies that may not be visible from a single perspective.
adapt models to new domains with minimal labeled data. 3D perspectives from depth information or point clouds can
offer more detailed spatial information, enabling models to
VIII. F UTURE O PPORTUNITIES better understand the structure and context of the environment,
which also brings multi-modal signals.
A. Creating Comprehensive Benchmarks
The current VAD benchmarks have various limitations in B. Towards Open-world Task
terms of data size, modality, and capturing views. Thus, an
The current research focuses on the closed-set VAD, which
important future direction is to extend benchmarks along these
is restricted to detecting only those anomalies that are defined
dimensions for providing more realistic VAD test platforms.
and annotated during training. In applications like urban
1) Large-scale: Currently, in VAD—especially in semi-
surveillance, the inability to adapt to unforeseen anomalies
supervised VAD—the data scale is too small. For example,
limits the practicality and effectiveness of closed-set VAD
the UCSD Ped dataset [228] lasts only a few minutes, and
models. Therefore, moving towards the open-world VAD
even the larger ShanghaiTech dataset [14] is only a few hours
task, handling the uncertainty and variability of real-world
long. Compared to datasets in video action recognition tasks
situations, is a feasible future trend. To accomplish this task,
[229], which can last hundreds or thousands of hours, VAD
several key approaches and their combination can be taken
datasets are extremely small. This is far from sufficient for
into account. Self-supervised learning: leveraging unlabeled
training VAD models, as training on small-scale datasets is
data to learn discriminative representations that can distinguish
highly prone to over-fitting in large models. While this might
between normal and abnormal events [231]; Open-vocabulary
lead to good detection results on the small-scale test data, it
learning: developing models that can adapt to new anomalies
can severely impact the performance of VAD models intended
with large models [142], pseudo anomaly synthesis, or min-
for real-world deployment. Therefore, expanding the data scale
imal labeled examples; Incremental learning: continuously
is a key focus of future research.
updating models with new data and anomaly types without
2) Multi-modal: Currently, there is limited research on
forgetting previously learned information [232].
multimodal VAD. Just as humans perceive the world through
multiple senses [230], such as vision, hearing, and smell,
effectively utilizing various modality information in the face of C. Embracing Pre-trained Large models
multi-source heterogeneous data can enhance the performance Pre-trained large models have shown remarkable success
of anomaly detection. For example, using audio information in various computer vision tasks, and these models can be
can better detect anomalies such as screams and panic, while leveraged in VAD to enhance the understanding and detection
using infrared information can identify abnormal situations in of anomalies by integrating semantic context and improving
dark environments. feature representations. Here are several feasible directions.
3) Egocentric, Multi-view, 3D, etc.: Egocentric VAD in- Feature extraction: pre-trained weights of large models,
volves using data captured from wearable devices or body- which have been trained on large-scale datasets, provide
mounted cameras to simulate how individuals perceive their a strong foundation for feature extraction and reduce the
environment and identify abnormal events, such as detecting need for extensive training from scratch [185]. Semantic
falls or aggressive behavior in real time. Creating multi-view understanding: language-vision models can be utilized to
benchmarks that leverage data from viewpoints allows for a understand and incorporate contextual information from video
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16
scenes. For instance, text descriptions associated with video [3] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S.
frames can provide additional context that helps in identifying Davis, “Learning temporal regularity in video sequences,” in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
anomalies. In the same way, leverage the language capabilities Recognition, 2016, pp. 733–742.
of these models to generate or understand textual descriptions [4] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction
of anomalies, aiding in both the detection and interpretation for anomaly detection–a new baseline,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2018, pp.
of the anomalies [186]. Zero-shot learning: exploit the zero- 6536–6545.
shot learning capabilities of language-vision models to detect [5] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection
anomalies without requiring explicit examples during training. in surveillance videos,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 6479–6488.
This is particularly useful in open-set VAD scenarios where [6] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in
new types of anomalies can occur [190]. matlab,” in Proceedings of the IEEE International Conference on
Computer Vision, 2013, pp. 2720–2727.
[7] C. Yan, S. Zhang, Y. Liu, G. Pang, and W. Wang, “Feature prediction
D. Exploiting Interpretable VAD diffusion model for video anomaly detection,” in Proceedings of the
IEEE International Conference on Computer Vision, 2023, pp. 5527–
Interpretable VAD focuses on creating models that not only 5537.
detect anomalies but also provide understandable explanations [8] B. Ramachandra, M. J. Jones, and R. R. Vatsavai, “A survey of single-
scene video anomaly detection,” IEEE Transactions on Pattern Analysis
for their predictions. This is crucial for gaining trust in the and Machine Intelligence, vol. 44, no. 5, pp. 2293–2312, 2020.
system, especially in high-stakes applications like surveillance, [9] K. K. Santhosh, D. P. Dogra, and P. P. Roy, “Anomaly detection in road
healthcare, and autonomous vehicles. Here are several feasi- traffic using visual surveillance: a survey,” ACM Computing Surveys,
vol. 53, no. 6, pp. 1–26, 2020.
ble directions from three different layers of a VAD system. [10] R. Nayak, U. C. Pati, and S. K. Das, “A comprehensive review on
Input: instead of directly inputting raw video data into the deep learning-based methods for video anomaly detection,” Image and
model, leverage existing technologies to extract key informa- Vision Computing, vol. 106, p. 104078, 2021.
[11] T. M. Tran, T. N. Vu, N. D. Vo, T. V. Nguyen, and K. Nguyen,
tion, such as foreground objects, position coordinates, motion “Anomaly analysis in images and videos: a comprehensive review,”
trajectories, and crowd relationships. Algorithm: combining ACM Computing Surveys, vol. 55, no. 7, pp. 1–37, 2022.
algorithms from different domains can be helpful for enhanced [12] S. Chandrakala, K. Deepak, and G. Revathy, “Anomaly detection in
surveillance videos: a thematic taxonomy of deep models, review and
reasoning, including: knowledge graphs, i.e., utilize knowledge performance analysis,” Artificial Intelligence Review, vol. 56, no. 4, pp.
graphs to incorporate contextual information and relationships 3319–3368, 2023.
between entities; intent prediction, i.e., use intent prediction [13] Y. Liu, D. Yang, Y. Wang, J. Liu, J. Liu, A. Boukerche, P. Sun,
and L. Song, “Generalized video anomaly event detection: systematic
algorithms to anticipate future actions and detect deviations taxonomy and comparison of deep models,” ACM Computing Surveys,
from expected behaviors [125]. LLMs’ reasoning, i.e., the vol. 56, no. 7, 2023.
textual descriptions of detected anomalies using large LLMs [14] W. Luo, W. Liu, and S. Gao, “Remembering history with convolutional
lstm for anomaly detection,” in Proceedings of the IEEE International
can also be used for the explanation. These descriptions can Conference on Multimedia and Expo, 2017, pp. 439–444.
explain what the model perceives as abnormal and why [186]. [15] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X.-S. Hua, “Spatio-
Output: Various aspects such as the spatio-temporal changes temporal autoencoder for video anomaly detection,” in Proceedings of
the ACM International Conference on Multimedia, 2017, pp. 1933–
and patterns in the video may be synthesized to explain 1941.
anomalies [184]. [16] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni,
and N. Sebe, “Abnormal event detection in videos using generative
adversarial nets,” in Proceedings of the IEEE International Conference
IX. C ONCLUSION on Image Processing, 2017, pp. 1577–1581.
We present a comprehensive survey of video anomaly [17] T.-N. Nguyen and J. Meunier, “Anomaly detection in video sequence
with appearance-motion correspondence,” in Proceedings of the IEEE
detection approaches in the deep learning era. Unlike previous International Conference on Computer Vision, 2019, pp. 1273–1283.
reviews mainly focusing on semi-supervised video anomaly [18] D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe, “Learning deep repre-
detection, we provide a taxonomy that systematically divides sentations of appearance and motion for anomalous event detection,”
in Proceedings of the British Machine Vision Conference, 2015.
the existing works into five categories by their supervision [19] D. Xu, Y. Yan, E. Ricci, and N. Sebe, “Detecting anomalous events
signals, e.g., semi-supervised, weakly supervised, unsuper- in videos by learning deep representations of appearance and motion,”
vised, fully supervised, and open-set supervised video anomaly Computer Vision and Image Understanding, vol. 156, pp. 117–127,
2017.
detection. For each category, we further refine the categories [20] P. Wu, J. Liu, and F. Shen, “A deep one-class neural network for
based on model differences, e.g., model input and output, anomalous event detection in complex scenes,” IEEE Transactions on
methodology, refinement strategy, and architecture, and we Neural Networks and Learning Systems, vol. 31, no. 7, pp. 2609–2622,
2019.
demonstrate the performance comparison of various methods. [21] M. Sabokrou, M. Fayyaz, M. Fathy, and R. Klette, “Deep-cascade:
Finally, we discuss several promising research directions for cascading 3d deep neural networks for fast anomaly detection and lo-
deep learning based video anomaly detection in the future. calization in crowded scenes,” IEEE Transactions on Image Processing,
vol. 26, no. 4, pp. 1992–2004, 2017.
[22] T. Wang, M. Qiao, Z. Lin, C. Li, H. Snoussi, Z. Liu, and C. Choi,
R EFERENCES “Generative neural networks for anomaly detection in crowded scenes,”
IEEE Transactions on Information Forensics and Security, vol. 14,
[1] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for no. 5, pp. 1390–1399, 2018.
anomaly detection: a review,” ACM Computing Surveys, vol. 54, no. 2, [23] Y. Fan, G. Wen, D. Li, S. Qiu, M. D. Levine, and F. Xiao, “Video
pp. 1–38, 2021. anomaly detection and localization via gaussian mixture fully convo-
[2] Y. Yao, X. Wang, M. Xu, Z. Pu, Y. Wang, E. Atkins, and D. J. Crandall, lutional variational autoencoder,” Computer Vision and Image Under-
“Dota: unsupervised detection of traffic anomaly in driving videos,” standing, vol. 195, p. 102920, 2020.
IEEE Transactions on Pattern Analysis and Machine Intelligence, [24] R. Hinami, T. Mei, and S. Satoh, “Joint detection and recounting of
vol. 45, no. 1, pp. 444–459, 2022. abnormal events by learning deep generic knowledge,” in Proceedings
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17
of the IEEE International Conference on Computer Vision, 2017, pp. the IEEE International Conference on Acoustics, Speech, and Signal
3619–3627. Processing, 2022, pp. 5787–5791.
[25] R. T. Ionescu, F. S. Khan, M.-I. Georgescu, and L. Shao, “Object- [45] N. Li, F. Chang, and C. Liu, “Human-related anomalous event detection
centric auto-encoders and dummy anomalies for abnormal event detec- via spatial-temporal graph convolutional autoencoder with embedded
tion in video,” in Proceedings of the IEEE Conference on Computer long short-term memory network,” Neurocomputing, vol. 490, pp. 482–
Vision and Pattern Recognition, 2019, pp. 7842–7851. 494, 2022.
[26] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, “A hybrid video anomaly [46] C. Huang, Y. Liu, Z. Zhang, C. Liu, J. Wen, Y. Xu, and Y. Wang,
detection framework via memory-augmented flow reconstruction and “Hierarchical graph embedded pose regularity learning via spatio-
flow-guided frame prediction,” in Proceedings of the IEEE Interna- temporal transformer for abnormal behavior detection,” in Proceedings
tional Conference on Computer Vision, 2021, pp. 13 588–13 597. of the ACM International Conference on Multimedia, 2022, pp. 307–
[27] Q. Bao, F. Liu, Y. Liu, L. Jiao, X. Liu, and L. Li, “Hierarchical 315.
scene normality-binding modeling for anomaly detection in surveil- [47] O. Hirschorn and S. Avidan, “Normalizing flows for human pose
lance videos,” in Proceedings of the ACM International Conference on anomaly detection,” in Proceedings of the IEEE International Con-
Multimedia, 2022, pp. 6103–6112. ference on Computer Vision, 2023, pp. 13 545–13 554.
[28] C. Chen, Y. Xie, S. Lin, A. Yao, G. Jiang, W. Zhang, Y. Qu, R. Qiao, [48] S. Yu, Z. Zhao, H. Fang, A. Deng, H. Su, D. Wang, W. Gan, C. Lu,
B. Ren, and L. Ma, “Comprehensive regularization in a bi-directional and W. Wu, “Regularity learning via explicit distribution modeling for
predictive network for video anomaly detection,” in Proceedings of the skeletal video anomaly detection,” IEEE Transactions on Circuits and
AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. Systems for Video Technology, pp. 1–1, 2023.
230–238. [49] A. Flaborea, L. Collorone, G. M. D. Di Melendugno, S. D’Arrigo,
[29] C. Sun, Y. Jia, and Y. Wu, “Evidential reasoning for video anomaly B. Prenkaj, and F. Galasso, “Multimodal motion conditioned diffusion
detection,” in Proceedings of the ACM International Conference on model for skeleton-based video anomaly detection,” in Proceedings
Multimedia, 2022, pp. 2106–2114. of the IEEE International Conference on Computer Vision, 2023, pp.
[30] S. Sun and X. Gong, “Hierarchical semantic contrast for scene-aware 10 318–10 329.
video anomaly detection,” in Proceedings of the IEEE Conference on [50] A. Stergiou, B. De Weerdt, and N. Deligiannis, “Holistic representation
Computer Vision and Pattern Recognition, 2023, pp. 22 846–22 856. learning for multitask trajectory anomaly detection,” in Proceedings of
[31] W. Luo, W. Liu, D. Lian, and S. Gao, “Future frame prediction network the IEEE Winter Conference on Applications of Computer Vision, 2024,
for video anomaly detection,” IEEE Transactions on Pattern Analysis pp. 6729–6739.
and Machine Intelligence, vol. 44, no. 11, pp. 7505–7520, 2021. [51] R. Pi, P. Wu, X. He, and Y. Peng, “Eogt: video anomaly detection
[32] P. Wu, J. Liu, M. Li, Y. Sun, and F. Shen, “Fast sparse coding networks with enhanced object information and global temporal dependency,”
for anomaly detection in videos,” Pattern Recognition, vol. 107, p. ACM Transactions on Multimedia Computing, Communications and
107515, 2020. Applications, 2024.
[33] R. Cai, H. Zhang, W. Liu, S. Gao, and Z. Hao, “Appearance-motion [52] Y. Chang, Z. Tu, W. Xie, and J. Yuan, “Clustering driven deep autoen-
memory consistency network for video anomaly detection,” in Proceed- coder for video anomaly detection,” in Proceedings of the European
ings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, Conference on Computer Vision, 2020, pp. 329–345.
2021, pp. 938–946. [53] Z. Fang, J. Liang, J. T. Zhou, Y. Xiao, and F. Yang, “Anomaly detection
[34] Y. Liu, J. Liu, X. Zhu, D. Wei, X. Huang, and L. Song, “Learning with bidirectional consistency in videos,” IEEE Transactions on Neural
task-specific representation for video anomaly detection with spatial- Networks and Learning Systems, vol. 33, no. 3, pp. 1079–1092, 2020.
temporal attention,” in Proceedings of the IEEE International Confer- [54] C. Huang, Z. Yang, J. Wen, Y. Xu, Q. Jiang, J. Yang, and Y. Wang,
ence on Acoustics, Speech, and Signal Processing, 2022, pp. 2190– “Self-supervision-augmented deep autoencoder for unsupervised visual
2194. anomaly detection,” IEEE Transactions on Cybernetics, vol. 52, no. 12,
[35] X. Huang, C. Zhao, and Z. Wu, “A video anomaly detection framework pp. 13 834–13 847, 2021.
based on appearance-motion semantics representation consistency,” [55] J. T. Zhou, L. Zhang, Z. Fang, J. Du, X. Peng, and Y. Xiao, “Attention-
in Proceedings of the IEEE International Conference on Acoustics, driven loss for anomaly detection in video surveillance,” IEEE Trans-
Speech, and Signal Processing, 2023, pp. 1–5. actions on Circuits and Systems for Video Technology, vol. 30, no. 12,
[36] N. Li, F. Chang, and C. Liu, “Spatial-temporal cascade autoencoder pp. 4639–4647, 2019.
for video anomaly detection in crowded scenes,” IEEE Transactions [56] Y. Zhang, X. Nie, R. He, M. Chen, and Y. Yin, “Normality learning
on Multimedia, vol. 23, pp. 203–215, 2020. in multispace for video anomaly detection,” IEEE Transactions on
[37] B. Ramachandra, M. Jones, and R. Vatsavai, “Learning a distance Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3694–
function with a siamese network to localize anomalies in videos,” 3706, 2020.
in Proceedings of the IEEE Winter Conference on Applications of [57] X. Wang, Z. Che, B. Jiang, N. Xiao, K. Yang, J. Tang, J. Ye,
Computer Vision, 2020, pp. 2598–2607. J. Wang, and Q. Qi, “Robust unsupervised video anomaly detection by
[38] T. Reiss and Y. Hoshen, “Attribute-based representations for ac- multipath frame prediction,” IEEE Transactions on Neural Networks
curate and interpretable video anomaly detection,” arXiv preprint and Learning Systems, vol. 33, no. 6, pp. 2301–2312, 2021.
arXiv:2212.00789, 2022. [58] J. Yu, Y. Lee, K. C. Yow, M. Jeon, and W. Pedrycz, “Abnormal
[39] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, event detection and localization via adversarial event prediction,” IEEE
“Learning regularity in skeleton trajectories for anomaly detection in Transactions on Neural Networks and Learning Systems, vol. 33, no. 8,
videos,” in Proceedings of the IEEE Conference on Computer Vision pp. 3572–3586, 2021.
and Pattern Recognition, 2019, pp. 11 996–12 004. [59] W. Zhou, Y. Li, and C. Zhao, “Object-guided and motion-refined
[40] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avi- attention network for video anomaly detection,” in Proceedings of the
dan, “Graph embedded pose clustering for anomaly detection,” in IEEE International Conference on Multimedia and Expo, 2022, pp.
Proceedings of the IEEE Conference on Computer Vision and Pattern 1–6.
Recognition, 2020, pp. 10 539–10 547. [60] K. Cheng, X. Zeng, Y. Liu, M. Zhao, C. Pang, and X. Hu, “Spatial-
[41] R. Rodrigues, N. Bhargava, R. Velmurugan, and S. Chaudhuri, “Multi- temporal graph convolutional network boosted flow-frame prediction
timescale trajectory prediction for abnormal human activity detection,” for video anomaly detection,” in Proceedings of the IEEE International
in Proceedings of the IEEE Winter Conference on Applications of Conference on Acoustics, Speech, and Signal Processing, 2023, pp. 1–
Computer Vision, 2020, pp. 2626–2634. 5.
[42] W. Luo, W. Liu, and S. Gao, “Normal graph: spatial temporal graph [61] Y. Liu, J. Liu, K. Yang, B. Ju, S. Liu, Y. Wang, D. Yang, P. Sun,
convolutional networks based prediction network for skeleton based and L. Song, “Amp-net: appearance-motion prototype network assisted
video anomaly detection,” Neurocomputing, vol. 444, pp. 332–337, automatic video anomaly detection system,” IEEE Transactions on
2021. Industrial Informatics, vol. 20, no. 2, pp. 2843–2855, 2023.
[43] X. Zeng, Y. Jiang, W. Ding, H. Li, Y. Hao, and Z. Qiu, “A hierarchical [62] G. Yu, S. Wang, Z. Cai, E. Zhu, C. Xu, J. Yin, and M. Kloft, “Cloze
spatio-temporal graph convolutional neural network for anomaly detec- test helps: effective video anomaly detection via learning to complete
tion in videos,” IEEE Transactions on Circuits and Systems for Video video events,” in Proceedings of the ACM International Conference on
Technology, vol. 33, no. 1, pp. 200–212, 2021. Multimedia, 2020, pp. 583–591.
[44] Y. Yang, Z. Fu, and S. M. Naqvi, “A two-stream information fusion [63] Z. Yang, J. Liu, Z. Wu, P. Wu, and X. Liu, “Video event restoration
approach to abnormal event detection in video,” in Proceedings of based on keyframes for video anomaly detection,” in Proceedings of
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18
the IEEE Conference on Computer Vision and Pattern Recognition, anomaly detection,” IEEE Transactions on Neural Networks and Learn-
2023, pp. 14 592–14 601. ing Systems, vol. 34, no. 11, pp. 9389–9403, 2022.
[64] G. Yu, S. Wang, Z. Cai, X. Liu, E. Zhu, and J. Yin, “Video anomaly [85] M.-I. Georgescu, A. Barbalau, R. T. Ionescu, F. S. Khan, M. Popescu,
detection via visual cloze tests,” IEEE Transactions on Information and M. Shah, “Anomaly detection in video via self-supervised and
Forensics and Security, vol. 18, pp. 4955–4969, 2023. multi-task learning,” in Proceedings of the IEEE Conference on Com-
[65] G. Wang, Y. Wang, J. Qin, D. Zhang, X. Bao, and D. Huang, puter Vision and Pattern Recognition, 2021, pp. 12 742–12 752.
“Video anomaly detection by solving decoupled spatio-temporal jigsaw [86] M. Zhang, J. Wang, Q. Qi, H. Sun, Z. Zhuang, P. Ren, R. Ma,
puzzles,” in Proceedings of the European Conference on Computer and J. Liao, “Multi-scale video anomaly detection by multi-grained
Vision, 2022, pp. 494–511. spatio-temporal representation learning,” in Proceedings of the IEEE
[66] C. Shi, C. Sun, Y. Wu, and Y. Jia, “Video anomaly detection via Conference on Computer Vision and Pattern Recognition, 2024, pp.
sequentially learning multiple pretext tasks,” in Proceedings of the 17 385–17 394.
IEEE International Conference on Computer Vision, 2023, pp. 10 330– [87] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.
10 340. Williamson, “Estimating the support of a high-dimensional distribu-
[67] A. Barbalau, R. T. Ionescu, M.-I. Georgescu, J. Dueholm, B. Ra- tion,” Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001.
machandra, K. Nasrollahi, F. S. Khan, T. B. Moeslund, and M. Shah, [88] D. M. Tax and R. P. Duin, “Support vector data description,” Machine
“Ssmtl++: revisiting self-supervised multi-task learning for video Learning, vol. 54, pp. 45–66, 2004.
anomaly detection,” Computer Vision and Image Understanding, vol. [89] J. Wang and A. Cherian, “Gods: generalized one-class discriminative
229, p. 103656, 2023. subspaces for anomaly detection,” in Proceedings of the IEEE Inter-
[68] C. Huang, Z. Wu, J. Wen, Y. Xu, Q. Jiang, and Y. Wang, “Abnormal national Conference on Computer Vision, 2019, pp. 8201–8211.
event detection using deep contrastive learning for intelligent video [90] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui,
surveillance system,” IEEE Transactions on Industrial Informatics, A. Binder, E. Müller, and M. Kloft, “Deep one-class classification,”
vol. 18, no. 8, pp. 5171–5179, 2021. in Proceedings of the International Conference on Machine Learning,
[69] Z. Wang, Y. Zou, and Z. Zhang, “Cluster attention contrast for 2018, pp. 4393–4402.
video anomaly detection,” in Proceedings of the ACM International [91] P. Liznerski, L. Ruff, R. A. Vandermeulen, B. J. Franks, M. Kloft,
Conference on Multimedia, 2020, pp. 2463–2471. and K.-R. Müller, “Explainable deep one-class classification,” in Pro-
[70] Y. Lu, C. Cao, Y. Zhang, and Y. Zhang, “Learnable locality-sensitive ceedings of the International Conference on Learning Representations,
hashing for video anomaly detection,” IEEE Transactions on Circuits 2021.
and Systems for Video Technology, vol. 33, no. 2, pp. 963–976, 2022. [92] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, and R. Klette, “Deep-
[71] C. Sun, Y. Jia, H. Song, and Y. Wu, “Adversarial 3d convolutional auto- anomaly: fully convolutional neural network for fast anomaly detection
encoder for abnormal event detection in videos,” IEEE Transactions on in crowded scenes,” Computer Vision and Image Understanding, vol.
Multimedia, vol. 23, pp. 3292–3305, 2020. 172, pp. 88–97, 2018.
[72] D. Chen, L. Yue, X. Chang, M. Xu, and T. Jia, “Nm-gan: noise- [93] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially
modulated generative adversarial network for video anomaly detec- learned one-class classifier for novelty detection,” in Proceedings of
tion,” Pattern Recognition, vol. 116, p. 107969, 2021. the IEEE Conference on Computer Vision and Pattern Recognition,
[73] Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal 2018, pp. 3379–3388.
event detection,” in Proceedings of the IEEE Conference on Computer
[94] M. Sabokrou, M. Pourreza, M. Fayyaz, R. Entezari, M. Fathy, J. Gall,
Vision and Pattern Recognition, 2011, pp. 3449–3456.
and E. Adeli, “Avid: adversarial visual irregularity detection,” in
[74] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly
Proceedings of the Asian Conference on Computer Vision, 2018, pp.
detection in stacked rnn framework,” in Proceedings of the IEEE
488–505.
International Conference on Computer Vision, 2017, pp. 341–349.
[95] M. Sabokrou, M. Fathy, G. Zhao, and E. Adeli, “Deep end-to-end one-
[75] J. T. Zhou, J. Du, H. Zhu, X. Peng, Y. Liu, and R. S. M. Goh,
class classifier,” IEEE Transactions on Neural Networks and Learning
“Anomalynet: an anomaly detection network for video surveillance,”
Systems, vol. 32, no. 2, pp. 675–684, 2020.
IEEE Transactions on Information Forensics and Security, vol. 14,
no. 10, pp. 2537–2550, 2019. [96] M. Z. Zaheer, J.-h. Lee, M. Astrid, and S.-I. Lee, “Old is gold: redefin-
[76] W. Luo, W. Liu, D. Lian, J. Tang, L. Duan, X. Peng, and S. Gao, “Video ing the adversarially learned one-class classifier training paradigm,” in
anomaly detection with sparse coding inspired deep neural networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern
IEEE Transactions on Pattern Analysis and Machine Intelligence, Recognition, 2020, pp. 14 183–14 193.
vol. 43, no. 3, pp. 1070–1084, 2019. [97] M. Z. Zaheer, J.-H. Lee, A. Mahmood, M. Astrid, and S.-I. Lee,
[77] V. Zavrtanik, M. Kristan, and D. Skočaj, “Reconstruction by inpainting “Stabilizing adversarially learned one-class novelty detection using
for visual anomaly detection,” Pattern Recognition, vol. 112, p. 107706, pseudo anomalies,” IEEE Transactions on Image Processing, vol. 31,
2021. pp. 5963–5975, 2022.
[78] N.-C. Ristea, N. Madan, R. T. Ionescu, K. Nasrollahi, F. S. Khan, [98] K. Doshi and Y. Yilmaz, “Towards interpretable video anomaly detec-
T. B. Moeslund, and M. Shah, “Self-supervised predictive convolutional tion,” in Proceedings of the IEEE Winter Conference on Applications
attentive block for anomaly detection,” in Proceedings of the IEEE of Computer Vision, 2023, pp. 2655–2664.
Conference on Computer Vision and Pattern Recognition, 2022, pp. [99] A. Singh, M. J. Jones, and E. G. Learned-Miller, “Eval: explainable
13 576–13 586. video anomaly localization,” in Proceedings of the IEEE Conference
[79] N. Madan, N.-C. Ristea, R. T. Ionescu, K. Nasrollahi, F. S. Khan, T. B. on Computer Vision and Pattern Recognition, 2023, pp. 18 717–18 726.
Moeslund, and M. Shah, “Self-supervised masked convolutional trans- [100] Y. Yang, K. Lee, B. Dariush, Y. Cao, and S.-Y. Lo, “Follow the rules:
former block for anomaly detection,” IEEE Transactions on Pattern reasoning for video anomaly detection with large language models,” in
Analysis and Machine Intelligence, vol. 46, no. 1, pp. 525–542, 2023. Proceedings of the European Conference on Computer Vision, 2024.
[80] N.-C. Ristea, F.-A. Croitoru, R. T. Ionescu, M. Popescu, F. S. Khan, [101] J. R. Medel and A. Savakis, “Anomaly detection in video using
M. Shah et al., “Self-distilled masked auto-encoders are efficient predictive convolutional long short-term memory networks,” arXiv
video anomaly detectors,” in Proceedings of the IEEE Conference on preprint arXiv:1612.00390, 2016.
Computer Vision and Pattern Recognition, 2024, pp. 15 984–15 995. [102] M. Ravanbakhsh, E. Sangineto, M. Nabi, and N. Sebe, “Training
[81] M. Ye, X. Peng, W. Gan, W. Wu, and Y. Qiao, “Anopcn: video anomaly adversarial discriminators for cross-channel abnormal event detection
detection via deep predictive coding network,” in Proceedings of the in crowds,” in Proceedings of the IEEE Winter Conference on Appli-
ACM International Conference on Multimedia, 2019, pp. 1805–1813. cations of Computer Vision, 2019, pp. 1896–1904.
[82] Y. Liu, J. Liu, J. Lin, M. Zhao, and L. Song, “Appearance-motion [103] H. Vu, T. D. Nguyen, T. Le, W. Luo, and D. Phung, “Robust anomaly
united auto-encoder framework for video anomaly detection,” IEEE detection in videos using multilevel representations,” in Proceedings of
Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 5, the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019,
pp. 2498–2502, 2022. pp. 5216–5223.
[83] Y. Liu, J. Liu, M. Zhao, D. Yang, X. Zhu, and L. Song, “Learning [104] H. Song, C. Sun, X. Wu, M. Chen, and Y. Jia, “Learning normal
appearance-motion normality for video anomaly detection,” in Proceed- patterns via adversarial attention-based autoencoder for abnormal event
ings of the IEEE International Conference on Multimedia and Expo, detection in videos,” IEEE Transactions on Multimedia, vol. 22, no. 8,
2022, pp. 1–6. pp. 2138–2148, 2019.
[84] C. Huang, J. Wen, Y. Xu, Q. Jiang, J. Yang, Y. Wang, and D. Zhang, [105] X. Feng, D. Song, Y. Chen, Z. Chen, J. Ni, and H. Chen, “Convolutional
“Self-supervised attentive generative adversarial networks for video transformer based dual discriminator generative adversarial networks
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 19
for video anomaly detection,” in Proceedings of the ACM International [127] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
Conference on Multimedia, 2021, pp. 5546–5554. spatiotemporal features with 3d convolutional networks,” in Proceed-
[106] P. Wu, W. Wang, F. Chang, C. Liu, and B. Wang, “Dss-net: dynamic ings of the IEEE International Conference on Computer Vision, 2015,
self-supervised network for video anomaly detection,” IEEE Transac- pp. 4489–4497.
tions on Multimedia, vol. 26, pp. 2124–2136, 2023. [128] M. Z. Zaheer, A. Mahmood, M. Astrid, and S.-I. Lee, “Claws:
[107] M. Astrid, M. Z. Zaheer, and S.-I. Lee, “Limiting reconstruction clustering assisted weakly supervised learning with normalcy suppres-
capability of autoencoders using moving backward pseudo anomalies,” sion for anomalous event detection,” in Proceedings of the European
in Proceedings of the International Conference on Ubiquitous Robots, Conference on Computer Vision, 2020, pp. 358–376.
2022, pp. 248–251. [129] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
[108] M. Astrid, M. Zaheer, J.-Y. Lee, and S.-I. Lee, “Learning not to model and the kinetics dataset,” in Proceedings of the IEEE Conference
reconstruct anomalies,” in Proceedings of the British Machine Vision on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
Conference, 2021. [130] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, “Not
[109] M. Astrid, M. Z. Zaheer, and S.-I. Lee, “Pseudobound: limiting the only look, but also listen: learning multimodal violence detection under
anomaly reconstruction capability of one-class classifiers using pseudo weak supervision,” in Proceedings of the European Conference on
anomalies,” Neurocomputing, vol. 534, pp. 147–160, 2023. Computer Vision, 2020, pp. 322–339.
[110] M. Pourreza, B. Mohammadi, M. Khaki, S. Bouindour, H. Snoussi, [131] J.-C. Wu, H.-Y. Hsieh, D.-J. Chen, C.-S. Fuh, and T.-L. Liu, “Self-
and M. Sabokrou, “G2d: generate to detect anomaly,” in Proceedings supervised sparse representation for video anomaly detection,” in
of the IEEE Winter Conference on Applications of Computer Vision, Proceedings of the European Conference on Computer Vision, 2022,
2021, pp. 2003–2012. pp. 729–745.
[111] M. I. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah,
[132] Y. Zhou, Y. Qu, X. Xu, F. Shen, J. Song, and H. Shen, “Batchnorm-
“A background-agnostic framework with adversarial training for abnor-
mal event detection in video,” IEEE Transactions on Pattern Analysis based weakly supervised video anomaly detection,” arXiv preprint
arXiv:2311.15367, 2023.
and Machine Intelligence, vol. 44, no. 9, pp. 4505–4523, 2021.
[112] Z. Liu, X.-M. Wu, D. Zheng, K.-Y. Lin, and W.-S. Zheng, “Generating [133] S. AlMarri, M. Z. Zaheer, and K. Nandakumar, “A multi-head approach
anomalies for video anomaly detection with prompt-based feature with shuffled segments for weakly-supervised video anomaly detec-
mapping,” in Proceedings of the IEEE Conference on Computer Vision tion,” in Proceedings of the IEEE Winter Conference on Applications
and Pattern Recognition, 2023, pp. 24 500–24 510. of Computer Vision, 2024, pp. 132–142.
[113] J. Leng, M. Tan, X. Gao, W. Lu, and Z. Xu, “Anomaly warning: [134] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace
learning and memorizing future semantic patterns for unsupervised the history of 2d cnns and imagenet?” in Proceedings of the IEEE
ex-ante potential anomaly prediction,” in Proceedings of the ACM Conference on Computer Vision and Pattern Recognition, 2018, pp.
International Conference on Multimedia, 2022, pp. 6746–6754. 6546–6555.
[114] G. Yu, S. Wang, Z. Cai, X. Liu, and C. Wu, “Effective video abnormal [135] S. Sun and X. Gong, “Long-short temporal co-teaching for weakly
event detection by learning a consistency-aware high-level feature supervised video anomaly detection,” in Proceedings of the IEEE
extractor,” in Proceedings of the ACM International Conference on International Conference on Multimedia and Expo, 2023, pp. 2711–
Multimedia, 2022, pp. 6337–6346. 2716.
[115] W. Liu, H. Chang, B. Ma, S. Shan, and X. Chen, “Diversity-measurable [136] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and
anomaly detection,” in Proceedings of the IEEE Conference on Com- L. Van Gool, “Temporal segment networks: towards good practices for
puter Vision and Pattern Recognition, 2023, pp. 12 147–12 156. deep action recognition,” in Proceedings of the European Conference
[116] Y. Liu, D. Yang, G. Fang, Y. Wang, D. Wei, M. Zhao, K. Cheng, J. Liu, on Computer Vision, 2016, pp. 20–36.
and L. Song, “Stochastic video normality network for abnormal event [137] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph
detection in surveillance videos,” Knowledge-Based Systems, vol. 280, convolutional label noise cleaner: train a plug-and-play action classifier
p. 110986, 2023. for anomaly detection,” in Proceedings of the IEEE Conference on
[117] C. Sun, C. Shi, Y. Jia, and Y. Wu, “Learning event-relevant factors for Computer Vision and Pattern Recognition, 2019, pp. 1237–1246.
video anomaly detection,” in Proceedings of the AAAI Conference on [138] N. Li, J.-X. Zhong, X. Shu, and H. Guo, “Weakly-supervised anomaly
Artificial Intelligence, vol. 37, no. 2, 2023, pp. 2384–2392. detection in video surveillance via graph convolutional label noise
[118] L. Wang, J. Tian, S. Zhou, H. Shi, and G. Hua, “Memory-augmented cleaning,” Neurocomputing, vol. 481, pp. 154–167, 2022.
appearance-motion network for video anomaly detection,” Pattern [139] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
Recognition, vol. 138, p. 109335, 2023. B. Guo, “Swin transformer: hierarchical vision transformer using
[119] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and shifted windows,” in Proceedings of the IEEE International Conference
A. v. d. Hengel, “Memorizing normality to detect anomaly: memory- on Computer Vision, 2021, pp. 10 012–10 022.
augmented deep autoencoder for unsupervised anomaly detection,” in [140] S. Li, F. Liu, and L. Jiao, “Self-training multi-sequence learning
Proceedings of the IEEE International Conference on Computer Vision, with transformer for weakly supervised video anomaly detection,” in
2019, pp. 1705–1714. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36,
[120] H. Park, J. Noh, and B. Ham, “Learning memory-guided normality no. 2, 2022, pp. 1395–1403.
for anomaly detection,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2020, pp. 14 372–14 381. [141] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
[121] H. Lv, C. Chen, Z. Cui, C. Xu, Y. Li, and J. Yang, “Learning normal G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
visual models from natural language supervision,” in Proceedings of
dynamics in videos with meta prototype network,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, the International Conference on Machine Learning, 2021, pp. 8748–
2021, pp. 15 425–15 434. 8763.
[122] Z. Yang, P. Wu, J. Liu, and X. Liu, “Dynamic local aggregation network [142] P. Wu, X. Zhou, G. Pang, Y. Sun, J. Liu, P. Wang, and Y. Zhang,
with adaptive clusterer for anomaly detection,” in Proceedings of the “Open-vocabulary video anomaly detection,” in Proceedings of the
European Conference on Computer Vision, 2022, pp. 404–421. IEEE Conference on Computer Vision and Pattern Recognition, 2024,
[123] C. Cao, Y. Lu, and Y. Zhang, “Context recovery and knowledge pp. 18 297–18 307.
retrieval: a novel two-stream framework for video anomaly detection,” [143] B. Wan, Y. Fang, X. Xia, and J. Mei, “Weakly supervised video
IEEE Transactions on Image Processing, vol. 33, pp. 1810–1825, 2024. anomaly detection via center-guided discriminative learning,” in Pro-
[124] S. Lee, H. G. Kim, and Y. M. Ro, “Bman: bidirectional multi-scale ceedings of the IEEE International Conference on Multimedia and
aggregation networks for abnormal event detection,” IEEE Transactions Expo, 2020, pp. 1–6.
on Image Processing, vol. 29, pp. 2395–2408, 2019. [144] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C.
[125] C. Cao, Y. Lu, P. Wang, and Y. Zhang, “A new comprehensive bench- Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn
mark for semi-supervised video anomaly detection and anticipation,” in architectures for large-scale audio classification,” in Proceedings of
Proceedings of the IEEE Conference on Computer Vision and Pattern the IEEE International Conference on Acoustics, Speech, and Signal
Recognition, 2023, pp. 20 392–20 401. Processing, 2017, pp. 131–135.
[126] C. Huang, C. Liu, Z. Zhang, Z. Wu, J. Wen, Q. Jiang, and Y. Xu, “Pixel- [145] W.-F. Pang, Q.-H. He, Y.-j. Hu, and Y.-X. Li, “Violence detection in
level anomaly detection via uncertainty-aware prototypical trans- videos based on fusing visual and audio information,” in Proceedings
former,” in Proceedings of the ACM International Conference on of the IEEE International Conference on Acoustics, Speech, and Signal
Multimedia, 2022, pp. 521–530. Processing, 2021, pp. 2260–2264.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 20
[146] X. Peng, H. Wen, Y. Luo, X. Zhou, K. Yu, Y. Wang, and Z. Wu, motion relational learning,” in Proceedings of the IEEE Conference on
“Learning weakly supervised audio-visual violence detection in hyper- Computer Vision and Pattern Recognition, 2023, pp. 12 137–12 146.
bolic space,” arXiv preprint arXiv:2305.18797, 2023. [167] D. Purwanto, Y.-T. Chen, and W.-H. Fang, “Dance with self-attention: a
[147] Y. Pu, X. Wu, and S. Wang, “Learning prompt-enhanced context fea- new look of conditional random fields on anomaly detection in videos,”
tures for weakly-supervised video anomaly detection,” arXiv preprint in Proceedings of the IEEE International Conference on Computer
arXiv:2306.14451, 2023. Vision, 2021, pp. 173–183.
[148] W. Chen, K. T. Ma, Z. J. Yew, M. Hur, and D. A.-A. Khoo, “Tevad: [168] C. Huang, C. Liu, J. Wen, L. Wu, Y. Xu, Q. Jiang, and Y. Wang,
improved video anomaly detection with captions,” in Proceedings of “Weakly supervised video anomaly detection via self-guided temporal
the IEEE Conference on Computer Vision and Pattern Recognition discriminative transformer,” IEEE Transactions on Cybernetics, vol. 54,
Workshop, 2023, pp. 5548–5558. no. 5, pp. 3197–3210, 2022.
[149] C. Tao, C. Wang, Y. Zou, X. Peng, J. Wu, and J. Qian, “Learn suspected [169] C. Zhang, G. Li, Q. Xu, X. Zhang, L. Su, and Q. Huang, “Weakly
anomalies from event prompts for video anomaly detection,” arXiv supervised anomaly detection in videos considering the openness of
preprint arXiv:2403.01169, 2024. events,” IEEE Transactions on Intelligent Transportation Systems,
[150] P. Wu, J. Liu, X. He, Y. Peng, P. Wang, and Y. Zhang, “Toward video vol. 23, no. 11, pp. 21 687–21 699, 2022.
anomaly retrieval from video anomaly detection: new benchmarks and [170] H. Zhou, J. Yu, and W. Yang, “Dual memory units with uncertainty
model,” IEEE Transactions on Image Processing, vol. 33, pp. 2213– regulation for weakly supervised video anomaly detection,” in Proceed-
2225, 2024. ings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3,
[151] D.-L. Wei, C.-G. Liu, Y. Liu, J. Liu, X.-G. Zhu, and X.-H. Zeng, 2023, pp. 3769–3777.
“Look, listen and pay more attention: fusing multi-modal information [171] G. Li, G. Cai, X. Zeng, and R. Zhao, “Scale-aware spatio-temporal
for video violence detection,” in Proceedings of the IEEE International relation learning for video anomaly detection,” in Proceedings of the
Conference on Acoustics, Speech, and Signal Processing, 2022, pp. European Conference on Computer Vision, 2022, pp. 333–350.
1980–1984. [172] H. Ye, K. Xu, X. Jiang, and T. Sun, “Learning spatio-temporal relations
[152] D. Wei, Y. Liu, X. Zhu, J. Liu, and X. Zeng, “Msaf: multimodal with multi-scale integrated perception for video anomaly detection,”
supervise-attention enhanced fusion for video anomaly detection,” in Proceedings of the IEEE International Conference on Acoustics,
IEEE Signal Processing Letters, vol. 29, pp. 2178–2182, 2022. Speech, and Signal Processing, 2024, pp. 4020–4024.
[153] J. Yu, J. Liu, Y. Cheng, R. Feng, and Y. Zhang, “Modality-aware con- [173] S. Lin, H. Yang, X. Tang, T. Shi, and L. Chen, “Social mil: interaction-
trastive instance learning with self-distillation for weakly-supervised aware for crowd anomaly detection,” in Proceedings of the IEEE
audio-visual violence detection,” in Proceedings of the ACM Interna- International Conference on Advanced Video and Signal-Based Surveil-
tional Conference on Multimedia, 2022, pp. 6278–6287. lance, 2019, pp. 1–8.
[154] P. Wu, X. Liu, and J. Liu, “Weakly supervised audio-visual violence [174] S. Park, H. Kim, M. Kim, D. Kim, and K. Sohn, “Normality guided
detection,” IEEE Transactions on Multimedia, vol. 25, pp. 1674–1685, multiple instance learning for weakly supervised video anomaly detec-
2022. tion,” in Proceedings of the IEEE Winter Conference on Applications
of Computer Vision, 2023, pp. 2665–2674.
[155] T. Yuan, X. Zhang, K. Liu, B. Liu, C. Chen, J. Jin, and Z. Jiao,
[175] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, “Unbiased
“Towards surveillance video-and-language understanding: new dataset
multiple instance learning for weakly supervised video anomaly detec-
baselines and challenges,” in Proceedings of the IEEE Conference on
tion,” in Proceedings of the IEEE Conference on Computer Vision and
Computer Vision and Pattern Recognition, 2024, pp. 22 052–22 061.
Pattern Recognition, 2023, pp. 8022–8031.
[156] Y. Zhu and S. Newsam, “Motion-aware feature for improved video
[176] J. Chen, L. Li, L. Su, Z.-j. Zha, and Q. Huang, “Prompt-enhanced
anomaly detection,” in Proceedings of the British Machine Vision
multiple instance learning for weakly supervised video anomaly detec-
Conference, 2019.
tion,” in Proceedings of the IEEE Conference on Computer Vision and
[157] J. Zhang, L. Qing, and J. Miao, “Temporal convolutional network
Pattern Recognition, 2024, pp. 18 319–18 329.
with complementary inner bag loss for weakly supervised anomaly
[177] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu, “Mgfn:
detection,” in Proceedings of the IEEE International Conference on
magnitude-contrastive glance-and-focus network for weakly-supervised
Image Processing, 2019, pp. 4030–4034.
video anomaly detection,” in Proceedings of the AAAI Conference on
[158] Y. Liu, J. Liu, M. Zhao, S. Li, and L. Song, “Collaborative normality Artificial Intelligence, vol. 37, no. 1, 2023, pp. 387–395.
learning framework for weakly supervised video anomaly detection,” [178] Y. Gong, C. Wang, X. Dai, S. Yu, L. Xiang, and J. Wu, “Multi-
IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, scale continuity-aware refinement network for weakly supervised video
no. 5, pp. 2508–2512, 2022. anomaly detection,” in Proceedings of the IEEE International Confer-
[159] J.-C. Feng, F.-T. Hong, and W.-S. Zheng, “Mist: multiple instance self- ence on Multimedia and Expo, 2022, pp. 1–6.
training framework for video anomaly detection,” in Proceedings of the [179] H. Sapkota and Q. Yu, “Bayesian nonparametric submodular video
IEEE Conference on Computer Vision and Pattern Recognition, 2021, partition for robust anomaly detection,” in Proceedings of the IEEE
pp. 14 009–14 018. Conference on Computer Vision and Pattern Recognition, 2022, pp.
[160] C. Zhang, G. Li, Y. Qi, S. Wang, L. Qing, Q. Huang, and M.-H. 3212–3221.
Yang, “Exploiting completeness and uncertainty of pseudo labels for [180] J. Fioresi, I. R. Dave, and M. Shah, “Ted-spad: temporal distinctiveness
weakly supervised video anomaly detection,” in Proceedings of the for self-supervised privacy-preservation for video anomaly detection,”
IEEE Conference on Computer Vision and Pattern Recognition, 2023, in Proceedings of the IEEE International Conference on Computer
pp. 16 271–16 280. Vision, 2023, pp. 13 598–13 609.
[161] Z. Yang, J. Liu, and P. Wu, “Text prompt with normality guidance [181] M. Z. Zaheer, A. Mahmood, M. Astrid, and S.-I. Lee, “Clustering aided
for weakly supervised video anomaly detection,” in Proceedings of the weakly supervised training to detect anomalous events in surveillance
IEEE Conference on Computer Vision and Pattern Recognition, 2024, videos,” IEEE Transactions on Neural Networks and Learning Systems,
pp. 18 899–18 908. pp. 1–14, 2023.
[162] P. Wu and J. Liu, “Learning causal temporal relation and feature [182] T. Liu, K.-M. Lam, and J. Kong, “Distilling privileged knowledge
discrimination for anomaly detection,” IEEE Transactions on Image for anomalous event detection from weakly labeled videos,” IEEE
Processing, vol. 30, pp. 3513–3527, 2021. Transactions on Neural Networks and Learning Systems, pp. 1–15,
[163] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, 2023.
“Weakly-supervised video anomaly detection with robust temporal [183] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang,
feature magnitude learning,” in Proceedings of the IEEE International “Vadclip: adapting vision-language models for weakly supervised video
Conference on Computer Vision, 2021, pp. 4975–4986. anomaly detection,” in Proceedings of the AAAI Conference on Artifi-
[164] T. Liu, C. Zhang, K.-M. Lam, and J. Kong, “Decouple and resolve: cial Intelligence, vol. 38, no. 6, 2024, pp. 6074–6082.
transformer-based models for online anomaly detection from weakly [184] P. Wu, X. Zhou, G. Pang, Z. Yang, Q. Yan, P. Wang, and Y. Zhang,
labeled videos,” IEEE Transactions on Information Forensics and “Weakly supervised video anomaly detection and localization with
Security, vol. 18, pp. 15–28, 2022. spatio-temporal prompts,” in Proceedings of the ACM International
[165] S. Chang, Y. Li, S. Shen, J. Feng, and Z. Zhou, “Contrastive attention Conference on Multimedia, 2024.
for video anomaly detection,” IEEE Transactions on Multimedia, [185] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, “Clip-tsa: clip-assisted
vol. 24, pp. 4067–4076, 2021. temporal self-attention for weakly-supervised video anomaly detec-
[166] M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and S. Lee, “Look around tion,” in Proceedings of the IEEE International Conference on Image
for anomalies: weakly-supervised anomaly detection via context- Processing, 2023, pp. 3230–3234.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 21
[186] H. Zhang, X. Xu, X. Wang, J. Zuo, C. Han, X. Huang, C. Gao, [207] A. Del Giorno, J. A. Bagnell, and M. Hebert, “A discriminative
Y. Wang, and N. Sang, “Holmes-vad: towards unbiased and explain- framework for anomaly detection in large videos,” in Proceedings of
able video anomaly detection via multi-modal llm,” arXiv preprint the European Conference on Computer Vision, 2016, pp. 334–349.
arXiv:2406.12235, 2024. [208] R. Tudor Ionescu, S. Smeureanu, B. Alexe, and M. Popescu, “Un-
[187] H. Lv and Q. Sun, “Video anomaly detection and explanation via large masking the abnormal events in video,” in Proceedings of the IEEE
language models,” arXiv preprint arXiv:2401.05702, 2024. International Conference on Computer Vision, 2017, pp. 2895–2903.
[188] L. Zanella, W. Menapace, M. Mancini, Y. Wang, and E. Ricci, [209] Y. Liu, C.-L. Li, and B. Póczos, “Classifier two sample test for video
“Harnessing large language models for training-free video anomaly anomaly detections,” in Proceedings of the British Machine Vision
detection,” in Proceedings of the IEEE Conference on Computer Vision Conference, 2018, p. 71.
and Pattern Recognition, 2024, pp. 18 527–18 536. [210] S. Wang, Y. Zeng, Q. Liu, C. Zhu, E. Zhu, and J. Yin, “Detecting
[189] J. Jeong, Y. Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, abnormality without knowing normality: a two-stage approach for
“Winclip: Zero-/few-shot anomaly classification and segmentation,” in unsupervised video abnormal event detection,” in Proceedings of the
Proceedings of the IEEE Conference on Computer Vision and Pattern ACM International Conference on Multimedia, 2018, pp. 636–644.
Recognition, 2023, pp. 19 606–19 616. [211] G. Pang, C. Yan, C. Shen, A. v. d. Hengel, and X. Bai, “Self-trained
[190] Q. Zhou, G. Pang, Y. Tian, S. He, and J. Chen, “Anomalyclip: Object- deep ordinal regression for end-to-end video anomaly detection,” in
agnostic prompt learning for zero-shot anomaly detection,” in The Proceedings of the IEEE Conference on Computer Vision and Pattern
Twelfth International Conference on Learning Representations, 2024. Recognition, 2020, pp. 12 173–12 182.
[212] J. Hu, G. Yu, S. Wang, E. Zhu, Z. Cai, and X. Zhu, “Detecting
[191] J. Zhu and G. Pang, “Toward generalist anomaly detection via in-
anomalous events from unlabeled videos via temporal masked auto-
context residual learning with few-shot sample prompts,” in Pro-
encoding,” in Proceedings of the IEEE International Conference on
ceedings of the IEEE Conference on Computer Vision and Pattern
Multimedia and Expo, 2022, pp. 1–6.
Recognition, 2024, pp. 17 826–17 836.
[213] X. Lin, Y. Chen, G. Li, and Y. Yu, “A causal inference look at
[192] K. Liu and H. Ma, “Exploring background-bias for anomaly detection unsupervised video anomaly detection,” in Proceedings of the AAAI
in surveillance videos,” in Proceedings of the ACM International Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1620–
Conference on Multimedia, 2019, pp. 1490–1499. 1629.
[193] J. Wu, W. Zhang, G. Li, W. Wu, X. Tan, Y. Li, E. Ding, and L. Lin, [214] G. Yu, S. Wang, Z. Cai, X. Liu, C. Xu, and C. Wu, “Deep anomaly
“Weakly-supervised spatio-temporal anomaly detection in surveillance discovery from unlabeled videos via normality advantage and self-
video,” in Proceedings of International Joint Conferences on Artificial paced refinement,” in Proceedings of the IEEE Conference on Com-
Intelligence, 2021. puter Vision and Pattern Recognition, 2022, pp. 13 987–13 998.
[194] Z. Dong, J. Qin, and Y. Wang, “Multi-stream deep networks for person [215] M. Z. Zaheer, A. Mahmood, M. H. Khan, M. Segu, F. Yu, and S.-I.
to person violence detection in videos,” in Proceedings of the Chinese Lee, “Generative cooperative learning for unsupervised video anomaly
Conference on Pattern Recognition, 2016, pp. 517–531. detection,” in Proceedings of the IEEE Conference on Computer Vision
[195] P. Zhou, Q. Ding, H. Luo, and X. Hou, “Violent interaction detection and Pattern Recognition, 2022, pp. 14 744–14 754.
in video based on deep learning,” in Journal of Physics: Conference [216] A. Al-lahham, N. Tastan, M. Z. Zaheer, and K. Nandakumar, “A coarse-
Series, vol. 844, no. 1, 2017, p. 012044. to-fine pseudo-labeling (c2fpl) framework for unsupervised video
[196] B. Peixoto, B. Lavi, J. P. P. Martin, S. Avila, Z. Dias, and A. Rocha, anomaly detection,” in Proceedings of the IEEE Winter Conference
“Toward subjective violence detection in videos,” in Proceedings of on Applications of Computer Vision, 2024, pp. 6793–6802.
the IEEE International Conference on Acoustics, Speech, and Signal [217] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked
Processing, 2019, pp. 8276–8280. autoencoders are scalable vision learners,” in Proceedings of the IEEE
[197] M. Perez, A. C. Kot, and A. Rocha, “Detection of real-world fights Conference on Computer Vision and Pattern Recognition, 2022, pp.
in surveillance videos,” in Proceedings of the IEEE International 16 000–16 009.
Conference on Acoustics, Speech, and Signal Processing, 2019, pp. [218] T. Li, Z. Wang, S. Liu, and W.-Y. Lin, “Deep unsupervised anomaly
2662–2666. detection,” in Proceedings of the IEEE Winter Conference on Applica-
[198] B. Peixoto, B. Lavi, P. Bestagini, Z. Dias, and A. Rocha, “Multimodal tions of Computer Vision, 2021, pp. 3636–3645.
violence detection in videos,” in Proceedings of the IEEE International [219] W. Liu, W. Luo, Z. Li, P. Zhao, S. Gao et al., “Margin learning
Conference on Acoustics, Speech, and Signal Processing, 2020, pp. embedded prediction for video anomaly detection with a few anoma-
2957–2961. lies,” in Proceedings of the International Joint Conference on Artificial
[199] S. Sudhakaran and O. Lanz, “Learning to detect violent videos using Intelligence, 2019, pp. 3023–3030.
convolutional long short-term memory,” in Proceedings of the IEEE [220] A. Acsintoae, A. Florescu, M.-I. Georgescu, T. Mare, P. Sumedrea,
International Conference on Advanced Video and Signal-Based Surveil- R. T. Ionescu, F. S. Khan, and M. Shah, “Ubnormal: new benchmark
lance, 2017, pp. 1–6. for supervised open-set video anomaly detection,” in Proceedings of
[200] A. Hanson, K. Pnvr, S. Krishnagopal, and L. Davis, “Bidirectional con- the IEEE Conference on Computer Vision and Pattern Recognition,
volutional lstm for the detection of violence in videos,” in Proceedings 2022, pp. 20 143–20 153.
of the European Conference on Computer Vision Workshop, 2018, pp. [221] Y. Zhu, W. Bao, and Q. Yu, “Towards open set video anomaly
0–0. detection,” in Proceedings of the European Conference on Computer
Vision, 2022, pp. 395–412.
[201] Y. Su, G. Lin, J. Zhu, and Q. Wu, “Human interaction learning on 3d
[222] C. Ding, G. Pang, and C. Shen, “Catching both gray and black swans:
skeleton point clouds for video violence recognition,” in Proceedings
Open-set supervised anomaly detection,” in Proceedings of the IEEE
of the European Conference on Computer Vision, 2020, pp. 74–90.
Conference on Computer Vision and Pattern Recognition, 2022, pp.
[202] A. Singh, D. Patil, and S. Omkar, “Eye in the sky: real-time drone 7388–7398.
surveillance system (dss) for violent individuals identification using [223] J. Zhu, C. Ding, Y. Tian, and G. Pang, “Anomaly heterogeneity learning
scatternet hybrid deep learning network,” in Proceedings of the IEEE for open-set supervised anomaly detection,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshop, Conference on Computer Vision and Pattern Recognition, 2024, pp.
2018, pp. 1629–1637. 17 616–17 626.
[203] M. Cheng, K. Cai, and M. Li, “Rwf-2000: an open large scale video [224] Y. Lu, F. Yu, M. K. K. Reddy, and Y. Wang, “Few-shot scene-adaptive
database for violence detection,” in Proceedings of the International anomaly detection,” in Proceedings of the European Conference on
Conference on Pattern Recognition, 2021, pp. 4183–4190. Computer Vision, 2020, pp. 125–141.
[204] Y. Shang, X. Wu, and R. Liu, “Multimodal violent video recognition [225] Y. Hu, X. Huang, and X. Luo, “Adaptive anomaly detection network
based on mutual distillation,” in Proceedings of the Chinese Conference for unseen scene without fine-tuning,” in Proceedings of the Chinese
on Pattern Recognition and Computer Vision, 2022, pp. 623–637. Conference on Pattern Recognition and Computer Vision, 2021, pp.
[205] G. Garcia-Cobo and J. C. SanMiguel, “Human skeletons and change 311–323.
detection for efficient violence detection in surveillance videos,” Com- [226] X. Huang, Y. Hu, X. Luo, J. Han, B. Zhang, and X. Cao, “Boosting
puter Vision and Image Understanding, vol. 233, p. 103739, 2023. variational inference with margin learning for few-shot scene-adaptive
[206] J. Su, P. Her, E. Clemens, E. Yaz, S. Schneider, and H. Medeiros, anomaly detection,” IEEE Transactions on Circuits and Systems for
“Violence detection using 3d convolutional neural networks,” in Pro- Video Technology, vol. 33, no. 6, pp. 2813–2825, 2022.
ceedings of the IEEE International Conference on Advanced Video and [227] A. Aich, K.-C. Peng, and A. K. Roy-Chowdhury, “Cross-domain video
Signal-Based Surveillance, 2022, pp. 1–8. anomaly detection without target domain adaptation,” in Proceedings
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 22