Cse 12
Cse 12
Anomalies
Abstract: At present, many anomaly detection researches focus on two problems: one is that the anomaly on pixels cannot be
accurately located; the other is that the training data cannot include the anomalies. We introduce the “Stereotype Deepening” algorithm
to solve the challenging problems, which uses transitive learning in the process of training the tree-like teacher-student network
structure to deepen the “Stereotype”. Therefore, in the abnormal area, the descriptors given by the student will deviate from the
descriptors given by the teacher. Additionally, peer bias is also taken into account as an abnormal score item. Experiments have been
conducted on different types of datasets to prove the effectiveness of this algorithm for anomaly detection and anomaly localization. By
comparison, the method proposed in this paper has significant advantages in textures data type.
Keywords: Stereotype deepening; Transitive learning; Knowledge distillation; Anomaly detection; Anomaly localization
𝑖 𝑗 𝑖 𝑖 𝑗 𝑗
deploy. To solve this problem, Hinton et al.[24] tried to
use knowledge distillation to transfer knowledge from (2)
bulky
models to small models, which is more suitable for Among them, 𝑏̅ and 𝑏̅ respectively represent the mean
𝑖 𝑗
deployment to a large number of users. The knowledge much smaller than that of the positive label, so the
distillation model consists of two parts: teacher network
and student network. The teacher network has a
complicated structure and numerous parameters, while the
student network has a simple structure and few parameters.
During training, the student network learns the knowledge
extracted by the teacher network.
denoted as R.
4 Algorithm
In this section, the proposed “Stereotype Deepening”
algorithm is described in detail. In process of the training
network, transitive learning was used to deepen the
“Stereotype”, so that the discrepancy between the
descriptors was enlarged. The student networks were
updated by minimizing the mixed loss. In the
evaluation,
inference bias, delivery bias, and peer bias were used to Bachelors were obtained by studying the knowledge of
measure the effect of anomaly detection and localization. masters. We divided the bachelors into different classes
according to which master is its teacher. Masters act as
4.1 Network Structure teachers and students. We trained all the students on the
given training data U = {u1,u2,...,un} that only contains
As shown in Figure 3, masters and bachelors were set up anomaly-free images. Each network except doctor takes
as students in the tree-like teacher-student structure, and it the descriptor of the previous network as the regression
completed the training of the network one by one through target. For example, the regression targets for bachelors
transitive learning. For convenience, we gave students are the feature descriptors output by masters. After training,
different names for each layer. The students who were we used both abnormal and non-abnormal images as test
obtained after completing the first transitive learning are data, and inference bias, delivery bias, and peer bias
called masters, and then the network carried out the second caused by “Stereotype” were used as indicators for
round of transfer study, at which time bachelors were got. abnormal evaluation.
Figure 3: Bachelor1 and Bachelor2 belong to Class1, Bachelor3 and Bachelor4 belong to Class2. Among them, CostInference
represents the inference bias, CostDelivery represents the delivery bias, and CostPeer represents the peer bias
4.2 The Process of Training trained on the dataset U, using the same method to train
bachelors with outputs of masters. In particular, before
This section will introduce network training in detail. The training the next layer of students, we all Normalized the
process is divided into three stages. The training structure output provided by this layer. Masters have dual identities
is shown in Figure 4. in the whole network structure. They are students for the
front layer and teachers for the back layer.
4.2.1 Training of Doctor Network
Mean Square Error Similar to the training of the doctor
The input image G was randomly cut into patch-sized network, it also needs to extract the knowledge of the
image regions I, and the doctor network D outputs a d- previous network into the current network when training
dimensional descriptor for each patch I. Because the pre- the student network. Meanwhile, the distance is used to
trained deep neural network has a strong representation measure the difference between M(I) and D(I). M(I) is the
ability, it performs well in classification. Therefore, we d-dimension descriptor given by the master network:
used the pre-trained network D as the basic network of the
classification network T, and the loss of the classification Ld= ||M(I) − D(I)||2 (4)
network can be expressed as:
Descriptor Compactness Error For a set of I inputs, to
Lk = −Σylog(T(I)) (3) eliminate redundancy and minimize correlation between
descriptors, we have made some improvements based on
where T is the classification network, and y is the the method used in [25]. The improved method can ensure
classification label. the accuracy of the calculation. Each patch I passing
through the network will be transformed into a d-
4.2.2 Training of Masters and Bachelors dimensional descriptor in a batch, and we have calculated
the correlation between any two descriptors. After
In this part, we used the mean square error and the calculation, it was found that simply summing these
improved descriptor compactness loss as the mixed loss. correlation coefficients cannot accurately express the
The network trained to get masters first and then to get overall correlation, and it would be affected by the batch
bachelors through masters. We always let the current size when minimizing the descriptor correlation. The batch
network fit the description of the previous network. size was determined by the number of patches, which
Specifically, D first extracted patch-based descriptors for could reflect the number of random combinations of
each image on the dataset U, and masters were trained by different descriptors,
regressing the descriptors output by D. After masters
Figure 4: b denotes batch size, and D(I), M(I) and B(I) are descriptors. From right to left are the three stages of training. The
masters are trained according to the doctor, and then the masters are used to train the bachelors.
and further affected the sum of correlation coefficients. information. Bachelors would be affected by this one-sided
Therefore, to eliminate the influence of batch size on factor when training according to the masters, and there
descriptor compactness, the improved method can be would be delivery bias when bachelors learned descriptors
expressed as: of masters. Additionally, there would be peer biases
between masters and peer biases among bachelors. Figure
Lp= n(n − 1)/2 (5) 3 clearly identifies three types of biases.
n is the batch size. We take the deviation degree between the descriptors
given by the master and the descriptors given by the doctor
Mixed Loss The training losses of masters and bachelors as the first score. D(x) represents the descriptor of the
are obtained by summing these two weights and are finally doctor, and Mi(x) is the descriptor given by the ith master.
expressed as follows: The first anomaly score is expressed as:
Table 1: The network structure when the patch size is 64. Leaky rectified linear units with slope 5 × 10−3 are applied as
activation functions after each convolution layer
Parameters
Layer Output Size Kernel Stride
Input 64×64×3 - -
Conv2d 61×61×64 4×4 1
MaxPool2d 30×30×64 2×2 2
Conv2d 27×27×32 4×4 1
MaxPool2d 13×13×32 2×2 2
Conv2d 10×10×16 4×4 1
MaxPool2d 5×5×16 2×2 2
Conv2d 2×2×8 4×4 1
Conv2d 1×1×4 2×2 1
Linear 1×1×1 - -
Flatten 1×1 - -
Linear 1×512 - -
Table 2: Anomaly detection results. The table gives the anomaly detection accuracy of different algorithms in one-class
classification, and the mean values are reflected in their overall performance
Dataset Method 0 1 2 3 4 5 6 7 8 9 Mean
DSVDD[29] 0.98 0.997 0.917 0.919 0.949 0.885 0.983 0.946 0.939 0.965 0.948
OCGAN[30] 0.998 0.999 0.942 0.963 0.975 0.98 0.991 0.981 0.939 0.981 0.975
CAVGA Du[31] 0.994 0.997 0.989 0.983 0.997 0.968 0.988 0.986 0.988 0.991 0.986
MINST LSA[32] 0.993 0.999 0.959 0.966 0.956 0.964 0.994 0.98 0.953 0.981 0.975
Ours 0.991 0.995 0.994 0.996 0.995 0.995 0.994 0.991 0.992 0.989 0.9932
DSVDD[29] 0.617 0.659 0.508 0.591 0.609 0.657 0.677 0.673 0.759 0.731 0.648
OCGAN[30] 0.757 0.531 0.64 0.62 0.723 0.62 0.723 0.575 0.82 0.554 0.6566
CAVGA Du[31] 0.653 0.784 0.761 0.747 0.775 0.552 0.813 0.745 0.801 0.741 0.737
CIFAR-10 LSA[32] 0.735 0.58 0.69 0.542 0.761 0.546 0.751 0.535 0.717 0.548 0.641
Ours 0.834 0.852 0.748 0.761 0.801 0.762 0.901 0.841 0.887 0.828 0.8215
5.2.2 Comparisons based on MVTec Dataset capabilities of the network, we also verified the effect of
anomaly localization through experiments.
The MVTec dataset provides anomalies based on different
entities. In addition to verifying the anomaly detection
Table 3: Anomaly localization results in terms of AUROC. The table shows the average AUROC on Textures and
Objects
expressed as Textures mean and Objects mean, respectively
Textures Objects
Textures Metal Objects
Carpet Grid Leather Tile Wood mean Bottle Cable Capsule Hazelnut nut Pill Screw Toothbrush Transistor Zipper mean Mean
STAD[6] 0.695 0.819 0.819 0.921 0.725 0.7958 0.918 0.865 0.916 0.937 0.895 0.935 0.928 0.863 0.701 0.933 0.889 0.858
CAVGA
0.73 0.75 0.71 0.7 0.85 0.748 0.89 0.63 0.83 0.84 0.67 0.88 0.77 0.91 0.73 0.87 0.802 0.784
Du[31]
CAVGA
0.78 0.78 0.75 0.72 0.88 0.782 0.91 0.67 0.87 0.87 0.71 0.91 0.78 0.97 0.75 0.94 0.838 0.819
Ru[31]
CAVGA
Dw[31] 0.8 0.79 0.8 0.81 0.89 0.818 0.93 0.86 0.89 0.9 0.81 0.93 0.79 0.96 0.8 0.95 0.882 0.861
Ours 0.958 0.955 0.9633 0.922 0.925 0.9447 0.9533 0.876 0.906 0.972 0.842 0.8086 0.906 0.74 0.67 0.8714 0.8731 0.885
We considered the inferred bias of each module in the maps, in which anomaly maps given by comprehensive
anomaly region. Excepting the inference bias, we also evaluation are included, and the difference of color reflects
considered the delivery bias between masters and anomaly degree. According to Figure 5, the performance
bachelors, which would make the descriptors given by the of the three biases in abnormal areas is different, and the
masters and bachelors different in the same abnormal area. comprehensive evaluation result is more obvious. As
Students of the same grade got different initial settings of shown in Figure 5, the result of anomaly detection and
the networks and did not use abnormal images for training. localization using only the delivery bias is more effective
Therefore, different students’ expressions in abnormal than that of inference bias, which shows that the
areas would also have significant differences when “Stereotype” is reasonable and available for anomaly
encountering abnormal areas. Figure 5 shows anomaly detection and localization.
Figure 5: Anomaly map and anomaly score at all levels. The various bias produced in the process of transitive learning is
evident in the abnormal areas
Table 3 shows the anomaly localization results on the student structure was proposed for anomaly detection and
MVTec dataset. Experiments showed that our method has
outstanding advantages in the category of textures. The
single AUROC on textures is above 0.92. For Textures,
“Stereotype Deepening” is superior to other baselines. The
average results in the category of Objects can reach a level
comparable to other methods. In general, this method is
better than most of the algorithms we compared.
6 Conclusion
In this paper, an algorithm based on a tree-like teacher-
location, called “Stereotype Deepening”. A descriptor
compactness loss that is irrelevant to batch size was
used in the training of the teacher network, and transitive
learning is applied to train the student networks. During
the experiment, the anomaly detection performance of
the network was tested on the MNIST and CIFAR-10
datasets. Afterward, the “Stereotype” generated by the
network during the training process was used to
complete the anomaly localization. Peer bias and
delivery bias verified the effectiveness of “Stereotype”
from two dimensions. The experiment proved that our
method has a significant effect on the textures type.
References [13] Jindong Gu, Matthias Schubert, and Volker Tresp.
Semi-supervised outlier detection using generative
[1] Raghavendra Chalapathy and Sanjay Chawla. Deep and adversary framework, 2018.
learning for anomaly detection: A survey. [14] Mohammad Sabokrou, Mohammad Khalooei,
abs/1901.03407, 2019. Mahmood Fathy, and Ehsan Adeli. Adversarially
[2] Thomas Schlegl, Philipp Seeb¨ock, Sebastian M. learned one-class classifier for novelty detection. In
Waldstein, Ursula Schmidt-Erfurth, and Georg Langs. 2018 IEEE/CVF Conference on Computer Vision and
Unsupervised anomaly detection with generative Pattern Recognition, pages 3379–3388, June 2018.
adversarial networks to guide marker discovery. [15] Pramuditha Perera and Vishal M. Patel. Learning deep
abs/1703.05921:146–157, 2017. features for one-class classification. IEEE
[3] Christoph Baur, Benedikt Wiestler, Shadi Albarqouni, Transactions on Image Processing, 28(11):5450–5463,
and Nassir Navab. Deep autoencoding models for 2019.
unsupervised anomaly segmentation in brain MR [16] Evan Racah, Christopher Beckham, Samira Maharaj,
images. abs/1804.04488:161–169, 2018. Teganand Ebrahimi Kahou, Mr. Prabhat, and Chris
[4] Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Pal. Extreme weather: A large-scale climate dataset for
Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, semi-supervised detection, localization, and
Dan Pei, Yang Feng, Jie Chen, Zhaogang Wang, and understanding of extreme weather events. In Advances
HonglinQiao. Unsupervised anomaly detection via in Neural Information Processing Systems, volume 30,
variational auto-encoder for seasonal kpis in web 2017.
applications. In Proceedings of the 2018 World Wide [17] Hao Wu and Saurabh Prasad. Semi-supervised deep
Web Conference, WWW ’18, page 187–196, Republic learning using pseudo labels for hyperspectral image
and Canton of Geneva, CHE, 2018. International classification. IEEE Transactions on Image
World Wide Web Conferences Steering Committee. Processing, 27(3):1259–1270, 2018.
[5] Mohammad Sabokrou, Mohsen Fayyaz, Mahmood [18] Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng,
Fathy, and Reinhard Klette. Deep-cascade: Cascading Cristian Lumezanu, Daeki Cho, and Haifeng Chen.
3d deep neural networks for fast anomaly detection Deep autoencoding gaussian mixture model for
and localization in crowded scenes. IEEE unsupervised anomaly detection. In International
Transactions on Image Processing, 26(4):1992–2004, Conference on Learning Representations, 2018.
2017. [19] Weining Lu, Yu Cheng, Cao Xiao, Shiyu Chang,
[6] Paul Bergmann, Michael Fauser, David Sattlegger, Shuai Huang, Bin Liang, and Thomas Huang.
and Carsten Steger. Uninformed students: Student- Unsupervised sequential outlier detection with deep
teacher anomaly detection with discriminative latent architectures. IEEE Transactions on Image Processing,
embeddings. In 2020 IEEE/CVF Conference on 26(9):4321–4330, 2017.
Computer Vision and Pattern Recognition (CVPR), [20] Valentin Leveau and Alexis Joly. Adversarial
pages 4182–4191, June 2020. autoencoders for novelty detection. Research report,
[7] Mohammadreza Salehi, Niousha Sadjadi, Soroosh Inria - Sophia Antipolis, 2017.
Baselizadeh, Mohammad Hossein Rohban, and [21] Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei
Hamid R. Rabiee. Multiresolution knowledge Zhang. Deep structured energy based models for
distillation for anomaly detection. pages 14902– anomaly detection. In Proceedings of the 33rd
14912, June 2021. International Conference on International Conference
[8] M. Almgren and E. Jonsson. Using active learning in on Machine Learning - Volume 48, ICML’16, page
intrusion detection. In Proceedings. 17th IEEE 1100–1109. JMLR.org, 2016.
Computer Security Foundations Workshop, pages 88– [22] Ashish Mishra, Shiva Krishna Reddy, Anurag Mittal,
98, Los Alamitos, CA, USA, jun 2004. IEEE and Hema A. Murthy. A generative model for zero
Computer Society. shot learning using conditional variational
[9] Yang Li and Li Guo. An active learning based tcm- autoencoders. In 2018 IEEE/CVF Conference on
knn algorithm for supervised network intrusion Computer Vision and Pattern Recognition Workshops
detection. Computers Security, 26(7):459–467, 2007. (CVPRW), pages 2269–22698, June 2018.
[10] Jay Stokes, John Platt, Joseph Kravis, and Michael [23] Wallace Lawson, Esube Bekele, and Keith Sullivan.
Shilman. Aladin: Active learning of anomalies to Finding anomalies with generative adversarial
detect intrusions. Technical Report MSR-TR-2008-24, networks for a patrolbot. In 2017 IEEE Conference on
March 2008. Computer Vision and Pattern Recognition Workshops
[11] Shekhar R. Gaddam, Vir V. Phoha, and Kiran S. (CVPRW), pages 484–485, 2017.
Balagani. K-means+id3: A novel method for [24] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.
supervised anomaly detection by cascading k-means Distilling the knowledge in a neural network, 2015.
clustering and id3 decision tree learning methods. [25] Yurun Tian, Bin Fan, and Fuchao Wu. L2-net: Deep
IEEE Transactions on Knowledge and Data learning of discriminative patch descriptor in
Engineering, 19(3):345–354, 2007. euclidean space. In 2017 IEEE Conference on
[12] VilenJumutc and Johan A.K. Suykens. Multi-class Computer Vision and Pattern Recognition (CVPR),
supervised novelty detection. IEEE Transactions on pages 6128–6136, 2017.
Pattern Analysis and Machine Intelligence, [26] Yann Le Cun and Corinna Cortes. Mnist handwritten
36(12):2510–2523, 2014. digit database. 2010.
[27] A. Krizhevsky. Learning multiple layers of features
from tiny images. 2009.
[28] Paul Bergmann, Michael Fauser, David Sattlegger,
and Carsten Steger. Mvtec ad – a comprehensive real-
world dataset for unsupervised anomaly detection. In
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
June 2019.
[29] Lukas Ruff, Robert Vandermeulen, Nico Goernitz,
Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander
Binder, Emmanuel Mu¨ller, and Marius Kloft. Deep
one-class classification. In Jennifer Dy and Andreas
Krause, editors, Proceedings of the 35th International
Conference on Machine Learning, volume 80 of
Proceedings of Machine Learning Research, pages
4393–4402, 10–15 Jul 2018.
[30] Pramuditha Perera, Ramesh Nallapati, and Bing Xiang.
Ocgan: One-class novelty detection using gans with
constrained latent representations. In 2019 IEEE/CVF
Conference on Computer Vision and Pattern
Recognition (CVPR), pages 2893–2901, 2019.
[31] Shashanka Venkataramanan, Kuan-Chuan Peng, Rajat
Vikram Singh, and Abhijit Mahalanobis. Attention
guided anomaly localization in images. In Computer
Vision – ECCV 2020, pages 485– 503. Springer
International Publishing, 2020.
[32] Davide Abati, Angelo Porrello, Simone Calderara, and
Rita Cucchiara. Latent space autoregression for
novelty detection. In 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
pages 481–490, 2019.