Spam Email Using Machine Learning
Spam Email Using Machine Learning
net/publication/297607119
CITATIONS READS
15 2,451
2 authors:
Some of the authors of this publication are also working on these related projects:
Books related to Machine Learning/ Data Science/Pattern Recognitions/Bigdata Applications/Computational Optimization Techniques View project
All content following this page was uploaded by Dr.Madhu Viswanatham V. on 08 August 2018.
Abstract: Spam emails have become an increasing difficulty for the entire web-users.These
unsolicited messages waste the resources of network unnecessarily. Customarily, machine learning
techniques are adopted for filtering email spams. This article examines the capabilities of the
extreme learning machine (ELM) and support vector machine (SVM) for the classification of spam
emails with the class level (d). The ELM method is an efficient model based on single layer feed-
forward neural network, which can choose weights from hidden layers,randomly. Support vector
machine is a strong statistical learning theory used frequently for classification. The performance
of ELM has been compared with SVM. The comparative study examines accuracy, precision,
recall, false positive and true positive.Moreover, a sensitivity analysis has been performed by ELM
and SVM for spam email classification.
Introduction
Recently, the quantity of unwanted bulk email messages, i.e. spam messages has increased in web
communication. These unwanted messages are degrading the reliability and authenticity of genuine
email messages [1]. Spams have affected the organisations heavily.As a result, the data allocation
capacity of the internet is getting difficult, which adds a high financial loss for companies. Many
business models depending on the spam commercializing sector, have the advantage as the expense
of sending an email needs less cost and can be sent with a large numbers. The above reasons forced
few nations to amend some legislative changes [2]. In the literature, two types of spam filter
techniques can be found: machine learning and non-machine learning (SMTP) methods. However,
machine learning techniques have more popularity than non-machine learning approaches[3]. The
subdivision of machine learning strategies can be done further in content based and non-content
based. Plenty of research has been carried out on machine learning based spam filters[1][3]. False
positives are the main concern as they can take to different implication while filtering spams.
Therefore, there might be circumstances, where ham messages may be classified as spams. Though
such situations can be rather lessened by the simultaneous application of an amount of dissimilar
classification techniques, but it continues to be a matter of debate [4] [5]. In addition to that, the
execution of a classifier also depends on the training set. The deportment of the spammer and the
type of delivery of spam payload have also produced diverse sets of test emails which need
innovative methods to remain operational in spam filtering domain. The majority of the spam
filters is developed on the basis of text classification and achieving high accuracy for any spam
classification model is a matter of concern. Bayesian method of spam detection is the most popular
technique [6]. H. Drucker proposed an excellent support vector machine strategy to filter spams in
the year 1999 [7]. Other than machine learning, there exits another approach called SMTP
technique (Simple Mail Transfer Protocol) as mentioned earlier [8]. The rout verification,
authenticity of the information exchange, and checking of SMTP [9] is managed by SMTP
approach. In this work, we adopted two most effective machine learning techniques for spam
detection namely: extreme learning machine(ELM) and support vector machine(SVM). Both
adopted methods have shown good performance on test email instances.
ELM is considered as the single hidden layer feedforward neural networks(SLFN)[11].The
extreme learning machine shows that the neurons, which are hidden can be individually generated
All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans
Tech Publications, www.ttp.net. (ID: 124.124.52.34-25/03/16,18:52:15)
International Journal of Engineering Research in Africa Vol. 22 153
outside of the application. ELM can do classification, also it can be adopted for the approximations
[10]. It can make a relationship between neural network, theory of matrix and regression model
specially of ridge regression. It has the faster learning rate and an easy implementation process.
Also, it needs less intervention from human [11]. Therefore, for large scale data ELM is the
potential alternative to other machine learning techniques; it has a large number of activation
functions.Also,few piecewise continuous functions can be applied to ELM. The Feedforward neural
network is a popular neural network algorithm which can do mappings of complex non-linear
functions, but it suffers from the time it takes. ELM overcomes that and can perform better than
feedforward neural network. Many authors have worked on the universal approximations of
feedforward neural networks.
The adopted second method is support vector machine which is a strong theoretical learning
technique that can analyse data and find pattern in the data[12]. The mapping from training input to
the high dimensional feature space is done by SVM. It gets the maximum margin of segregation
among classes. Application of the SVM and other classification techniques can be found in the
literature[13][18][19]. In this work, for both ELM and SVM, accuracy of spam detection, actual
vs predicted graphs and confusion matrices are obtained. Also, a solid comparative study has been
carried out for both the methods which has been shown in Table 4 and Table 5.
Data set
In this work, the spam data set namely spambase has been collected from UCI machine learning
repository for the experiment purpose[14].ELM technique uses the basic feed-forward neural
network for classification and regression [11]. The total number of email instances are 4601.Out of
which 1813 are actually spam emails which holds 39.4% of total email instances and remaining
60.6% i.e. 2788 are non-spams. The total number of features is 58, of which 57 attributes are
continuous and there exists one titular class label (d). The different type of dataset on spam emails
are as follows,
Table 1. Different types of spam corpora[16][17]
U5Spam 982 82 90
mean 0.3940448
d skewness 0.4338114
sd 0.4886977
kurtosis 1.187404
∑ = , = 1 (1)
The number instances is N and nodes in the hidden nodes is L. Here, is the weight of the output
parameter. The component ( , ) is the ith limit of the ith hidden node.The input to the ELM is
attribute.The input can be written as x={ } and t is the decision attribute. Equation (1) can be re-
given as L which comprises all the inputs of ‘spambase’ dataset, excluding the decision
written as
= (2)
And the H can be described as,
)…………...
………………..………………
=
………………………………….
)……………. )
. .
= . , = .
. .
= (3)
In the below, several graphs have been shown of false positive vs true positive, precision vs recall
and specificity vs sensivity when ELM is applied to the test spam email. These graphs are shown to
examine the performance of adopted ELM in figures 3, 4 and 5.The definition of
accuracy,precision,recall,positive predicted values,negative predicted values, McNemar test(P
value),
Accuracy=
Precision= ,
Recall= .
Where, ∈ , is a vector which has dimension n, with every instance either belongs to
(4)
spam or non-spam(ham). In this work, our intention is to discover a generalized type of classifier
that can separate two classes, i.e spam and non-spam {-1,+1} given the train data set.The
)= . + = 0,
distinguisher is nothing but a linear plane of the form,
(5)
Where, w is the weight to segregate the hyperplane and k∈ is a bias. In this work the
. + ≥ 1, = 1)
hyper planes correspond to spam emails and non-spam emails can be written as,
. + ≤ −1, = −1)
(6)
(7)
. + )≥1
Equation (6) & (7) can be combined together as,
(8)
The slack variable is many times added to adjust the miss classification rate, therefore
. + )≥ 1−
equation (11) can be updated as,
(9)
. + = 1, is
For, spam class, the perpendicular distance from source to the hyperplane
| |
|| ||
, and for non-spam class the hyper plane can be written as,
|| ||
,k)= (10)
Below are the graphs (figures 6,7,8 and 9) which show the actual vs predicted plot,false positive vs
true positive,precision vs recall and sensitivity vs specificity and accuracy plots once applied to test
spam emails.
spam 45 328
Conclusion
This article inspects the capability of ELM and SVM for classification of spam email. The adopted
ELM and SVM classify emails based on class level (d). The performances of ELM and SVM are
inspiring as they both have achieved high accuracy rate in spam email detection. The performance
of ELM has been compared with SVM. The comparative study examines accuracy, precision,
recall, false positive, true positive and a sensitivity analysis of ELM and SVM for spam email
classification.The experimental results establish the effectiveness and efficiency of the ELM and
SVM models. So this article sums up that ELM and SVM, can be used for solving different
classification problems in computer engineering.
References
[1] B. Zhou, Y. Yao,J. Luo Cost-sensitive three-way email spam filtering, Journal of Intelligent
Information Systems, Vol. 42,No.1, pp. 19-45, (2014).
[2] D. MochrieCanada’s Anti-spam/anti-spyware: An Overview’. Internationa Journal of
Franchising Law, Vol. 12, No. 4, (2014) .
[3] T. A Almeida, J.M.G Hidalgo and A.Yamakami Contributions to the study of SMS spam
filtering: new collection and results’,In Proceedings of the 11th ACM symposium on Document
engineering ,pp. 259-262, (2011).
[4] Y. Meng and L. F Kwok Enhancing email classification using data reduction and disagreement-
based semi-supervised learning. In Communications (ICC), 2014 IEEE International Conference
on,pp. 622-627, (2014).
[5] S. S Roy,S. Charaborty,S. Sourav, and A. Abraham Rough set theory approach for filtering
spams from boundary messages in a chat system. In Intelligent Systems Design and Applications
(ISDA), 2013 13th International Conference on (pp. 28-34). IEEE, (2013).
[6] M. Sahami,S. Dumais, D. Heckerman,E. Horvitz, E A Bayesian approach to filtering junk e-
mail’. In Learning for Text Categorization: Papers from the 1998 workshop ,Vol. 62, pp. 98-105,
(1998).
International Journal of Engineering Research in Africa Vol. 22 161
[7] H. Drucker, S. Wu and V.N Vapnik Support vector machines for spam
categorization’,Neural Networks, IEEE Transactions on, Vol 10,No. 5,pp.1048-1054, (1999)
[8] E. Blanzieri and A. Bryl A survey of learning-based techniques of email spam filtering,
Artificial Intelligence Review, Vol.29,No.1, pp.63-92, (2008).
[9] G. Caruana ,M. Li A survey of emerging approaches to spam filtering’,ACM Computing
Surveys (CSUR), Vol 44,No.2,pp. 9, (2012).
[10] E. Cambria,G.B Huang, L . L.C Kasun,H. Zhou,C.M Vong,J. Lin, J. Liu. Extreme learning
machines [trends & controversies]. Intelligent Systems, IEEE, 28(6), 30-59, (2013).
[11] G. B Huang,Q.Y Zhu,C.K Siew Extreme learning machine: theory and applications.
Neurocomputing, 70(1), 489-501, (2006).
[12] C. Cortes ,V. Vapnik Support-vector networks. Machine learning, 20(3), 273-297, (1995).
[13] A.Basu,S.S. Roy and A. Abraham A Novel Diagnostic Approach Based on Support Vector
Machine with Linear Kernel for Classifying the Erythemato-Squamous Disease. In Computing
Communication Control and Automation (ICCUBEA), 2015 International Conference on (pp. 343-
347). IEEE, (2015).
[14] M. Hopkins, E. Reeber,G. Forman,J. Suermondt Spambase, UML Repository, ed, (1999).
[15] R. Viswanathan ,Pijush Samui Determination of rock depth using artificial intelligence
techniques.Geoscience Frontiers, (2015).
[16] J. Clark, I. Koprinska and J. Poon A neural network based approach to automated e-mail
classification. In null ,p. 702. IEEE, (2003).
[17] M. Lichman UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information and Computer Science, (2013).
[18] S.S.Roy, V.M Viswanatham,P.V. Krishna,N. Saraf,A. Gupta and R Mishra Applicability of
Rough Set Technique for Data Investigation and Optimization of Intrusion Detection System. In
Quality, Reliability, Security and Robustness in Heterogeneous Networks , pp. 479-484, Springer
Berlin Heidelberg, (2013).
[19] S.S.Roy , D. Mittal, A. Basu, and A. Abraham Stock Market Forecasting Using LASSO
Linear Regression Model In Afro-European Conference for Industrial Advancement, pp. 371-381,
Springer International Publishing, (2015).
International Journal of Engineering Research in Africa Vol. 22
10.4028/www.scientific.net/JERA.22
DOI References
[1] B. Zhou, Y. Yao,J. Luo Cost-sensitive three-way email spam filtering, Journal of Intelligent Information
Systems, Vol. 42, No. 1, pp.19-45, (2014).
10.1007/s10844-013-0254-7
[3] T. A Almeida, J.M. G Hidalgo and A. Yamakami Contributions to the study of SMS spam filtering: new
collection and results', In Proceedings of the 11th ACM symposium on Document engineering , pp.259-262,
(2011).
10.1145/2034691.2034742
[4] Y. Meng and L. F Kwok Enhancing email classification using data reduction and disagreementbased
semi-supervised learning. In Communications (ICC), 2014 IEEE International Conference on, pp.622-627,
(2014).
10.1109/icc.2014.6883388
[5] S. S Roy,S. Charaborty,S. Sourav, and A. Abraham Rough set theory approach for filtering spams from
boundary messages in a chat system. In Intelligent Systems Design and Applications (ISDA), 2013 13th
International Conference on (pp.28-34). IEEE, (2013).
10.1109/isda.2013.6920763
[7] H. Drucker, S. Wu and V. N Vapnik Support vector machines for spam categorization', Neural Networks,
IEEE Transactions on, Vol 10, No. 5, pp.1048-1054, (1999).
10.1109/72.788645
[8] E. Blanzieri and A. Bryl A survey of learning-based techniques of email spam filtering, Artificial
Intelligence Review, Vol. 29, No. 1, pp.63-92, (2008).
10.1007/s10462-009-9109-6
[9] G. Caruana ,M. Li A survey of emerging approaches to spam filtering', ACM Computing Surveys
(CSUR), Vol 44, No. 2, p.9, (2012).
10.1145/2089125.2089129
[10] E. Cambria,G. B Huang, L . L. C Kasun,H. Zhou,C. M Vong,J. Lin, J. Liu. Extreme learning machines
[trends & controversies]. Intelligent Systems, IEEE, 28(6), 30-59, (2013).
10.1109/mis.2013.140
[11] G. B Huang,Q. Y Zhu,C. K Siew Extreme learning machine: theory and applications. Neurocomputing,
70(1), 489-501, (2006).
10.1016/j.neucom.2005.12.126
[13] A. Basu S.S. Roy and A. Abraham A Novel Diagnostic Approach Based on Support Vector Machine
with Linear Kernel for Classifying the Erythemato-Squamous Disease. In Computing Communication
Control and Automation (ICCUBEA), 2015 International Conference on (pp.343-347). IEEE, (2015).
10.1109/iccubea.2015.72
[15] R. Viswanathan , Pijush Samui Determination of rock depth using artificial intelligence techniques.
Geoscience Frontiers, (2015).
10.1016/j.gsf.2015.04.002
[16] J. Clark, I. Koprinska and J. Poon A neural network based approach to automated e-mail classification.
In null , p.702. IEEE, (2003).
10.1109/wi.2003.1241300
[18] S.S. Roy, V. M Viswanatham P.V. Krishna,N. Saraf,A. Gupta and R Mishra Applicability of Rough Set
Technique for Data Investigation and Optimization of Intrusion Detection System. In Quality, Reliability,
Security and Robustness in Heterogeneous Networks , pp.479-484, Springer Berlin Heidelberg, (2013).
10.1007/978-3-642-37949-9_42
[19] S.S. Roy , D. Mittal, A. Basu, and A. Abraham Stock Market Forecasting Using LASSO Linear
Regression Model In Afro-European Conference for Industrial Advancement, pp.371-381, Springer
International Publishing, (2015).
10.1007/978-3-319-13572-4_31