0% found this document useful (0 votes)
134 views

Spam Email Using Machine Learning

spam

Uploaded by

ushavalsa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views

Spam Email Using Machine Learning

spam

Uploaded by

ushavalsa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/297607119

Classifying Spam Emails Using Artificial Intelligent Techniques

Article  in  International Journal of Engineering Research in Africa · February 2016


DOI: 10.4028/www.scientific.net/JERA.22.152

CITATIONS READS

15 2,451

2 authors:

Sanjiban Sekhar Roy Dr.Madhu Viswanatham V.


VIT University VIT University
82 PUBLICATIONS   481 CITATIONS    81 PUBLICATIONS   265 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Deep Learning View project

Books related to Machine Learning/ Data Science/Pattern Recognitions/Bigdata Applications/Computational Optimization Techniques View project

All content following this page was uploaded by Dr.Madhu Viswanatham V. on 08 August 2018.

The user has requested enhancement of the downloaded file.


International Journal of Engineering Research in Africa Submitted: 2015-09-20
ISSN: 1663-4144, Vol. 22, pp 152-161 Revised: 2015-12-30
doi:10.4028/www.scientific.net/JERA.22.152 Accepted: 2016-01-04
© 2016 Trans Tech Publications, Switzerland Online: 2016-02-29

Classifying Spam Emails Using Artificial Intelligent Techniques


Sanjiban Sekhar Roy1, a,*, V Madhu Viswanatham1, b
1,2
School of Computing Science and Engineering
VIT University,Vellore,India
{s.roy , vmadhuviswanathamb}@vit.ac.in
a

Keywords: Spam,emails, extreme learning machine,support vector machine and classification.

Abstract: Spam emails have become an increasing difficulty for the entire web-users.These
unsolicited messages waste the resources of network unnecessarily. Customarily, machine learning
techniques are adopted for filtering email spams. This article examines the capabilities of the
extreme learning machine (ELM) and support vector machine (SVM) for the classification of spam
emails with the class level (d). The ELM method is an efficient model based on single layer feed-
forward neural network, which can choose weights from hidden layers,randomly. Support vector
machine is a strong statistical learning theory used frequently for classification. The performance
of ELM has been compared with SVM. The comparative study examines accuracy, precision,
recall, false positive and true positive.Moreover, a sensitivity analysis has been performed by ELM
and SVM for spam email classification.

Introduction
Recently, the quantity of unwanted bulk email messages, i.e. spam messages has increased in web
communication. These unwanted messages are degrading the reliability and authenticity of genuine
email messages [1]. Spams have affected the organisations heavily.As a result, the data allocation
capacity of the internet is getting difficult, which adds a high financial loss for companies. Many
business models depending on the spam commercializing sector, have the advantage as the expense
of sending an email needs less cost and can be sent with a large numbers. The above reasons forced
few nations to amend some legislative changes [2]. In the literature, two types of spam filter
techniques can be found: machine learning and non-machine learning (SMTP) methods. However,
machine learning techniques have more popularity than non-machine learning approaches[3]. The
subdivision of machine learning strategies can be done further in content based and non-content
based. Plenty of research has been carried out on machine learning based spam filters[1][3]. False
positives are the main concern as they can take to different implication while filtering spams.
Therefore, there might be circumstances, where ham messages may be classified as spams. Though
such situations can be rather lessened by the simultaneous application of an amount of dissimilar
classification techniques, but it continues to be a matter of debate [4] [5]. In addition to that, the
execution of a classifier also depends on the training set. The deportment of the spammer and the
type of delivery of spam payload have also produced diverse sets of test emails which need
innovative methods to remain operational in spam filtering domain. The majority of the spam
filters is developed on the basis of text classification and achieving high accuracy for any spam
classification model is a matter of concern. Bayesian method of spam detection is the most popular
technique [6]. H. Drucker proposed an excellent support vector machine strategy to filter spams in
the year 1999 [7]. Other than machine learning, there exits another approach called SMTP
technique (Simple Mail Transfer Protocol) as mentioned earlier [8]. The rout verification,
authenticity of the information exchange, and checking of SMTP [9] is managed by SMTP
approach. In this work, we adopted two most effective machine learning techniques for spam
detection namely: extreme learning machine(ELM) and support vector machine(SVM). Both
adopted methods have shown good performance on test email instances.
ELM is considered as the single hidden layer feedforward neural networks(SLFN)[11].The
extreme learning machine shows that the neurons, which are hidden can be individually generated

All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of Trans
Tech Publications, www.ttp.net. (ID: 124.124.52.34-25/03/16,18:52:15)
International Journal of Engineering Research in Africa Vol. 22 153

outside of the application. ELM can do classification, also it can be adopted for the approximations
[10]. It can make a relationship between neural network, theory of matrix and regression model
specially of ridge regression. It has the faster learning rate and an easy implementation process.
Also, it needs less intervention from human [11]. Therefore, for large scale data ELM is the
potential alternative to other machine learning techniques; it has a large number of activation
functions.Also,few piecewise continuous functions can be applied to ELM. The Feedforward neural
network is a popular neural network algorithm which can do mappings of complex non-linear
functions, but it suffers from the time it takes. ELM overcomes that and can perform better than
feedforward neural network. Many authors have worked on the universal approximations of
feedforward neural networks.
The adopted second method is support vector machine which is a strong theoretical learning
technique that can analyse data and find pattern in the data[12]. The mapping from training input to
the high dimensional feature space is done by SVM. It gets the maximum margin of segregation
among classes. Application of the SVM and other classification techniques can be found in the
literature[13][18][19]. In this work, for both ELM and SVM, accuracy of spam detection, actual
vs predicted graphs and confusion matrices are obtained. Also, a solid comparative study has been
carried out for both the methods which has been shown in Table 4 and Table 5.

Data set
In this work, the spam data set namely spambase has been collected from UCI machine learning
repository for the experiment purpose[14].ELM technique uses the basic feed-forward neural
network for classification and regression [11]. The total number of email instances are 4601.Out of
which 1813 are actually spam emails which holds 39.4% of total email instances and remaining
60.6% i.e. 2788 are non-spams. The total number of features is 58, of which 57 attributes are
continuous and there exists one titular class label (d). The different type of dataset on spam emails
are as follows,
Table 1. Different types of spam corpora[16][17]

Data Set No of Spam Legitimate


Emails Emails Emails

PUI 1099 481 618

Ling 2893 481 2412


Spam

U5Spam 982 82 90

UCI 4601 1813 2788


154 International Journal of Engineering Research in Africa Vol. 22

Figure 1. Empirical and Cumulative distribution of data set


Table 2. Statistical output of the data set namely spambase

mean 0.3940448
d skewness 0.4338114
sd 0.4886977
kurtosis 1.187404

Figure 2. Cullen and Frey Graph applied to data set


International Journal of Engineering Research in Africa Vol. 22 155

The Extreme Learning Machine Technique of Spam Classification


ELM is considered as the single hidden layer feed-forward neural networks(SLFN)[11]. The
association between output t and input x is given as[15].

∑ = , = 1 (1)

The number instances is N and nodes in the hidden nodes is L. Here, is the weight of the output
parameter. The component ( , ) is the ith limit of the ith hidden node.The input to the ELM is

attribute.The input can be written as x={ } and t is the decision attribute. Equation (1) can be re-
given as L which comprises all the inputs of ‘spambase’ dataset, excluding the decision

written as

= (2)
And the H can be described as,

)…………...
………………..………………
=
………………………………….
)……………. )

. .
= . , = .
. .

From the equation (3) the value of can be calculated,

= (3)
In the below, several graphs have been shown of false positive vs true positive, precision vs recall
and specificity vs sensivity when ELM is applied to the test spam email. These graphs are shown to
examine the performance of adopted ELM in figures 3, 4 and 5.The definition of
accuracy,precision,recall,positive predicted values,negative predicted values, McNemar test(P
value),

Accuracy=

Precision= ,

Recall= .

Positive predictive value=

Negative predictive value=


)
)
McNemar test(P value): χ2= , f is the outcome of first positive test and second negative test
and g is the outcome of first negative test and second positive test. The detailed comparisons
between ELM and SVM has been shown in Table 4 and Table 5 ,where all the above parameters’
values have been obtained.
156 International Journal of Engineering Research in Africa Vol. 22

Figure 3. False positive vs True positive

Figure 4. Precision vs Recall

Figure 5. Specificity vs Sensivity


International Journal of Engineering Research in Africa Vol. 22 157

The Support Vector Machine Technique of Spam Classification


Support vector machine is a strong statistical learning theory used frequently for
classification of data.SVM has been introduced by vapnik [11] and attaining popularity for its
attractive features. In this work, the spam email classification has been limited by two classes :
spam and non-spam.The data set is having the training vector S consisting two different classes that

={ , )…… , ), where ∈ , ∈ {−1, +1}


are spam or ham,

Where, ∈ , is a vector which has dimension n, with every instance either belongs to
(4)

spam or non-spam(ham). In this work, our intention is to discover a generalized type of classifier
that can separate two classes, i.e spam and non-spam {-1,+1} given the train data set.The

)= . + = 0,
distinguisher is nothing but a linear plane of the form,
(5)
Where, w is the weight to segregate the hyperplane and k∈ is a bias. In this work the

. + ≥ 1, = 1)
hyper planes correspond to spam emails and non-spam emails can be written as,

. + ≤ −1, = −1)
(6)
(7)

. + )≥1
Equation (6) & (7) can be combined together as,
(8)
The slack variable is many times added to adjust the miss classification rate, therefore

. + )≥ 1−
equation (11) can be updated as,
(9)

. + = 1, is
For, spam class, the perpendicular distance from source to the hyperplane

| |
|| ||
, and for non-spam class the hyper plane can be written as,

. + = −1 and the perpendicular distance can be written as,


| |
|| ||
.

Moreover the margin, , ) between the plane can be given as,

|| ||
,k)= (10)

To maximize this margin the following optimization problem forms,


Minimize || || + ∑ (11)
C is the capacity factor.
Table 3 . Detail outputs obtained by SVM
SVM
parameter epsilon = 0.1 cost C = 5
Kernel Gaussian Radial Basis kernel function.
sigma 0.05
Number of Support Vectors 2264
Objective Function Value -1757.389
Training error 0.081035
Cross validation error 0.066512
158 International Journal of Engineering Research in Africa Vol. 22

Below are the graphs (figures 6,7,8 and 9) which show the actual vs predicted plot,false positive vs
true positive,precision vs recall and sensitivity vs specificity and accuracy plots once applied to test
spam emails.

Figure 6 . Actual vs Predicted

Figure 7. False positive vs True positive


International Journal of Engineering Research in Africa Vol. 22 159

Figure 8. Precision vs Recall

Figure 9. Accuracy graph with cut-offs

Simulations Results and Comparative Study of ELM and SVM


The following table(Table 4) examines the performance of ELM and SVM by obtaining the values
accuracy,CI,kappa, Mcnemar's Test( P-Value), information rate, sensitivity,specificity, positive pre
diction value , negative prediction value, prevalence , detection rate , detection prevalence,balanced
detection and balanced accuracy .
Table 4. Detailed comparisons of ELM and SVM
ELM SVM
Accuracy 0.9273 0.9217
95% CI (0.9087, 0.9431) (0.91, 0.9324)
Kappa 0.8468 0.8353
Mcnemar's Test P-Value 0.01088 1.21e-07
No Information Rate 0.6246 0.597
P-Value [Acc > NIR] < 2e-16 < 2.2e-16
Sensitivity 0.9229 0.9607
Specificity 0.9345 0.8641
Pos Pred Value 0.9591 0.9128
160 International Journal of Engineering Research in Africa Vol. 22

Neg Pred Value 0.8794 0.9368


Prevalence 0.6246 0.5970
Detection Rate 0.5765 0.5735
Detection Prevalence 0.6011 0.6283
Balanced Accuracy 0.9287 0.9124
Also, the adopted models have been able to find the confusion matrices for both ELM and SVM
which has been shown below in Table 5.
Table 5. Confusion Matrices of ELM and SVM

Prediction nonspam spam

ELM nonspam 539 23

spam 45 328

nonspam 1319 126


SVM
spam 54 801

Conclusion
This article inspects the capability of ELM and SVM for classification of spam email. The adopted
ELM and SVM classify emails based on class level (d). The performances of ELM and SVM are
inspiring as they both have achieved high accuracy rate in spam email detection. The performance
of ELM has been compared with SVM. The comparative study examines accuracy, precision,
recall, false positive, true positive and a sensitivity analysis of ELM and SVM for spam email
classification.The experimental results establish the effectiveness and efficiency of the ELM and
SVM models. So this article sums up that ELM and SVM, can be used for solving different
classification problems in computer engineering.

References
[1] B. Zhou, Y. Yao,J. Luo Cost-sensitive three-way email spam filtering, Journal of Intelligent
Information Systems, Vol. 42,No.1, pp. 19-45, (2014).
[2] D. MochrieCanada’s Anti-spam/anti-spyware: An Overview’. Internationa Journal of
Franchising Law, Vol. 12, No. 4, (2014) .
[3] T. A Almeida, J.M.G Hidalgo and A.Yamakami Contributions to the study of SMS spam
filtering: new collection and results’,In Proceedings of the 11th ACM symposium on Document
engineering ,pp. 259-262, (2011).
[4] Y. Meng and L. F Kwok Enhancing email classification using data reduction and disagreement-
based semi-supervised learning. In Communications (ICC), 2014 IEEE International Conference
on,pp. 622-627, (2014).
[5] S. S Roy,S. Charaborty,S. Sourav, and A. Abraham Rough set theory approach for filtering
spams from boundary messages in a chat system. In Intelligent Systems Design and Applications
(ISDA), 2013 13th International Conference on (pp. 28-34). IEEE, (2013).
[6] M. Sahami,S. Dumais, D. Heckerman,E. Horvitz, E A Bayesian approach to filtering junk e-
mail’. In Learning for Text Categorization: Papers from the 1998 workshop ,Vol. 62, pp. 98-105,
(1998).
International Journal of Engineering Research in Africa Vol. 22 161

[7] H. Drucker, S. Wu and V.N Vapnik Support vector machines for spam
categorization’,Neural Networks, IEEE Transactions on, Vol 10,No. 5,pp.1048-1054, (1999)
[8] E. Blanzieri and A. Bryl A survey of learning-based techniques of email spam filtering,
Artificial Intelligence Review, Vol.29,No.1, pp.63-92, (2008).
[9] G. Caruana ,M. Li A survey of emerging approaches to spam filtering’,ACM Computing
Surveys (CSUR), Vol 44,No.2,pp. 9, (2012).
[10] E. Cambria,G.B Huang, L . L.C Kasun,H. Zhou,C.M Vong,J. Lin, J. Liu. Extreme learning
machines [trends & controversies]. Intelligent Systems, IEEE, 28(6), 30-59, (2013).
[11] G. B Huang,Q.Y Zhu,C.K Siew Extreme learning machine: theory and applications.
Neurocomputing, 70(1), 489-501, (2006).
[12] C. Cortes ,V. Vapnik Support-vector networks. Machine learning, 20(3), 273-297, (1995).
[13] A.Basu,S.S. Roy and A. Abraham A Novel Diagnostic Approach Based on Support Vector
Machine with Linear Kernel for Classifying the Erythemato-Squamous Disease. In Computing
Communication Control and Automation (ICCUBEA), 2015 International Conference on (pp. 343-
347). IEEE, (2015).
[14] M. Hopkins, E. Reeber,G. Forman,J. Suermondt Spambase, UML Repository, ed, (1999).
[15] R. Viswanathan ,Pijush Samui Determination of rock depth using artificial intelligence
techniques.Geoscience Frontiers, (2015).
[16] J. Clark, I. Koprinska and J. Poon A neural network based approach to automated e-mail
classification. In null ,p. 702. IEEE, (2003).
[17] M. Lichman UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information and Computer Science, (2013).
[18] S.S.Roy, V.M Viswanatham,P.V. Krishna,N. Saraf,A. Gupta and R Mishra Applicability of
Rough Set Technique for Data Investigation and Optimization of Intrusion Detection System. In
Quality, Reliability, Security and Robustness in Heterogeneous Networks , pp. 479-484, Springer
Berlin Heidelberg, (2013).
[19] S.S.Roy , D. Mittal, A. Basu, and A. Abraham Stock Market Forecasting Using LASSO
Linear Regression Model In Afro-European Conference for Industrial Advancement, pp. 371-381,
Springer International Publishing, (2015).
International Journal of Engineering Research in Africa Vol. 22
10.4028/www.scientific.net/JERA.22

Classifying Spam Emails Using Artificial Intelligent Techniques


10.4028/www.scientific.net/JERA.22.152

DOI References
[1] B. Zhou, Y. Yao,J. Luo Cost-sensitive three-way email spam filtering, Journal of Intelligent Information
Systems, Vol. 42, No. 1, pp.19-45, (2014).
10.1007/s10844-013-0254-7
[3] T. A Almeida, J.M. G Hidalgo and A. Yamakami Contributions to the study of SMS spam filtering: new
collection and results', In Proceedings of the 11th ACM symposium on Document engineering , pp.259-262,
(2011).
10.1145/2034691.2034742
[4] Y. Meng and L. F Kwok Enhancing email classification using data reduction and disagreementbased
semi-supervised learning. In Communications (ICC), 2014 IEEE International Conference on, pp.622-627,
(2014).
10.1109/icc.2014.6883388
[5] S. S Roy,S. Charaborty,S. Sourav, and A. Abraham Rough set theory approach for filtering spams from
boundary messages in a chat system. In Intelligent Systems Design and Applications (ISDA), 2013 13th
International Conference on (pp.28-34). IEEE, (2013).
10.1109/isda.2013.6920763
[7] H. Drucker, S. Wu and V. N Vapnik Support vector machines for spam categorization', Neural Networks,
IEEE Transactions on, Vol 10, No. 5, pp.1048-1054, (1999).
10.1109/72.788645
[8] E. Blanzieri and A. Bryl A survey of learning-based techniques of email spam filtering, Artificial
Intelligence Review, Vol. 29, No. 1, pp.63-92, (2008).
10.1007/s10462-009-9109-6
[9] G. Caruana ,M. Li A survey of emerging approaches to spam filtering', ACM Computing Surveys
(CSUR), Vol 44, No. 2, p.9, (2012).
10.1145/2089125.2089129
[10] E. Cambria,G. B Huang, L . L. C Kasun,H. Zhou,C. M Vong,J. Lin, J. Liu. Extreme learning machines
[trends & controversies]. Intelligent Systems, IEEE, 28(6), 30-59, (2013).
10.1109/mis.2013.140
[11] G. B Huang,Q. Y Zhu,C. K Siew Extreme learning machine: theory and applications. Neurocomputing,
70(1), 489-501, (2006).
10.1016/j.neucom.2005.12.126
[13] A. Basu S.S. Roy and A. Abraham A Novel Diagnostic Approach Based on Support Vector Machine
with Linear Kernel for Classifying the Erythemato-Squamous Disease. In Computing Communication
Control and Automation (ICCUBEA), 2015 International Conference on (pp.343-347). IEEE, (2015).
10.1109/iccubea.2015.72
[15] R. Viswanathan , Pijush Samui Determination of rock depth using artificial intelligence techniques.
Geoscience Frontiers, (2015).
10.1016/j.gsf.2015.04.002
[16] J. Clark, I. Koprinska and J. Poon A neural network based approach to automated e-mail classification.
In null , p.702. IEEE, (2003).
10.1109/wi.2003.1241300
[18] S.S. Roy, V. M Viswanatham P.V. Krishna,N. Saraf,A. Gupta and R Mishra Applicability of Rough Set
Technique for Data Investigation and Optimization of Intrusion Detection System. In Quality, Reliability,
Security and Robustness in Heterogeneous Networks , pp.479-484, Springer Berlin Heidelberg, (2013).
10.1007/978-3-642-37949-9_42
[19] S.S. Roy , D. Mittal, A. Basu, and A. Abraham Stock Market Forecasting Using LASSO Linear
Regression Model In Afro-European Conference for Industrial Advancement, pp.371-381, Springer
International Publishing, (2015).
10.1007/978-3-319-13572-4_31

View publication stats

You might also like