15
15
Mining
Jyh-Jian Sheu1, Yin-Kai Chen2, Ko-Tsung Chu3, Jih-Hsin Tang4, Wei-Pang Yang2
Abstract
In this paper, we proposed an efficient spam filtering method based on decision tree data mining
technique, analyzed the association rules about spams and applies these rules to develop a
systematized spam filtering method. Our method possessed the following three major superiorities:
(1) Checking only e-mail’s header section to avoid the low operating efficiency in scanning e-mail’s
content. Moreover, the accuracy of filtering was enhanced simultaneously. (2) In order that the
reversing mechanism to help the classification of unknown e-mails. Thus, the overall accuracy of our
filtering method will be increased. (3) Our method was equipped with a re-learning mechanism,
which utilized the supervised machine learning method to collect and analyze each misjudged e-mail.
Therefore, the revision information learned from the analysis of misjudged e-mails incrementally
gave feedback to our method, and its ability of identifying spams would be improved.
Correspondence
Department of Finance, Minghsin University of Science and Technology, Taiwan
E-mail: ktc1009@[Link]
1
1. Introduction
With the advance of Internet technologies, e-mail has become one of the major communication
channels in modern society. Due to its low cost and convenience, e-mail has become an important
media for spreading advertisements, viruses and detrimental information. The unsolicited e-mails or
called spams have occupied network bandwidth, decreased people’s work efficiency and even leaked
personal information. According to Hong Kong Anti-SPAM Coalition report dated in 2004 [15], it
was estimated that $9 billion was needed to deal with impact brought about by spam on an annual
basis. Furthermore, Symantec spam report dated in May 2014 stated that the global spam rate is 60.6
percent for the month of May [34]. In other words, almost 61 in 100 e-mails are spams, which is a
serious problem.
Various mechanisms have been proposed to filter out the spammy e-mails, including
White/Black Listing, Grey Listing [4], Rule Learning [29, 30], and the methods based on Text
Classification, such as Naïve Bayes [1, 12, 13, 38], Support Vector Machine, SVM [7] and Boosting
Trees [3, 5], Multi-Agent [17, 31] and Genetic Algorithm [32]. Other approaches combine two
mechanisms or users’ experiences to increase the filtering accuracy such as collaborative filtering
techniques [11].
Among these mechanisms, content-based filtering methods, which scan e-mail’s content, are
widely used and characterized by their effectiveness [1]. However, the process of scanning content
will increase complexity and reduce operation efficiency. Recently, some efficient header-based
methods of analyzing only e-mail’s header section have been proposed [20, 29, 30, 36]. However, the
header-based methods perform better in efficiency, but probably poor to maintain their accuracy
The machine learning methods are to collect existing data (denoted as “training data”) and
2
choose useful attributes of the data to generate meaningful rules or models, which can be applied to
predict the newly arrived data [19, 28, 35]. Machine learning techniques can be divided into two
categories: unsupervised and supervised. In unsupervised learning no labels are used on the training
data to be classified. On the other hand, the supervised learning methods learn the classification by
using a set of man-made examples [18]. Supervised learning algorithms are now applied frequently
[14], such as support vector machines (SVMs) [7], random forests [2], and decision trees [20, 24, 25,
29, 30].
Machine learning methods have two major phases: (1) the training phase and (2) the
classification phase [18, 19, 28, 35]. In the training phase, the prior estimates are captured by
building a model from the training data. The model built using training data is then applied by a
classifier to classify the unknown data in the second phase (i.e., the classification phase). However,
at the end of training phase, the model or rules are learned from previous data, whose knowledge
may be outdated. If the spammers design newly spamming techniques, the classifier could not detect
This study aims to propose an efficient spam filtering mechanism based on machine learning
technique. We will apply the uncomplicated decision tree data mining algorithm to find association
rules about spams from the training e-mails. Based on these association rules, we propose a
systematized three-phase spam filtering method with the following major superiorities:
(1) Checking only e-mail’s header section in order to avoid the low operating efficiency in
scanning e-mail’s content. On the other hand, the accuracy of filtering will be enhanced
simultaneously.
In order that the probable misjudgment can be “reversed”, we establish a reversing mechanism,
3
which will calculate a supplementary score to help the classification of each unknown e-mail.
(3) A re-learning mechanism is designed to incrementally improve our method. We utilize the
supervised machine learning to collect and analyze each misjudged e-mail resulted from our
method. Therefore, the revision information learned from the analysis of misjudged e-mails
can incrementally give feedback to our method and improve its ability of identifying spams.
The remainder of this paper is organized as follows. Section 2 discusses the decision tree data
mining algorithm. Section 3 presents the descriptions of our proposed mechanism. The experimental
results of our method are shown in Section 4. Section 5 concludes this paper.
Decision tree is one of the data mining methods upon the tree data structure. The general
statistical methods usually can only analyze the distribution of the surface of data whereas decision
tree algorithms can find the potential association rules between the important attributes from the
existing data. Moreover, the prediction of classification of the unknown data can be further acquired
An example of tree is illustrated in Figure 1. There is a start node called the “root node” in each
tree. If there is any node under, the bottom nodes will be the “children nodes” of the above one, and
the above one will be the “parent node” of the bottom ones. For example, For example, node A is the
root node of this tree in Figure 1, B and C are children nodes of A, and A is the parent node of B.
Moreover, each node without children is called a “leaf node”, such as C, D and E in Figure 1.
The Iterative Dichotmiser 3 (called ID3 for short) is one of the most well-known and effective
4
decision tree algorithms, [24, 25]. In 1999, Stark and Pfeiffer [33] studied the behavior of ID3 and
pointed out that ID3 was better than other decision tree methods, such as C4.5, CHAID, and CART.
As compared with the improved methods of ID3 (for example, C4.5), Ohmann et al. demonstrated
that the quantity of association rules computed by ID3 was not as numerous as that of C4.5 [21]. In
other words, considering the simplicity of rule quantity, ID3 algorithm possessed the superior feature.
Let “Target Attribute” be the attribute which is concerned objective of our research. For
example, we suppose that the attribute “e-mail type” (“S” means it is spammy; “L” means it is
legitimate) is the Target Attribute in this study. And let “Critical Attributes” be the other important
attributes which interest us in this research. The construction process of decision tree will start from
root node. Note that all of the data instances are initially contained in the root node. ID3 algorithm
will select an unselected critical attribute with the maximum “Information Gain” (the detailed
process will be described later). Then, ID3 algorithm will divide all data instances into children
nodes according to their values of the selected Critical Attribute. Subsequently, each children node
respectively repeated the same process for its own data instances.
In ID3 algorithm, there are two conditions to end the construction process of decision tree: (1)
all of the Critical Attributes are selected; (2) Target Attribute’s values of all data instances in this
children node are exactly the same. If any of the two conditions is satisfied, this children node will be
signified as a leaf node. Given a leaf node C, it will be labeled by the value of Target Attribute
possessed by the majority of data instances in C, which is denoted as Label (C ) . And let
| Label (C ) | be the number of data instances whose Target Attribute’s value is equal to Label (C ) in
C. Then we calculate C’s degree of purity (denoted as Purity(C ) ) and degree of support (denoted as
Support (C ) ), and end this node’s execution of ID3 algorithm. The formulas of Purity(C ) and
5
Purity(C ) ( Label (C ) C )*100%
Support (C ) ( C N )*100%
where C is the number of data instances contained in node C and N is the number of total data
instances.
The detailed process of ID3 algorithm is summarized as follows. Note that we has modified step
4 of ID3 algorithm by adding a stop condition in order to avoid the inordinate branching. And the
variables Plower , Pupper , and Slower indicate the threshold values of stopping computation.
Step 1. If the Target Attribute’s values of all data instances in node C are exactly the same, then
Step 2. If all Critical Attributes are “selected”, then set C to be a leaf node, let Label (C ) be the
Step 3. Compute the Information Gain G( A) for each unselected Critical Attribute A, and select
the one with maximum Information Gain. Divide all data instances contained in node C
into disjoint children nodes according their values of the select attribute A;
Step 4. Treat each children node branched in Step 3 as node C. If Plower Purity(C ) or
Purity(C ) Pupper or Support (C ) S lower , then stop; else, continue the algorithm
Considering a certain Critical Attribute A on node C, its Information Gain G( A) concerns the
6
“Entropy” of node C, which is denoted as E (C ) and calculated by the following formula:
t
pi p
E (C ) log 2 i
i 1 n n
where t is the number of Target Attribute’s values, pi is the total number of data instances
corresponding to the i-th value of the Target Attribute in C, and n is the number of data instances in C.
Then, the Information Gain G( A) of Critical Attribute A is calculated by using the following
formulas:
G( A) E (C ) - E ( A)
k
E ( A) (n j / n) E (C j )
j 1
including the data instances corresponding to the j-th value of Critical Attribute A, and n j is the
Finally, each leaf node in the resulted decision tree will be labeled as a value of the Target
Attribute. And each path constructed from root node to leaf node will form an association rule. In
other words, all of the internal nodes on the path constructed a row of “if” judgment of several
Critical Attributes. With the “then” result presented by the labeled value (i.e., Label (C ) ) of the leaf
3. System Architecture
In this paper, we propose a systematized three-phase spam filtering method based on decision
tree data mining technique. In the method, we construct a reversing mechanism to avoid the
misjudgment of unknown e-mails and design a re-learning mechanism to incrementally improve the
filter’s ability to identify spams accurately. As shown in Figure 2, our method can be divided into the
7
following three phases:
(1) Training Phase: The purpose of this phase is to find association rules about spams by
analyzing only the header sections of training e-mails. And the association rules will be
applied to classify unknown e-mails in the second phase. There exist two major modules in
the Training Phase: Rule Constructing Module and Reversing Mechanism’s Setup Module.
The Rule Constructing Module will check the Critical Attributes of e-mails and apply the
decision tree algorithm ID3 to compute the potential association rules of “if-then” pattern,
which will be stored into the Rule-Database. And the Reversing Mechanism’s Setup
(2) Classification Phase: This phase is to classify the unknown e-mails. Each unknown e-mail
will be scored by applying Rule-Database and the reversing mechanism together. According
to the computed score, each unknown e-mail can be classified as either a legitimate e-mail
or a spam.
(3) Re-learning Phase: This phase will incrementally learn the revision information by
analyzing misjudged e-mails resulted from the Classification Phase to improve our filtering
method. Thus, those misjudgments will give feedback to our filtering method and strengthen
Occasionally, the header based method may unavoidably suffer from deficiency of information
in identifying unknown e-mails. In order that the misjudgment can be “reversed”, we establish a
reversing mechanism with the Reversing-Database in our filtering system. Our reversing mechanism
(1) In the Training Phase, the Reversing Mechanism’s Setup Module will initialize the
Reversing-Database, which will be applied to compute an auxiliary score for each unknown
8
e-mail in the Classification Phase.
(2) In the Classification Phase, each unknown e-mail is first examined and scored by applying
the Rule-Database to compute its original score. Note that the original score of an unknown
e-mail implies its tendency to be identified as a spam. Obviously, if the computed original
score of a legitimate e-mail is high, it should be decreased. On the other hand, if the
arrange that each unknown e-mail should be examined once again to compute its additional
classification of this unknown e-mail will be judged according to the sum of original score
and additional score. Thus, the latent misjudgment of this unknown e-mail is likely to be
(3) The Re-Learning Phase utilizes the supervised machine learning method to analyze each
misjudged e-mail resulted from the Classification Phase. According to the revision
information learned from analysis of those misjudged e-mails, our re-learning mechanism
will improve the Reversing-Database. Hence, the misjudged e-mails can incrementally give
Then, we will introduce the detailed procedures of Training Phase, Classification Phase, and
Re-learning Phase. Note that each training e-mail or unknown e-mail is handled first by the
pre-processing procedure. In the pre-processing procedure, the header section of each e-mail will be
examined. First, the meaningless stop words will be removed from the header section. Then, apply
the Porter stemming algorithm [23] to strip suffixes from English words. Thus, the noise in fields of
In the Training Phase, numerous e-mails collected in advance are taken as the training data for
9
our spam filtering method. The main purpose of this phase is to seek for association rules between
the Critical Attributes and the Target Attribute of training e-mails. Then these rules will be applied to
classify unknown e-mails in the Classification Phase. As shown in Figure 3, the Training Phase
contains two major modules: Rule Constructing Module and Reversing Mechanism’s Setup Module,
We set the attribute “e-mail type” to be the Target Attribute in this study. If the training e-mail is
spammy, its e-mail type will be denoted as “S”. On the other hand, if the e-mail is legitimate, its
e-mail type will be denoted as “L”. Moreover, as shown in Table 1, nine Critical Attributes of binary
values are defined by surveying the important fields of e-mail’s header section and referring to the
related researches [29, 30, 36]. These 9 Critical Attributes are divided into three categories of
“Sender”, “Title”, and “Time and size”. We will apply the ID3 decision tree algorithm to analyze the
associative rules between the 9 Critical Attributes and the Target Attribute of the training e-mails.
The detailed process of the Rule Construction Module is described by the following stages.
In this stage, each training e-mail will be checked to capture the values of all necessary Critical
Attributes. For each training e-mail, the Target Attribute’s value depends on its type (“S” means it is
spammy; “L” means it is legitimate). And the 9 Critical Attributes are defined as shown in Table 1,
whose values are decided by checking the corresponding fields of header section and looking up the
spam keywords table (if necessary). The table of spam keywords contains suspicious keywords
found frequently in spams. In this study, we will take the spam keywords table proposed by
10
This stage employs the decision tree data mining algorithm ID3 to look for the association rules
between the Target Attribute and Critical Attributes. The captured attributes of training e-mails
mentioned above will be input into the algorithm ID3 to build decision tree, which will bring out the
potential association rules of “if-then” pattern between the 9 Critical Attributes and the Target
Attribute.
Then, we will score each rule by using the formulas based on the values of its degree of support
and degree of purity. Given an association rule R, we assume that C is its leaf node, n is the number
of e-mails whose the Target Attribute’s value is “S” in node C, and Support ( R) records the degree
of support of this rule. We compute the values of degree of support for all rules, and denote the
maximum one as SupportMAX and the minimum one as SupportMIN . Let Support (C ) , Purity(C ) ,
and Label (C ) be defined as mentioned earlier. Before describing the scoring formula of rules, we
is defined as follows:
SpamTendency
. ( R) Purity(C) if Label (C) " spam" ;
n
and [Link]( R) ( )*100% otherwise ,
|C |
The function W ( R) records the weighted value of rule R, which is computed as follows:
11
Support (C )
W ( R) 100% .
SupportMAX SupportMIN
Assume that WMax is the maximum one and WMin is the minimum one of weighted values of all
rules computed by above formula. Then, the function S ( [Link] ( R)) will record the score of
W ( R) WMIN
S ( [Link] ( R)) 100% .
WMAX WMIN
Now, we can compute the score of rule R, which is recorded by the function [Link]( R) . It
is composed of SpamTendency
. ( R) and [Link] ( R) in a ratio of 7:3, which is defined as
follows:
After computing the scores, all of the rules are stored into the Rule-Database, which keeps the
extracted association rules and will be accessed by the Classification Phase to classify unknown
e-mails. Moreover, we choose out the minimum rule’s score from the rules with SpamTendency
. ( R)
more than 80%, and set it as the threshold for judging whether the unknown e-mail is spam.
which will be applied by the Classification Phase to calculate additional score for each unknown
e-mail. Thus, the latent misjudgment of this unknown e-mail is likely to be “reversed” by the effect
of additional score.
In the Reversing-Database, we construct a reversing table for each rule of the Rule-Database.
Each reversing table records the 9 items of Critical Attributes as mentioned in Table 1. Moreover,
each item in this table has two parameters: Plus-Value and Minus-Value, which record the adjustment
12
values that will be increased additionally to unknown e-mails. The Plus-Value records a positive
integer value, which implies a supplement to the score of an unknown e-mail. Moreover, the
Minus-Value records a negative integer value, which implies a subtraction from the score of an
unknown e-mail. An example of reversing table is shown in Table 2. Given a rule Ri , we denote its
corresponding reversing table as RT ( Ri ) . Moreover, we denote the nine items of reversing table
By using the scored rules of Rule-Database to examine attributes of training e-mails, we can
classify them and collect the misjudged ones. The algorithm RT_Initial will initialize the reversing
tables by applying these misjudged training e-mails. The process of algorithm RT_Initial is
illustrated in Figure 4. Note that each initial value of RT ( Ri )[ j ]. plus and RT ( Ri )[ j ].minus for
1 j 9 is set as zero before performing RT_Initial . Moreover, the parameters I and I are
two positive integers, which are basic units to adjust the values of RT ( Ri )[ j ]. plus and
RT ( Ri )[ j ].minus , respectively.
Step 2. According to the values of Critical Attributes, this training e-mail will dovetail with some
associated with this dovetailed rule Ri will be chosen from the Reversing-Database.
13
spam), do the following operations:
For 1 j 9 , check whether this misjudged training e-mail satisfies the statement of
RT ( Ri )[ j ] (“True” or “False”):
If True then do
RT ( Ri )[ j ].minus RT ( Ri )[ j ].minus I .
For 1 j 9 , check whether this misjudged training e-mail satisfies the statement of
RT ( Ri )[ j ] (“True” or “False”):
If True, then do
RT ( Ri )[ j ]. plus RT ( Ri )[ j ]. plus I ;
The task of this phase is to classify each unknown e-mail to be either a legitimate e-mail or a
spam according to the association rules learned in the Training Phase. The first step is to extract the
Critical Attributes of each unknown e-mail, and find the dovetailed association rule in Rule-Database
to compute the original score for this e-mail. Then, this unknown e-mail’s attributes will be
14
examined once again to compute its additional score by checking the corresponding reversing table
in Reversing-Database. Finally, this unknown e-mail can be classified according to the total score
composed of original and additional scores. We describe the process of this phase in the following
stages:
In this stage, each unknown e-mail will be examined to capture the values of nine Critical
Attributes as shown in Table 1. The values of these nine Critical Attributes are decided by checking
the corresponding fields of unknown e-mail’s header section and looking up the spam keywords table
(if necessary).
According to the values of Critical Attributes, this unknown e-mail will dovetail with some
association rule, say, Ri in the Rule-Database built in the Training Module. And we will set the
original score of the unknown e-mail to be [Link]( Ri ) , which is as mentioned previously in the
Training Phase. Assume that the original score of this unknown e-mail is named as scoreA
Assume that the additional score of this unknown e-mail is named as scoreB . In this stage, we
will access the corresponding reversing table of dovetailed rule Ri . First, the reversing table
examined again to compute its additional score (i.e., scoreB ) by executing the following steps:
15
Step 2. For 1 j 9 , check whether this unknown e-mail satisfies the statement of RT ( Ri )[ j ]
(“True” or “False”):
Now the total score of this unknown e-mail can be obtained by adding up scoreA and scoreB .
Then this unknown e-mail will be classified as a spam if scoreA scoreB , and a legitimate
e-mail otherwise. In this research, we apply the supervised machine learning method to collect the
misjudged e-mails for further analysis in Re-learning Phase. Therefore, in this stage, the supervisor
will monitor the classification results of all unknown e-mails and collect the misjudged ones.
During the Re-learning Phase, the misjudged e-mails collected in above phase (the Classification
Phase) will be used by algorithm RT_Modify to modify the reversing tables in Reversing-Database.
The process of RT_Modify is similar to RT_Initial but not exactly the same. In RT_Modify , the
parameters M and M are basic units to adjust the values of RT ( Ri )[ j ]. plus and
Step 2. According to the values of Critical Attributes, this misjudged e-mail will dovetail with
some rule, say Ri , in the Rule-Database. Then choose the corresponding reversing table
16
Step 3. If this misjudged e-mail is “False-Positive” (a legitimate e-mail to be judged as a spam),
(“True” or “False”):
If True, then do
RT ( Ri )[ j ].minus RT ( Ri )[ j ].minus M .
(“True” or “False”):
If True, then do
RT ( Ri )[ j ]. plus RT ( Ri )[ j ]. plus M ;
4. Experimental Results
In this section, we perform experiments to confirm the accuracy and efficiency of our spam
filtering method. We employ two spam datasets as experimental data: SpamAssassin [27] and
17
Enron-Spam [8], which are commonly applied in research papers about spams. The dataset of
SpamAssassin consists of 6,827 e-mails (4,894 legitimate e-mails and 1,933 spams), and the dataset
of Enron-Spam consists of 52,076 e-mails (19,088 legitimate e-mails and 32,988 spams). Moreover,
we have collected 10502 e-mails (4,401 legitimate e-mails and 6,101 spams) in the recent period as
supplements to the above two datasets. Therefore, the total number of experimental e-mails is 69,405
(28,383 legitimate e-mails and 41,022 spams). We denote these e-mails as the third dataset:
The experiments will be proceeded through the following steps. First, we will introduce the
efficacy assessment indexes used in this paper. And we optimize the parameters ( I , I , M , M )
used in the algorithms RT_Initial and RT_Modify . Then, we conduct a series of experiments to
To evaluate the performance for our spam filtering method proposed in this paper, we employ
the following efficacy assessment indexes: “Precision”, “Recall”, and “F-measure”, which are
commonly used for document classification. The decision confusion matrix, as shown in Table 3, is
used to explain the calculation equations listed as follows [6, 9, 10]. Note that all the four cases A, B,
1. Accuracy: the percentage of total e-mails that are correctly recognized. It is defined by the
following formula:
A D
Accuracy .
A B C D
2. Precision: it calculate the ratio of the e-mails classified correctly in the e-mails judged as
the certain category, representing filter’s capabilities of classifying correctly such category
18
of e-mails. In this study, we calculate the “Spam Precision” from the perspective of
identifying spams, and the “Legitimate Precision” from the perspective of identifying
legitimate e-mails. And the value of “Precision” is set as the mean of Spam Precision and
A
Spam Precision ;
A B
D
Legitimate Precision ;
CD
3. Recall: it refers the ratio of the e-mails classified correctly. The “Spam Recall” is defined
as the probability of classifying correctly spammy e-mails as spams, and the “Legitimate
Recall” is defined as the probability classifying correctly legitimate e-mails. Then, we set
the “Recall” value as the mean of the Spam Recall and Legitimate Recall. The formulas are
listed as follows:
A
Spam Recall ;
AC
D
Legitimate Recall ;
BD
4. F-measure: the harmonic mean of the Precision and Recall with equation listed as follows:
2 Precision Recall
F measure
Precision Recall
5. FP-rate and FN-rate: FP-rate defines the ratio of misjudging legitimate e-mails as
spammy, and FN-rate defines the ratio of misjudging spams as legitimate. The formulas are
listed as follows: :
19
B
FP-rate
BD
C
FN -rate
AC
Before performing the experiments, we must optimize the important parameters used in this
research. First off, we should optimize the parameters ( Plower , Pupper , Slower ) , which indicate the
threshold values used in Step 4 of ID3 algorithm. By performing a lot of experiments, we found that
the size of decision tree built by ID3 algorithm can be pruned acceptably if we set
( Plower , Pupper , Slower ) as (20%, 90%, 2.5%). Therefore, we use these threshold values as stop
conditions in Step 4 of ID3 algorithm. That is, Step 4 can be interpreted as follows: “If
I are used by algorithm RT_Initial in the Training Phase, and M and M are used by
RT_Modify in the Re-learning Phase. We set the initial values of ( I , I , M , M ) as (1, 1, 1, 1).
Then, we adopted all the 6,827 e-mails as testing data from the SpamAssassin dataset to observe
variation of the Accuracy of our spam filtering system (using both reversing mechanism and
changed M to investigate the variation of Accuracy of our method. The results were shown in
Table 4. From the experimental results, we observed that the higher value of M , the higher value
of Accuracy. While the M reached to 10, the Accuracy could not be improved further. Thus, we
20
Similarly, we manipulated M and fixed the other three parameters ( I , I , M ) at (1, 1, 10)
to observe the variation of Accuracy, which was shown in Table 5. Obviously, the Accuracy could
not be improved further while the M reached to 7. Therefore, we chose 7 as the optimal value of
M.
By apply the similar way, we chose 12 as the optimal value of I . Then we tried to optimize
the value of I . However, the Accuracy would decrease while we began to adjust I . Hence, we
kept I as 1 and finished the optimization work. Finally, we set the optimal values of parameters
as (1, 12, 10, 7). Therefore, we would adopt the optimized parameters in experiments of the next
subsection.
In this subsection, we perform the following three experiments to confirm the efficiency of our
spam filtering method proposed in this paper: (A) Using SpamAssassin as experimental dataset; (B)
Note that the ratio of legitimate e-mails to spams used in the Training Phase was different from
that in the Classification Phase. In the Training Phase, we took randomly 1000 training e-mails in the
ratio 1:1 (500 legitimate e-mails and 500 spams) from the experimental dataset adopted in each
experiment. Moreover, in the Classification Phase, we would continuously increase the amount of
testing data (unknown e-mails) to observe the performances of our filtering method. Note that those
unknown e-mails were taken randomly from the adopted experimental dataset without any
In each experiment, the performance of our spam filtering system will be verified by adopting
different methods which are combinations of the mechanisms proposed in our filtering system: (I)
Using neither reversing mechanism nor re-learning mechanism; (II) Using only reversing mechanism;
21
(III) Using both reversing mechanism and re-learning mechanism. Note that (III) is actually
equivalent to the whole filtering system proposed in this paper. The experimental results are
discussed as follows.
of our filtering system. The result of this experiment was shown in Figure 5. Obviously, the curve of
Accuracy of method (I), which used neither reversing mechanism nor re-learning mechanism, was
lower than those of methods (II) and (III). With the assistance of reversing mechanism in classifying
unknown e-mails, the curve of Accuracy of method (II) was better than that of method (I). Moreover,
with both of reversing mechanism and re-learning mechanism, the method (III) obtained the most
outstanding accuracy. After applying the re-learning mechanism increasingly, the Accuracy of
The performances of method (III) evaluated in various efficacy assessment indexes were shown
in Figure 6. Each curve of the four indexes had outstanding exhibition. Among these four curves, the
curve of Recall obtained the highest values, which implied that our filtering system would classify
collected a larger number of e-mails as compared with SpamAssassin. The experimental results were
shown in Figure 7. Obviously, the considerable quantities of testing data had made a great impact on
behaviors of the methods (I), (II), and (III). With the support of re-learning mechanism, the method
(III) incrementally learned knowledge from numerous unknown e-mails and obtained the best
accuracy. As compared with experiment (A), the curve of Accuracy of methods (III) in this
22
experiment was lifted up and reached 0.9765.
The performances of method (III) recorded in various efficacy assessment indexes were shown
in Figure 8. Compared with Figure 6, the curves of the four indexes in this experiment had better
exhibitions. After re-learning from a plenty of unknown e-mails, all curves became stable and
reached the desirable values. Moreover, the ratio of Recall eventually reached the highest value of
0.9992, which indicated that our system classified the unknown e-mails perfectly.
The previous two experiments had confirmed that performance of method (III), the filtering
system proposed in this paper, was excellent. In this experiment, we applied only the method (III) to
observe exhibitions of various efficacy assessment indexes. To emphasize the influence of a great
quantity of testing data, we adopted the Mixed-Set as experimental dataset and chose a double
amount of training e-mails in this experiment. Precisely, in the Training Phase, we took randomly
2000 training e-mail from Mixed-Set without any predefined proportion of legitimate e-mails to
spams. Then, in the Classification Phase, we continuously increased the amount of unknown e-mails
taken randomly from Mixed-Set to observe the performances of our spam filtering method.
The results of this experiment were shown in Figure 9. By taking more training data from a
plentiful dataset, the ratio of each assessment index acquired a greatly high value, more than 0.99, in
the beginning of this experiment. After applying a large number of testing data, all curves of the four
assessment indexes approached more closely and reached the extremely desirable numerical values.
Moreover, the results of FP-rate (the ratio of misjudging legitimate e-mails as spammy) and FN-rate
(the ratio of misjudging spams as legitimate) in method (III) of the three experiments were shown in
Table 6. Obviously, both the FP-rate and FN-rate were ideal even though the proposed method
applied a small experimental dataset. For example, the FP-rate in experiment (A) was only 0.0014,
which showed that our filtering method infrequently misjudged the legitimate e-mails (that are of
23
vital importance) as spammy ones. Moreover, the considerable quantities of data instances had made
a great and incredible improvement on the misjudgment rate. which implied that the filtering method
Table 7 was the comparison between some spam filtering methods proposed in literatures. Note
that some of these methods had to check the whole content of e-mail, whereas our method would
check e-mail’s header only. Obviously, our method revealed better Accuracy and Recall. We could
observe that our Precision rate of method (III) in experiment (B) was inferior to those of other
methods. However, our Precision rate of experiment (C) reached an almost perfect numerical value,
which implied that our method could incrementally learn classification knowledge under a great
5. Conclusions
In this research, we proposed an efficient spam filtering method based on decision tree data
mining technique, analyzed the association rules among about spams and applies these rules to
develop a systematized spam filtering method. Different from content checking, we classified
e-mails simply by analyzing their basic header data only. Our method possessed the following three
major superiorities: (1) Checking only e-mail’s header section to avoid the low operating efficiency
in scanning e-mail’s content. Moreover, the accuracy of filtering was enhanced simultaneously. (2) In
order that the probable misjudgment in identifying an unknown e-mail could be “reversed”, we had
constructed a reversing mechanism to help the classification of unknown e-mails. Thus, the overall
accuracy of our filtering method will be increased. (3) Our method was equipped with a re-learning
mechanism, which utilized the supervised machine learning method to collect and analyze each
misjudged e-mail. Therefore, the revision information learned from the analysis of misjudged e-mails
incrementally gave feedback to our method, and its ability of identifying spams would be improved.
The results of experiments with a large number of testing data showed that the ratios of
24
assessment indexes, Accuracy, Recall, Precision, F-measure, FP-rate, and FN-rate, approached more
closely and reached the extremely desirable numerical values, which implied that the filtering
method proposed in this paper possessed outstanding performances. Note that one of advantages of
our method was to reduce the calculation cost. Therefore, the method proposed in this paper can
classify unknown e-mails precisely and not consume too many system resources, which will be
extremely useful in resolving the requirement of judging a large number of unknown e-mails
nowadays.
Acknowledgements
This work is partially supported by the Ministry of Science and Technology, Taiwan, R.O.C. under
Reference
3. Carreras X, Marquez L. Boosting trees for anti-spam email filtering. 4th International
Conference on Recent Advances in Natural Language Processing (RANLP), Bulgaria, Sep. 5-7,
2001; 58-64.
4. Cook D, Hartnett J, Manderson K, Scanlan J. Catching spam before it arrives: domain specific
dynamic blacklists. In Proceedings of the 2006 Australasian Workshops on Grid Computing and
E-Research 2006; 54: 193-203.
5. DeBarr D, Wechsler H. Spam detection using random boost. Pattern Recognition Letters 2012;
33(10): 1237-1244.
6. Delany SJ, Cunningham P, Tsymbal A, Coyle L. A case-based technique for tracking concept
25
drift in spam filtering. Knowledge-Based Systems 2005; 18: 187-195.
9. Fdez-Riverola F, Iglesias EL, Díaz F, Me´ndez JR, Corchado JM. Applying lazy learning
algorithms to tackle concept drift in spam filtering. Expert Systems with Applications 2007; 33(1):
36-48.
10. Fdez-Riverola F, Iglesias EL, Díaz F, Me´ndez JR, Corchado JM. Spamhunting: an
instance-based reasoning system for spam labelling and filtering. Decision Support Systems 2007;
43(3): 722-736.
11. Golbeck J, Hendler J. Reputation network analysis for email filtering. In Proceedings of the First
Conference on Email and Anti-Spam (CEAS), 2004.
12. Guo Y, Zhou L, He K, Gu Y, Sun Y. Bayesian spam filtering mechanism based on decision tree of
attribute set dependence in the mapreduce framework. Open Cybernetics & Systemics Journal
2014; 8: 435-441.
13. Han J, Kamber M. Data Mining Concepts and Techniques. USA: Morgan Kaufman, 2001;
284-287.
14. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer: New York, 2001.
15. Hong Kong Anti-SPAM Coalition (HKASC). Legislation: One of The Key Pillars in The Fight
Against SPAM. White Paper, 2004.
16. Hsiao WF, Chang TM. An incremental cluster-based approach to spam filtering. Expert Systems
with Applications 2008; 34: 1599-1608.
17. Islam MR, Zhou W, Guo M, Xiang Y. An innovative analyser for multi-classifier e-mail
classification based on grey list analysis. Journal of Network and Computer Applications 2009;
32: 357-366.
18. Jayaraj A, Venkatesh T, Murthy CSR. Loss classification in optical burst switching networks
using machine learning techniques: improving the performance of tcp. IEEE Journal on Selected
Areas in Communications 2008; 26(6): 45-54.
19. Lai CC. An empirical study of three machine learning methods for spam filtering.
Knowledge-Based Systems 2007; 20(3): 249-254.
26
20. Liu YN, Han Y, Zhu XD, He F, Wei LY. An expanded feature extraction of e-mail header for
spam recognition. Advanced Materials Research 2013; 846:1672-1675.
22. Pölzlbauer G, Lidy T, Rauber A. Decision manifolds - a supervised learning algorithm based on
self-organization. IEEE Transactions on Neural Networks 2008; 19(9): 1518-1530.
23. Porter MF. An algorithm for suffix stripping. Program 1980; 14: 130-137.
24. Quinlan JR. Induction of decision trees. Machine Learning 1986; 1(1):81-106.
25. Quinlan JR. C4.5: Programs for Machine Learning. San Mateo: Morgan Kaufmann, 1993.
26. Sanpakdee U, Walairacht A, Walairacht S. Adaptive spam mail filtering using genetic algorithm.
The 8th International Conference Advanced Communication Technology, pp. 441-445, Phoenix
Park, Korea, Feb. 20-22, 2006.
28. Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys 2002;
34(1): 1-47.
29. Sheu JJ. An efficient two-phase spam filtering method based on e-mails categorization.
International Journal of Network Security 2009; 8(3): 334-343.
30. Sheu JJ, Chu KT. An efficient spam filtering method by analyzing e-mail's header session only.
International Journal of Innovative Computing, Information and Control 2009; 5(11):
3717-3731.
31. Shih DH, Chiang HS, Lin B. Collaborative spam filtering with heterogeneous agents. Expert
Systems with Applications 2008; 35(4): 1555-1566.
32. Shrivastava JN, Bindu MH. (2014). E-mail spam filtering using adaptive genetic algorithm.
International Journal of Intelligent Systems and Applications (IJISA) 2014; 6(2): 54-60
33. Stark KD, Pfeiffer DU. The application of non-parametric techniques to solve classification
problems in complex data sets in veterinary epidemiology - an example. Intelligent Data
Analysis 1999; 3(1):23-35.
35. Tretyakov K. Machine learning techniques in spam filtering. Technical report, Institute of
Computer Science, University of Tartu, 2004.
27
36. Wang CC, Chen SY. Using header session messages to anti-spamming. Computers & Security
2007; 26: 381-390.
37. Yang Y. A novel framework based on rough set, ant colony optimization and genetic algorithm
for spam filtering. International Journal of Advancements in Computing Technology 2012; 4(14):
516-525.
38. Zhou B, Yao Y, Luo J. Cost-sensitive three-way email spam filtering. Journal of Intelligent
Information Systems 2014; 42(1): 19-45.
28
Figures:
B C
D E
29
Training Training Phase
e-mail
Reversing
Rule
Mechanism’s
Pre-processing Construction Re-Learning Phase
Setup
Module
Module
Unknown
e-mail
Legitimate
e-mail
30
Training
e-mails Rule Construction Training Phase
Module
Capturing Critical Reversing Mechanism
Pre-processing Module
Attributes
RM Generate
Scoring the rules algorithm Re-Learning Phase
Re-Learning Misjudged
Rule-Database Reversing-Database algorithm e-mails
Unknown
e-mails
31
False-Positive
Misjudged Checking Selecting Checking each Type of Adjusting
e-mail Critical Reversing table of the 9 items misjudgement parameters by
Attributes RT(R) in RT(R) reducIng I -
False-Negative
Adjusting
Storing RT(R) back to
parameters
Reversing-Database
by adding I+
32
Figure 5. The result of experiment (A)
33
Figure 6. The performance of method (III) in experiment (A)
34
Figure 7. The result of experiment (B)
35
Figure 8. The performance of method (III) in experiment (B)
36
Figure 9. The result of experiment (C)
37
Tables:
Table 1: The 9 Critical Attributes of e-mail
Attribute
Critical Attribute Value
categories
Length of sender’s name is If the length of sender’s name is more than 9 characters,
abnormal it is set at 1 (True), otherwise 0 (False)
If any of sender’s name and address is blank or contains
Either sender’s name or
Sender abnormal symbol, it is set at 1 (True), otherwise 0
address is abnormal
(False)
Spam keyword is found in If any spam keyword is found, it is set at 1 (True),
sender’s name or address otherwise 0 (False)
If the title is blank or contains more than 3 wrong (or
E-mail’s title is abnormal
unknown) words, it is set at 1 (True), otherwise 0 (False)
E-mail’s title includes spam If e-mail’s title has a spam keyword, it is set at 1 (True),
Title
keyword (Type I) otherwise 0 (False)
E-mail’s title includes spam If e-mail’s title has at least three spam keywords, it is
keyword (Type II) set at 1 (True), otherwise 0 (False)
Sending date and receiving If the date of sending distinctly differs from the date of
date are abnormal receiving, it is set at 1 (True), otherwise 0 (False)
If e-mail’s size is equal to or larger than 8k, it is set at 1
Other E-mail’s size is abnormal
(True), otherwise 0 (False)
If this e-mail’s format is HTML or contains attachments,
E-mail’s format
it is set at 1, otherwise 0
38
Table 2: An example of reversing table RT(R)
Revised values
No. Item (True/False)
Plus-Value Minus-Value
1 Length of sender’s name is abnormal +4 -4
2 Either sender’s name or address is abnormal +4 -2
3 Spam keyword is found in sender’s name or address +4 -4
4 E-mail’s title is abnormal +4 -2
5 E-mail’s title includes spam keyword (Type I) +4 -2
6 E-mail’s title includes spam keyword (Type II) +8 -4
7 Sending date and receiving date are abnormal +10 0
8 E-mail’s size is abnormal +4 -3
9 E-mail’s format +4 -2
39
Table 3: Four cases of judgment
40
Table 4: Optimization of parameter M+ with (I+, I- , M-)=(1, 1, 1)
9 0.7515
10 0.9054
11 0.9054
… …
41
Table 5: Optimization of parameter M- with (I+, I- , M+)=(1, 1, 10)
7 0.9518
8 0.9518
9 0.9518
… …
42
Table 6: The experimental results of FP-rate and FN-rate
43
Table 7: Comparison of filtering methods
44