0% found this document useful (0 votes)
5 views45 pages

15

This paper presents an intelligent three-phase spam filtering method utilizing decision tree data mining techniques. The proposed method enhances efficiency by analyzing only email headers, incorporates a reversing mechanism to correct misjudgments, and employs a re-learning mechanism to improve spam identification over time. The study aims to address the challenges of traditional spam filtering methods by increasing accuracy while maintaining operational efficiency.

Uploaded by

fatna.elmendili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views45 pages

15

This paper presents an intelligent three-phase spam filtering method utilizing decision tree data mining techniques. The proposed method enhances efficiency by analyzing only email headers, incorporates a reversing mechanism to correct misjudgments, and employs a re-learning mechanism to improve spam identification over time. The study aims to address the challenges of traditional spam filtering methods by increasing accuracy while maintaining operational efficiency.

Uploaded by

fatna.elmendili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

An Intelligent Three-Phase Spam Filtering Method Based on Decision Tree Data

Mining

Jyh-Jian Sheu1, Yin-Kai Chen2, Ko-Tsung Chu3, Jih-Hsin Tang4, Wei-Pang Yang2

1. College of Communication, National Chengchi University, Taiwan


2. Department of Information Management, National Dong Hwa University, Taiwan
3. Department of Finance, Minghsin University of Science and Technology, Taiwan
4. Department of Information Management, National Taipei University of Business, Taiwan

Abstract

In this paper, we proposed an efficient spam filtering method based on decision tree data mining

technique, analyzed the association rules about spams and applies these rules to develop a

systematized spam filtering method. Our method possessed the following three major superiorities:

(1) Checking only e-mail’s header section to avoid the low operating efficiency in scanning e-mail’s

content. Moreover, the accuracy of filtering was enhanced simultaneously. (2) In order that the

probable misjudgment in identifying an unknown e-mail could be “reversed”, we had constructed a

reversing mechanism to help the classification of unknown e-mails. Thus, the overall accuracy of our

filtering method will be increased. (3) Our method was equipped with a re-learning mechanism,

which utilized the supervised machine learning method to collect and analyze each misjudged e-mail.

Therefore, the revision information learned from the analysis of misjudged e-mails incrementally

gave feedback to our method, and its ability of identifying spams would be improved.

Keywords: Spam, Data Mining, Decision Tree


Correspondence
Department of Finance, Minghsin University of Science and Technology, Taiwan
E-mail: ktc1009@[Link]
1
1. Introduction

With the advance of Internet technologies, e-mail has become one of the major communication

channels in modern society. Due to its low cost and convenience, e-mail has become an important

media for spreading advertisements, viruses and detrimental information. The unsolicited e-mails or

called spams have occupied network bandwidth, decreased people’s work efficiency and even leaked

personal information. According to Hong Kong Anti-SPAM Coalition report dated in 2004 [15], it

was estimated that $9 billion was needed to deal with impact brought about by spam on an annual

basis. Furthermore, Symantec spam report dated in May 2014 stated that the global spam rate is 60.6

percent for the month of May [34]. In other words, almost 61 in 100 e-mails are spams, which is a

serious problem.

Various mechanisms have been proposed to filter out the spammy e-mails, including

White/Black Listing, Grey Listing [4], Rule Learning [29, 30], and the methods based on Text

Classification, such as Naïve Bayes [1, 12, 13, 38], Support Vector Machine, SVM [7] and Boosting

Trees [3, 5], Multi-Agent [17, 31] and Genetic Algorithm [32]. Other approaches combine two

mechanisms or users’ experiences to increase the filtering accuracy such as collaborative filtering

techniques [11].

Among these mechanisms, content-based filtering methods, which scan e-mail’s content, are

widely used and characterized by their effectiveness [1]. However, the process of scanning content

will increase complexity and reduce operation efficiency. Recently, some efficient header-based

methods of analyzing only e-mail’s header section have been proposed [20, 29, 30, 36]. However, the

header-based methods perform better in efficiency, but probably poor to maintain their accuracy

since e-mail’s header has less information than e-mail’s content.

The machine learning methods are to collect existing data (denoted as “training data”) and

2
choose useful attributes of the data to generate meaningful rules or models, which can be applied to

predict the newly arrived data [19, 28, 35]. Machine learning techniques can be divided into two

categories: unsupervised and supervised. In unsupervised learning no labels are used on the training

data to be classified. On the other hand, the supervised learning methods learn the classification by

using a set of man-made examples [18]. Supervised learning algorithms are now applied frequently

[14], such as support vector machines (SVMs) [7], random forests [2], and decision trees [20, 24, 25,

29, 30].

Machine learning methods have two major phases: (1) the training phase and (2) the

classification phase [18, 19, 28, 35]. In the training phase, the prior estimates are captured by

building a model from the training data. The model built using training data is then applied by a

classifier to classify the unknown data in the second phase (i.e., the classification phase). However,

at the end of training phase, the model or rules are learned from previous data, whose knowledge

may be outdated. If the spammers design newly spamming techniques, the classifier could not detect

and filter these novel spams.

This study aims to propose an efficient spam filtering mechanism based on machine learning

technique. We will apply the uncomplicated decision tree data mining algorithm to find association

rules about spams from the training e-mails. Based on these association rules, we propose a

systematized three-phase spam filtering method with the following major superiorities:

(1) Checking only e-mail’s header section in order to avoid the low operating efficiency in

scanning e-mail’s content. On the other hand, the accuracy of filtering will be enhanced

simultaneously.

(2) A reversing mechanism is constructed to avoid misjudgment in identifying unknown e-mails.

In order that the probable misjudgment can be “reversed”, we establish a reversing mechanism,

3
which will calculate a supplementary score to help the classification of each unknown e-mail.

Thus, the overall accuracy of our filtering method will be increased.

(3) A re-learning mechanism is designed to incrementally improve our method. We utilize the

supervised machine learning to collect and analyze each misjudged e-mail resulted from our

method. Therefore, the revision information learned from the analysis of misjudged e-mails

can incrementally give feedback to our method and improve its ability of identifying spams.

The remainder of this paper is organized as follows. Section 2 discusses the decision tree data

mining algorithm. Section 3 presents the descriptions of our proposed mechanism. The experimental

results of our method are shown in Section 4. Section 5 concludes this paper.

2. Decision tree data mining algorithm

Decision tree is one of the data mining methods upon the tree data structure. The general

statistical methods usually can only analyze the distribution of the surface of data whereas decision

tree algorithms can find the potential association rules between the important attributes from the

existing data. Moreover, the prediction of classification of the unknown data can be further acquired

by comparing their related attributes’ values to these association rules.

An example of tree is illustrated in Figure 1. There is a start node called the “root node” in each

tree. If there is any node under, the bottom nodes will be the “children nodes” of the above one, and

the above one will be the “parent node” of the bottom ones. For example, For example, node A is the

root node of this tree in Figure 1, B and C are children nodes of A, and A is the parent node of B.

Moreover, each node without children is called a “leaf node”, such as C, D and E in Figure 1.

The Iterative Dichotmiser 3 (called ID3 for short) is one of the most well-known and effective

4
decision tree algorithms, [24, 25]. In 1999, Stark and Pfeiffer [33] studied the behavior of ID3 and

pointed out that ID3 was better than other decision tree methods, such as C4.5, CHAID, and CART.

As compared with the improved methods of ID3 (for example, C4.5), Ohmann et al. demonstrated

that the quantity of association rules computed by ID3 was not as numerous as that of C4.5 [21]. In

other words, considering the simplicity of rule quantity, ID3 algorithm possessed the superior feature.

Hence, we choose ID3 as the data mining technique in this research.

Let “Target Attribute” be the attribute which is concerned objective of our research. For

example, we suppose that the attribute “e-mail type” (“S” means it is spammy; “L” means it is

legitimate) is the Target Attribute in this study. And let “Critical Attributes” be the other important

attributes which interest us in this research. The construction process of decision tree will start from

root node. Note that all of the data instances are initially contained in the root node. ID3 algorithm

will select an unselected critical attribute with the maximum “Information Gain” (the detailed

process will be described later). Then, ID3 algorithm will divide all data instances into children

nodes according to their values of the selected Critical Attribute. Subsequently, each children node

respectively repeated the same process for its own data instances.

In ID3 algorithm, there are two conditions to end the construction process of decision tree: (1)

all of the Critical Attributes are selected; (2) Target Attribute’s values of all data instances in this

children node are exactly the same. If any of the two conditions is satisfied, this children node will be

signified as a leaf node. Given a leaf node C, it will be labeled by the value of Target Attribute

possessed by the majority of data instances in C, which is denoted as Label (C ) . And let

| Label (C ) | be the number of data instances whose Target Attribute’s value is equal to Label (C ) in

C. Then we calculate C’s degree of purity (denoted as Purity(C ) ) and degree of support (denoted as

Support (C ) ), and end this node’s execution of ID3 algorithm. The formulas of Purity(C ) and

Support (C ) are defined as follows:

5
Purity(C )  ( Label (C ) C )*100%

Support (C )  ( C N )*100%

where C is the number of data instances contained in node C and N is the number of total data

instances.

The detailed process of ID3 algorithm is summarized as follows. Note that we has modified step

4 of ID3 algorithm by adding a stop condition in order to avoid the inordinate branching. And the

variables Plower , Pupper , and Slower indicate the threshold values of stopping computation.

Step 1. If the Target Attribute’s values of all data instances in node C are exactly the same, then

set C to be a leaf node, compute Purity(C ) and Support (C ) , and stop;

Step 2. If all Critical Attributes are “selected”, then set C to be a leaf node, let Label (C ) be the

value of Target Attribute possessed by the majority of data instances in C, compute

Purity(C ) and Support (C ) , and stop;

Step 3. Compute the Information Gain G( A) for each unselected Critical Attribute A, and select

the one with maximum Information Gain. Divide all data instances contained in node C

into disjoint children nodes according their values of the select attribute A;

Step 4. Treat each children node branched in Step 3 as node C. If Plower  Purity(C ) or

Purity(C )  Pupper or Support (C )  S lower , then stop; else, continue the algorithm

recursively from step 1.

Considering a certain Critical Attribute A on node C, its Information Gain G( A) concerns the

6
“Entropy” of node C, which is denoted as E (C ) and calculated by the following formula:

t
pi p
E (C )    log 2 i
i 1 n n

where t is the number of Target Attribute’s values, pi is the total number of data instances

corresponding to the i-th value of the Target Attribute in C, and n is the number of data instances in C.

Then, the Information Gain G( A) of Critical Attribute A is calculated by using the following

formulas:

G( A)  E (C ) - E  ( A)
k
E  ( A)   (n j / n) E (C j )
j 1

where k is the number of values of Critical Attribute A, C j with 1  j  k is a subset of C

including the data instances corresponding to the j-th value of Critical Attribute A, and n j is the

total number of data instances contained in C j .

Finally, each leaf node in the resulted decision tree will be labeled as a value of the Target

Attribute. And each path constructed from root node to leaf node will form an association rule. In

other words, all of the internal nodes on the path constructed a row of “if” judgment of several

Critical Attributes. With the “then” result presented by the labeled value (i.e., Label (C ) ) of the leaf

node, there is the association rule of “if-then” pattern constructed.

3. System Architecture

In this paper, we propose a systematized three-phase spam filtering method based on decision

tree data mining technique. In the method, we construct a reversing mechanism to avoid the

misjudgment of unknown e-mails and design a re-learning mechanism to incrementally improve the

filter’s ability to identify spams accurately. As shown in Figure 2, our method can be divided into the

7
following three phases:

(1) Training Phase: The purpose of this phase is to find association rules about spams by

analyzing only the header sections of training e-mails. And the association rules will be

applied to classify unknown e-mails in the second phase. There exist two major modules in

the Training Phase: Rule Constructing Module and Reversing Mechanism’s Setup Module.

The Rule Constructing Module will check the Critical Attributes of e-mails and apply the

decision tree algorithm ID3 to compute the potential association rules of “if-then” pattern,

which will be stored into the Rule-Database. And the Reversing Mechanism’s Setup

Module is designed to initialize the parameters of our reversing mechanism.

(2) Classification Phase: This phase is to classify the unknown e-mails. Each unknown e-mail

will be scored by applying Rule-Database and the reversing mechanism together. According

to the computed score, each unknown e-mail can be classified as either a legitimate e-mail

or a spam.

(3) Re-learning Phase: This phase will incrementally learn the revision information by

analyzing misjudged e-mails resulted from the Classification Phase to improve our filtering

method. Thus, those misjudgments will give feedback to our filtering method and strengthen

its ability of classifying unknown e-mails accurately.

Occasionally, the header based method may unavoidably suffer from deficiency of information

in identifying unknown e-mails. In order that the misjudgment can be “reversed”, we establish a

reversing mechanism with the Reversing-Database in our filtering system. Our reversing mechanism

contains the following three major tasks:

(1) In the Training Phase, the Reversing Mechanism’s Setup Module will initialize the

Reversing-Database, which will be applied to compute an auxiliary score for each unknown

8
e-mail in the Classification Phase.

(2) In the Classification Phase, each unknown e-mail is first examined and scored by applying

the Rule-Database to compute its original score. Note that the original score of an unknown

e-mail implies its tendency to be identified as a spam. Obviously, if the computed original

score of a legitimate e-mail is high, it should be decreased. On the other hand, if the

computed original score of a spammy e-mail is low, it should be increased. Therefore, we

arrange that each unknown e-mail should be examined once again to compute its additional

score by checking the corresponding parameters of Reversing-Database. The deciding

classification of this unknown e-mail will be judged according to the sum of original score

and additional score. Thus, the latent misjudgment of this unknown e-mail is likely to be

“reversed” by the effect of additional score.

(3) The Re-Learning Phase utilizes the supervised machine learning method to analyze each

misjudged e-mail resulted from the Classification Phase. According to the revision

information learned from analysis of those misjudged e-mails, our re-learning mechanism

will improve the Reversing-Database. Hence, the misjudged e-mails can incrementally give

feedback to strengthen our reversing mechanism.

Then, we will introduce the detailed procedures of Training Phase, Classification Phase, and

Re-learning Phase. Note that each training e-mail or unknown e-mail is handled first by the

pre-processing procedure. In the pre-processing procedure, the header section of each e-mail will be

examined. First, the meaningless stop words will be removed from the header section. Then, apply

the Porter stemming algorithm [23] to strip suffixes from English words. Thus, the noise in fields of

e-mail’s header section will be reduced.

3.1 Training Phase

In the Training Phase, numerous e-mails collected in advance are taken as the training data for
9
our spam filtering method. The main purpose of this phase is to seek for association rules between

the Critical Attributes and the Target Attribute of training e-mails. Then these rules will be applied to

classify unknown e-mails in the Classification Phase. As shown in Figure 3, the Training Phase

contains two major modules: Rule Constructing Module and Reversing Mechanism’s Setup Module,

which will be introduced as follows.

3.1.1 Rule Construction Module

We set the attribute “e-mail type” to be the Target Attribute in this study. If the training e-mail is

spammy, its e-mail type will be denoted as “S”. On the other hand, if the e-mail is legitimate, its

e-mail type will be denoted as “L”. Moreover, as shown in Table 1, nine Critical Attributes of binary

values are defined by surveying the important fields of e-mail’s header section and referring to the

related researches [29, 30, 36]. These 9 Critical Attributes are divided into three categories of

“Sender”, “Title”, and “Time and size”. We will apply the ID3 decision tree algorithm to analyze the

associative rules between the 9 Critical Attributes and the Target Attribute of the training e-mails.

The detailed process of the Rule Construction Module is described by the following stages.

Stage 1. Capturing Critical Attributes

In this stage, each training e-mail will be checked to capture the values of all necessary Critical

Attributes. For each training e-mail, the Target Attribute’s value depends on its type (“S” means it is

spammy; “L” means it is legitimate). And the 9 Critical Attributes are defined as shown in Table 1,

whose values are decided by checking the corresponding fields of header section and looking up the

spam keywords table (if necessary). The table of spam keywords contains suspicious keywords

found frequently in spams. In this study, we will take the spam keywords table proposed by

Sanpakdee et al. [26].

Stage 2. Constructing the decision tree

10
This stage employs the decision tree data mining algorithm ID3 to look for the association rules

between the Target Attribute and Critical Attributes. The captured attributes of training e-mails

mentioned above will be input into the algorithm ID3 to build decision tree, which will bring out the

potential association rules of “if-then” pattern between the 9 Critical Attributes and the Target

Attribute.

Stage 3. Scoring the rules

Then, we will score each rule by using the formulas based on the values of its degree of support

and degree of purity. Given an association rule R, we assume that C is its leaf node, n is the number

of e-mails whose the Target Attribute’s value is “S” in node C, and Support ( R) records the degree

of support of this rule. We compute the values of degree of support for all rules, and denote the

maximum one as SupportMAX and the minimum one as SupportMIN . Let Support (C ) , Purity(C ) ,

and Label (C ) be defined as mentioned earlier. Before describing the scoring formula of rules, we

have to introduce the following three important functions: SpamTendency


. ( R) , W ( [Link] ( R)) ,

and S ( [Link] ( R)) .

The function SpamTendency


. ( R) implies rule’s “intensity” to classify e-mails as spams, which

is defined as follows:

SpamTendency
. ( R)  Purity(C) if Label (C)  " spam" ;

n
and [Link]( R)  ( )*100% otherwise ,
|C |

where |C| is the number of e-mails contained in leaf node C.

The function W ( R) records the weighted value of rule R, which is computed as follows:

11
Support (C )
W ( R)  100% .
SupportMAX  SupportMIN

Assume that WMax is the maximum one and WMin is the minimum one of weighted values of all

rules computed by above formula. Then, the function S ( [Link] ( R)) will record the score of

[Link] ( R) , which is relative to the ranking of weighted value of rule R.

W ( R)  WMIN
S ( [Link] ( R))  100% .
WMAX  WMIN

Now, we can compute the score of rule R, which is recorded by the function [Link]( R) . It

is composed of SpamTendency
. ( R) and [Link] ( R) in a ratio of 7:3, which is defined as

follows:

[Link]( R)  (0.7  SpamTendency


. ( R)  0.3  S ( [Link]( R))) 100 .

After computing the scores, all of the rules are stored into the Rule-Database, which keeps the

extracted association rules and will be accessed by the Classification Phase to classify unknown

e-mails. Moreover, we choose out the minimum rule’s score from the rules with SpamTendency
. ( R)

more than 80%, and set it as the threshold  for judging whether the unknown e-mail is spam.

3.1.2 Reversing Mechanism’s Setup Module

As mentioned earlier, this module is designed to initialize parameters of the Reversing-Database,

which will be applied by the Classification Phase to calculate additional score for each unknown

e-mail. Thus, the latent misjudgment of this unknown e-mail is likely to be “reversed” by the effect

of additional score.

In the Reversing-Database, we construct a reversing table for each rule of the Rule-Database.

Each reversing table records the 9 items of Critical Attributes as mentioned in Table 1. Moreover,

each item in this table has two parameters: Plus-Value and Minus-Value, which record the adjustment
12
values that will be increased additionally to unknown e-mails. The Plus-Value records a positive

integer value, which implies a supplement to the score of an unknown e-mail. Moreover, the

Minus-Value records a negative integer value, which implies a subtraction from the score of an

unknown e-mail. An example of reversing table is shown in Table 2. Given a rule Ri , we denote its

corresponding reversing table as RT ( Ri ) . Moreover, we denote the nine items of reversing table

RT ( Ri ) as RT ( Ri )[ j ] for 1  j  9 , whose the two parameters Plus-Value and Minus-Value are

named as RT ( Ri )[ j ]. plus and RT ( Ri )[ j ].minus , respectively.

By using the scored rules of Rule-Database to examine attributes of training e-mails, we can

classify them and collect the misjudged ones. The algorithm RT_Initial will initialize the reversing

tables by applying these misjudged training e-mails. The process of algorithm RT_Initial is

illustrated in Figure 4. Note that each initial value of RT ( Ri )[ j ]. plus and RT ( Ri )[ j ].minus for

1  j  9 is set as zero before performing RT_Initial . Moreover, the parameters I  and I  are

two positive integers, which are basic units to adjust the values of RT ( Ri )[ j ]. plus and

RT ( Ri )[ j ].minus , respectively.

The detailed process of RT_Initial is summarized as follows.

Step 1. Check the Critical Attributes of this misjudged training e-mail.

Step 2. According to the values of Critical Attributes, this training e-mail will dovetail with some

rule, say Ri , in the Rule-Database. Then the corresponding reversing table RT ( Ri )

associated with this dovetailed rule Ri will be chosen from the Reversing-Database.

Step 3. If this misjudged training e-mail is “False-Positive” (a legitimate e-mail to be judged as a

13
spam), do the following operations:

For 1  j  9 , check whether this misjudged training e-mail satisfies the statement of

RT ( Ri )[ j ] (“True” or “False”):

If True then do

If RT ( Ri )[ j ]. plus  I  , then RT ( Ri )[ j ]. plus  RT ( Ri )[ j ]. plus  I  ;

Else (False) then do

RT ( Ri )[ j ].minus  RT ( Ri )[ j ].minus  I  .

Step 4. If this misjudged training e-mail is “False-Negative” (a spam to be judged as a legitimate

e-mail), do the following operations:

For 1  j  9 , check whether this misjudged training e-mail satisfies the statement of

RT ( Ri )[ j ] (“True” or “False”):

If True, then do

RT ( Ri )[ j ]. plus  RT ( Ri )[ j ]. plus  I  ;

Else (False), then do

If RT ( Ri )[ j ].minus  I   0 , then RT ( Ri )[ j ].minus  RT ( Ri )[ j ].minus  I  .

Step 5. Store RT ( Ri ) back to the Reversing-Database.

3.2 Classification Phase

The task of this phase is to classify each unknown e-mail to be either a legitimate e-mail or a

spam according to the association rules learned in the Training Phase. The first step is to extract the

Critical Attributes of each unknown e-mail, and find the dovetailed association rule in Rule-Database

to compute the original score for this e-mail. Then, this unknown e-mail’s attributes will be
14
examined once again to compute its additional score by checking the corresponding reversing table

in Reversing-Database. Finally, this unknown e-mail can be classified according to the total score

composed of original and additional scores. We describe the process of this phase in the following

stages:

Stage 1. Capturing Critical Attributes

In this stage, each unknown e-mail will be examined to capture the values of nine Critical

Attributes as shown in Table 1. The values of these nine Critical Attributes are decided by checking

the corresponding fields of unknown e-mail’s header section and looking up the spam keywords table

(if necessary).

Stage 2. Computing the original score

According to the values of Critical Attributes, this unknown e-mail will dovetail with some

association rule, say, Ri in the Rule-Database built in the Training Module. And we will set the

original score of the unknown e-mail to be [Link]( Ri ) , which is as mentioned previously in the

Training Phase. Assume that the original score of this unknown e-mail is named as scoreA

Stage 3. Computing the additional score

Assume that the additional score of this unknown e-mail is named as scoreB . In this stage, we

will access the corresponding reversing table of dovetailed rule Ri . First, the reversing table

RT ( Ri ) is picked form Reversing-Database. Then, this unknown e-mail’s attributes will be

examined again to compute its additional score (i.e., scoreB ) by executing the following steps:

Step 1. Check the Critical Attributes of this unknown e-mail.

15
Step 2. For 1  j  9 , check whether this unknown e-mail satisfies the statement of RT ( Ri )[ j ]

(“True” or “False”):

If True, then scoreB  scoreB  RT ( Ri )[ j ]. plus ;

Else (False), then scoreB  scoreB  RT ( Ri )[ j ].minus .

Stage 4. Judging classification of this unknown e-mail

Now the total score of this unknown e-mail can be obtained by adding up scoreA and scoreB .

Then this unknown e-mail will be classified as a spam if scoreA  scoreB   , and a legitimate

e-mail otherwise. In this research, we apply the supervised machine learning method to collect the

misjudged e-mails for further analysis in Re-learning Phase. Therefore, in this stage, the supervisor

will monitor the classification results of all unknown e-mails and collect the misjudged ones.

3.4 Re-learning Phase

During the Re-learning Phase, the misjudged e-mails collected in above phase (the Classification

Phase) will be used by algorithm RT_Modify to modify the reversing tables in Reversing-Database.

The process of RT_Modify is similar to RT_Initial but not exactly the same. In RT_Modify , the

parameters M  and M  are basic units to adjust the values of RT ( Ri )[ j ]. plus and

RT ( Ri )[ j ].minus , respectively. The detailed process of RT_Modify is summarized as follows.

Step 1. Check the Critical Attributes of each misjudged e-mail.

Step 2. According to the values of Critical Attributes, this misjudged e-mail will dovetail with

some rule, say Ri , in the Rule-Database. Then choose the corresponding reversing table

RT ( Ri ) associated with this dovetailed rule Ri from the Reversing-Database.

16
Step 3. If this misjudged e-mail is “False-Positive” (a legitimate e-mail to be judged as a spam),

do the following operations:

For 1  j  9 , check whether this misjudged e-mail satisfies the statement of RT ( Ri )[ j ]

(“True” or “False”):

If True, then do

If RT ( Ri )[ j ]. plus  M  , then RT ( Ri )[ j ]. plus  RT ( Ri )[ j ]. plus  M  ;

Else (False), then do

RT ( Ri )[ j ].minus  RT ( Ri )[ j ].minus  M  .

Step 4. If this misjudged e-mail is “False-Negative” (a spam to be judged as a legitimate e-mail),

do the following operations:

For 1  j  9 , check whether this misjudged e-mail satisfies the statement of RT ( Ri )[ j ]

(“True” or “False”):

If True, then do

RT ( Ri )[ j ]. plus  RT ( Ri )[ j ]. plus  M  ;

Else (False), then do

If RT ( Ri )[ j ].minus  M   0 , then RT ( Ri )[ j ].minus  RT ( Ri )[ j ].minus  M  .

Step 5. Store RT ( Ri ) back to the Reversing-Database.

4. Experimental Results

In this section, we perform experiments to confirm the accuracy and efficiency of our spam

filtering method. We employ two spam datasets as experimental data: SpamAssassin [27] and

17
Enron-Spam [8], which are commonly applied in research papers about spams. The dataset of

SpamAssassin consists of 6,827 e-mails (4,894 legitimate e-mails and 1,933 spams), and the dataset

of Enron-Spam consists of 52,076 e-mails (19,088 legitimate e-mails and 32,988 spams). Moreover,

we have collected 10502 e-mails (4,401 legitimate e-mails and 6,101 spams) in the recent period as

supplements to the above two datasets. Therefore, the total number of experimental e-mails is 69,405

(28,383 legitimate e-mails and 41,022 spams). We denote these e-mails as the third dataset:

Mixed-Set, which will be adopted in this section.

The experiments will be proceeded through the following steps. First, we will introduce the

efficacy assessment indexes used in this paper. And we optimize the parameters ( I  , I  , M  , M  )

used in the algorithms RT_Initial and RT_Modify . Then, we conduct a series of experiments to

measure the filtering performance of the proposed method.

4.1 Assessment indexes used in this study

To evaluate the performance for our spam filtering method proposed in this paper, we employ

the following efficacy assessment indexes: “Precision”, “Recall”, and “F-measure”, which are

commonly used for document classification. The decision confusion matrix, as shown in Table 3, is

used to explain the calculation equations listed as follows [6, 9, 10]. Note that all the four cases A, B,

C, and D in Table 3 are recorded by the quantity of e-mails.

1. Accuracy: the percentage of total e-mails that are correctly recognized. It is defined by the

following formula:

A D
Accuracy  .
A B C  D

2. Precision: it calculate the ratio of the e-mails classified correctly in the e-mails judged as

the certain category, representing filter’s capabilities of classifying correctly such category

18
of e-mails. In this study, we calculate the “Spam Precision” from the perspective of

identifying spams, and the “Legitimate Precision” from the perspective of identifying

legitimate e-mails. And the value of “Precision” is set as the mean of Spam Precision and

Legitimate Precision. The formulas are listed as follows:

A
Spam Precision  ;
A B

D
Legitimate Precision  ;
CD

Spam Precision  Legitimate Precision


Precision  .
2

3. Recall: it refers the ratio of the e-mails classified correctly. The “Spam Recall” is defined

as the probability of classifying correctly spammy e-mails as spams, and the “Legitimate

Recall” is defined as the probability classifying correctly legitimate e-mails. Then, we set

the “Recall” value as the mean of the Spam Recall and Legitimate Recall. The formulas are

listed as follows:

A
Spam Recall  ;
AC

D
Legitimate Recall  ;
BD

Spam Recall  Legitimate Recall


Recall  .
2

4. F-measure: the harmonic mean of the Precision and Recall with equation listed as follows:

2  Precision  Recall
F  measure 
Precision  Recall

5. FP-rate and FN-rate: FP-rate defines the ratio of misjudging legitimate e-mails as

spammy, and FN-rate defines the ratio of misjudging spams as legitimate. The formulas are

listed as follows: :
19
B
FP-rate 
BD

C
FN -rate 
AC

4.2 Optimization of parameters

Before performing the experiments, we must optimize the important parameters used in this

research. First off, we should optimize the parameters ( Plower , Pupper , Slower ) , which indicate the

threshold values used in Step 4 of ID3 algorithm. By performing a lot of experiments, we found that

the size of decision tree built by ID3 algorithm can be pruned acceptably if we set

( Plower , Pupper , Slower ) as (20%, 90%, 2.5%). Therefore, we use these threshold values as stop

conditions in Step 4 of ID3 algorithm. That is, Step 4 can be interpreted as follows: “If

20%  Purity(C) or Purity(C)  90% or Support (C )  2.5% , then stop”.

Then, we must optimize four important parameters ( I  , I  , M  , M  ) . The parameters I  and

I  are used by algorithm RT_Initial in the Training Phase, and M  and M  are used by

RT_Modify in the Re-learning Phase. We set the initial values of ( I  , I  , M  , M  ) as (1, 1, 1, 1).

Then, we adopted all the 6,827 e-mails as testing data from the SpamAssassin dataset to observe

variation of the Accuracy of our spam filtering system (using both reversing mechanism and

re-learning mechanism). We fixed the three parameters ( I  , I  , M  ) at (1, 1, 1) and gradually

changed M  to investigate the variation of Accuracy of our method. The results were shown in

Table 4. From the experimental results, we observed that the higher value of M  , the higher value

of Accuracy. While the M  reached to 10, the Accuracy could not be improved further. Thus, we

chose 10 as the optimal value of M  .

20
Similarly, we manipulated M  and fixed the other three parameters ( I  , I  , M  ) at (1, 1, 10)

to observe the variation of Accuracy, which was shown in Table 5. Obviously, the Accuracy could

not be improved further while the M  reached to 7. Therefore, we chose 7 as the optimal value of

M.

By apply the similar way, we chose 12 as the optimal value of I  . Then we tried to optimize

the value of I  . However, the Accuracy would decrease while we began to adjust I  . Hence, we

kept I  as 1 and finished the optimization work. Finally, we set the optimal values of parameters

as (1, 12, 10, 7). Therefore, we would adopt the optimized parameters in experiments of the next

subsection.

4.3 Experimental analysis

In this subsection, we perform the following three experiments to confirm the efficiency of our

spam filtering method proposed in this paper: (A) Using SpamAssassin as experimental dataset; (B)

Using Enron-Spam as experimental dataset; (C) Using Mixed-Set as experimental dataset.

Note that the ratio of legitimate e-mails to spams used in the Training Phase was different from

that in the Classification Phase. In the Training Phase, we took randomly 1000 training e-mails in the

ratio 1:1 (500 legitimate e-mails and 500 spams) from the experimental dataset adopted in each

experiment. Moreover, in the Classification Phase, we would continuously increase the amount of

testing data (unknown e-mails) to observe the performances of our filtering method. Note that those

unknown e-mails were taken randomly from the adopted experimental dataset without any

predefined proportion of legitimate e-mails to spams.

In each experiment, the performance of our spam filtering system will be verified by adopting

different methods which are combinations of the mechanisms proposed in our filtering system: (I)

Using neither reversing mechanism nor re-learning mechanism; (II) Using only reversing mechanism;

21
(III) Using both reversing mechanism and re-learning mechanism. Note that (III) is actually

equivalent to the whole filtering system proposed in this paper. The experimental results are

discussed as follows.

4.3.1 The result of experiment (A)

In this experiment, we chose SpamAssassin as experimental dataset to observe the performance

of our filtering system. The result of this experiment was shown in Figure 5. Obviously, the curve of

Accuracy of method (I), which used neither reversing mechanism nor re-learning mechanism, was

lower than those of methods (II) and (III). With the assistance of reversing mechanism in classifying

unknown e-mails, the curve of Accuracy of method (II) was better than that of method (I). Moreover,

with both of reversing mechanism and re-learning mechanism, the method (III) obtained the most

outstanding accuracy. After applying the re-learning mechanism increasingly, the Accuracy of

method (III) reached 0.9675, which implied an excellent result.

The performances of method (III) evaluated in various efficacy assessment indexes were shown

in Figure 6. Each curve of the four indexes had outstanding exhibition. Among these four curves, the

curve of Recall obtained the highest values, which implied that our filtering system would classify

the certain category (spammy or legitimate) of e-mails correctly.

4.3.2 The result of experiment (B)

In this experiment, we chose Enron-Spam as experimental dataset. The Enron-Spam dataset

collected a larger number of e-mails as compared with SpamAssassin. The experimental results were

shown in Figure 7. Obviously, the considerable quantities of testing data had made a great impact on

behaviors of the methods (I), (II), and (III). With the support of re-learning mechanism, the method

(III) incrementally learned knowledge from numerous unknown e-mails and obtained the best

accuracy. As compared with experiment (A), the curve of Accuracy of methods (III) in this

22
experiment was lifted up and reached 0.9765.

The performances of method (III) recorded in various efficacy assessment indexes were shown

in Figure 8. Compared with Figure 6, the curves of the four indexes in this experiment had better

exhibitions. After re-learning from a plenty of unknown e-mails, all curves became stable and

reached the desirable values. Moreover, the ratio of Recall eventually reached the highest value of

0.9992, which indicated that our system classified the unknown e-mails perfectly.

4.3.3 The result of experiment (C)

The previous two experiments had confirmed that performance of method (III), the filtering

system proposed in this paper, was excellent. In this experiment, we applied only the method (III) to

observe exhibitions of various efficacy assessment indexes. To emphasize the influence of a great

quantity of testing data, we adopted the Mixed-Set as experimental dataset and chose a double

amount of training e-mails in this experiment. Precisely, in the Training Phase, we took randomly

2000 training e-mail from Mixed-Set without any predefined proportion of legitimate e-mails to

spams. Then, in the Classification Phase, we continuously increased the amount of unknown e-mails

taken randomly from Mixed-Set to observe the performances of our spam filtering method.

The results of this experiment were shown in Figure 9. By taking more training data from a

plentiful dataset, the ratio of each assessment index acquired a greatly high value, more than 0.99, in

the beginning of this experiment. After applying a large number of testing data, all curves of the four

assessment indexes approached more closely and reached the extremely desirable numerical values.

Moreover, the results of FP-rate (the ratio of misjudging legitimate e-mails as spammy) and FN-rate

(the ratio of misjudging spams as legitimate) in method (III) of the three experiments were shown in

Table 6. Obviously, both the FP-rate and FN-rate were ideal even though the proposed method

applied a small experimental dataset. For example, the FP-rate in experiment (A) was only 0.0014,

which showed that our filtering method infrequently misjudged the legitimate e-mails (that are of
23
vital importance) as spammy ones. Moreover, the considerable quantities of data instances had made

a great and incredible improvement on the misjudgment rate. which implied that the filtering method

proposed in this paper possessed outstanding performances.

Table 7 was the comparison between some spam filtering methods proposed in literatures. Note

that some of these methods had to check the whole content of e-mail, whereas our method would

check e-mail’s header only. Obviously, our method revealed better Accuracy and Recall. We could

observe that our Precision rate of method (III) in experiment (B) was inferior to those of other

methods. However, our Precision rate of experiment (C) reached an almost perfect numerical value,

which implied that our method could incrementally learn classification knowledge under a great

quantity of unknown e-mails to improve itself.

5. Conclusions

In this research, we proposed an efficient spam filtering method based on decision tree data

mining technique, analyzed the association rules among about spams and applies these rules to

develop a systematized spam filtering method. Different from content checking, we classified

e-mails simply by analyzing their basic header data only. Our method possessed the following three

major superiorities: (1) Checking only e-mail’s header section to avoid the low operating efficiency

in scanning e-mail’s content. Moreover, the accuracy of filtering was enhanced simultaneously. (2) In

order that the probable misjudgment in identifying an unknown e-mail could be “reversed”, we had

constructed a reversing mechanism to help the classification of unknown e-mails. Thus, the overall

accuracy of our filtering method will be increased. (3) Our method was equipped with a re-learning

mechanism, which utilized the supervised machine learning method to collect and analyze each

misjudged e-mail. Therefore, the revision information learned from the analysis of misjudged e-mails

incrementally gave feedback to our method, and its ability of identifying spams would be improved.

The results of experiments with a large number of testing data showed that the ratios of

24
assessment indexes, Accuracy, Recall, Precision, F-measure, FP-rate, and FN-rate, approached more

closely and reached the extremely desirable numerical values, which implied that the filtering

method proposed in this paper possessed outstanding performances. Note that one of advantages of

our method was to reduce the calculation cost. Therefore, the method proposed in this paper can

classify unknown e-mails precisely and not consume too many system resources, which will be

extremely useful in resolving the requirement of judging a large number of unknown e-mails

nowadays.

Acknowledgements

This work is partially supported by the Ministry of Science and Technology, Taiwan, R.O.C. under

Grant no. MOST 103-2410-H-004-112.

Reference

1. Androutsopoulos I, Koutsias J, Chandrinos KV, Spyropoulos CD. An experimental comparison


of naïve bayesian and keyword-based anti-spam filtering with personal e-mail messages. In
Proceedings of the 23rd Annual International ACM SIGR Conference on Research and
Development in Information Retrieval 2000; 160-167.

2. Breiman L. Random forests. Machine Learning 2001; 45: 5-32.

3. Carreras X, Marquez L. Boosting trees for anti-spam email filtering. 4th International
Conference on Recent Advances in Natural Language Processing (RANLP), Bulgaria, Sep. 5-7,
2001; 58-64.

4. Cook D, Hartnett J, Manderson K, Scanlan J. Catching spam before it arrives: domain specific
dynamic blacklists. In Proceedings of the 2006 Australasian Workshops on Grid Computing and
E-Research 2006; 54: 193-203.

5. DeBarr D, Wechsler H. Spam detection using random boost. Pattern Recognition Letters 2012;
33(10): 1237-1244.

6. Delany SJ, Cunningham P, Tsymbal A, Coyle L. A case-based technique for tracking concept

25
drift in spam filtering. Knowledge-Based Systems 2005; 18: 187-195.

7. Drucker H, Wu D, Vapnik V. Support vector machines for spam categorization. IEEE


Transactions on Neural Networks 1999; 10(5): 1048-1054.

8. Enron-SPAM Datasets: [Link]

9. Fdez-Riverola F, Iglesias EL, Díaz F, Me´ndez JR, Corchado JM. Applying lazy learning
algorithms to tackle concept drift in spam filtering. Expert Systems with Applications 2007; 33(1):
36-48.

10. Fdez-Riverola F, Iglesias EL, Díaz F, Me´ndez JR, Corchado JM. Spamhunting: an
instance-based reasoning system for spam labelling and filtering. Decision Support Systems 2007;
43(3): 722-736.

11. Golbeck J, Hendler J. Reputation network analysis for email filtering. In Proceedings of the First
Conference on Email and Anti-Spam (CEAS), 2004.

12. Guo Y, Zhou L, He K, Gu Y, Sun Y. Bayesian spam filtering mechanism based on decision tree of
attribute set dependence in the mapreduce framework. Open Cybernetics & Systemics Journal
2014; 8: 435-441.

13. Han J, Kamber M. Data Mining Concepts and Techniques. USA: Morgan Kaufman, 2001;
284-287.

14. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference,
and Prediction. Springer: New York, 2001.

15. Hong Kong Anti-SPAM Coalition (HKASC). Legislation: One of The Key Pillars in The Fight
Against SPAM. White Paper, 2004.

16. Hsiao WF, Chang TM. An incremental cluster-based approach to spam filtering. Expert Systems
with Applications 2008; 34: 1599-1608.

17. Islam MR, Zhou W, Guo M, Xiang Y. An innovative analyser for multi-classifier e-mail
classification based on grey list analysis. Journal of Network and Computer Applications 2009;
32: 357-366.

18. Jayaraj A, Venkatesh T, Murthy CSR. Loss classification in optical burst switching networks
using machine learning techniques: improving the performance of tcp. IEEE Journal on Selected
Areas in Communications 2008; 26(6): 45-54.

19. Lai CC. An empirical study of three machine learning methods for spam filtering.
Knowledge-Based Systems 2007; 20(3): 249-254.

26
20. Liu YN, Han Y, Zhu XD, He F, Wei LY. An expanded feature extraction of e-mail header for
spam recognition. Advanced Materials Research 2013; 846:1672-1675.

21. Ohmann C, Moustakis V, Yang Q, Lang K. Evaluation of automatic knowledge acquisition


techniques in the diagnosis of acute abdominal pain. Artificial Intelligence in Medicine 1996;
8(1): 23-36.

22. Pölzlbauer G, Lidy T, Rauber A. Decision manifolds - a supervised learning algorithm based on
self-organization. IEEE Transactions on Neural Networks 2008; 19(9): 1518-1530.

23. Porter MF. An algorithm for suffix stripping. Program 1980; 14: 130-137.

24. Quinlan JR. Induction of decision trees. Machine Learning 1986; 1(1):81-106.

25. Quinlan JR. C4.5: Programs for Machine Learning. San Mateo: Morgan Kaufmann, 1993.

26. Sanpakdee U, Walairacht A, Walairacht S. Adaptive spam mail filtering using genetic algorithm.
The 8th International Conference Advanced Communication Technology, pp. 441-445, Phoenix
Park, Korea, Feb. 20-22, 2006.

27. Schwartz A. SpamAssassin. O’Reilly, 2004.

28. Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys 2002;
34(1): 1-47.

29. Sheu JJ. An efficient two-phase spam filtering method based on e-mails categorization.
International Journal of Network Security 2009; 8(3): 334-343.

30. Sheu JJ, Chu KT. An efficient spam filtering method by analyzing e-mail's header session only.
International Journal of Innovative Computing, Information and Control 2009; 5(11):
3717-3731.

31. Shih DH, Chiang HS, Lin B. Collaborative spam filtering with heterogeneous agents. Expert
Systems with Applications 2008; 35(4): 1555-1566.

32. Shrivastava JN, Bindu MH. (2014). E-mail spam filtering using adaptive genetic algorithm.
International Journal of Intelligent Systems and Applications (IJISA) 2014; 6(2): 54-60

33. Stark KD, Pfeiffer DU. The application of non-parametric techniques to solve classification
problems in complex data sets in veterinary epidemiology - an example. Intelligent Data
Analysis 1999; 3(1):23-35.

34. Symantec Intelligence Report: May 2014. Symantec, 2014.

35. Tretyakov K. Machine learning techniques in spam filtering. Technical report, Institute of
Computer Science, University of Tartu, 2004.
27
36. Wang CC, Chen SY. Using header session messages to anti-spamming. Computers & Security
2007; 26: 381-390.

37. Yang Y. A novel framework based on rough set, ant colony optimization and genetic algorithm
for spam filtering. International Journal of Advancements in Computing Technology 2012; 4(14):
516-525.

38. Zhou B, Yao Y, Luo J. Cost-sensitive three-way email spam filtering. Journal of Intelligent
Information Systems 2014; 42(1): 19-45.

28
Figures:

B C

D E

Figure 1. An example of tree

29
Training Training Phase
e-mail

Reversing
Rule
Mechanism’s
Pre-processing Construction Re-Learning Phase
Setup
Module
Module

Unknown
e-mail
Legitimate
e-mail

Pre-processing Classification Phase


Spammy
e-mail Supervisor
(user)

Figure 2. Architecture of the three-Phase spam filtering method

30
Training
e-mails Rule Construction Training Phase
Module
Capturing Critical Reversing Mechanism
Pre-processing Module
Attributes

Constructing the Misjudged


decision tree e-mails

RM Generate
Scoring the rules algorithm Re-Learning Phase

Re-Learning Misjudged
Rule-Database Reversing-Database algorithm e-mails

Unknown
e-mails

Capturing Critical Computing the Computing the


Pre-processing
Attributes original score additional score Legitimate
e-mail

Judgement Spam Supervisor


Classification Phase

Figure 3 Detailed process of the three-Phase spam filtering method

31
False-Positive
Misjudged Checking Selecting Checking each Type of Adjusting
e-mail Critical Reversing table of the 9 items misjudgement parameters by
Attributes RT(R) in RT(R) reducIng I -

False-Negative

Adjusting
Storing RT(R) back to
parameters
Reversing-Database
by adding I+

Figure 4. The flow of RT_Initial

32
Figure 5. The result of experiment (A)

33
Figure 6. The performance of method (III) in experiment (A)

34
Figure 7. The result of experiment (B)

35
Figure 8. The performance of method (III) in experiment (B)

36
Figure 9. The result of experiment (C)

37
Tables:
Table 1: The 9 Critical Attributes of e-mail

Attribute
Critical Attribute Value
categories
Length of sender’s name is If the length of sender’s name is more than 9 characters,
abnormal it is set at 1 (True), otherwise 0 (False)
If any of sender’s name and address is blank or contains
Either sender’s name or
Sender abnormal symbol, it is set at 1 (True), otherwise 0
address is abnormal
(False)
Spam keyword is found in If any spam keyword is found, it is set at 1 (True),
sender’s name or address otherwise 0 (False)
If the title is blank or contains more than 3 wrong (or
E-mail’s title is abnormal
unknown) words, it is set at 1 (True), otherwise 0 (False)
E-mail’s title includes spam If e-mail’s title has a spam keyword, it is set at 1 (True),
Title
keyword (Type I) otherwise 0 (False)
E-mail’s title includes spam If e-mail’s title has at least three spam keywords, it is
keyword (Type II) set at 1 (True), otherwise 0 (False)
Sending date and receiving If the date of sending distinctly differs from the date of
date are abnormal receiving, it is set at 1 (True), otherwise 0 (False)
If e-mail’s size is equal to or larger than 8k, it is set at 1
Other E-mail’s size is abnormal
(True), otherwise 0 (False)
If this e-mail’s format is HTML or contains attachments,
E-mail’s format
it is set at 1, otherwise 0

38
Table 2: An example of reversing table RT(R)

Revised values
No. Item (True/False)
Plus-Value Minus-Value
1 Length of sender’s name is abnormal +4 -4
2 Either sender’s name or address is abnormal +4 -2
3 Spam keyword is found in sender’s name or address +4 -4
4 E-mail’s title is abnormal +4 -2
5 E-mail’s title includes spam keyword (Type I) +4 -2
6 E-mail’s title includes spam keyword (Type II) +8 -4
7 Sending date and receiving date are abnormal +10 0
8 E-mail’s size is abnormal +4 -3
9 E-mail’s format +4 -2

39
Table 3: Four cases of judgment

E-mail’s categorization in reality


Spammy Legitimate
To be judged as spam A B
To be judged as legitimate e-mail C D

40
Table 4: Optimization of parameter M+ with (I+, I- , M-)=(1, 1, 1)

M+ Accuracy of method (III)


4 0.7486
8 0.7438

9 0.7515

10 0.9054

11 0.9054

… …

41
Table 5: Optimization of parameter M- with (I+, I- , M+)=(1, 1, 10)

M- Accuracy of method (III)


4 0.9250
6 0.9442

7 0.9518

8 0.9518

9 0.9518

… …

42
Table 6: The experimental results of FP-rate and FN-rate

Experiment FP-rate FN-rate


(A) 0.0014 0.1112
(B) 0.000786 0.0367
(C) 0.00099539 0.000454711

43
Table 7: Comparison of filtering methods

Checking the whole


Methods Recall(%) Precision(%) Accuracy(%)
content of e-mail
The method of Bayes classifier proposed by
87.44 100 94.49 Yes
Tretyakov [35]
Incremental clustering-based classification (ICBC)
96.1 95.76 94.73 Yes
proposed by Hsiao and Chang [16]
Genetic algorithm proposed by Sanpakdee et al. [26] 75.71 89.83 85.53 Yes
The decision tree method proposed by Sheu [29] 96.35 96.67 96.5 No
The decision tree method proposed by Sheu et al.
94.00 97.96 96.17 No
[30]
The proposed method in this paper: Experimental
99.92 94.05 97.65 No
results of method (III) in experiment (B)
The proposed method in this paper: Experimental
99.90 99.92 99.93 No
results of experiment (C)

44

You might also like