Chapter1 - Probabilistic Learning - Classification Using Naive Bayes
Chapter1 - Probabilistic Learning - Classification Using Naive Bayes
This chapter covers the Naive Bayes algorithm, which uses probabilities in much the same way as a
weather forecast.
Typically, Bayesian classifiers are best applied to problems in which the information from
numerous attributes should be considered simultaneously in order to estimate the overall
probability of an outcome. Bayesian methods utilize all available evidence to subtly change the
predictions. This implies that even if a large number of features have relatively minor effects, their
combined impact in a Bayesian model could be quite large.
Bayesian methods provide insights into how the probability of these events can be estimated from
observed data.
Understanding Probability
The probability of an event is estimated from observed data by dividing the numb of trials in
which the event occurred by the total number of trials.
The complement of event A is typically denoted A or A'. Additionally, the shorthand notation P(¬A)
can be used to denote the probability of event A not occurring, as in P(¬spam) = 0.80. This notation
is equivalent to P(𝐴𝐶 ).
1
Probabilistic Learning – Classification Using Naive Bayes
We know that 20 percent of all messages were spam (the left circle), and 5 percent of all messages
contained the word "Viagra" (the right circle). We would like to quantify the degree of overlap
between these two proportions. In other words, we hope to estimate the probability that both
P(spam) and P(Viagra) occur, which can be written as P(spam ∩ Viagra). The upside-down "U"
symbol signifies the intersection of the two events; the notation A ∩ B refers to the event in which
both A and B occur.
Calculating P(spam ∩ Viagra) depends on the joint probability of the two events, or how the
probability of one event is related to the probability of the other. If the two events are totally
unrelated, they are called independent events. This is not to say that independent events cannot
occur at the same time; event independence simply implies that knowing the outcome of one event
does not provide any information about the outcome of the other. For instance, the outcome of a
heads result on a coin flip is independent from whether the weather is rainy or sunny on any given
day.
If all events were independent, it would be impossible to predict one event by observing another. In
other words, dependent events are the basis of predictive modeling. Just as the presence of
clouds is predictive of a rainy day, the appearance of the word "Viagra" is predictive of a spam
email.
Calculating the probability of dependent events is a bit more complex than for independent
events.
This is known as conditional probability since the probability of A is dependent (that is,
conditional) on what happened with event B.
Bayes' theorem states that the best estimate of P(A|B) is the proportion of trials in which A occurred
with B, out of all the trials in which B occurred.
2
Probabilistic Learning – Classification Using Naive Bayes
-P(spam), the probability that any prior message was spam. This estimate is known as the prior
probability.
By applying Bayes' theorem to this evidence, we can compute a posterior probability that
measures how likely a message is to be spam.
To calculate the components of Bayes' theorem, it helps to construct a frequency table (likelihood
table).
The likelihood table reveals that P(Viagra=Yes | spam) = 4/20 = 0.20, indicating that there is a 20%
probability a message contains the term "Viagra" given that the message is spam.
The Naive Bayes algorithm is named as such because it makes some "naive" assumptions about
the data. In particular, Naive Bayes assumes that all of the features in the dataset are equally
important and independent. These assumptions are rarely true in most real-world applications.
However, in most cases, even when these assumptions are violated, Naive Bayes still
performs fairly well. This is true even in circumstances where strong dependencies are found
among the features. Due to the algorithm's versatility and accuracy across many types of
conditions, particularly with smaller training datasets, Naive Bayes is often a reasonable
baseline candidate for classification learning tasks.
The exact reason why Naive Bayes works well despite its faulty assumptions has been the subject
of much speculation. One explanation is that it is not important to obtain a precise estimate of
probability so long as the predictions are accurate. For instance, if a spam filter correctly
identifies spam, does it matter whether it was 51 percent or 99 percent confident in its prediction?
3
Probabilistic Learning – Classification Using Naive Bayes
As new messages are received, we need to calculate the posterior probability to determine whether
they are more likely spam or ham, given the likelihood of the words being found in the message text.
For example, suppose that a message contains the terms "Viagra" and "unsubscribe" but does not
contain either "money" or "groceries."
This formula is computationally difficult to solve. The computation becomes more reasonable if we
exploit the fact that Naive Bayes makes the naive assumption of independence among events.
Specifically, it assumes class-conditional independence, which means that events are
independent so long as they are conditioned on the same class value. The conditional
independence assumption allows us to use the probability rule for independent events, which
states that P(A ∩ B) = P(A) * P(B). This simplifies the numerator by allowing us to multiply the
individual conditional probabilities rather than computing a complex conditional joint probability.
Lastly, because the denominator does not depend on the target class (spam or ham), it is treated as
a constant value and can be ignored for the time being. This means that the conditional probability
of spam can be expressed as:
And the probability that the message is ham can be expressed as:
Note that the equals symbol has been replaced by the proportional-to symbol (similar to a
sideways, open-ended "8") to indicate the fact that the denominator has been omitted.
Using the values in the likelihood table, we can start filling numbers in these equations. The overall
likelihood of spam is then: (4/20)* (10/20)* (20/20) * (12/20)* (20/100) = 0.012
While the likelihood of ham is: (1/80) * (66/80) * (71/80)* (23/80) * (80/100) = 0.002
Because 0.012/0.002 = 6, we can say that this message is six times more likely to be spam than
ham. However, to convert these numbers to probabilities, we need one last step to reintroduce the
4
Probabilistic Learning – Classification Using Naive Bayes
denominator that had been excluded. Essentially, we must re-scale the likelihood of each outcome
by dividing it by the total likelihood across all possible outcomes.
In this way, the probability of spam is equal to the likelihood that the message is spam divided by
the likelihood that the message is either spam or ham: 0.012/(0.012 +0.002) = 0.857
Similarly, the probability of ham is equal to the likelihood that the message is ham divided by the
likelihood that the message is either spam or ham: 0.002/(0.012 +0.002) = 0.143
Given the pattern of words found in this message, we expect that the message is spam with 85.7
percent probability, and ham with 14.3 percent probability. Because these are mutually exclusive
and exhaustive events, the probabilities sum up to one.
The Naive Bayes classification algorithm used in the preceding example can be summarized by the
following formula. The probability of level L for class C, given the evidence provided by features F 1,
through Fn, is equal to the product of the probabilities of each piece of evidence conditioned on the
class level, the prior probability of the class level, and a scaling factor 1/Z, which converts the
likelihood values to probabilities. This is formulated as:
Although this equation seems intimidating, as the spam filtering example illustrated, the series of
steps is fairly straightforward. Begin by building a frequency table, use this to build a likelihood
table, and multiply out the conditional probabilities with the "naive" assumption of independence.
Finally, divide by the total likelihood to transform each class likelihood into a probability. After
attempting this calculation a few times by hand, it will become second nature.
And the likelihood of ham is: (1/80)* (14/80)* (8/80) * (23/80) * (80/100) = = 0.00005
This problem arises if an event never occurs for one or more levels of the class and therefore their
joint probability is zero
A solution to this problem involves using something called the Laplace estimator. The Laplace
estimator adds a small number to each of the counts in the frequency table, which ensures that
each feature has a non-zero probability of occurring with each class. Typically, the Laplace
estimator is set to one, which ensures that each class-feature combination is found in the data at
least once.
5
Probabilistic Learning – Classification Using Naive Bayes
Let's see how this affects our prediction for this message. Using a Laplace value of one, we add one
to each numerator in the likelihood function. Then, we need to add four to each conditional
probability denominator to compensate for the four additional values added to the numerator. The
likelihood of spam is therefore: (5/24)* (11/24)* (1/24)* (13/24)* (20/100) = 0.0004
And the likelihood of ham is: (2/84)* (15/84) * (9/84) * (24/84) * (80/100) = 0.0001
By computing 0.0004/(0.0004 0.0001), we find that the probability of spam is 80 percent and
therefore the probability of ham is about 20 percent. This is a more plausible result than the
P(spam) = 0 computed when the term "groceries" alone determined the result.
Although the Laplace estimator was added to the numerator and denominator of the likelihoods, it
was not added to the prior probabilities - the values of 20/100 and 80/100. This is because our best
estimate of the overall probability of spam and ham remains 20% and 80% given what was
observed in the data.
One easy and effective solution is to discretize numeric features, which simply means that the
numbers are put into categories known as bins. For this reason, discretization is also sometimes
called binning. This method works best when there are large amounts of training data.
The most common way is to explore the data for natural categories or cut points in the distribution.
For example, suppose that you added a feature to the spam dataset that recorded the time of day or
night the email was sent, from zero to 24 hours past midnight. Depicted using a histogram, the time
data might look something like the following diagram:
In the early hours of the morning, message frequency is low. Activity picks up during business hours
and tapers off in the evening. This creates four natural bins of activity, as partitioned by the dashed
lines. These indicate places where the numeric data could be divided into levels to create a new
categorical feature, which could then be used with Naive Bayes.
6
Probabilistic Learning – Classification Using Naive Bayes
The choice of four bins was based on the natural distribution of data and a hunch about how the
proportion of spam might change throughout the day. We might expect that spammers operate in
the late hours of the night, or they may operate during the day, when people are likely to check their
email. That said, to capture these trends, we could have just as easily used three bins or twelve.
If there are no obvious cut points, one option is to discretize the feature using quantiles. You could
divide the data into three bins with tertiles, four bins with quartiles, or five bins with quintiles.
One thing to keep in mind is that discretizing a numeric feature always results in a reduction of
information, as the feature's original granularity is reduced to a smaller number of categories. It is
important to strike a balance. Too few bins can result in important trends being obscured. Too many
bins can result in small counts in the Naive Bayes frequency table, which can increase the
algorithm's sensitivity to noisy data.
Example - filtering mobile phone spam with the Naive Bayes algorithm
Step 1 - collecting data
To develop the Naive Bayes classifier, we will use data adapted from the SMS Spam Collection at
http://www.dt.fee.unicamp.br/~tiago/sms spamcollection/.
Note: In the R programming language, the dollar sign ($) is used as an operator to access
components of a data structure, typically within a list or a data frame. It allows you to refer to
specific variables or elements within an object.
7
Probabilistic Learning – Classification Using Naive Bayes
1/The first step in processing text data involves creating a corpus, which is a collection of text
documents. In our case, the corpus will be a collection of SMS messages.
1. To create a corpus, we'll use the VCorpus () function in the tm package, which refers to a
volatile corpus_ volatile as it is stored in memory as opposed to being stored on disk (we
mean that it is stored in memory, which means it resides in the computer's RAM (Random
Access Memory) and is temporary )(the PCorpus () function is used to access a permanent
corpus stored in a database(which refers to saving the corpus to the computer's hard drive
or another permanent storage device)). This function requires us to specify the source of
documents for the corpus, which could be a computer's file system, a database, the web, or
elsewhere.
Note: VCorpus() and PCorpus() depends on whether you need the corpus to be temporary and
stored in memory for the current session (Volatile), or whether you need it to be permanent and
stored on disk for long-term use (Permanent).
Since we already loaded the SMS message text into R, we'll use the VectorSource() reader
function to create a source object from the existing sms_raw$text vector, which can then be
supplied to VCorpus () as follows:
By specifying an optional readerControl parameter, the VCorpus () function can be used to import
text from sources such as PDFs and Microsoft Word files. To learn more, examine the Data Import
section in the tm package vignette using the vignette ("tm") command.
Now, because the tm corpus is essentially a complex list, we can use list operations to select
documents in the corpus. The inspect () function shows a summary of the result. For example,
the following command will view a summary of the first and second SMS messages in the corpus:
8
Probabilistic Learning – Classification Using Naive Bayes
To view the actual message text, the as. character() function must be applied to the desired
messages. To view one message, use the as. character () function on a single list element, noting
that the double-bracket notation is required:
To view multiple documents, we'll need to apply as. character () to several items in the sms_corpus
object. For this, we'll use the lapply() function, which is part of a family of R functions that applies
a procedure to each element of an R data structure. These functions, which include apply () and
sapply () among others, are one of the key idioms of the R language. Experienced R coders use
these much like the way for or while loops are used in other programming languages, as they result
in more readable (and sometimes more efficient) code. The lapply() function for applying
as.character() to a subset of corpus elements is as follows:
As noted earlier, the corpus contains the raw text of 5,559 text messages. To perform our analysis,
we need to divide these messages into individual words.
1/ we need to clean the text to standardize the words and remove punctuation characters that
clutter the result. For example, we would like the strings Hello!, HELLO, and hello to be counted as
instances of the same word.
The tm_map() function provides a method to apply a transformation (also known as a mapping) to
a tm corpus. We will use this function to clean up our corpus using a series of transformations and
save the result in a new object called corpus_clean.
Our first transformation will standardize the messages to use only lowercase characters. To this
end, R provides a tolower() function that returns a lowercase version of text strings. In order to
apply this function to the corpus, we need to use the tm wrapper function content_transformer()
to treat tolower() as a transformation function that can be used to access the corpus. The full
command is as follows:
let's inspect the first message in the original corpus and compare it to the same in the transformed
corpus:
As expected, uppercase letters in the clean corpus have been replaced by lowercase versions of
the same.
9
Probabilistic Learning – Classification Using Naive Bayes
The content_transformer() function can be used to apply more sophisticated text processing
and cleanup processes like grep pattern matching and replacement.
2/Let's continue our cleanup by removing numbers from the SMS messages.
Note that the preceding code did not use the content_transformer() function. This is because
remove Numbers () is built into tm along with several other mapping functions that do not need to
be wrapped.
3/Our next task is to remove filler words such as to, and, but, and or from the SMS messages.
These terms are known as stop words.
Rather than define a list of stop words ourselves, we'll use the stopwords () function provided by
the tm package. This function allows us to access sets of stop words from various languages. By
default, common English language stop words are used.
The stop words alone are not a transformation. What we need is a way to remove any words that
appear in the stop words list. The solution lies in the removewords () function, which is a
transformation included with the tm package. As we have done before, we'll use the tm_map()
function to apply this mapping to the data, providing the stopwords() function as a parameter to
indicate exactly the words we would like to remove. The full command is as follows:
4/Continuing our cleanup process, we can also eliminate any punctuation from the text messages
using the built-in remove Punctuation () transformation:
To work around the default behavior of remove Punctuation (), create a custom function that
replaces rather than removes punctuation characters:
This uses R's gsub () function to substitute any punctuation characters in x with a blank space.
This replacePunctuation () function can then be used with tm_map () as with other
transformations.
5/Another common standardization for text data involves reducing words to their root form in a
process called stemming( transform them into the base form.
The tm package provides stemming functionality via integration with the Snowballc package.
The Snowballc package provides a wordstem() function, which for a character vector, returns the
same vector of terms in its root form. For example, the function correctly stems the variants of the
word learn as described previously.
10
Probabilistic Learning – Classification Using Naive Bayes
Note: To apply the wordstem () function to an entire corpus of text documents, the tm package
includes a stemDocument () transformation. We apply this to our corpus with the tm_map ()
function exactly as before:
6/After removing numbers, stop words, and punctuation, and also performing stemming, the text
messages are left with the blank spaces that once separated the now-missing pieces. Therefore,
the final step in our text cleanup process is to remove additional whitespace using the built-in
stripWhitespace () transformation:
The following table shows the first three messages in the SMS corpus before and after the cleaning
process. The messages have been limited to the most interesting words, and punctuation and
As you might assume, the tm package provides functionality to tokenize the SMS message corpus.
The Document TermMatrix() function takes a corpus and creates a data structure called a
document-term matrix (DTM) in which rows indicate documents (SMS messages) and columns
indicate terms (words).
Note: term-document matrix (TDM), which is simply a transposed DTM in which the rows are terms
and the columns are documents (used for short texts and multiple documents)
Note: Sparse matrix; the vast majority of cells in the matrix are filled with zeros.
On the other hand, if we hadn't already performed the preprocessing, we could do so here by
providing a list of control parameter options to override the defaults. For example, to create a DTM
directly from the raw, unprocessed SMS corpus, we can use the following command:
11
Probabilistic Learning – Classification Using Naive Bayes
This applies the same preprocessing steps to the SMS corpus in the same order as done earlier.
However, comparing sms_dtm to the sms_dtm2, we see a slight difference in the number of terms
in the matrix:
To force the two prior DTMs to be identical, we can override the default stop words function with our
own that uses the original replacement function. Simply replace stopwords = TRUE with the
following:
We'll divide the data into two portions: 75 percent for training and 25 percent for testing. Since the
SMS messages are sorted in a random order, we can simply take the first 4,169 for training and
leave the remaining 1,390 for testing. Thankfully, the DTM object acts very much like a data frame
and can be split using the standard [row, coll operations. As our DTM stores SMS messages as rows
and words as columns, we must request a specific range of rows and all columns for each:
To confirm that the subsets are representative of the complete set of SMS data, let's compare the
proportion of spam in the training and test data frames:
12
Probabilistic Learning – Classification Using Naive Bayes
The wordcloud package provides a simple R function to create this type of diagram.
A word cloud can be created directly from a tm corpus object using the syntax:
This will create a word cloud from our prepared SMS corpus. Since we specified random.order =
FALSE, the cloud will be arranged in non-random order, with higher-frequency words placed closer
to the center. If we do not specify random. order, the cloud will be arranged randomly by default.
You might get a warning message noting that R was unable to fit all of the words on the figure. If so,
try increasing the min. freq to reduce the number of words in the cloud. It might also help to use
the scale parameter to reduce the font size.
Let's use R's subset() function to take a subset of the sms_raw data by the SMS type. First, we'll
create a subset where the type is spam:
The scale parameter allows us to adjust the maximum and minimum font size for words in the
cloud. This is illustrated in the following code:
Finding frequent words requires the use of the findFreqTerms () function in the t package. This
function takes a DTM and returns a character vector containing the words that appear at least a
minimum number of times. For instance, the following command displays the words appearing at
least five times in the sms_dtm_train matrix:
The result of the function is a character vec tor, so let's save our frequent words for later:
13
Probabilistic Learning – Classification Using Naive Bayes
A peek into the contents of the vector shows us that there are 1,139 terms appearing in at least five
!!!SMS messages:
We now need to filter our DTM to include only the terms appearing in the frequent word vector. As
before, we'll use data frame style [row, col] operations to request specific sections of the DTM,
noting that the columns are named after the words the DTM contains. We can take advantage of
this fact to limit the DTM to specific words. Since we want all rows but only the columns
representing the words in the sms_ freq_words vector, our commands are:
The training and test datasets now include 1,139 features, which correspond to words appearing in
at least five messages.
The Naive Bayes classifier is usually trained on data with categorical features. This poses a
problem, since the cells in the sparse matrix are numeric and measure the number of times a word
appears in a message. We need to change this to a categorical variable that simply indicates yes or
no, depending on whether the word appears at all. The following defines a convert_counts()
function to convert counts to Yes or No strings:
By now, some of the pieces of the preceding function should look familiar. The first line defines the
function. The statement ifelse(x > 0, "Yes", "No") transforms the values in x such that if the value is
greater than o, then it will be replaced with "Yes", otherwise it will be replaced by a "No" string.
Lastly, the newly transformed vector x is returned.
We now need to apply convert_counts () to each of the columns in our sparse matrix. You may be
able to guess the R function to do exactly this. The function is simply called apply () and is used
much like lapply () was used previously.
The apply () function allows a function to be used on each of the rows or columns in a matrix. It
uses a MARGIN parameter to specify either rows or columns. Here, we'll use MARGIN = 2 since
we're interested in the columns (MARGIN 1 is used for rows). The commands to convert the training
and test matrices are as follows:
The result will be two character-type matrices, each with cells indicating "Yes" or "No" for whether
the word represented by the column appears at any point in the message represented by the row.
14
Probabilistic Learning – Classification Using Naive Bayes
Many machine learning approaches are implemented in more than one R package, and Naive Bayes
is no exception. One other option is NaiveBayes () in the klar package, which is nearly identical to
the one in the e1071 package.
Unlike the k-NN algorithm we used for classification in the previous chapter, training a Naive Bayes
learner and using it for classification occur in separate stages. Still, as shown in the following table,
these steps are fairly straightforward:
To build our model on the sms_train matrix, we'll use the following command:
The sms_classifier variable now contains a naiveBayes classifier object that car be used to make
predictions.
15
Probabilistic Learning – Classification Using Naive Bayes
The predict() function is used to make the predictions. We will store these in a vector named
sms_test_pred. We simply supply this function with the names of our classifier and test dataset as
shown:
To compare the predictions to the true values, we'll use the CrossTable() function in the gmodels
package, which we used previously. This time, we'll add some additional parameters to eliminate
unnecessary cell proportions, and use the dnn parameter (dimension names) to relabel the rows
and columns as shown in the following code:
Looking at the table, we can see that a total of only 6 +30= 36 of 1,390 SMS messages were
incorrectly classified (2.6 percent). Among the errors were six out of 1,207 ham messages that were
misidentified as spam, and 30 of 183 spam messages that were incorrectly labeled as ham.
Considering the little effort that we put into the project, this level of performance seems quite
impressive. This case study exemplifies the reason why Naive Bayes is so often used for text
classification: directly out of the box, it performs surprisingly well.
On the other hand, the six legitimate messages that were incorrectly classified as spam could
cause significant problems for the deployment of our filtering algorithm, because the filter could
cause a person to miss an important text message. We should investigate to see whether we can
slightly tweak the model to achieve better performance.
16
Probabilistic Learning – Classification Using Naive Bayes
Adding the Laplace estimator reduced the number of false positives (ham messages erroneously
classified as spam) from six to five, and the number of false negatives from 30 to 28. Although this
seems like a small change, it's substantial considering that the model's accuracy was already quite
impressive. We'd need to be careful before tweaking the model too much more, as it is important to
maintain a balance between being overly aggressive and overly passive when filtering spam. Users
would prefer that a small number of spam messages slip through the filter rather than an alternative
in which ham messages are filtered too aggressively.
Summary
In this chapter, we learned about classification using Naive Bayes. This algorithm constructs tables
of probabilities that are used to estimate the likelihood that new examples belong to various
classes. The probabilities are calculated using a formula known as Bayes' theorem, which specifies
how dependent events are related. Although Bayes' theorem can be computationally expensive, a
simplified version that makes so-called "naive" assumptions about the independence of features is
capable of handling much larger datasets.
The Naive Bayes classifier is often used for text classification. To illustrate its effectiveness, we
employed Naive Bayes on a classification task involving spam SMS messages. Preparing the text
data for analysis required the use of specialized R packages for text processing and visualization.
Ultimately, the model was able to classify over 97 percent of all the SMS messages correctly as
spam or ham.
In the next chapter, we will examine two more machine learning methods. Each performs
classification by partitioning data into groups of similar values.
17